Oracle Scratchpad

July 16, 2018

Direct IOT

Filed under: 12c,Infrastructure,IOT,Oracle — Jonathan Lewis @ 1:02 pm BST Jul 16,2018

A recent (automatic ?) tweet from Connor McDonald highlighted an article he’d written a couple of years ago about an enhancement introduced in 12c that allowed for direct path inserts to index organized tables (IOTs). The article included a demonstration seemed to suggest that direct path loads to IOTs were of no benefit, and ended with the comment (which could be applied to any Oracle feature): “Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.”

Clearly it’s necessary to pose the question – “so when would direct mode insert be a good option for IOTs?” – because if it’s never a good option you have to wonder why it has been implemented. This naturally leads on to thinking about which tests have not yet been done – what aspects of IOTs did Connor not get round to examining in his article. (That’s a standard principle of trouble-shooting, or testing, or investigation: when someone shows you a test case (or when you think you’ve finished testing) one thing you should do before taking the results as gospel is to ask yourself what possible scenarios have not been covered by the test.)

So if you say IOT what are the obvious tests once you’ve got past the initial step of loading the IOT and seeing what happens. First, I think, would be “What if the IOT weren’t empty before the test started”; second would be “IOTs can have overflow segments, what impact might one have?”; third would be “Do secondary indexes have any effects?”; finally “What happens with bitmap indexes and the requirement for a mapping table?” (Then, of course, you can worry about mixing all the different possibilities together – but for the purposes of this note I’m just going to play with two simple examples: non-empty starting tables, and overflow segments.)

Here’s some code to define a suitable table:


create table t2 
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
	3 * rownum			id,
	lpad(rownum,10,'0')		v1,
	lpad('x',50,'x')		padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
order by
	dbms_random.value
;

begin
	dbms_stats.gather_table_stats(
		ownname     => null,
		tabname     => 'T2',
		method_opt  => 'for all columns size 1'
	);
end;
/

create table t1(
	id,
	v1,
	padding,
	constraint t1_pk primary key(id)
)
organization index
-- including v1
-- overflow
nologging
as
select * from t2
;

begin
	dbms_stats.gather_table_stats(
		ownname     => null,
		tabname     => 'T1',
		method_opt  => 'for all columns size 1'
	);
end;
/

I’ve created a heap table t2 that holds 100,000 rows with an id column that is randomly ordered; then I’ve used this ta1ble as a source to create an IOT, with the option to have an overflow segment that contains just the 100 character padding columns. I’ve used 3 * rownum to define the id column for t2 so that when I insert another copy of t2 into t1 I can add 1 (or 2) to the id and interleave the new data with the old data. (That’s another thought about IOT testing – are you loading your data in a pre-existing order that suits the IOTs or is it arriving in a way that’s badly out of order with respect to the IOT ordering; and does your data go in above the current high value, or spread across the whole range, or have a partial overlap with the top end of the range and then run on above it.)

Have created the starting data set, here’s the test:


execute snap_my_stats.start_snap
execute snap_events.start_snap

insert 
	/*  append */
into t1
select
	id + 1, v1, padding
from
	t2
;


execute snap_events.end_snap
execute snap_my_stats.end_snap

All I’m doing is using a couple of my snapshot packages to check the work done and time spent while insert 100,000 interleaved rows – which are supplied out of order – into the existing table. As shown the “append” is a comment, not a hint, so I’ll be running the test case a total of 4 times: no overflow, with and without the hint – then with the overflow, with and without the hint. (Then, of course, I could run the test without the overflow but an index on v1).

Here are some summary figures from the tests – first from the test without an overflow segment:

                                      Unhinted       With Append
                                  ============      ============
CPU used when call started                 153               102
CPU used by this session                   153               102
DB time                                    166               139

redo entries                           130,603            42,209
redo size                           78,315,064        65,055,376

sorts (rows)                                30           100,031

You’ll notice that with the /*+ append */ hint in place there’s a noticeable reduction in redo entries and CPU time, but this has been achieved at a cost of sorting the incoming data into order. The reduction in redo (entries and size) is due to an “array insert” effect that Oracle can take advantage of with the delayed index maintenance that takes place when the append hint is legal (See the section labelled Option 4 in this note). So even with an IOT with no overflow there’s a potential benefit to gain from direct path loading that depends on how much the new data overlaps the old data, and there’s a penalty that depends on the amount of sorting you’d have to do.

What happens in my case when I move the big padding column out to an overflow segment – here are the equivalent results:


Headline figures                      Unhinted       With Append
================                  ============      ============
CPU used when call started                 158                52
CPU used by this session                   158                52
DB time                                    163                94
redo entries                           116,669            16,690
redo size                           51,392,748        26,741,868
sorts (memory)                               4                 5
sorts (rows)                                33           100,032

Interestingly, comparing the unhinted results with the previous unhinted results, there’s little difference in the CPU usage between having the padding column in the “TOP” section of the IOT compared to having it in the overflow segment, though there is a significant reduction in redo (the index entries are still going all over the place one by one, but the overflow blocks are being pinned and packed much more efficiently). The difference between having the append hint or not, though, is damatic. One third of the CPU time (despited still having 100,000 rows to sort), and half the redo. One of the side effects of the overflow, of course, is that the things being sorted are much shorted (only the id and v1 columns that go into the TOP section, and not the whole IOT row.

So, if you already have an overflow segment that caters for a significant percentage of the row, it looks as if the benefit you could get from using the /*+ append */ hint would far outweigh the penalty of sorting you have to pay. Of course, an IOT with a large overflow doesn’t look much different from a heap table with index – so perhaps that result isn’t very surprising.

I’ll close by re-iterating Connor’s closing comment:

Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.

Before you dive in and embrace it, or ruthlessly push it to one side, make sure you do some testing that reflects the situations you have to handle.

June 29, 2018

Truncate upgrade

Filed under: 12c,Infrastructure,Oracle,Upgrades — Jonathan Lewis @ 8:22 am BST Jun 29,2018

Connor McDonald produced a tweet yesterday linking to a short video he’d created about an enhancement to the truncate command in 12c. If you have referential integrity declared between a parent and child table then in 12c you can truncate the parent table and Oracle will truncate the child table for you – rather than raising an error. The feature requires the foreign key constraint to be declared “on delete cascade” – which is an option that I don’t see used very often. Unfortunately if you try to change an existing foreign key constraint to meet this requirement you’ll find that you can’t (yet) use the “alter table modify constraint” to make the necessary change. As Connor pointed out, you’ll have to drop and recreate the constraint – which leaves you open to bad data getting into the system or an outage while you do the drop and recreate.

If you don’t want to stop the system but don’t want to leave even a tiny window for bad data to arrive here’s a way to do it. In summary:

  1. Add a virtual column to the child table “cloning” the original foreign key column
  2. Create an index on the  virtual column (if you have concerns about “foreign key locking”)
  3. Add a foreign key constraint based on the virtual column
  4. Drop the old foreign key constraint
  5. Recreate the old foreign key constraint “on delete cascade”
  6. Drop the virtual column

Here’s some sample SQL:


rem
rem	Script:		122_truncate_workaround.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Jun 2018
rem	Purpose:	
rem
rem	Last tested 
rem		18.1.0.0	via LiveSQL
rem		12.2.0.1
rem		12.1.0.2

drop table child;
drop table parent;

create table parent (
	p number,
	constraint p_pk primary key(p)
);

create table child (
	c	number,
	p	number,
	constraint c_pk primary key(c),
	constraint c_fk_p foreign key (p) references parent
);

create index c_fk_p on child(p);

insert into parent values(1);
insert into child values(1,1);

commit;

prompt	==========================================================================
prompt	Truncate  should fail with
prompt	ORA-02266: unique/primary keys in table referenced by enabled foreign keys
prompt	==========================================================================

truncate table parent;

alter table child add (
	pv generated always as (p+0) virtual
)
;

create index c_ipv on child(pv) online;

alter table child add constraint c_fk_pv
	foreign key (pv)
	references parent(p)
	on delete cascade
	enable novalidate
;
alter table child modify constraint c_fk_pv validate;

alter table child drop constraint c_fk_p;

prompt	===================================================================================
prompt	Illegal insert (first 1) should fail with
prompt	ORA-02291: integrity constraint (TEST_USER.C_FK_PV) violated - parent key not found
prompt	===================================================================================

insert into child (c,p) values(2,2);
insert into child (c,p) values(2,1);
commit;

alter table child add constraint c_fk_p
	foreign key (p)
	references parent(p)
	on delete cascade
	enable novalidate
;

alter table child modify constraint c_fk_p validate;

prompt	===================================================
prompt	Dropping the virtual column results in Oracle
prompt	dropping the index and constraint at the same time
prompt	===================================================

alter table child drop column pv;

The overhead of this strategy is significant – I’ve created an index (which you may not need, or want, to do) in anticipation of a possible “foreign key locking” issue – and I’ve used the online option to avoid locking the table while the index is created which means Oracle has to use a tablescan to acquire the data. I’ve enabled a new constraint without validation (which takes a brief lock on the table) then validated it (which doesn’t lock the table but could do a lot of work). Then I’ve dropped the old constraint and recreated it using the same novalidate/validate method to minimise locking time. If I were prepared simply to drop and recreate the original foreign key I wouldn’t need to create that index and I’d only do one validation pass rather than two.

 

June 1, 2018

Index Bouncy Scan 4

Filed under: 12c,Execution plans,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 9:19 am BST Jun 1,2018

There’s always another hurdle to overcome. After I’d finished writing up the “index bouncy scan” as an efficient probing mechanism to find the combinations of the first two columns (both declared not null) of a very large index a follow-up question appeared almost immediately: “what if it’s a partitioned index”.

The problem with “typical” partitioned indexes is that the smallest value of the leading column might appear in any of the partitions, and the combination of that value and the smallest value for the second column might not appear in all the partitions where the smallest value appears. Consider a table of 10 partitions and a locally partitioned index on (val1, val2) where neither column is the partition key. The smallest value of val1 – call it k1 may appear only in partitions 4, 7, 8, 9, 10; the lowest combination of (val1, val2) – call it (k1, k2) may appear only in partitions 8 and 10. In a global (or globally partitioned) index the pair (k1, k2) would be at the low (leftmost) end of the index, but to find the pair in a locally partitioned index we have to probe the leftmost end of 10 separate index partitions – and once we’ve done that each “bounce” requires us to probe 10 index partitions for the first (val1, val2) pair where val1 = k1 and val2 is just just greater than k2, or val1 is just greater than k1 and val2 is the minimum for that value of val1. The more partitions we have the greater the number of index partitions we have to probe at each step and the more likely it is that we ought to switch to a brute force index fast full scan with aggregate.

Here’s the starting point for solving the problem (maybe) – I’ll create a simple partitioned table, and use the “bouncy scan” code from the earlier posting with the table and column names adjusted accordingly:


rem
rem     Script:         bouncy_index_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10) ; alter table pt1 modify(status not null); execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

prompt  ==================================================
prompt  Make some rows in the last partition have a status
prompt  that won't be found in the first partition.
prompt  ==================================================

column namespace format 99999999
column partition_name new_value m_part

select  partition_name
from    user_tab_partitions
where   table_name = 'PT1'
order by
        partition_position
;

update pt1 partition (&m_part) set status = 'MISSING' where rownum <= 10;

select
        dbms_mview.pmarker(rowid), status, namespace
from    pt1
where   status = 'MISSING'
;

I’ve created a hash partitioned copy of view all_objects, duplicating it 10 times and created a local index on the columns (status, namespace). My data has two values for status, ‘VALID’ and ‘INVALID’, and there are about 10 values for the namespace. I’ve then updated a few rows in the last partition, giving them a status value that is between the two current values – this is just one little test case to help me check that my code is going to catch all values even if they don’t appear in the first table partition.

Here’s the query from the earlier posting – and it does get the right results – followed by the execution plan:


alter session set statistics_level = all;

set serveroutput off
set linesize 180
set pagesize 60

prompt  =============================================================
prompt  Original Query, showing expensive access for driving minimums
prompt  =============================================================

with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select
                        /*+ index(pt1) no_index_ffs(pt1) */
                        status, namespace,
                        row_number() over(order by status, namespace) rn
                from    pt1
        )
        where
                rn = 1
        union all
        select
                v1.status, v1.namespace
        from    bounce1,
                lateral (
                              select  /*+ index(pt1) no_index_ffs(pt1) no_decorrelate */
                                      pt1.status, pt1.namespace
                              from    pt1
                              where   pt1.status > bounce1.status
                              and     rownum = 1
                ) v1
        where   bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last outline'));


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        | 16378 (100)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 | 16378   (4)|       |       |     10 |00:00:00.58 |    1869 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 | 16377   (4)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     12 |00:00:00.58 |    1869 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |  8157   (4)|       |       |      2 |00:00:00.58 |    1747 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      2 |00:00:00.58 |    1747 |  1024 |  1024 | 2048  (0)|
|*  6 |       VIEW                                   |                 |      1 |      1 |  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |                 |      1 |    617K|  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |  2048 |  2048 | 2048  (0)|
|   8 |         PARTITION HASH ALL                   |                 |      1 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.34 |    1732 |       |       |          |
|   9 |          INDEX FULL SCAN                     | PT1_I1          |      4 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.15 |    1732 |       |       |          |
|  10 |       NESTED LOOPS                           |                 |      2 |      1 |  4110   (4)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  11 |        RECURSIVE WITH PUMP                   |                 |      2 |        |            |       |       |      2 |00:00:00.01 |       0 |       |       |          |
|  12 |        VIEW                                  | VW_LAT_1BBF5C63 |      2 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|* 13 |         COUNT STOPKEY                        |                 |      2 |        |            |       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  14 |          PARTITION HASH ALL                  |                 |      2 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|* 15 |           INDEX RANGE SCAN                   | PT1_I1          |      5 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|  16 |     SORT AGGREGATE                           |                 |     10 |      1 |            |       |       |     10 |00:00:00.01 |     122 |       |       |          |
|  17 |      PARTITION HASH ALL                      |                 |     10 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  18 |       FIRST ROW                              |                 |     40 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     122 |       |       |          |
|* 19 |        INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     40 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  20 |     RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     10 |00:00:00.01 |       0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
   4 - filter("BOUNCE1"."STATUS" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "STATUS","NAMESPACE")<=1) 13 - filter(ROWNUM=1) 15 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  19 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

In terms of time the query doesn’t seem to have done too badly – but I’m only using a small data set and we can see from the numbers that we haven’t produced an efficient plan. Operations 8 and 9 tell us that we’ve done an index full scan on every single partition before passing the data up for a window sort operation. That’s clearly a bad thing, but we did have an index() hint at that bit of code that worked very well for the simple (global) index so maybe we should have taken that out before testing (except it doesn’t help much to do so since Oracle still scans all 617K rows, changing to an index fast full scan).

Apart from that massive load the rest of the query looks quite efficient. We keep seeing “partition hash all” of course – whatever we do we tend to do it to 4 separate partitions one after the other – but everything else we do looks rather efficient. But there is another problem – and this is where the importance of inserting the rows with status = ‘MISSING’ shows up: this query didn’t find them! We have a predicate “rownum = 1” in the second half of the bounce1 recursive subquery and because we’re using a partitioned index we’ve managed to find a row that looks appropriate in an early partition when the row we really needed doesn’t appear until the last partition.

Let’s return to this problem later – first we want to check if the rest of the query will run efficiently and give us the right answer if we can find some way of getting the starting values; so let’s use a strategy we’ve used before – replace the bounce1 subquery with a union all select from dual:


with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select 'INVALID' status, 1 namespace from dual
                union all
                select 'MISSING', 4 from dual
                union all
                select 'VALID', 1 from dual
        )
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last partition outline'));

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |        |      1 |        |    76 (100)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   1 |  SORT ORDER BY                             |        |      1 |      6 |    76   (2)|       |       |     11 |00:00:00.01 |     132 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |        |      1 |      6 |    75   (0)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|        |      1 |        |            |       |       |     14 |00:00:00.01 |     132 |  1024 |  1024 |          |
|   4 |     VIEW                                   |        |      1 |      3 |     6   (0)|       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   5 |      UNION-ALL                             |        |      1 |        |            |       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   6 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   7 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   8 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |     SORT AGGREGATE                         |        |     11 |      1 |            |       |       |     11 |00:00:00.01 |     132 |       |       |          |
|  10 |      PARTITION HASH ALL                    |        |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  11 |       FIRST ROW                            |        |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     132 |       |       |          |
|* 12 |        INDEX RANGE SCAN (MIN/MAX)          | PT1_I1 |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  13 |     RECURSIVE WITH PUMP                    |        |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
  12 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

This gets us the right answer, very efficiently. There are only 11 rows in the result set and we have an average 12 buffer visits per row – which is reasonble given that we (probably) have to probe 4 index partitions for every row. So that’s 11 * 4 * 3 buffer visits per probe – which seems just about optimal.

The next step is to figure out a way of getting the (three in our case) starting points while using a partitioned index. Here’s a query we can use for bounce1:


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
)
select * from bounce1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

It looks a little convoluted with all the inline select statements, but they all do very small amounts of work and they’re only reading the index leaf blocks that you have to read. We know from yesterday’s post that Oracle can execute the scalar subqueries at lines 3 and 4 very efficiently; we can hope (and check) that the lateral() subqueries driven by the single values from the recursive row in bounce1 will operate just as efficiently – and here’s the plan:


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                 |      1 |        |   166 (100)|       |       |      4 |00:00:00.01 |     132 |
|   1 |  VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   3 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   4 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   5 |      INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   6 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   7 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   8 |      FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  9 |       INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  11 |         PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  12 |          INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  13 |    FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  14 |    NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  15 |     NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  16 |      RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  17 |      VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |       SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  19 |        PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  20 |         FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 21 |          INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |     VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  23 |      SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  24 |       PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  25 |        FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 26 |         INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |           PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |            FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |             INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("STATUS"=)
  21 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  26 - access("PT1"."STATUS"=)
  30 - access("PT2"."STATUS">:B1)

Although we have done lots of individual probes into the index they have all been very efficient using a min/max access and an average of about 3 buffer visits per probe. So we can now insert this new bounce1 subquery into the previous query in place of the union all of dual and check that the two pieces of the query cooperate.


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace from bounce1
        union all
        select  bounce2.status, (select min(t.namespace) namespace from pt1 t where t.namespace > bounce2.namespace and status=bounce2.status) namespace
        from    bounce2
        where   bounce2.status is not null
        and     bounce2.namespace is not null
)
select  *
from    bounce2
where   namespace is not null
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |                 |      1 |        |   396 (100)|       |       |     11 |00:00:00.01 |     266 |
|*  1 |  VIEW                                       |                 |      1 |      4 |   396   (1)|       |       |     11 |00:00:00.01 |     266 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     15 |00:00:00.01 |     266 |
|   3 |    VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   5 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   8 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   9 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  12 |          SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  13 |           PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  14 |            INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  15 |      FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  16 |      NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  17 |       NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |        RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  19 |        VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  20 |         SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  21 |          PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |           FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 23 |            INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  24 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  25 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  26 |         PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 28 |           INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  29 |            SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  30 |             PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  31 |              FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 32 |               INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  33 |    SORT AGGREGATE                           |                 |     11 |      1 |            |       |       |     11 |00:00:00.01 |     134 |
|  34 |     PARTITION HASH ALL                      |                 |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  35 |      FIRST ROW                              |                 |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     134 |
|* 36 |       INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  37 |    RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NAMESPACE" IS NOT NULL)
  11 - access("STATUS"=)
  23 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  28 - access("PT1"."STATUS"=)
  32 - access("PT2"."STATUS">:B1)
  36 - access("STATUS"=:B1 AND "T"."NAMESPACE">:B2)

Job done. We’ve found the distinct set of pairs without having to scan the entire index. We’ve found 11 pairs at a total cost of 266 buffer gets. For comparitive purposes the query totalled 56 buffer visits when I recreated the table as a non-partitioned table (again updating a few rows to status = ‘MISSING’).

It’s important to note that this query can only work this efficiently in 12.2 (and possibly in a suitably patched 11.2.0.4) because of the optimizer’s ability to use the min/max operation for queries like: “select max(col2) where col1 = (select max()…))”. When I ran the final query on 12.1.0.2 the execution plan changed around lines 11 and 28 where 12.2.0.1 could use the aggregate subquery to drive the min/max scan 12.1.0.2 did a real range scan with aggregate (which was extremely expensive at one point).

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|  23 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |  1206   (2)|       |       |      3 |00:00:05.43 |    2414 |
|  24 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:05.43 |    2414 |
|  25 |         PARTITION HASH ALL                  |                 |      3 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:05.84 |    2414 |
|* 26 |          INDEX RANGE SCAN                   | PT1_I1          |     12 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:02.03 |    2414 |
|  27 |           SORT AGGREGATE                    |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |            PARTITION HASH ALL               |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |             FIRST ROW                       |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |              INDEX RANGE SCAN (MIN/MAX)     | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this makes a dramatic difference to the work Oracle has to do – in this case 2,414 buffer gets and 845K rows examined. As I said in yestrday’s post – there’s a patch for 11.2.0.4, so there could be a patch for 12.1.0.2 if you ask for it, but it looks like no-one has done so yet.

<h3>Footnote:</h3>

I could have used a lateral() view in the first half of bounce1 to reduce the reported number of probes of pt1_i1 in the plan – but it made the code extremely messy, I had to include a /*+ no_decorrelate */ hint in it, and it increased the number of buffer visits slightly because the optimizer seemed to lose the option for a min/max scan in this particular lateral join.

 

May 31, 2018

Min/Max upgrade

Filed under: 12c,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 2:13 pm BST May 31,2018

Here’s a nice little optimizer enhancement that appeared in 12.2 to make min/max range scans (and full scans) available in more circumstances. Rather than talk through it, here’s a little demonstration:

rem
rem     Script:         122_minmax.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Good path
rem             12.1.0.2        Bad path

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10)  -- > comment to avoid format wordpress issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

alter session set statistics_level = all;
set serveroutput off
set linesize 156
set pagesize 60
set trimspool on

select  min(status) from pt1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = 'INVALID';
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = (select min(status) from pt1);
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

The basic “min/max” optimisation allows Oracle to avoid a massive sort aggregate – Oracle doesn’t need to acquire a lot of data and sort it when it knows that the “left hand” end of an index is the low values and the “right hand” is the high values so, for example, in the first query above the optimizer could simply walk down the index branches to the left hand leaf and look at the single lowest entry in the leaf block to determine the lowest value for status … if the index had been a global index.

Things get a little messy, though, when the index is locally partitioned and your query isn’t about the partition key and there’s no suitable global index. Once upon a time (IIRC) Oracle would simply have to do an index fast full scan across all index partitions to handle such a query, but some time ago it got a lot cleverer and was enhanced to do a min/max scan on each partition in turn getting one value per partition very efficiently, then aggregating across those values to find the global minimum.

Here are the three execution plans (with rowsource execution stats pulled from memory) taken from 12.1.0.2 for the queries above:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   3 |    INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
-----------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE               |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL          |        |      1 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
|   3 |    FIRST ROW                  |        |      4 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      12 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
-------------------------------------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |        |      1 |        |   337 (100)|       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   1 |  SORT AGGREGATE                |        |      1 |      1 |            |       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   2 |   PARTITION HASH ALL           |        |      1 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|*  3 |    INDEX FAST FULL SCAN        | PT1_I1 |      4 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|   4 |     SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |      0 |
|   5 |      PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
|   6 |       INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
-----------------------------------------------------------------------------------------------------------------------------------------

In the first plan Oracle has done an “index full scan (min/max)” across each of the four partitions in turn to return one row very cheaply from each, then aggregated to find the overall minimum.

In the second plan Oracle has done an “index range scan (min/max)” in exactly the same way, since it was able to find the start point in the index for the status ‘INVALID’ very efficiently.

In the third plan Oracle has been able to find the minimum value for the status (‘INVALID’) very efficiently in the subquery, and has passed that single value up to the main query, which has then used a brute force approach to search the whole of every partition of the index for every occurrence (all 10 of them) of the value ‘INVALID’ and then aggregated them to find the minimum namespace. Despite “knowing”, by the time the main query runs, that there will be a single value to probe for the status, the optimizer has not anticipated the fact that the final query will effectively become the same as the preceding one. As a result we’ve read 2,242 data blocks into the cache.

Turn, then, to the execution plan from 12.2.0.1 for this last query:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE                 |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   2 |   PARTITION HASH ALL            |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   3 |    FIRST ROW                    |        |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)  | PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   5 |      SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------------------------------------------------

In 12.2 you can see that the main query is now doing an “index range scan (min/max)” on each index partition in turn, based on the incoming (though unknown at parse time) single value from the subquery. As a result the total work done is a mere 24 buffer visits.

There have been a couple of occasions in the past where I’ve had to write some PL/SQL to work around little details like this. It’s nice to know simple tables and partitoned tables with local indexes can now behave the same way. I also wonder whether there may be sites that could drop (or drop columns from, or make local) some indexes that they’ve previously created to  handle queries of the “most recent occurence” type.

If, for any reason, you need to disable this enhancement, it’s controlled by fix_control (v$system_fix_control) “18915345 Allow MIN/MAX optimization for pred having single row subquery” which can be set in the startup file, at the system level, or in the session.

Update

Checking MoS for the bug number I found that the limitation had been reported for 11.2.0.3, with “Fixed in product version” reported as 12.2; but there are patches for various releases of 11.2.0.4, though none yet for 12.1.0.2 – but if you think you need it you can always try raising an SR.

 

May 30, 2018

Index Bouncy Scan 3

Filed under: 12c,Bugs,CBO,Oracle — Jonathan Lewis @ 1:15 pm BST May 30,2018

This is a follow-up to a problem I had with yesterday’s example of using recursive CTEs to “bounce” along a multi-column index to pick out the unique set of combinations of the first two columns. Part of the resulting query used a pair of aggregate scalar subqueries in a select list – and Andrew Sayer improved on my query by introducing a “cross apply” (which I simply hadn’t thought of) which the optimizer transformed into a lateral view (which I had thought of, but couldn’t get to work).

After seeing what the Andrew and the optimizer had done I looked a little more closely at my lateral view experiment and modified it so that it worked. Here are the three critical versions of the relevant code fragment; first is my original code, then Andrew’s cross apply, then my working lateral view version:

select
        (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
        (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
from    bounce1
where   bounce1.val1 is not null
 
 
select
        ca.val1 ,ca.val2
from    bounce1
cross  apply (select val1, val2
              from  (select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
                    )
             ) ca
where  bounce1.val1 is not null
 
----

select
        ca.val1 ,ca.val2
from    bounce1, 
        lateral(select val1, val2
              from  (select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
                    )
             ) ca
where  bounce1.val1 is not null

All I’ve done to modify Andrew’s code is put a comma after the table (actually CTE) bounce1, then change “cross apply” to “lateral”. Compare the resulting text with the following lateral version that doesn’t work:


select
        ca.val1 ,ca.val2
from    bounce1, 
        lateral (
                   select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
             ) ca
where  bounce1.val1 is not null

To get from not working to working all I’ve done is wrap the text in my lateral() subquery inside one more (apparently redundant) layer of “select * from ()”!

In fact my assumption that my code wasn’t working was incorrect – what was really going on was that the code I had written was producing the wrong results but I thought that I had made a mistake in the way I was writing it and couldn’t figure out what I had done wrong.

Problem Solving:

To get a better idea of what’s going on, I took a closer look at the execution plans. Here are the plans (main body only) for the two variants of using the lateral() view – the first from the SQL with the “redundant” select, the second as I originally wrote it. Notice that the number of rows (A-Rows) returned in the first case is the 30 expected while in the second case it’s only 10.


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   125 (100)|     30 |00:00:00.01 |      40 |     28 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   125   (3)|     30 |00:00:00.01 |      40 |     28 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   124   (2)|     30 |00:00:00.01 |      40 |     28 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     33 |00:00:00.01 |      40 |     28 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    61   (2)|      3 |00:00:00.01 |       8 |      4 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      3 |00:00:00.01 |       8 |      4 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 |      1 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |      1 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      3 |      1 |    31   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      3 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|  11 |        VIEW                                  | VW_LAT_1BBF5C63 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  12 |         VIEW                                 |                 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|* 13 |          COUNT STOPKEY                       |                 |      3 |        |            |      2 |00:00:00.01 |       6 |      3 |       |       |          |
|* 14 |           INDEX RANGE SCAN                   | T1_PK           |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  15 |     SORT AGGREGATE                           |                 |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  16 |      FIRST ROW                               |                 |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 17 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  18 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   125 (100)|     10 |00:00:00.01 |      16 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   125   (3)|     10 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   124   (2)|     10 |00:00:00.01 |      16 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     11 |00:00:00.01 |      16 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    61   (2)|      1 |00:00:00.01 |       4 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      1 |00:00:00.01 |       4 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      1 |      1 |    31   (0)|      0 |00:00:00.01 |       2 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      1 |        |            |      1 |00:00:00.01 |       0 |       |       |          |
|* 11 |        VIEW                                  | VW_DCL_1BBF5C63 |      1 |      1 |     2   (0)|      0 |00:00:00.01 |       2 |       |       |          |
|* 12 |         COUNT STOPKEY                        |                 |      1 |        |            |      1 |00:00:00.01 |       2 |       |       |          |
|  13 |          INDEX FULL SCAN                     | T1_PK           |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|  14 |     SORT AGGREGATE                           |                 |     10 |      1 |            |     10 |00:00:00.01 |      12 |       |       |          |
|  15 |      FIRST ROW                               |                 |     10 |      1 |     2   (0)|      9 |00:00:00.01 |      12 |       |       |          |
|* 16 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     10 |      1 |     2   (0)|      9 |00:00:00.01 |      12 |       |       |          |
|  17 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     10 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Most importantly we can see that the optimizer has used two different transformations. For the working query we see the view name VW_LAT_xxxxxxxx at operation 11, this is Oracle implementing a lateral view; for the problem query we see the view name VW_DCL_xxxxxxxx at operation 11, which is Oracle implementing a transformation to a “decorrelated lateral view”.

My first test after noting this difference was to see what would happen in I added the hint /*+ no_query_transformation */ to the query: it resulted in the VW_DCL_xxxxxxxx view name changing to VW_LAT_xxxxxxxx and the query producing the right result. Andrew Sayer, on the ODC thread, then pointed out that he’d done a couple more experiments and used the /*+ no_decorrelate() */ hint so I tried that with my query, adding it (with no parameters) to the subquery inside the lateral() clause – again the plan changed from using VW_DCL to VW_LAT and the results were correct.

Test Case

Bottom line on this – it looks like the optimizer is decorrelating a subquery when it shouldn’t, leading to wrong results. To make it easier to see this anomaly I stripped the original sample down to a basic test case starting with the table that I used in the previous posting:

rem
rem     Script:         decorralate.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem
rem     Last tested 
rem             18.1.0.0  -- via liveSQL
rem             12.2.0.1
rem             12.1.0.2
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

Now two versions of a simplified piece of code that should select the distinct values of val1 greater than the lowest value (each row in the UNION ALL of dual is emulating the way in which yesterday’s recursive CTE was effectively saying “this is a current known value, find the next higher”):


prompt  =============
prompt  Right results
prompt  =============

select
        v1.val1, v1.val2
from    (
        select  0 val1, 0 val2 from dual
        union all
        select 1,0 from dual
        union all
        select 2,0 from dual
        ) bounce1,
        lateral (
            select val1, val2 from (
              select  /*+ index(t1) no_index_ffs(t1) */
                      t1.val1, t1.val2
              from    t1
              where   t1.val1 > bounce1.val1
              and     rownum = 1
            )
        ) v1
;

prompt  ===========================================
prompt  Wrong results -- "redundant" select removed
prompt  ===========================================

select
        v1.val1, v1.val2
from    (
        select  0 val1, 0 val2 from dual
        union all
        select 1,0 from dual
        union all
        select 2,0 from dual
        ) bounce1,
        lateral (
            -- select val1, val2 from (
              select  /*+ index(t1) no_index_ffs(t1) */
                      t1.val1, t1.val2
              from    t1
              where   t1.val1 > bounce1.val1
              and     rownum = 1
            -- )
        ) v1
;

Here’s a cut-n-paste from running the two queries:


=============
Right results
=============

      VAL1       VAL2
---------- ----------
         1          0
         2          0

2 rows selected.

============================================
Wrong results  -- "redundant" select removed
============================================

no rows selected

Finally, to get an idea of what’s gone wrong – and to show that the optimizer has done something wrong when attempting to decorrelate – we can take a look at the optimizer trace file to see the final transformed SQL that the optimizer has produced a plan for. (I enabled the trace with the command “alter session set events ‘trace [rdbms.SQL_Transform.*]’;” to limit the trace to just the information about optimizer transformations.) This – cosmetically altered – is the final “unparsed” query:

select 
        vw_dcl_a18161ff.val1 val1,
        vw_dcl_a18161ff.val2 val2 
from    ( 
                (select 0 val1 from sys.dual dual) 
                union all  
                (select 1 1 from sys.dual dual) 
                union all  
                (select 2 2 from sys.dual dual)
        ) bounce1, 
        (
        select
                 /*+ no_index_ffs (t1) index (t1) */ 
                t1.val1 val1_0,
                t1.val2 val2_1 
        from
                test_user.t1 t1
        where 
                rownum = 1
        ) vw_dcl_a18161ff 
where 
        vw_dcl_a18161ff.val1 > bounce1.val1

As you can see, the lateral view has turned into a non-mergeable inline view which selects the first row available from t1 by following the supplied hints, and joins that single row result set to bounce1. I have a suspicion that lateral views which include rownum predicates should not be decorrelated. I have looked on MoS to see if I can find any bugs related to decorrelating lateral views, but either there are none or my search terms weren’t good enough.

 

Upgrades

Filed under: 12c,Bugs,Function based indexes,Indexing,Oracle,Upgrades — Jonathan Lewis @ 10:08 am BST May 30,2018

One of my maxims for Oracle performance is: “Don’t try to be too clever”. Apart from the obvious reason that no-one else may be able to understand how to modify your code if the requirements change at a future date, there’s always the possibility that an Oracle upgrade will mean some clever trick you implemented will simply stop working.

While searching for information about a possible Oracle bug recently I noticed the following fix control (v$system_fix_control) in 12.2.0.1:


     BUGNO OPTIMIZE SQL_FEATURE                        DESCRIPTION                                                             VALUE
---------- -------- ---------------------------------- ---------------------------------------------------------------- ------------
  18385778          QKSFM_CARDINALITY_18385778         avoid virtual col usage if FI is unusable or invisible 

Maybe that’s just invalidated an idea I published 12 years ago.

I haven’t researched the bug or any underlying SR, but I can think of valid argument both for and against the fix as described.

 

 

May 29, 2018

Index Bouncy Scan 2

Filed under: 12c,Index skip scan,Oracle,Performance — Jonathan Lewis @ 12:27 pm BST May 29,2018

I wrote a note some time last year about taking advantage of the “index range scan (min/max)” operation in a PL/SQL loop to find the small number distinct values in a large single column index efficiently (for example an index that was not very efficient but existed to avoid the “foreign key locking” problem. The resulting comments included pointers to other articles that showed pure SQL solutions to the same problem using recursive CTEs (“with” subqueries) from Markus Winand and Sayan Malakshinov: both writers also show examples of extending the technique to cover more cases than the simple list of distinct values.

The topic came up again on the ODC (OTN) database forum a couple of days ago; one of the replies linked back to my original posting, another gave the recursive solution for a single column index – so I ended up seeing the following question twice, once as a comment on my blog, once in the forum: “Can you extend this method to a two column index, what about an N column index ?”

Here’s a walk-through of working out one possible solution for the two-column requirement – how to find all the distinct combinations for the first two columns of a very large index without having to scan and aggregate the whole index. We start with a suitable table and index.


rem
rem     Script:         bouncy_index.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

I’ve created a table with 3 values for val1, 10 values for val2, with a total of 30 combinations. The addition of the primary key starting with (val1, val2) is just a lazy way to ensure that I have a suitable index AND val1 and val2 are both declared not null.

With this data my first step will be to demonstrate the recursive CTE (“with” subquery) used by Andrew Sayer in the ODC posting to get the distinct values for val1 using three index “index range scan (min/max)”probes. I’ve included the in-memory execution plan with rowsource execution stats to show that this does a minimal amount of work.

The results in this note come from 12.2.0.1:


set serveroutput off
alter session set statistics_level = all;

with bouncy (val1)
as (
        select  min(val1) val1
        from    t1
        union all
        select  (select min(t1.val1) val1 from t1 where t1.val1 > bouncy.val1) val1
        from    bouncy
        where   bouncy.val1 is not null
    )
select  *
from    bouncy
where   bouncy.val1 is not null
order by
        val1
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    19 (100)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      2 |    19   (6)|      3 |00:00:00.01 |       7 |      4 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      2 |    18   (0)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |       7 |      4 |  1024 |  1024 |          |
|   4 |     SORT AGGREGATE                         |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   5 |      INDEX FULL SCAN (MIN/MAX)             | T1_PK |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   6 |     SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |      3 |       |       |          |
|   7 |      FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|*  8 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|   9 |     RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("BOUNCY"."VAL1" IS NOT NULL)
   8 - access("T1"."VAL1">:B1)

As you can see I’ve done an “index full scan (min/max)” as the first step of the recursive query, visiting just two buffered blocks (the index leaf-block count is 27 – roughly 9 per value of val1 – so Oracle is clearly doing an efficient access for that value, it’s not rally a “full” scan. We then see 3 “index range scan (min/max)” at roughly 2 buffer visits each to collect the remaining values. (There’s probably a small saving in buffer gets due to the pinning that takes place).

So we can get the val1 values very easily and efficiently with this recurstive CTE technology. Let’s write some code that uses the same technology to find the val2 values for each possible val1 value in turn:

with bounce2 (val1, val2)
as (
        select val1, val2 from (
                select  0 val1, 0 val2 from dual
                union all
                select 1,0 from dual
                union all
                select 2,0 from dual
        )
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    27 (100)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      6 |    27   (4)|     30 |00:00:00.01 |      32 |     24 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      6 |    26   (0)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |     33 |00:00:00.01 |      32 |     24 |  1024 |  1024 |          |
|   4 |     VIEW                                   |       |      1 |      3 |     6   (0)|      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      UNION-ALL                             |       |      1 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   8 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |     SORT AGGREGATE                         |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  10 |      FIRST ROW                             |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 11 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  12 |     RECURSIVE WITH PUMP                    |       |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
  11 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


In this example of the code the second half of the CTE looks remarkably similar to the previous statement – except I now have a two-column CTE and I’ve included an equality predicate against val1 based on the first of the two columns. In the first half of the code I’ve cheated (as a temporary measure) and supplied three rows of data which list the three distinct values of val1 with their associated minimum values for val2.

The execution plan shows that I’ve done 30 “index range scan (min/max)” of the index with 32 buffer visits. And that’s exactly the right number of probes to return my result set. So if I can manage to generate the starting values efficiently I can execute the whole query efficiently. So let’s find a way of changing that “union all on dual” fudge into a generic statement. Let’s replace it with a recursive CTE:


with bounce1(val1, val2) as (
        select val1, val2 
        from    (
                select
                        /*+ index(t1) */
                        val1, val2,
                        row_number() over(order by val1, val2) rn
                from    t1
        )
        where
                rn = 1
        union all
        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null
),
bounce2 (val1, val2)
as (
        select  val1, val2 
        from    bounce1
--      where   bounce1.val1 is not null
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |       |      1 |        |   189 (100)|     30 |00:00:00.01 |      45 |       |       |          |
|   1 |  SORT ORDER BY                               |       |      1 |      4 |   189   (2)|     30 |00:00:00.01 |      45 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |       |      1 |      4 |   188   (2)|     30 |00:00:00.01 |      45 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |       |      1 |        |            |     34 |00:00:00.01 |      45 |  1024 |  1024 |          |
|   4 |     VIEW                                     |       |      1 |      2 |    87   (2)|      4 |00:00:00.01 |      13 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |      13 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |       |       |          |
|  10 |        FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|  12 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       6 |       |       |          |
|* 13 |        COUNT STOPKEY                         |       |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |         INDEX RANGE SCAN                     | T1_PK |      3 |    500 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |       RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  16 |     SORT AGGREGATE                           |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  17 |      FIRST ROW                               |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 18 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  19 |     RECURSIVE WITH PUMP                      |       |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 11 - access("T1"."VAL1">:B1)
  13 - filter(ROWNUM=1)
  14 - access("T1"."VAL1">:B1)
  18 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


Again we see 30 probes using “index range scan (min/max)” with 32 buffer gets to get 30 rows; plus a further 13 buffer gets to generate the three driving rows. The 13 buffer gets break down to: 2 to get the minimum (val1, val2) combination using an “index full scan (min/max)”, then 5 for the probes to get the three minimum values for val1, and 6 for the probes to get the three corresponding minimum values of val2.

You’ll notice that I’ve got various “is not null” predicates scattered throughout the code. In some cases this is to stop Oracle from running into an infinite loop and reporting Oracle error: ORA-32044: cycle detected while executing recursive WITH query” This will occur because of the way that “(select max()…)” inline scalar subqueries returning a null if there is no data found which would lead to the next cycle of the recursive descent taking that null as an input – hence starting the infinite recursion. In some cases the “is not null” predicates are my default pattern for recurstive CTEs and some of them could probably be removed with no change in meaning (or workload).

The /*+ index() */ hint in the starting point for bounce1 was necessary to avoid an “index fast full scan” in 12.2; but that was purely a case of the statistics – number of distinct values, leaf_block count, etc – making the optimizer pick an option that was appropriate for this tiny data set, but not appropriate for the demonstration.  In fact this looks like the side effect of two defects in the 12.1 optimizer code, of which only one has been fixed in 12.2.

Optimizer Limitations

Here’s an extract from the execution plan for the final query with an /*+ index(t1) */ hint in place. The extract is identical for 12.1.0.2 and 12.2.0.1:

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |

You’ll notice the Cost at operation 8 is appropriate for a real (i.e. all leaf blocks) full scan of the index. (The leaf_block value was 27 as I mentioned earlier on). You’ll also see that the OMem (PGA requirement for optimum workarea operation) figure is consistent with Oracle processing 10,000 rows in the index. Since the optimizer managed to work out that it could do a full scan with nosort and stopkey it looks a little surprising that the algorithms didn’t manage to make some allowance for the limited access that would occur. (I’d view this as a current limitation, rather than a bug, though).

Now compare the equivalent extracts when we hint an index fast full scan 12.1.0.2 first, then 12.2.0.1:

12.1.0.2
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    39   (8)|      1 |00:00:00.03 |      32 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |    39   (8)|      1 |00:00:00.03 |      32 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      32 |       |       |          |

12.2.0.1
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |     7  (29)|      1 |00:00:00.01 |      34 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |     7  (29)|      1 |00:00:00.01 |      34 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      34 |       |       |          |

In both cases the cost of the index fast full scan is the same – and much cheaper; but in 12.1.0.2 the cost of the query looks as if it is allowing for sorting (and spilling) the entire 10,000 rows of returned from the index fast full scan (even though the OMem indicates otherwise), while the cost in 12.2.0.1 looks as if it recognises that it just has to do a running comparison through the data set as it returns, keeping only the current minimum in memory at any one moment. This clearly matches our expectations of how Oracle ought to behave, which is why I’d call this a bug in 12.1, fixed by 12.2.

The dramatic change in cost of operation 7 on the upgrade explains the change in plan and the necessity for the /*+ index(t1) */ hint – but if the “first row” predicate were also reflected in the costing then the cost of the “stopkey” index full scan would drop to 2 (probably) and the original 12.1 path would be re-appear.

Footnote

I don’t think there’s a lot of scope for improving the efficiency of this query for getting the (relatively) small number of distinct combinations from the first two columns of a very large index – but there are some very clever SQL bunnies on the ODC forum, so I won’t be surprised if someone comes up with a better solution.

Update

Well it didn’t take very long for someone to improve my SQL. Andrew Sayer took advantage of the “cross apply” feature of Oracle 12c to get rid of that nasty little bit of SQL where I’d used two scalar subqueries in the select list of the driving CTE. Here are the before and after versions of that fragment:


        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null


        select
                ca.val1 ,ca.val2
        from    bounce1
        cross  apply (select val1, val2
                      from  (select /*+ index(t1) no_index_ffs(t1) */
                                     val1, val2
                             from    t1
                             where   t1.val1 > bounce1.val1
                             and     rownum = 1
                            )
                     ) ca
        where  bounce1.val1 is not null

This “cross apply” has the effect of running a correlated subquery for every row selected from (this level of) bounce1 and then joining the results back to (this level of) bounce1. With this change in place (and with my original data set) the following plan appears:


------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   161 (100)|     30 |00:00:00.01 |      40 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   161   (2)|     30 |00:00:00.01 |      40 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   160   (2)|     30 |00:00:00.01 |      40 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     33 |00:00:00.01 |      40 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    73   (2)|      3 |00:00:00.01 |       8 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      3 |00:00:00.01 |       8 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      3 |      1 |    43   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      3 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  11 |        VIEW                                  | VW_LAT_A83890C2 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  12 |         VIEW                                 |                 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|* 13 |          COUNT STOPKEY                       |                 |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |           INDEX RANGE SCAN                   | T1_PK           |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |     SORT AGGREGATE                           |                 |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  16 |      FIRST ROW                               |                 |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 17 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  18 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   4 - filter("BOUNCE1"."VAL1" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 13 - filter(ROWNUM=1) 14 - access("T1"."VAL1">"BOUNCE1"."VAL1")
  17 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)

If you compare this with my final plan further up the page you can see that operations 9 – 14 look completely different and while my plan shows two “sort aggregate” probes against t1_pk, Andrew’s plan does an interesting “nested loop” driven by a “recursive pump” that effectively halves the work done in this section of the plan.

Another little detail about this plan that I found interesting was that the “cross apply” had been converted to a “lateral join” internally – note the VW_LAT_xxxx view name. This was a little irritating because I had actually tried to write the query with a lateral join in the first place and ended up getting the wrong results. I’ve got a follow-up posting about this – but (spoiler alert) I think it means I’ve found another bug.

April 10, 2018

exp catch

Filed under: 12c,Histograms,Oracle,Statistics — Jonathan Lewis @ 5:52 pm BST Apr 10,2018

No-one should be using exp/imp to export and import data any more, they should be using the datapump equivalents expdp/impdp – but if you’re on an older (pre-12c) version of Oracle and still using exp/imp to do things like moving tables with their production statistics over to test systems then be careful that you don’t fall into an obsolescence trap when you finally upgrade to 12c (or Oracle 18).

exp/imp will mess up some of your histograms if you’re still using them to move tables/statistics in 12c.

Remember that 12c can create “Top-N” and “hybrid” histograms – and exp/imp were written long before these new histogram types came into existence. The code has not been updated to allow for the new histogram types so if you happen to generate any histograms of these type in a 12c system and then use exp/imp to move some table stats (and it’s particularly an issue relating to stats) from one system to another – the stats that arrive at the destination system won’t match the stats that left the source system.

Here’s a little sample code to build a model that I can use to demonstrate the problem. It creates a table with three columns that will make it easy for me to create one frequency histogram, one Top-N histogram and one hybrid histogram. I’ve included a couple of substitution variables in the code so that you can specify an Oracle instance to connect to and a directory for the export file that expdp is going to produce. Don’t forget to check that the directory I create in this script doesn’t overwrite a directory that already exists for other reasons on your test system.


rem
rem     Script:         12c_histograms.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2015
rem
rem     Define m_service to be the service name you connect to
rem     Define m_directory to be the O/S directory you to use for
rem     the export/import/log files
rem
rem     Make sure this code is not over-writing an existing 
rem     definition for a directory called DMPDIR before you 
rem     start


define m_service = 'orcl'
define m_service = 'or32'

define m_directory = '/mnt/working'

host rm &m_directory/expdat.dmp
host rm &m_directory/expdp.dmp

create or replace directory dmpdir as '&m_directory';

drop table t1 purge;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        trunc(sqrt(rownum + 0)) frequency,
        trunc(sqrt(rownum + 0)) top_n,
        trunc(sqrt(rownum + 0)) hybrid
from
        generator
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for columns frequency size 254 for columns top_n size 95 for columns hybrid size 50'
        );
end;
/

select
        column_name,
        num_distinct,
        histogram,
        num_buckets
from
        user_tab_cols
where
        table_name = 'T1'
;

column endpoint_actual_value format a22
break on column_name skip 1

select 
        column_name, 
        endpoint_number, endpoint_value, endpoint_actual_value, 
        endpoint_repeat_count 
from 
        user_tab_histograms 
where 
        table_name = 'T1'
order by 
        column_name, endpoint_number
;

Here’s an extract, from a 12.1.0.2 instance, of the results of the two queries with a large number of the rows from the histogram data deleted:


COLUMN_NAME              Distinct HISTOGRAM          Buckets
-------------------- ------------ --------------- ----------
FREQUENCY                     100 FREQUENCY              100
TOP_N                         100 TOP-FREQUENCY           95
HYBRID                        100 HYBRID                  50

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE  ENDPOINT_REPEAT_COUNT
-------------------- --------------- -------------- ---------------------- ---------------------
FREQUENCY                          3              1 1                                          0
                                   8              2 2                                          0
                                  15              3 3                                          0
...
                                9800             98 98                                         0
                                9999             99 99                                         0
                               10000            100 100                                        0


HYBRID                             3              1 1                                          3
                                 224             14 14                                        29
                                 440             20 20                                        41
...
                                9603             97 97                                       195
                                9800             98 98                                       197
                               10000            100 100                                        1

TOP_N                              1              1 1                                          0
                                  16              7 7                                          0
                                  33              8 8                                          0
...
                                9753             98 98                                         0
                                9952             99 99                                         0
                                9953            100 100                                        0

The most important detail is the endpoint_repeat_count column of the hybrid histogram, although you should note that the endpoint_actual_value columns is populated with a copy of the endpoint_value for all three histograms.

Now I’m going to use exp / drop table / imp (or the datapump equivalents) to export, drop, and re-import the table with (one hopes) the exact same statistics. To do this I’ll be using the imp command with the option “statistics=always” with the intention of copying the stats from the export file into the destination database (you’ll have to substitute your own userid/password, of course):


host exp   userid=test_user/test@&m_service file=expdat.dmp tables='(t1)'
-- host expdp userid=test_user/test@&m_service DIRECTORY=dmpdir DUMPFILE=expdp.dmp TABLES='(t1)'

drop table t1 purge;

host imp   userid=test_user/test@&m_service file=expdat.dmp tables='(t1)' statistics=always
-- host impdp userid=test_user/test@&m_service DIRECTORY=dmpdir DUMPFILE=expdp.dmp TABLES='(t1)'

So what do we see now when we re-run the two queries to report the histogram information:


COLUMN_NAME              Distinct HISTOGRAM          Buckets
-------------------- ------------ --------------- ----------
FREQUENCY                     100 FREQUENCY              100
TOP_N                         100 FREQUENCY               95
HYBRID                        100 FREQUENCY               50

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE  ENDPOINT_REPEAT_COUNT
-------------------- --------------- -------------- ---------------------- ---------------------
FREQUENCY                          3              1                                            0
                                   8              2                                            0
                                  15              3                                            0
...
                                9800             98                                            0
                                9999             99                                            0
                               10000            100                                            0

HYBRID                             3              1                                            0
                                 224             14                                            0
                                 440             20                                            0
...
                                9603             97                                            0
                                9800             98                                            0
                               10000            100                                            0

TOP_N                              1              1                                            0
                                  16              7                                            0
                                  33              8                                            0
...
                                9753             98                                            0
                                9952             99                                            0
                                9953            100                                            0

The histograms on all three columns are now labelled as FREQUENCY.
The endpoint_actual_value is null for all three – but that may be a purely cosmetic detail with no side effects.
The “hybrid” column really has become a frequency histogram – and that’s the critical one – the endpoint_repeat_count columns are all zero.

tl;dr

If you’re still using exp/imp instead of expdp/impdp to move tables (and, more importantly, their statistics) from one database to another then the upgrade to 12c may mean you end up with hybrid histograms on the source system that are “downgraded” to frequency histograms on the destination system, with the effect that execution plans vary between the two systems.

March 15, 2018

Keeping Intervals

Filed under: 12c,Oracle,Partitioning — Jonathan Lewis @ 8:03 am BST Mar 15,2018

I’ve recently been reminded of a blog post I wrote a couple of years ago that discussed the issue of running into the hard limit of 2^20 -1 as the number of segments for a (composite) partitioned table – a problem that could arise in a relatively short time if you used a large number of hash subpartitions in an interval/hash composite partitioned table (you get about 2 years and 10 months of daily partitions at 1,024 subpartitions per day, for example).

A natural follow-on from that article is to think through a strategy for dropping old partitions sufficiently early that you don’t hit the limit as new partitions are created. This, of course, pretty much defeats the point of interval partitioning – instead of planning to add partitions “just in time” you now have to eliminate them “just in time”. Amongst other issues, we’re going to find that interval partitioning manages to re-introduce a problem with range partitioning that Oracle got rid of in Oracle 10g.

So let’s test the obvious option: drop the oldest partition(s) in time to keep head-room for new partitions; for convenience we’ll start with a simple interval partitioned table with a few pre-declared range partitions and a few automatically generated interval partitions. All the examples here were run under 12.1.0.2:


rem
rem     Script:         pt_merge.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(id, v1, padding)
partition by range (id) interval (1e4)
(
        partition p10000 values less than (1e4),
        partition p20000 values less than (2e4),
        partition p30000 values less than (3e4),
        partition p40000 values less than (4e4),
        partition p50000 values less than (5e4)
)
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.


I’ve created 100,000 rows and since the partitions I’ve pre-declared have an (unreachable) upper bound of only 50,000 Oracle will have added a further 6 partitions to the table to hold the data for values up to 110,000 (with just one row in the last partition). For testing purposes I’ve created the table in an otherwise empty tablespace so when I check the block address of each segment I can see the location (and size) of the segments so far. So here’s the list of names and locations:

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.

No surprises so far. So let’s pretend we know the dreaded ORA-14299 or ORA-14300 will be arriving soon and try to drop the first 5 partitions to keep the partition count below the limit. Here’s a cut-n-paste from an SQL*Plus session that tries to do that one partition at a time:

SQL> alter table t1 drop partition p10000;

Table altered.

SQL> alter table t1 drop partition p20000;

Table altered.

SQL> alter table t1 drop partition p30000;

Table altered.

SQL> alter table t1 drop partition p40000;

Table altered.

SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped

We can’t drop partition p50000 – it’s the highest partition that wasn’t created automatically, and we have to leave an “anchor” partition in place for interval partitioning to work from. By querying user_tab_partitions we can even see that this partition is flagged a little differently from the others:


select
        partition_name, interval, high_value 
from
        user_tab_partitions
where
        table_name = 'T1'
order by
        partition_position
;


PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P50000                 NO  5e4
SYS_P69844             YES 60000
SYS_P69845             YES 70000
SYS_P69846             YES 80000
SYS_P69847             YES 90000
SYS_P69848             YES 100000
SYS_P69849             YES 110000

7 rows selected.

So, at first sight, we’re stuck. If we’re dropping old partitions we will eventually get to a point where there’s only one “real” range partition at the bottom and then we can’t drop any more historic partitions. There are two solutions to this problem, explained a long time ago here and here by Harald van Breederode.

Option 1

Convert the interval partitioned table to a range partitioned table and back again, and if you know the interval (and you can always look it up in the data dictionary) there’s a quick and dirty way of doing that. Here’s a cut-n-paste demonstrating the method and effect:


SQL> alter table t1 set interval (10000);

1Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position ; 

PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P10000                 NO  1e4
P20000                 NO  2e4
P30000                 NO  3e4
P40000                 NO  4e4
P50000                 NO  5e4
SYS_P69850             NO  60000
SYS_P69851             NO  70000
SYS_P69852             NO  80000
SYS_P69853             NO  90000
SYS_P69854             NO  100000
SYS_P69855             NO  110000

11 rows selected.

SQL> select table_name, partitioning_type, interval from user_part_tables;

TABLE_NAME           PARTITION INTERVAL
-------------------- --------- --------------------
T1                   RANGE     1E4

1 row selected.

Every single partition has just become a range-based partition, but the table is still interval partitioned. This is a tidy solution, but there’s one obvious, generic, drawback to the method.  The “theory” of interval partitioning is that you don’t have to pre-create partitions in anticipation of the data arriving – so what will happen if a (possibly bad) row arrives weeks ahead of schedule and you find that Oracle has created (say) partition 85,001 with a gap of 12,000 partitions between the current high partition and the new one. If you use this “convert to range and back” trick then you’ll have a single partition covering the entire range where you were expecting (eventually) to have 12,000 partitions. Every time you convert from interval to range and back you’d better have code that checks if there are any gaps first, and then does loads of “split partition” –  or comes up with some other strategy – to address the side effects.

Option 2

When you’ve got just one range partition left, merge the bottom two partitions – this makes the next partition up a range partition without affecting any other partitions. After recreating the original table and dropping the first 4 partitions this is how things go:


SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped


SQL> alter table t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position;

PARTITION_NAME         INTERVAL             HIGH_VALUE
---------------------- -------------------- --------------------------
P_LOW                  NO                   60000
SYS_P69863             YES                  70000
SYS_P69864             YES                  80000
SYS_P69865             YES                  90000
SYS_P69866             YES                  100000
SYS_P69867             YES                  110000

6 rows selected.

Is this too good to be true ? Of course it is, but you may have to pause for a moment to think why. When you merge two partitions Oracle copies the contents of the two segments into a new segment – always; even if one of the two segments is empty. When you do a “split partition” Oracle runs a check to see if the split would leave all the data in a single segment and if it would then Oracle doesn’t do any copying but simply plays clever games in the data dictionary – unfortunately Oracle doesn’t use the same sort of trick to optimise a merge.

So the merge partition mechanism carries less risk than the “interval/range/interval”, but you either pay the cost of the merge or you carefully code the mechanism so that the bottom two partitions are always empty when you merge: for example you might always leave the bottom (range) partition empty and use your scheduled code to truncate (or exchange out) the lowest interval partition, then do the merge.

The good news

When you upgrade to 12.2.0.1 you can drop the lowest partition – and Oracle will simply turn the lowest interval partition currently in existence into a range partition. (That may be a bit of a nuisance if there’s a gap between the range partition and the current lowest interval partition.)

The Bad News

It doesn’t really matter which strategy you use to deal with this problem (even if you’ve upgraded to 12.2) – you still pay one other penalty for both mechanisms. And that’s the bit which re-introduces a problem that last existed in 9i.

Ask youself “How does Oracle know which interval a partition is for and what the limit is on the partitioning key ?” Then look at the data dictionary, or maybe build a very simple model and trace what happens when you use either of the methods above – but in your model create a significant number or partitions first. I’m going to take the data dictionary method – starting from the point where I’ve created and populated the table. Again this is cut-n-paste, and do note that I switch to the sys account after creating the table:


SQL> select object_id, object_name, subobject_name from user_objects;

 OBJECT_ID OBJECT_NAME          SUBOBJECT_NAME
---------- -------------------- ----------------------
    185164 T1
    185165 T1                   P10000
    185166 T1                   P20000
    185167 T1                   P30000
    185168 T1                   P40000
    185169 T1                   P50000
    185170 T1                   SYS_P69868
    185171 T1                   SYS_P69869
    185172 T1                   SYS_P69870
    185173 T1                   SYS_P69871
    185174 T1                   SYS_P69872
    185175 T1                   SYS_P69873

12 rows selected.

SQL> connect / as sysdba
Connected.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185165     185165         10
    185166     185166         20
    185167     185167         30
    185168     185168         40
    185169     185169         50
    185170     185170 2147483648
    185171     185171 2147483649
    185172     185172 2147483650
    185173     185173 2147483651
    185174     185174 2147483652
    185175     185175 2147483653

11 rows selected.

I’ve queried user_objects to find the object_id of the table then used that as the “base object number” (bo#) to query tabpart$, which holds the table partition definitions. Note how there are 5 partitions where the partition number goes up 10 at a time, and 6 where it goes up one at a time. Prior to 10g (and interval partitions, of course) the stored partition number would increase in steps of 1 but if you wanted to do a split, merge or drop partition (and the last of the three was the most significant one) every single partition position about the split/merge/drop point would have to be renumbered, and that was done by a single row update to the data dictionary to keep the numbering intact. The steps of 10 were introduced in 10g to deal with the inherent performance problems – particularly the shared pool catastrophe that this could cause.

The steps of 1 for interval partitions allows Oracle to keep track (easily) of what high_value each partition partition represents, and the highest legal partition. Try inserting the values 1,000,000 into the table and re-run the query against tabpart$ and you’ll see Oracle adding part# = 2147483743. So what do you think is going to happen if you try to apply the two mechanisms ?

If you do the interval/range/interval switch every interval part# will be renumbered so to follow the “increment by 10” pattern. If you drop partitions p10000 to p40000 nothing happens to the existing part# values until you get to the command to merge p50000 with the next partition up and then you see this:


SQL> alter table test_user.t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185177     185177         10
    185171     185171 2147483648
    185172     185172 2147483649
    185173     185173 2147483650
    185174     185174 2147483651
    185175     185175 2147483652
    185176     185176 2147483742

7 rows selected.


The newly merged partition is a new object, of course, so has a completely new obj# and dataobj#, and it’s been given the part# of 10 (the lowest value for a clean range-partitioned object). Every single interval partition has had its part# decreased by one. The lowest possible interval partition is always given the part# of 2147483648 (0x80000000) and the partition numbering increments by 1 from there onwards. (The numbering gets a little more subtle when you have composite partitioning but a similar approach takes place in tabcompart$).

Pause for thought – if you’re thinking of creating an interval partitioned table that could get close to a running level of 1 million partitions and you start to get rid of old partitions in any version of Oracle then each “drop/merge” partition will update about 1 million rows in the data dictionary – and that’s assuming you don’t have any local indexes that will need to be renumbered in the same way!

Here’s a critical part of the output from tkprof when I recreated the table with 1,000,000 rows – which means 101 partitions – and created a local index on it, before dropping the first 4 partitions and then enabled tracing just before merging the bottom interval partition with the anchor range partition.


update indpart$ set dataobj# = :1, part# = :2, flags = :3, ts# = :4, file# =
  :5, block# = :6, pctfree$ = :7, initrans = :8, maxtrans = :9, analyzetime =
  :10, samplesize = :11, rowcnt = :12, blevel = :13, leafcnt = :14, distkey =
  :15, lblkkey = :16, dblkkey = :17, clufac = :18, pctthres$ = :19
where
 obj# = :20


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.01          0         94        480          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.01       0.01          0         94        480          94


update tabpart$ set dataobj# = :1, part# = :2, ts# = :3, file# = :4, block# =
  :5, pctfree$ = :6, pctused$ = :7, initrans = :8, maxtrans = :9, flags = :10,
   analyzetime = :11, samplesize = :12, rowcnt = :13, blkcnt = :14, empcnt =
  :15, avgspc = :16, chncnt = :17, avgrln = :18
where
 obj# = :19


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.00          0        188        489          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.00       0.00          0        188        489          94

That’s not a lot of work for my little example with less than 100 partitions – but when you’ve got a million of them, with a handful of indexes, and the partitions have been created over time that’s going to turn into a lot of work that’s going to disrupt the shared pool for a long time, generate a lot of redo, and do a lot of disk reads and writes.

So be cautious with interval partitioning – even in 12.2 (and 18.1, possibly) the ease of use may disappear if you realise too late that you’re going to get into a cycle of partition maintenance.

Footnote for composite partitioning – the limits of 2^20-1 segments (hence subpartitions) still applies, but the necessary update is relevant only at the partition level, not at the subpartition level. The objects updated are tabcompart$ and indcompart$.

Update (included for ironic effect)

The day I posted this note my “Oracle Support Hot Topics” email with a report of the following bug:

Bug 19294302 : DBMS_REDEFINITION DOES NOT WORK WITH INTERVAL PARTITIONS

This was reported for 11.2.0.4, fixed in 12.2. The rediscovery information is:

ORA-14024 during copy_table_dep when the interim table is interval partitioned.

The problem arises if you change a table from simple range partitioned to range with interval – so might be relevant if you have a strategy of doing the interval/range/interval trick.

 

 

March 13, 2018

Deferred Invalidation

Filed under: 12c,CBO,Infrastructure,Oracle,Troubleshooting,Upgrades — Jonathan Lewis @ 6:30 pm BST Mar 13,2018

I was going to write an article on the way 12.2 has introduced the option for “deferred invalidation” for a number of DDL operations, but I did a quick google search before I started writing and found that both Franck Pachot and Richard Foote (yes, rebuild index is one of the operations) had got there long ago so here are a couple of links – as much for my own benefit as anything else:

Richard Foote:

Franck Pachot:

Franck’s 2nd example may be particularly relevant to some clients of mine who were having problems with SQL queries that were crashing (slowly and randomly) instead of running very efficiently because they were running queries against one subpartition of a table while another subpartition of the same table was subject to exchange. With a little bad luck in the timing an exchange that took place between a parse and an execute would cause a query to have its cursor invalidated and re-parsed in a way that failed to do (sub-)partition elimination the way it should have because the local indexes were in an indeterminate state.

 

March 6, 2018

Match_recognise – 2

Filed under: 12c,Match_recognize,Oracle,Tuning — Jonathan Lewis @ 7:59 am BST Mar 6,2018

In my previous post I presented a warning about the potential cost of sorting and the cost of failing to find a match after each pass of a long search. In a comment on that post Stew Ashton reminded me that the cost of repeatedly trying to find a match starting from “the next row down” could be less of a threat than the cost of “back-tracking” before moving to the next row down.

Taking the example from the previous posting to explain – the requirement was for customers who had executed a transaction in September but not October, and a match_recognize() clause suggested on the ODC (formerly OTN) database forum to implement this requirement was as follows:

match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
        pattern(x+ y* $) 
        define
                x as mth  = 9,
                y as mth != 10
);

In the blog post I raised the issue of an extreme case where there were 100,000 transactions for a single customer of which all but the last was a September transaction and the last was an October transaction. This would have two effects – first that we could have to sort 100,000 rows, including the cust_id that appeared in the “partition by” clause and the 1000-character padding column that was listed in the measures clause, leading to a huge dump to, and constant re-read of, the temporary tablespace; secondly that having failed to find a match starting from row 1 Oracle would go back to row 2 and try again, then to row 3, and so on.

The point that Stew Ashton made was that Oracle doesn’t just “go back to” row 2, it will be unwinding a stack, or reversing out a recursive descent to get there. What this means is that Oracle will fail as it reaches the October at row 100,000 and say “no more X rows, is this a Y row ? no”, backtrack to row 999,998 and say “what if I stop collecting X rows here and start looking for Y rows?”, so it reads row 999,999 as a Y row (since 9 != 10), then finds row 100,000 and fails the match. So it backtracks again to row 999,997 and says “what if I stop collecting X rows here and start looking for Y rows?”, and this time it finds identifies 999,998 and 999,999 as Y rows, then fails on row 100,000.

Remember, this is still part of the attempt to match the pattern starting at row 1 – and there are 999,996 more steps backwards still to go, and the further we go back the further we come forward again until we fail — and there are 999,998 steps we have to back-track before we start to consider a pattern starting are row 2..

To demonstrate the costs I’ve got three variants of the original query. First, the query as it was but limited to just 1,000 rows for a single customer; second a changed pattern that highlights the cost of trying to use back-tracking to match the pattern just once, starting from row 1 (the pattern doesn’t actually meet the original requirement because it would only find customers whose first transaction of the year was in September); finally a changed pattern that achieves the required result much more efficiently than the original (but still very slowly) by adding some human intelligence to the implementation.

Here’s version 1 – which took 257 CPU seconds to handle just 1,000 rows:

select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x+ y* $)
        define
                x as mth  = 9,
                y as mth != 10
);

You’ll see that I’ve included the “debug” functions of classifier() and match_number() in the SQL above – these are particularly useful with the options “all rows per match” and “with unmatched rows” when you’re trying to figure out why your match_recognize() clause is not producing the right results, so I’ve left them there purely for reference.

Then there’s a version where I’ve made the modification suggested by Stew Ashton to demonstrate the full cost of an attempt to match only if the pattern starts on the first row of the partition. This took just 0.83 CPU seconds to complete. This may sound fairly reasonable, but if you compare that to the time it might take simply to sort and walk once through 1,000 rows you’ll realise that it’s actually pretty expensive – and it’s not surprising that when we had to do the same thing 1,000 times (on a slowly decreasing set, of course, as we work our way down the partition) the original task took 257 CPU seconds.

select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(^ x+ y* $)
        define
                x as mth  = 9,
                y as mth != 10
);

You’ll notice the caret “^” at the start of the pattern – this means the pattern must start at the first row of the partition (just as the “$” means the pattern has to end at the end of the partition).

Finally, thinking of a better way of using match_recognize() for this requirement we realise that we know that November comes after October, and December comes after November so (in the context of our example) the predicate “!= 10” is equivalent to “> 10”. With this code change the original query took 0.82 CPU seconds.


select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x+ y* $)
        define
                x as mth  = 9,
                y as mth  > 10
);

In this case we still have to do a lot of back tracking, but each time we backtrack one step we then check just one row forward for the match to fail (9 is not greater than 10), whereas with the original if we have backtracked 750 steps (for example) we would then have to check 750 rows before we reached the October row for the match to fail.

Bottom line: back-tracking is a massive cost if you have to take a lot of steps backwards to the previous starting row; and you need the match to fail (or succeed) as fast as possible as you start stepping forward again.

Addendum

Since Stew Ashton had highlighted the omission in the previous blog post I passed him a copy of this post before publishing it, asking him to check whether there were any errors or omissions in the way I had described the work Oracle would do back tracking in this example. He said that he couldn’t think of anything to improve the explanation (though I will still claim responsibility for any errors, omissions, or ambiguities) and then suggested another, far more efficient, way of getting the required answer by (again) re-thinking the question before writing the code. His solution looks like this:


select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt nulls first
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x{1})
        define
                x as mth  = 9 and (
                             next(mth) is null or next(mth) > 10
                     )
)
;

The pattern here simply says: for the partition find the first “X”-row, but an X-row is defined as “month is september and either there are no more rows or the next row is after October”. You’ll notice that I’ve modified the “order by” clause to put nulls first – there are none in the sample data, but if there were this change to the order would ensure that for a row where “mth = 9″ the “next(mth)” could only be null if the current row were the last in the partition.

If you imagine walking through the pattern-matching process now, you start looking at rows and keep going until you reach the first September, and each time you find a September you check the next row to see if it’s past the end of partition, or a November or December; if it is you report the current row and move to the end of the partition, if it isn’t you just walk to the next row and repeat the process – you never back-track. Effectively the workload here is simply to sort then walk non-stop through the whole list – and Oracle even tells us that we are using this optimum strategy in the execution plan:


---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                       | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                |      |      1 |        |      0 |00:00:00.01 |     146 |       |       |          |
|   1 |  VIEW                                           |      |      1 |   1000 |      0 |00:00:00.01 |     146 |       |       |          |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|      |      1 |   1000 |      0 |00:00:00.01 |     146 |  1186K|   567K| 1054K (0)|
|   3 |    VIEW                                         |      |      1 |   1000 |   1000 |00:00:00.01 |     146 |       |       |          |
|   4 |     TABLE ACCESS FULL                           | T1   |      1 |   1000 |   1000 |00:00:00.01 |     146 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------

Operation 2 – the Match Recognize Sort operation – is reported as “deterministic finite auto”, which basically means the duration of the process is predictable because Oracle knows it is a straight end to end walk with no back-tracking. This is the ideal thing to see when you try to design code using match_recognize().

February 26, 2018

Match_recognize

Filed under: 12c,Match_recognize,Oracle — Jonathan Lewis @ 2:59 pm BST Feb 26,2018

In the spirit of Cary Millsap’s comment: “The fastest way to do anything is to not do it at all”, here’s my take (possibly not an original one) on solving problems:

“The best time to solve a problem is before it has happened.”

I spend quite a lot of my “non-contact” time thinking about boundary cases, feature collisions, contention issues, and any other things that could go wrong when you start to implement real systems with (new) Oracle features. The benefit of doing this, of course, is that when I’m looking at a client’s system I can often solve problems because I recognise symptoms that I’ve previously created “in the lab”. The strange thing about this is that there have been times when I’ve pushed Oracle to a breaking point, documented it, and then dismissed the threat because “no one would do that in real life” only to find that someone has done it in real life.

All this is just a preamble to a demonstration of a threat with a terrific feature that is just beginning to gain greater acceptance as a solution to some interesting problems – and the demonstration is going to exaggerate the problem to a level that (probably) won’t appear in a production. The driving example appeared as a question on the OTN/ODC database forum:

“I need customers who have done a transaction in September but not in October.”

There are obviously many ways to address this type of requirement (my first thought was to use the MINUS operator), and a few questions you might ask before trying to address it, but the OP had supplied some data to play which consisted of just a few rows of a table with three columns and some data restricted to just one year, and one solution offered was a very simple query using the 12c feature match_recognize():


CREATE TABLE TEST_TABLE   
  ( T_ID NUMBER, -- trans-id  
    CUST_ID NUMBER,   
    TRANS_DT DATE  
  ) ;  
                  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (1,100,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (2,100,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (3,200,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (4,300,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (5,400,to_date('12-JAN-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (6,500,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (7,500,to_date('12-MAR-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (8,600,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (9,600,to_date('12-JUL-17','DD-MON-RR'));  

commit;

select * from test_table
match_recognize
(
  partition by cust_id
  order by trans_dt
  pattern( x+ y* $)
  define
    x as extract(month from trans_dt)  = 9,
    y as extract(month from trans_dt) != 10
);
 
   CUST_ID
----------
       200
       600
      

The obvious benefit of this solution over a solution involving a set-wise MINUS is that it need only scan the data set once (whereas the MINUS strategy will be scanning it twice with a select distinct in each scan) – but it’s a solution that is likely to be unfamiliar to many people and may need a little explanation.

The partition by cust_id order by trans_dt means we sort the data by those two columns, breaking on cust_id. Then for each cust_id we walk through the data looking for a pattern which is defined as: “one or more rows where the month is september followed by zero or more rows where the month is NOT october followed by the end of the set for the customer”. The SQL leaves many details to default so the result set is just the cust_id column and only one row per occurrence of the pattern (which, given the data set, can occur at most once per customer).

For a cust_id that shows a matching pattern the work we will have done is:

  • Walk through rows for Jan to Aug until we reach the first September – which is the start of pattern
  • Keep on walking through to the last of the Septembers – which is a partial match
  • One of
  • Walk through zero rows of November and December and reach the end of cust_id
  • Walk through one or more rows of November and/or December then reach the end of cust_id
  • Record the end of pattern by reporting one row
  • Move on to next cust_id

The excitement starts when we think about a cust_id that doesn’t have a matching pattern – and for that I’m going to generate a new, extreme, data set.


rem
rem     Script:         match_recognize_07.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level  comment to avoid WordPress format issue
)
select
        rownum                          id,
        99                              cust_id,
        to_date('01-Sep-2017')          trans_dt,
        lpad(rownum,1000,'0')           padding
from
        generator       v1,
        generator       v2
where
        rownum  comment to avoid WordPress format issue
;

update t1
set
        trans_dt = to_date('01-Oct-2017','dd-mon-yyyy')
where
        rownum = 1
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select  *
from    (
        select 
                t1.*,
                extract(year from trans_dt) yr, 
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding
        pattern( x+  y*  $ )
        define
                x as mth = 9,
                y as mth != 10
);

I’ve moved the calculation of month number from the define clause into an in-line view purely to make the match_recognize() clause a little tidier.

I’ve created a table with just one customer with 100,000 transactions on 1st September 2017, then I’ve updated one row from September to October. Thanks to that one row Oracle is not going to be able to find the requested pattern. I’ve added a padding column of 1,000 characters to the table and included it in the measures that I want to select, so Oracle will have to sort roughly 100MB of data (100,000 rows at roughly 1KB per row) before it starts walking the data to find matches – and, though it’s not visible in the script, the workarea settings mean the session won’t be allowed to expand its PGA to accommodate the whole 100MB.

Test 1 – comment out the update and see how long it takes to produce a result: 0.67 seconds, and the padding value reported was the last one from the pattern.
Test 2 – put the update back in place and try again:

After running for 46 seconds with no result and interrupting the query these are some figures from a snapshot of the session stats:

Name                                                 Value
----                                                 -----
CPU used when call started                           3,662
DB time                                              3,711
user I/O wait time                                   1,538
consistent gets                                     14,300
physical reads direct                            1,298,939
physical read IO requests                          736,478
physical read bytes                         10,640,908,288      
physical writes                                     25,228
physical writes direct                              25,228
physical reads direct temporary tablespace       1,298,939
physical writes direct temporary tablespace         25,228
table scan rows gotten                             100,000
table scan blocks gotten                            14,286

  • I’ve scanned a table of 14,286 blocks to find 100,000 rows.
  • I’ve sorted and spilled to disc, using roughly 25,000 blocks of direct path writes and reads to do the sort.
  • Then I’ve spend the rest of the time burning up CPU and reading 1.27 million blocks from the temporary tablespace trying to find a match

The way that basic pattern matching works on a match failure is to go back to the row after the one where the current match attempt started, and begin all over again. So in this example, after dumping 100MB of Septembers to temp Oracle started at row 1, read 999,999 rows, then found the October that failed the match; so it went to row 2 [ed: doing some very expensive back-tracking: see comment #2 from Stew Ashton], read 999,998 rows, then found the October that failed the match; so it went to row 3 and so on. Every time it went back to (nearly) the beginning it had to start re-reading that 100,000 rows from temp because the session wasn’t allowed to keep the whole 100MB in memory.

You need to avoid defining a pattern that has to scan large volumes of data to identify a single occurrence of the pattern if the matching process is likely to fail. Even if you can keep the appropriate volume of data in memory for the entire time and avoid a catastrophic volume of reads from the temporary tablespace you can still see a huge amount of CPU being used to process the data – when I reduced the table from 100,000 rows to 10,000 rows it still took me 99 CPU seconds to run the query.

tl;dr

The 12c match_recognize() is a terrific tool, but you must remember two important details about the default behaviour when you think about using it:

  • You will sort a volume of data that is the number of input rows multiplied but the total length of the measures/partition output.
  • If you have a long sequence of rows that ends up failing to match a pattern Oracle goes back to the row after the start of the previous match attempt.

With the usual proviso that “large”, “small” etc. are all relative: keep the data volume small, and try to define patterns that will be short  runs of rows.

Do note, however, that I engineered this example to produce a catastrophe. There are many non-default actions you can choose to minimise the workload you’re likely to produce with match_recognize(), and if you just spare a little time to think about worst case events you probably won’t need to face a scenario like this in a real production environment.

See also:

Part 6 (which includes a list of earlier installments) of an introductory series to match_recognize() by Keith Laker.

A pdf file of Keith Laker’s presentation on match_recognize(), including some technical implementation details.

 

January 19, 2018

Nested MVs

Filed under: 12c,Bugs,Infrastructure,Materialized view,Oracle — Jonathan Lewis @ 2:43 pm BST Jan 19,2018

A recent client was seeing a very large redo penalty from refreshing materialized views. Unfortunately they had to be refreshed very frequently, and were being handled with a complete refresh in atomic mode – which means delete every row from every MV then re-insert every row.  The total redo was running at about 5GB per hour, which wasn’t a problem for throughput, but the space for handling backup and recovery was getting a bit extreme.

The requirement consisted of two MVs which extracted and aggregated row and column subsets in two different ways from a single table; then two MVs that aggregated one of the first MVs in two different ways; then two MVs which each joined one of the first level MVs to one of the scond level MVs.

No problem – join MVs are legal, aggregate MVs are legal, “nested” MVs are legal: all you have to do is create the right MV logs and pick the right refresh command.  Since the client was also running Standard Editions (SE2) there was no need to worry about how to ensure that query rewrite would work (feature not implemented on SE).

So here, simplified and camouflaged, is a minimum subset of just the first few stages of the construction: a base table with MV log, one first-level aggregate MV with its own MV log, and two aggregate MVs based on the first MV.

drop materialized view log on req_line;
drop materialized view log on jpl_req_group_numlines;

drop materialized view jpl_req_group_numlines;
drop materialized view jpl_req_numsel;
drop materialized view jpl_req_basis;

drop table req_line;

-- ----------
-- Base Table
-- ----------

create table req_line(
        eventid         number(10,0),
        selected        number(10,0),
        req             number(10,0),
        basis           number(10,0),
        lnid            number(10,0),
        area            varchar2(10),
        excess          number(10,0),
        available       number(10,0),
        kk_id           number(10,0),
        eventdate       number(10,0),
        rs_id           number(10,0)
)
;

-- --------------------
-- MV log on base table
-- --------------------

create materialized view log 
on
req_line
with rowid(
        req, basis, lnid, eventid, selected, area,
        excess, available, kk_id, eventdate, rs_id
)
including new values
;

-- --------------------
-- Level 1 aggregate MV
-- --------------------

create materialized view jpl_req_group_numlines(
        eventid, selected, 
        row_ct, req_ct, basis_ct, req, basis, 
        maxlnid, excess, numsel, area, available, kk_id, 
        rs_id, eventdate
)
segment creation immediate
build immediate
refresh fast on demand 
as 
select 
        eventid,
        selected,
        count(*)        row_ct,
        count(req)      req_ct,
        count(basis)    basis_ct,
        sum(req)        req,
        sum(basis)      basis,
        max(lnid)       maxlnid,
        excess,
        count(selected) numsel,
        area,
        available,
        kk_id,
        rs_id,
        eventdate
from 
        req_line
group by 
        eventid, selected, area, excess,
        available, kk_id, eventdate, rs_id
;

-- ------------------------
-- MV log on first level MV
-- ------------------------

create materialized view log 
on
jpl_req_group_numlines
with rowid 
(
        eventid, area, selected, available,
        basis, req, maxlnid, numsel
)
including new values
;


-- ----------------------------
-- First "level 2" aggregate MV
-- ----------------------------

create materialized view jpl_req_numsel(
        eventid, selected, 
        row_ct, totalreq_ct, totalbasis_ct, totalreq, totalbasis, 
        maxlnid, numsel_ct, numsel, area
)
segment creation immediate
build immediate
refresh fast on demand
as 
select 
        eventid,
        selected,
        count(*)        row_ct,
        count(req)      req_ct,
        count(basis)    basis_ct,
        sum(req)        req,
        sum(basis)      basis,
        max(maxlnid)    maxlnid,
        count(numsel)   numsel_ct,
        sum(numsel)     numsel,
        area
from 
        jpl_req_group_numlines
group by 
        eventid, selected, area
;


-- -----------------------------
-- Second "level 2" aggregate MV
-- -----------------------------

create materialized view jpl_req_basis(
        eventid, 
        row_ct, totalbasis_ct, totalreq_ct, totalbasis, totalreq, 
        area, selected, available, maxlnid ,
        numsel_ct, numsel
)
segment creation immediate
build immediate
refresh fast on demand
as 
select 
        eventid,
        count(*)        row_ct,
        count(basis)    totalbasis_ct,
        count(req)      totalreq_ct,
        sum(basis)      totalbasis,
        sum(req)        totalreq,
        area,
        selected,
        available,
        max(maxlnid)    maxlnid,
        count(numsel)   numsel,
        sum(numsel)     numsel
from
        jpl_req_group_numlines
group by 
        eventid, area, available, selected
;

Once the table, MV logs and MVs exist we can insert some data into the base table, then try refreshing the views. I have tried three different calls to the dbms_refresh package, dbms_mview.refresh_all_mviews(), dbms_mview.refresh_dependent(), and dbms_mview.refresh(), specifying the ‘F’ (fast) refresh method, atomic refresh, and nested. All three fail in the same way on 12.2.0.1. The code below shows only the refresh_dependent() call.

I’ve included a query to report the current state of the materialized views before and after the calls, and set a two second sleep before the refresh so that changes in “last refresh” time will appear. The final queries are just to check that the expected volume of data has been transferred to the materialized views.


-- ------------------------------------
-- Insert some data into the base table
-- ------------------------------------

begin
        for i in 1..100 loop
                execute immediate 'insert into req_line values( :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx)' 
                using i,i,i,i,i,i,i,i,i,i,i;
                commit;
        end loop;
end;
/

set linesize 144
column mview_name format a40

select
        mview_name, staleness, compile_state, last_refresh_type, 
        to_char(last_refresh_date,'dd-mon hh24:mi:ss')          ref_time
from
        user_mviews
ORDER by
        last_refresh_date, mview_name
;

prompt  Waiting for 2 seconds to allow refresh time to change

execute dbms_lock.sleep(2)

declare
        m_fail_ct       number(6,0);
begin
        dbms_mview.refresh_dependent(
                number_of_failures      => m_fail_ct,
                list                    => 'req_line',
                method                  => 'F',
                nested                  => true,
                atomic_refresh          => true
        );

        dbms_output.put_line('Failures: ' || m_fail_ct);
end;
/

select
        mview_name, staleness, compile_state, last_refresh_type, 
        to_char(last_refresh_date,'dd-mon hh24:mi:ss')          ref_time
from
        user_mviews
order by
        last_refresh_date, mview_name
;

-- --------------------------------
-- Should be 100 rows in each table
-- --------------------------------

select count(*) from jpl_req_basis;
select count(*) from jpl_req_group_numlines;
select count(*) from jpl_req_numsel;

Both the earlier versions of Oracle are happy with this code and refresh all three materialized view without fail. Oracle 12.2.0.1 crashes the procedure call with a deadlock error which, when traced, shows itself to be a self-deadlock while attempting to select a data dictionary row for update:


MVIEW_NAME                               STALENESS	     COMPILE_STATE	 LAST_REF REF_TIME
---------------------------------------- ------------------- ------------------- -------- ------------------------
JPL_REQ_BASIS                            FRESH		     VALID		 COMPLETE 19-jan 14:03:01
JPL_REQ_GROUP_NUMLINES			 NEEDS_COMPILE	     NEEDS_COMPILE	 COMPLETE 19-jan 14:03:01
JPL_REQ_NUMSEL                           FRESH		     VALID		 COMPLETE 19-jan 14:03:01

3 rows selected.

Waiting for 2 seconds to allow refresh time to change

PL/SQL procedure successfully completed.

declare
*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 2952
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 85
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 245
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 1243
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 2414
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 2908
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 3699
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 3723
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 75
ORA-06512: at line 4


MVIEW_NAME				 STALENESS	     COMPILE_STATE	 LAST_REF REF_TIME
---------------------------------------- ------------------- ------------------- -------- ------------------------
JPL_REQ_NUMSEL                           NEEDS_COMPILE	     NEEDS_COMPILE	 COMPLETE 19-jan 14:03:01
JPL_REQ_BASIS                            FRESH		     VALID		 FAST	  19-jan 14:03:04
JPL_REQ_GROUP_NUMLINES                   FRESH		     VALID		 FAST	  19-jan 14:03:04

The deadlock graph from the trace file, with a little extra surrounding information, looks like this:


Deadlock graph:
                                          ------------Blocker(s)-----------  ------------Waiter(s)------------
Resource Name                             process session holds waits serial  process session holds waits serial
TX-00020009-00000C78-A9B090F8-00000000         26      14     X        40306      26      14           X  40306


*** 2018-01-19T14:18:03.925859+00:00 (ORCL(3))
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
----- Current SQL Statement for this session (sql_id=2vnzfjzg6px33) -----
select log, oldest, oldest_pk, oldest_oid, oldest_new, youngest+1/86400,  flag, yscn, oldest_seq, oscn, oscn_pk, oscn_oid, oscn_new, oscn_seq  from sys.mlog$ where mowner = :1 and master = :2 for update
----- PL/SQL Stack -----

So far I haven’t been able to spot whether or not I’m doing something wrong, or prohibited, and I haven’t been able to find a matching problem on MoS. Since the code works on 11gR2 and 12cR1 I’m inclined to believe it’s a bug introduced in the 12cR2 timeline – which is a nuisance for my client, but if it is a bug then perhaps a fix will appear fairly promptly.

January 9, 2018

ASSM argh!

Filed under: 12c,ASSM,Bugs,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:53 pm BST Jan 9,2018

Here’s a problem with ASSM that used to exist in older versions of Oracle had disappeared by 11.2.0.4 and then re-appeared in 12.1.0.2 – disappearing again by 12.2.0.1. It showed up on MoS a few days ago under the heading: “Insert is running long with more waits on db file sequential read”.

The obvious response to this heading is to question the number of indexes on the table – because big tables with lots of indexes tend to give you lots of random I/O as Oracle maintains the indexes – but this table had no indexes. The owner of the problem supplied several of bits of information in the initial post, with further material in response to follow-up questions, including the tkprof summary of the 10046/level 12 trace of the insert and two extracts from the trace file to show us some of the “db file sequential read” waits – the first extract made me wonder if there might be some issue involving 16KB blocks but the second one dispelled that illusion.

There are several buggy things that can appear with ASSM and large-scale DML operations, and sometimes the problems can appear long after the original had done the dirty deed, so I thought I’d create a simple model based on the information supplied to date – and discovered what the problem (probably) was. Here’s how it starts – I’ve created a tablespace using ASSM, and in this tablespace I’ve created a table which has 48 columns with a row length of 290 bytes (roughly matching the OP’s table), and I’ve hacked out a simple PL/SQL block that loops around inserting arrays of 100 rows at a time into the table for a total of 1M rows before committing.


rem
rem     Script:         assm_cleanout.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2        Lots of blocks left "not full"
rem             11.2.0.4
rem

rem
rem     using OMF, so no file-name needed
rem     Ran this bit as SYS, and altered test user to have unlimited quota
rem

/*
create
        tablespace test_8k_assm
        datafile size 1G
        extent management local
        autoallocate
        segment space management auto
;
*/

rem
rem     Create the table, pre-allocate some space.
rem     This means we should get consistent 8M extents and not initial little ones
rem

create table t1 (
        v001 varchar2(5), v002 varchar2(5), v003 varchar2(5), v004 varchar2(5), v005 varchar2(5),
        v006 varchar2(5), v007 varchar2(5), v008 varchar2(5), v009 varchar2(5), v010 varchar2(5),
        v011 varchar2(5), v012 varchar2(5), v013 varchar2(5), v014 varchar2(5), v015 varchar2(5),
        v016 varchar2(5), v017 varchar2(5), v018 varchar2(5), v019 varchar2(5), v020 varchar2(5),
        v021 varchar2(5), v022 varchar2(5), v023 varchar2(5), v024 varchar2(5), v025 varchar2(5),
        v026 varchar2(5), v027 varchar2(5), v028 varchar2(5), v029 varchar2(5), v030 varchar2(5),
        v031 varchar2(5), v032 varchar2(5), v033 varchar2(5), v034 varchar2(5), v035 varchar2(5),
        v036 varchar2(5), v037 varchar2(5), v038 varchar2(5), v039 varchar2(5), v040 varchar2(5),
        v041 varchar2(5), v042 varchar2(5), v043 varchar2(5), v044 varchar2(5), v045 varchar2(5),
        v046 varchar2(5), v047 varchar2(5), v048 varchar2(5)
)
segment creation immediate
tablespace test_8k_assm
storage(initial 8M)
;

alter table t1 allocate extent (size 8M);
alter table t1 allocate extent (size 8M);

rem
rem     Simple anonymous pl/sql block
rem     Large insert, handled with array inserts
rem     Can modify loop count and array size very easily
rem

declare
        type tab_array is table of t1%rowtype;
        junk_array tab_array;
begin

        select
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx'
        bulk collect into
                junk_array
        from
                all_objects
        where
                rownum  <= 100 -- > comment to avoid WordPress format issue
        ;

        for i in 1..10000 loop
                forall j in 1..junk_array.count
                        insert into t1 values junk_array(j) ;
        end loop;

end;

commit;

The number of rows per block after this insert is 24, with 1038 bytes free space left (808 due to the pctfree = 10, then the bit that was too small to take a 25th row before breaching the pctfree barrier). This means we should report 1M/24 = 41,666 full blocks and one block with some free space. So we query the table using the dbms_space package:


declare
        m_unformatted_blocks    number;
        m_unformatted_bytes     number;
        m_fs1_blocks            number;
        m_fs1_bytes             number;
        m_fs2_blocks            number;
        m_fs2_bytes             number;

        m_fs3_blocks            number;
        m_fs3_bytes             number;
        m_fs4_blocks            number;
        m_fs4_bytes             number;
        m_full_blocks           number;
        m_full_bytes            number;

begin
        dbms_space.SPACE_USAGE(
                segment_owner           => 'TEST_USER',
                segment_name            => 'T1',
                segment_type            => 'TABLE',
                unformatted_blocks      => m_unformatted_blocks,
                unformatted_bytes       => m_unformatted_bytes,
                fs1_blocks              => m_fs1_blocks ,
                fs1_bytes               => m_fs1_bytes,
                fs2_blocks              => m_fs2_blocks,
                fs2_bytes               => m_fs2_bytes,
                fs3_blocks              => m_fs3_blocks,
                fs3_bytes               => m_fs3_bytes,
                fs4_blocks              => m_fs4_blocks,
                fs4_bytes               => m_fs4_bytes,
                full_blocks             => m_full_blocks,
                full_bytes              => m_full_bytes
        );

        dbms_output.new_line;
        dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,999,990'));
        dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,999,990') || ' / ' || to_char(m_full_bytes,'999,999,999,990'));

end;
/

The results aren’t what we expect:


Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :       35,001 /      286,728,192
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :        6,665 /       54,599,680

We have one block marked as 25 – 50% free (that’s the one block with 16 rows in it, which means about 40% space currently free) but our 41,666 full blocks are actually reported as 6,665 full blocks and 35,001 blocks with some space available. That’s going to hurt eventually if some process wants to insert more rows and finds that it has to fail its way through 35,001 blocks before finding a block which has enough free space.

So what happens when I repeat the PL/SQL block (and commit)? Here are the results from calls to dbms_space after the next two cycles:


Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :       70,002 /      573,456,384
Freespace 2 ( 25 -  50% free) :            2 /           16,384
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          220 /        1,802,240
Full                          :       13,330 /      109,199,360

Unformatted                   :          256 /        2,097,152
Freespace 1 (  0 -  25% free) :      105,003 /      860,184,576
Freespace 2 ( 25 -  50% free) :            3 /           24,576
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          117 /          958,464
Full                          :       19,995 /      163,799,040

Every time we execute the PL/SQL block we leave a trail of 35,001 more blocks which are flagged as “not quite full”.

Looking at the session stats while running the insert loop I can tell that Oracle isn’t checking to see whether or not it should be using those blocks. (A quick way of proving this is to flush the buffer cache before each execution of the PL/SQL and note that Oracle doesn’t read back the 105,000 blocks before inserting any data). So somehow, sometime, someone might get a nasty surprise – and here’s one way that it might happen:

Since I know I my data fits 24 rows per block I’m going to modify my PL/SQL block to select one row into the array then loop round the insert 25 times – so I know I’m inserting a little bit more than one block’s worth of data. Starting from the state with 105,003 blocks marked as “Freespace 1” this is what I saw – first, the free space report after inserting 25 rows:


Unformatted                   :          240 /        1,966,080
Freespace 1 (  0 -  25% free) :        1,074 /        8,798,208
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          133 /        1,089,536
Full                          :      123,927 /    1,015,209,984

Then a few wait events and session statistics for the insert:


---------------------------------------------------------
SID:    39:TEST_USER - jonathan
Session Events - 09-Jan 16:57:18
Interval:-      6 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs        Csec    Avg Csec    Max Csec
-----                                             -----   ---------        ----    --------    --------
db file sequential read                          15,308           0         128        .008           3
db file scattered read                           20,086           0         271        .014           4

---------------------------------
Session stats - 09-Jan 16:57:18
Interval:-  6 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
session logical reads                                                  269,537
physical read total IO requests                                         35,401
db block gets                                                          229,522
consistent gets                                                         40,015
physical reads                                                         124,687
physical reads cache                                                   124,687
db block changes                                                       208,489
physical reads cache prefetch                                           89,293
redo entries                                                           207,892
redo size                                                           16,262,724
undo change vector size                                                  1,720
deferred (CURRENT) block cleanout applications                         103,932
table scan blocks gotten                                                20,797
HSC Heap Segment Block Changes                                              25

The session has read and updated almost all of the level 1 bitmap blocks. I don’t know exactly what triggered this re-read, but seems to be related to the number of rows inserted (or, perhaps, the amount of space used rather than the row count) as an insert crosses the pctfree boundary and fails over to the next block. I’ve only done a couple of little tests to try and get a better idea of why an insert sometimes sweeps through the bitmap blocks – so I know that inserting 2 or 3 rows at a time will also trigger the cleanout – but there are probably several little details involved that need to be identified.

You might note a couple of details in the stats:

  • Because I had flushed the buffer cache before the insert Oracle did its “cache warmup” tablescanning trick – if this had not happened I would probably have done a single block read for every single bitmap block I touched.
  • There are 103,932 block cleanout applications – but 208,000 db block changes and redo entries. Roughly half the latter are for data block cleanouts (OP code 4.1) and half are the state changes on the level 1 bitmap blocks (OP code 13.22). You’ll notice that neither change produces any undo.
  • I’ve also included the HSC Heap Segment Block Changes statistics to show you that not all changes to Heap Segment Blocks show up where you might expect them.

And finally:

If you re-run the tests on 11.2.0.4 and 12.2.0.1 you get the following results after the intial script run – the problem doesn’t exist:


11.2.0.4
========
Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :       41,666 /      341,327,872

12.2.0.1
========
Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :       41,666 /      341,327,872

So how does this help the OP.

  • First, there may be a huge mess still waiting to be cleaned in the table – but at 34M blocks I’m not too keen to suggest running the dbms_space routine to find out what it looks like – but maybe that’s necessary.
  • Secondly – an upgrade to 12.2 will probably avoid the problem in future.
  • Thirdly – if the number of rows per block is very close to uniform, write a little code to do a loop that inserts (say) 2 * expected number of rows per block as single row inserts and rolls back; the inserts will probably trigger a massive though perhaps not complete cleanout, so rinse and repeat until the cleanout is complete. Try to find a time when you don’t mind the extra load to get this job done.
  • Finally – on the big job that does the bulk insert – repeat the dummy insert/rollback at the end of the job to clean up the mess made by the job.

Addenda

Prompted by comment #2 below, I should add that if the problem has been fixed in 12.2 then possibly there’s a bug report and patch for it already. If there isn’t then the OP could raise an SR (referencing this blog note), and request a bug fix or back-port from 12.2.

And with 24 hours of publication, comment #4 (from Yury Pudovchenko) tells us that the bug is fixed by the Oct 2017 Bundle Patch.

 

 

October 6, 2017

12c Parse

Filed under: 12c,Oracle,Upgrades — Jonathan Lewis @ 9:07 am BST Oct 6,2017

Following on from a comment to a recent posting of mine about “bad” SQL ending up in the shared pool and the specific detail that too much bad SQL could cause contention problems while staying virtually invisible, there’s a related note today on the ODC (formerly OTN) forum of a little change in 12.2 that alerts you to the problem.

Try executing the following anonymous block (on a non-production system):


declare
        m1 number;
begin
        for i in 1..10000 loop
        begin
                execute immediate 'select count(*) frm dual' into m1;
                dbms_output.put_line(m1);
        exception
                when others then null;
        end;
        end loop;
end;
/

Then check your alert log (if you want to be a little cautious, change the 10,000 in the loop to something like 200). If you’re running 12.2.0.1 you’ll find something like the following:


ORCL(3):WARNING: too many parse errors, count=100 SQL hash=0x19a22496
ORCL(3):PARSE ERROR: ospid=4577, error=923 for statement:
2017-10-06T03:46:15.842431-04:00
ORCL(3):select count(*) frm dual
ORCL(3):Additional information: hd=0x7673c258 phd=0x765151a8 flg=0x28 cisid=135 sid=135 ciuid=135 uid=135
2017-10-06T03:46:15.842577-04:00
ORCL(3):----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x76734f18         5  anonymous block
ORCL(3):WARNING: too many parse errors, count=200 SQL hash=0x19a22496
ORCL(3):PARSE ERROR: ospid=4577, error=923 for statement:
2017-10-06T03:46:15.909523-04:00
ORCL(3):select count(*) frm dual
ORCL(3):Additional information: hd=0x7673c258 phd=0x765151a8 flg=0x28 cisid=135 sid=135 ciuid=135 uid=135
2017-10-06T03:46:15.909955-04:00
ORCL(3):----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x76734f18         5  anonymous block

The warning will be repeated every hundred occurrences. As you can see the guilty (ORA-00923: missing FROM) SQL appears in the report so you know what you’re looking for. In my particular case, with the silly PL/SQL block, the address of the calling anonymous pl/sql block was also reported:


select sql_text from V$sql where child_address = '0000000076734F18';

SQL_TEXT
--------------------------------------------------------------------------------
declare m1 number; begin  for i in 1..10000 loop  begin   execute immediate 'sel
ect count(*) frm dual' into m1;   dbms_output.put_line(m1);  exception	 when ot
hers then null;  end;  end loop; end;

In the case of the OP on ODC the SQL reported in the alert log was simply: “SELECT 1”. As Billy Verreynne suggested in the thread, this looks like the sort of code that would be sent to the database by some of the connection pooling clients to check that the database is up. Unfortunately (apart from the waste of effort) this particular setup seems to think it’s talking to some database other Oracle!

Footnote:

This is a feature of 12.2 – 11g and 12.1 don’t write such warnings to the alert log.

Lagniappe

A tweet from Mohamed Houri reminds me that parse failures like these, of course, show up in the instance activity stats, in particular:


Name                               Value
----                               -----
opened cursors cumulative         10,006
enqueue requests                  10,002
enqueue releases                  10,002
sql area purged                   10,000
sql area evicted                  10,000
parse count (total)               10,008
parse count (hard)                10,002
parse count (failures)            10,000

The enqueue requests are for the ‘CU’ (cursor) enqueue which, I think, appeared in 10g – they’re acquired (and released) on every hard parse.

Most of the figures that my session reports here are likely to be highly camouflaged by the rest of the activity from a normal system, so the most important number is the “parse count (failures)” – so it’s useful to know that you can subtract that number the other statistics to give you an idea of the impact that would be eliminated if you could located and stop the thing generating the failing statements.

Update

Patrick Joliffe (see pingback below) has published an article pointing out that in earlier versions of Oracle you can set event 10035 to get the same information dumped into the alert log on every parse failure.

 

Next Page »

Powered by WordPress.com.