Oracle Scratchpad

July 14, 2018

Quiz Night

Filed under: Execution plans,Oracle,Performance,sorting — Jonathan Lewis @ 7:07 pm BST Jul 14,2018

Here’s a question prompted by a recent thread on the ODevCom database forum – how many rows will Oracle sort (assuming you have enough rows to start with in all_objects) for the final query, and how many sort operations will that take ?


drop table t1 purge;

create table t1 nologging as select * from all_objects where rownum < 50000;

select owner, count(distinct object_type), count(distinct object_name) from t1 group by owner;

Try to resist the temptation of doing a cut-n-paste and running the code until after you’ve thought about the answer.

And the answer is:

It was nice to see a few ideas being volunteered in response to this question; I think that getting a diverse set of comments makes a nice point about how it’s always worth spending a little time to think along the lines of: “If I do X how might Oracle handle it”. Having the ideas before trying to check the effects can make it a lot easier to understand what’s happening and, sometimes, how to take advantage of what Oracle does to improve the way you design a query.

The first point to make, as Michael D O’Shea  pointed out in comment #2, is that computer systems don’t usually “sort” data – they tend to create pointers to data and shuffle the pointers in some way. In Oracle’s case “sorting” used to mean inserting pointers into a balanced binary tree, and aggregating used to be a case of accumulating values at the leaf nodes of the insertion tree. Then in 10g Oracle introduced a new sorting algorithm that often works more efficiently than the binary insertion tree. I’m still going to refer to the binary tree method as “sorting”, though.

Looking at the query we can see that there is no “order by” clause so it’s possible that Oracle will do whatever it does using hash aggregation throughout and no sorting, but that leaves open the question of how a hash table on owner can also record a distinct count of both object_type and object_name because every single owner hash bucket would have to link to its own hash tables for object_type and object_name and do a sort of “recursive hash aggregation” which starts to sound a little complicated. Maybe the alternative suggested by Kaley in comment #1 is closer to the truth – maybe Oracle just “buckets” all the data by owner and then sorts within each owner twice to do the count distincts, but then we’re still going to be hanging on to a lot of data, doing a two-level open-ended process.

Having waved hands for a little bit to try and head in the direction of possible solutions we need to look for clues that tell us whether we ought to eliminate or refine some of our guesses. There are several bits of information we could look at and running the query (although I asked you not to) is the next step we have to take. But when we run the query we want to see the session statistics, pick up the actual execution plan with rowsource execution statistics, and enable the 10032 and 10033 (sort) traces. So let’s fold the query into a longer script, something like:


set linesize 255
set trimspool on
set pagesize 60

set serveroutput off
alter session set statistics_level = all;

execute snap_my_stats.start_snap

alter session set events '10032 trace name context forever';
alter session set events '10033 trace name context forever';

select owner ... etc.

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

alter session set events '10033 trace name context off';
alter session set events '10032 trace name context off';

execute snap_my_stats.end_snap

To avoid blurring around the edges we may have to isolate the three different tests – the query against dbms_xplan.display_cursor(), for example, is obviously going to have some impact on the session stats – and we may then want to run each test twice in succession so that any warm-up or parsing activities don’t confuse the issue. It would also be a good idea to run the tests after creating a new session in case there are some distracting side effects from creating the data set. But with these details addressed, here are a few results:

First the execution plan (I got these results from 12.2.0.1, all recent versions of Oracle behave similarly):


----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     16 |00:00:00.34 |    1018 |       |       |          |
|   1 |  SORT GROUP BY     |      |      1 |     16 |     16 |00:00:00.34 |    1018 |  5014K|  1445K| 4456K (0)|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  50000 |  50000 |00:00:00.07 |    1018 |       |       |          |
----------------------------------------------------------------------------------------------------------------

Oracle uses a SORT GROUP BY, not a HASH GROUP BY, and the indications are that we used about 4.4MB of memory to sort 50,000 rows. Then there’s the session statistics:


Name                                                     Value
----                                                ----------

session uga memory max                               3,255,648
session pga memory max                               3,211,264

table scan rows gotten                                  50,000

sorts (memory)                                               1
sorts (rows)                                           150,000

This says we did just one sort operation and sorted 150,000 rows using an excess of 3.2MB over the starting pga/uga (rather than the 4.4MB suggested by the plan); possibly the variation in apparent memory usage could be explained by the way that Oracle allocates reasonably large chunks to grow the PGA when you grow a workarea but since I’m taking deltas there’s also some scope for being misled by the change in maximum pga/uga memory.

Finally the 10032 trace file shows the following (we didn’t spill to disc, so the 10033 trace wasn’t triggered):


---- Sort Parameters ------------------------------
sort_area_size                    4562944
sort_area_retained_size           4562944
sort_multiblock_read_count        7
max intermediate merge width      38

---- Sort Statistics ------------------------------
Input records                             150000
Output records                            49907
Total number of comparisons performed     1945863
  Comparisons performed by in-memory sort 1945863
Total amount of memory used               4562944
Uses version 1 sort
---- End of Sort Statistics -----------------------

These figures are ones we have to trust – it seems we really did do one sort operation of 150,000 rows, and we did allocate 4.5MB of memory. There’s an obvious guess for the 150,000 input rows – Oracle has turned every row into three rows – the original row, a row to count the object_type, and a row to count the object_name – and with that in mind maybe we will be able to make sense of getting an output of 49,907 rows using 4.5M of memory. Let’s create a query that could produce the right numbers but with a differently arranged output:


select distinct owner, 0, null        from t1
union all
select distinct owner, 1, object_type from t1
union all
select distinct owner, 2, object_name from t1
order by 1, 2, 3
;

For my data set this query produced 49,907 rows of output (which is a number we wanted to see) and here are the first 9 rows of output – followed by the first row of output from the original query:


OWNER                    0 NULL
--------------- ---------- ---------------------
APPQOSSYS                0

APPQOSSYS                1 SYNONYM
APPQOSSYS                1 TABLE

APPQOSSYS                2 DBMS_WLM
APPQOSSYS                2 WLM_CLASSIFIER_PLAN
APPQOSSYS                2 WLM_FEATURE_USAGE
APPQOSSYS                2 WLM_METRICS_STREAM
APPQOSSYS                2 WLM_MPA_STREAM
APPQOSSYS                2 WLM_VIOLATION_STREAM


OWNER           COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOBJECT_NAME)
--------------- -------------------------- --------------------------
APPQOSSYS                                2                          6

Spot the pattern ? All I have to do after this “union all with simple sort” is walk the result set in order accumulating distinct counts as I do so.

But what about the memory requirements ? Check back to the original 10032 trace file, it reported 4562944 bytes as the total memory needed, but it also reported using a “Version 1” sort – so before I ran my union all query I set “_newsort_enabled”=false, to get the following 10032 trace details:


---- Sort Statistics ------------------------------
Input records                             49907
Output records                            49907
Total number of comparisons performed     730667
  Comparisons performed by in-memory sort 730667
Total amount of memory used               4562944
Uses version 1 sort
---- End of Sort Statistics -----------------------

The memory used is exactly the number I wanted to see. (The  version 2 sort got exactly the same result using 3.5MB of memory, so I don’t know why it’s not used at this point – but maybe that’s because the implementation isn’t quite what I think.)

So, hypothesis (to date, until a reader shows me that my answer is incomplete – or wrong):

As the “SORT GROUP BY” operation accepts each row from the “TABLE ACCESS FULL” operation it converts each row into N rows (one base row and one for each column for which there is a count(distinct)). The base row holds just the set of “group by” columns and is tagged with a zero, each subsequent row holds the “group by” columns, a tag value to identify which column it carries, and one of the “count(distinct)” columns. Oracle then operates the normal “group by” aggregation mechanism but is actually aggregating on the “group by” columns plus the “tag” column. So each leaf node in the binary tree ends up holding {owner_value, tag_value, count}, and once all the data has been aggregated into the binary tree Oracle can walk the tree and perform a pivot to turn three rows for each owner value into a single row.

If you start thinking about nasty scenarios you will realise that the upshot of this implementation (if my hypothesis is correct) is that if you have a query with a long “group by” list, and several columns where there are lots of distinct values for each combination of the “group by” list then the volume of the B-tree could actually be much larger than the volume of the original table, and the amount of memory and CPU needed to build that tree (before collapsing it) could be huge.

Footnote:

There is one special case with this count(distinct …) query. If you have only ONE distinct operation in the query Oracle can use the “distinct aggregation” transformation with “view merging” to produce a completely different plan.

June 11, 2018

dbms_random

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 8:31 am BST Jun 11,2018

In a recent ODC thread someone had a piece of SQL that was calling dbms_random.string(‘U’,20) to generate random values for a table of 100,000,000 rows. The thread was about how to handle the ORA-30009 error (not enough memory for operation) that is almost inevitable when you use the “select from dual connect by level <= n” strategy for generating very large numbers of rows, but this example of calling dbms_random.string() so frequently prompted me to point out an important CPU saving , and then publicise through this blog a little known fact (or deduction) about the dbms_random.string() function.

If you generate a random string of length 6 using only upper-case letters there are 308,915,766 different combinations (266); so if you’re after “nearly unique” values for 100 million rows then a six character string is probably good enough – it might give you a small percentage of values which appear in a handful rows but most of the values are likely to be unique or have two rows. If you want to get closer to uniqueness then 7 characters will do it, and 8 will make it almost certain that you will get a unique value in every row.

So if you want “nearly unique” and “random 20 character strings” it’s probably sufficient to generate random strings of 6 to 8 characters and then rpad() them up to 20 characters with spaced – the saving in CPU will be significant; roughly a factor of 3 (which is going to matter when you’re trying to generate 100 million rows. As a little demo I supplied the OP with a script to create a table of just one million random strings – first of 20 random characters, then of 6 random characters with 14 spaces appended. The run time (mostly CPU) dropped from 1 minute 55 seconds to 41 seconds.

Why is there such a difference ? Because to generate a random string of 6 characters Oracle generates a random string of one character six times in a row and concatenates them. The difference between 6 calls and 20 calls per row gives you that factor of around 3. For a quick demo, try running the following anonymous PL/SQL block:

rem
rem     Script:         random_speed.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2010
rem

begin
        dbms_random.seed(0);
        dbms_output.put_line(dbms_random.string('U',6));
        dbms_output.new_line;

        dbms_random.seed(0);
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
end;
/

Jere are the results I got from instances of 12.1.0.2, 12.2.0.1, and 18.1.0.0 (from LiveSQL):


BVGFJB
B
V
G
F
J
B

I haven’t shown the tests for all the possible dbms_random.string() options but, unsurprisingly, changing the test to use the ‘L’ (lower case alpha) option produces the same effect (and the same 6 letters changed to lower case). The same effect, with different characters, also appeared using the ‘A’ (mixed case alpha), ‘X’ (uppercase alphanumeric) and ‘P’ (all printable characters) options.

I haven’t considered the effect of using a multi-byte character set – maybe Oracle calls its random number generator once per byte rather than once per character. The investigation is left as an exercise to the interested reader.

tl;dr

When generating a very large number of random strings – keep the “operational” part of the string as short as you can and leave the rest to be rpad()‘ed.

June 8, 2018

Massive Delete

Filed under: Oracle,Performance — Jonathan Lewis @ 9:14 am BST Jun 8,2018

The question of how to delete 25 million rows from a table of one billion came up on the ODC database forum recently. With changes in the numbers of rows involved it’s a question that keeps coming back and I wrote a short series for AllthingsOracle a couple of years ago that discusses the issue. This is note is just a catalogue of links to the articles:

There is an error in part 2 in the closing paragraphs – it says that the number of index entries deleted varies “from just one to 266″, it actually varies from 181 to 266.

 

June 1, 2018

Index Bouncy Scan 4

Filed under: 12c,Execution plans,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 9:19 am BST Jun 1,2018

There’s always another hurdle to overcome. After I’d finished writing up the “index bouncy scan” as an efficient probing mechanism to find the combinations of the first two columns (both declared not null) of a very large index a follow-up question appeared almost immediately: “what if it’s a partitioned index”.

The problem with “typical” partitioned indexes is that the smallest value of the leading column might appear in any of the partitions, and the combination of that value and the smallest value for the second column might not appear in all the partitions where the smallest value appears. Consider a table of 10 partitions and a locally partitioned index on (val1, val2) where neither column is the partition key. The smallest value of val1 – call it k1 may appear only in partitions 4, 7, 8, 9, 10; the lowest combination of (val1, val2) – call it (k1, k2) may appear only in partitions 8 and 10. In a global (or globally partitioned) index the pair (k1, k2) would be at the low (leftmost) end of the index, but to find the pair in a locally partitioned index we have to probe the leftmost end of 10 separate index partitions – and once we’ve done that each “bounce” requires us to probe 10 index partitions for the first (val1, val2) pair where val1 = k1 and val2 is just just greater than k2, or val1 is just greater than k1 and val2 is the minimum for that value of val1. The more partitions we have the greater the number of index partitions we have to probe at each step and the more likely it is that we ought to switch to a brute force index fast full scan with aggregate.

Here’s the starting point for solving the problem (maybe) – I’ll create a simple partitioned table, and use the “bouncy scan” code from the earlier posting with the table and column names adjusted accordingly:


rem
rem     Script:         bouncy_index_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10) ; alter table pt1 modify(status not null); execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

prompt  ==================================================
prompt  Make some rows in the last partition have a status
prompt  that won't be found in the first partition.
prompt  ==================================================

column namespace format 99999999
column partition_name new_value m_part

select  partition_name
from    user_tab_partitions
where   table_name = 'PT1'
order by
        partition_position
;

update pt1 partition (&m_part) set status = 'MISSING' where rownum <= 10;

select
        dbms_mview.pmarker(rowid), status, namespace
from    pt1
where   status = 'MISSING'
;

I’ve created a hash partitioned copy of view all_objects, duplicating it 10 times and created a local index on the columns (status, namespace). My data has two values for status, ‘VALID’ and ‘INVALID’, and there are about 10 values for the namespace. I’ve then updated a few rows in the last partition, giving them a status value that is between the two current values – this is just one little test case to help me check that my code is going to catch all values even if they don’t appear in the first table partition.

Here’s the query from the earlier posting – and it does get the right results – followed by the execution plan:


alter session set statistics_level = all;

set serveroutput off
set linesize 180
set pagesize 60

prompt  =============================================================
prompt  Original Query, showing expensive access for driving minimums
prompt  =============================================================

with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select
                        /*+ index(pt1) no_index_ffs(pt1) */
                        status, namespace,
                        row_number() over(order by status, namespace) rn
                from    pt1
        )
        where
                rn = 1
        union all
        select
                v1.status, v1.namespace
        from    bounce1,
                lateral (
                              select  /*+ index(pt1) no_index_ffs(pt1) no_decorrelate */
                                      pt1.status, pt1.namespace
                              from    pt1
                              where   pt1.status > bounce1.status
                              and     rownum = 1
                ) v1
        where   bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last outline'));


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        | 16378 (100)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 | 16378   (4)|       |       |     10 |00:00:00.58 |    1869 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 | 16377   (4)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     12 |00:00:00.58 |    1869 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |  8157   (4)|       |       |      2 |00:00:00.58 |    1747 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      2 |00:00:00.58 |    1747 |  1024 |  1024 | 2048  (0)|
|*  6 |       VIEW                                   |                 |      1 |      1 |  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |                 |      1 |    617K|  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |  2048 |  2048 | 2048  (0)|
|   8 |         PARTITION HASH ALL                   |                 |      1 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.34 |    1732 |       |       |          |
|   9 |          INDEX FULL SCAN                     | PT1_I1          |      4 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.15 |    1732 |       |       |          |
|  10 |       NESTED LOOPS                           |                 |      2 |      1 |  4110   (4)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  11 |        RECURSIVE WITH PUMP                   |                 |      2 |        |            |       |       |      2 |00:00:00.01 |       0 |       |       |          |
|  12 |        VIEW                                  | VW_LAT_1BBF5C63 |      2 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|* 13 |         COUNT STOPKEY                        |                 |      2 |        |            |       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  14 |          PARTITION HASH ALL                  |                 |      2 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|* 15 |           INDEX RANGE SCAN                   | PT1_I1          |      5 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|  16 |     SORT AGGREGATE                           |                 |     10 |      1 |            |       |       |     10 |00:00:00.01 |     122 |       |       |          |
|  17 |      PARTITION HASH ALL                      |                 |     10 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  18 |       FIRST ROW                              |                 |     40 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     122 |       |       |          |
|* 19 |        INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     40 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  20 |     RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     10 |00:00:00.01 |       0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
   4 - filter("BOUNCE1"."STATUS" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "STATUS","NAMESPACE")<=1) 13 - filter(ROWNUM=1) 15 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  19 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

In terms of time the query doesn’t seem to have done too badly – but I’m only using a small data set and we can see from the numbers that we haven’t produced an efficient plan. Operations 8 and 9 tell us that we’ve done an index full scan on every single partition before passing the data up for a window sort operation. That’s clearly a bad thing, but we did have an index() hint at that bit of code that worked very well for the simple (global) index so maybe we should have taken that out before testing (except it doesn’t help much to do so since Oracle still scans all 617K rows, changing to an index fast full scan).

Apart from that massive load the rest of the query looks quite efficient. We keep seeing “partition hash all” of course – whatever we do we tend to do it to 4 separate partitions one after the other – but everything else we do looks rather efficient. But there is another problem – and this is where the importance of inserting the rows with status = ‘MISSING’ shows up: this query didn’t find them! We have a predicate “rownum = 1” in the second half of the bounce1 recursive subquery and because we’re using a partitioned index we’ve managed to find a row that looks appropriate in an early partition when the row we really needed doesn’t appear until the last partition.

Let’s return to this problem later – first we want to check if the rest of the query will run efficiently and give us the right answer if we can find some way of getting the starting values; so let’s use a strategy we’ve used before – replace the bounce1 subquery with a union all select from dual:


with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select 'INVALID' status, 1 namespace from dual
                union all
                select 'MISSING', 4 from dual
                union all
                select 'VALID', 1 from dual
        )
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last partition outline'));

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |        |      1 |        |    76 (100)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   1 |  SORT ORDER BY                             |        |      1 |      6 |    76   (2)|       |       |     11 |00:00:00.01 |     132 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |        |      1 |      6 |    75   (0)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|        |      1 |        |            |       |       |     14 |00:00:00.01 |     132 |  1024 |  1024 |          |
|   4 |     VIEW                                   |        |      1 |      3 |     6   (0)|       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   5 |      UNION-ALL                             |        |      1 |        |            |       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   6 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   7 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   8 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |     SORT AGGREGATE                         |        |     11 |      1 |            |       |       |     11 |00:00:00.01 |     132 |       |       |          |
|  10 |      PARTITION HASH ALL                    |        |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  11 |       FIRST ROW                            |        |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     132 |       |       |          |
|* 12 |        INDEX RANGE SCAN (MIN/MAX)          | PT1_I1 |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  13 |     RECURSIVE WITH PUMP                    |        |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
  12 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

This gets us the right answer, very efficiently. There are only 11 rows in the result set and we have an average 12 buffer visits per row – which is reasonble given that we (probably) have to probe 4 index partitions for every row. So that’s 11 * 4 * 3 buffer visits per probe – which seems just about optimal.

The next step is to figure out a way of getting the (three in our case) starting points while using a partitioned index. Here’s a query we can use for bounce1:


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
)
select * from bounce1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

It looks a little convoluted with all the inline select statements, but they all do very small amounts of work and they’re only reading the index leaf blocks that you have to read. We know from yesterday’s post that Oracle can execute the scalar subqueries at lines 3 and 4 very efficiently; we can hope (and check) that the lateral() subqueries driven by the single values from the recursive row in bounce1 will operate just as efficiently – and here’s the plan:


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                 |      1 |        |   166 (100)|       |       |      4 |00:00:00.01 |     132 |
|   1 |  VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   3 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   4 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   5 |      INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   6 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   7 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   8 |      FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  9 |       INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  11 |         PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  12 |          INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  13 |    FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  14 |    NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  15 |     NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  16 |      RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  17 |      VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |       SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  19 |        PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  20 |         FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 21 |          INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |     VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  23 |      SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  24 |       PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  25 |        FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 26 |         INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |           PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |            FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |             INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("STATUS"=)
  21 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  26 - access("PT1"."STATUS"=)
  30 - access("PT2"."STATUS">:B1)

Although we have done lots of individual probes into the index they have all been very efficient using a min/max access and an average of about 3 buffer visits per probe. So we can now insert this new bounce1 subquery into the previous query in place of the union all of dual and check that the two pieces of the query cooperate.


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace from bounce1
        union all
        select  bounce2.status, (select min(t.namespace) namespace from pt1 t where t.namespace > bounce2.namespace and status=bounce2.status) namespace
        from    bounce2
        where   bounce2.status is not null
        and     bounce2.namespace is not null
)
select  *
from    bounce2
where   namespace is not null
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |                 |      1 |        |   396 (100)|       |       |     11 |00:00:00.01 |     266 |
|*  1 |  VIEW                                       |                 |      1 |      4 |   396   (1)|       |       |     11 |00:00:00.01 |     266 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     15 |00:00:00.01 |     266 |
|   3 |    VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   5 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   8 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   9 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  12 |          SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  13 |           PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  14 |            INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  15 |      FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  16 |      NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  17 |       NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |        RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  19 |        VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  20 |         SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  21 |          PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |           FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 23 |            INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  24 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  25 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  26 |         PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 28 |           INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  29 |            SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  30 |             PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  31 |              FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 32 |               INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  33 |    SORT AGGREGATE                           |                 |     11 |      1 |            |       |       |     11 |00:00:00.01 |     134 |
|  34 |     PARTITION HASH ALL                      |                 |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  35 |      FIRST ROW                              |                 |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     134 |
|* 36 |       INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  37 |    RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NAMESPACE" IS NOT NULL)
  11 - access("STATUS"=)
  23 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  28 - access("PT1"."STATUS"=)
  32 - access("PT2"."STATUS">:B1)
  36 - access("STATUS"=:B1 AND "T"."NAMESPACE">:B2)

Job done. We’ve found the distinct set of pairs without having to scan the entire index. We’ve found 11 pairs at a total cost of 266 buffer gets. For comparitive purposes the query totalled 56 buffer visits when I recreated the table as a non-partitioned table (again updating a few rows to status = ‘MISSING’).

It’s important to note that this query can only work this efficiently in 12.2 (and possibly in a suitably patched 11.2.0.4) because of the optimizer’s ability to use the min/max operation for queries like: “select max(col2) where col1 = (select max()…))”. When I ran the final query on 12.1.0.2 the execution plan changed around lines 11 and 28 where 12.2.0.1 could use the aggregate subquery to drive the min/max scan 12.1.0.2 did a real range scan with aggregate (which was extremely expensive at one point).

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|  23 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |  1206   (2)|       |       |      3 |00:00:05.43 |    2414 |
|  24 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:05.43 |    2414 |
|  25 |         PARTITION HASH ALL                  |                 |      3 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:05.84 |    2414 |
|* 26 |          INDEX RANGE SCAN                   | PT1_I1          |     12 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:02.03 |    2414 |
|  27 |           SORT AGGREGATE                    |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |            PARTITION HASH ALL               |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |             FIRST ROW                       |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |              INDEX RANGE SCAN (MIN/MAX)     | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this makes a dramatic difference to the work Oracle has to do – in this case 2,414 buffer gets and 845K rows examined. As I said in yestrday’s post – there’s a patch for 11.2.0.4, so there could be a patch for 12.1.0.2 if you ask for it, but it looks like no-one has done so yet.

<h3>Footnote:</h3>

I could have used a lateral() view in the first half of bounce1 to reduce the reported number of probes of pt1_i1 in the plan – but it made the code extremely messy, I had to include a /*+ no_decorrelate */ hint in it, and it increased the number of buffer visits slightly because the optimizer seemed to lose the option for a min/max scan in this particular lateral join.

 

May 31, 2018

Min/Max upgrade

Filed under: 12c,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 2:13 pm BST May 31,2018

Here’s a nice little optimizer enhancement that appeared in 12.2 to make min/max range scans (and full scans) available in more circumstances. Rather than talk through it, here’s a little demonstration:

rem
rem     Script:         122_minmax.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Good path
rem             12.1.0.2        Bad path

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10)  -- > comment to avoid format wordpress issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

alter session set statistics_level = all;
set serveroutput off
set linesize 156
set pagesize 60
set trimspool on

select  min(status) from pt1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = 'INVALID';
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = (select min(status) from pt1);
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

The basic “min/max” optimisation allows Oracle to avoid a massive sort aggregate – Oracle doesn’t need to acquire a lot of data and sort it when it knows that the “left hand” end of an index is the low values and the “right hand” is the high values so, for example, in the first query above the optimizer could simply walk down the index branches to the left hand leaf and look at the single lowest entry in the leaf block to determine the lowest value for status … if the index had been a global index.

Things get a little messy, though, when the index is locally partitioned and your query isn’t about the partition key and there’s no suitable global index. Once upon a time (IIRC) Oracle would simply have to do an index fast full scan across all index partitions to handle such a query, but some time ago it got a lot cleverer and was enhanced to do a min/max scan on each partition in turn getting one value per partition very efficiently, then aggregating across those values to find the global minimum.

Here are the three execution plans (with rowsource execution stats pulled from memory) taken from 12.1.0.2 for the queries above:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   3 |    INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
-----------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE               |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL          |        |      1 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
|   3 |    FIRST ROW                  |        |      4 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      12 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
-------------------------------------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |        |      1 |        |   337 (100)|       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   1 |  SORT AGGREGATE                |        |      1 |      1 |            |       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   2 |   PARTITION HASH ALL           |        |      1 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|*  3 |    INDEX FAST FULL SCAN        | PT1_I1 |      4 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|   4 |     SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |      0 |
|   5 |      PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
|   6 |       INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
-----------------------------------------------------------------------------------------------------------------------------------------

In the first plan Oracle has done an “index full scan (min/max)” across each of the four partitions in turn to return one row very cheaply from each, then aggregated to find the overall minimum.

In the second plan Oracle has done an “index range scan (min/max)” in exactly the same way, since it was able to find the start point in the index for the status ‘INVALID’ very efficiently.

In the third plan Oracle has been able to find the minimum value for the status (‘INVALID’) very efficiently in the subquery, and has passed that single value up to the main query, which has then used a brute force approach to search the whole of every partition of the index for every occurrence (all 10 of them) of the value ‘INVALID’ and then aggregated them to find the minimum namespace. Despite “knowing”, by the time the main query runs, that there will be a single value to probe for the status, the optimizer has not anticipated the fact that the final query will effectively become the same as the preceding one. As a result we’ve read 2,242 data blocks into the cache.

Turn, then, to the execution plan from 12.2.0.1 for this last query:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE                 |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   2 |   PARTITION HASH ALL            |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   3 |    FIRST ROW                    |        |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)  | PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   5 |      SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------------------------------------------------

In 12.2 you can see that the main query is now doing an “index range scan (min/max)” on each index partition in turn, based on the incoming (though unknown at parse time) single value from the subquery. As a result the total work done is a mere 24 buffer visits.

There have been a couple of occasions in the past where I’ve had to write some PL/SQL to work around little details like this. It’s nice to know simple tables and partitoned tables with local indexes can now behave the same way. I also wonder whether there may be sites that could drop (or drop columns from, or make local) some indexes that they’ve previously created to  handle queries of the “most recent occurence” type.

If, for any reason, you need to disable this enhancement, it’s controlled by fix_control (v$system_fix_control) “18915345 Allow MIN/MAX optimization for pred having single row subquery” which can be set in the startup file, at the system level, or in the session.

Update

Checking MoS for the bug number I found that the limitation had been reported for 11.2.0.3, with “Fixed in product version” reported as 12.2; but there are patches for various releases of 11.2.0.4, though none yet for 12.1.0.2 – but if you think you need it you can always try raising an SR.

 

May 29, 2018

Index Bouncy Scan 2

Filed under: 12c,Index skip scan,Oracle,Performance — Jonathan Lewis @ 12:27 pm BST May 29,2018

I wrote a note some time last year about taking advantage of the “index range scan (min/max)” operation in a PL/SQL loop to find the small number distinct values in a large single column index efficiently (for example an index that was not very efficient but existed to avoid the “foreign key locking” problem. The resulting comments included pointers to other articles that showed pure SQL solutions to the same problem using recursive CTEs (“with” subqueries) from Markus Winand and Sayan Malakshinov: both writers also show examples of extending the technique to cover more cases than the simple list of distinct values.

The topic came up again on the ODC (OTN) database forum a couple of days ago; one of the replies linked back to my original posting, another gave the recursive solution for a single column index – so I ended up seeing the following question twice, once as a comment on my blog, once in the forum: “Can you extend this method to a two column index, what about an N column index ?”

Here’s a walk-through of working out one possible solution for the two-column requirement – how to find all the distinct combinations for the first two columns of a very large index without having to scan and aggregate the whole index. We start with a suitable table and index.


rem
rem     Script:         bouncy_index.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

I’ve created a table with 3 values for val1, 10 values for val2, with a total of 30 combinations. The addition of the primary key starting with (val1, val2) is just a lazy way to ensure that I have a suitable index AND val1 and val2 are both declared not null.

With this data my first step will be to demonstrate the recursive CTE (“with” subquery) used by Andrew Sayer in the ODC posting to get the distinct values for val1 using three index “index range scan (min/max)”probes. I’ve included the in-memory execution plan with rowsource execution stats to show that this does a minimal amount of work.

The results in this note come from 12.2.0.1:


set serveroutput off
alter session set statistics_level = all;

with bouncy (val1)
as (
        select  min(val1) val1
        from    t1
        union all
        select  (select min(t1.val1) val1 from t1 where t1.val1 > bouncy.val1) val1
        from    bouncy
        where   bouncy.val1 is not null
    )
select  *
from    bouncy
where   bouncy.val1 is not null
order by
        val1
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    19 (100)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      2 |    19   (6)|      3 |00:00:00.01 |       7 |      4 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      2 |    18   (0)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |       7 |      4 |  1024 |  1024 |          |
|   4 |     SORT AGGREGATE                         |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   5 |      INDEX FULL SCAN (MIN/MAX)             | T1_PK |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   6 |     SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |      3 |       |       |          |
|   7 |      FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|*  8 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|   9 |     RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("BOUNCY"."VAL1" IS NOT NULL)
   8 - access("T1"."VAL1">:B1)

As you can see I’ve done an “index full scan (min/max)” as the first step of the recursive query, visiting just two buffered blocks (the index leaf-block count is 27 – roughly 9 per value of val1 – so Oracle is clearly doing an efficient access for that value, it’s not rally a “full” scan. We then see 3 “index range scan (min/max)” at roughly 2 buffer visits each to collect the remaining values. (There’s probably a small saving in buffer gets due to the pinning that takes place).

So we can get the val1 values very easily and efficiently with this recurstive CTE technology. Let’s write some code that uses the same technology to find the val2 values for each possible val1 value in turn:

with bounce2 (val1, val2)
as (
        select val1, val2 from (
                select  0 val1, 0 val2 from dual
                union all
                select 1,0 from dual
                union all
                select 2,0 from dual
        )
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    27 (100)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      6 |    27   (4)|     30 |00:00:00.01 |      32 |     24 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      6 |    26   (0)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |     33 |00:00:00.01 |      32 |     24 |  1024 |  1024 |          |
|   4 |     VIEW                                   |       |      1 |      3 |     6   (0)|      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      UNION-ALL                             |       |      1 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   8 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |     SORT AGGREGATE                         |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  10 |      FIRST ROW                             |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 11 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  12 |     RECURSIVE WITH PUMP                    |       |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
  11 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


In this example of the code the second half of the CTE looks remarkably similar to the previous statement – except I now have a two-column CTE and I’ve included an equality predicate against val1 based on the first of the two columns. In the first half of the code I’ve cheated (as a temporary measure) and supplied three rows of data which list the three distinct values of val1 with their associated minimum values for val2.

The execution plan shows that I’ve done 30 “index range scan (min/max)” of the index with 32 buffer visits. And that’s exactly the right number of probes to return my result set. So if I can manage to generate the starting values efficiently I can execute the whole query efficiently. So let’s find a way of changing that “union all on dual” fudge into a generic statement. Let’s replace it with a recursive CTE:


with bounce1(val1, val2) as (
        select val1, val2 
        from    (
                select
                        /*+ index(t1) */
                        val1, val2,
                        row_number() over(order by val1, val2) rn
                from    t1
        )
        where
                rn = 1
        union all
        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null
),
bounce2 (val1, val2)
as (
        select  val1, val2 
        from    bounce1
--      where   bounce1.val1 is not null
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |       |      1 |        |   189 (100)|     30 |00:00:00.01 |      45 |       |       |          |
|   1 |  SORT ORDER BY                               |       |      1 |      4 |   189   (2)|     30 |00:00:00.01 |      45 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |       |      1 |      4 |   188   (2)|     30 |00:00:00.01 |      45 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |       |      1 |        |            |     34 |00:00:00.01 |      45 |  1024 |  1024 |          |
|   4 |     VIEW                                     |       |      1 |      2 |    87   (2)|      4 |00:00:00.01 |      13 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |      13 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |       |       |          |
|  10 |        FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|  12 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       6 |       |       |          |
|* 13 |        COUNT STOPKEY                         |       |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |         INDEX RANGE SCAN                     | T1_PK |      3 |    500 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |       RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  16 |     SORT AGGREGATE                           |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  17 |      FIRST ROW                               |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 18 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  19 |     RECURSIVE WITH PUMP                      |       |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 11 - access("T1"."VAL1">:B1)
  13 - filter(ROWNUM=1)
  14 - access("T1"."VAL1">:B1)
  18 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


Again we see 30 probes using “index range scan (min/max)” with 32 buffer gets to get 30 rows; plus a further 13 buffer gets to generate the three driving rows. The 13 buffer gets break down to: 2 to get the minimum (val1, val2) combination using an “index full scan (min/max)”, then 5 for the probes to get the three minimum values for val1, and 6 for the probes to get the three corresponding minimum values of val2.

You’ll notice that I’ve got various “is not null” predicates scattered throughout the code. In some cases this is to stop Oracle from running into an infinite loop and reporting Oracle error: ORA-32044: cycle detected while executing recursive WITH query” This will occur because of the way that “(select max()…)” inline scalar subqueries returning a null if there is no data found which would lead to the next cycle of the recursive descent taking that null as an input – hence starting the infinite recursion. In some cases the “is not null” predicates are my default pattern for recurstive CTEs and some of them could probably be removed with no change in meaning (or workload).

The /*+ index() */ hint in the starting point for bounce1 was necessary to avoid an “index fast full scan” in 12.2; but that was purely a case of the statistics – number of distinct values, leaf_block count, etc – making the optimizer pick an option that was appropriate for this tiny data set, but not appropriate for the demonstration.  In fact this looks like the side effect of two defects in the 12.1 optimizer code, of which only one has been fixed in 12.2.

Optimizer Limitations

Here’s an extract from the execution plan for the final query with an /*+ index(t1) */ hint in place. The extract is identical for 12.1.0.2 and 12.2.0.1:

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |

You’ll notice the Cost at operation 8 is appropriate for a real (i.e. all leaf blocks) full scan of the index. (The leaf_block value was 27 as I mentioned earlier on). You’ll also see that the OMem (PGA requirement for optimum workarea operation) figure is consistent with Oracle processing 10,000 rows in the index. Since the optimizer managed to work out that it could do a full scan with nosort and stopkey it looks a little surprising that the algorithms didn’t manage to make some allowance for the limited access that would occur. (I’d view this as a current limitation, rather than a bug, though).

Now compare the equivalent extracts when we hint an index fast full scan 12.1.0.2 first, then 12.2.0.1:

12.1.0.2
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    39   (8)|      1 |00:00:00.03 |      32 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |    39   (8)|      1 |00:00:00.03 |      32 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      32 |       |       |          |

12.2.0.1
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |     7  (29)|      1 |00:00:00.01 |      34 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |     7  (29)|      1 |00:00:00.01 |      34 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      34 |       |       |          |

In both cases the cost of the index fast full scan is the same – and much cheaper; but in 12.1.0.2 the cost of the query looks as if it is allowing for sorting (and spilling) the entire 10,000 rows of returned from the index fast full scan (even though the OMem indicates otherwise), while the cost in 12.2.0.1 looks as if it recognises that it just has to do a running comparison through the data set as it returns, keeping only the current minimum in memory at any one moment. This clearly matches our expectations of how Oracle ought to behave, which is why I’d call this a bug in 12.1, fixed by 12.2.

The dramatic change in cost of operation 7 on the upgrade explains the change in plan and the necessity for the /*+ index(t1) */ hint – but if the “first row” predicate were also reflected in the costing then the cost of the “stopkey” index full scan would drop to 2 (probably) and the original 12.1 path would be re-appear.

Footnote

I don’t think there’s a lot of scope for improving the efficiency of this query for getting the (relatively) small number of distinct combinations from the first two columns of a very large index – but there are some very clever SQL bunnies on the ODC forum, so I won’t be surprised if someone comes up with a better solution.

Update

Well it didn’t take very long for someone to improve my SQL. Andrew Sayer took advantage of the “cross apply” feature of Oracle 12c to get rid of that nasty little bit of SQL where I’d used two scalar subqueries in the select list of the driving CTE. Here are the before and after versions of that fragment:


        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null


        select
                ca.val1 ,ca.val2
        from    bounce1
        cross  apply (select val1, val2
                      from  (select /*+ index(t1) no_index_ffs(t1) */
                                     val1, val2
                             from    t1
                             where   t1.val1 > bounce1.val1
                             and     rownum = 1
                            )
                     ) ca
        where  bounce1.val1 is not null

This “cross apply” has the effect of running a correlated subquery for every row selected from (this level of) bounce1 and then joining the results back to (this level of) bounce1. With this change in place (and with my original data set) the following plan appears:


------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   161 (100)|     30 |00:00:00.01 |      40 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   161   (2)|     30 |00:00:00.01 |      40 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   160   (2)|     30 |00:00:00.01 |      40 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     33 |00:00:00.01 |      40 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    73   (2)|      3 |00:00:00.01 |       8 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      3 |00:00:00.01 |       8 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      3 |      1 |    43   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      3 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  11 |        VIEW                                  | VW_LAT_A83890C2 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  12 |         VIEW                                 |                 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|* 13 |          COUNT STOPKEY                       |                 |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |           INDEX RANGE SCAN                   | T1_PK           |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |     SORT AGGREGATE                           |                 |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  16 |      FIRST ROW                               |                 |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 17 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  18 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   4 - filter("BOUNCE1"."VAL1" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 13 - filter(ROWNUM=1) 14 - access("T1"."VAL1">"BOUNCE1"."VAL1")
  17 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)

If you compare this with my final plan further up the page you can see that operations 9 – 14 look completely different and while my plan shows two “sort aggregate” probes against t1_pk, Andrew’s plan does an interesting “nested loop” driven by a “recursive pump” that effectively halves the work done in this section of the plan.

Another little detail about this plan that I found interesting was that the “cross apply” had been converted to a “lateral join” internally – note the VW_LAT_xxxx view name. This was a little irritating because I had actually tried to write the query with a lateral join in the first place and ended up getting the wrong results. I’ve got a follow-up posting about this – but (spoiler alert) I think it means I’ve found another bug.

March 19, 2018

Reference Costs

Filed under: Oracle,Partitioning,Performance — Jonathan Lewis @ 7:44 am BST Mar 19,2018

The partitioning option “partition by reference” is a very convenient option which keeps acquiring more cute little features, such as cascading truncates and cascading splits, as time passes – but what does it cost and would you use it if you don’t really need to.

When reference partitioning came into existence many years ago, I had already seen several performance disasters created by people’s enthusiasm for surrogate keys and the difficulties this introduced for partition elimination; so my first thought was that this was a mechanism that would have a hugely beneficial effect on systems which (in 20:20 – or 6:6 if you’re European – hindsight) had been badly designed and would otherwise need a lot of re-engineering to use partitioning effectively.

(Side note: Imagine you have partitioned an orders table on colX which is a column in the real (business-oriented) candidate key, but you’ve created a surrogate key which is used as the target for a foreign key from the order_lines tables – how do you get partition-wise joins between orders and order_lines if you haven’t got the partitioning column in the order_lines table ?)

So ref partitioning was a good way to workaround a big existing problem and, whatever overheads it introduced, the benefit was potentially so  huge that you wouldn’t care (or, probably, notice) that your system was less efficient than it ought to be. But what if you’re working on a new project and still have control of the physical design – how does that change the cost/benefit analysis.

It’s actually taken me several years to get round to producing a little demonstration to highlight one of the obvious costs of reference partitioning – even though it’s a very simple demo raising the obvious question: ‘how much work does Oracle have to do to find the right partition when inserting a “child” row ?’ If you chose to implement reference partitioning without asking that simple question you may be using a lot more machine resources than you really need to, although you may not actually be heading for a disastrous performance problem.

As a demonstration of the issue I’m going to set up something that approximates an order/order_lines model in two ways, one using reference partitioning and one using a copied column, to see what differences show up when you start loading data.

rem
rem     Script:         pt_ref.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table orders (
        id              number(10,0) not null,
        id_cust         number(10,0) not null,
        date_ordered    date not null,
        padding         varchar2(150)
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
);

create unique index ord_pk on orders (id);
alter table orders add constraint ord_pk primary key(id);

create table order_lines (
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by reference (orl_fk_ord)
;

create unique index orl_pk on order_lines (id_ord, line_number);
alter table order_lines add constraint orl_pk primary key (id_ord, line_number);

create table order_lines_2 (
        date_ordered    date,
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl2_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
)
;

create unique index orl2_pk on order_lines_2 (id_ord, line_number);
alter table order_lines_2 add constraint orl2_pk primary key (id_ord, line_number);

It’s a bit of a bodge job as far as modelling goes, but that’s to keep workload comparisons easy and make a point without writing too much code. All I’ve got is an orders table partitioned by date and an order_lines table that I want partitioned the same way. I’ve handled the requirement for partitioning order_lines in two ways, one is partition by reference and the other is to copy down the partitioning column from the orders table. (In my view the “real” key for an orders table should be (customer identifier, order date, counter) and if I thought efficient partitioning was going to be a necessary feature for scalability I would copy down all three columns. Depending on the nature of the business I would compress the primary key index on orders on one or two of the columns, and the foreign key index on order_lines on one, two, or three of its columns)

Now all I have to do is load some data into the tables. First the orders table:

insert into orders(
        id, id_cust, date_ordered, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        rownum                                  id,
        trunc(dbms_random.value(10000,20000))   id_cust,
        to_date('01-Jan-2018') +
                trunc((rownum-1)/100)           date_ordered,
        rpad('x',40)                            padding
from
        g,g
where
        rownum <= 2e4
;

commit;
execute dbms_stats.gather_table_stats(user,'orders')

This produces 100 orders per day, for 200 days which fits within the seven months of pre-declared partitions. I’ve gathered table stats on the table because that’s probably the best way to deal with any requirements for block cleanout after the insert. (Note: I’m avoiding interval partitioning in this example because that’s just another complication to add to the comparison and, as I reported a few days ago, introduces another massive inefficiency on data loading.)

Now I’ll insert some order_lines rows at 5 lines per order into the two versions of the order_lines tables. One of them, of course, has to have a date generated using the same algorithm that I used for the orders table. Note that I’ve made a call to dbms_random.seed(0) before each insert to guarantee that the same “random” values will be inserted in both table.

execute dbms_random.seed(0)

insert into order_lines_2(
        date_ordered, id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        to_date('01-Jan-2018') +
                trunc((rownum-1)/500)           date_ordered,
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;


execute dbms_random.seed(0)

insert into order_lines(
        id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;

What I haven’t shown in the code is the snapshot wrapping I used to check the session stats, system latch activity and system rowcache activity – which I thought would give me the best indication of any variation in workload. In fact, of course, the first and simplest variation was the elapsed time: 4.5 seconds for the ref partitioned table, 2.5 seconds for the explicitly created table (regardless of which insert I did first), and it was nearly all pure CPU time.

It turned out that the rowcache stats showed virtually no variation, and the latch stats only showed significant variation in the latches that I could have predicted from the session stats, and here are the most significant session stats that highlight and explain the difference in times:

Explicitly Created
------------------
CPU used by this session                                                   231
DB time                                                                    242
db block gets                                                          219,471
db block changes                                                        27,190
redo entries                                                            15,483
redo size                                                           24,790,224
HSC Heap Segment Block Changes                                           2,944

Ref partitioned
---------------
CPU used by this session                                                   515
DB time                                                                    532
db block gets                                                          615,979
db block changes                                                       418,025
redo entries                                                           209,918
redo size                                                           70,043,676
HSC Heap Segment Block Changes                                         100,048

These results were from 12.1.0.2, but the figures from 12.2.0.1 and 11.2.0.4 were similar though the CPU time dropped as the version number went up: what you’re seeing is the effect of turning an array insert (for the precreated table) into single row processing for the ref partitioned table. Basically it seems that for every row inserted Oracle has to do something to work out which partition the row should go into, and while it does that work it release any pins of buffers it would have been holding from the previous row’s insert; in other words, various optimisations relating to array inserts are not taking place.

  • Looking in more detail at the figures for the ref partition insert:
  • The 100,000 “HSC heap Segment Block Changes” equate to the 100,000 rows inserted into the table
  • Add the single row index updates to the primary key and you get 200,000 redo entries.
  • For every individual row inserted Oracle has to do a current mode (db block gets) check against the primary key of the orders table – but when array processing the root block can be pinned.

We can get a closer look at the differences by taking snapshots of v$segstat (or v$segment_statistics), to see the following (pre-created table on the left):


ORD_PK                                |   ORD_PK
  logical reads          199,440      |     logical reads          300,432
                                      |
ORDER_LINES_2 - P201801               |   ORDER_LINES - P201801
  logical reads            2,112      |     logical reads           16,960
  db block changes         1,280      |     db block changes        16,944
                                      |
ORDER_LINES_2 - P201802               |   ORDER_LINES - P201802
  logical reads            2,256      |     logical reads           16,144
  db block changes         1,248      |     db block changes        15,088
                                      |
ORDER_LINES_2 - P201803               |   ORDER_LINES - P201803
  logical reads            2,288      |     logical reads           17,264
  db block changes         1,376      |     db block changes        16,560
                                      |
ORDER_LINES_2 - P201804               |   ORDER_LINES - P201804
  logical reads            2,672      |     logical reads           16,768
  db block changes         1,280      |     db block changes        16,144
                                      |
ORDER_LINES_2 - P201805               |   ORDER_LINES - P201805
  logical reads            2,224      |     logical reads           17,472
  db block changes         1,264      |     db block changes        16,528
                                      |
ORDER_LINES_2 - P201806               |   ORDER_LINES - P201806
  logical reads            2,624      |     logical reads           16,800
  db block changes         1,328      |     db block changes        16,160
                                      |
ORDER_LINES_2 - P201807               |   ORDER_LINES - P201807
  logical reads            1,376      |     logical reads           10,368
  db block changes           864      |     db block changes        10,752
                                      |
ORL2_PK                               |   ORL_PK
  logical reads           10,640      |     logical reads          206,352
  db block changes         7,024      |     db block changes       104,656

The right hand data set does an extra 100,000 logical reads on the ORD_PK index (top set of lines) which I think are the 100,000 gets on the root block that was pinned for the table on the left – the numbers don’t quite add up, so there’s some extra complexity that I haven’t guessed correctly.

The insert into the ORL[2]_PK index (lines) is single row processed for the right hand table – with, I think, the logical reads recording two current gets per insert.

Every partition of the table, except the last, shows 15,000 db block changes, totalling a difference of about 100,000 db block changes corresponding to the single rows being inserted. Then ORL[2]_PK shows another 100,000 db block changes, giving us the 200,000 we saw as redo entries and 400,000 (when doubled up to allow for the undo) db block changes that we saw in total.

Finally we need to explain the difference of 400,000 db block gets between the two sets of session stats – and I think this is the extra 100,000 for ORD_PK, the 100,000 for the table inserts, and 200,000 for the ORL[2]_PK index, which I think might be explained as 100,000 as a current get that checks for “duplicate key” and 100,000 gets to do the actual insert.

Bottom Line, though – if you use reference partitioning every array insert seems to turn into single row processing with the attendant increase in buffer gets, undo and redo generated,  latch activity, and CPU used as Oracle checks for every single row which partition it should go into: and there doesn’t seem to be any optimisation that caters for “this row belongs in the same partition as the previous row”. You may decide that this extra cost due to reference partitioning is worth it for the benefits that reference partitioning supplies – it’s all down to what your application does, especially in terms of aging data perhaps – but it’s nice to know that this cost is there so that you can do a better cost/benefit analysis.

Footnote:

Interested readers might like to extend this test to a multi-layered set of ref-partitioned tables to see if the increase in overheads is linear or geometric.

 

June 14, 2017

Unpivot

Filed under: Execution plans,Oracle,Performance — Jonathan Lewis @ 3:46 pm BST Jun 14,2017

An interesting observation appeared recently as a side-channel on a question on the OTN database forum – how does Oracle execute an unpivot() operation. Here’s an example of such a query:

rem
rem     Script:         unpivot_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2017
rem

create table t1( id, col1, col2, col3, col4, col5, padding )
cache
pctfree 95 pctused 5
-- compress for query low
as
select
        1, 100 , 200 , 300 , 400 , 500,rpad('x',100)
from
        all_objects
where
        rownum <= 50000       -- > needed to bypass wordpress format glitch
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

select
        /*+ gather_plan_statistics find this */
        id, max(val) as high_val
from
        t1
unpivot include nulls (
        val for source in (col1, col2, col3, col4, col5)
)
group by id
order by id
; 

I’ve created a table with 50,000 rows (all_objects is quite large in 12.1.0.2 and 12.2.0.1), but with lots of free space per block so that I get three rows per block for a total of roughly 16,667 blocks which is going to make it fairly easy to spot any interesting session statistics. Then I’ve used an unpivot() call that has the effect of turning one row with five columns into five rows with one column.

Here’s the basic execution plan for the query (as pulled from memory):


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        | 10695 (100)|      1 |00:00:00.18 |   16671 |       |       |          |
|   1 |  SORT GROUP BY       |      |      1 |      1 | 10695   (2)|      1 |00:00:00.18 |   16671 |  2048 |  2048 | 2048  (0)|
|   2 |   VIEW               |      |      1 |    250K| 10653   (2)|    250K|00:00:01.41 |   16671 |       |       |          |
|   3 |    UNPIVOT           |      |      1 |        |            |    250K|00:00:00.52 |   16671 |       |       |          |
|   4 |     TABLE ACCESS FULL| T1   |      1 |  50000 |  2131   (2)|  50000 |00:00:00.12 |   16671 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

According to the plan Oracle has done one full tablescan on the data (Starts = 1), but the unpivot operation has used this to produce 250,000 rows of output from 50,000 rows (A-Rows=) of input. Note, however, that the cost of the view operation is 5 times the cost of the tablescan but, on the other hand, the number of buffers visited is 16,671 (which matches the size of the table). So have we done 5 tablescans with a union all, or have we done one tablescan ?

The next step is to look at the 10053 (optimizer) trace file, specifically for the “unparsed query” which (despite the final plan table showing the plan we’ve just seen above) looked like this – after applying a few purely cosmetic changes:

Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "from$_subquery$_002"."ID" "ID",MAX("from$_subquery$_002"."VAL") "HIGH_VAL"
FROM    (
                   (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL1' "SOURCE","T1"."COL1" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL2' "SOURCE","T1"."COL2" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL3' "SOURCE","T1"."COL3" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL4' "SOURCE","T1"."COL4" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL5' "SOURCE","T1"."COL5" "VAL" FROM "TEST_USER"."T1" "T1")
        ) "from$_subquery$_002"
GROUP BY "from$_subquery$_002"."ID"
ORDER BY "from$_subquery$_002"."ID"
;

And then there’s the outline (which I can see in the 10053 trace, or in the plan pulled from memory by a call to dbms_xplan.display_cursor()).


Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.1.0.2')
      DB_VERSION('12.1.0.2')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SEL$3")
      OUTLINE_LEAF(@"SEL$4")
      OUTLINE_LEAF(@"SEL$5")
      OUTLINE_LEAF(@"SET$1")
      OUTLINE_LEAF(@"SEL$6")
      NO_ACCESS(@"SEL$6" "from$_subquery$_002"@"SEL$6")
      FULL(@"SEL$5" "T1"@"SEL$5")
      FULL(@"SEL$4" "T1"@"SEL$4")
      FULL(@"SEL$3" "T1"@"SEL$3")
      FULL(@"SEL$2" "T1"@"SEL$2")
      FULL(@"SEL$1" "T1"@"SEL$1")
      END_OUTLINE_DATA
  */

And if I also pull the alias (query block and fully qualified table name) information from memory:


Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$6
   2 - SET$1 / from$_subquery$_002@SEL$6
   3 - SET$1
   4 - SEL$5 / T1@SEL$5

Then there’s the alternative (though not necessarily correct) way to find an “unparsed” version of the query – dbms_sql2.expand_sql_text() in 11g, dbms_utility.expand_sql_text() in 12c:


variable m_sql_out clob

declare
        m_sql_in    clob :=
                'select /*+ gather_plan_statistics find this */ id, max(val) as high_val
                from    t1
                        unpivot include nulls ( val for source in (col1, col2, col3, col4, col5) )
                group by id
                order by id
                '
        ;
begin

--      dbms_sql2.expand_sql_text(        -- 11g
        dbms_utility.expand_sql_text(     -- 12c
                m_sql_in,
                :m_sql_out
        );

end;
/

set long 20000
print m_sql_out

M_SQL_OUT
--------------------------------------------------------------------------------
SELECT "A1"."ID" "ID",MAX("A1"."VAL") "HIGH_VAL" FROM  ( (SELECT "A3"."ID" "ID",
"A3"."PADDING" "PADDING",'COL1' "SOURCE","A3"."COL1" "VAL" FROM "TEST_USER"."T1"
 "A3") UNION ALL  (SELECT "A4"."ID" "ID","A4"."PADDING" "PADDING",'COL2' "SOURCE
","A4"."COL2" "VAL" FROM "TEST_USER"."T1" "A4") UNION ALL  (SELECT "A5"."ID" "ID
","A5"."PADDING" "PADDING",'COL3' "SOURCE","A5"."COL3" "VAL" FROM "TEST_USER"."T
1" "A5") UNION ALL  (SELECT "A6"."ID" "ID","A6"."PADDING" "PADDING",'COL4' "SOUR
CE","A6"."COL4" "VAL" FROM "TEST_USER"."T1" "A6") UNION ALL  (SELECT "A7"."ID" "
ID","A7"."PADDING" "PADDING",'COL5' "SOURCE","A7"."COL5" "VAL" FROM "TEST_USER".
"T1" "A7")) "A1" GROUP BY "A1"."ID" ORDER BY "A1"."ID"

So at some level the optimizer does think it’s doing 5 tablescans – and the t1 reported in the unpivot plan is from the fifth (i.e. last) copy of the table in the union all. But the execution statistics and the session statistics say I’ve done just one tablescan – and this persists even when I make the table so large that it (a) won’t fit the buffer cache, and/or (b) uses direct path reads, and/or (c) runs under Exadata with Hybrid columnar compression.

So what’s (probably) happening ?

I think Oracle has a code path that says it’s doing a union all of tablescans (at least for this example) but tells it that the union all is there as an expansion of an unpivot so (sticking with an example that does a tablescan into the buffer cache) Oracle reads the number of blocks dictated by the current multiblock read count into the cache, pins that one batch of blocks, scans the batch (or, possibly, each block in turn) 5 times (or as required), unpins the batch and then reads the next batch. So the session does five tablescans but does them in a way that lets you see only one tablescan in the statistics.

Footnote

There was a change in the results when I tested this on 12.2.0.1; the unparsed query reported only a two-part union all subquery, and the table alias information in the plan identified the referenced table as the copy of the table from the second (i.e. still last) subquery in the union all. More significantly the cost of the VIEW operation was a close match to the cost of a single tablescan, rather than being a multiple thereof:


select * from table(dbms_xplan.display_cursor('1k077bzp0t6mn',null,'outline alias cost 

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |  1117 (100)|      1 |00:00:00.13 |    8350 |   8334 |       |       |          |
|   1 |  SORT GROUP BY       |      |      1 |    250K|  1117   (6)|      1 |00:00:00.13 |    8350 |   8334 |  2048 |  2048 | 2048  (0)|
|   2 |   VIEW               |      |      1 |    250K|  1076   (2)|    250K|00:00:00.13 |    8350 |   8334 |       |       |          |
|   3 |    UNPIVOT           |      |      1 |        |            |    250K|00:00:00.09 |    8350 |   8334 |       |       |          |
|   4 |     TABLE ACCESS FULL| T1   |      1 |  50000 |  1074   (2)|  50000 |00:00:00.07 |    8350 |   8334 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$3
   2 - SET$1 / from$_subquery$_002@SEL$3
   3 - SET$1
   4 - SEL$2 / T1@SEL$2

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SET$1")
      OUTLINE_LEAF(@"SEL$3")
      NO_ACCESS(@"SEL$3" "from$_subquery$_002"@"SEL$3")
      FULL(@"SEL$2" "T1"@"SEL$2")
      END_OUTLINE_DATA
  */

M_SQL_OUT
--------------------------------------------------------------------------------
SELECT "A1"."ID" "ID",MAX("A1"."VAL") "HIGH_VAL" FROM  (SELECT "A3"."ID" "ID","A
3"."PADDING" "PADDING",'COL1' "SOURCE","A3"."COL1" "VAL" FROM "TEST_USER"."T1" "
A3" UNION ALL SELECT "A3"."ID" "ID","A3"."PADDING" "PADDING",'COL2' "SOURCE","A3
"."COL2" "VAL" FROM "TEST_USER"."T1" "A3" UNION ALL SELECT "A3"."ID" "ID","A3"."
PADDING" "PADDING",'COL3' "SOURCE","A3"."COL3" "VAL" FROM "TEST_USER"."T1" "A3"
UNION ALL SELECT "A3"."ID" "ID","A3"."PADDING" "PADDING",'COL4' "SOURCE","A3"."C
OL4" "VAL" FROM "TEST_USER"."T1" "A3" UNION ALL SELECT "A3"."ID" "ID","A3"."PADD
ING" "PADDING",'COL5' "SOURCE","A3"."COL5" "VAL" FROM "TEST_USER"."T1" "A3") "A1
" GROUP BY "A1"."ID" ORDER BY "A1"."ID"
allstats last'));

It used to be hard enough working out what the run-time engine was doing when the optimizer and the run-time engine claimed that they were working on the same (transformed version of the) SQL; but now we can get the impression that the optimizer is directing the run-time engine to execute a plan that couldn’t possibly match the (unparsed) query that the optimizer had been considering.

April 27, 2017

Quiz Night

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 5:29 pm BST Apr 27,2017

If this is the closing section of thetkprof output from the trace file of a single end-user session that has a performance problem, what’s the most obvious deduction you can make about the cause of the problem, and what sort of action would you take next ?


OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse      296      0.06       0.10          0        651          0           0
Execute    440      1.39       2.24          7       4664          0         146
Fetch      345     29.38      48.27          0    1709081          0         346
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1081     30.84      50.63          7    1714396          0         492

Misses in library cache during parse: 5
Misses in library cache during execute: 7

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                     497        0.00          0.00
  SQL*Net message from client                   496       27.03         50.35
  direct path read                                1        0.00          0.00


OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse   128199      6.94      11.46          0       2740          0           0
Execute 2274845    371.25     605.60         30   10031162          0       68200
Fetch   2225314     10.94      18.17          5     879297          0      577755
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   4628358    389.14     635.23         35   10913199          0      645955

Misses in library cache during parse: 9701
Misses in library cache during execute: 134

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  direct path read                                1        0.00          0.00
  latch: shared pool                              3        0.00          0.00

12666  user  SQL statements in session.
  495  internal SQL statements in session.
13161  SQL statements in session.
********************************************************************************
Trace file: {concealed file name}.trc
Trace file compatibility: 11.1.0.7
Sort options: default

       1  session in tracefile.
   12666  user  SQL statements in trace file.
     495  internal SQL statements in trace file.
   13161  SQL statements in trace file.
    5266  unique SQL statements in trace file.
 39046623  lines in trace file.
     742  elapsed seconds in trace file.


There’s no absoutely right answer to the last question, by the way – there are a couple of variations in approach that are likely to get to the same result in the same time, and some of those variations might have made looking at the tail end data the 2nd or 3rd step in the sequence.

Update

One of the skills of using your time effectively when trouble-shooting is the way you balance the time you spend noting the little details that might be relevant, and the time you then spend following up each detail to check for relevance and corroborating evidence. The amateurs might throw themselves into in-depth research on the first detail they notice; or if they’ve started with a short list of observations simply spend too much time on the first before moving on to the second.

An exercise like looking at this tail-end this tkprof file and talking about what you spot and what you might do is just a little exercise in how to pace yourself as you tackle a problem. So, from my perspective, here a couple of obvious starting points:

  • I said it was an end-user session complaining about performance – the last line of the file tells use that the elapsed time was 742 seconds, and the summary of recursive statement tells us there were 2.2 million executions. How can ANYTHING an end-user wants to do “quickly” require 3,000 executions per second of 12 minutes? I can ask that question because I know that “recursive” executions aren’t necessarily “sys-recursive”, SQL statements executed inside a PL/SQL block are also recursive – on top of that I can see, anyway, that there are 12,466 (probably all different) USER statements in the file – the user (or the user’s code) is doing something it almost certainly shouldn’t be doing. Those 2.2M executions are responsible for 10M buffer visits – does that suggest a lot of single row processing ?
  • Associated with the 13,161 statements in the trace file there are 9,700 misses in the library cache during parse – that means “hard” parsing, probably means the user is getting through about 800 “new” statements per minute – but there are only 5,266 unique statements so part of the problem is that some recently used statements (or child cursors, at least) are being flushed from the library cache: again that suggests that they’re doing something wrong.
  • As one of the commentators pointed out – a detail that I hadn’t noted initially – the number of “current” buffers is zero. This isn’t a mini-batch updating the database, it’s just a report (done badly).

What would I do next ? If it wasn’t already immediately obvious I’d look for the statements that were responible for the very large numbers of executions; a call to tkprof with sort=execnt as a parameter would push the high execution counts to the top of the file.  (The worst one had over 900,000 executions, the next 600,000). Then I’d do a grep, sed, sort with uniq -c to find the statement which (I’d guess – and there were 3 of them responsible for a few thousand variants each) were the generated texts using concatenation instead of bind variables. Then I’d go and find the owner of the code and sit down with them to work out how it should be re-engineered.

 

March 7, 2017

Guesswork

Filed under: Infrastructure,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 7:57 pm BST Mar 7,2017

A recent posting on the OTN database forum described a problem with an insert (as select) statement that sometimes ran extremely slowly: nothing interesting yet, there could be plenty of boring reasons for that to happen. The same SQL statement (by SQL_ID) might take 6 hours to insert 300K rows one night while taking just a few minutes to insert 900K another night (still nothing terribly interesting).

An analysis of the ASH data about the statement showed that the problem was on the “LOAD TABLE CONVENTIONAL” operation – which starts to get interesting if you also assume that someone who was competent to look at ASH would probably have noticed whether or not the time spent was on (the first obvious guess) some variant of “TX enqueue” waits or “log file” waits or something amazingly different. It’s also a little more interesting if you’ve noticed that the title of the posting is about “consuming a lot of CPU time” – so where could that be going if we see most of the excess time going on the insert rather than on the select.

Here’s an enormous clue that this might be a “non-standard” problem – the SQL statement starts like this:


INSERT  /*+ ignore_row_on_dupkey_index(tgt OTC_DAT_TV_PROC_STATUS_UK) */
INTO OTC_DAT_TV_PROC_STATUS TGT (
    {list of columns}
)
SELECT  /*+ parallel(8) */
    ...

See the hint which says “ignore rows if they raise duplicate key errors against index otc_dat_tv_proc_status_uk” ? Think what it might take to implement code that obeys the hint – Oracle can’t know that a row is duplicating a value until it has inserted the row (to get a rowid) then tried to maintain the index and discovered the pre-existing value, at which point it raises an exception, then handles it by undoing the single row insert (I wonder what that does to array inserts, and the logic of triggers), then carries on with the next row. (I suppose the code could check every relevant index before doing the insert – but that would mean a very big, and possibly very resource-intensive, change to all existing “insert a row” code.)

Surely we’re going to see some dramatic effects if a large fraction of our rows result in duplicate values.

Warning – the guess might not be right but it’s much more interesting than all the other guesses you might make at this point and a good excuse for doing a bit of experimentation and learning. So here’s a simple model to test the “excess work” hypothesis – running against 11.2.0.4:

rem
rem     Script:         ignore_dupkey.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2017
rem

drop table t2;
drop table t1;

create table t1
nologging
as
select  *
from    all_objects
where   rownum <= 50000 ; -- > comment to protect WordPress formatting

create table t2
nologging
as
select  *
from    all_objects
where   rownum <= 50000 ; -- > comment to protect WordPress formatting

update t1 set object_id = object_id + (select max(object_id) from t1)
;

commit;

create unique index t1_i1 on t1(object_id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

All I’ve done is create two copies of the same data – with an update in place that changes one set of data so that the object_id doesn’t overlap with the object_id in the other set. The update statement is something that I’m going to include in one test but exclude from the second. Now for the statement that tests for the effects of the hint:


execute snap_events.start_snap
execute snap_my_stats.start_snap

insert
        /*+ ignore_row_on_dupkey_index(t1 (object_id)) */
        into t1
select  *
from    t2
;

execute snap_my_stats.end_snap
execute snap_events.end_snap


There are several different things I could do once I’ve got this basic test set up if I want to refine what I’m testing and isolate certain aspects of the mechanism, but all I’ll do for this note is a simple report of some figures from the two tests – one where the object_id values don’t overlap and one where t1 and t2 are exact copies of each other. The two procedures wrapping the statement are just my standard routines for capturing changes in v$mystat and v$session_event for my session.

For the insert where I’ve done the update to avoid any duplicates appearing the insert completed in about 0.3 seconds, generating 10MB of redo and 2MB of undo.

When I removed the update statement the (continuously failing) insert took 35.5 seconds to complete, of which almost all the time was CPU time. The redo jumped to 478MB with 14MB of undo. The extreme level of redo surprised me slightly especially since the scale of the change was so much greater than that of the undo – I think it may be due to a problem with Oracle needing to unwind one row from an (internal) array insert before retrying. Here, taken from the session stats of the problem run, is a little indication of why the time (and especially the CPU time) increased so much:

Name                                                                     Value
----                                                                     -----
rollback changes - undo records applied                                 95,014
active txn count during cleanout                                        24,627
cleanout - number of ktugct calls                                       24,627
HSC Heap Segment Block Changes                                          97,509
Heap Segment Array Inserts                                              97,509
recursive calls                                                        682,574
recursive cpu usage                                                      2,193
session logical reads                                                1,341,077

Obviously there’s a lot of work done rolling back changes that should not have been made (though why it’s reported as 95,000 rather than 100,000 I don’t know and I’d rather not do a trace of buffer activity to find out) with an associated extra load of data block activity. Most visible, though, is the huge number of recursive calls with, perhaps associated, a very large number of session logical reads. Clearly it’s worth enabling extended tracing to see what’s going on – if you haven’t already guessed what some of those calls are about. Here’s an extract from the top of an appropriate tkprof output:


tkprof test_ora_32298.trc ignore_dupkey sort=execnt

select /*+ rule */ c.name, u.name
from
 con$ c, cdef$ cd, user$ u  where c.con# = cd.con# and cd.enabled = :1 and
  c.owner# = u.user#


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    48753      1.86       2.08          0          0          0           0
Execute  48753      3.75       4.17          0          0          0           0
Fetch    48753      2.86       3.12          0      97506          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   146259      8.47       9.38          0      97506          0           0

select o.name, u.name
from
 obj$ o, user$ u  where o.obj# = :1 and o.owner# = u.user#

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    48753      1.87       1.99          0          0          0           0
Execute  48753      3.60       3.63          0          0          0           0
Fetch    48753      7.35       7.52          0     243765          0       48753
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   146259     12.84      13.14          0     243765          0       48753

The top two statements in the trace file show Oracle first trying to find the name of the constraint that has been breached, then falling back to searching for the name of the unique index that has caused the error to appear. If I had created a unique constraint rather than just a unique index then the second of these two statement would not have appeared in the trace file (and the run would have been a little quicker – hint: constraints are a good thing).

You’ll notice that the total number of calls from the two statement is roughly 292,500 – far short of the 682,000 reported in the session stats. Unfortunately there was nothing else in the trace files that could be blamed for the outstanding 400,000 missing calls. It’s not really necessary to chase up all the details, though; clearly we can see that this feature is very expensive if lots of duplicates appear – like DML error logging it has probably been created as a way of dealing with occasional errors when handling large volumes of data.

Footnote

Notice that my example uses the “index description” method for specifying the index in the hint; it’s legal with either the name or the description. Interestingly (but, perhaps, unsurprisingly) this is a hint that Oracle has to parse for correctness before optimisation. I made a mistake in my first attempt at writing the update statement leaving me with duplicates in the data so Oracle couldn’t create the unique index; as a consequence of the missing unique index the insert statement reported the Oracle error “ORA-38913: Index specified in the index hint is invalid”.

 

February 13, 2017

Band Join 12c

Filed under: 12c,Execution plans,Oracle,Performance,Upgrades — Jonathan Lewis @ 1:53 pm BST Feb 13,2017

One of the optimizer enhancements that appeared in 12.2 for SQL is the “band join”. that makes certain types of merge join much more  efficient.  Consider the following query (I’ll supply the SQL to create the demonstration at the end of the posting) which joins two tables of 10,000 rows each using a “between” predicate on a column which (just to make it easy to understand the size of the result set)  happens to be unique with sequential values though there’s no index or constraint in place:

select
        t1.v1, t2.v1
from
        t1, t2
where
        t2.id between t1.id - 1
                  and t1.id + 2
;

This query returns nearly 40,000 rows. Except for the values at the extreme ends of the range each of the 10,000 rows in t2 will join to 4 rows in t1 thanks to the simple sequential nature of the data. In 12.2 the query, with rowsource execution stats enabled, completed in 1.48 seconds. In 12.1.0.2 the query, with rowsource execution stats OFF, took a little over 14 seconds. (With rowsource execution stats enabled it took 12.1.0.2 a little over 1 minute to return the first 5% of the data – I didn’t bother to wait for the rest, though the rate would have improved over time.)

Here are the two execution plans – spot the critical difference:


12.1.0.2
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |    25M|   715M|  1058  (96)| 00:00:01 |
|   1 |  MERGE JOIN          |      |    25M|   715M|  1058  (96)| 00:00:01 |
|   2 |   SORT JOIN          |      | 10000 |   146K|    29  (11)| 00:00:01 |
|   3 |    TABLE ACCESS FULL | T1   | 10000 |   146K|    27   (4)| 00:00:01 |
|*  4 |   FILTER             |      |       |       |            |          |
|*  5 |    SORT JOIN         |      | 10000 |   146K|    29  (11)| 00:00:01 |
|   6 |     TABLE ACCESS FULL| T2   | 10000 |   146K|    27   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("T2"."ID"<="T1"."ID"+2)   -- > had to add GT here to stop WordPress spoiling the format 
   5 - access("T2"."ID">="T1"."ID"-1)
       filter("T2"."ID">="T1"."ID"-1)

12.2.0.1
----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      | 40000 |  1171K|    54  (12)| 00:00:01 |
|   1 |  MERGE JOIN         |      | 40000 |  1171K|    54  (12)| 00:00:01 |
|   2 |   SORT JOIN         |      | 10000 |   146K|    27  (12)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T1   | 10000 |   146K|    25   (4)| 00:00:01 |
|*  4 |   SORT JOIN         |      | 10000 |   146K|    27  (12)| 00:00:01 |
|   5 |    TABLE ACCESS FULL| T2   | 10000 |   146K|    25   (4)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T2"."ID">="T1"."ID"-1)
       filter("T2"."ID"<="T1"."ID"+2 AND "T2"."ID">="T1"."ID"-1)

Notice how operation 4, the FILTER, that appeared in 12.1 has disappeared in 12.2 and the filter predicate that it used to hold is now part of the filter predicate of the SORT JOIN that has been promoted to operation 4 in the new plan.

As a reminder – the MERGE JOIN operates as follows: for each row returned by the SORT JOIN at operation 2 it calls operation 4. In 12.1 this example will then call operation 5 so the SORT JOIN there happens 10,000 times. It’s important to know, though, that the name of the operation is misleading; what’s really happening is that Oracle is “probing a sorted result set in local memory” 10,000 times – it’s only on the first probe that it finds it has to call operation 6 to read and move the data into local memory in sorted order.

So in 12.1 operation 5 probes (accesses) the in-memory data set starting at the point where t2.id >= t1.id – 1; I believe there’s an optimisation here because Oracle will recall where it started the probe last time and resume searching from that point; having found the first point in the in-memory set where the access predicate it true Oracle will walk through the list passing each row back to the FILTER operation as long as the access predicate is still true, and it will be true right up until the end of the list. As each row arrives at the FILTER operation Oracle checks to see if the filter predicate there is true and passes the row up to the MERGE JOIN operation if it is. We know that on each cycle the FILTER operation will start returning false after receiving 4 rows from SORT JOIN operation – Oracle doesn’t.  On average the SORT JOIN operation will send 5,000 rows to the FILTER operation (for a total of 50,000,000 values passed and discarded).

In 12.2, and for the special case here where the join predicate uses constants to define the range, Oracle has re-engineered the code to eliminate the FILTER operation and to test both parts of the between clause in the same subroutine it uses to probe and scan the rowsource. In 12.2 the SORT JOIN operation will pass 4 rows up to the MERGE JOIN operation and stop scanning on the fifth row it reaches. In my examples that’s an enormous (CPU) saving in subroutine calls and redundant tests.

Footnote:

This “band-join” mechanism only applies when the range is defined by constants (whether literal or bind variable). It doesn’t work with predicates like (e.g.):

where t2.id between t1.id - t1.step_back and t1.id + t1.step_forward

The astonishing difference in performance due to enabling rowsource execution statistics is basically due to the number of subroutine calls eliminated – I believe (subject to a hidden parameter that controls a “sampling frequency”) that Oracle will call the O/S clock twice each time it calls the second SORT JOIN operation from the FILTER operation to acquire the next row. In 12.1 we’re doing roughly 50M redundant calls to that SORT JOIN.

The dramatic difference in performance even when rowsource execution statistics isn’t enabled is probably something you won’t see very often in a production system – after all, I engineered a fairly extreme data set and query for the purposes of demonstration. Note, however, the band join does seemt to introduce a change in cost, so it’s possible that on the upgrade you may find a few cases where the optimizer will switch from a nested loop join to a merge join using a band-join.

February 9, 2017

Index bouncy scan

Filed under: Oracle,Performance,Tuning — Jonathan Lewis @ 1:05 pm BST Feb 9,2017

There’s a thread running on OTN at present about deleting huge volumes of duplicated data from a table (to reduce it from 1.1 billion to about 22 million rows). The thread isn’t what I’m going to talk about, though, other than quoting some numbers from it to explain what this post is about.

An overview of the requirement suggests that a file of about 2.2 million rows is loaded into the table every week with (historically) no attempt to delete duplicates. As a file is loaded into the table every row gets the same timestamp, which is the sysdate at load time. I thought it would be useful to know how many different timestamps there were in the whole table.  (From an averaging viewpoint, 1.1 billion rows at 2.2 million rows per week suggests about 500 dates/files/weeks – or about 9.5 years – but since the table relates to “customer accounts” it seems likely that the file was originally smaller and has grown over time, which means the history may be rather longer than that.)

Conveniently there is an index on the “input_user_date” column in the table so we might feel happy running a query that simply does:


select
        distinct input_user_date
from
        customer_account
order by
        input_user_date
;

We might then refine the query to do a count(*) aggregate, or do some analytics to find any strange gaps in the timing of the weekly loads. However, all I’m really interested in is the number of dates because I’ve suggested we could de-duplicate the data by running a PL/SQL process that does a simple job for each date in turn, and I want to get an idea of how many times that job will run so that I can estimate how long the entire process might take.

The trouble with the basic query is that the table is (as you probably noticed) rather large, and so is the index. If we assume 8 bytes (which includes the length byte) for a date, 7 bytes for the rowid, 4 bytes overhead, and 100% packing we get about 420 index entries per leaf blocks, so with 1.1 billion entries the index is about 2.6 million leaf blocks. If the index had been built with compression (which means you’d only be recording a date once per leaf block) it would still be about 1.6 million leaf blocks. Fortunately we wouldn’t have to do much “real” sorting to report just a list of distinct values, or even the count(*) for each date, if we made Oracle use an index full scan – but it’s still a lot of work to read 1.6 million blocks (possibly using single block reads) and do even something as simple as a running count as you go. So I whipped up a quick and dirty bit of PL/SQL to do the job.

declare
        m_d1 date := to_date('01-Jan-0001');
        m_d2 date := to_date('01-Jan-0001');
        m_ct number := 0;
begin
        loop
                select
                        min(input_user_date)
                into
                        m_d2
                from
                        customer_account
                where
                        input_user_date > m_d1
                ;

                exit when m_d2 is null;

                m_ct := m_ct + 1;
                dbms_output.put_line('Count: ' || m_ct || '  Date: ' || m_d2);
                m_d1 := m_d2;

        end loop;
end;
/

The code assumes that the input_user_date hasn’t gone back to a silly date in the past to represent a “null date” (which shouldn’t exist anyway; if you want to use code like this but have a problem with a special “low-value” then you would probably be safest adding a prequel SQL that selects the min(columnX) where columnX is not null to get the starting value instead of using the a constant as I have done.

The execution path for the SQL statement should be an index-only: “index range scan (min/max)” which typically requires only 3 or 4 logical I/Os to find the relevant item for each date (which compares well with the estimated 2,200,000 / 420 = 5,238 leaf blocks we would otherwise have to scan through for each date). Here’s the path you should see:


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |       |       |     3 (100)|          |
|   1 |  SORT AGGREGATE              |       |     1 |     8 |            |          |
|   2 |   FIRST ROW                  |       |     1 |     8 |     3   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)| CA_I1 |     1 |     8 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("INPUT_USER_DATE">:B1)

I did build a little data set as a proof of concept – and produced a wonderful example of how the scale and the preceding events makes a difference that requires you to look very closely at what has happened. I used a table t1 in my example with a column d1, but apart from the change in names the PL/SQL block was as above.Here’s the code I used to create the data and prepare for the test:


create table t1 nologging
as
select
        trunc(sysdate) + trunc((rownum - 1)/100) d1,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 50000
;

execute dbms_stats.gather_table_stats(user,'t1')
alter table t1 modify d1 not null;

create index t1_i1 on t1(d1) nologging pctfree 95
;

select index_name, leaf_blocks from user_indexes;

alter system flush buffer_cache;

alter session set events '10046 trace name context forever, level 8';

My data set has 500 dates with 100 rows per date, and the pctfree setting for the index gives me an average of about 8 leaf blocks per date (for a total of 4,167 leaf blocks). It’s only a small index so I’m expecting to see just 2 or 3 LIOs per date, and a total of about 500 physical reads (one per date plus a handful for reading branch blocks). Here’s the output from the running tkprof against the trace file:


SELECT MIN(D1)
FROM
 T1 WHERE D1 > :B1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute    501      0.00       0.01          0          0          0           0
Fetch      501      0.08       0.18       4093       1669          0         501
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1003      0.09       0.19       4093       1669          0         501

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=3 pr=64 pw=0 time=9131 us)
         1          1          1   FIRST ROW  (cr=3 pr=64 pw=0 time=9106 us cost=3 size=8 card=1)
         1          1          1    INDEX RANGE SCAN (MIN/MAX) T1_I1 (cr=3 pr=64 pw=0 time=9089 us cost=3 size=8 card=1)(object id 252520)

I’ve done a physical read of virtually every single block in the index; but I have done only 3 buffer gets per date – doing fewer buffer gets than physical reads.

I’ve been caught by two optimisations (which turned out to be “pessimisations” in my test): I’ve flushed the buffer cache, so the Oracle runtime engine has decided to consider “warming up” the cache by reading extra blocks from any popular-looking objects that I’m accessing, and the optimizer may have given the run-time engine enough information to allow it to recognise that this index is subject to range scans and could therefore be a suitable object to use while warming up. As you can see from the following extracts from session events and session activity stats – we’ve done a load of multiblock reads through the index.


Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               1           0           0.03        .031           6
db file scattered read                              136           0          13.54        .100           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           4,095
physical reads cache                                                     4,095
physical read IO requests                                                  137
physical reads cache prefetch                                            3,958
physical reads prefetch warmup                                           3,958

This isn’t likely to happen, of course, in the production system where we’ll be starting with a fully loaded cache and the leaf blocks we need are (logically) spaced apart by several thousand intervening blocks.

Footnote

I can’t remember who first brought this strategy to my attention – though I’m fairly sure it was one of my Russian colleagues [update: Sayan Malakshinov – see comments below], who has blogged about ways to work around what is effectively a limitation of the “index skip scan”. Apologies to the originator; and if you recognise your work here please add a comment with URL below.

November 22, 2016

Delete/Insert

Filed under: Oracle,Performance,Tuning — Jonathan Lewis @ 12:59 pm BST Nov 22,2016

Many of the questions that appear on OTN are deceptively simple until you start thinking carefully about the implications; one such showed up a little while ago:

What i want to do is to delete rows from table where it matches condition upper(CATEGORY_DESCRIPTION) like ‘%BOOK%’.

At the same time i want these rows to be inserted into other table.

The first problem is this: how carefully does the requirement need to be stated before you can decide how to address it? Trying to imagine awkward scenarios, or boundary conditions, can help to clarify the issue.

If you delete before you insert, how do you find the data to insert ?

If you insert before you delete, what happens if someone updates a row you’ve copied so that it no longer matches the condition. Would it matter if the update changes the row in a way that leaves it matching the condition (what you’ve inserted is not totally consistent with what you’ve deleted).

If you insert before you delete, and someone executes some DML that makes another row match the requirement should you delete it (how do you avoid deleting it) or leave it in place.

Once you start kicking the problem about you’ll probably come to the conclusion that the requirement is for the delete and insert to be self-consistent – in other words what you delete has to be an exact match for what you insert as at the time you inserted it. You’ll ignore rows that come into scope in mid-process due to other activity, and you’ll have to stop people changing rows that are being transferred (in case there’s an audit trail that subsequently says that there was, at some point in time, a row that matched the condition but never arrived – and a row that has arrived that didn’t match the final condition of the rows that disappeared).

Somehow your code needs to lock the set of rows to be transferred and then transfer those rows and eliminate them. There are two “obvious” and simple strategies – readers are invited to propose others (or criticise the two I – or any of the comments – suggest). I’ll start with a simple data setup for testing:


create table t1
as
select  object_id, object_name, owner
from    all_objects
;

alter table t1 add constraint t1_pk primary key(object_id);

create table t2
as
select  * from t1
where   rownum = 0
;

execute dbms_stats.gather_table_stats(user,'t1')
execute dbms_stats.gather_table_stats(user,'t2')

Option 1:

The simplest approach is often the best – until, perhaps, you spot the drawbacks – do a basic delete of the data to be transferred (which handles the locking) but wrap the statement in a PL/SQL block that captures the data (using the returning clause) and then inserts it into the target table as efficiently as possible. With thanks to Andrew Sayer who prompted this blog post:

declare
        type t1_rows is table of t1%rowtype;
        t1_deleted t1_rows;

begin
        delete from t1 where owner = 'SYSTEM'
        returning object_id, object_name, owner bulk collect into t1_deleted;

        forall i in 1..t1_deleted.count
                insert into t2 values t1_deleted(i);

        commit;
end;
/

The drawback to this, of course, is that if the volume to be transferred is large (where “large” is probably a fairly subjective measure) then you might not want to risk the volume of memory (PGA) it takes to gather all the data with the bulk collect.

Option 2:

For large volumes of data we could reduce the threat to the PGA by gathering only the rowids of the rows to be transferred (locking the rows as we do so) then do the insert and delete based on the rowids:

declare
        type rid_type is table of rowid;
        r rid_type;

        cursor c1 is select rowid from t1 where owner = 'SYSTEM' for update;

begin
        open c1;
        fetch c1 bulk collect into r;
        close c1;

        forall i in 1..r.count
                insert into t2 select * from t1 where rowid = r(i);

        forall i in 1..r.count
                delete from t1 where rowid = r(i);

        commit;
end;
/

Note, particularly, the “for update” in the driving select.

Inevitably there is a drawback to this strategy as well (on top of the threat that the requirement for memory might still be very large even when the return set is restricted to just rowids). We visit the source data (possibly through a convenient index and avoid visiting the table, [silly error deleted – see comment 8]) to collect rowids; then we visit the data again by rowid (which is usually quite efficient) to copy it, then we visit it again (by rowid) to delete it. That’s potentially a significant increase in buffer cache activity (especially latching) over the simple “delete returning” strategy; moreover the first strategy gives Oracle the option to use the index-driven optimisation for maintaining indexes and this method doesn’t. You might note, by the way, that you could include an “order by rowid” clause on the select; depending on your data distribution and indexes this might reduce the volume of random I/O you have to do as Oracle re-visits the table for the inserts and deletes.

We can address the PGA threat, of course, by fetching the rowids with a limit:


declare
        type rid_type is table of rowid;
        r rid_type;

        cursor c1 is select rowid from t1 where owner = 'SYSTEM' for update;

begin
        open c1;

--      dbms_lock.sleep(60);

        loop
                fetch c1 bulk collect into r limit 5;

                forall i in 1..r.count
                        insert into t2 select * from t1 where rowid = r(i);

                forall i in 1..r.count
                        delete from t1 where rowid = r(i);

                exit when r.count != 5;
        end loop;

        close c1; 

        commit;
end;
/

One thing to be aware of is that even though we fetch the rowids in small batches we lock  all the relevant rows when we open the cursor, so we don’t run into the problem of inserting thousands of rows into t2 and then finding that the next batch we select from t1 has been changed or deleted by another session. (The commented out call to dbms_lock.sleep() was something I included as a way of checking that this claim was true.) This doesn’t stop us running into a locking (or deadlocking) problem, of course; if it takes us 10 seconds to lock 1M rows in our select for update another user might manage to lock our millionth row before we get there; if, a few seconds later, it then gets stuck in a TX/6 wait trying to lock one of our locked rows after we start waiting in a TX/6 wait for our millionth row our session will time out after 3 further seconds with an ORA-00060 deadlock error.

The limit of 5 is just for demonstration purposes, of course – there were 9 rows in all_objects that matched the select predicate; in a production system I’d probably raise the limit as high as 255 (which seems to be the limit of Oracle’s internal array-processing).

You’ll notice, of course, that we can’t use this limited fetch approach with the delete command – the entire delete would take place as we opened the equivalent cursor and, though we can use the bulk collect with the returning clause, there is no syntax that allows something like the fetch with limit to take place.

Discarded option

My first thought was to play around with the AS OF SCN clause.  Select the current SCN from v$database and then do things like delete “as of scn”, or “select for update as of scn” – there were ways of getting close, but invariably I ended up running into Oracle error: “ORA-08187: snapshot expression not allowed here”. But maybe someone else can come up with a way of doing this that doesn’t add significant overheads and doesn’t allow for inconsistent results.

Update:

Following a comment from SydOracle1 below I’ve revisited the “as of SCN” approach and discovered that I had fooled myself into discarding the option too quickly. The results are in a follow-up blog.

October 24, 2016

Anniversary OICA

Filed under: CBO,Oracle,Performance,Statistics,Troubleshooting — Jonathan Lewis @ 1:00 pm BST Oct 24,2016

Happy anniversary to me!

On this day 10 years ago I published the first article in my blog. It was about the parameter optimizer_index_cost_adj (hence OICA), a parameter that has been a  source of many performance problems and baffled DBAs over the years and, if you read my first blog posting and follow the links, a parameter that should almost certainly be left untouched.

It seems appropriate to mention it today because I recently found a blog posting (dated 3rd May 2013) on the official Oracle Blogs where the director for Primavera advises setting this parameter to 1 (and the optimizer_index_caching parameter to 90) for the Primavera P6 OLTP (PMDB) database. The recommendation is followed by a fairly typical “don’t blame me” warning, viz: “As with any changes that affect query optimization, it is paramount to TEST, TEST and TEST again. At least these settings are easily adjusted or change back to the original value”.

Here’s a thought, though: setting the optimizer_index_cost_adj to the extreme value 1 is a catastrophic change so don’t suggest it unless you are extremely confident that it’s almost certain to be the right thing to do. If you’re confident that it’s a good idea to reduce the parameter to a much smaller value than the default then suggest a range of values that varies from “ideal if it works, but high risk” to “low risk and mostly helpful”. Maybe a suggestion like: “Primavera P6 OLTP (PMDB) tends to work best with this parameter set to a value in the range of 1 to 15” would be a more appropriate comment from someone in a position of authority.

Here’s another thought: if you work for Oracle you could always contact the optimizer group to present them with your argument for the strategy and see what they think about it. Then you can include their opinion when you offer your suggestion.

For what it’s worth, here’s my opinion: as a general rule you shouldn’t be working around performance issues by fiddling with the optimizer_index_cost_adj; as a specific directive do not set it to 1. If you want to encourage Oracle to be enthusiastic about indexes in general then adjust the system statistics (preferably with a degree of truth). If you need to persuade Oracle that particular indexes are highly desirable than you can use dbms_stats.set_index_stats() to adjust the clustering_factor (and avg_data_blocks_per_key) of those indexes. If you are running 11.2.0.4 or later then you can use dbms_stats.set_table_prefs() to set the “table_cached_blocks” parameter for tables where you think Oracle should be particularly keen on using indexes but isn’t; and if your queries are suffering from bad cardinality estimates because of a pattern of multi-column filter predicates create some column group (extended) statistics.

Why am I so firmly set against setting the optimizer_index_cost_adj to 1 ? Because it doesn’t tell Oracle to “use indexes instead of doing tablescans”, it tells Oracle that every index is just about as good as every other index for almost any query. Here’s a pdf file of an article (formerly published on DBAZine and then on my old website) I wrote over twelve years ago explaining the issue. Various links in the article no longer work, and the data pattern was generated to display the problem in 8i and 9i and you would need to modify the data to display the same effect in newer versions of Oracle – but the principle remains the same.

If you would like to see a slightly newer example of how the parameter causes problems. Here’s a thread dated April 2012 from the OTN database forum where a SYS-recursive query caused a performance problem because the parameter was set 1.

 

October 6, 2016

My session workload

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 1:19 pm BST Oct 6,2016

My old website (www.jlcomp.demon.co.uk) will be disappearing in a couple of weeks – but there are a couple of timeless articles on it that are worth saving and although the popularity of this one has probably been surpassed by Tanel Poder’s Snapper script, or other offerings by Tom Kyte or Adrian Billington, it’s still one of those useful little things to have around – it’s a package to takes a snapshot of your session stats.

The package depends on a view created in the SYS schema, and the package itself has to be installed in the SYS schema – which is why other strategies for collecting the information have become more popular; but if you want to have it handy, here are the two scripts:

rem
rem     Script:         c_mystats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2001
rem     Purpose:        Put names to v$mystat
rem
rem     Last tested
rem             12.1.0.2        -- naming issue
rem             11.2.0.4
rem             10.2.0.5
rem             10.1.0.4
rem              9.2.0.8
rem              8.1.7.4
rem
rem     Notes:
rem     Should be run by SYS - which means it has to be re-run
rem     on a full database export/import
rem
rem     It looks as if it is illegal to create a view with a
rem     name starting with v$ in the sys account as from 12c.
rem     (ORA-00999: invalid view name). Hence the JV$ name.
rem
rem     But you can create a public synonym starting "v$"
rem

create or replace view jv$my_stats
as
select
        /*+
                first_rows
                ordered
        */
        ms.sid,
        sn.statistic#,
        sn.name,
        sn.class,
        ms.value
from
        v$mystat        ms,
        v$statname      sn
where
        sn.statistic# = ms.statistic#
;

drop public synonym v$my_stats;
create public synonym v$my_stats for jv$my_stats;
grant select on v$my_stats to public;

rem
rem	Script:		snap_myst.sql
rem	Author:		Jonathan Lewis
rem	Dated:		March 2001
rem	Purpose:	Package to get snapshot start and delta of v$mystat
rem
rem	Last tested
rem		12.1.0.2
rem		11.2.0.4
rem		10.2.0.5
rem		10.1.0.4
rem		 9.2.0.8
rem		 8.1.7.4
rem
rem	Notes
rem	Has to be run by SYS to create the package
rem	Depends on view (j)v$my_stats (see c_mystats.sql)
rem
rem	Usage:
rem		set serveroutput on size 1000000 format wrapped
rem		set linesize 120
rem		set trimspool on
rem		execute snap_my_stats.start_snap
rem		-- do something
rem		execute snap_my_stats.end_snap
rem

create or replace package snap_my_stats as
	procedure start_snap;
	procedure end_snap (i_limit in number default 0);
end;
/

create or replace package body snap_my_stats as

cursor c1 is
	select 
		statistic#, 
		name,
		value
	from 
		v$my_stats
	where
		value != 0
	;


	type w_type is table of c1%rowtype index by binary_integer;
	w_list		w_type;
	empty_list	w_type;

	m_start_time	date;
	m_start_flag	char(1);
	m_end_time	date;

procedure start_snap is
begin

	m_start_time := sysdate;
	m_start_flag := 'U';
	w_list := empty_list;

	for r in c1 loop
		w_list(r.statistic#).value := r.value;
	end loop;

end start_snap;


procedure end_snap (i_limit in number default 0) 
is
begin

	m_end_time := sysdate;

	dbms_output.put_line('---------------------------------');

	dbms_output.put_line('Session stats - ' ||
				to_char(m_end_time,'dd-Mon hh24:mi:ss')
	);

	if m_start_flag = 'U' then
		dbms_output.put_line('Interval:-  '  || 
				trunc(86400 * (m_end_time - m_start_time)) ||
				' seconds'
		);
	else
		dbms_output.put_line('Since Startup:- ' || 
				to_char(m_start_time,'dd-Mon hh24:mi:ss')
		);
	end if;

	if (i_limit != 0) then
		dbms_output.put_line('Lower limit:-  '  || i_limit);
	end if;

	dbms_output.put_line('---------------------------------');

	dbms_output.put_line(
		rpad('Name',60) ||
		lpad('Value',18)
	);

	dbms_output.put_line(
		rpad('----',60) ||
		lpad('-----',18)
	);

	for r in c1 loop
		if (not w_list.exists(r.statistic#)) then
		    w_list(r.statistic#).value := 0;
		end if;

		if (
		       (r.value > w_list(r.statistic#).value + i_limit)
		) then
			dbms_output.put(rpad(r.name,60));
			dbms_output.put(to_char(
				r.value - w_list(r.statistic#).value,
					'9,999,999,999,990')
			);
			dbms_output.new_line;
		end if;
	end loop;

end end_snap;

begin
	select
		logon_time, 'S'
	into
		m_start_time, m_start_flag
	from
		v$session
	where
		sid = 	(
				select /*+ no_unnest */ sid 
				from v$mystat 
				where rownum = 1
			);

end snap_my_stats;
/

drop public synonym snap_my_stats;
create public synonym snap_my_stats for snap_my_stats;
grant execute on snap_my_stats to public;

One point to be cautious about with this package: do not embed it inside anonymous pl/sql blocks, e.g.

begin

        snap_my_stats.start_snap;

        -- some other code

        snap_my_stats.end_snap;

end;
/

There are some statistics in v$my_stats (v$mystat / v$sesstat) which are not updated until the end of a database call – and calling an anonymous pl/sql block counts as a single database call, so some of your statistics (for example “CPU used by this session”) will report misleading values.

Next Page »

Powered by WordPress.com.