Oracle Scratchpad

February 3, 2016

Hinting

Filed under: Hints,Performance,Oracle,Troubleshooting — Jonathan Lewis @ 1:04 pm GMT Feb 3,2016

This is just a little example of thinking about hinting for short-term hacking requirements. It’s the answer to a question that came up on the Oracle-L listserver  a couple of months ago (Oct 2015) and is a convenient demonstration of a principle that can often (not ALWAYS) be applied as a response to the problem: “I can make this query work quickly once, how do I make it work quickly when I make it part of a join ?”

The question starts with this query, which returns “immediately” for any one segment:


SELECT DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM dba_extents DE
WHERE 1=1
  AND DE.OWNER           = <owner>
  AND DE.SEGMENT_NAME    = <segment_name>
  AND DE.segment_type    = <segment_type>
  AND DE.tablespace_name = <tablespace_name>
  AND DE.partition_name  = <max_partition_name>
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME
;

But the email then goes on to explain: “I’ve got a table of values that I need to use as a list of inputs for this query, but when I do the join it takes ages to complete; how do I make the join run quickly?”

Here’s the initial code:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

What we’ve got is a query where the user’s reference table generates a list of segments (the rownum < 2 was a temporary test) and we want the detail query to run for each segment identified. The “for each segment” gives us a clue that what we want to see is a simple nested loop join, driven by the factored subquery, with the very efficient query above running as the “second table of the nested loop”.

What I failed to notice at the time is that the /*+ materialize */ hint was in the wrong place, it should have been placed after the outer (i.e. very first) select, and it’s possible that if it had been in the right place the user would have got the plan they wanted – especially in the later versions of Oracle. As it was I suggested that we merely need to tell the optimizer to:

Visit the “tables” in the order (segment_list, dba_extents), and do a nested loop into (dba_extents), but since both segment_list and dba_extents were views we needed to stop Oracle from trying to merge them and play silly games with distinct aggregate placement, etc. by including directives that the views should not be merged, but then we might need to explain to Oracle that it would have to push the join predicate between segment_list and dba_extents inside the dba_extents view.

In other words, a list of 4 hints, as shown below:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT /*+
        leading(sl de)
        no_merge(sl)
        no_merge(de)
        push_pred(de)
        */
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

According to a follow-up email, this was sufficient.  The OP had actually tried variations on the leading() and use_nl() hints – but without the no_merge() hint the optimizer was probably rewriting the SQL in a way that put the hints out of context. It’s worth noting that the /*+ materialize */ hint is in the wrong place – it should be after the first appearance of the SELECT keyword in the factored subquery – and that probably added to the difficulty of getting the desired execution plan.

For a production system I’d probably want to do something a little more sophisticated in terms of stability once I’d got the plan I wanted – but this looks like a DBA query used to run an ad hoc report, so perhaps this solution is good enough for the current requirement.

 

February 2, 2016

Partitioned Bitmap Join

Filed under: bitmaps,Bugs,Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 8:32 am GMT Feb 2,2016

If you don’t want to read the story, the summary for this article is:

If you create bitmap join indexes on a partitioned table and you use partition exchanges to load data into the table then make sure you create the bitmap join indexes on the loading tables in exactly the same order as you created them on the partitioned table or the exchange will fail with the (truthful not quite complete) error: ORA-14098: index mismatch for tables in ALTER TABLE EXCHANGE PARTITION.

My story starts with this OTN posting from John Hall where he found after a year of successful batch loading one of his partition exchanges was raising error 14098. After an exchange of ideas, user rp0428 came up with a query against sys.jijoin$ (one of the tables behind bitmap join indexes) that allowed John Hall to see that the indexes on the exchange table had been created in a different order from that of the partitioned table. I did a quick test to see if this might be relevant (it shouldn’t be, it isn’t with “normal” indexes or function-based indexes, or virtual columns) and didn’t manage to reproduce the problem with two dimension tables and two bitmap join indexes.

Fortunately John didn’t take my word for it and tested the idea on a clone of the production system – and found that the order of creation did matter. His system, however, had 9 dimension tables and 33 bitmap join indexes – which shouldn’t have made any difference in principle, but maybe it was something to do with having several indexes on the same table,  maybe it was something to do with have far more tables or far more indexes than I had. So I built a larger test case with 6 dimension tables and six indexes per table – and reproduced the problem.

Then I started cutting back to see where the problem appeared, and found that all it took was one dimension with two indexes, or two dimensions with one index each – whatever I had done in my “quick test” I had clearly done it too quickly and done something wrong. (Unfortunately I had overwritten most of the code from the original quick test while building the larger test, so I couldn’t go back and see where the error was.)

Here, then, is the minimal test case that I finally ran to demonstrate that switching the order of index creation on the exchange table causes the exchange to fail:


drop table pt_range purge;
drop table t1 purge;
drop table dim_1 purge;
drop table dim_2 purge;

prompt  =================
prompt  Partitioned table
prompt  =================

create table pt_range (
        id,
        grp1,
        grp2,
        padding
)
nologging
partition by range(id) (
        partition p2001 values less than (2001),
        partition p4001 values less than (4001),
        partition p6001 values less than (6001),
        partition p8001 values less than (8001)
)
as
select
        rownum                          id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 8000
;

prompt  ================================================
prompt  Exchange table - loaded to match partition p8001
prompt  ================================================

alter table pt_range 
add constraint pt_pk primary key (id) using index local;

create table t1 (
        id,
        grp1,
        grp2,
        padding
)
as 
select
        rownum + 6000                   id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 2000
;

alter table t1
add constraint t1_pk primary key (id);

execute dbms_stats.gather_table_stats(user,'pt_range')
execute dbms_stats.gather_table_stats(user,'t1')

prompt  ================
prompt  dimension tables
prompt  ================

create table dim_1 
as 
select distinct 
        grp1, 
        cast('A'||grp1 as varchar2(3)) agrp1,
        cast('B'||grp1 as varchar2(3)) bgrp1
from
        t1
;

create table dim_2 as select * from dim_1;

prompt  ===============================
prompt  Primary keys required for BMJIs
prompt  ===============================

alter table dim_1 add constraint d1_pk primary key (grp1);
alter table dim_2 add constraint d2_pk primary key (grp1);

execute dbms_stats.gather_table_stats(user,'dim_1')
execute dbms_stats.gather_table_stats(user,'dim_2')

prompt  ============================
prompt  Creating bitmap join indexes
prompt  ============================

create bitmap index pt_1a on pt_range(d1.agrp1) from pt_range pt, dim_1 d1 where d1.grp1 = pt.grp1 local ;
create bitmap index pt_2a on pt_range(d2.agrp1) from pt_range pt, dim_2 d2 where d2.grp1 = pt.grp2 local ;

prompt  ====================================================
prompt  Pick your index creation order on the exchange table
prompt  ====================================================

create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;
create bitmap index t1_2a on t1(d2.agrp1) from t1, dim_2 d2 where d2.grp1 = t1.grp2 ;
-- create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;

prompt  ==================
prompt  Exchanging (maybe)
prompt  ==================

alter table pt_range
        exchange partition p8001 with table t1
        including indexes
        without validation
;

I’ve got the same create statement twice for one of the bitmap join indexes – as it stands the indexes will be created in the right order and the exchange will work; if you comment out the first t1_1a create and uncomment the second the exchange will fail. (If you comment out the ‘including indexes’ then the exchange will succeed irrespective of the order of index creation, but that rather defeats the point of being able to exchange partitions.)

I’ve reproduced the problem in 12.1.0.2, 11.2.0.4 and 10.2.0.5

Footnote: running an extended trace didn’t help me work out how Oracle is detecting the mismatch, presumably it’s something that gets into the dictionary cache in a general “load the index definition” step; but it did show me that (in the “without validation” case) the code seems to check the correctness of the exchange table’s primary key data BEFORE checking whether the indexes match properly.

January 29, 2016

Table Scans

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 12:49 pm GMT Jan 29,2016

It’s amazing how easy it is to interpret a number incorrectly until the point comes where you have to look at it closely – and then you realise that there was a lot more to the number than your initial casual assumption, and you would have realised it all along if you’d ever needed to think about it before.

Here’s a little case in point. I have a simple (i.e. non-partitioned) heap table t1 which is basically a clone of the view dba_segments, and I’ve just connected to Oracle through an SQL*Plus session then run a couple of SQL statements. The following is a continuous log of my activity:


SQL> select table_name, partitioned, blocks from user_tables;

TABLE_NAME           PAR     BLOCKS
-------------------- --- ----------
T1                   NO         958

1 row selected.

SQL> select name, value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic# and name like '%table scan%';

NAME                                  VALUE
-------------------------------- ----------
table scans (short tables)                9
table scans (long tables)                 0
table scans (rowid ranges)                0
table scans (cache partitions)            0
table scans (direct read)                 0
table scan rows gotten                 4188
table scan blocks gotten                 14

7 rows selected.

SQL> select count(extents) from t1;

COUNT(EXTENTS)
--------------
          5143

1 row selected.

SQL> select name, value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic# and name like '%table scan%';

  • My cache size is quite small, so t1 doesn’t count as a “short” table.
  • I’ve collected stats on the table (and there are no indexes) so the optimizer doesn’t need to do any dynamic sampling to generate an execution plan.
  • This is 11g, so there are no SQL Plan Directives in place to force dynamic sampling

So here’s the question: how many “table scans (long tables)” will Oracle record against my session when I re-run that query against v$mystat ?

Warning – this IS a trick question.

Update number 1

I said it was a trick question and, as you will have seen if you’ve had time to read the comments, the answer is going to depend on various configuration options. Here’s what I got in my test – and I’ve reported not just the session stats, but the segment statistics (v$segment_statistics):


NAME                                  VALUE
-------------------------------- ----------
table scans (short tables)                9
table scans (long tables)                26
table scans (rowid ranges)               26
table scans (cache partitions)            0
table scans (direct read)                 0
table scan rows gotten                 9331
table scan blocks gotten                954

7 rows selected.

SQL> select owner, object_name, object_type, value  from v$segment_statistics where owner = 'TEST_USER' and statistic_name = 'segment scans' and value != 0;

OWNER           OBJECT_NAME          OBJECT_TYPE             VALUE
--------------- -------------------- ------------------ ----------
TEST_USER       T1                   TABLE                      26

1 row selected.

I recorded 26 long table scans – and that 26 was echoed in the segment statistics. (Note: I don’t normally use v$segment_statistics, which is a join between a couple of in-memory structures and three real table, I usually query v$segstat). It’s the segment statistics that made me pause in a recent problem review;  up to that moment I had been using the “Segments by Table Scans” section of the AWR Report as a useful (but badly named – since it also counts (most) index fast full scans) indicator of a potential threat, then I suddenly realised that there was yet another important detail I had to check before I could determine what the numbers were telling me.

So the next question is WHY has Oracle reported 26 tablescans ?

Update number 2

Martin Preiss (see comments) has been working hard to investigate this, and managed to produce a couple more variations in statistics for “the same” problem. In my case I had declared the table as “parallel 2”. For a table of degree N Oracle is very likely to break the table into 13*N chunks – there’s a historic reason for 13 – and that’s why I see 26 “rowid range” scans which, unfortunately, are also recorded as “long” table scans even though each is individually well short of the short table limit. This is really annoying when you start to look at the “Segments by Table Scan” report buecause you now don’t know how many times a table really was scanned unless you know something about the run-time degree of parellelism involved.

If you check the hidden parameters you will find several parameters relating to this chunking, in particular:


_px_min_granules_per_slave        minimum number of rowid range granules to generate per slave (default  13)
_px_max_granules_per_slave        maximum number of rowid range granules to generate per slave (default 100)

Technically, therefore, Oracle MIGHT get to 100 granules per parallel slave, and a tablescan at degree 2 could be recorded as 200 tablescans!

Martin asked why my 26 tablescans didn’t show up as “table scans (direct read)” – but that’s 11g for you, it allows a serial tablescan to use direct path reads, and it allows parallel tablescans to read into the cache, and the combination of cache size and table size meant that my example just happened to read into the cache.

And that introduces ANOTHER interpretation problem – what annoying things might I discover if I declare the table the the CACHE option ? (as Ivica suggested in his first comment below) ?

Final Update (probably)

Iviva has been busy on checking the CACHE operation and shown that there are a number of cases to consider (more, in fact, than I was planning to mention – and the variation the combination in NOPARALLEL and CACHE should give you some pause for thought). The only point I wanted to make was the effect of enabling PARALLEL and CACHE; I don’t think that this is a combination that is very likely to appear in a production system, but boundary conditions (and accidents) do occur. With my little sample – even after I grew the table to be MUCH larger, the tablescan DIDN’T get reported: here are some session stats and segment stats (taken using my snapshot code) of a single tablescan running parallel 2 when the table was cached:


Name                                                                     Value
----                                                                     -----
table scans (rowid ranges)                                                  26
table scans (cache partitions)                                              26
table scan rows gotten                                                 164,672
table scan blocks gotten                                                29,611

  Statistic                                    Value
  ---------                             ------------

T1
  logical reads                               30,272
  physical reads                              29,614
  physical read requests                         261

I know I’ve done a tablescan in 26 pieces (parallel 2) – and scanned 29,000+ blocks doing it; but according to the segment stats AND session stats I haven’t done a tablescan. Fortunately, of course, I can see the parallel tablescan in the session stats, and in this isolated case I can see from the “cache partitions” statistics that that tablescan was on a cached table. But if I’ve been depending on the segment stats to tell me about which tablescans happen most frequently and do most work I’ve found another reason why I can’t trust the stats and have to do more work cross-checking different parts of the AWR for self-consistent results.

I don’t expect to add any more comments about this mixture of tablescans and parallelism, with the varying effects on the session and segment statistics – but there’s always the possibility that one day I’ll start to worry about how the KEEP and RECYCLE (db_keep_cache_size and db_recycle_cache_size) could confuse things further.

Footnote:

It’s probably worth pointing out that the segment statistics have never recorded SHORT tablescans, they’ve only ever captured details of LONG tablescans. There is a slight inconsistency here, though, since they capture all the “index fast full scans (full)” reported in the session stats whether they are short or long – not that the session stats record the difference; unfortunately, another flaw creeps in: parallel index fast full scans “disappear” from the stats, although they show up as “index fast full scans (rowid ranges)” and “index fast full scans (direct)” with the same multiplier of 13 that we see for parallel table scans.

January 27, 2016

Add primary key.

Filed under: Indexing,Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 9:07 am GMT Jan 27,2016

I thought I had written this note a few years ago, on OTN or Oracle-L if not on my blog, but I can’t find any sign of it so I’ve decided it’s time to write it (again) – starting as a question about the following code:


create table t1
as
with generator as (
        select  rownum  id
        from            dual
        connect by
                        rownum <= 1000
)
select
        rownum                                  id,
        trunc((rownum-1)/50)                    clustered,
        mod(rownum,20000)                       scattered,
        lpad(rownum,10)                         vc_small,
        rpad('x',100,'x')                       vc_padding
from
        generator       g1,
        generator       g2
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

alter system flush buffer_cache;

alter table t1 add constraint t1_pk primary key(id, scattered);

I’ve generated a table with 1,000,000 rows, including a column that’s guaranteed to be unique; then I’ve added a (two-column) primary key constraint to that table.

Because of the guaranteed unique column the call to add constraint will succeed. Because Oracle will automatically create a unique index to support that constraint it will have to do a tablescan of the table. So here’s the question: HOW MANY TIMES will it tablescan that table (and how many rows will it scan) ?

Space for thought …

The answer is three tablescans, 3 million rows.

Oracle will scan the table to check the validity of adding a NOT NULL definition and constraint for the id column, repeat the scan to do the same for the scattered column, then one final scan to accumulate the key data and rowids to sort and create the index.

Knowing this, you may be able to find ways to modify bulk data loading operations to minimise overheads.

The most recent version I’ve tested this on is 12.1.0.2.

See also: https://jonathanlewis.wordpress.com/2012/03/02/add-constraint/

My reference: pk_overhead.sql

January 26, 2016

Trace file size

Filed under: 12c,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 8:30 am GMT Jan 26,2016

Here’s a convenient enhancement for tracing that came up on Twitter a few days ago – first in a tweet that I retweeted, then in a question from Christian Antognini based on this bit of the 12c Oracle documentation (opens in separate tab). The question was – does it work for you ?

The new description for max_dump_file_size says that for large enough values Oracle will split the file into multiple chunks of a few megabytes, using a suffix to identify the sequence of the chunks, keeping only the first chunk and the most recent chunks. Unfortunately this doesn’t seem to be true. However, prompted by Chris’ question I ran a quick query against the full parameter list looking for parameters with the word “trace” in their name:


select
        /*+
                leading(nam val val2)
                full(name)
                full(val)  use_hash(val)  no_swap_join_inputs(val)
                full(val2) use_hash(val2) no_swap_join_inputs(val2)
        */
        nam.ksppinm                             name,
        val.ksppstvl                            ses_val,
        val2.ksppstvl                           sys_val,
        nam.ksppdesc                            description,
        nam.indx+1                              numb,
        nam.ksppity                             type,
        val.ksppstdf                            is_def,
        decode(bitand(nam.ksppiflg/256,1),
                1,'True',
                  'False'
        )                                       ses_mod,
        decode(bitand(nam.ksppiflg/65536,3),
                1,'Immediate',
                2,'Deferred' ,
                3,'Immediate',
                  'False'
        )                                       sys_mod,
        decode(bitand(val.ksppstvf,7),
                1,'Modified',
                4,'System Modified',
                  'False'
        )                                       is_mod,
        decode(bitand(val.ksppstvf,2),
                2,'True',
                  'False'
        )                                       is_adj,
        val.ksppstcmnt                          notes
from
        x$ksppi         nam,
        x$ksppcv        val,
        x$ksppsv        val2
where
        nam.indx = val.indx
and     val2.indx = val.indx
and     ksppinm like '%&m_search.%'
order by
        nam.ksppinm
;

Glancing through the result I spotted a couple of interesting parameters with the letters “uts” in their names, so re-ran my query looking for all the “uts” parameters, getting the following (edited) list:


NAME                           SYS_VAL         DESCRIPTION    
------------------------------ --------------- ---------------------------------------------
_diag_uts_control              0               UTS control parameter
_uts_first_segment_retain      TRUE            Should we retain the first trace segment
_uts_first_segment_size        0               Maximum size (in bytes) of first segments 
_uts_trace_disk_threshold      0               Trace disk threshold parameter
_uts_trace_segment_size        0               Maximum size (in bytes) of a trace segment
_uts_trace_segments            5               Maximum number of trace segments 

Note particularly the “first segment size” and “trace segment size” – defaulting to zero (which often means a hidden internal setting, though that doesn’t seem to be the case here, but maybe that’s what the “diag control” is for). I haven’t investigated all the effects, but after a little experimentation I found that all I needed to do to get the behaviour attributed to max_dump_file_size was to set the following two parameters – which I could do at the session level.


alter session set "_uts_first_segment_size" = 5242880;
alter session set "_uts_trace_segment_size" = 5242880;

The minimum value for these parameters is the one I’ve shown above (5120 KB) and with the default value for _uts_trace_segments you will get a maximum of 5 trace files with sequential names like the following:

ls -ltr *4901*.trc

-rw-r----- 1 oracle oinstall 5243099 Jan 26 08:15 orcl_ora_4901_1.trc
-rw-r----- 1 oracle oinstall 5243064 Jan 26 08:15 orcl_ora_4901_12.trc
-rw-r----- 1 oracle oinstall 5243058 Jan 26 08:15 orcl_ora_4901_13.trc
-rw-r----- 1 oracle oinstall 5242993 Jan 26 08:15 orcl_ora_4901_14.trc
-rw-r----- 1 oracle oinstall 1363680 Jan 26 08:15 orcl_ora_4901.trc

As you can see I’m currently generating my 15th trace, and Oracle has kept the first one and the previous three. It’s always working on a file with no suffix to its name but as soon as that file hits its limiting size (plus or minus a few bytes) it gets its appropriate suffix, the oldest file is deleted, and a new trace file without a suffix is started.

Apart from the usual header information the trace files start and end with lines like:

*** TRACE CONTINUED FROM FILE /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_11.trc ***
  
*** TRACE SEGMENT RENAMED TO /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_12.trc ***

There is one little trap to watch out for: if you set either of these parameters to be larger than max_dump_file_size tracing stops as soon as one of the segments hits the max_dump_file_size and that trace file ends with the usual “overflow” message – e.g, when I changed the max_dump_file_size to 4M in mid-session:

*** DUMP FILE SIZE IS LIMITED TO 4194304 BYTES ***

In my case I had started with max_dump_file_size set to 20M, so I got lucky with my choice of 5M as the segment size.

Further investigation is left as an exercise to the interested reader.

 

November 30, 2015

Trouble-shooting

Filed under: Oracle,Troubleshooting,Uncategorized — Jonathan Lewis @ 6:03 am GMT Nov 30,2015

This is the text of the “whitepaper” I submitted to DOAG for my presentation on “Core Strategies for Troubleshooting”.

Introduction

In an ideal world, everyone who had to handle performance problems would have access to ASH and the AWR through a graphic interface – but even with these tools you still have to pick the right approach, recognise the correct targets, and acquire information at the boundary that tells you why you have a performance problem and the ways in which you should be addressing it.

There are only three ways you can waste resources on an Oracle system, and only three different types of activity that need to be investigated. If you don’t appreciate that this is the case then you can waste a lot of time following the wrong strategy and attempting to solve the wrong problems. Once you have a clear statement of what you are trying to achieve it becomes much easier to achieve those aims.

Three ways to cause problems

There really are only three symptoms you need to look out for in the database

  • You’re doing it the hard way
  • You’re doing it too often
  • You’re queueing / locking

Inevitably you can see that there is going to be some overlap between the three (and it would be easy to argue that the third is simply a side effect of the other two). If you are executing a single statement “the hard way” you’re likely to be doing single block reads or buffer gets too often. If you’re executing a very lightweight statement too often it’s probably a step in a process that is attempting get a job done “the hard way”, and it’s probably causing (and getting stuck in) queues for latches and mutexes. I’ve included queueing as the third option because the simple action of locking data (deliberately, or by accident) doesn’t fall neatly into the other two.

Another way of looking at this list is to reduce it to just two items with a slightly different flavour: when you spend too much time on a task it’s because you’re either doing too much work or you’re not being allowed to work.

Three classes of problems to solve

An important aspect of performance problems is the high-level classification. Labelling the class of problem properly points you to the correct strategy for investigating the problem. Again there are only three possibilities in the list, which I’ll express as typical complaints:

  • My report is taking too long to run / screen is taking to long to refresh
  • The batch job over-ran the SLA last night
  • The “system” is slow

What’s the most significant difference between three classes of complaint ?

  • “My XXX takes too long”: XXX is a repeatable event that can be isolated and watched – just do it again, and again, and again, and again while I watch every step of what’s going on.
  • The batch job over-ran last night: It’s not a repeatable process, so you’ve got to infer what the problem was from historical evidence; it’s (usually) made up of a number of concurrent processes, which may interfere with each other to varying degrees depending on when their relative start and finish times were.
  • The “system” is slow: possibly all the time, possibly intermittently – if there’s no specific complaint then the only option is to keep an eye open for resource-intensive activity to see if you can reduce the resource usage of specific individual tasks (for which read SQL or PL/SQL statements) or reduce the number of times that those tasks are executed.

The common source

Despite the fact that we can classify performance problems in three ways, it’s worth remembering that ALL the information we might use to drive our diagnosis comes from one place – the Oracle database. At any moment we have sessions that are active, operating statements (cursors), and using a resource such as a file, a latch, a buffer, and so on. We could almost represent each moment by a cube with sessions along one side, cursors along another, and resources along the third – the cube changes moment by moment, allowing us to visualise time as the fourth dimension in a hypercube of activity.

Instant by instant the Oracle kernel code knows which session is using what resource to operate which cursor and, although the total volume of all that information is far more than could reasonably be recorded, Oracle has many different ways of slicing, dicing and capturing parts of that hypercube – running totals, rolling aggregates, snapshots by session, by cursor, by resource and so on – that are made visible as the dynamic performance views (v$ objects). Trouble-shooting is largely a case of deciding which dynamic performance views are the most appropriate to use for our three classes of task.

Active Session History

Before reviewing the three classes, it’s worth breaking off for a moment to say a few things about one of the most important and useful views that we have of the wealth of information available; this is the active session history (v$active_session_history / ASH) which Oracle uses to capture a snapshot once every second of what each currently active session is doing; every 10th snapshot is then echoed down into the Automatic Workload Repository (AWR) by a process that runs every hour to copy the dynamic view to a table that can be accessed through the database view dba_hist_active_sess_history.

The capture rates can all be adjusted: I have never seen anyone change from one snapshot per second, or every 10th snapshot in a production system, but I’ve often seen the dump to the AWR taken every 30 minutes, occasionally 20 or even 15 minutes. On occasion I have asked clients to do a CTAS (create table as select – nologging) to capture the whole of the v$active_session_history to a table that can be exported and reviewed at a later date. Oracle tries to keep a minimum of at least an hour’s ASH in memory but, with a large enough SGA, you may find that this will stretch out to 3 or 4 hours.

Unfortunately although (or perhaps because) ASH and its AWR history are extremely helpful, you have to pay extra licence fees to use the information, and the technology can only be licensed with the Enterprise Edition of Oracle. (Which is why there are some ASH emulators on the Internet)

My report is slow

The special feature of someone complaining about a specific task is that it’s likely to be repeatable – so we can run it again and again and watch every single detail to see where the time goes. Our slice through the hypercube could take a single session over a period of time and report every action along that path. This, of course, is the 10046 – a.k.a extended SQL trace event. We can enable it in many ways, perhaps through a logon trigger, perhaps through a call to dbms_monitor:


begin
        dbms_monitor.session_trace_enable( 
                session_id => &m_sid, 
                serial_num => &m_serial,
                waits      => true, 
                bind       => true, 
                plan_stat  => 'all_executions'
        );
end; 
/

In this example I’ve request all wait states and bind variable to be dumped into the trace file, I’ve also requested that the execution plan (with rowsource execution stats) be dumped for every single execution of every single statement. Sometimes a problem arises because a particular set of bind variables represents a special case that causes a “reasonable” plan to behave very badly. If we’re going to look closely we may as well get as much detail as possible.

The entire “trace” interface was upgraded dramatically in 11g, and one of the useful variants on this theme is particularly relevant to a commonly used Web-based implementation. If you know that a specific screen task corresponds to a particular PL/SQL package you can enable tracing of a cursor (across the system, if necessary) by SQL_ID. So, for example, you might issue the following two commands, with a couple of minutes gap between the two:


alter system
        set events '
                sql_trace[SQL:1wthpj7as7urp]
                plan_stat=all_executions,
                wait=true, 
                bind=true
        '
;

-- wait a few minutes

alter system
        set events 'sql_trace[SQL:1wthpj7as7urp] off'
;

Every time the statement with SQL_ID =‘1wthpj7as7urp’ is executed, the session executing it will start adding information to the session trace file, and when the statement ends the tracing will end. This is particularly nice if the “statement” is a top-level call to a PL/SQL procedure because all the SQL inside the procedure will be traced as the package executes.

For a highly focused, highly repeatable task, the 10046 trace event is almost always all you need to do.

The batch over-ran

The big difference between this case and the previous one is that “the batch” is not something you can simply repeat and watch. Moreover, “the batch” is likely to be a large number of separate sections of code that are scheduled to run with a fairly fluid timetable that can result in changes from day to day (or, more likely, night to night) in the set of jobs that might be running concurrently. This means that even if you could re-run the batch job (perhaps on the previous night’s backup) you might not see the same problem appear because a small change in timing could result in a large change in contention).

One of the most important steps of dealing with the batch is pre-emptive: instrument your code and make it possible to compare the run on one night with the run on another. At the very least you need to have something capturing the start and end times of each “significant component” of the batch so you can quickly answer questions like: “which jobs took much longer than usual”, “which job was the first job that took longer than usual”, “which jobs were running concurrently with job X last night when they never usually overlap?”

Ideally you should have much more information than this about each job – basically a report from Oracle which says “how much work did I do, how much time did I spend”: for a session this is simply a report of v$mystat or v$sesstat (joined to v$statname) and v$session_event for the session (v$mystat is a less well- known view that is v$sesstat restricted to “my” session) ; if you classify each job as “connect to the database, do something, disconnect” then this critical log is simply a pair of select statements spooled out somewhere convenient, or written to the database; if you want to break a single connection into significant pieces then a simple pl/sql procedure could read the statistics into a pl/sql array as the piece starts, then re-read the stats and calculate the differences as the piece ends.

Knowing where the time went, and knowing how tasks have behaved differently from previous runs is a big step forward to identifying the problem.

If you don’t have the instrumentation you need then the AWR (if you’re licensed) or Statspack (if you’re not licensed) is a step in the right direction. Apart from the typical hourly snapshots and reports you can take a snapshot as the first and last steps of the batch so that you’ve got “the whole batch” in a single AWR/Statspack report. If you’ve got that you can then do comparisons for things like:

  • Which event consumed much more time than usual
  • Which SQL took much more time than usual
  • Which segment(s) saw much more activity than usual
  • Was there some unusual enqueue activity
  • Can we see some unusual outliers in the event histograms
  • Can we see some unusual memory demands in the pga histogram

Although system-wide summaries rarely point us at exact causes, they can often give us strong clues of areas (and times) where problem originated.

In this respect the “Top Activity” screen from Enterprise Manager (Grid Control / Cloud Control) can be very helpful as it produces a nice graphical presentation of “working”. Where, in the picture of last night’s activity, does the graph start to ramp up, and what colour is the bit that’s growing, and how does that picture compare to the same picture the previous night. (Having two windows open with two different nights makes it wonderfully easy to switch between displays and spot the differences.) Since the “historical” top activity screen is created from the dba_hist_active_sess_history, which contains about 100 different details per session of each captured moment, it’s very easy to drill though the spikes to answer questions like: “which object”, “what event”, “which SQL”, “what was the execution plan”, “how much work did that take”, to follow the chain of time back to the cause.

The system is slow

If no-one is going to tell you about specific tasks, and if you don’t have any sort of boundary that allows you to focus on tasks or time-ranges, then the simplest thing to do is look for anything expensive (i.e. time-consuming) and see if you can make it cheaper.

Again, the graphic “Top Activity” screen is very helpful, and I often tell people to arrange to have a system that shows the top activity screens for the most important 2 or 3 databases on a large screen on the wall where any passing DBA might notice a brief spike in workload. There are systems that can be improved by constant monitoring – so long as the monitoring doesn’t take out 100% of an individual’s time but is driven as an informal glance at a picture.

If you’re not licensed to take advantage of the AWR then Statspack can help – but with the smallest time interval (though 15 minutes is as low as I’ve ever gone) between snapshots so that “anomlies” that are short-lived don’t fall out of memory before they can be captured.

An important feature of reading Statspack is that you need to check for missing information – if the headline figure for physical reads is 25M but the “SQL ordered by reads” is 12M then you know that there must be 13M reads that didn’t get captured in the report and that might be the 13M that is causing the problem. Similarly if the “Segments by physical reads” reports 16M reads that’s 4M more than the SQL – but is the 12M a subset of the 16M, or is there only a 3M overlap between the two figures so that between them the 12M and 16M cover the entire 25M. There’s more information in the Statspack report than immediately meets the eye, and a careful drilldown into other areas of the report (typically the Instance Activity) may be needed to clarify the meaning of what you’re seeing as a headline figure.

The other thing you can do with “the slow system” when you don’t have ASH to help is take snapshots (or get some freeware to do the same sort of thing). If the system is slow “right now” you could, for example, take a snapshot of v$sess_io (session I/O), wait 30 seconds then take another snapshot, find the difference and see who is doing most of the I/O work – then chase that session; or take snapshots of v$sesstat limited to (say) statistics like “%redo%” and find out who is generating lots of redo.

Oracle allows you to take this approach back into recent history – there are a number of “metric” views which give you thing like the rolling average, min, and max I/O volumes for the last few intervals of 5 minutes or 1 minute each – telling you, in other words, whether there were any interesting bursts of extreme activity in the recent past. For example, a query against v$sysmetric_summary might give you an output like the following:

METRIC_NAME                     MAXVAL     AVERAGE STANDARD_DEV METRIC_UNIT
------------------------ ------------- ----------- ------------ -----------------------
Physical Reads Per Sec        1,618.95      105.92       358.16 Reads Per Second
Physical Reads Per Txn       97,202.00    5,539.19    20,811.56 Reads Per Txn
Redo Generated Per Sec    6,773,108.94  218,132.86 1,023,458.57 Bytes Per Second
User Calls Per Txn              395.00       43.39        79.85 Calls Per Txn
Total Parse Count Per Sec        31.14        1.88         4.25 Parses Per Second
Host CPU Utilization (%)         64.51        3.93         9.07 % Busy/(Idle+Busy)
Database Time Per Sec            82.96        6.65        15.37 CentiSeconds Per Second
I/O Megabytes per Second         35.58        2.62         5.73 Megabtyes per Second

This summarises the last 12 intervals of 5 minutes. If we look at “Physical Reads per Txn” we can see that there were some extreme swings in activity over that period, so we could drill down into v$sysmetric_history for “Physical Reads per txn”, looking at the 1 minute granularity and see:

METRIC_UNIT                    BEGIN_TIME                VALUE
Physical Reads Per Txn         05-feb 12:45:55          421.00
                               05-feb 12:44:55          477.00
                               05-feb 12:43:55          351.00
                               05-feb 12:42:55          406.84
                               05-feb 12:41:55        1,550.00
                               05-feb 12:40:55       93,984.00
                               05-feb 12:39:55       97,202.00
                               05-feb 12:38:55       93,323.00
                               05-feb 12:37:55          391.00
                               05-feb 12:36:55          504.00
                               05-feb 12:35:55          504.00
                               05-feb 12:34:55          252.00

Yes, a few minutes ago something exploded onto the system doing a huge amount of I/O for about 3 minutes. If we’re lucky we might now drill into the v$sesstat, or v$sess_io, or v$session_event to see if we can find a session that is responsible for a large amount of I/O; and then check v$open_cursor to see if it still has some open cursors that might (if we check v$sql) show us what caused the I/O.

When there are no specific complaints, we just keep an eye open for spikes in activity and try track them down as quickly and cheaply as possible to see if they’re worth addressing.

Summary

Oracle gives you a huge amount of information about the work that’s going on and the time that’s being used in the database. Unfortunately the most useful repository of that information is in a dynamic performance view that can only be viewed in the Enterprise Edition after purchasing additional licences. However, the information is summarised, in many different ways in literally hundreds of other dynamic performance views, and it’s easy to pick out helpful information from those views in a variety of ways.

Key to making the best use of those views, though, is recognising that different classes of performance problems require different strategies – and there are only three different classes of problems to worry about.

November 3, 2015

Nul points

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 8:16 am GMT Nov 3,2015

(To understand the title, see this Wikipedia entry)

The title could also be: “Do as I say, don’t do as I do”, because I want to remind you of an error that I regularly commit in my demonstrations. Here’s an example:

 
SQL> create table t (n number); 

Table created 

Have you spotted the error yet ? Perhaps this will help:

SQL> insert into t select 1 - 1/3 * 3 from dual; 

1 row created. 

SQL> insert into t select 1 - 3 * 1/3 from dual; 

1 row created. 

SQL> column n format 9.99999999999999999999999999999999999999999 
SQL> select * from t; 

                                           N
--------------------------------------------
  .00000000000000000000000000000000000000010
  .00000000000000000000000000000000000000000

2 rows selected. 

Spotted the error yet ? If not then perhaps this will help:

SQL> select * from dual where 3 * 1/3 = 1/3 * 3;

no rows selected 

SQL> select * from dual where 3 * (1/3) = (1/3) * 3; 

D
-
X

1 row selected. 

Computers work in binary, people (tend to) work in decimal. 10 = 2 * 5, and 5 (more precisely, dividing by 5) is something that a computer cannot do accurately. So when you do arbitrary arithmetic you should use some method to deal with tiny rounding errors.

In Oracle this means you ought to define all numbers with a precision and scale. Look on it as another form of constraint that helps to ensure the correctness of your data as well as improving performance and reducing wasted storage space.

September 23, 2015

Forget-me-nots

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 8:17 am GMT Sep 23,2015

Here’s a little note that I drafted (according to its date stamp) in January 2013 and then forgot to post. (Which adds a little irony to the title.)

============================================================

Here’s an object lesson in (a) looking at what’s in front of you, and (b) how hard it is to remember all the details.

I ran a script today [ED: i.e. some time early Jan 2013] that I’ve have no problems with in earlier versions of Oracle, but today I was running it against 11.2.0.3 for the first time, and hit a problem with autotrace:

SQL> set autotrace on
ERROR:
ORA-28002: the password will expire within 5 days

SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report

Anyone who reads the preceding text closely will see immediately what the problem is – but I saw the bottom line and immediately decided that I had forgotten to set up the plustrace role in this database. So I logged on as SYS and ran the script to create it ($ORACLE_HOME/sqlplus/admin/plustrce.sql) – and discovered that the role already existed. So I decided that I hadn’t granted plustrace to my test_user role, but that didn’t help; so I decided that this was clearly a case of a role that had to be assigned directly to a user, and that didn’t help.

THEN I read the error message properly, changed my password, and everything worked the way as expected.

That’s part (a) of the lesson, here’s part (b): I wrote about running int the same issue nearly two years ago (only that time I had a little excuse for not spotting the problem instantly).

============================================================

August 3, 2015

Demo data

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 1:26 pm GMT Aug 3,2015

One of the articles I wrote for redgate’s AllthingsOracle site some time ago included a listing of the data distribution for some client data which I had camouflaged. A recent comment on the article asked how I had generated the data – of course the answer was that I hadn’t generated it, but I had done something to take advantage of its existence without revealing the actual values.  This article is just a little note showing what I did; it’s not intended as an elegant and stylish display of perfectly optimised SQL, it’s an example of a quick and dirty one-off  hack that wasn’t (in my case) a disaster to run.

I’ve based the demonstration on the view all_objects. We start with a simple query showing the distribution of the values of column object_type:


break on report
compute sum of count(*) on report

select
        object_type, count(*)
from
        all_objects
group by object_type
order by
        count(*) desc
;

OBJECT_TYPE           COUNT(*)
------------------- ----------
SYNONYM                  30889
JAVA CLASS               26447
...
JAVA RESOURCE              865
TRIGGER                    509
JAVA DATA                  312
...
JAVA SOURCE                  2
DESTINATION                  2
LOB PARTITION                1
EDITION                      1
MATERIALIZED VIEW            1
RULE                         1
                    ----------
sum                      76085

44 rows selected.

Starting from this data set I want 44 randomly generated strings and an easy way to translate the actual object type into one of those strings. There are various ways to do this but the code I hacked out put the original query into an inline view, surrounded it with a query that added a rownum to the result set to give each row a unique id, then used the well-known and much-loved  “connect by level” query against  dual to generate a numbered list of randomly generated strings as an inline view that I could use in a join to do the translation.


execute dbms_random.seed(0)

column random_string format a6

select
        generator.id,
        dbms_random.string('U',6)       random_string,
        sum_view.specifier,
        sum_view.ct                     "COUNT(*)"
from
        (
        select
                rownum  id
                from    dual
                connect by
                        level <= 100
        )       generator,
        (
        select
                rownum          id,
                specifier,
                ct
        from
                (
                select
                        object_type specifier, count(*) ct
                from
                        all_objects
                group by
                        object_type
                order by
                        count(*) desc
                )
        )       sum_view
where
        sum_view.id = generator.id
order by
        ct desc
;

        ID RANDOM SPECIFIER             COUNT(*)
---------- ------ ------------------- ----------
         1 BVGFJB SYNONYM                  30889
         2 LYYVLH JAVA CLASS               26447
...
         9 DNRYKC JAVA RESOURCE              865
        10 BEWPEQ TRIGGER                    509
        11 UMVYVP JAVA DATA                  312
...
        39 EYYFUJ JAVA SOURCE                  2
        40 SVWKRC DESTINATION                  2
        41 CFKBRX LOB PARTITION                1
        42 ZWVEVH EDITION                      1
        43 DDAHZX MATERIALIZED VIEW            1
        44 HFWZBX RULE                         1
                                      ----------
sum                                        76085

44 rows selected.

I’ve selected the id and original value here to show the correspondance, but didn’t need to show them in the original posting. I’ve also left the original (now redundant) “order by” clause in the main inline view, and you’ll notice that even though I needed only 44 distinct strings for the instance I produced the results on I generated 100 values as a safety margin for testing the code on a couple of other versions of Oracle.

A quick check for efficiency – a brief glance at the execution plan, which might have prompted me to add a couple of /*+ no_merge */ hints if they’d been necessary – showed that the work done was basically the work of the original query plus a tiny increment for adding the rownum and doing the “translation join”. Of course, if I’d then wanted to translate the full 76,000 row data set and save it as a table I’d have to join the result set above back to a second copy of all_objects – and it’s translating full data sets , while trying to deal with problems of referential integrity and correlation, where the time disappears when masking data.

It is a minor detail of this code that it produced fixed length strings (which matched the structure of the original client data). Had I felt the urge I might have used something like: dbms_random.string(‘U’,trunc(dbms_random.value(4,21))) to give me a random distribution of string lengths between 4 and 20. Getting fussier I might have extracted the distinct values for object_type and then generated a random string that matched the length of the value it was due to replace. Fussier still I might have generated the right number of random strings matching the length of the longest value, sorted the original and random values into alphabetical order to align them, then trimmed each random value to the length of the corresponding original value.

It’s extraordinary how complicated it can be to mask data realistically – even when you’re looking at just one column in one table. And here’s a related thought – if an important type of predicate in the original application with the original data is where object_type like ‘PACK%’ how do you ensure that your masked data is consistent with the data that would be returned by this query and how do you determine the value to use instead of “PACK” as the critical input when you run the critial queries against the masked data ? (Being privileged may give you part of the answer, but bear in mind that the people doing the testing with that data shouldn’t be able to see the unmasked data or any translation tables.)

 

 

 

July 27, 2015

Subquery Factoring (10)

Filed under: Bugs,CBO,Oracle,Subquery Factoring,Troubleshooting — Jonathan Lewis @ 1:26 pm GMT Jul 27,2015

What prompted me to write my previous note about subquerying was an upgrade to 12c, and a check that a few critical queries would not do something nasty on the upgrade. As ever it’s always interesting how many little oddities you can discover while looking closely as some little detail of how the optimizer works. Here’s an oddity that came up in the course of my playing around investigation in 12.1.0.2 – first some sample data:


create table t1
nologging
as
select * from all_objects;

create index t1_i1 on t1(owner) compress nologging;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1 for columns owner size 254'
        );
end;
/

The all_objects view is convenient as a tool for modelling what I wanted to do since it has a column with a small number of distinct values and an extreme skew across those values. Here’s a slightly weird query that shows an odd costing effect:


with v1 as (
        select /*+ inline */ owner from t1 where owner > 'A'
)
select count(*) from v1 where owner = 'SYS'
union all
select count(*) from v1 where owner = 'SYSTEM'
;

Since the query uses the factored subquery twice and there’s a predicate on the subquery definition, I expect to see materialization as the default, and that’s what happened (even though I’ve engineered the query so that materialization is more expensive than executing inline). Here are the two plans from 12.1.0.2 (the same pattern appears in 11.2.0.4, though the costs are a little less across the board):


=======================
Unhinted (materializes)
=======================

---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |     2 |   132 |    25  (20)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D661B_876C2CB |       |       |            |          |
|*  3 |    INDEX FAST FULL SCAN    | T1_I1                      | 85084 |   498K|    21  (15)| 00:00:01 |
|   4 |   UNION-ALL                |                            |       |       |            |          |
|   5 |    SORT AGGREGATE          |                            |     1 |    66 |            |          |
|*  6 |     VIEW                   |                            | 85084 |  5483K|    13  (24)| 00:00:01 |
|   7 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661B_876C2CB | 85084 |   498K|    13  (24)| 00:00:01 |
|   8 |    SORT AGGREGATE          |                            |     1 |    66 |            |          |
|*  9 |     VIEW                   |                            | 85084 |  5483K|    13  (24)| 00:00:01 |
|  10 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661B_876C2CB | 85084 |   498K|    13  (24)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

=============
Forced inline
=============

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     2 |    12 |    22  (14)| 00:00:01 |
|   1 |  UNION-ALL             |       |       |       |            |          |
|   2 |   SORT AGGREGATE       |       |     1 |     6 |            |          |
|*  3 |    INDEX FAST FULL SCAN| T1_I1 | 38784 |   227K|    21  (15)| 00:00:01 |
|   4 |   SORT AGGREGATE       |       |     1 |     6 |            |          |
|*  5 |    INDEX RANGE SCAN    | T1_I1 |   551 |  3306 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------

I’m not surprised that the optimizer materialized the subquery – as I pointed out in my previous article, the choice seems to be rule-based (heuristic) rather than cost-based. What surprises me is that the cost for the default plan is not self-consistent – the optimizer seems to have lost the cost of generating the temporary table. The cost of the materialized query plan looks as if it ought to be 21 + 13 + 13 = 47. Even if the optimizer were smart enough to assume that the temporary table would be in the cache for the second scan (and therefore virtually free to access) we ought to see a cost of 21 + 13 = 34. As it is we have a cost of 25, which is 13 + 13 (or, if you check the 10053 trace file, 12.65 + 12.65, rounded).

Since the choice to materialize doesn’t seem to be cost-based (at present) this doesn’t really matter – but it’s always nice to see, and be able to understand, self-consistent figures in an execution plan.

Footnote

It is worth pointing out as a side note that materialization can actually be more expensive than running in-line, even for very simple examples. Subquery factoring seems to have become more robust and consistent over recent releases in terms of consistency of execution plans when the subqueries are put back inline, but you still need to think a little bit before rewriting a query for cosmetic (i.e. totally valid “readability”) reasons just to check whether the resulting query is going to produce an unexpected, and unexpectedly expensive, materialization.

July 17, 2015

Descending Indexes

Filed under: CBO,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 8:42 am GMT Jul 17,2015

I’ve written about optimizer defects with descending indexes before now but a problem came up on the OTN database forum a few days ago that made me decide to look very closely at an example where the arithmetic was clearly defective. The problem revolves around a table with two indexes, one on a date column (TH_UPDATE_TIMESTAMP) and the other a compound index which starts with the same column in descending order (TH_UPDATE_TIMESTAMP DESC, TH_TXN_CODE). The optimizer was picking the “descending” index in cases where it was clearly the wrong index (even after the statistics had been refreshed and all the usual errors relating to date-based indexes had been discounted). Here’s an execution plan from the system which shows that there’s something wrong with the optimizer:


SELECT COUNT(*) FROM TXN_HEADER WHERE TH_UPDATE_TIMESTAMP BETWEEN SYSDATE-30 AND SYSDATE;

---------------------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                  |     1 |     8 |            |          |
|*  2 |   FILTER           |                  |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| TXN_HEADER_IDX17 |  1083K|  8462K|     4   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

There are two clues here: first, Oracle has used the (larger) two-column index when the single column is (almost guaranteed to be) the better choice simply because it will be smaller; secondly, we have a cost of 4 to acquire 1M rowids through an (implicitly b-tree) index range scan, which would require at least 250,000 index entries per leaf block to keep the cost that low (or 2,500 if you set the optimizer_index_cost_adj to 1; so it might just be possible if you had a 32KB block size).

The OP worked hard to supply information we asked for, and to follow up any suggestions we made; in the course of this we got the information that the table had about 90M rows and this “timestamp” column had about 45M distinct values ranging from 6th Sept 2012 to 2nd July 2015 with no nulls.

Based on this information I modelled the problem in an instance of 11.2.0.4 (the OP was using 11.2.0.3).


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                                          id,
        to_date('06-SEP-2012 15:13:00','dd-mon-yyyy hh24:mi:ss') +
                trunc((rownum - 1 )/4) / (24 * 60)                      upd_timestamp,
        mod(rownum - 1,10)                                              txn_code,
        rpad('x',50)                                                    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4 * 1030  *  24 * 60
;

create index t1_asc on t1(upd_timestamp) nologging;
create index t1_desc on t1(upd_timestamp desc) nologging;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T1',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

My data set has 4 rows per minute from 6th Sept 2012 to 3rd July 2015, with both an ascending and descending index on the upd_timestamp column. For reference, here are the statistics about the two indexes:


INDEX_NAME               SAMPLE     BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ---------- ----------- -----------------
T1_DESC                 5932800          2       17379            154559
T1_ASC                  5932800          2       15737             59825

The ascending index is smaller with a significantly better clustering_factor, so for queries where either index would be a viable option the ascending index is the one (we think) that the optimizer should choose. In passing, the 5.9M index entries is exactly the number of rows in the table – these stats were computed automatically as the indexes were created.

Here’s a simple query with execution plan (with rowsource execution stats):

select max(id) from t1 where upd_timestamp between
to_date('01-jun-2015','dd-mon-yyyy')                and
to_date('30-jun-2015','dd-mon-yyyy')

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |      1 |        |    29 (100)|      1 |00:00:01.29 |    4710 |
|   1 |  SORT AGGREGATE              |         |      1 |      1 |            |      1 |00:00:01.29 |    4710 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1      |      1 |    167K|    29   (0)|    167K|00:00:00.98 |    4710 |
|*  3 |    INDEX RANGE SCAN          | T1_DESC |      1 |    885 |     5   (0)|    167K|00:00:00.35 |     492 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."SYS_NC00005$">=HEXTORAW('878CF9E1FEF8FEFAFF')  AND
              "T1"."SYS_NC00005$"<=HEXTORAW('878CF9FEF8FEF8FF') )
       filter((SYS_OP_UNDESCEND("T1"."SYS_NC00005$")>=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND SYS_OP_UNDESCEND("T1"."SYS_NC00005$")<=TO_DATE(' 2015-06-30 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))

The optimizer’s index estimate of 885 rowids is a long way off the actual rowcount of 167,000 – even though I have perfect stats describing uniform data and the query is simple enough that the optimizer should be able to get a good estimate. Mind you, the predicate section looks a lot messier than you might expect, and the table estimate is correct (though not consistent with the index estimate, of course).

Here’s the plan I get when I make the descending index invisible:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |      1 |        |  2146 (100)|      1 |00:00:01.25 |    2134 |
|   1 |  SORT AGGREGATE              |        |      1 |      1 |            |      1 |00:00:01.25 |    2134 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1     |      1 |    167K|  2146   (1)|    167K|00:00:00.94 |    2134 |
|*  3 |    INDEX RANGE SCAN          | T1_ASC |      1 |    167K|   453   (2)|    167K|00:00:00.31 |     446 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("UPD_TIMESTAMP">=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "UPD_TIMESTAMP"<=TO_DATE(' 2015-06-30 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

With the descending index invisible we can see that the cardinality estimate for the ascending index correct and notice how much higher this makes the cost. It’s not surprising that the optimizer picked the wrong index with such a difference in cost. The question now is why did the optimizer get the index cardinality (i.e. selectivity, hence cost) so badly wrong.

The answer is that the optimizer has made the same mistake that applications make by storing dates as character strings. It’s using the normal range-based calculation for the sys_op_descend() values recorded against the virtual column and has lost all understanding of the fact that these values represent dates. I can demonstrate this most easily by creating one more table, inserting a couple of rows, gathering stats, and showing you what the internal storage of a couple of “descendomg” dates looks like.


drop table t2;

create table t2 (d1 date);
create index t2_i1 on t2(d1 desc);

insert into t2 values('01-Jun-2015');
insert into t2 values('30-Jun-2015');
commit;

select d1, sys_nc00002$ from t2;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T2',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

column endpoint_value format 999,999,999,999,999,999,999,999,999,999,999,999
break on table_name skip 1

select
        table_name, column_name, endpoint_number, endpoint_value, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') end_hex
from
        user_tab_histograms
where
        table_name in ('T1','T2')
and     column_name like 'SYS%'
order by
        table_name, column_name, endpoint_number
;

I’ve put the dates 1st June 2015 and 30th June 2015 into the table because those were the values I used in the where clause of my query. Here are the results showing the internal representation of the (descending) index column and the stored low and high values for the virtual columns in both t1 and t2.


D1        SYS_NC00002$
--------- ------------------------
01-JUN-15 878CF9FEF8FEF8FF
30-JUN-15 878CF9E1FEF8FEFAFF

TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER                                   ENDPOINT_VALUE END_HEX
-------------------- -------------------- --------------- ------------------------------------------------ ---------------------------------
T1                   SYS_NC00005$                       0  703,819,340,111,320,000,000,000,000,000,000,000    878CF8FCEFF30BCEBC8823F7000000
                     SYS_NC00005$                       1  703,880,027,955,346,000,000,000,000,000,000,000    878FF6F9EFF20256F3B404F7400000

T2                   SYS_NC00002$                       0  703,819,411,001,549,000,000,000,000,000,000,000    878CF9E1FEF8F24C1C7CED46200000
                     SYS_NC00002$                       1  703,819,419,969,389,000,000,000,000,000,000,000    878CF9FEF8FEEED28AC5B22A200000

If you check the t2 data (sys_nc00002$) values against the end_hex values in the histogram data you’ll see they match up to the first 6 bytes (12 digits). Oracle has done its standard processing – take the first 6 bytes of the column, covert to decimal, round to the most significant 15 digits (technically round(value, -21)), convert and store the result as hex.

So let’s do some arithmetic. The selectivity of a range scan that is in-bounds is (informally): “range we want” / “total range”. I’ve set up t2 to show us the values we will need to calculate the range we want, and I’ve reported the t1 values to allow us to calculate the total range, we just have to subtract the lower value (endpoint number 0) from the higher value for the two sys_nc0000N$ columns. So (ignoring the 21 zeros everywhere) our selectivity is:

  • (703,819,419,969,389 – 703,819,411,001,549) / ( 703,880,027,955,346 – 703,819,340,111,320) = 0.00014777
  • We have 5.9M rows in the table, so the cardinality estimate should be about: 5,932,800 * 0.00014777 = 876.69

The actual cardinality estimate was 885 – but we haven’t allowed for the exact form of the range-based predicate: we should add 1/num_distinct at both ends because the range is a closed range (greater than or EQUAL to, less than or EQUAL to) – which takes the cardinality estimate up to 884.69 which rounds to the 885 that the optimizer produced.

Conclusion

This note shows that Oracle is (at least for dates) losing information about the underlying data when dealing with the virtual columns associated with descending columns in indexes. As demonstrated that can lead to extremely bad selectivity estimates for predicates based on the original columns, and these selectivity estimates make it possible for Oracle to choose the wrong index and introduce a spuriously low cost into the calculation of the full execution plan.

Footnote

This note seems to match bug note 11072246 “Wrong Cardinality estimations for columns with DESC indexes”, but that bug is reported as fixed in 12.1, while this test case reproduces in 12.1.0.2

 

July 6, 2015

SQL vs. PL/SQL

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 10:23 am GMT Jul 6,2015

Which piece of code will be faster (clue – the table in question has no indexes):

Option 1 – pure SQL


update join1 set
        data = data||'#'
where   key_1=500
and     key_2=23851
and     key_3=57012
and     key_4=521
and     key_6=1
and     key_7=23352
;

Option 2 – a silly PL/SQL row by row approach:


declare
        type rowid_type is table of urowid index by binary_integer;
        tab_rowid           rowid_type;  

        lv_rows_updated     number :=0;  

        cursor my_cursor is
                select  rowid rid
                from    join1
                where   key_1=500
                and     key_2=23851
                and     key_3=57012
                and     key_4=521
                and     key_6=1
                and     key_7=23352
        ;

begin
        open my_cursor;

        -- We know that the number of rows to be updated is very small
        fetch my_cursor bulk collect into tab_rowid limit 10000;

        forall lv_row in tab_rowid.first .. tab_rowid.last
             update join1 set data = data||'#' where  rowid = tab_rowid(lv_row);

        lv_rows_updated := sql%rowcount;
        close my_cursor;
end;
/

It’s a trick question, of course, and although the automatic response from any DBA-type is likely to be “the SQL”, the correct answer is (as so often) “it depends”.

This question appeared as a problem on the OTN database forum a few days ago. In it’s original form it asked why a select statement should be much faster than a select for update or an update – even though the volume identified and updated was very small (just one row in 100M).The note then went on to show that using PL/SQL to select the rowids of the target rows then doing the bulk update by rowid was faster than the basic SQL update. The answer didn’t spring to mind immediately; but fortunately someone asked for some run-time statistics (v$sesstat) and the supplied statistics told me what was going on.

Conveniently the OP gave us the code to recreate the test case – all 100M rows of it; I cut this back to 16M rows (ca. 1.5GB of disc space), and then ran the tests with ny db_cache_size set to 256MB (another clue). I got similar results to the OP – not so dramatic, but the PL/SQL ran faster than the SQL and the difference was due to an obvious change in the CPU usage.

If you haven’t guess from the clue in the 256MB db_cache_size (which means the table is more than 5 times the size of the cache), the answer is “serial direct path reads”. For a sufficiently large table (and that’s not easy to define – start here and follow a few links) it’s fairly common knowledge that from 11g a tablescan can use a serial direct path read, and that’s what the PL/SQL was doing to select the required rowids. However, here’s a detail that’s not often mentioned: an update has to take place in public where everyone can see it so when Oracle executed the simple SQL update or select for update statement it had to scan the table through the buffer cache. Pulling all those blocks into the buffer cache, grabbing latches to link them to the right cache buffers chains, pinning them, then unpinning them uses a lot of CPU – which isn’t needed for the direct path read. The PL/SQL with its pure select used far less CPU than the basic SQL with its update/select for update, and because the OP had a very high-powered machine with plenty of CPU and loads of (disc-)caching effects all over the place the difference in CPU time was exremely visible as a fraction of the total DB time.

This was, inevitably, a very special case where a little detail became a significant fraction of the workload. The OP scanned 100M rows to update 1 row (in 7 – 13 seconds!). This doesn’t sound like a strategy you would normally want to adopt for frequent use; and for occasional use we might be happy to use the slower (13 second) approach to avoid the coding requirement of the fast (7 second) solution.

Footnote:

It’s worth pointing out that the PL/SQL strategy is not safe. In the few seconds between the select statement starting and the row being identified and updated by rowid it’s possible that another session could have updated (or deleted) the row. In the former case the update statement is now updating a row which doesn’t match the specification; in the latter case the code will raise an exception.

We can make the PL/SQL safer by including the original predicates in the update statement – but that still leaves the question of what the code should do if the select statement finds a row and the update fails to update it. Should it, perhaps, assume that there is still a row in the table that needs an update and re-run (using up all the time you saved by adopting a PL/SQL solution).

 

 

 

June 17, 2015

Reverse Key

Filed under: Indexing,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:11 pm GMT Jun 17,2015

A question came up on the OTN database forum recently asking if you could have a partitioned index on a non-partitioned table.

(Aside: I’m not sure whether it would be quicker to read the manuals or try the experiment – either would probably be quicker than posing the question to the forum. As so often happens in these RTFM questions the OP didn’t bother to acknowledge any of the responses)

The answer to the question is yes – you can create a globally partitioned index, though if it uses range partitioning you have to specify a MAXVALUE partition. The interesting thing about the question, though is that several people tried to guess why it had been asked and then made suggestions based on the most likely guess (and wouldn’t it have been nice to see some response from the OP ). The common guess was that there was a performance problem with the high-value block of a sequence-based (or time-based) index – a frequent source of “buffer busy wait” events and other nasty side effects.

Unfortunately too many people suggesting reverse key as a solution to this “right-hand” problem. If you’re licensed for partitioning it’s almost certain that a better option would simple be to use global hash partitioning (with 2^N for some N) partitions. Using reverse keys can result in a bigger performance problem than the one you’re trying to avoid – you may end up turning a little time spent on buffer busy waits into a large amount of time spent on db file sequential reads. To demonstrate the issue I’ve created a sample script – and adjusted my buffer cache down to the appropriate scale:

create table t1(
	id	not null
)
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		rownum <= 1e4
)
select
	1e7 + rownum	id
from
	generator	v1,
	generator	v2
where
	rownum <= 1e7 
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 => 'T1'
	);
end;
/

alter table t1 add constraint t1_pk primary key(id) 
using index 
	reverse 
	nologging 
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';

begin
	for i in 20000001..20010000 loop
		insert into t1 values(i);
	end loop;
end;
/

I’ve created a table with 10,000,000 rows using a sequential value as the primary key, then inserted “the next” 10,000 rows into the table in order. The index occupied about about 22,000 blocks, so to make my demonstration show you the type of effect you could get from a busy production system with more tables and many indexes I ran my test with the buffer cache limited to 6,000 blocks – a fair fraction of the total index size. Here’s a small section of the trace file from the test running 10.2.0.3 on an elderly machine:


WAIT #43: nam='db file sequential read' ela= 13238 file#=6 block#=12653 blocks=1 obj#=63623 tim=3271125590
WAIT #43: nam='db file sequential read' ela=  7360 file#=6 block#=12749 blocks=1 obj#=63623 tim=3271133150
WAIT #43: nam='db file sequential read' ela=  5793 file#=6 block#=12844 blocks=1 obj#=63623 tim=3271139110
WAIT #43: nam='db file sequential read' ela=  5672 file#=6 block#=12940 blocks=1 obj#=63623 tim=3271145028
WAIT #43: nam='db file sequential read' ela= 15748 file#=5 block#=13037 blocks=1 obj#=63623 tim=3271160998
WAIT #43: nam='db file sequential read' ela=  8080 file#=5 block#=13133 blocks=1 obj#=63623 tim=3271169314
WAIT #43: nam='db file sequential read' ela=  8706 file#=5 block#=13228 blocks=1 obj#=63623 tim=3271178240
WAIT #43: nam='db file sequential read' ela=  7919 file#=5 block#=13325 blocks=1 obj#=63623 tim=3271186372
WAIT #43: nam='db file sequential read' ela= 15553 file#=6 block#=13549 blocks=1 obj#=63623 tim=3271202115
WAIT #43: nam='db file sequential read' ela=  7044 file#=6 block#=13644 blocks=1 obj#=63623 tim=3271209420
WAIT #43: nam='db file sequential read' ela=  6062 file#=6 block#=13741 blocks=1 obj#=63623 tim=3271215648
WAIT #43: nam='db file sequential read' ela=  6067 file#=6 block#=13837 blocks=1 obj#=63623 tim=3271221887
WAIT #43: nam='db file sequential read' ela= 11516 file#=5 block#=13932 blocks=1 obj#=63623 tim=3271234852
WAIT #43: nam='db file sequential read' ela=  9295 file#=5 block#=14028 blocks=1 obj#=63623 tim=3271244368
WAIT #43: nam='db file sequential read' ela=  9466 file#=5 block#=14125 blocks=1 obj#=63623 tim=3271254002
WAIT #43: nam='db file sequential read' ela=  7704 file#=5 block#=14221 blocks=1 obj#=63623 tim=3271261991
WAIT #43: nam='db file sequential read' ela= 16319 file#=6 block#=14444 blocks=1 obj#=63623 tim=3271278492
WAIT #43: nam='db file sequential read' ela=  7416 file#=6 block#=14541 blocks=1 obj#=63623 tim=3271286129
WAIT #43: nam='db file sequential read' ela=  5748 file#=6 block#=14637 blocks=1 obj#=63623 tim=3271292163
WAIT #43: nam='db file sequential read' ela=  7131 file#=6 block#=14732 blocks=1 obj#=63623 tim=3271299489
WAIT #43: nam='db file sequential read' ela= 16126 file#=5 block#=14829 blocks=1 obj#=63623 tim=3271315883
WAIT #43: nam='db file sequential read' ela=  7746 file#=5 block#=14925 blocks=1 obj#=63623 tim=3271323845
WAIT #43: nam='db file sequential read' ela=  9208 file#=5 block#=15020 blocks=1 obj#=63623 tim=3271333239
WAIT #43: nam='db file sequential read' ela=  7708 file#=5 block#=15116 blocks=1 obj#=63623 tim=3271341141
WAIT #43: nam='db file sequential read' ela= 15484 file#=6 block#=15341 blocks=1 obj#=63623 tim=3271356807
WAIT #43: nam='db file sequential read' ela=  5488 file#=6 block#=15437 blocks=1 obj#=63623 tim=3271362623
WAIT #43: nam='db file sequential read' ela= 10447 file#=6 block#=15532 blocks=1 obj#=63623 tim=3271373342
WAIT #43: nam='db file sequential read' ela= 12565 file#=6 block#=15629 blocks=1 obj#=63623 tim=3271386741
WAIT #43: nam='db file sequential read' ela= 17168 file#=5 block#=15725 blocks=1 obj#=63623 tim=3271404135
WAIT #43: nam='db file sequential read' ela=  7542 file#=5 block#=15820 blocks=1 obj#=63623 tim=3271411882
WAIT #43: nam='db file sequential read' ela=  9400 file#=5 block#=15917 blocks=1 obj#=63623 tim=3271421514
WAIT #43: nam='db file sequential read' ela=  7804 file#=5 block#=16013 blocks=1 obj#=63623 tim=3271429519
WAIT #43: nam='db file sequential read' ela= 14470 file#=6 block#=16237 blocks=1 obj#=63623 tim=3271444168
WAIT #43: nam='db file sequential read' ela=  5788 file#=6 block#=16333 blocks=1 obj#=63623 tim=3271450154
WAIT #43: nam='db file sequential read' ela=  9630 file#=6 block#=16429 blocks=1 obj#=63623 tim=3271460008
WAIT #43: nam='db file sequential read' ela= 10910 file#=6 block#=16525 blocks=1 obj#=63623 tim=3271471174
WAIT #43: nam='db file sequential read' ela= 15683 file#=5 block#=16620 blocks=1 obj#=63623 tim=3271487065
WAIT #43: nam='db file sequential read' ela=  8094 file#=5 block#=16717 blocks=1 obj#=63623 tim=3271495454
WAIT #43: nam='db file sequential read' ela=  6670 file#=5 block#=16813 blocks=1 obj#=63623 tim=3271502293
WAIT #43: nam='db file sequential read' ela=  7852 file#=5 block#=16908 blocks=1 obj#=63623 tim=3271510360
WAIT #43: nam='db file sequential read' ela= 10500 file#=6 block#=17133 blocks=1 obj#=63623 tim=3271521039
WAIT #43: nam='db file sequential read' ela= 11038 file#=6 block#=17229 blocks=1 obj#=63623 tim=3271532275
WAIT #43: nam='db file sequential read' ela= 12432 file#=6 block#=17325 blocks=1 obj#=63623 tim=3271544974
WAIT #43: nam='db file sequential read' ela=  7784 file#=6 block#=17421 blocks=1 obj#=63623 tim=3271553331
WAIT #43: nam='db file sequential read' ela=  7774 file#=5 block#=17517 blocks=1 obj#=63623 tim=3271561346
WAIT #43: nam='db file sequential read' ela=  6583 file#=5 block#=17613 blocks=1 obj#=63623 tim=3271568146
WAIT #43: nam='db file sequential read' ela=  7901 file#=5 block#=17708 blocks=1 obj#=63623 tim=3271576231
WAIT #43: nam='db file sequential read' ela=  6667 file#=5 block#=17805 blocks=1 obj#=63623 tim=3271583259
WAIT #43: nam='db file sequential read' ela=  9427 file#=6 block#=18029 blocks=1 obj#=63623 tim=3271592988
WAIT #43: nam='db file sequential read' ela= 52334 file#=6 block#=18125 blocks=1 obj#=63623 tim=3271646055
WAIT #43: nam='db file sequential read' ela= 50512 file#=6 block#=18221 blocks=1 obj#=63623 tim=3271697284
WAIT #43: nam='db file sequential read' ela= 10095 file#=6 block#=18317 blocks=1 obj#=63623 tim=3271708095

Check the block number for this list of single block reads – we’re jumping through the index about 100 blocks at a time to read the next block where an index entry has to go. The jumps are the expected (and designed) effect of reverse key indexes: the fact that the jumps turn into physical disc reads is the (possibly unexpected) side effect. Reversing an index makes adjacent values look very different (by reversing the bytes) and go to different index leaf blocks: the purpose of the exercise is to scatter concurrent similar inserts across multiple blocks, but if you scatter the index entries you need to buffer a lot more of the index to keep the most recently used values in memory. Reversing the index may eliminate buffer busy waits, but it may increase time lost of db file sequential reads dramatically.

Here’s a short list of interesting statistics from this test – this time running on 11.2.0.4 on a machine with SSDs) comparing the effects of reversing the index with those of not reversing the index – normal index first:


Normal index
------------
CPU used by this session               83
DB time                                97
db block gets                      40,732
physical reads                         51
db block changes                   40,657
redo entries                       20,174
redo size                       5,091,436
undo change vector size         1,649,648

Repeat with reverse key index
-----------------------------
CPU used by this session              115
DB time                               121
db block gets                      40,504
physical reads                     10,006
db block changes                   40,295
redo entries                       19,973
redo size                       4,974,820
undo change vector size         1,639,232

Because of the SSDs there’s little difference in timing between the two sets of data and, in fact, all the other measures of work done are very similar except for the physical read, and the increase in reads is probably the cause of the extra CPU time thanks to both the LRU manipulation and the interaction with the operating system.

If you want to check the effect of index reversal you can take advantage of the sys_op_lbid() function to sample a little of your data – in my case I’ve queried the last 10,000 rows (values) in the table:


select 
	/*+ 
		cursor_sharing_exact 
		dynamic_sampling(0) 
		no_monitoring 
		no_expand 
		index_ffs(t1,t1_i1) 
		noparallel_index(t,t1_i1) 
	*/ 
	count (distinct sys_op_lbid( &m_ind_id ,'L',t1.rowid)) as leaf_blocks
from 
	t1
where 
	id between 2e7 + 1 and 2e7 + 1e4
;

The &m_ind_id substition variable is the object_id of the index t1_i1.

In my case, with an index of 22,300 leaf blocks, my 10,000 consecutive values were scattered over 9,923 leaf blocks. If I want access to “recent data” to be as efficient as possible I need to keep that many blocks of the index cached, compared to (absolute) worst case for my data 100 leaf blocks. When you reverse key an index you have to think about how much bigger you have to make your buffer cache to keep the performance constant.

June 2, 2015

Predicate Order

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 7:10 pm GMT Jun 2,2015

A recent OTN post demonstrated a very important point about looking at execution plans – especially when you don’t use the right data types. The question was:

We’ve this query which throws invalid number

SELECT * FROM table A
WHERE A.corporate_id IN (59375,54387) AND TRUNC(created_dt) BETWEEN '19-DEC-14' AND '25-DEC-14';

However it works fine if we use not in instead of in

SELECT * FROM table A  
WHERE A.corporate_id  NOT IN (59375,54387) AND TRUNC(created_dt) BETWEEN '19-DEC-14' AND '25-DEC-14';

Please assist.

A follow-up post told us that corporate_id was a varchar() type – so the root cause of the ORA-01722: invalid number error is simply that you shouldn’t be mixing data types. Either the corporate_id should have been defined as numeric or the in-list should have been a list of varchar2() values. (And, of course, the character strings that look like dates should have been converted explicitly to date data types using either the to_date() function with a 4-digit year or the date ‘yyyy-mm-dd’ syntax; and using “created_dt >=  19th Dec and created_dt < 26th Dec” would have given the optimizer a chance to get a better cardinality estimate)

The answer to the slightly more specific problem – why does changing NOT IN to IN allow the query to run rather than crashing – is (probably) one that I first addressed in an article in Oracle Magazine just over eleven years ago: with CPU costing enabled Oracle can change the order in which it applies filter predicates to a table. It’s also a question that can easily be answered by my commonest response to many of the optimizer questions that appear on OTN – look at the execution plan.

In this example it’s a fairly safe bet that there’s a reasonable small volume of data (according to the optimizer’s estimate) where to_number(corporate_id) is one of the required values, and a much larger volume of data where it is not; with some intermediate volume of data where the created_dt falls in the required date range. With CPU costing enabled (optional in 9i, enabled by default in 10g) the optimizer would then do some arithmetic to calculate the most cost-effective order of applying the filter predicates based on things like: the number of CPU cycles it takes to walk along a row to find a particular column. the number of CPU cycles it takes to convert a character column to a number and compare it with a number; the number of CPU cycles it takes truncate a date column and compare it with a string, the number of rows that would pass the numeric test hence requiring the first-applied date test, compared with the number of rows that would survive the first-applied date test hence requiring either the second date test or the numeric test to take place.

Here’s some code to demonstrate the point. It may require the system stats to be set to a particular values to ensure that it is probably repeatable, but there’s probably some flexibility in the range, which is why I’ve called dbms_stats.set_system_stats() in the first few lines:

drop table t1 purge;

create table t1 (
        v1      varchar2(10),
        d1      date
)
;

insert into t1 values(1,'01-Jan-2015');

insert into t1 values('x','02-Jan-2015');

insert into t1 values(3,'03-Jan-2015');
insert into t1 values(4,'04-Jan-2015');
insert into t1 values(5,'05-Jan-2015');
insert into t1 values(6,'06-Jan-2015');
insert into t1 values(7,'07-Jan-2015');
insert into t1 values(8,'08-Jan-2015');
insert into t1 values(9,'09-Jan-2015');
insert into t1 values(10,'10-Jan-2015');

execute dbms_stats.gather_table_stats(user,'t1');

First we create a table, load some data, and gather stats. You’ll notice that I’ve got a varchar2(10) column into which I’ve inserted numbers for all rows except one where it holds the value ‘x’. Now we just run some code to check the execution plans for a couple of queries.


explain plan for
select
        *
from    t1
where   v1 in (4,6)
and     d1 between '03-Jan-2015' and '09-Jan-2015'
;

select * from table(dbms_xplan.display);

explain plan for
select
        *
from    t1
where   v1 not in (4,6)
and     d1 between '03-Jan-2015' and '&1-Jan-2015'
;

select * from table(dbms_xplan.display);

As with the original question I’ve take a query with an IN operator and changed it to NOT IN. The in-list is numeric even though the relevant column is varchar2(10). The first query crashes with ORA-01722: invalid number, the second one runs and returns the correct result. You’ll notice, of course, that the “bad” value for v1 is not in the set of rows
where d1 is between 3rd and 9th Jan 2015. You’ll also notice that in my code I’ve used &1 for the end day in the query with the NOT IN clause so that I can re-run the query a few times to show the effects of changing the date range. Here are the execution plans – first with the IN clause:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     2 |    20 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     2 |    20 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((TO_NUMBER("V1")=4 OR TO_NUMBER("V1")=6) AND
              "D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "D1"<=TO_DATE(' 2015-01-09 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

The optimizer arithmetic predicts 2 rows returned using a full tablescan – it doesn’t know the query is going to crash. Notice the predicate information, though. The first predicate says Oracle will attempt to convert v1 to a number and compare it with 4 and then (if the first test fails) with 6. The query will crash as soon as it hits a row with a non-numeric value for v1. In outline, the optimizer has decided that the numeric conversion and test is very cheap (on CPU) and only a few rows will survive to take the more expensive date comparison; wherease either of the (expensive) date comparisons would leave a lot of rows that would still have to be checked with the numeric test. It makes sense to do the numeric comparison first.

Here’s the plan for the query with the NOT IN clause when I set the date range to be 3rd Jan to 7th Jan.


Execution plan for NOT IN:  7th Jan
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     5 |    50 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     5 |    50 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1"<=TO_DATE(' 2015-01-07 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6)

The plan is still a full tablescan – there are no indexes available – and the estimated number of rows has gone up to 5. The important thing, though, is the predicate section. In this case the optimizer has decided that the first thing it will apply is the (relatively expensive) predicate “d1 >= 3rd Jan” before worrying about the “NOT IN” numeric predicate. The optimizer has worked out that almost all the data will survive the NOT IN predicate, so it’s not efficient to apply it before using other predicates that eliminate more data.

By a stroke of luck my simple example happened to be a very good example. Here’s what happened when I set the end date to 8th Jan:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     6 |    60 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     6 |    60 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "D1"<=TO_DATE(' 2015-01-08 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6)

The estimated rows has gone up to 6 – but the interesting thing, as before, is the predicate section: in the previous example Oracle did the tests in the order “upper bound”, “lower bound”, “numeric”; in this test it has done “lower bound”, “upper bound”, “numeric”.

And this is what I got when I ran the test with 9th Jan:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     7 |    70 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     7 |    70 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6 AND
              "D1"<=TO_DATE(' 2015-01-09 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Again the estimated rows has gone up by one, but the ever-interesting predicate section now shows the evaluation order as: “lower bound”, “numeric”, “upper bound”.

There are only 6 possible orders for the predicate evaluation for the query with the NOT IN clause, and we’ve seen three of them. Three will fail, three will succeed – and I got all three of the successful orders. It wouldn’t take much fiddling around with the data (careful choice of duplicate values, variation in ranges of low and high values, and so on) and I could find a data set where small changes in the requested date range would allow me to reproduce all six variations. In fact when I changed my nls_date_format to “dd-mon-yy” and used a 2 digit year for testing I got two of the three possible failing predicate evaluation orders – “numeric”, “lower bound”, “higher bound” and “higher bound”, “numeric”, “lower bound” without changing the data set. (To be able to get all six orders with a single data set I’d probably need a data set where the “bad” v1 value corresponded to a d1 value somewhere mear the middle of the d1 range.)

The bottom line – use the correct data types; make sure your date literals are dates with a 4-digit year; check the predicate section to see if the optimizer did any implicit conversions with your predicates and what order it used them in. If you don’t do this you may find that a query can work perfectly for months, then crash because you finally got unlucky with the arithmetic.

May 21, 2015

Understanding SQL

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 6:12 pm GMT May 21,2015

From time to time someone publishes a query on the OTN database forum and asks how to make it go faster, and you look at it and think it’s a nice example to explain a couple of principles because it’s short, easy to understand, obvious what sort of things might be wrong, and easy to fix. Then, after you’ve made a couple of suggestions and explained a couple of ideas the provider simply fades into the distance and doesn’t tell you any more about the query, or whether they’ve taken advantage of your advice, or found some other way to address the problem.

Such a query, with its execution plan, appeared a couple of weeks ago:

UPDATE FACETS_CUSTOM.MMR_DTL
SET
	CAPITN_PRCS_IND = 2,
	FIL_RUN_DT = Current_fil_run_dt,
	ROW_UPDT_DT = dta_cltn_end_dttm
WHERE
	CAPITN_PRCS_IND = 5
AND	HSPC_IND ='Y'
AND	EXISTS (
		SELECT	1
		FROM	FACETS_STAGE.CRME_FUND_DTL_STG STG_CRME
		WHERE	STG_CRME.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN STG_CRME.ERN_FROM_DT AND STG_CRME.ERN_THRU_DT
		AND	STG_CRME.FUND_ID IN ('AAB1', '1AA2', '1BA2', 'AAB2', '1AA3', '1BA3', '1B80', '1A80')
	)
AND	EXISTS (
		SELECT	1
		FROM	FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
		WHERE	XWLK.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN XWLK.HSPC_EVNT_EFF_DT AND XWLK.HSPC_EVNT_TERM_DT
	)
;

-------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|   1 |  UPDATE                       | MMR_DTL               |       |       |            |          |
|   2 |   NESTED LOOPS SEMI           |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|*  3 |    HASH JOIN RIGHT SEMI       |                       |    49 |  5488 | 12375   (2)| 00:02:29 |
|   4 |     TABLE ACCESS FULL         | FCTS_TMS_MBRID_XWLK   |  6494 | 64940 |    24   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL         | MMR_DTL               |   304K|    29M| 12347   (2)| 00:02:29 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CRME_FUND_DTL_STG     |     1 |    36 |     5   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | IE1_CRME_FUND_DTL_STG |     8 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("XWLK"."MBR_CK"="MMR_DTL"."MBRSHP_CK")
       filter("XWLK"."HSPC_EVNT_EFF_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "XWLK"."HSPC_EVNT_TERM_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   5 - filter("CAPITN_PRCS_IND"=5 AND "HSPC_IND"='Y')
   6 - filter(("STG_CRME"."FUND_ID"='1A80' OR "STG_CRME"."FUND_ID"='1AA2' OR
              "STG_CRME"."FUND_ID"='1AA3' OR "STG_CRME"."FUND_ID"='1B80' OR "STG_CRME"."FUND_ID"='1BA2' OR
              "STG_CRME"."FUND_ID"='1BA3' OR "STG_CRME"."FUND_ID"='AAB1' OR "STG_CRME"."FUND_ID"='AAB2') AND
              "STG_CRME"."ERN_FROM_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "STG_CRME"."ERN_THRU_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   7 - access("STG_CRME"."MBR_CK"="MMR_DTL"."MBRSHP_CK")

The most informative bit of narrative that went with this query said:

“The table MMR_DTL doesnt have index on these columns CAPITN_PRCS_IND , HSPC_IND .. Since this is an update stmt which will update 85k records, In a dilema whether to index these columns or not .. And this table MMR_DTL is an ever growing table. Worried about the update performance. “

This was in response an observation that there was a full tablescan on MMR_DTL at operation 5 despite the predicate “CAPITN_PRCS_IND”=5 AND “HSPC_IND”=’Y’. You’ll note that the predicted cardinality for that scan is 304K and the update statement is going to change CAPITN_PRCS_IND from the value 5 to the value 2 – so it’s not entirely unreasonable to be worried about the impact of creating an index that included the column capitn_prcs_ind.

What more can we say about this query, given the limited information. Quite a lot – unfortunately the owner of the query isn’t giving anything else away.

I’m going to leave this note unfinished to give people a little chance to think about the clues in the request, the questions they might ask, reasons why there might be a performance problem, and strategies they might investigate, then I’ll update the posting with a few ideas some time in the next 24 hours.

Update 1 – 24th May

There are so many ideas that spring up from a small amount of information that it’s very hard to write a concise and coherent description of what you’ve noticed, when and how far you pursued it, and how relevant the ideas might be to the problem in hand – especially when most of the thoughts require you to ask for more information. Unfortunately the time I had available to write this note has just disappeared, so I’m just going to have to complete it in rapidly written installments. The first bit is an outline of the immediate response I had to the initial presentation of the problem and the execution plan that went with it.

The only comment from the OP on this statement and plan was: “I couldn’t optimize this query for better performance and optimized cost.. Can some one guide me on this.”

We have no idea how many rows would be updated, how long it took, or how long the OP thinks it ought to take; it’s not until a subsequent post that we learn that the number of rows targetted for update is 85,000 – which tells us that the optimizer has run into some problems with its cardinality estimates. This suggests that IF there’s a serious performance problem then POSSIBLY there’s a better execution plan and we might get the optimizer to find it automatically if we could tell it how to adjust its cardinality estimates. It would be nice, however to know where the query spent its time (i.e. can we re-run it with rowsource execution stats or monitoring enabled, and see the actual run-time work in the plan).

If it took a couple of minutes to update that 85,000 rows, I probably wouldn’t want to spend time making it go faster; if it took 2 hours, of which 1 hour 50 minutes was spent waiting for a transaction (row) lock then I’d want to look at why the update collision could happen and see if that problem could be avoided – it might then be the case that the last 10 minutes was spent rolling back and restarting an update that ought to have taken 2 minutes “in vacuo”. Underlying all this, I would want to be sure (as I’ve implicitly, and I think reasonably, assumed) that it’s an update that runs only occasionally, perhaps once per day or once per week.

In the absence of hard information – let’s resort to a few hypotheticals; looking at the plan itself (and knowing the target 85,000 rows) I am prepared to make a few guesses about the run-time activity.

  1. We build an inmemory hash table from the whole of FCTS_TMS_MBRID_XWLK, a step for which the optimizer ought to be able to give a reasonable cost and cardinality – assuming (as I will from now on) that the basic stats are reasonably accurate.
  2. We scan the (fairly large) MMR_DETAIL table, applying a couple of filters; again the optimizer ought to do a reasonable job of estimating the cost of such a tablescan, and we might expect a significant fraction of the time to be spent on multiblock (possibly direct path) reads of the table. The cardinality reported is 304,000 but we note there are two predicates and both are for columns which we might guess have a small number of distinct values – one of which we are changing. Perhaps there’s a bad cardinality error there and maybe a couple of single column histograms would help, but maybe column group stats with a histogram on the pair would be even better. I also wonder when (if) HSPC_IND ever changes from Y to N, and contemplate the possibility of creating a function-based index that records ONLY the rows that match this predicate pair (see the note on indexing that will appear some time over the next week further down the page). It’s at this point that we might ask whether the number of rows returned by this scan should be very similar to the number of rows updated, or whether the scan identifies far too many rows and the two existence tests do a lot of work to eliminate the excess and, if the latter, which test should we apply first and how should we apply it.
  3. Having scanned the MMR_DTL we probe the in-memory hash table copy of FCTS_TMS_MBRID_XWLK for the first match, using an equality predicate (which will be the access predicate) and a range-based (filter) predicate which looks as if it is checking that some “start date” is between an “effective date” and a “termination date”. The estimated size of the result set is FAR too small at 49 rows when we know we have to have at least 85,000 rows survive this test; moreover, this tiny estimate comes from inputs of 6,500 and 304,000 rows being joined so we ought to wonder how such a tiny estimate could appear. A possible explanation is that the application has used some extreme dates to represent NULL values. If that’s the case then it’s possible that suitable histograms might help the optimizer recognise the extreme distribution; alternatively virtual columns that change the extreme values back to NULL and a predicate referencing the virtual columns may help.
  4. After estimating the cardinality of the intermediate result so badly, the optimizer decides that the second existence test can be performed as a semi-join using a nested loop. The thing to note here is that the optimizer “knows” that this is an expensive step – the cost of each table access operation is 5 (4 + 1) – but it’s a step that the optimizer “thinks” shouldn’t happen very frequently so the cost is considered acceptable. We know, however, that this step has to execute at least 85,000 times, so the optimizer’s prediction of visiting 4 blocks in the table to identify (on average) 8 rows and discard (on average) 7 of them looks nasty. Again we note that one of the predicates is range-based on a pair of dates – and in this case we might wonder whether or not most of the rows we discard are outside the date range, and whether we ought to consider (as a general point, and not just for this query) whether or not we should add one, other, or both the ERN_FROM_DT and ERN_THRU_DAT to the IE1_CRME_FUND_DTL_STG index. It’s at this point in the query that we are most ignorant of time spent witht the present data set and how that time will change in the future as the data in the MMR_DTL table grows – on one hand it’s possible that the rows for each MMR_DTL are widely scattered across the CRME_FUND_DTL_STG and this step could do a lot of random I/O, on the other hand the (assumed) time-dependent nature of the data may mean that the only MMR_DTL rows we look at are recently entered and the associated CRME_FUND_DTL_STG rows are therefore also recently entered and closely clustered – leading to a beneficial “self-caching” effect at the “high” end of the table as the query runs, which introduces an efficiency that the optimizer won’t notice. There is one numerical highlight in this join – we have a cost of 5 for each probe and 49 rows to test, so we might expect the incremental cost of the query to be around 250 (i.e. 49 * 5), but the difference between operations 3 and 2 is only 56 – suggesting that the optimizer does have some “self-caching” information, possibly based on there being a significant difference between the two tables for the number of distinct values of the join column. (See, for example: http://oracle-randolf.blogspot.co.uk/2012/05/nested-loop-join-costing.html )

Update 2 – 25th May

Scalability is a question that should always be considered – and there’s a scalability threat in the information we have so far. The plan shows a full tablescan of the MMR_DTL table, and while tablescans are not necessarily a bad thing we’ve been told that: “this table MMR_DTL is an ever growing table“. It’s possible that Oracle can be very quick and efficient when doing the existence tests on the rows it selects from the table – but it is inevitable that the tablescan will take longer to complete as time passes. Whether or not this is likely to matter is something we can’t decide from the information given: we don’t know how much of the time is the tablescan, we don’t know what fraction of the total time is due to the tablescan, and we don’t know  how much larger the table will grow each day (although, taking a wild guess, if we assume that the query runs once per day to manipulate recently arrived data the table would appear to be growing fairly rapidly.)

Another scalability detail we ought to ask about is the volume of data that we expect to update each time we run this statement. As time passes do we expect to see the same number of rows waiting to be updated, or are we expecting the business (whatever that may be) to grow steadily each month with an increase of a few percent in the number of rows to be updated on each execution. Our coding strategy may vary depending on the answer to that question – we might, for example, try to pre-empt a future problem by introducing some partitioning now.

The final scalablility issue is one I’ve raised already and comes from the CRME_FUND_DTL_STG table. According to the plan there about 8 rows (the number of rowids reporteed by the index range scan in operation 7) in this table for each distinct value of MMR_DTL.MBRSHP_CK; if MMR_DTL is large and growing, is CRME_FUND_DTL_STG very large and growing at 8 times the speed, or worse – as time passes will there be more rows for each distinct value of MMR_DTL.MBRSHP_CK.  Answers to these questions will help us decide whether we should use a hash join or a nested loop in the join to this table, and how to index the table to minimise random I/O.

Update 3 – 1st June

Since we know that the optimizer has come up with some very bad estimates, and given that the statement is short and simple, I’d be happy to pass the statement to the Tuning Advisor to see what suggestions it made. At the least it ought to come up with a suggestion for an SQL profile (i.e a set of opt_estimate() hints) to address the extremely poor cardinality estimates; it’s also likely to make some suggestions about potentially helpful indexes. I can think of three basic warnings, though:

  • Make sure you are licensed to use the advisor
  • Don’t run the advisor right after you’ve just done the big update – wait until the next big batch of data is ready for update
  • Be cautious about the indexes – the indexing suggestions may point you at the problem, but you may do better by using function-based indexes and modified SQL

The most obvious “simple” index to (assuming a lot of time goes on the table scan) is one of:


create index dtl_cpi on mmr_dtl(capitn_prcs_ind) compress;
create index dtl_cpi on mmr_dtl(capitn_prcs_ind, hspc_ind) compress;

If almost all the rows with capitn_prcs_ind = 5 also had hspc_ind = ‘Y’ then I’d probably choose the former, if a large number of rows (actually index rowids) could be eliminated before visiting the table I’d choose the latter. I would also make sure I had a frequency – or Top-N frequency in 12c – histogram on capitn_prcs_ind, and may even “fake” it to ensure that an automatic call to gather table stats didn’t do something that made the histogram a threat instead of a benefit. If I created the two-column index I would also create a frequency histogram on hspc_ind. The benefit of the histograms is based on my assumption that both columns have a very small number of values and a highly skewed distribution. I might go one step further and create a column group, but keeping an eye open for an “out of range” anomaly, on the pair of columns if there was a strong degree of correlation between the two columns (in particular if there were a relatively small number of ‘Y’, but most of the ‘Y’ was also capitn_prcs_ind).

A simple index, though, might get used for other statements where it was not appropriate; there’s also the extra overhead of maintence as the capitn_prcs_ind value changes from 5 to 2 – to update the index we delete the old entry and insert the new entry. Just to make matters a little worse, when an index has a large number of rows for the same key value it’s fairly common to find that its space utilisation averages about 50% per leaf block. On top of everything else, if 2 is a “final state” value for most of the rows in this very large table then a very large fraction of the index we’re building is probably a total waste of space. So we might look at maximising efficiency and minimising space (and undo and redo wastage) with a function-based index.  Again one or two columns as appropriate.


create index dtl_cpi_1 on mmr_dtl(case when capitn_prcs_ind = 5 then 0 end) compress;

create index dtl_cpi_2 on mmr_dtl(case when capitn_prcs_ind = 5 and hspc_ind = 'Y' then 0 end) compress;

execute dbms_stats.gather_table_stats(user, 'mmr_dtl',method_opt=>'for all hidden columns size 1')

Again my choice of index might be affected by the extra benefit of having two columns to eliminate the maximum number of rows from the table, but since the index would be very small either way I’d probably go for the two-column index. Note that once you’ve created a function-based index you have to create stats on the underlying hidden column, which I’ve done a little lazily in this case by simple gathering “for all hidden columns”. I don’t need a histogram since there’s only a single value in the index. Apart from the size benefit, the other efficiency benefit is that when I update the table I only have to worry about deleting a row from the index, I don’t have to insert a replacement row – getting rid of the third undo record and redo vector is likely to save about 30% on the index maintenance background costs.

I now have to change the original SQL to match the index definition, of course, so my driving where clause would become one of:

WHERE
	(case when CAPITN_PRCS_IND = 5 then 0 end) = 0
AND	HSPC_IND ='Y'
AND	EXISTS (...

WHERE
	(case when CAPITN_PRCS_IND = 5 AND HSPC_IND ='Y' then 0 end ) = 0
AND	EXISTS (...

Update 4 – 2nd June 2015

At this point we really do need better information to make sensible decisions about what to do next; however, as I indicated earlier on, the hash semi join to the small table FCTS_TMS_MBRID_XWLK does look like a sensible choice: it’s small so it’s a good target for a build table and small tables joined to big tables tend to be things that are used to eliminate rapidly (at least, in the case of existence tests).

The final big question is what to do about the table CRME_FUND_DTL_STG which looks as if it might be quite big (several rows for each MMR_DTL row) and therefore be contributing a lot of random I/O. We don’t really have many options here – and we’ve seen what they are in a previous post about “NOT EXISTS” subqueries. We could try to make the nested loop semi-join more efficient by adding columns to the index so that we can do more filtering in the index; we could consider forcing Oracle to stick with a filter subquery and see if we can introduce some indexing that could make the subquery run a very small number of times, or we could consider the impact of switching to a Hash semi-join (possibly switch the build and probe data sets to make it a “hash join right semi”).

A full tablescan on the CRME_FUND_DTL_STG to do the hash join might be rather expensive, though, so we might look at the significance of the predicate on STG_CRME.FUND_ID which lists 8 different literal values. Perhaps there’s some scope for creating an index on that column if those fund_ids represent a small fraction of the total data; maybe even consider list-partitioning the table (not necessarily one fund per partition) on the FUND_ID.

The only other significant “pre-processing” predicate on the CRME_FUND_DTL_STG table is the one that requires the MMR_DTL.PMT_MSA_STRT_DT to be between STG_CRME.ERN_FROM_DT and the STG_CRME.ERN_THRU_DT; so we might look at the how many rows in the table could be eliminated by finding the maximum and minimum values for MMR_DTL.PMT_MSA_STRT_DT and eliminating any rows where STG_CRME.ERN_FROM_DT was greater than the maximum PMT_MSA_START_DT or the STG_CRME.ERN_THRU_DT was less than the minimum PMT_MSA_START_DT.

We might play around with CTEs (subquery factoring / common table expressions) to make this happen – I’m not going to work the exact SQL, but it might start something like:


with starting_view as (
       select from mmr_detail
       where  ...
       and exists (
                select from FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
                where ...
       )
),
minmax_view as (
        select  /*+ materialize */
                min(PMT_MSA_STRT_DT), 
                max(PMT_MSA_STRT_DT) max_date
        from    starting_view
),
crme_fund_view as (
        select
        from
                minmax_view,
                CRME_FUND_DTL_STG
        where   ...
)
select
from
        starting_view   sv,
        crme_fund_view  cv
where
        ...

An alternative strategy to get the minimum and maximum dates more efficiently might be to include them in the function-based index, and then include TWO inline views, or maybe scalar subqueries in the where clause, that depend on the min/max range scan to find the values that can be used to eliminate as much data from CRME_FUND_DTL_STG as early as possible.

Footnote

I think it took me about 10 minutes looking at the original posting to come up with a few thoughts that might be appropriate – but it’s taken five sessions spread over nearly two week to write those ideas down. The same difference can appear between theoretical strategy and final action – everything I’ve said revolves around the two simple ideas of “how much data” (absolute and relative) and “how hard to we work to collect it” so it’s easy to come up with ideas; but it may take time to learn what the data looks like, and finding a SAFE way of making it possible to collect it efficiently.

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 5,939 other followers