Oracle Scratchpad

February 3, 2016

Hinting

Filed under: Hints,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 1:04 pm BST Feb 3,2016

This is just a little example of thinking about hinting for short-term hacking requirements. It’s the answer to a question that came up on the Oracle-L listserver  a couple of months ago (Oct 2015) and is a convenient demonstration of a principle that can often (not ALWAYS) be applied as a response to the problem: “I can make this query work quickly once, how do I make it work quickly when I make it part of a join ?”

The question starts with this query, which returns “immediately” for any one segment:


SELECT DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM dba_extents DE
WHERE 1=1
  AND DE.OWNER           = <owner>
  AND DE.SEGMENT_NAME    = <segment_name>
  AND DE.segment_type    = <segment_type>
  AND DE.tablespace_name = <tablespace_name>
  AND DE.partition_name  = <max_partition_name>
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME
;

But the email then goes on to explain: “I’ve got a table of values that I need to use as a list of inputs for this query, but when I do the join it takes ages to complete; how do I make the join run quickly?”

Here’s the initial code:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

What we’ve got is a query where the user’s reference table generates a list of segments (the rownum < 2 was a temporary test) and we want the detail query to run for each segment identified. The “for each segment” gives us a clue that what we want to see is a simple nested loop join, driven by the factored subquery, with the very efficient query above running as the “second table of the nested loop”.

What I failed to notice at the time is that the /*+ materialize */ hint was in the wrong place, it should have been placed after the outer (i.e. very first) select, and it’s possible that if it had been in the right place the user would have got the plan they wanted – especially in the later versions of Oracle. As it was I suggested that we merely need to tell the optimizer to:

Visit the “tables” in the order (segment_list, dba_extents), and do a nested loop into (dba_extents), but since both segment_list and dba_extents were views we needed to stop Oracle from trying to merge them and play silly games with distinct aggregate placement, etc. by including directives that the views should not be merged, but then we might need to explain to Oracle that it would have to push the join predicate between segment_list and dba_extents inside the dba_extents view.

In other words, a list of 4 hints, as shown below:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT /*+
        leading(sl de)
        no_merge(sl)
        no_merge(de)
        push_pred(de)
        */
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

According to a follow-up email, this was sufficient.  The OP had actually tried variations on the leading() and use_nl() hints – but without the no_merge() hint the optimizer was probably rewriting the SQL in a way that put the hints out of context. It’s worth noting that the /*+ materialize */ hint is in the wrong place – it should be after the first appearance of the SELECT keyword in the factored subquery – and that probably added to the difficulty of getting the desired execution plan.

For a production system I’d probably want to do something a little more sophisticated in terms of stability once I’d got the plan I wanted – but this looks like a DBA query used to run an ad hoc report, so perhaps this solution is good enough for the current requirement.

 

February 2, 2016

Partitioned Bitmap Join

Filed under: bitmaps,Bugs,Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 8:32 am BST Feb 2,2016

If you don’t want to read the story, the summary for this article is:

If you create bitmap join indexes on a partitioned table and you use partition exchanges to load data into the table then make sure you create the bitmap join indexes on the loading tables in exactly the same order as you created them on the partitioned table or the exchange will fail with the (truthful not quite complete) error: ORA-14098: index mismatch for tables in ALTER TABLE EXCHANGE PARTITION.

My story starts with this OTN posting from John Hall where he found after a year of successful batch loading one of his partition exchanges was raising error 14098. After an exchange of ideas, user rp0428 came up with a query against sys.jijoin$ (one of the tables behind bitmap join indexes) that allowed John Hall to see that the indexes on the exchange table had been created in a different order from that of the partitioned table. I did a quick test to see if this might be relevant (it shouldn’t be, it isn’t with “normal” indexes or function-based indexes, or virtual columns) and didn’t manage to reproduce the problem with two dimension tables and two bitmap join indexes.

Fortunately John didn’t take my word for it and tested the idea on a clone of the production system – and found that the order of creation did matter. His system, however, had 9 dimension tables and 33 bitmap join indexes – which shouldn’t have made any difference in principle, but maybe it was something to do with having several indexes on the same table,  maybe it was something to do with have far more tables or far more indexes than I had. So I built a larger test case with 6 dimension tables and six indexes per table – and reproduced the problem.

Then I started cutting back to see where the problem appeared, and found that all it took was one dimension with two indexes, or two dimensions with one index each – whatever I had done in my “quick test” I had clearly done it too quickly and done something wrong. (Unfortunately I had overwritten most of the code from the original quick test while building the larger test, so I couldn’t go back and see where the error was.)

Here, then, is the minimal test case that I finally ran to demonstrate that switching the order of index creation on the exchange table causes the exchange to fail:


drop table pt_range purge;
drop table t1 purge;
drop table dim_1 purge;
drop table dim_2 purge;

prompt  =================
prompt  Partitioned table
prompt  =================

create table pt_range (
        id,
        grp1,
        grp2,
        padding
)
nologging
partition by range(id) (
        partition p2001 values less than (2001),
        partition p4001 values less than (4001),
        partition p6001 values less than (6001),
        partition p8001 values less than (8001)
)
as
select
        rownum                          id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 8000
;

prompt  ================================================
prompt  Exchange table - loaded to match partition p8001
prompt  ================================================

alter table pt_range 
add constraint pt_pk primary key (id) using index local;

create table t1 (
        id,
        grp1,
        grp2,
        padding
)
as 
select
        rownum + 6000                   id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 2000
;

alter table t1
add constraint t1_pk primary key (id);

execute dbms_stats.gather_table_stats(user,'pt_range')
execute dbms_stats.gather_table_stats(user,'t1')

prompt  ================
prompt  dimension tables
prompt  ================

create table dim_1 
as 
select distinct 
        grp1, 
        cast('A'||grp1 as varchar2(3)) agrp1,
        cast('B'||grp1 as varchar2(3)) bgrp1
from
        t1
;

create table dim_2 as select * from dim_1;

prompt  ===============================
prompt  Primary keys required for BMJIs
prompt  ===============================

alter table dim_1 add constraint d1_pk primary key (grp1);
alter table dim_2 add constraint d2_pk primary key (grp1);

execute dbms_stats.gather_table_stats(user,'dim_1')
execute dbms_stats.gather_table_stats(user,'dim_2')

prompt  ============================
prompt  Creating bitmap join indexes
prompt  ============================

create bitmap index pt_1a on pt_range(d1.agrp1) from pt_range pt, dim_1 d1 where d1.grp1 = pt.grp1 local ;
create bitmap index pt_2a on pt_range(d2.agrp1) from pt_range pt, dim_2 d2 where d2.grp1 = pt.grp2 local ;

prompt  ====================================================
prompt  Pick your index creation order on the exchange table
prompt  ====================================================

create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;
create bitmap index t1_2a on t1(d2.agrp1) from t1, dim_2 d2 where d2.grp1 = t1.grp2 ;
-- create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;

prompt  ==================
prompt  Exchanging (maybe)
prompt  ==================

alter table pt_range
        exchange partition p8001 with table t1
        including indexes
        without validation
;

I’ve got the same create statement twice for one of the bitmap join indexes – as it stands the indexes will be created in the right order and the exchange will work; if you comment out the first t1_1a create and uncomment the second the exchange will fail. (If you comment out the ‘including indexes’ then the exchange will succeed irrespective of the order of index creation, but that rather defeats the point of being able to exchange partitions.)

I’ve reproduced the problem in 12.1.0.2, 11.2.0.4 and 10.2.0.5

Footnote: running an extended trace didn’t help me work out how Oracle is detecting the mismatch, presumably it’s something that gets into the dictionary cache in a general “load the index definition” step; but it did show me that (in the “without validation” case) the code seems to check the correctness of the exchange table’s primary key data BEFORE checking whether the indexes match properly.

January 29, 2016

Table Scans

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 12:49 pm BST Jan 29,2016

It’s amazing how easy it is to interpret a number incorrectly until the point comes where you have to look at it closely – and then you realise that there was a lot more to the number than your initial casual assumption, and you would have realised it all along if you’d ever needed to think about it before.

Here’s a little case in point. I have a simple (i.e. non-partitioned) heap table t1 which is basically a clone of the view dba_segments, and I’ve just connected to Oracle through an SQL*Plus session then run a couple of SQL statements. The following is a continuous log of my activity:


SQL> select table_name, partitioned, blocks from user_tables;

TABLE_NAME           PAR     BLOCKS
-------------------- --- ----------
T1                   NO         958

1 row selected.

SQL> select name, value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic# and name like '%table scan%';

NAME                                  VALUE
-------------------------------- ----------
table scans (short tables)                9
table scans (long tables)                 0
table scans (rowid ranges)                0
table scans (cache partitions)            0
table scans (direct read)                 0
table scan rows gotten                 4188
table scan blocks gotten                 14

7 rows selected.

SQL> select count(extents) from t1;

COUNT(EXTENTS)
--------------
          5143

1 row selected.

SQL> select name, value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic# and name like '%table scan%';

  • My cache size is quite small, so t1 doesn’t count as a “short” table.
  • I’ve collected stats on the table (and there are no indexes) so the optimizer doesn’t need to do any dynamic sampling to generate an execution plan.
  • This is 11g, so there are no SQL Plan Directives in place to force dynamic sampling

So here’s the question: how many “table scans (long tables)” will Oracle record against my session when I re-run that query against v$mystat ?

Warning – this IS a trick question.

Update number 1

I said it was a trick question and, as you will have seen if you’ve had time to read the comments, the answer is going to depend on various configuration options. Here’s what I got in my test – and I’ve reported not just the session stats, but the segment statistics (v$segment_statistics):


NAME                                  VALUE
-------------------------------- ----------
table scans (short tables)                9
table scans (long tables)                26
table scans (rowid ranges)               26
table scans (cache partitions)            0
table scans (direct read)                 0
table scan rows gotten                 9331
table scan blocks gotten                954

7 rows selected.

SQL> select owner, object_name, object_type, value  from v$segment_statistics where owner = 'TEST_USER' and statistic_name = 'segment scans' and value != 0;

OWNER           OBJECT_NAME          OBJECT_TYPE             VALUE
--------------- -------------------- ------------------ ----------
TEST_USER       T1                   TABLE                      26

1 row selected.

I recorded 26 long table scans – and that 26 was echoed in the segment statistics. (Note: I don’t normally use v$segment_statistics, which is a join between a couple of in-memory structures and three real table, I usually query v$segstat). It’s the segment statistics that made me pause in a recent problem review;  up to that moment I had been using the “Segments by Table Scans” section of the AWR Report as a useful (but badly named – since it also counts (most) index fast full scans) indicator of a potential threat, then I suddenly realised that there was yet another important detail I had to check before I could determine what the numbers were telling me.

So the next question is WHY has Oracle reported 26 tablescans ?

Update number 2

Martin Preiss (see comments) has been working hard to investigate this, and managed to produce a couple more variations in statistics for “the same” problem. In my case I had declared the table as “parallel 2”. For a table of degree N Oracle is very likely to break the table into 13*N chunks – there’s a historic reason for 13 – and that’s why I see 26 “rowid range” scans which, unfortunately, are also recorded as “long” table scans even though each is individually well short of the short table limit. This is really annoying when you start to look at the “Segments by Table Scan” report buecause you now don’t know how many times a table really was scanned unless you know something about the run-time degree of parellelism involved.

If you check the hidden parameters you will find several parameters relating to this chunking, in particular:


_px_min_granules_per_slave        minimum number of rowid range granules to generate per slave (default  13)
_px_max_granules_per_slave        maximum number of rowid range granules to generate per slave (default 100)

Technically, therefore, Oracle MIGHT get to 100 granules per parallel slave, and a tablescan at degree 2 could be recorded as 200 tablescans!

Martin asked why my 26 tablescans didn’t show up as “table scans (direct read)” – but that’s 11g for you, it allows a serial tablescan to use direct path reads, and it allows parallel tablescans to read into the cache, and the combination of cache size and table size meant that my example just happened to read into the cache.

And that introduces ANOTHER interpretation problem – what annoying things might I discover if I declare the table the the CACHE option ? (as Ivica suggested in his first comment below) ?

Final Update (probably)

Iviva has been busy on checking the CACHE operation and shown that there are a number of cases to consider (more, in fact, than I was planning to mention – and the variation the combination in NOPARALLEL and CACHE should give you some pause for thought). The only point I wanted to make was the effect of enabling PARALLEL and CACHE; I don’t think that this is a combination that is very likely to appear in a production system, but boundary conditions (and accidents) do occur. With my little sample – even after I grew the table to be MUCH larger, the tablescan DIDN’T get reported: here are some session stats and segment stats (taken using my snapshot code) of a single tablescan running parallel 2 when the table was cached:


Name                                                                     Value
----                                                                     -----
table scans (rowid ranges)                                                  26
table scans (cache partitions)                                              26
table scan rows gotten                                                 164,672
table scan blocks gotten                                                29,611

  Statistic                                    Value
  ---------                             ------------

T1
  logical reads                               30,272
  physical reads                              29,614
  physical read requests                         261

I know I’ve done a tablescan in 26 pieces (parallel 2) – and scanned 29,000+ blocks doing it; but according to the segment stats AND session stats I haven’t done a tablescan. Fortunately, of course, I can see the parallel tablescan in the session stats, and in this isolated case I can see from the “cache partitions” statistics that that tablescan was on a cached table. But if I’ve been depending on the segment stats to tell me about which tablescans happen most frequently and do most work I’ve found another reason why I can’t trust the stats and have to do more work cross-checking different parts of the AWR for self-consistent results.

I don’t expect to add any more comments about this mixture of tablescans and parallelism, with the varying effects on the session and segment statistics – but there’s always the possibility that one day I’ll start to worry about how the KEEP and RECYCLE (db_keep_cache_size and db_recycle_cache_size) could confuse things further.

Footnote:

It’s probably worth pointing out that the segment statistics have never recorded SHORT tablescans, they’ve only ever captured details of LONG tablescans. There is a slight inconsistency here, though, since they capture all the “index fast full scans (full)” reported in the session stats whether they are short or long – not that the session stats record the difference; unfortunately, another flaw creeps in: parallel index fast full scans “disappear” from the stats, although they show up as “index fast full scans (rowid ranges)” and “index fast full scans (direct)” with the same multiplier of 13 that we see for parallel table scans.

January 28, 2016

Bitmap Efficiency

Filed under: bitmaps,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 1:02 pm BST Jan 28,2016

An interesting observation came up on the Oracle-L list server a few days ago that demonstrated how clever the Oracle software is at minimising run-time work, and how easy it is to think you know what an execution plan means when you haven’t actually thought through the details – and the details might make a difference to performance.

The original question was about a very large table with several bitmap indexes, and an anomaly that appeared as a query changed its execution plan.  Here are the critical sections from the plans (extracted from memory with rowsource execution statistics enabled):

--------------------------------------------------------------------------------------------------------
|  Id |Operation                        | Name       | Starts | E-Rows | A-Rows |     A-Time | Buffers |
--------------------------------------------------------------------------------------------------------
|   6 |    TABLE ACCESS BY INDEX ROWID  |       FACT |      1 |      1 |     24 |00:00:00.01 |      31 |
|   7 |     BITMAP CONVERSION TO ROWIDS |            |      1 |        |     24 |00:00:00.01 |       7 |
|   8 |      BITMAP AND                 |            |      1 |        |      1 |00:00:00.01 |       7 |
|*  9 |       BITMAP INDEX SINGLE VALUE |     FACT_0 |      1 |        |      1 |00:00:00.01 |       3 |
|* 10 |       BITMAP INDEX SINGLE VALUE |  FACT_DIM1 |      1 |        |      4 |00:00:00.01 |       4 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
     9 - access("FACT"."C0"=243001)
    10 - access("FACT"."C1"="DIMENSION1"."ID")


-------------------------------------------------------------------------------------------------------
|  Id | Operation                      | Name       | Starts | E-Rows | A-Rows |     A-Time | Buffers |
-------------------------------------------------------------------------------------------------------
|   7 |    BITMAP CONVERSION TO ROWIDS |            |      5 |        |      8 |00:00:00.01 |     119 |
|   8 |     BITMAP AND                 |            |      5 |        |      1 |00:00:00.01 |     119 |
|*  9 |      BITMAP INDEX SINGLE VALUE |  FACT_DIM1 |      5 |        |     20 |00:00:00.01 |      28 |
|* 10 |      BITMAP INDEX SINGLE VALUE |  FACT_DIM2 |      5 |        |    140 |00:00:00.01 |      78 |
|* 11 |      BITMAP INDEX SINGLE VALUE |     FACT_0 |      5 |        |      5 |00:00:00.01 |      13 |
|  12 |   TABLE ACCESS BY INDEX ROWID  |       FACT |      8 |      1 |      8 |00:00:00.01 |       8 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
     9 - access("FACT"."C1"="DIMENSION1"."ID")
    10 - access("FACT"."C2"="DIMENSION2"."ID")
    11 - access("FACT"."C0"=243001)

The first plan shows the steps leading to a single access (Starts = 1) to the FACT table after combining two bitmap indexes; the second shows the second child of a nested loop join where Oracle has combined three bitmaps indexes to access the FACT table – operation 7 (and its descendants) execute 5 times in this case. I’ve included the related parts of the predicate section so that you can see that the predicates at operations 9 and 10 of the first plan are the same as the predicates at operations 9 and 11 of the second plan.

So here’s the question – if one access to fact_dim1 requires 4 buffer visits, why does it take 28 buffer visits to do the same thing 5 times (and it is with the same value every time); conversely if one access to fact_0 requires 3 buffer visits, why do 5 visits to do the same thing take only 13 buffer visits. (Note: the arithmetic is made a little more obscure by the way in which index branch blocks may be pinned during nested loop joins.)

Then there’s a further question – not visible in the plan – the A-Rows in the “BITMAP INDEX SINGLE VALUE” operation is the number of bitmap sections in the rowsource, and we can see that the key values for index fact_dim2 have a significant number of bitmap chunks for a single key (5 executions returned 140 bitmap chunks). This scale, though, is true of all three indexes – in fact a follow-up email pointed out that a typical key value in EVERY ONE of the three indexes consisted of about 100 bitmap chunks, so why can’t we see those hundreds in the execution plan ?

So this is where we’re at: we have an execution plan where we haven’t visited all the bitmap chunks for a bitmap key, and the order in which the bitmap indexes are used in the plan seems to have some effect on the choice of leaf-blocks you visit when accessing the chunks. So (a) could a change in the order of indexes make a significant difference to the number of bitmap chunks you visit and the resulting performance, and (b) is there a way to control the order in which you visit the indexes. That’s where the note starts to get a bit technical – if you don’t want to read any more the answers are: (a) yes but probably not significantly and (b) yes.

Demo

To investigate what goes on inside a “BITMAP AND” I created a table with two bitmap indexes and used a very large setting for pctfree for the indexes so that they had to be stored with a large number of bitmap chunks per key. Here’s the code that I used, with some results from an instance of 12.1.0.2:


create table people
nologging
as
with generator as (
        select  --+ materialize 
                rownum id 
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum-1, 1e2)      id_town_home,
        trunc((rownum-1)/1e4)   id_town_work,
        rpad('x',10,'x')        small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'PEOPLE',
                method_opt       => 'for all columns size 1'
        );
end;
/

create bitmap index pe_home on people(id_town_home) nologging pctfree 95;
create bitmap index pe_work on people(id_town_work) nologging pctfree 95;

select
        index_name, distinct_keys, num_rows, leaf_blocks, avg_leaf_blocks_per_key
from
        user_indexes
where
        table_name = 'PEOPLE'
order by
        index_name
;


INDEX_NAME           DISTINCT_KEYS   NUM_ROWS LEAF_BLOCKS AVG_LEAF_BLOCKS_PER_KEY
-------------------- ------------- ---------- ----------- -----------------------
PE_HOME                        100      30399       15200                     152
PE_WORK                        100       1800         907                       9

As you can see I’ve generated two columns (id_town_home, id_town_work) with 100 distinct values and 10,000 rows each, but with very different data distributions – the rows for any given value for id_town_home are uniformly spread across the entire table, every hundredth row; while the rows for any given value of id_town_work are very tightly clustered as a group of 10,000 consecutive rows. As a consequence the index entry (bitmap string) for a typical key value for id_town_home is enormous and has to be broken into 304 chunks spread across 152 leaf blocks (2 index entries per leaf block), while the index entry for a typical key value for id_town_work is much shorter, but still requires 18 chunks spread across 9 leaf blocks.

So what will I see if I run the following query, and force it to use a BITMAP AND of the two indexes, in the two different orders:

select
        /*+ index_combine(pe) */
        max(small_vc)
from
        people pe
where
        id_town_home = 50
and     id_town_work = 50
;

Based on a very simple interpretation of the typical execution plan and using the index stats shown above we might expect to see roughly A-Rows = 18 with 9 buffer gets (plus a few more for segment headers and branch blocks) on the id_town_work index and A-Rows = 304 with 152 buffer gets on the id_town_home index to allow Oracle to generate and compare the two bit strings – but here are the two plans with their execution stats, generated in 12.1.0.2, and each run after flushing the buffer cache:

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |         |      1 |        |      1 |00:00:00.01 |     118 |    117 |
|   1 |  SORT AGGREGATE                      |         |      1 |      1 |      1 |00:00:00.01 |     118 |    117 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| PEOPLE  |      1 |    100 |    100 |00:00:00.01 |     118 |    117 |
|   3 |    BITMAP CONVERSION TO ROWIDS       |         |      1 |        |    100 |00:00:00.01 |      18 |     17 |
|   4 |     BITMAP AND                       |         |      1 |        |      1 |00:00:00.01 |      18 |     17 |
|*  5 |      BITMAP INDEX SINGLE VALUE       | PE_WORK |      1 |        |     18 |00:00:00.01 |      14 |     13 |
|*  6 |      BITMAP INDEX SINGLE VALUE       | PE_HOME |      1 |        |      4 |00:00:00.01 |       4 |      4 |
-------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |         |      1 |        |      1 |00:00:00.01 |     122 |    120 |
|   1 |  SORT AGGREGATE                      |         |      1 |      1 |      1 |00:00:00.01 |     122 |    120 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| PEOPLE  |      1 |    100 |    100 |00:00:00.01 |     122 |    120 |
|   3 |    BITMAP CONVERSION TO ROWIDS       |         |      1 |        |    100 |00:00:00.01 |      22 |     20 |
|   4 |     BITMAP AND                       |         |      1 |        |      1 |00:00:00.01 |      22 |     20 |
|*  5 |      BITMAP INDEX SINGLE VALUE       | PE_HOME |      1 |        |      5 |00:00:00.01 |       8 |      7 |
|*  6 |      BITMAP INDEX SINGLE VALUE       | PE_WORK |      1 |        |     18 |00:00:00.01 |      14 |     13 |
-------------------------------------------------------------------------------------------------------------------

We have NOT touched anything like the entire bit-string for the id_town_home index – a bit-string that spans 152 leaf blocks! Clearly Oracle is doing something clever to minimise the work, and it’s so clever that switching the order of these two extremely different indexes in the plan has made virtually no difference to the work done. Obviously I can’t tell you exactly what the code is doing, but I think I can produce a reasonable guess about what’s going on.

The pe_work index has the smaller number of leaf blocks per key, which makes it the better starting choice for the AND in this case, so the optimizer’s default starting action was to pick the first couple of chunks of that index key value; and Oracle immediately sees that the first rowid that it could possibly need in its result set is roughly in the middle of the table – remember that the “key” columns of a bitmap index are (real_key, first_rowid_of chunk, last_rowid_of_chunk, compressed_bitstring).

Since it now knows the lowest possible rowid that it could need Oracle can now probe the pe_home index by (id_town_home=50, {target_rowid}) – which will let it go to a bitmap index chunk that’s roughly in the middle of the full range of 152. Then Oracle can expand the bitstrings from the chunks it has, reading new chunks as needed from each of the indexes until the 18 chunks / 9 leaf block from the pe_work index have been used up (and that range would have aligned with just two or three chunks from the pe_home index) at which point Oracle can see there’s no more rows in the table that could match both predicates and it doesn’t need to read the next 75 chunks of the pe_home index.

Conversely, when I forced Oracle to use the (inappropriate) pe_home index first, it read the first couple of chunks, then read the first couple of chunks of the pe_work index, at which point it discovered that it didn’t need any of the pe_home index prior to (roughly) chunk 75, so it jumped straight to the right chunk to align with pe_work and carried on from there. That’s why the forced, less efficient, plan that visited pe_home first visited just a couple more leaf blocks than the plan the optimizer selected for itself.

Bottom line on performance (tl;dr) – Oracle is sufficiently smart about checking the start and end ranges on bitmap indexes (rather then arbitrarily expanding the entire bitmap for each key) that even for very large bitmap index entries it will probably only access a couple of “redundant” leaf blocks per index even if it picks the worst possible order for using the indexes. You’re far more likely to notice Oracle picking the wrong indexes (because you know the data better) than you are to spot it using the right indexes in the wrong order – and given that bitmap indexes tend to be relatively small and well buffered (compared to the tables), and given the relatively large number of rows we pick by random I/O from fact tables, a little extra work in the bitmap indexes is unlikely to make a significant difference to the performance of most queries.

Closing fact: in the unlikely circumstances that you do spot the special case where it will make a difference (and it will probably be a difference in CPU usage) then you can dictate the order of the indexes with the undocumented bitmap_tree() hint.  I may get round to writing up the variations one day but, for this simple case, the index_combine() hint that I used to force the BITMAP AND turned into the following bitmap_tree() hint in the outline:

bitmap_tree(@sel$1 pe@sel$1 and((people.id_town_work) (people.id_town_home)))

bitmap_tree( @query_block     table_name@query_block     and( ({first index definition}) ({second index definition}) ) )

Obviously not suitable to throw into production code casually – check with Oracle support if you think it’s really necessary – but if you wanted to reverse the order of index usage in this case you could just swap the order of the index definitions. If you thought there was a third index that should be used you could include its definition (note that it’s table_name.column_name – the index definition – in the brackets).

My reference: bitmap_control_02.sql

January 27, 2016

Add primary key.

Filed under: Indexing,Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 9:07 am BST Jan 27,2016

I thought I had written this note a few years ago, on OTN or Oracle-L if not on my blog, but I can’t find any sign of it so I’ve decided it’s time to write it (again) – starting as a question about the following code:


rem
rem     Script: pk_overhead.sql
rem     Author: J.P.Lewis
rem     Dated:  Feb 2012
rem

create table t1
as
with generator as (
        select  rownum  id
        from            dual
        connect by
                        rownum <= 1000
)
select
        rownum                                  id,
        trunc((rownum-1)/50)                    clustered,
        mod(rownum,20000)                       scattered,
        lpad(rownum,10)                         vc_small,
        rpad('x',100,'x')                       vc_padding
from
        generator       g1,
        generator       g2
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

alter system flush buffer_cache;

alter table t1 add constraint t1_pk primary key(id, scattered);

I’ve generated a table with 1,000,000 rows, including a column that’s guaranteed to be unique; then I’ve added a (two-column) primary key constraint to that table.

Because of the guaranteed unique column the call to add constraint will succeed. Because Oracle will automatically create a unique index to support that constraint it will have to do a tablescan of the table. So here’s the question: HOW MANY TIMES will it tablescan that table (and how many rows will it scan) ?

Space for thought …

The answer is three tablescans, 3 million rows.

Oracle will scan the table to check the validity of adding a NOT NULL definition and constraint for the id column, repeat the scan to do the same for the scattered column, then one final scan to accumulate the key data and rowids to sort and create the index.

Knowing this, you may be able to find ways to modify bulk data loading operations to minimise overheads.

The most recent version I’ve tested this on is 12.1.0.2.

See also: https://jonathanlewis.wordpress.com/2012/03/02/add-constraint/

Update – May 2016

The extra tablescans occur even if you have pre-existing check constraints (not declarations) on the columns to ensure that they are not null (i.e. things like: “alter table t1 add constraint t1_nn_id check (id is not null)”).

January 26, 2016

Trace file size

Filed under: 12c,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 8:30 am BST Jan 26,2016

Here’s a convenient enhancement for tracing that came up on Twitter a few days ago – first in a tweet that I retweeted, then in a question from Christian Antognini based on this bit of the 12c Oracle documentation (opens in separate tab). The question was – does it work for you ?

The new description for max_dump_file_size says that for large enough values Oracle will split the file into multiple chunks of a few megabytes, using a suffix to identify the sequence of the chunks, keeping only the first chunk and the most recent chunks. Unfortunately this doesn’t seem to be true. However, prompted by Chris’ question I ran a quick query against the full parameter list looking for parameters with the word “trace” in their name:


select
        /*+
                leading(nam val val2)
                full(name)
                full(val)  use_hash(val)  no_swap_join_inputs(val)
                full(val2) use_hash(val2) no_swap_join_inputs(val2)
        */
        nam.ksppinm                             name,
        val.ksppstvl                            ses_val,
        val2.ksppstvl                           sys_val,
        nam.ksppdesc                            description,
        nam.indx+1                              numb,
        nam.ksppity                             type,
        val.ksppstdf                            is_def,
        decode(bitand(nam.ksppiflg/256,1),
                1,'True',
                  'False'
        )                                       ses_mod,
        decode(bitand(nam.ksppiflg/65536,3),
                1,'Immediate',
                2,'Deferred' ,
                3,'Immediate',
                  'False'
        )                                       sys_mod,
        decode(bitand(val.ksppstvf,7),
                1,'Modified',
                4,'System Modified',
                  'False'
        )                                       is_mod,
        decode(bitand(val.ksppstvf,2),
                2,'True',
                  'False'
        )                                       is_adj,
        val.ksppstcmnt                          notes
from
        x$ksppi         nam,
        x$ksppcv        val,
        x$ksppsv        val2
where
        nam.indx = val.indx
and     val2.indx = val.indx
and     ksppinm like '%&m_search.%'
order by
        nam.ksppinm
;

Glancing through the result I spotted a couple of interesting parameters with the letters “uts” in their names, so re-ran my query looking for all the “uts” parameters, getting the following (edited) list:


NAME                           SYS_VAL         DESCRIPTION    
------------------------------ --------------- ---------------------------------------------
_diag_uts_control              0               UTS control parameter
_uts_first_segment_retain      TRUE            Should we retain the first trace segment
_uts_first_segment_size        0               Maximum size (in bytes) of first segments 
_uts_trace_disk_threshold      0               Trace disk threshold parameter
_uts_trace_segment_size        0               Maximum size (in bytes) of a trace segment
_uts_trace_segments            5               Maximum number of trace segments 

Note particularly the “first segment size” and “trace segment size” – defaulting to zero (which often means a hidden internal setting, though that doesn’t seem to be the case here, but maybe that’s what the “diag control” is for). I haven’t investigated all the effects, but after a little experimentation I found that all I needed to do to get the behaviour attributed to max_dump_file_size was to set the following two parameters – which I could do at the session level.


alter session set "_uts_first_segment_size" = 5242880;
alter session set "_uts_trace_segment_size" = 5242880;

The minimum value for these parameters is the one I’ve shown above (5120 KB) and with the default value for _uts_trace_segments you will get a maximum of 5 trace files with sequential names like the following:

ls -ltr *4901*.trc

-rw-r----- 1 oracle oinstall 5243099 Jan 26 08:15 orcl_ora_4901_1.trc
-rw-r----- 1 oracle oinstall 5243064 Jan 26 08:15 orcl_ora_4901_12.trc
-rw-r----- 1 oracle oinstall 5243058 Jan 26 08:15 orcl_ora_4901_13.trc
-rw-r----- 1 oracle oinstall 5242993 Jan 26 08:15 orcl_ora_4901_14.trc
-rw-r----- 1 oracle oinstall 1363680 Jan 26 08:15 orcl_ora_4901.trc

As you can see I’m currently generating my 15th trace, and Oracle has kept the first one and the previous three. It’s always working on a file with no suffix to its name but as soon as that file hits its limiting size (plus or minus a few bytes) it gets its appropriate suffix, the oldest file is deleted, and a new trace file without a suffix is started.

Apart from the usual header information the trace files start and end with lines like:

*** TRACE CONTINUED FROM FILE /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_11.trc ***
  
*** TRACE SEGMENT RENAMED TO /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_12.trc ***

There is one little trap to watch out for: if you set either of these parameters to be larger than max_dump_file_size tracing stops as soon as one of the segments hits the max_dump_file_size and that trace file ends with the usual “overflow” message – e.g, when I changed the max_dump_file_size to 4M in mid-session:

*** DUMP FILE SIZE IS LIMITED TO 4194304 BYTES ***

In my case I had started with max_dump_file_size set to 20M, so I got lucky with my choice of 5M as the segment size.

Further investigation is left as an exercise to the interested reader.

 

January 24, 2016

Semijoin_driver

Filed under: bitmaps,CBO,Execution plans,Oracle,subqueries — Jonathan Lewis @ 11:42 am BST Jan 24,2016

Here’s one of those odd little tricks that (a) may help in a couple of very special cases and (b) may show up at some future date – or maybe it already does – in the optimizer if it is recognised as a solution to a more popular problem. It’s about an apparent restriction on how the optimizer uses the BITMAP MERGE operation, and to demonstrate a very simple case I’ll start with a data set with just one bitmap index:


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1000)        n1,
        rpad('x',10,'x')        small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create bitmap index t1_b1 on t1(n1);

select index_name, leaf_blocks, num_rows from user_indexes;

/*
INDEX_NAME           LEAF_BLOCKS   NUM_ROWS
-------------------- ----------- ----------
T1_B1                        500       1000
*/

Realistically we don’t expect to use a single bitmap index to access data from a large table, usually we expect to have queries that give the optimizer the option to choose and combine several bitmap indexes (possibly driving through dimension tables first) to reduce the target row set in the table to a cost-effective level.

In this example, though, I’ve created a column data set that many people might view as “inappropriate” as the target for a bitmap index – in one million rows I have one thousand distinct values, it’s not a “low cardinality” column – but, as Richard Foote (among others) has often had to point out, it’s wrong to think that bitmap indexes are only suitable for columns with a very small number of distinct values. Moreover, it’s the only index on the table, so no chance of combining bitmaps.

Another thing to notice about my data set is that the n1 column has been generated by the mod() function; because of this the column cycles through the 1,000 values I’ve allowed for it, and this means that the rows for any given value are scattered widely across the table, but it also means that if I find a row with the value X in it then there could well be a row with the value X+4 (say) in the same block.

I’ve reported the statistics from user_indexes at the end of the sample code. This shows you that the index holds 1,000 “rows” – i.e. each key value requires only one bitmap entry to cover the whole table, with two rows per leaf block.  (By comparison, a B-tree index oon the column was 2,077 leaf block uncompressed, or 1,538 leaf blocks when compressed).

So here’s the query I want to play with, followed by the run-time execution plan with stats (in this case from a 12.1.0.2 instance):


alter session set statistics_level = all;

select
        /*+
                qb_name(main)
        */
        max(small_vc)
from
        t1
where
        n1 in (1,5)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline'));

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |      1 |        |      1 |00:00:00.03 |    2006 |      4 |
|   1 |  SORT AGGREGATE                       |       |      1 |      1 |      1 |00:00:00.03 |    2006 |      4 |
|   2 |   INLIST ITERATOR                     |       |      1 |        |   2000 |00:00:00.03 |    2006 |      4 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      2 |   2000 |   2000 |00:00:00.02 |    2006 |      4 |
|   4 |     BITMAP CONVERSION TO ROWIDS       |       |      2 |        |   2000 |00:00:00.01 |       6 |      4 |
|*  5 |      BITMAP INDEX SINGLE VALUE        | T1_B1 |      2 |        |      2 |00:00:00.01 |       6 |      4 |
------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access(("N1"=1 OR "N1"=5))

The query is selecting 2,000 rows from the table, for n1 = 1 and n1 = 5, and the plan shows us that the optimizer probes the bitmap index twice (operation 5), once for each value, fetching all the rows for n1 = 1, then fetching all the rows for n1 = 5. This entails 2,000 buffer gets. However, we know that for every row where n1 = 1 there is another row nearby (probably in the same block) where n1 = 5 – it would be nice if we could pick up the 1 and the 5 at the same time and do less work.

Technically the optimizer has the necessary facility to do this – it’s known as the BITMAP MERGE – Oracle can read two or more entries from a bitmap index, superimpose the bits (effectively a BITMAP OR), then convert to rowids and visit the table. Unfortunately there are cases (and it seems to be only the simple cases) where this doesn’t appear to be allowed even when we – the users – can see that it might be a very effective strategy. So can we make it happen – and since I’ve asked the question you know that the answer is almost sure to be yes.

Here’s an alternate (messier) SQL statement that achieves the same result:


select
        /*+
                qb_name(main)
                semijoin_driver(@subq)
        */
        max(small_vc)
from
        t1
where
        n1 in (
                select /*+ qb_name(subq) */
                        *
                from    (
                        select 1 from dual
                        union all
                        select 5 from dual
                        )
        )
;

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |      1 |        |      1 |00:00:00.02 |    1074 |       |       |          |
|   1 |  SORT AGGREGATE                      |       |      1 |      1 |      1 |00:00:00.02 |    1074 |       |       |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |   2000 |   2000 |00:00:00.02 |    1074 |       |       |          |
|   3 |    BITMAP CONVERSION TO ROWIDS       |       |      1 |        |   2000 |00:00:00.01 |       6 |       |       |          |
|   4 |     BITMAP MERGE                     |       |      1 |        |      1 |00:00:00.01 |       6 |  1024K|   512K| 8192  (0)|
|   5 |      BITMAP KEY ITERATION            |       |      1 |        |      2 |00:00:00.01 |       6 |       |       |          |
|   6 |       VIEW                           |       |      1 |      2 |      2 |00:00:00.01 |       0 |       |       |          |
|   7 |        UNION-ALL                     |       |      1 |        |      2 |00:00:00.01 |       0 |       |       |          |
|   8 |         FAST DUAL                    |       |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |         FAST DUAL                    |       |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
|* 10 |       BITMAP INDEX RANGE SCAN        | T1_B1 |      2 |        |      2 |00:00:00.01 |       6 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  10 - access("N1"="from$_subquery$_002"."1")

Key points from this plan – and I’ll comment on the SQL in a moment: The number of buffer visits is roughly halved (In many cases we picked up two rows as we visited each buffer); operation 4 shows us that we did a BITMAP MERGE, and we can see in operations 5 to 10 that we did a BITMAP KEY ITERATION (which is a bit like a nested loop join – “for each row returned by child 1 (operation 6) we executed child 2 (operation 10)”) to probe the index twice and get two strings of bits that operation 4 could merge before operation 3 converted to rowids.

For a clearer picture of how we visit the table, here are the first few rows and last few rows from a version of the two queries where we simply select the ID column rather than aggregating on the small_vc column:

select  id from ...

Original query structure
         1
      1001
      2001
      3001
...
    997005
    998005
    999005

2000 rows selected.

Modified query structure:

         1
         5
      1001
      1005
      2001
      2005
...
    998001
    998005
    999001
    999005
    
2000 rows selected.

As you can see, one query returns all the n1 = 1 rows then all the n1 = 5 rows while the other query alternates as it walks through the merged bitmap. You may recall the Exadata indexing problem (now addressed, of course) from a few years back where the order in which rows were visited after a (B-tree) index range scan made a big difference to performance. This is the same type of issue – when the optimizer’s default plan gets the right data in the wrong order we may be able to find ways of modifying the SQL to visit the data in a more efficient order. In this case we save only fractions of a second because all the data is buffered, but it’s possible that in a production environment with much larger tables many, or all, of the re-visits could turn into physical reads.

Coming back to the SQL, the key to the re-write is to turn my IN-list into a subquery, and then tell the optimizer to use that subquery as a “semijoin driver”. This is essentially the mechanism used by the Star Tranformation, where the optimizer rewrites a simple join so that each dimension table (typically) appears twice, first as an IN subquery driving the bitmap selection then as a “joinback”. But (according to the manuals) a star transformation requires at least two dimension tables to be involved in a join to the central fact table – and that may be why the semi-join approach is not considered in this (and slightly more complex) cases.

 

 

My reference: bitmap_merge.sql, star_hack3.sql

January 18, 2016

Drop Column

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 8:14 am BST Jan 18,2016

I published a note on AllthingsOracle a few days ago discussing the options for dropping a column from an existing table. In a little teaser to a future article I pointed out that dropping columns DOESN’T reclaim space; or rather, probably doesn’t, and even if it did you probably won’t like the way it does it.

I will  be writing about “massive deletes” for AllthingsOracle in the near future, but I thought I’d expand on the comment about not reclaiming space straight away. The key point is this – when you drop a column you are probably dropping a small fraction of each row. (Obviously there are some extreme variants on the idea – for example, you might have decided to move a large varchar2() to a separate table with shared primary key).

If you’ve dropped a small fraction of each row you’ve freed up a small fraction of each block, which probably means the block hasn’t been identified as having available free space for inserts. In many cases this is probably  a good thing – because it’s quite likely the if every block in your table is suddenly labelled as having sufficient free space for new row then you could end up with a difficult and ongoing performance problem.

Many large tables have a “time-based” component to their usage – as time passes the most recently entered rows are the ones that get most usage, and older rows are no longer accessed; this means you get a performance benefit from caching because the most useful fractions of such tables are often well cached and the “interesting” data is fairly well clustered.

In a case like this, imagine what will happen if EVERY block in your table suddenly acquires enough free space to accept a couple of new rows – over the next few days the incoming data will be spread across the entire length of the table, and for the next couple of months, or years, you will have to keep the entire table cached in memory if the performance is to stay constant; moreover the clustering_factor of the most useful indexes is likely to jump from “quite small” to “absolutely massive”, and the optimizer will start changing lots of plans because it will decide that your favourite indexes are probably much to expensive to user.

I am, of course, painting a very grim picture – but it is a possible scenario that should be considered before you drop a column from a table. Combined with my observations about the locking and overheads of dropping a column you might (probably ought to) decide that you should never drop a column you should only mark it as unused or (better still if you’re on 12c) mark it invisible for a while before marking it unused. You can worry about space reclamation at a later date when you considered all the ramifications of how it might impact on performance.

Footnote: If you’re still using freelist management then dropping a column won’t put a block on the freelist until the total used space in the block falls below the value dictated by pctused (default 40%); if you’re using ASSM then the block doesn’t become available for reuse until (by default) the free space exceeds 25% of the block’s usable space.

 

 

January 11, 2016

Subquery Effects

Filed under: Bugs,CBO,Execution plans,Oracle,subqueries — Jonathan Lewis @ 12:50 pm BST Jan 11,2016

Towards the end of last year I used a query with a couple of “constant” subqueries as a focal point for a blog note on reading parallel execution plans. One of the comments on that note raised a question about cardinality estimates and, coincidentally, I received an email about the cost calculations for a similar query a few days later.

Unfortunately there are all sorts of anomalies, special cases, and changes that show up across versions when subqueries come into play – it’s only in recent versions of 11.2, for example, that a very simple example I’ve got of three equivalent statements that produce the same execution plan report the same costs and cardinality. (The queries are:  table with IN subquery, table with EXISTS subquery, table joined to “manually unnested” subquery – the three plans take the unnested subquery shape.)

I’m just going to pick out one particular anomaly, which is a costing error with multiple subqueries when “OR-ed”. Here’s my sample data set:


create table t1
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 20000
;


create table t2
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 25000
;

create table t3
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 30000
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t1',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t2',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t3',
                method_opt       => 'for all columns size 1'
        );
end;
/

The three tables are slightly different sizes so that it will be easy to see different costs of tablescans, and there are no indexes to everything I do in the queries will be tablescans. Here are six queries I’m going to test – they all scan t1, with “constant” subqueries against t2 and/or t3. The first pair is just to show you the basic cost of the query with a single subquery, the second pair shows you the default action with two subqueries in two different orders, the final pair shows you what happens with two subqueries when you block subquery pushing.


select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     t1.n2 > (select avg(t2.n2) from t2)
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     t1.n3 > (select avg(t3.n3) from t3)
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n2 > (select avg(t2.n2) from t2)
         or t1.n3 > (select avg(t3.n3) from t3)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n3 > (select avg(t3.n3) from t3)
         or t1.n2 > (select avg(t2.n2) from t2)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
         or t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
         or t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
        )
;

Here are the first two plans, pulled from memory (which you might have guessed thanks to the “disappearing subquery predicate” in the predicate section. These examples came from 12.1.0.2, but the same happens in 11.2.0.4:


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   111 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   500 |  5000 |    49   (3)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND "T1"."N2">))

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   123 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   500 |  5000 |    49   (3)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND "T1"."N3">))

As you can see, the cost of the query is the cost of the t1 tablescan plus the cost of running the t2 or t3 subquery once: 111 = 49 + 62, and 123 = 49 + 74.

(As a general guideline, recent versions of the optimizer tend to allow for subqueries by including “cost of subquery” * “number of times the optimizer thinks it will execute” – in this case the optimizer knows that the subquery will run exactly once).

But what happens when we test the query that applies BOTH subqueries to the tablescan ?


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |    50 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   975 | 14625 |    50   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND ("T1"."N2"> OR "T1"."N3">)))


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |    50 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   975 | 14625 |    50   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND ("T1"."N3"> OR "T1"."N2">)))

The cost of the query in both cases is just the cost of the tablescan of t1 – the subqueries are, apparently, free. You can check from the predicate section, by the way, that the subqueries are applied in the order they appear in original statement.

Does anything change if the subqueries are not pushed ?


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   111 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   FILTER             |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   4 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   5 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
|   6 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N2"> OR "T1"."N3">))
   3 - filter("T1"."N1">10000)

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   124 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   FILTER             |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   4 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   5 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   6 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N3"> OR "T1"."N2">))
   3 - filter("T1"."N1">10000)

The two plans have different costs – and the cost is the cost of the tablescan of t1 plus the cost of just the first subquery in the filter predciate list.

The non-pushed subqueries show up another anomaly: you’ll notice that the t1 tablescan reports 10,001 rows cardinality, but the FILTER operation doesn’t have an associated cardinality so we can’t see how many rows the optimizer thinks will survive the subqueries. So let’s run a query that allows us to see the surviving row estimate:


select
        max(n1)
from
        (
        select
                /*+ no_eliminate_oby */
                t1.n1
        from
                t1
        where
                t1.n1 > 10000
        and     (
                   t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
                or t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
                )
        order by
                n1
        )
;

-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |       |       |   126 (100)|          |
|   1 |  SORT AGGREGATE        |      |     1 |    13 |            |          |
|   2 |   VIEW                 |      | 10001 |   126K|   126   (5)| 00:00:01 |
|   3 |    SORT ORDER BY       |      | 10001 |   146K|   126   (5)| 00:00:01 |
|*  4 |     FILTER             |      |       |       |            |          |
|*  5 |      TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   6 |      SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |       TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   8 |      SORT AGGREGATE    |      |     1 |     5 |            |          |
|   9 |       TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("T1"."N3"> OR "T1"."N2">))
   5 - filter("T1"."N1">10000)

As you can see, the SORT ORDER BY operation thinks it’s going to handle 10,0001 rows – it looks as if the optimizer arithmetic hasn’t applied the usual subquery guess of 5% for the two subqueries. (When the subqueries were automatically pushed you saw a cardinality of 975 – which is 5% for subquery t2 plus (due to OR) 5% for subquery t3 minus 5% of 5% (=25) for the overlap – this is the standard OR arithmetic)

tl;dr

Although the optimizer code has been enhanced in many places for dealing with subquery estimates, but there are still some odd errors and inconsistencies that you need to be aware of. The examples I’ve shown may not be particularly significant in terms of what they do, but the pattern is one that you may recognise in more complex queries.

 

Reference script: subq_cost_anomaly_2.sql

 

January 8, 2016

CTEs and Updates

Filed under: Execution plans,Oracle,Subquery Factoring,Tuning — Jonathan Lewis @ 1:01 pm BST Jan 8,2016

An important target of trouble-shooting, particularly when addressing performance problems, is to minimise the time and effort you have to spend to get a “good enough” result. A recent question on the OTN database forum struck me as a good demonstration of following this strategy; the problem featured a correlated update that had to access a view 84 times to update a small table; but the view was a complex view (apparently non-mergeable) and the update took several hours to complete even though the view, when instantiated, held only 63 rows.

The OP told us that the query “select * from view” took seven minutes to return those 63 rows, and wanted to know if we could find a nice way to perform the update in (approximately) that seven minutes, rather than using the correlated update approach that seemed to take something in the ballpark of 7 minutes per row updated.

Of course the OP could have given us all the details of the view definition, all the table and index definitions, with stats etc. and asked us if we could make the update run faster – but that could lead to a long and frustrating period of experimentation and testing, and a solution that might increase the general maintenance costs of the system (because a subsequent modification to the view might then have to be echoed into the code that did the update). Setting a strictly limited target that clearly ought to be achievable is (if nothing else) a very good starting point for improving the current situation.

I don’t know (as at the time of writing) if the OP implemented the strategy I suggested, but from his description it looked as if it should have been simple to use subquery factoring with materialization to achieve the required result in the most elegant way possible (meaning, in this case, simple SQL and no change to any surrounding code).

The OP has responded to my suggestion with a comment that “it didn’t work”, but it appeared to me that they were looking at and mis-interpreting the output from a call to “Explain Plan” rather than testing the query and pulling the plan from memory – so I thought I’d build a simple model to demonstrate the principle and show you how you could confirm (beyond just checking the clock) that the strategy had worked.

We start with a table to update, a non-mergeable view, and two tables to make up the non-mergeable view:


create table t1
as
select
        trunc((rownum-1)/15)    n1,
        trunc((rownum-1)/15)    n2,
        rpad(rownum,180)        v1
from
        dual
connect by
        level <= 3000
;


create table t2
as
select
        mod(rownum,200)         n1,
        mod(rownum,200)         n2,
        rpad(rownum,180)        v1
from
        dual
connect by
        level <= 3000;
create index t1_i1 on t1(n1);
create index t2_i1 on t2(n1);

begin
        dbms_stats.gather_table_stats(
                user,
                't1',
                method_opt => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                user,
                't2',
                method_opt => 'for all columns size 1'
        );
end;
/

create or replace view v1
as
select distinct
        t1.n1 t1n1, t1.n2 t1n2, t2.n2 t2n2
from
        t1, t2
where
        t1.n1 = t2.n1
;

create table t3
as
select * from v1
;

begin
        dbms_stats.gather_table_stats(
                user,
                't3',
                method_opt => 'for all columns size 1'
        );
end;
/

I’ve created the table t3 by copying the content of the view v1 and I’m going to update every row in t3 from v1; I gathered stats on t1 and t2 before creating the view and table simply to avoid the need for Oracle to do dynamic sampling as it created t3. Depending on your version of Oracle, of course, the stats collections might be redundant.

Having set the scene with the data, here’s the “original” code for doing the required update, followed by its execution plan (pulled from the memory of a 12.1.0.2 instance):


set serveroutput off
set linesize 180
set trimspool on

alter session set statistics_level = all;

spool cte_update

update t3
        set t2n2 = (
                select  v1.t2n2
                from    v1
                where   v1.t1n1 = t3.t1n1
                and     v1.t1n2 = t3.t1n2
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                         |       |      1 |        |      0 |00:00:01.22 |   46745 |       |       |          |
|   1 |  UPDATE                                  | T3    |      1 |        |      0 |00:00:01.22 |   46745 |       |       |          |
|   2 |   TABLE ACCESS FULL                      | T3    |      1 |    200 |    200 |00:00:00.01 |       3 |       |       |          |
|   3 |   VIEW                                   | V1    |    200 |      1 |    200 |00:00:01.22 |   46332 |       |       |          |
|   4 |    SORT UNIQUE                           |       |    200 |      1 |    200 |00:00:01.21 |   46332 |  2048 |  2048 | 2048  (0)|
|   5 |     NESTED LOOPS                         |       |    200 |      1 |  45000 |00:00:01.11 |   46332 |       |       |          |
|   6 |      NESTED LOOPS                        |       |    200 |      1 |  45000 |00:00:00.34 |    1332 |       |       |          |
|*  7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T1    |    200 |      1 |   3000 |00:00:00.02 |     684 |       |       |          |
|*  8 |        INDEX RANGE SCAN                  | T1_I1 |    200 |     15 |   3000 |00:00:00.01 |     408 |       |       |          |
|*  9 |       INDEX RANGE SCAN                   | T2_I1 |   3000 |      1 |  45000 |00:00:00.11 |     648 |       |       |          |
|  10 |      TABLE ACCESS BY INDEX ROWID         | T2    |  45000 |      1 |  45000 |00:00:00.31 |   45000 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - filter("T1"."N2"=:B1)
   8 - access("T1"."N1"=:B1)
   9 - access("T2"."N1"=:B1)
       filter("T1"."N1"="T2"."N1")

Points to note from this execution plan: the VIEW operation at line 3 has started 200 times (there are 200 rows in table t3, the subquery runs once per row); and a simple measure of work done is the 46,745 buffer visits (of which, I can tell you, roughly 400 are current block gets) reported under Buffers in the top line of the plan.

It’s an interesting detail that although Oracle has pushed the correlation predicates inside the view (as shown by the predicate section for operations 7,8 and 9) it doesn’t report the operation at line 3 as “VIEW PUSHED PREDICATE”. It would be nice to see the explicit announcement of predicate pushing here, but that seems to be an expression reserved for pushing join predicates into views – fortunately we always check the predicate section, don’t we!

Now let’s see what the SQL and plan look like if we want Oracle to create the entire v1 result set and use that to update the t3 table.

update t3 
        set t2n2 = (
                with v0 as (
                        select
                                /*+ materialize */
                                t1n1, t1n2, t2n2
                        from v1
                )
                select
                        t2n2
                from
                        v0
                where   v0.t1n1 = t3.t1n1
                and     v0.t1n2 = t3.t1n2
        )
;

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT            |                            |      1 |        |      0 |00:00:00.19 |    1185 |      1 |      1 |       |       |          |
|   1 |  UPDATE                     | T3                         |      1 |        |      0 |00:00:00.19 |    1185 |      1 |      1 |       |       |          |
|   2 |   TABLE ACCESS FULL         | T3                         |      1 |    200 |    200 |00:00:00.01 |       3 |      0 |      0 |       |       |          |
|   3 |   TEMP TABLE TRANSFORMATION |                            |    200 |        |    200 |00:00:00.18 |     778 |      1 |      1 |       |       |          |
|   4 |    LOAD AS SELECT           |                            |      1 |        |      0 |00:00:00.01 |     171 |      0 |      1 |  1040K|  1040K|          |
|   5 |     VIEW                    | V1                         |      1 |  45000 |    200 |00:00:00.01 |     168 |      0 |      0 |       |       |          |
|   6 |      HASH UNIQUE            |                            |      1 |  45000 |    200 |00:00:00.01 |     168 |      0 |      0 |  1558K|  1558K| 3034K (0)|
|*  7 |       HASH JOIN             |                            |      1 |  45000 |  45000 |00:00:00.01 |     168 |      0 |      0 |  1969K|  1969K| 1642K (0)|
|   8 |        TABLE ACCESS FULL    | T1                         |      1 |   3000 |   3000 |00:00:00.01 |      84 |      0 |      0 |       |       |          |
|   9 |        TABLE ACCESS FULL    | T2                         |      1 |   3000 |   3000 |00:00:00.01 |      84 |      0 |      0 |       |       |          |
|* 10 |    VIEW                     |                            |    200 |  45000 |    200 |00:00:00.17 |     603 |      1 |      0 |       |       |          |
|  11 |     TABLE ACCESS FULL       | SYS_TEMP_0FD9D6618_911FB4C |    200 |  45000 |  40000 |00:00:00.08 |     603 |      1 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access("T1"."N1"="T2"."N1")
  10 - filter(("V0"."T1N1"=:B1 AND "V0"."T1N2"=:B2))

The headline figure to note is that 1,185 Buffer visits – clearly we’ve done something very different (and possibly cheaper and faster, even in this tiny demonstration). Looking at operation 3 we see the “TEMP TABLE TRANSFORMATION”, which tells us that we’ve materialized our factored subquery. There is scope, though, for a little ambiguity and uncertainty – the Starts column for this operation says we started it 200 times, once for each row in t3. We might worry that we’ve actually recreated the result and written it to disc 200 times even though we might then notice that lines 4 – 9 tell us that we loaded the temporary table just once (Starts = 1).

You could take my word for it that we didn’t “do” the temp table transformation 200 time, we merely used the result of the temp table transformation 200 times; but I wasn’t prepared to make this assumption until I had done a little more checking, so there’s no reason why you shouldn’t still be a little suspicious. Lines 4 – 9 do seem to tell us (consistently) that we only load the data once, but there have been occasional bugs where counters have been reset to zero when they shouldn’t have been, so the fact that we see (for example, at operation 8) “1 full tablescan of t1 returning 3,000 rows after visiting 84 buffers” may mean that Oracle counted the work once and “forgot” to count it the other 199 times.

It’s easy enough to do a quick cross-check. Take a snapshot of v$mystat joined to v$statname before and after runnning the query, and check the difference in buffer visits, tablescans, and tablescan rows gotten – if those figures are broadly consistent with the figures in the execution plan I think we can be reasonably confident that the plan is telling us the truth.

Here’s what we get for a few key figures:

Name                                       Value
----                                       -----
session logical reads                      1,472
db block gets                                412
consistent gets                            1,060
consistent gets from cache                 1,060
db block changes                             410
table scans (short tables)                   205
table scan rows gotten                    46,213
table scan blocks gotten                     366

There are a number of oddities – not to mention version and feature dependent variations – in the numbers and a couple of discrepancies introduced by the code I was using to take the snapshot, but the “table scan rows gotten” figure is particularly easy to see in the execution plan:

46,213 = 3000 (t1) + 3000 (t2) + 200 (t3) + 200 * 200 (temp table)

With a small error the number of “table scans (short tables)” is also consistent with the plan Starts – and that’s perhaps the most important indicator, we scan t1 and t2 just once, and the temp table result 200 times. If we were creating the temp table 200 times we’d have to have done over 400 table scans (200 each for t1 and t2).

I won’t go into the details of how to compare the session logical I/O to the total Buffer gets for the plan – but the figures are in the right ballpark as far as matching is concerned – if the plan was deceiving us about the number of times the temporary table was created (rather than used) the session stats would have to report a figure more like 33,600 (200 * (84 + 84)) consistent gets.

Conclusion

We have managed to reduce the workload from “one view instantiation per row” to “one view instantiation” with a very small change to the SQL. In the case of the OP this should result in a small, easily comprehensible, change in the SQL statement leading to a drop in run-time from several hours to seven minutes – and maybe that’s good enough for the present.

Reference Script: cte_update.sql

 

January 6, 2016

NLS Mess

Filed under: Bugs,CBO,Execution plans,Function based indexes,Indexing,Oracle — Jonathan Lewis @ 1:18 pm BST Jan 6,2016

The Oracle database has all sorts of little details built into it to help it deal with multi-national companies, but since they’re not commonly used you can find all sorts of odd “buggy” bits of behaviour when you start to look closely. I have to put “buggy” in quotes because some of the reported oddities are the inevitable consequences of (for example) how multi-byte character sets have to work; but some of the oddities look as if they simply wouldn’t be there if the programmer writing the relevant bit of code had remembered that they also had to cater for some NLS feature.

Here’s an example of the type of unexpected behaviour that can appear. There probably are some bugs in the area I’m going to demonstrate but, at first glance, I thought I was looking at an acceptable limitation imposed by a generic requirement. The example came from AskTom. which is why the data set isn’t my usual “t1” generation (and the formatting and capitalisation isn’t according to my usual standards).

The problem involves Case Insensitive indexing.


ALTER session SET nls_sort=binary_ci;
ALTER session SET nls_comp=linguistic;

CREATE TABLE log_data(
  account_id NUMBER,
  log_type NUMBER,
  sys_name VARCHAR2(30),
  log_time TIMESTAMP,
  msg varchar2(4000)
)
nologging
;

insert /*+ append */ into log_data(
  account_id,
  log_type,
  sys_name,
  log_time,
  msg
)
select
        5,
        2,
        dbms_random.string('a',1),
        sysdate + dbms_random.value,
        rpad('x',200)
from
        dual
connect by
        level <= 26000
;


create index log_date on log_data (
        account_id, 
        log_type, 
--      sys_name,
        NLSSORT(sys_name,'NLS_SORT=BINARY_CI'),
        log_time
)
nologging
;
  
rem     ======================================================================
rem     Need to gather stats AFTER index creation because of the hidden column
rem     ======================================================================
  
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'LOG_DATA',
                method_opt       => 'for all columns size 1'
        );
end;
/

And here’s the query I want to optimize:


SELECT 
        *
FROM
  (
    SELECT
        sys_name, log_time,  substr(msg,1,40) msg
    FROM log_data
    WHERE
      account_id=5
      AND log_type=2
      AND sys_name='a'
    ORDER BY
      log_time  desc
  )
WHERE
  rownum <= 10
;

The requirement of the query is that we see the ten most recent entries for a given combination of account_id, log_type and sys_name (ignoring case in sys_name). The orginal table has tens of millions of rows, of course, with many combinations, and some of the combinations have a very large number of entries hence the desire to find an access path that gets just the 10 rows we want without getting all the rows for a combination and sorting them before returning the ten.

Normally we would just create an index that started with the 3 columns used in the equality and ending with the column in the order by clause, and that would be enough for the optimizer to see the option for a “sort order by nosort” operation to get the required data through an index range scan; so that’s the index the code sample creates, except that since we’ve enabled case insensitive sorting we need to use a function-based index to hold the case-insensitive version of sys_name.

Here’s the execution plan we would get if we DIDN’T use the nlssort() function in the index – I’ve run the query in 11.2.0.4 and pulled the plan from memory with rowsource execution stats enabled:


---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |   605 (100)|     10 |00:00:00.02 |    1065 |       |       |          |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.02 |    1065 |       |       |          |
|   2 |   VIEW                         |          |      1 |    500 |   605   (1)|     10 |00:00:00.02 |    1065 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY       |          |      1 |    500 |   605   (1)|     10 |00:00:00.02 |    1065 |  2048 |  2048 | 2048  (0)|
|   4 |     TABLE ACCESS BY INDEX ROWID| LOG_DATA |      1 |    500 |   603   (1)|    966 |00:00:00.01 |    1065 |       |       |          |
|*  5 |      INDEX RANGE SCAN          | LOG_DATE |      1 |    500 |   103   (3)|    966 |00:00:00.01 |     100 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   3 - filter(ROWNUM<=10)
   5 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2)
       filter(NLSSORT("SYS_NAME",'nls_sort=''BINARY_CI''')=HEXTORAW('6100') )

Notice particularly the filter predicate at operation 5: that’s the thing we need to get into the index before we can avoid picking up excess data and sorting it. Notice also in the A-Rows column that we acquired 966 rows from the table before sorting and discarding all but 10 of them at operation 3.

Notice especially how important it is to look at the predicate section of an execution plan to gain a full understanding of what’s happening.

So here’s the execution plan we get by default with the function-based index in place:


----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |      1 |        |    13 (100)|     10 |00:00:00.01 |     969 |       |       |          |
|*  1 |  COUNT STOPKEY                  |          |      1 |        |            |     10 |00:00:00.01 |     969 |       |       |          |
|   2 |   VIEW                          |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |     969 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY        |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |     969 |  2048 |  2048 | 2048  (0)|
|   4 |     TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |    13   (0)|    966 |00:00:00.01 |     969 |       |       |          |
|*  5 |      INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |     11 |     2   (0)|    966 |00:00:00.01 |       5 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   3 - filter(ROWNUM<=10)
   5 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "LOG_DATA"."SYS_NC00006$"=HEXTORAW('6100') )

It didn’t work ! (Check the A-Rows at operations 4 and 5, and the sort that we didn’t want at operation 3 where the data is finally reduced to 10 rows.

But there’s something odd going on here – look at the predicate section: our three predicates are all access predicates for the index range scan descending. We are doing exactly what we want to do with the index, but we’re not stopping after the 10 rows that we need, we’re getting all of them (in the order we want) and then doing a trivial sort and discard. Look at the Cost column – the cost at operation 4 is exactly what we might expect for the 10 rows we want to see, and the E-rows at line 5 is clearly based on our “first 10 rows” requirement.

This raises two questions:

  1. What’s gone wrong ?
  2. Can we work around the problem ?

The answer to (1) is, I think, that there’s a bug in the code. Looking at the 10053 trace file I can see the optimizer correctly handling the arithmetic of the virtual column (the sys_nc000006$) representing the function in the index and then getting to the point where it goes into a code section relating to “Recost for ORDER BY”, and brings back the original function as a filter predicate – I think that in the recosting it may be losing track of the fact that sys_nc000006$ and nlssort(sys_name, ‘nls_sort=binary_ci’) are the same thing and therefore can’t apply the rule about “Equality on 1st N columns, order by on the remainder”.

There are several answers to (2).

Workarounds

The honest hack

The first one is simply to fall back to the old (probably version 7, possibly version 8) requirement for getting the “sort order by nosort” operation – put all the index columns into the order by clause. Unfortunately the optimizer then did a tablescan rather than an index range scan because my data set was so small, so I had to hack the system stats temporarily to make the tablescan very expensive:


begin
        dbms_stats.set_system_stats('MBRC',2);
        dbms_stats.set_system_stats('MREADTIM',20); 
        dbms_stats.set_system_stats('SREADTIM',5);
        dbms_stats.set_system_stats('CPUSPEED',1000); 
end;
/

... order by account_id desc, log_type desc, sys_name desc, lot_time desc

Unfortunately the optimizer still went wrong – it did an ASCENDING index range scan sorting all the data. I actually had to hint the code to use the index in descending order to get the following execution plan:


------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |  1215 (100)|     10 |00:00:00.01 |      13 |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.01 |      13 |
|   2 |   VIEW                         |          |      1 |   1000 |  1215   (1)|     10 |00:00:00.01 |      13 |
|   3 |    TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |  1006   (1)|     10 |00:00:00.01 |      13 |
|*  4 |     INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |   1000 |     5   (0)|     10 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   4 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "LOG_DATA"."SYS_NC00006$"=HEXTORAW('6100') )

The A-Rows tells us we’ve accessed the minimum data set, and the absence of the SORT ORDER BY STOPKEY operation tells us that we’ve avoided doing the sort. Notice, though that the cost is the cost that would have been appropriate if we have accessed all 1,000 rows that matched the equality predicates. This is an example of a plan that you couldn’t really trust if all you had done was an “explain plan” rather than running the query and checking the rowsource execution stats. If you ignore the A-Rows it looks as if the plan WOULD get all the data in order and only eliminate the redundant rows at operation 1.

The silly surprise

The original author of the problem came up with this one. Put in two predicates which, between them are equivalent to the original requirement:


where ...
and     sys_name >= 'a'
and     sys_name <= 'a'

Clearly this is totally silly – the optimizer can fold this pair of predicates into the single predicate “sys_name = ‘a'”, so it shouldn’t make any difference. But here’s the execution plan:

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |    13 (100)|     10 |00:00:00.01 |      13 |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.01 |      13 |
|   2 |   VIEW                         |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |      13 |
|   3 |    TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |    13   (0)|     10 |00:00:00.01 |      13 |
|*  4 |     INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |     11 |     2   (0)|     10 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   4 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "LOG_DATA"."SYS_NC00006$"=HEXTORAW('6100') )

Yes, it’s (structurally) exactly the same plan, with exactly the same predicate section except that (a) it gets there without being hinted, (b) the Cost column looks appropriate all down the line, and (c) the E-Rows value for the VIEW operator would have helped us appreciate that the correct elimination was (probably) going to happen if all we had done was the Explain Plan.

The dirty hack

I know the name of the hidden column that’s causing the problem, and I know how to generate the value it has to be – so let’s give Oracle exactly what it needs to see rather than allowing its internal transformation to rewrite the SQL:

...
AND sys_nc00006$ = nlssort('a','nls_sort=binary_ci')
...


------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |    13 (100)|     10 |00:00:00.01 |      13 |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.01 |      13 |
|   2 |   VIEW                         |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |      13 |
|   3 |    TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |    13   (0)|     10 |00:00:00.01 |      13 |
|*  4 |     INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |     11 |     2   (0)|     10 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   4 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "SYS_NC00006$"=HEXTORAW('6100') )

We get exactly the plan we need – and the silly thing about this example is that it’s a case where we get the plan we want by EXPLICITLY transforming the SQL to reproduce the transformation that Oracle had done IMPLICITLY and then messed up !

Final Choice
Of the three options – the dirty hack is definitely a no-no in production; the “double the predicate” trock is undesirable because it may depend in some unexpected way on a particular optimizer bug or on some statistical detail that could change; so I’d choose the hinted path with the (nominally) redundant columns.

One final point about this solution, we actually needed to include only the sys_name in the order by clause to use the descending range scan and early stop – which is basically another indication that it’s something about the function-based column that is breaking the normal code path.

Reference Script: nls_sort_anomaly.sql

January 4, 2016

ANSI bug

Filed under: ANSI Standard,Bugs,Oracle — Jonathan Lewis @ 1:12 pm BST Jan 4,2016

In almost all cases the SQL you write using the ANSI standard syntax is tranformed into a statement using Oracle’s original syntax before being optimised – and there are still odd cases where the translation is not ideal.  This can result in poor performance, it can result in wrong results. The following examples arrived in my in-tray a couple of weeks ago:

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
            from table1
            left join table2 on table1.my_number = table2.my_number
            group by table1.my_number
        );


with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select sum(table3.table2.my_number) the_answer
    from table1
    left join table2 on table1.my_number = table2.my_number
    group by table1.my_number;

Notice the reference to table3.table2.my_number in the select list of both queries – where does the “table3” bit come from ? These queries should result in Oracle error ORA-00904: “TABLE3″.”TABLE2″.”MY_NUMBER”: invalid identifier.

If you’re running 11.2.0.4 (and, probably, earlier versions) both queries produce the following result:


THE_ANSWER
----------
         1

1 row selected.

If you’re running 12.1.0.2 the first query produces the ORA-00904 error that it should do, but the second query still survives to produce the same result as 11.2.0.4.

January 1, 2016

Expert

Filed under: Oracle,Philosophy — Jonathan Lewis @ 1:02 pm BST Jan 1,2016

I was sent the following email a few years ago. It’s a question that comes up fairly frequently and there’s no good answer to it but, unusually, I made an attempt to produce a response; and I’ve decided that I’d start this year by presenting the question and quoting the answer I gave so here, with no editing is the question:

I’m disturbing you for some help about becoming an Oracle master expert. Probably you are getting this kind of emails a lot but I would be appreciate if you give a small answer to me at least.

First, shortly I want to introduce my self. I’m an *Oracle Trainer* in Turkey Oracle University for 2 years. Almost for 4 years, I worked as software engineer and meet with Oracle on these days. After a while I decided to develop myself in Oracle database technologies and become trainer as i said. I also give consultancy services about SQL / PLSQL development and especially* SQL / PLSQL tuning*. I really dedicate myself to these subjects. As a trainer I also give DBA workshop lectures but in fact I didnt actually did dba job in a production system. I have the concept and even read everything I found about it but always feel inadequate because didnt worked as a DBA on a production system. So many DBA’s has taken my class and they were really satisfied (they have got all answers for their questions) but I did not. I’m a good trainger (with more that 97 average points in oracle evaluations) but I want to be best.

Even in SQL / PLSQL tuning, I know that I am really good at it but I also aware that there are some levels and I can not pass through the next level. for ex: I can examine execution plan (index structures, access paths etc), find cpu and io consumption using hierarchical profiler and solve the problem but can’t understand yet how to understand how much IO consumed by query and understand slow segments. if you remember, for a few days ago, on OTN you answered a question that I involved about sequence caching and Log file sync event. There, I said that sequence can cause to log file sync event (and as you said that was true) but when someone else write a simple code and couldnt see this event, I couldnt answer to him, you did (you said that it was because optimizing).

that is the level what i want to be. I am really working on this and age on 29. but whatever I do I cant get higher. I need a guideness about that. I even worked free for a while (extra times after my job here). I need your guideness, as I said I can work with you if you want to test and I want to learn more advanced topics while working. In Turkey, I couldn’t find people who can answer my questions so I can not ask for guideness to them.

And my (impromptu, and unedited) reply:

Thank you for your email. You are correct, I do get a lot of email like this, and most of it gets a stock response; but yours was one of the most intelligently written so I’ve decided to spend a little time giving you a personal answer.

Even if you were to spend a few years as a DBA, you would probably not become the sort of expert you want to be. Most DBAs end up dealing with databases that, for want of a better word, we could call “boring”; for a database to be interesting and show you the sorts of problems where you have to be able to answer the types of question I regularly answer you probably need to be the DBA for a large banking or telecoms system – preferably one that hasn’t been designed very well – that has to handle a very large volume of data very quickly. On these extreme systems you might find that you keep running into boundary conditions in Oracle that force you to investigate problems in great detail and learn all sorts of strange things very quickly. On most other systems you might run into a strange problem very occasionally and spend several years on the job without once being forced to solve any difficult problems very quickly.

If you want to become an expert, you need to be a consultant so you get to see a lot of problems on lots of different systems in a very short time; but you can’t really become a consultant until you’re an expert. As a substitute, then, you need to take advantage of the problems that people report on the OTN database forum – but that doesn’t mean just answering questions on OTN. Look for the problems which people have described reasonably well that make you think “why would that happen”, then try to build a model of the problem that has been described and look very closely at all the statistics and wait events that change as you modify the model. Creating models, and experimenting with models, is how you learn more.

Take, for example, the business of the sequences and pl/sql – you might run the test as supplied with SQL_trace enabled to see what that showed you, you could look very carefully at the session stats for the test and note the number of redo entries, user commits, and transactions reported; you could look at the statistics of enqueue gets and enqueue releases, ultimately you might dump the redo log file to see what’s going into it. Many of the tiny little details I casually report come from one or two days of intense effort studying an unexpected phenomenon.  (The log file sync one was the result of such a study about 15 years ago.)

 Happen new year to all my readers.

December 29, 2015

Column Groups

Filed under: extended stats,Oracle,Statistics,Tuning — Jonathan Lewis @ 1:13 pm BST Dec 29,2015

I think the “column group” variant of extended stats is a wonderful addition to the Oracle code base, but there’s a very important detail about using the feature that I hadn’t really noticed until a question came up on the OTN database forum recently about a very bad join cardinality estimate.

The point is this: if you have a multi-column equality join and the optimizer needs some help to get a better estimate of join cardinality then column group statistics may help if you create matching stats at both ends of the join. There is a variation on this directive that helps to explain why I hadn’t noticed it before – multi-column indexes (with exactly the correct columns) have the same effect and, most significantly, the combination of  one column group and a matching multi-column index will do the trick.

Here’s some code to demonstrate the effect:

create table t8
as
select
        trunc((rownum-1)/125)   n1,
        trunc((rownum-1)/125)   n2,
        rpad(rownum,180)        v1
from
        all_objects
where
        rownum <= 1000
;

create table t10
as
select
        trunc((rownum-1)/100)   n1,
        trunc((rownum-1)/100)   n2,
        rpad(rownum,180)        v1
from
        all_objects
where
        rownum <= 1000
;
begin
        dbms_stats.gather_table_stats(
                user,
                't8',
                method_opt => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                user,
                't10',
                method_opt => 'for all columns size 1'
        );
end;
/

set autotrace traceonly

select
        t8.v1, t10.v1
from
        t8,t10
where
        t10.n1 = t8.n1
and     t10.n2 = t8.n2
/

set autotrace off

Table t8 has eight distinct values for n1 and n2, and 8 combinations (though the optimizer will assume there are 64 combinations); table t10 has ten distinct values for n1 and n2, and ten combinations (though the optimizer will assume there are 100 combinations). In the absence of any column group stats (or histograms, or indexes) and with no filter predicates on either table, the join cardinality will be “{Cartesian Join cardinality} * {join selectivity}”, and in the absence of any nulls the join selectivity – thanks to the “multi-column sanity check” – will be 1/(greater number of distinct combinations). So we get 1,000,000 / 100 = 10,000.

Here’s the output from autotrace in 11.2.0.4 to prove the point:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 10000 |  3652K|    11  (10)| 00:00:01 |
|*  1 |  HASH JOIN         |      | 10000 |  3652K|    11  (10)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T8   |  1000 |   182K|     5   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T10  |  1000 |   182K|     5   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T10"."N1"="T8"."N1" AND "T10"."N2"="T8"."N2")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        835  consistent gets
          0  physical reads
          0  redo size
   19965481  bytes sent via SQL*Net to client
      73849  bytes received via SQL*Net from client
       6668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     100000  rows processed

As you can see, the query actually returns 100,000 rows. The estimate of 10,000 is badly wrong thanks to the correlation between the n1 and n2 columns. So let’s check the effect of creating a column group on t10:


begin
        dbms_stats.gather_table_stats(
                user,
                't10',
                method_opt => 'for all columns size 1 for columns (n1,n2) size 1'
        );
end;
/

At this point you might think that the optimizer’s sanity check might say something like: t8 table: 64 combinations, t10 table column group 10 combinations so use the 64 which is now the greater num_distinct. It doesn’t – maybe it will in some future version, but at present the optimizer code doesn’t seem to recognize this as a possibility. (I won’t bother to reprint the unchanged execution plan.)

But, at this point, I could create an index on t8(n1,n2) and run the query again:


create index t8_i1 on t8(n1, n2);

select
        t8.v1, t10.v1
from
        t8,t10
where
        t10.n1 = t8.n1
and     t10.n2 = t8.n2
/

Index created.


100000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 216880280

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|    35M|    12  (17)| 00:00:01 |
|*  1 |  HASH JOIN         |      |   100K|    35M|    12  (17)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T8   |  1000 |   182K|     5   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T10  |  1000 |   182K|     5   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T10"."N1"="T8"."N1" AND "T10"."N2"="T8"."N2")

Alternatively I could create a column group at the t8 table:



drop index t8_i1;

begin
        dbms_stats.gather_table_stats(
                user,
                't8',
                method_opt => 'for all columns size 1 for columns (n1,n2) size 1'
        );
end;
/

select  
        t8.v1, t10.v1 
from
        t8,t10
where
        t10.n1 = t8.n1
and     t10.n2 = t8.n2
/

Index dropped.


PL/SQL procedure successfully completed.


100000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 216880280

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|    35M|    12  (17)| 00:00:01 |
|*  1 |  HASH JOIN         |      |   100K|    35M|    12  (17)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T8   |  1000 |   182K|     5   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T10  |  1000 |   182K|     5   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T10"."N1"="T8"."N1" AND "T10"."N2"="T8"."N2")


If you’re wondering why I’ve not picked up this “both ends” detail in the past – it’s because I’ve usually been talking about replacing indexes with column groups and my examples have probably started with indexes at both end of the join before I replaced one index with a column group. (The other examples I’ve given of column groups are typically about single-table access rather than joins.)

 

December 23, 2015

Testing

Filed under: Oracle,Philosophy — Jonathan Lewis @ 12:56 pm BST Dec 23,2015

This post is a 100% copy of a message that Tanel Poder sent to the Oracle-L mailing list in response to a thread about the performance of SSD. It’s not just a good answer to the question, it’s a wonderfully succinct insight into how to think about what you’re really testing and it displays the mind-set that should be adopted by everyone.

If you measure write performance on an idle Exadata machine without any other load going on, you are not comparing flash vs disk, you are comparing flash vs the battery-backed 512MB RAM cache in the “RAID” controllers within each storage cell!

This is how the “disk” that’s supposed to have a couple of milliseconds of average latency (it still rotates and needs to seek + calibrate to next track even in sequential writes) gives you sub-millisecond write latencies… it’s not the disk write, it’s the controller’s RAM write that gets acknowledged.

And now when you run a real workload on the machine (lots of random IOs on the disk and Smart Scans hammering them too), your disk writes won’t be always acknowledged by the controller RAM cache. When comparing *busy* flash disks to *busy* spinning disks vs. *idle* flash disks vs *idle* spinning disks (with non-dirty write cache) you will get different results.

So, I’m not arguing here that flash is somehow faster for sequential writes than a bunch of disks when talking about throughput. But if you care about latency (of your commits) you need to be aware of everything else that will be going on on these disks (and account for this in your benchmarks).

Without queueing time included, a busy flash device will “seek” where needed and perform the write in under a millisecond, a busy disk device in 6-10 milliseconds. So your commits will end up having to wait for longer (yes, your throughput will be ok due to the LGWR writing multiple transactions redo out in a single write, but this doesn’t change the fact that individual commit latency suffers).

This latency issue of course will be mitigated when you are using a decent storage array with enough (well-managed) write cache.

So I’d say there are the following things you can compare (and need to be aware of which hardware are you really benchmarking):

1) Flash storage
2) Disk storage without (write) cache
3) Disk storage with crappy (write) cache
4) Disk storage with lots of well-managed & isolated (write) cache

And the second thing to be aware of:

1) Are you the single user on an idle storage array
2) Are you just one of the many users in a heavily utilized (and randomly seeking) storage array

So, as usual, run a realistic workload and test it out yourself (if you have the hardware :)

« Previous PageNext Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 6,511 other followers