Oracle Scratchpad

March 20, 2016

Index Speculation

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 10:32 pm BST Mar 20,2016

There’s a current question on the OTN database forum as follows (with a little cosmetic adjustment):

I have a request for 3 indices as shown below. Does the 1st index suffice for 2 and 3?  Do I need all 3?

  • create index idx_atm_em_eff_ver_current_del on atm_xxx_salary (employee_key, effective_dt, salary_version_number, is_current, is_deleted);
  • create index idx_atm_em_ver_current on atm_xxx_salary (employee_key, salary_version_number, is_current);
  • create index .idx_atm_sal_current_del on atm_xxx_salary (employee_key, is_deleted, is_current);

In the absence of any information about the data and the application the correct answer is: “How could we possibly tell?”

On the other hand there’s plenty of scope for intelligent speculation, and that’s an important skill to practise because when we’re faced with a large number of options and very little information we need to be able to make best-guess choices about which ones are most likely to be worth the effort of pursuing in detail. So if we have to make some guesses about this table and the set of indexes shown, are there any reasonable guesses we might make.

I’ve highlighted the table name and leading column for the first index. The table seems to about salary and the leading column seems to identify an employee. In fact we see that all three indexes start with the employee_key and that may be what prompted the original question. Previous (“real-world”) experience tells me that employees are, generally, paid a salary and that salaries are likely to change (usually upwards) over time, and I note that another column in one of these indexes is effective_dt (date ?), and a third column (appearing in two of the indexes) is is_current.

This looks like a table of employee salaries recording their current and historic salaries, engineered with a little redundant information to make it easy to find the current salary. (Perhaps there’s a view of current_salary defined as is_current = ‘Y’ and is_deleted = ‘N’.)

It’s harder to speculate with any confidence on the columns is_deleted and salary_version_number;  why would a salary row be marked as deleted – is this something that happens when an employee leaves or an employee is deleted (or, following the pattern, has their is_deleted flag set to ‘Y’); why does a salary have a version number – does the table contain it’s own audit trail of errors and corrections, perhaps a correction is effected by marking the incorrect entry as deleted and incrementing its version number to generate the version number for the correct entry. Possibly the notional primary key of the table is (employee_key, effective_dt, is_deleted, salary_version_number).

The level of complexity surrounding these two columns could send further speculation in completely the wrong direction, but let’s follow the line that these two columns see very little action – let’s assume that most of the data is not “deleted” and virtually none of the data needs “versioning”. How does this assumption help us with the original question.

The largest employer in the world is the America Department of Defence with 3.2 million employees (following by the People’s Liberation Army of China with only 2.3 million employees), so an “employees” table is not really likely to be very big. How often does an employee have a salary review and change ? Would once per year be a reasonable figure to pluck from the air ? How many employess stay at the same company for 40 years – how many rows per employee would you end up with, and how scattered would they be through the salary table ?

Under any reasonable estimate it seems likely that if you created the first index (5 columns) then all the salary rows for a given employee are likely to be contained in a single leaf block, so if all the searches were driven by employee then that single index would allow exactly the correct set of table rows to be identified from one index leaf block access plus a little extra CPU.

Of course it’s possible that, with different circumstances, the size and clustering factor of the first index would be so much greater than the size and clustering factors of the other two that a query that would use one of the smaller indexes won’t use the larger index -but in this case the most significant contributor to the optimizer’s cost is likely to be the clustering_factor and given our assumption of the slow appearance over time of the new salaries for an employee the clustering factor of all three indexes is likely to be the same (probably very similar to the number of rows in the salary table).

Having got this far, it’s worth considering whether or not the salary table should actually be an index-organized table – it looks like an obvious candidate; how many other columns are there likely to be in a salary table ? Of course it’s worth thinking about other queries that might access a salary table without reference to the employees table at that point, perhaps a secondary index on (is_current, employee_key) might be appropriate, but in the absence of any other information we’ve reached the point where speculation needs to be backed up by some hard facts.

Bottom Line:

I wouldn’t guarantee that the first index makes the other two indexes redundant but it seems highly likely that it should and it’s probably worth spending some time looking at the requirements and numbers a little more closely – especially if you’re the US DoD or the Chinese People’s Liberation Army.

 

 

 

 

February 15, 2016

Connect By

Filed under: Execution plans,Hints,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 2:01 pm BST Feb 15,2016

I received an email a couple of days ago that was a little different from usual – although the obvious answer was “it’s the data”. A connect by query with any one of several hundred input values ran in just a few seconds, but with one specific input it was still running 4,000 seconds later using the same execution plan – was this a bug ?

There’s nothing to suggest that it should be, with skewed data anything can happen: even a single table access by exact index could take 1/100th of a second to return a result if there was only one row matching the requirement and 1,000 seconds if there were 100,000 rows in 100,000 different table blocks (and the table was VERY big). The same scaling problem could be true of any type of query – and “connect by” queries can expose you to a massive impact because their run time can increase geometrically as the recursion takes place.

So it was easy to answer the question – no it’s (probably) not a bug, check the data for that one value.

Then I decided to build a simple model. The original email had a four table join, but I just created a single table, and used a “no filtering” connect by which I had to hint. Here’s some code I ran on 11.2.0.4:


rem
rem     script: connect_by_skew.sql
rem     dated:  Feb 2016
rem     Last tested:
rem             12.1.0.2
rem

create table t1 nologging 
as
select 
        rownum id_p, 10 * rownum id
from
        all_objects
where 
        rownum <= 50000 ; execute dbms_stats.gather_table_stats(user,'t1', method_opt=>'for all columns size 1')

alter system flush shared_pool;

set serveroutput off
alter session set statistics_level = all;

select sum(ct) 
from    (
        select
                /*+ no_connect_by_filtering */
                count(id) ct
        from
                t1
        connect by
                id = 20 * prior id_p
        start with
                id_p = 1
        group by
                id
)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

update t1 set id_p = 0
where   id_p = 1
;

update t1 set id_p = 1
where   id_p > 45000
;

select sum(ct) 
from    (
        select
                /*+ no_connect_by_filtering */
                count(id) ct
        from
                t1
        connect by
                id = 20 * prior id_p
        start with
                id_p = 1
        group by
                id
)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

The sum() of the inline aggregate view emulates the original code – I don’t know what it was for, possibly it was a way of demonstrating the problem without producing a large output, I just copied it.

As you can see in my script every parent id (id_p) starts out unique, and if I look at the pattern of the raw data identified by the recursion from id_p = 1 (rather than looiking at the result of the actual query) this is what I’d get:

      ID_P         ID
---------- ----------
         1         10
         2         20
         4         40
         8         80
        16        160
        32        320
        64        640
       128       1280
       256       2560
       512       5120
      1024      10240
      2048      20480
      4096      40960
      8192      81920
     16384     163840
     32768     327680

When I modify the data so that I have exactly 5,000 rows with id_p = 1 the initial data generation will be 80,000 rows of data. If you want to try setting id_p = 1 for more rows make sure you do it to rows where id_p is already greater than 32768 or you’ll run into Oracle error ORA-01436: CONNECT BY loop in user data.

Here’s the execution plan, with rowsource execution stats I got for the first query (running 11.2.0.4):


-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |      1 |        |    32 (100)|      1 |00:00:00.44 |     103 |       |       |          |
|   1 |  SORT AGGREGATE                            |      |      1 |      1 |            |      1 |00:00:00.44 |     103 |       |       |          |
|   2 |   VIEW                                     |      |      1 |      2 |    32   (7)|     16 |00:00:00.44 |     103 |       |       |          |
|   3 |    HASH GROUP BY                           |      |      1 |      2 |    32   (7)|     16 |00:00:00.44 |     103 |  1519K|  1519K| 1222K (0)|
|*  4 |     CONNECT BY NO FILTERING WITH START-WITH|      |      1 |        |            |     16 |00:00:00.44 |     103 |       |       |          |
|   5 |      TABLE ACCESS FULL                     | T1   |      1 |  50000 |    31   (4)|  50000 |00:00:00.10 |     103 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this took 0.44 seconds, generated the expected 16 rows (still visible up to operation 2) which it then counted. Oracle followed the same execution plan when I set 5,000 rows to the critical value – here’s the new run-time plan:


-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |      1 |        |    32 (100)|      1 |00:05:39.25 |     103 |       |       |          |
|   1 |  SORT AGGREGATE                            |      |      1 |      1 |            |      1 |00:05:39.25 |     103 |       |       |          |
|   2 |   VIEW                                     |      |      1 |      2 |    32   (7)|   5015 |00:05:39.24 |     103 |       |       |          |
|   3 |    HASH GROUP BY                           |      |      1 |      2 |    32   (7)|   5015 |00:05:39.22 |     103 |  5312K|  2025K| 1347K (0)|
|*  4 |     CONNECT BY NO FILTERING WITH START-WITH|      |      1 |        |            |  80000 |00:05:38.56 |     103 |       |       |          |
|   5 |      TABLE ACCESS FULL                     | T1   |      1 |  50000 |    31   (4)|  50000 |00:00:00.09 |     103 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

As expected, 80,000 rows generated (5,000 * 16), aggregated down to 5,015, then aggregated again to the one row result. Time to complete: 5 minutes 39 seconds – and it was all CPU time. It’s not entirely surprising – a single recursive descent (with startup overheads) took 0.44 seconds – presumably a fairly large fraction of that was startup, but even 0.1 seconds adds up if you do it 5,000 times.

Everybody knows that skewed data can produced extremely variable response times. With a deeper tree and more rows with the special value it wouldn’t be hard for the total run time of this query to get to the 4,000 seconds reported in the original email. (I also tried running with 10,000 rows set to 1 and the run time went up to 18 minutes – of which a large fraction was reading from the TEMPORARY tablespace because something had overflowed to disc).

Was there a solution ?

I don’t know – but I did suggest two options
a) create a histogram on the data to show that there was one particular special value; since the code seemed to include literals perhaps the optimizer would notice the special case and choose a different plan.
b) hint the code to use a different strategy – the hint would be /*+ connect_by_filtering */. Here’s the resulting execution plan:


---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |      |      1 |        |    95 (100)|      1 |00:00:06.50 |    1751 |       |       |          |
|   1 |  SORT AGGREGATE              |      |      1 |      1 |            |      1 |00:00:06.50 |    1751 |       |       |          |
|   2 |   VIEW                       |      |      1 |      2 |    95   (6)|   5015 |00:00:06.49 |    1751 |       |       |          |
|   3 |    HASH GROUP BY             |      |      1 |      2 |    95   (6)|   5015 |00:00:06.47 |    1751 |  5312K|  2025K| 1346K (0)|
|   4 |     CONNECT BY WITH FILTERING|      |      1 |        |            |  80000 |00:00:06.30 |    1751 |   337K|   337K|  299K (0)|
|*  5 |      TABLE ACCESS FULL       | T1   |      1 |      1 |    31   (4)|   5000 |00:00:00.01 |     103 |       |       |          |
|*  6 |      HASH JOIN               |      |     16 |      1 |    63   (5)|     15 |00:00:05.98 |    1648 |  1969K|  1969K|  741K (0)|
|   7 |       CONNECT BY PUMP        |      |     16 |        |            |     16 |00:00:00.01 |       0 |       |       |          |
|   8 |       TABLE ACCESS FULL      | T1   |     16 |  50000 |    31   (4)|    800K|00:00:01.49 |    1648 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

We get the result in 6.5 seconds! [UPDATE: but there’s a nice explanation for that – most of the time comes from the work done gathering rowsource execution statistics; with statistics_level set back to typical the run time dropped to 0.19 seconds.]

February 3, 2016

Hinting

Filed under: Hints,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 1:04 pm BST Feb 3,2016

This is just a little example of thinking about hinting for short-term hacking requirements. It’s the answer to a question that came up on the Oracle-L listserver  a couple of months ago (Oct 2015) and is a convenient demonstration of a principle that can often (not ALWAYS) be applied as a response to the problem: “I can make this query work quickly once, how do I make it work quickly when I make it part of a join ?”

The question starts with this query, which returns “immediately” for any one segment:


SELECT DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM dba_extents DE
WHERE 1=1
  AND DE.OWNER           = <owner>
  AND DE.SEGMENT_NAME    = <segment_name>
  AND DE.segment_type    = <segment_type>
  AND DE.tablespace_name = <tablespace_name>
  AND DE.partition_name  = <max_partition_name>
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME
;

But the email then goes on to explain: “I’ve got a table of values that I need to use as a list of inputs for this query, but when I do the join it takes ages to complete; how do I make the join run quickly?”

Here’s the initial code:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

What we’ve got is a query where the user’s reference table generates a list of segments (the rownum < 2 was a temporary test) and we want the detail query to run for each segment identified. The “for each segment” gives us a clue that what we want to see is a simple nested loop join, driven by the factored subquery, with the very efficient query above running as the “second table of the nested loop”.

What I failed to notice at the time is that the /*+ materialize */ hint was in the wrong place, it should have been placed after the outer (i.e. very first) select, and it’s possible that if it had been in the right place the user would have got the plan they wanted – especially in the later versions of Oracle. As it was I suggested that we merely need to tell the optimizer to:

Visit the “tables” in the order (segment_list, dba_extents), and do a nested loop into (dba_extents), but since both segment_list and dba_extents were views we needed to stop Oracle from trying to merge them and play silly games with distinct aggregate placement, etc. by including directives that the views should not be merged, but then we might need to explain to Oracle that it would have to push the join predicate between segment_list and dba_extents inside the dba_extents view.

In other words, a list of 4 hints, as shown below:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT /*+
        leading(sl de)
        no_merge(sl)
        no_merge(de)
        push_pred(de)
        */
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

According to a follow-up email, this was sufficient.  The OP had actually tried variations on the leading() and use_nl() hints – but without the no_merge() hint the optimizer was probably rewriting the SQL in a way that put the hints out of context. It’s worth noting that the /*+ materialize */ hint is in the wrong place – it should be after the first appearance of the SELECT keyword in the factored subquery – and that probably added to the difficulty of getting the desired execution plan.

For a production system I’d probably want to do something a little more sophisticated in terms of stability once I’d got the plan I wanted – but this looks like a DBA query used to run an ad hoc report, so perhaps this solution is good enough for the current requirement.

 

February 2, 2016

Partitioned Bitmap Join

Filed under: bitmaps,Bugs,Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 8:32 am BST Feb 2,2016

If you don’t want to read the story, the summary for this article is:

If you create bitmap join indexes on a partitioned table and you use partition exchanges to load data into the table then make sure you create the bitmap join indexes on the loading tables in exactly the same order as you created them on the partitioned table or the exchange will fail with the (truthful not quite complete) error: ORA-14098: index mismatch for tables in ALTER TABLE EXCHANGE PARTITION.

My story starts with this OTN posting from John Hall where he found after a year of successful batch loading one of his partition exchanges was raising error 14098. After an exchange of ideas, user rp0428 came up with a query against sys.jijoin$ (one of the tables behind bitmap join indexes) that allowed John Hall to see that the indexes on the exchange table had been created in a different order from that of the partitioned table. I did a quick test to see if this might be relevant (it shouldn’t be, it isn’t with “normal” indexes or function-based indexes, or virtual columns) and didn’t manage to reproduce the problem with two dimension tables and two bitmap join indexes.

Fortunately John didn’t take my word for it and tested the idea on a clone of the production system – and found that the order of creation did matter. His system, however, had 9 dimension tables and 33 bitmap join indexes – which shouldn’t have made any difference in principle, but maybe it was something to do with having several indexes on the same table,  maybe it was something to do with have far more tables or far more indexes than I had. So I built a larger test case with 6 dimension tables and six indexes per table – and reproduced the problem.

Then I started cutting back to see where the problem appeared, and found that all it took was one dimension with two indexes, or two dimensions with one index each – whatever I had done in my “quick test” I had clearly done it too quickly and done something wrong. (Unfortunately I had overwritten most of the code from the original quick test while building the larger test, so I couldn’t go back and see where the error was.)

Here, then, is the minimal test case that I finally ran to demonstrate that switching the order of index creation on the exchange table causes the exchange to fail:


drop table pt_range purge;
drop table t1 purge;
drop table dim_1 purge;
drop table dim_2 purge;

prompt  =================
prompt  Partitioned table
prompt  =================

create table pt_range (
        id,
        grp1,
        grp2,
        padding
)
nologging
partition by range(id) (
        partition p2001 values less than (2001),
        partition p4001 values less than (4001),
        partition p6001 values less than (6001),
        partition p8001 values less than (8001)
)
as
select
        rownum                          id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 8000
;

prompt  ================================================
prompt  Exchange table - loaded to match partition p8001
prompt  ================================================

alter table pt_range 
add constraint pt_pk primary key (id) using index local;

create table t1 (
        id,
        grp1,
        grp2,
        padding
)
as 
select
        rownum + 6000                   id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 2000
;

alter table t1
add constraint t1_pk primary key (id);

execute dbms_stats.gather_table_stats(user,'pt_range')
execute dbms_stats.gather_table_stats(user,'t1')

prompt  ================
prompt  dimension tables
prompt  ================

create table dim_1 
as 
select distinct 
        grp1, 
        cast('A'||grp1 as varchar2(3)) agrp1,
        cast('B'||grp1 as varchar2(3)) bgrp1
from
        t1
;

create table dim_2 as select * from dim_1;

prompt  ===============================
prompt  Primary keys required for BMJIs
prompt  ===============================

alter table dim_1 add constraint d1_pk primary key (grp1);
alter table dim_2 add constraint d2_pk primary key (grp1);

execute dbms_stats.gather_table_stats(user,'dim_1')
execute dbms_stats.gather_table_stats(user,'dim_2')

prompt  ============================
prompt  Creating bitmap join indexes
prompt  ============================

create bitmap index pt_1a on pt_range(d1.agrp1) from pt_range pt, dim_1 d1 where d1.grp1 = pt.grp1 local ;
create bitmap index pt_2a on pt_range(d2.agrp1) from pt_range pt, dim_2 d2 where d2.grp1 = pt.grp2 local ;

prompt  ====================================================
prompt  Pick your index creation order on the exchange table
prompt  ====================================================

create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;
create bitmap index t1_2a on t1(d2.agrp1) from t1, dim_2 d2 where d2.grp1 = t1.grp2 ;
-- create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;

prompt  ==================
prompt  Exchanging (maybe)
prompt  ==================

alter table pt_range
        exchange partition p8001 with table t1
        including indexes
        without validation
;

I’ve got the same create statement twice for one of the bitmap join indexes – as it stands the indexes will be created in the right order and the exchange will work; if you comment out the first t1_1a create and uncomment the second the exchange will fail. (If you comment out the ‘including indexes’ then the exchange will succeed irrespective of the order of index creation, but that rather defeats the point of being able to exchange partitions.)

I’ve reproduced the problem in 12.1.0.2, 11.2.0.4 and 10.2.0.5

Footnote: running an extended trace didn’t help me work out how Oracle is detecting the mismatch, presumably it’s something that gets into the dictionary cache in a general “load the index definition” step; but it did show me that (in the “without validation” case) the code seems to check the correctness of the exchange table’s primary key data BEFORE checking whether the indexes match properly.

January 29, 2016

Table Scans

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 12:49 pm BST Jan 29,2016

It’s amazing how easy it is to interpret a number incorrectly until the point comes where you have to look at it closely – and then you realise that there was a lot more to the number than your initial casual assumption, and you would have realised it all along if you’d ever needed to think about it before.

Here’s a little case in point. I have a simple (i.e. non-partitioned) heap table t1 which is basically a clone of the view dba_segments, and I’ve just connected to Oracle through an SQL*Plus session then run a couple of SQL statements. The following is a continuous log of my activity:


SQL> select table_name, partitioned, blocks from user_tables;

TABLE_NAME           PAR     BLOCKS
-------------------- --- ----------
T1                   NO         958

1 row selected.

SQL> select name, value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic# and name like '%table scan%';

NAME                                  VALUE
-------------------------------- ----------
table scans (short tables)                9
table scans (long tables)                 0
table scans (rowid ranges)                0
table scans (cache partitions)            0
table scans (direct read)                 0
table scan rows gotten                 4188
table scan blocks gotten                 14

7 rows selected.

SQL> select count(extents) from t1;

COUNT(EXTENTS)
--------------
          5143

1 row selected.

SQL> select name, value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic# and name like '%table scan%';

  • My cache size is quite small, so t1 doesn’t count as a “short” table.
  • I’ve collected stats on the table (and there are no indexes) so the optimizer doesn’t need to do any dynamic sampling to generate an execution plan.
  • This is 11g, so there are no SQL Plan Directives in place to force dynamic sampling

So here’s the question: how many “table scans (long tables)” will Oracle record against my session when I re-run that query against v$mystat ?

Warning – this IS a trick question.

Update number 1

I said it was a trick question and, as you will have seen if you’ve had time to read the comments, the answer is going to depend on various configuration options. Here’s what I got in my test – and I’ve reported not just the session stats, but the segment statistics (v$segment_statistics):


NAME                                  VALUE
-------------------------------- ----------
table scans (short tables)                9
table scans (long tables)                26
table scans (rowid ranges)               26
table scans (cache partitions)            0
table scans (direct read)                 0
table scan rows gotten                 9331
table scan blocks gotten                954

7 rows selected.

SQL> select owner, object_name, object_type, value  from v$segment_statistics where owner = 'TEST_USER' and statistic_name = 'segment scans' and value != 0;

OWNER           OBJECT_NAME          OBJECT_TYPE             VALUE
--------------- -------------------- ------------------ ----------
TEST_USER       T1                   TABLE                      26

1 row selected.

I recorded 26 long table scans – and that 26 was echoed in the segment statistics. (Note: I don’t normally use v$segment_statistics, which is a join between a couple of in-memory structures and three real table, I usually query v$segstat). It’s the segment statistics that made me pause in a recent problem review;  up to that moment I had been using the “Segments by Table Scans” section of the AWR Report as a useful (but badly named – since it also counts (most) index fast full scans) indicator of a potential threat, then I suddenly realised that there was yet another important detail I had to check before I could determine what the numbers were telling me.

So the next question is WHY has Oracle reported 26 tablescans ?

Update number 2

Martin Preiss (see comments) has been working hard to investigate this, and managed to produce a couple more variations in statistics for “the same” problem. In my case I had declared the table as “parallel 2”. For a table of degree N Oracle is very likely to break the table into 13*N chunks – there’s a historic reason for 13 – and that’s why I see 26 “rowid range” scans which, unfortunately, are also recorded as “long” table scans even though each is individually well short of the short table limit. This is really annoying when you start to look at the “Segments by Table Scan” report buecause you now don’t know how many times a table really was scanned unless you know something about the run-time degree of parellelism involved.

If you check the hidden parameters you will find several parameters relating to this chunking, in particular:


_px_min_granules_per_slave        minimum number of rowid range granules to generate per slave (default  13)
_px_max_granules_per_slave        maximum number of rowid range granules to generate per slave (default 100)

Technically, therefore, Oracle MIGHT get to 100 granules per parallel slave, and a tablescan at degree 2 could be recorded as 200 tablescans!

Martin asked why my 26 tablescans didn’t show up as “table scans (direct read)” – but that’s 11g for you, it allows a serial tablescan to use direct path reads, and it allows parallel tablescans to read into the cache, and the combination of cache size and table size meant that my example just happened to read into the cache.

And that introduces ANOTHER interpretation problem – what annoying things might I discover if I declare the table the the CACHE option ? (as Ivica suggested in his first comment below) ?

Final Update (probably)

Iviva has been busy on checking the CACHE operation and shown that there are a number of cases to consider (more, in fact, than I was planning to mention – and the variation the combination in NOPARALLEL and CACHE should give you some pause for thought). The only point I wanted to make was the effect of enabling PARALLEL and CACHE; I don’t think that this is a combination that is very likely to appear in a production system, but boundary conditions (and accidents) do occur. With my little sample – even after I grew the table to be MUCH larger, the tablescan DIDN’T get reported: here are some session stats and segment stats (taken using my snapshot code) of a single tablescan running parallel 2 when the table was cached:


Name                                                                     Value
----                                                                     -----
table scans (rowid ranges)                                                  26
table scans (cache partitions)                                              26
table scan rows gotten                                                 164,672
table scan blocks gotten                                                29,611

  Statistic                                    Value
  ---------                             ------------

T1
  logical reads                               30,272
  physical reads                              29,614
  physical read requests                         261

I know I’ve done a tablescan in 26 pieces (parallel 2) – and scanned 29,000+ blocks doing it; but according to the segment stats AND session stats I haven’t done a tablescan. Fortunately, of course, I can see the parallel tablescan in the session stats, and in this isolated case I can see from the “cache partitions” statistics that that tablescan was on a cached table. But if I’ve been depending on the segment stats to tell me about which tablescans happen most frequently and do most work I’ve found another reason why I can’t trust the stats and have to do more work cross-checking different parts of the AWR for self-consistent results.

I don’t expect to add any more comments about this mixture of tablescans and parallelism, with the varying effects on the session and segment statistics – but there’s always the possibility that one day I’ll start to worry about how the KEEP and RECYCLE (db_keep_cache_size and db_recycle_cache_size) could confuse things further.

Footnote:

It’s probably worth pointing out that the segment statistics have never recorded SHORT tablescans, they’ve only ever captured details of LONG tablescans. There is a slight inconsistency here, though, since they capture all the “index fast full scans (full)” reported in the session stats whether they are short or long – not that the session stats record the difference; unfortunately, another flaw creeps in: parallel index fast full scans “disappear” from the stats, although they show up as “index fast full scans (rowid ranges)” and “index fast full scans (direct)” with the same multiplier of 13 that we see for parallel table scans.

January 27, 2016

Add primary key.

Filed under: Indexing,Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 9:07 am BST Jan 27,2016

I thought I had written this note a few years ago, on OTN or Oracle-L if not on my blog, but I can’t find any sign of it so I’ve decided it’s time to write it (again) – starting as a question about the following code:


create table t1
as
with generator as (
        select  rownum  id
        from            dual
        connect by
                        rownum <= 1000
)
select
        rownum                                  id,
        trunc((rownum-1)/50)                    clustered,
        mod(rownum,20000)                       scattered,
        lpad(rownum,10)                         vc_small,
        rpad('x',100,'x')                       vc_padding
from
        generator       g1,
        generator       g2
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

alter system flush buffer_cache;

alter table t1 add constraint t1_pk primary key(id, scattered);

I’ve generated a table with 1,000,000 rows, including a column that’s guaranteed to be unique; then I’ve added a (two-column) primary key constraint to that table.

Because of the guaranteed unique column the call to add constraint will succeed. Because Oracle will automatically create a unique index to support that constraint it will have to do a tablescan of the table. So here’s the question: HOW MANY TIMES will it tablescan that table (and how many rows will it scan) ?

Space for thought …

The answer is three tablescans, 3 million rows.

Oracle will scan the table to check the validity of adding a NOT NULL definition and constraint for the id column, repeat the scan to do the same for the scattered column, then one final scan to accumulate the key data and rowids to sort and create the index.

Knowing this, you may be able to find ways to modify bulk data loading operations to minimise overheads.

The most recent version I’ve tested this on is 12.1.0.2.

See also: https://jonathanlewis.wordpress.com/2012/03/02/add-constraint/

My reference: pk_overhead.sql

January 26, 2016

Trace file size

Filed under: 12c,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 8:30 am BST Jan 26,2016

Here’s a convenient enhancement for tracing that came up on Twitter a few days ago – first in a tweet that I retweeted, then in a question from Christian Antognini based on this bit of the 12c Oracle documentation (opens in separate tab). The question was – does it work for you ?

The new description for max_dump_file_size says that for large enough values Oracle will split the file into multiple chunks of a few megabytes, using a suffix to identify the sequence of the chunks, keeping only the first chunk and the most recent chunks. Unfortunately this doesn’t seem to be true. However, prompted by Chris’ question I ran a quick query against the full parameter list looking for parameters with the word “trace” in their name:


select
        /*+
                leading(nam val val2)
                full(name)
                full(val)  use_hash(val)  no_swap_join_inputs(val)
                full(val2) use_hash(val2) no_swap_join_inputs(val2)
        */
        nam.ksppinm                             name,
        val.ksppstvl                            ses_val,
        val2.ksppstvl                           sys_val,
        nam.ksppdesc                            description,
        nam.indx+1                              numb,
        nam.ksppity                             type,
        val.ksppstdf                            is_def,
        decode(bitand(nam.ksppiflg/256,1),
                1,'True',
                  'False'
        )                                       ses_mod,
        decode(bitand(nam.ksppiflg/65536,3),
                1,'Immediate',
                2,'Deferred' ,
                3,'Immediate',
                  'False'
        )                                       sys_mod,
        decode(bitand(val.ksppstvf,7),
                1,'Modified',
                4,'System Modified',
                  'False'
        )                                       is_mod,
        decode(bitand(val.ksppstvf,2),
                2,'True',
                  'False'
        )                                       is_adj,
        val.ksppstcmnt                          notes
from
        x$ksppi         nam,
        x$ksppcv        val,
        x$ksppsv        val2
where
        nam.indx = val.indx
and     val2.indx = val.indx
and     ksppinm like '%&m_search.%'
order by
        nam.ksppinm
;

Glancing through the result I spotted a couple of interesting parameters with the letters “uts” in their names, so re-ran my query looking for all the “uts” parameters, getting the following (edited) list:


NAME                           SYS_VAL         DESCRIPTION    
------------------------------ --------------- ---------------------------------------------
_diag_uts_control              0               UTS control parameter
_uts_first_segment_retain      TRUE            Should we retain the first trace segment
_uts_first_segment_size        0               Maximum size (in bytes) of first segments 
_uts_trace_disk_threshold      0               Trace disk threshold parameter
_uts_trace_segment_size        0               Maximum size (in bytes) of a trace segment
_uts_trace_segments            5               Maximum number of trace segments 

Note particularly the “first segment size” and “trace segment size” – defaulting to zero (which often means a hidden internal setting, though that doesn’t seem to be the case here, but maybe that’s what the “diag control” is for). I haven’t investigated all the effects, but after a little experimentation I found that all I needed to do to get the behaviour attributed to max_dump_file_size was to set the following two parameters – which I could do at the session level.


alter session set "_uts_first_segment_size" = 5242880;
alter session set "_uts_trace_segment_size" = 5242880;

The minimum value for these parameters is the one I’ve shown above (5120 KB) and with the default value for _uts_trace_segments you will get a maximum of 5 trace files with sequential names like the following:

ls -ltr *4901*.trc

-rw-r----- 1 oracle oinstall 5243099 Jan 26 08:15 orcl_ora_4901_1.trc
-rw-r----- 1 oracle oinstall 5243064 Jan 26 08:15 orcl_ora_4901_12.trc
-rw-r----- 1 oracle oinstall 5243058 Jan 26 08:15 orcl_ora_4901_13.trc
-rw-r----- 1 oracle oinstall 5242993 Jan 26 08:15 orcl_ora_4901_14.trc
-rw-r----- 1 oracle oinstall 1363680 Jan 26 08:15 orcl_ora_4901.trc

As you can see I’m currently generating my 15th trace, and Oracle has kept the first one and the previous three. It’s always working on a file with no suffix to its name but as soon as that file hits its limiting size (plus or minus a few bytes) it gets its appropriate suffix, the oldest file is deleted, and a new trace file without a suffix is started.

Apart from the usual header information the trace files start and end with lines like:

*** TRACE CONTINUED FROM FILE /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_11.trc ***
  
*** TRACE SEGMENT RENAMED TO /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_12.trc ***

There is one little trap to watch out for: if you set either of these parameters to be larger than max_dump_file_size tracing stops as soon as one of the segments hits the max_dump_file_size and that trace file ends with the usual “overflow” message – e.g, when I changed the max_dump_file_size to 4M in mid-session:

*** DUMP FILE SIZE IS LIMITED TO 4194304 BYTES ***

In my case I had started with max_dump_file_size set to 20M, so I got lucky with my choice of 5M as the segment size.

Further investigation is left as an exercise to the interested reader.

 

November 30, 2015

Trouble-shooting

Filed under: Oracle,Troubleshooting,Uncategorized — Jonathan Lewis @ 6:03 am BST Nov 30,2015

This is the text of the “whitepaper” I submitted to DOAG for my presentation on “Core Strategies for Troubleshooting”.

Introduction

In an ideal world, everyone who had to handle performance problems would have access to ASH and the AWR through a graphic interface – but even with these tools you still have to pick the right approach, recognise the correct targets, and acquire information at the boundary that tells you why you have a performance problem and the ways in which you should be addressing it.

There are only three ways you can waste resources on an Oracle system, and only three different types of activity that need to be investigated. If you don’t appreciate that this is the case then you can waste a lot of time following the wrong strategy and attempting to solve the wrong problems. Once you have a clear statement of what you are trying to achieve it becomes much easier to achieve those aims.

Three ways to cause problems

There really are only three symptoms you need to look out for in the database

  • You’re doing it the hard way
  • You’re doing it too often
  • You’re queueing / locking

Inevitably you can see that there is going to be some overlap between the three (and it would be easy to argue that the third is simply a side effect of the other two). If you are executing a single statement “the hard way” you’re likely to be doing single block reads or buffer gets too often. If you’re executing a very lightweight statement too often it’s probably a step in a process that is attempting get a job done “the hard way”, and it’s probably causing (and getting stuck in) queues for latches and mutexes. I’ve included queueing as the third option because the simple action of locking data (deliberately, or by accident) doesn’t fall neatly into the other two.

Another way of looking at this list is to reduce it to just two items with a slightly different flavour: when you spend too much time on a task it’s because you’re either doing too much work or you’re not being allowed to work.

Three classes of problems to solve

An important aspect of performance problems is the high-level classification. Labelling the class of problem properly points you to the correct strategy for investigating the problem. Again there are only three possibilities in the list, which I’ll express as typical complaints:

  • My report is taking too long to run / screen is taking to long to refresh
  • The batch job over-ran the SLA last night
  • The “system” is slow

What’s the most significant difference between three classes of complaint ?

  • “My XXX takes too long”: XXX is a repeatable event that can be isolated and watched – just do it again, and again, and again, and again while I watch every step of what’s going on.
  • The batch job over-ran last night: It’s not a repeatable process, so you’ve got to infer what the problem was from historical evidence; it’s (usually) made up of a number of concurrent processes, which may interfere with each other to varying degrees depending on when their relative start and finish times were.
  • The “system” is slow: possibly all the time, possibly intermittently – if there’s no specific complaint then the only option is to keep an eye open for resource-intensive activity to see if you can reduce the resource usage of specific individual tasks (for which read SQL or PL/SQL statements) or reduce the number of times that those tasks are executed.

The common source

Despite the fact that we can classify performance problems in three ways, it’s worth remembering that ALL the information we might use to drive our diagnosis comes from one place – the Oracle database. At any moment we have sessions that are active, operating statements (cursors), and using a resource such as a file, a latch, a buffer, and so on. We could almost represent each moment by a cube with sessions along one side, cursors along another, and resources along the third – the cube changes moment by moment, allowing us to visualise time as the fourth dimension in a hypercube of activity.

Instant by instant the Oracle kernel code knows which session is using what resource to operate which cursor and, although the total volume of all that information is far more than could reasonably be recorded, Oracle has many different ways of slicing, dicing and capturing parts of that hypercube – running totals, rolling aggregates, snapshots by session, by cursor, by resource and so on – that are made visible as the dynamic performance views (v$ objects). Trouble-shooting is largely a case of deciding which dynamic performance views are the most appropriate to use for our three classes of task.

Active Session History

Before reviewing the three classes, it’s worth breaking off for a moment to say a few things about one of the most important and useful views that we have of the wealth of information available; this is the active session history (v$active_session_history / ASH) which Oracle uses to capture a snapshot once every second of what each currently active session is doing; every 10th snapshot is then echoed down into the Automatic Workload Repository (AWR) by a process that runs every hour to copy the dynamic view to a table that can be accessed through the database view dba_hist_active_sess_history.

The capture rates can all be adjusted: I have never seen anyone change from one snapshot per second, or every 10th snapshot in a production system, but I’ve often seen the dump to the AWR taken every 30 minutes, occasionally 20 or even 15 minutes. On occasion I have asked clients to do a CTAS (create table as select – nologging) to capture the whole of the v$active_session_history to a table that can be exported and reviewed at a later date. Oracle tries to keep a minimum of at least an hour’s ASH in memory but, with a large enough SGA, you may find that this will stretch out to 3 or 4 hours.

Unfortunately although (or perhaps because) ASH and its AWR history are extremely helpful, you have to pay extra licence fees to use the information, and the technology can only be licensed with the Enterprise Edition of Oracle. (Which is why there are some ASH emulators on the Internet)

My report is slow

The special feature of someone complaining about a specific task is that it’s likely to be repeatable – so we can run it again and again and watch every single detail to see where the time goes. Our slice through the hypercube could take a single session over a period of time and report every action along that path. This, of course, is the 10046 – a.k.a extended SQL trace event. We can enable it in many ways, perhaps through a logon trigger, perhaps through a call to dbms_monitor:


begin
        dbms_monitor.session_trace_enable( 
                session_id => &m_sid, 
                serial_num => &m_serial,
                waits      => true, 
                bind       => true, 
                plan_stat  => 'all_executions'
        );
end; 
/

In this example I’ve request all wait states and bind variable to be dumped into the trace file, I’ve also requested that the execution plan (with rowsource execution stats) be dumped for every single execution of every single statement. Sometimes a problem arises because a particular set of bind variables represents a special case that causes a “reasonable” plan to behave very badly. If we’re going to look closely we may as well get as much detail as possible.

The entire “trace” interface was upgraded dramatically in 11g, and one of the useful variants on this theme is particularly relevant to a commonly used Web-based implementation. If you know that a specific screen task corresponds to a particular PL/SQL package you can enable tracing of a cursor (across the system, if necessary) by SQL_ID. So, for example, you might issue the following two commands, with a couple of minutes gap between the two:


alter system
        set events '
                sql_trace[SQL:1wthpj7as7urp]
                plan_stat=all_executions,
                wait=true, 
                bind=true
        '
;

-- wait a few minutes

alter system
        set events 'sql_trace[SQL:1wthpj7as7urp] off'
;

Every time the statement with SQL_ID =‘1wthpj7as7urp’ is executed, the session executing it will start adding information to the session trace file, and when the statement ends the tracing will end. This is particularly nice if the “statement” is a top-level call to a PL/SQL procedure because all the SQL inside the procedure will be traced as the package executes.

For a highly focused, highly repeatable task, the 10046 trace event is almost always all you need to do.

The batch over-ran

The big difference between this case and the previous one is that “the batch” is not something you can simply repeat and watch. Moreover, “the batch” is likely to be a large number of separate sections of code that are scheduled to run with a fairly fluid timetable that can result in changes from day to day (or, more likely, night to night) in the set of jobs that might be running concurrently. This means that even if you could re-run the batch job (perhaps on the previous night’s backup) you might not see the same problem appear because a small change in timing could result in a large change in contention).

One of the most important steps of dealing with the batch is pre-emptive: instrument your code and make it possible to compare the run on one night with the run on another. At the very least you need to have something capturing the start and end times of each “significant component” of the batch so you can quickly answer questions like: “which jobs took much longer than usual”, “which job was the first job that took longer than usual”, “which jobs were running concurrently with job X last night when they never usually overlap?”

Ideally you should have much more information than this about each job – basically a report from Oracle which says “how much work did I do, how much time did I spend”: for a session this is simply a report of v$mystat or v$sesstat (joined to v$statname) and v$session_event for the session (v$mystat is a less well- known view that is v$sesstat restricted to “my” session) ; if you classify each job as “connect to the database, do something, disconnect” then this critical log is simply a pair of select statements spooled out somewhere convenient, or written to the database; if you want to break a single connection into significant pieces then a simple pl/sql procedure could read the statistics into a pl/sql array as the piece starts, then re-read the stats and calculate the differences as the piece ends.

Knowing where the time went, and knowing how tasks have behaved differently from previous runs is a big step forward to identifying the problem.

If you don’t have the instrumentation you need then the AWR (if you’re licensed) or Statspack (if you’re not licensed) is a step in the right direction. Apart from the typical hourly snapshots and reports you can take a snapshot as the first and last steps of the batch so that you’ve got “the whole batch” in a single AWR/Statspack report. If you’ve got that you can then do comparisons for things like:

  • Which event consumed much more time than usual
  • Which SQL took much more time than usual
  • Which segment(s) saw much more activity than usual
  • Was there some unusual enqueue activity
  • Can we see some unusual outliers in the event histograms
  • Can we see some unusual memory demands in the pga histogram

Although system-wide summaries rarely point us at exact causes, they can often give us strong clues of areas (and times) where problem originated.

In this respect the “Top Activity” screen from Enterprise Manager (Grid Control / Cloud Control) can be very helpful as it produces a nice graphical presentation of “working”. Where, in the picture of last night’s activity, does the graph start to ramp up, and what colour is the bit that’s growing, and how does that picture compare to the same picture the previous night. (Having two windows open with two different nights makes it wonderfully easy to switch between displays and spot the differences.) Since the “historical” top activity screen is created from the dba_hist_active_sess_history, which contains about 100 different details per session of each captured moment, it’s very easy to drill though the spikes to answer questions like: “which object”, “what event”, “which SQL”, “what was the execution plan”, “how much work did that take”, to follow the chain of time back to the cause.

The system is slow

If no-one is going to tell you about specific tasks, and if you don’t have any sort of boundary that allows you to focus on tasks or time-ranges, then the simplest thing to do is look for anything expensive (i.e. time-consuming) and see if you can make it cheaper.

Again, the graphic “Top Activity” screen is very helpful, and I often tell people to arrange to have a system that shows the top activity screens for the most important 2 or 3 databases on a large screen on the wall where any passing DBA might notice a brief spike in workload. There are systems that can be improved by constant monitoring – so long as the monitoring doesn’t take out 100% of an individual’s time but is driven as an informal glance at a picture.

If you’re not licensed to take advantage of the AWR then Statspack can help – but with the smallest time interval (though 15 minutes is as low as I’ve ever gone) between snapshots so that “anomlies” that are short-lived don’t fall out of memory before they can be captured.

An important feature of reading Statspack is that you need to check for missing information – if the headline figure for physical reads is 25M but the “SQL ordered by reads” is 12M then you know that there must be 13M reads that didn’t get captured in the report and that might be the 13M that is causing the problem. Similarly if the “Segments by physical reads” reports 16M reads that’s 4M more than the SQL – but is the 12M a subset of the 16M, or is there only a 3M overlap between the two figures so that between them the 12M and 16M cover the entire 25M. There’s more information in the Statspack report than immediately meets the eye, and a careful drilldown into other areas of the report (typically the Instance Activity) may be needed to clarify the meaning of what you’re seeing as a headline figure.

The other thing you can do with “the slow system” when you don’t have ASH to help is take snapshots (or get some freeware to do the same sort of thing). If the system is slow “right now” you could, for example, take a snapshot of v$sess_io (session I/O), wait 30 seconds then take another snapshot, find the difference and see who is doing most of the I/O work – then chase that session; or take snapshots of v$sesstat limited to (say) statistics like “%redo%” and find out who is generating lots of redo.

Oracle allows you to take this approach back into recent history – there are a number of “metric” views which give you thing like the rolling average, min, and max I/O volumes for the last few intervals of 5 minutes or 1 minute each – telling you, in other words, whether there were any interesting bursts of extreme activity in the recent past. For example, a query against v$sysmetric_summary might give you an output like the following:

METRIC_NAME                     MAXVAL     AVERAGE STANDARD_DEV METRIC_UNIT
------------------------ ------------- ----------- ------------ -----------------------
Physical Reads Per Sec        1,618.95      105.92       358.16 Reads Per Second
Physical Reads Per Txn       97,202.00    5,539.19    20,811.56 Reads Per Txn
Redo Generated Per Sec    6,773,108.94  218,132.86 1,023,458.57 Bytes Per Second
User Calls Per Txn              395.00       43.39        79.85 Calls Per Txn
Total Parse Count Per Sec        31.14        1.88         4.25 Parses Per Second
Host CPU Utilization (%)         64.51        3.93         9.07 % Busy/(Idle+Busy)
Database Time Per Sec            82.96        6.65        15.37 CentiSeconds Per Second
I/O Megabytes per Second         35.58        2.62         5.73 Megabtyes per Second

This summarises the last 12 intervals of 5 minutes. If we look at “Physical Reads per Txn” we can see that there were some extreme swings in activity over that period, so we could drill down into v$sysmetric_history for “Physical Reads per txn”, looking at the 1 minute granularity and see:

METRIC_UNIT                    BEGIN_TIME                VALUE
Physical Reads Per Txn         05-feb 12:45:55          421.00
                               05-feb 12:44:55          477.00
                               05-feb 12:43:55          351.00
                               05-feb 12:42:55          406.84
                               05-feb 12:41:55        1,550.00
                               05-feb 12:40:55       93,984.00
                               05-feb 12:39:55       97,202.00
                               05-feb 12:38:55       93,323.00
                               05-feb 12:37:55          391.00
                               05-feb 12:36:55          504.00
                               05-feb 12:35:55          504.00
                               05-feb 12:34:55          252.00

Yes, a few minutes ago something exploded onto the system doing a huge amount of I/O for about 3 minutes. If we’re lucky we might now drill into the v$sesstat, or v$sess_io, or v$session_event to see if we can find a session that is responsible for a large amount of I/O; and then check v$open_cursor to see if it still has some open cursors that might (if we check v$sql) show us what caused the I/O.

When there are no specific complaints, we just keep an eye open for spikes in activity and try track them down as quickly and cheaply as possible to see if they’re worth addressing.

Summary

Oracle gives you a huge amount of information about the work that’s going on and the time that’s being used in the database. Unfortunately the most useful repository of that information is in a dynamic performance view that can only be viewed in the Enterprise Edition after purchasing additional licences. However, the information is summarised, in many different ways in literally hundreds of other dynamic performance views, and it’s easy to pick out helpful information from those views in a variety of ways.

Key to making the best use of those views, though, is recognising that different classes of performance problems require different strategies – and there are only three different classes of problems to worry about.

November 3, 2015

Nul points

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 8:16 am BST Nov 3,2015

(To understand the title, see this Wikipedia entry)

The title could also be: “Do as I say, don’t do as I do”, because I want to remind you of an error that I regularly commit in my demonstrations. Here’s an example:

 
SQL> create table t (n number); 

Table created 

Have you spotted the error yet ? Perhaps this will help:

SQL> insert into t select 1 - 1/3 * 3 from dual; 

1 row created. 

SQL> insert into t select 1 - 3 * 1/3 from dual; 

1 row created. 

SQL> column n format 9.99999999999999999999999999999999999999999 
SQL> select * from t; 

                                           N
--------------------------------------------
  .00000000000000000000000000000000000000010
  .00000000000000000000000000000000000000000

2 rows selected. 

Spotted the error yet ? If not then perhaps this will help:

SQL> select * from dual where 3 * 1/3 = 1/3 * 3;

no rows selected 

SQL> select * from dual where 3 * (1/3) = (1/3) * 3; 

D
-
X

1 row selected. 

Computers work in binary, people (tend to) work in decimal. 10 = 2 * 5, and 5 (more precisely, dividing by 5) is something that a computer cannot do accurately. So when you do arbitrary arithmetic you should use some method to deal with tiny rounding errors.

In Oracle this means you ought to define all numbers with a precision and scale. Look on it as another form of constraint that helps to ensure the correctness of your data as well as improving performance and reducing wasted storage space.

September 23, 2015

Forget-me-nots

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 8:17 am BST Sep 23,2015

Here’s a little note that I drafted (according to its date stamp) in January 2013 and then forgot to post. (Which adds a little irony to the title.)

============================================================

Here’s an object lesson in (a) looking at what’s in front of you, and (b) how hard it is to remember all the details.

I ran a script today [ED: i.e. some time early Jan 2013] that I’ve have no problems with in earlier versions of Oracle, but today I was running it against 11.2.0.3 for the first time, and hit a problem with autotrace:

SQL> set autotrace on
ERROR:
ORA-28002: the password will expire within 5 days

SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report

Anyone who reads the preceding text closely will see immediately what the problem is – but I saw the bottom line and immediately decided that I had forgotten to set up the plustrace role in this database. So I logged on as SYS and ran the script to create it ($ORACLE_HOME/sqlplus/admin/plustrce.sql) – and discovered that the role already existed. So I decided that I hadn’t granted plustrace to my test_user role, but that didn’t help; so I decided that this was clearly a case of a role that had to be assigned directly to a user, and that didn’t help.

THEN I read the error message properly, changed my password, and everything worked the way as expected.

That’s part (a) of the lesson, here’s part (b): I wrote about running int the same issue nearly two years ago (only that time I had a little excuse for not spotting the problem instantly).

============================================================

August 3, 2015

Demo data

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 1:26 pm BST Aug 3,2015

One of the articles I wrote for redgate’s AllthingsOracle site some time ago included a listing of the data distribution for some client data which I had camouflaged. A recent comment on the article asked how I had generated the data – of course the answer was that I hadn’t generated it, but I had done something to take advantage of its existence without revealing the actual values.  This article is just a little note showing what I did; it’s not intended as an elegant and stylish display of perfectly optimised SQL, it’s an example of a quick and dirty one-off  hack that wasn’t (in my case) a disaster to run.

I’ve based the demonstration on the view all_objects. We start with a simple query showing the distribution of the values of column object_type:


break on report
compute sum of count(*) on report

select
        object_type, count(*)
from
        all_objects
group by object_type
order by
        count(*) desc
;

OBJECT_TYPE           COUNT(*)
------------------- ----------
SYNONYM                  30889
JAVA CLASS               26447
...
JAVA RESOURCE              865
TRIGGER                    509
JAVA DATA                  312
...
JAVA SOURCE                  2
DESTINATION                  2
LOB PARTITION                1
EDITION                      1
MATERIALIZED VIEW            1
RULE                         1
                    ----------
sum                      76085

44 rows selected.

Starting from this data set I want 44 randomly generated strings and an easy way to translate the actual object type into one of those strings. There are various ways to do this but the code I hacked out put the original query into an inline view, surrounded it with a query that added a rownum to the result set to give each row a unique id, then used the well-known and much-loved  “connect by level” query against  dual to generate a numbered list of randomly generated strings as an inline view that I could use in a join to do the translation.


execute dbms_random.seed(0)

column random_string format a6

select
        generator.id,
        dbms_random.string('U',6)       random_string,
        sum_view.specifier,
        sum_view.ct                     "COUNT(*)"
from
        (
        select
                rownum  id
                from    dual
                connect by
                        level <= 100
        )       generator,
        (
        select
                rownum          id,
                specifier,
                ct
        from
                (
                select
                        object_type specifier, count(*) ct
                from
                        all_objects
                group by
                        object_type
                order by
                        count(*) desc
                )
        )       sum_view
where
        sum_view.id = generator.id
order by
        ct desc
;

        ID RANDOM SPECIFIER             COUNT(*)
---------- ------ ------------------- ----------
         1 BVGFJB SYNONYM                  30889
         2 LYYVLH JAVA CLASS               26447
...
         9 DNRYKC JAVA RESOURCE              865
        10 BEWPEQ TRIGGER                    509
        11 UMVYVP JAVA DATA                  312
...
        39 EYYFUJ JAVA SOURCE                  2
        40 SVWKRC DESTINATION                  2
        41 CFKBRX LOB PARTITION                1
        42 ZWVEVH EDITION                      1
        43 DDAHZX MATERIALIZED VIEW            1
        44 HFWZBX RULE                         1
                                      ----------
sum                                        76085

44 rows selected.

I’ve selected the id and original value here to show the correspondance, but didn’t need to show them in the original posting. I’ve also left the original (now redundant) “order by” clause in the main inline view, and you’ll notice that even though I needed only 44 distinct strings for the instance I produced the results on I generated 100 values as a safety margin for testing the code on a couple of other versions of Oracle.

A quick check for efficiency – a brief glance at the execution plan, which might have prompted me to add a couple of /*+ no_merge */ hints if they’d been necessary – showed that the work done was basically the work of the original query plus a tiny increment for adding the rownum and doing the “translation join”. Of course, if I’d then wanted to translate the full 76,000 row data set and save it as a table I’d have to join the result set above back to a second copy of all_objects – and it’s translating full data sets , while trying to deal with problems of referential integrity and correlation, where the time disappears when masking data.

It is a minor detail of this code that it produced fixed length strings (which matched the structure of the original client data). Had I felt the urge I might have used something like: dbms_random.string(‘U’,trunc(dbms_random.value(4,21))) to give me a random distribution of string lengths between 4 and 20. Getting fussier I might have extracted the distinct values for object_type and then generated a random string that matched the length of the value it was due to replace. Fussier still I might have generated the right number of random strings matching the length of the longest value, sorted the original and random values into alphabetical order to align them, then trimmed each random value to the length of the corresponding original value.

It’s extraordinary how complicated it can be to mask data realistically – even when you’re looking at just one column in one table. And here’s a related thought – if an important type of predicate in the original application with the original data is where object_type like ‘PACK%’ how do you ensure that your masked data is consistent with the data that would be returned by this query and how do you determine the value to use instead of “PACK” as the critical input when you run the critial queries against the masked data ? (Being privileged may give you part of the answer, but bear in mind that the people doing the testing with that data shouldn’t be able to see the unmasked data or any translation tables.)

 

 

 

July 27, 2015

Subquery Factoring (10)

Filed under: Bugs,CBO,Oracle,Subquery Factoring,Troubleshooting — Jonathan Lewis @ 1:26 pm BST Jul 27,2015

What prompted me to write my previous note about subquerying was an upgrade to 12c, and a check that a few critical queries would not do something nasty on the upgrade. As ever it’s always interesting how many little oddities you can discover while looking closely as some little detail of how the optimizer works. Here’s an oddity that came up in the course of my playing around investigation in 12.1.0.2 – first some sample data:


create table t1
nologging
as
select * from all_objects;

create index t1_i1 on t1(owner) compress nologging;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1 for columns owner size 254'
        );
end;
/

The all_objects view is convenient as a tool for modelling what I wanted to do since it has a column with a small number of distinct values and an extreme skew across those values. Here’s a slightly weird query that shows an odd costing effect:


with v1 as (
        select /*+ inline */ owner from t1 where owner > 'A'
)
select count(*) from v1 where owner = 'SYS'
union all
select count(*) from v1 where owner = 'SYSTEM'
;

Since the query uses the factored subquery twice and there’s a predicate on the subquery definition, I expect to see materialization as the default, and that’s what happened (even though I’ve engineered the query so that materialization is more expensive than executing inline). Here are the two plans from 12.1.0.2 (the same pattern appears in 11.2.0.4, though the costs are a little less across the board):


=======================
Unhinted (materializes)
=======================

---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |     2 |   132 |    25  (20)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D661B_876C2CB |       |       |            |          |
|*  3 |    INDEX FAST FULL SCAN    | T1_I1                      | 85084 |   498K|    21  (15)| 00:00:01 |
|   4 |   UNION-ALL                |                            |       |       |            |          |
|   5 |    SORT AGGREGATE          |                            |     1 |    66 |            |          |
|*  6 |     VIEW                   |                            | 85084 |  5483K|    13  (24)| 00:00:01 |
|   7 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661B_876C2CB | 85084 |   498K|    13  (24)| 00:00:01 |
|   8 |    SORT AGGREGATE          |                            |     1 |    66 |            |          |
|*  9 |     VIEW                   |                            | 85084 |  5483K|    13  (24)| 00:00:01 |
|  10 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661B_876C2CB | 85084 |   498K|    13  (24)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

=============
Forced inline
=============

--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     2 |    12 |    22  (14)| 00:00:01 |
|   1 |  UNION-ALL             |       |       |       |            |          |
|   2 |   SORT AGGREGATE       |       |     1 |     6 |            |          |
|*  3 |    INDEX FAST FULL SCAN| T1_I1 | 38784 |   227K|    21  (15)| 00:00:01 |
|   4 |   SORT AGGREGATE       |       |     1 |     6 |            |          |
|*  5 |    INDEX RANGE SCAN    | T1_I1 |   551 |  3306 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------

I’m not surprised that the optimizer materialized the subquery – as I pointed out in my previous article, the choice seems to be rule-based (heuristic) rather than cost-based. What surprises me is that the cost for the default plan is not self-consistent – the optimizer seems to have lost the cost of generating the temporary table. The cost of the materialized query plan looks as if it ought to be 21 + 13 + 13 = 47. Even if the optimizer were smart enough to assume that the temporary table would be in the cache for the second scan (and therefore virtually free to access) we ought to see a cost of 21 + 13 = 34. As it is we have a cost of 25, which is 13 + 13 (or, if you check the 10053 trace file, 12.65 + 12.65, rounded).

Since the choice to materialize doesn’t seem to be cost-based (at present) this doesn’t really matter – but it’s always nice to see, and be able to understand, self-consistent figures in an execution plan.

Footnote

It is worth pointing out as a side note that materialization can actually be more expensive than running in-line, even for very simple examples. Subquery factoring seems to have become more robust and consistent over recent releases in terms of consistency of execution plans when the subqueries are put back inline, but you still need to think a little bit before rewriting a query for cosmetic (i.e. totally valid “readability”) reasons just to check whether the resulting query is going to produce an unexpected, and unexpectedly expensive, materialization.

July 17, 2015

Descending Indexes

Filed under: CBO,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 8:42 am BST Jul 17,2015

I’ve written about optimizer defects with descending indexes before now but a problem came up on the OTN database forum a few days ago that made me decide to look very closely at an example where the arithmetic was clearly defective. The problem revolves around a table with two indexes, one on a date column (TH_UPDATE_TIMESTAMP) and the other a compound index which starts with the same column in descending order (TH_UPDATE_TIMESTAMP DESC, TH_TXN_CODE). The optimizer was picking the “descending” index in cases where it was clearly the wrong index (even after the statistics had been refreshed and all the usual errors relating to date-based indexes had been discounted). Here’s an execution plan from the system which shows that there’s something wrong with the optimizer:


SELECT COUNT(*) FROM TXN_HEADER WHERE TH_UPDATE_TIMESTAMP BETWEEN SYSDATE-30 AND SYSDATE;

---------------------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                  |     1 |     8 |            |          |
|*  2 |   FILTER           |                  |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| TXN_HEADER_IDX17 |  1083K|  8462K|     4   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

There are two clues here: first, Oracle has used the (larger) two-column index when the single column is (almost guaranteed to be) the better choice simply because it will be smaller; secondly, we have a cost of 4 to acquire 1M rowids through an (implicitly b-tree) index range scan, which would require at least 250,000 index entries per leaf block to keep the cost that low (or 2,500 if you set the optimizer_index_cost_adj to 1; so it might just be possible if you had a 32KB block size).

The OP worked hard to supply information we asked for, and to follow up any suggestions we made; in the course of this we got the information that the table had about 90M rows and this “timestamp” column had about 45M distinct values ranging from 6th Sept 2012 to 2nd July 2015 with no nulls.

Based on this information I modelled the problem in an instance of 11.2.0.4 (the OP was using 11.2.0.3).


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                                          id,
        to_date('06-SEP-2012 15:13:00','dd-mon-yyyy hh24:mi:ss') +
                trunc((rownum - 1 )/4) / (24 * 60)                      upd_timestamp,
        mod(rownum - 1,10)                                              txn_code,
        rpad('x',50)                                                    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4 * 1030  *  24 * 60
;

create index t1_asc on t1(upd_timestamp) nologging;
create index t1_desc on t1(upd_timestamp desc) nologging;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T1',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

My data set has 4 rows per minute from 6th Sept 2012 to 3rd July 2015, with both an ascending and descending index on the upd_timestamp column. For reference, here are the statistics about the two indexes:


INDEX_NAME               SAMPLE     BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ---------- ----------- -----------------
T1_DESC                 5932800          2       17379            154559
T1_ASC                  5932800          2       15737             59825

The ascending index is smaller with a significantly better clustering_factor, so for queries where either index would be a viable option the ascending index is the one (we think) that the optimizer should choose. In passing, the 5.9M index entries is exactly the number of rows in the table – these stats were computed automatically as the indexes were created.

Here’s a simple query with execution plan (with rowsource execution stats):

select max(id) from t1 where upd_timestamp between
to_date('01-jun-2015','dd-mon-yyyy')                and
to_date('30-jun-2015','dd-mon-yyyy')

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |      1 |        |    29 (100)|      1 |00:00:01.29 |    4710 |
|   1 |  SORT AGGREGATE              |         |      1 |      1 |            |      1 |00:00:01.29 |    4710 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1      |      1 |    167K|    29   (0)|    167K|00:00:00.98 |    4710 |
|*  3 |    INDEX RANGE SCAN          | T1_DESC |      1 |    885 |     5   (0)|    167K|00:00:00.35 |     492 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."SYS_NC00005$">=HEXTORAW('878CF9E1FEF8FEFAFF')  AND
              "T1"."SYS_NC00005$"<=HEXTORAW('878CF9FEF8FEF8FF') )
       filter((SYS_OP_UNDESCEND("T1"."SYS_NC00005$")>=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND SYS_OP_UNDESCEND("T1"."SYS_NC00005$")<=TO_DATE(' 2015-06-30 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))

The optimizer’s index estimate of 885 rowids is a long way off the actual rowcount of 167,000 – even though I have perfect stats describing uniform data and the query is simple enough that the optimizer should be able to get a good estimate. Mind you, the predicate section looks a lot messier than you might expect, and the table estimate is correct (though not consistent with the index estimate, of course).

Here’s the plan I get when I make the descending index invisible:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |      1 |        |  2146 (100)|      1 |00:00:01.25 |    2134 |
|   1 |  SORT AGGREGATE              |        |      1 |      1 |            |      1 |00:00:01.25 |    2134 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1     |      1 |    167K|  2146   (1)|    167K|00:00:00.94 |    2134 |
|*  3 |    INDEX RANGE SCAN          | T1_ASC |      1 |    167K|   453   (2)|    167K|00:00:00.31 |     446 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("UPD_TIMESTAMP">=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "UPD_TIMESTAMP"<=TO_DATE(' 2015-06-30 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

With the descending index invisible we can see that the cardinality estimate for the ascending index correct and notice how much higher this makes the cost. It’s not surprising that the optimizer picked the wrong index with such a difference in cost. The question now is why did the optimizer get the index cardinality (i.e. selectivity, hence cost) so badly wrong.

The answer is that the optimizer has made the same mistake that applications make by storing dates as character strings. It’s using the normal range-based calculation for the sys_op_descend() values recorded against the virtual column and has lost all understanding of the fact that these values represent dates. I can demonstrate this most easily by creating one more table, inserting a couple of rows, gathering stats, and showing you what the internal storage of a couple of “descendomg” dates looks like.


drop table t2;

create table t2 (d1 date);
create index t2_i1 on t2(d1 desc);

insert into t2 values('01-Jun-2015');
insert into t2 values('30-Jun-2015');
commit;

select d1, sys_nc00002$ from t2;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T2',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

column endpoint_value format 999,999,999,999,999,999,999,999,999,999,999,999
break on table_name skip 1

select
        table_name, column_name, endpoint_number, endpoint_value, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') end_hex
from
        user_tab_histograms
where
        table_name in ('T1','T2')
and     column_name like 'SYS%'
order by
        table_name, column_name, endpoint_number
;

I’ve put the dates 1st June 2015 and 30th June 2015 into the table because those were the values I used in the where clause of my query. Here are the results showing the internal representation of the (descending) index column and the stored low and high values for the virtual columns in both t1 and t2.


D1        SYS_NC00002$
--------- ------------------------
01-JUN-15 878CF9FEF8FEF8FF
30-JUN-15 878CF9E1FEF8FEFAFF

TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER                                   ENDPOINT_VALUE END_HEX
-------------------- -------------------- --------------- ------------------------------------------------ ---------------------------------
T1                   SYS_NC00005$                       0  703,819,340,111,320,000,000,000,000,000,000,000    878CF8FCEFF30BCEBC8823F7000000
                     SYS_NC00005$                       1  703,880,027,955,346,000,000,000,000,000,000,000    878FF6F9EFF20256F3B404F7400000

T2                   SYS_NC00002$                       0  703,819,411,001,549,000,000,000,000,000,000,000    878CF9E1FEF8F24C1C7CED46200000
                     SYS_NC00002$                       1  703,819,419,969,389,000,000,000,000,000,000,000    878CF9FEF8FEEED28AC5B22A200000

If you check the t2 data (sys_nc00002$) values against the end_hex values in the histogram data you’ll see they match up to the first 6 bytes (12 digits). Oracle has done its standard processing – take the first 6 bytes of the column, covert to decimal, round to the most significant 15 digits (technically round(value, -21)), convert and store the result as hex.

So let’s do some arithmetic. The selectivity of a range scan that is in-bounds is (informally): “range we want” / “total range”. I’ve set up t2 to show us the values we will need to calculate the range we want, and I’ve reported the t1 values to allow us to calculate the total range, we just have to subtract the lower value (endpoint number 0) from the higher value for the two sys_nc0000N$ columns. So (ignoring the 21 zeros everywhere) our selectivity is:

  • (703,819,419,969,389 – 703,819,411,001,549) / ( 703,880,027,955,346 – 703,819,340,111,320) = 0.00014777
  • We have 5.9M rows in the table, so the cardinality estimate should be about: 5,932,800 * 0.00014777 = 876.69

The actual cardinality estimate was 885 – but we haven’t allowed for the exact form of the range-based predicate: we should add 1/num_distinct at both ends because the range is a closed range (greater than or EQUAL to, less than or EQUAL to) – which takes the cardinality estimate up to 884.69 which rounds to the 885 that the optimizer produced.

Conclusion

This note shows that Oracle is (at least for dates) losing information about the underlying data when dealing with the virtual columns associated with descending columns in indexes. As demonstrated that can lead to extremely bad selectivity estimates for predicates based on the original columns, and these selectivity estimates make it possible for Oracle to choose the wrong index and introduce a spuriously low cost into the calculation of the full execution plan.

Footnote

This note seems to match bug note 11072246 “Wrong Cardinality estimations for columns with DESC indexes”, but that bug is reported as fixed in 12.1, while this test case reproduces in 12.1.0.2

 

July 6, 2015

SQL vs. PL/SQL

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 10:23 am BST Jul 6,2015

Which piece of code will be faster (clue – the table in question has no indexes):

Option 1 – pure SQL


update join1 set
        data = data||'#'
where   key_1=500
and     key_2=23851
and     key_3=57012
and     key_4=521
and     key_6=1
and     key_7=23352
;

Option 2 – a silly PL/SQL row by row approach:


declare
        type rowid_type is table of urowid index by binary_integer;
        tab_rowid           rowid_type;  

        lv_rows_updated     number :=0;  

        cursor my_cursor is
                select  rowid rid
                from    join1
                where   key_1=500
                and     key_2=23851
                and     key_3=57012
                and     key_4=521
                and     key_6=1
                and     key_7=23352
        ;

begin
        open my_cursor;

        -- We know that the number of rows to be updated is very small
        fetch my_cursor bulk collect into tab_rowid limit 10000;

        forall lv_row in tab_rowid.first .. tab_rowid.last
             update join1 set data = data||'#' where  rowid = tab_rowid(lv_row);

        lv_rows_updated := sql%rowcount;
        close my_cursor;
end;
/

It’s a trick question, of course, and although the automatic response from any DBA-type is likely to be “the SQL”, the correct answer is (as so often) “it depends”.

This question appeared as a problem on the OTN database forum a few days ago. In it’s original form it asked why a select statement should be much faster than a select for update or an update – even though the volume identified and updated was very small (just one row in 100M).The note then went on to show that using PL/SQL to select the rowids of the target rows then doing the bulk update by rowid was faster than the basic SQL update. The answer didn’t spring to mind immediately; but fortunately someone asked for some run-time statistics (v$sesstat) and the supplied statistics told me what was going on.

Conveniently the OP gave us the code to recreate the test case – all 100M rows of it; I cut this back to 16M rows (ca. 1.5GB of disc space), and then ran the tests with ny db_cache_size set to 256MB (another clue). I got similar results to the OP – not so dramatic, but the PL/SQL ran faster than the SQL and the difference was due to an obvious change in the CPU usage.

If you haven’t guess from the clue in the 256MB db_cache_size (which means the table is more than 5 times the size of the cache), the answer is “serial direct path reads”. For a sufficiently large table (and that’s not easy to define – start here and follow a few links) it’s fairly common knowledge that from 11g a tablescan can use a serial direct path read, and that’s what the PL/SQL was doing to select the required rowids. However, here’s a detail that’s not often mentioned: an update has to take place in public where everyone can see it so when Oracle executed the simple SQL update or select for update statement it had to scan the table through the buffer cache. Pulling all those blocks into the buffer cache, grabbing latches to link them to the right cache buffers chains, pinning them, then unpinning them uses a lot of CPU – which isn’t needed for the direct path read. The PL/SQL with its pure select used far less CPU than the basic SQL with its update/select for update, and because the OP had a very high-powered machine with plenty of CPU and loads of (disc-)caching effects all over the place the difference in CPU time was exremely visible as a fraction of the total DB time.

This was, inevitably, a very special case where a little detail became a significant fraction of the workload. The OP scanned 100M rows to update 1 row (in 7 – 13 seconds!). This doesn’t sound like a strategy you would normally want to adopt for frequent use; and for occasional use we might be happy to use the slower (13 second) approach to avoid the coding requirement of the fast (7 second) solution.

Footnote:

It’s worth pointing out that the PL/SQL strategy is not safe. In the few seconds between the select statement starting and the row being identified and updated by rowid it’s possible that another session could have updated (or deleted) the row. In the former case the update statement is now updating a row which doesn’t match the specification; in the latter case the code will raise an exception.

We can make the PL/SQL safer by including the original predicates in the update statement – but that still leaves the question of what the code should do if the select statement finds a row and the update fails to update it. Should it, perhaps, assume that there is still a row in the table that needs an update and re-run (using up all the time you saved by adopting a PL/SQL solution).

 

 

 

June 17, 2015

Reverse Key

Filed under: Indexing,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:11 pm BST Jun 17,2015

A question came up on the OTN database forum recently asking if you could have a partitioned index on a non-partitioned table.

(Aside: I’m not sure whether it would be quicker to read the manuals or try the experiment – either would probably be quicker than posing the question to the forum. As so often happens in these RTFM questions the OP didn’t bother to acknowledge any of the responses)

The answer to the question is yes – you can create a globally partitioned index, though if it uses range partitioning you have to specify a MAXVALUE partition. The interesting thing about the question, though is that several people tried to guess why it had been asked and then made suggestions based on the most likely guess (and wouldn’t it have been nice to see some response from the OP ). The common guess was that there was a performance problem with the high-value block of a sequence-based (or time-based) index – a frequent source of “buffer busy wait” events and other nasty side effects.

Unfortunately too many people suggesting reverse key as a solution to this “right-hand” problem. If you’re licensed for partitioning it’s almost certain that a better option would simple be to use global hash partitioning (with 2^N for some N) partitions. Using reverse keys can result in a bigger performance problem than the one you’re trying to avoid – you may end up turning a little time spent on buffer busy waits into a large amount of time spent on db file sequential reads. To demonstrate the issue I’ve created a sample script – and adjusted my buffer cache down to the appropriate scale:

create table t1(
	id	not null
)
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		rownum <= 1e4
)
select
	1e7 + rownum	id
from
	generator	v1,
	generator	v2
where
	rownum <= 1e7 
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 => 'T1'
	);
end;
/

alter table t1 add constraint t1_pk primary key(id) 
using index 
	reverse 
	nologging 
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';

begin
	for i in 20000001..20010000 loop
		insert into t1 values(i);
	end loop;
end;
/

I’ve created a table with 10,000,000 rows using a sequential value as the primary key, then inserted “the next” 10,000 rows into the table in order. The index occupied about about 22,000 blocks, so to make my demonstration show you the type of effect you could get from a busy production system with more tables and many indexes I ran my test with the buffer cache limited to 6,000 blocks – a fair fraction of the total index size. Here’s a small section of the trace file from the test running 10.2.0.3 on an elderly machine:


WAIT #43: nam='db file sequential read' ela= 13238 file#=6 block#=12653 blocks=1 obj#=63623 tim=3271125590
WAIT #43: nam='db file sequential read' ela=  7360 file#=6 block#=12749 blocks=1 obj#=63623 tim=3271133150
WAIT #43: nam='db file sequential read' ela=  5793 file#=6 block#=12844 blocks=1 obj#=63623 tim=3271139110
WAIT #43: nam='db file sequential read' ela=  5672 file#=6 block#=12940 blocks=1 obj#=63623 tim=3271145028
WAIT #43: nam='db file sequential read' ela= 15748 file#=5 block#=13037 blocks=1 obj#=63623 tim=3271160998
WAIT #43: nam='db file sequential read' ela=  8080 file#=5 block#=13133 blocks=1 obj#=63623 tim=3271169314
WAIT #43: nam='db file sequential read' ela=  8706 file#=5 block#=13228 blocks=1 obj#=63623 tim=3271178240
WAIT #43: nam='db file sequential read' ela=  7919 file#=5 block#=13325 blocks=1 obj#=63623 tim=3271186372
WAIT #43: nam='db file sequential read' ela= 15553 file#=6 block#=13549 blocks=1 obj#=63623 tim=3271202115
WAIT #43: nam='db file sequential read' ela=  7044 file#=6 block#=13644 blocks=1 obj#=63623 tim=3271209420
WAIT #43: nam='db file sequential read' ela=  6062 file#=6 block#=13741 blocks=1 obj#=63623 tim=3271215648
WAIT #43: nam='db file sequential read' ela=  6067 file#=6 block#=13837 blocks=1 obj#=63623 tim=3271221887
WAIT #43: nam='db file sequential read' ela= 11516 file#=5 block#=13932 blocks=1 obj#=63623 tim=3271234852
WAIT #43: nam='db file sequential read' ela=  9295 file#=5 block#=14028 blocks=1 obj#=63623 tim=3271244368
WAIT #43: nam='db file sequential read' ela=  9466 file#=5 block#=14125 blocks=1 obj#=63623 tim=3271254002
WAIT #43: nam='db file sequential read' ela=  7704 file#=5 block#=14221 blocks=1 obj#=63623 tim=3271261991
WAIT #43: nam='db file sequential read' ela= 16319 file#=6 block#=14444 blocks=1 obj#=63623 tim=3271278492
WAIT #43: nam='db file sequential read' ela=  7416 file#=6 block#=14541 blocks=1 obj#=63623 tim=3271286129
WAIT #43: nam='db file sequential read' ela=  5748 file#=6 block#=14637 blocks=1 obj#=63623 tim=3271292163
WAIT #43: nam='db file sequential read' ela=  7131 file#=6 block#=14732 blocks=1 obj#=63623 tim=3271299489
WAIT #43: nam='db file sequential read' ela= 16126 file#=5 block#=14829 blocks=1 obj#=63623 tim=3271315883
WAIT #43: nam='db file sequential read' ela=  7746 file#=5 block#=14925 blocks=1 obj#=63623 tim=3271323845
WAIT #43: nam='db file sequential read' ela=  9208 file#=5 block#=15020 blocks=1 obj#=63623 tim=3271333239
WAIT #43: nam='db file sequential read' ela=  7708 file#=5 block#=15116 blocks=1 obj#=63623 tim=3271341141
WAIT #43: nam='db file sequential read' ela= 15484 file#=6 block#=15341 blocks=1 obj#=63623 tim=3271356807
WAIT #43: nam='db file sequential read' ela=  5488 file#=6 block#=15437 blocks=1 obj#=63623 tim=3271362623
WAIT #43: nam='db file sequential read' ela= 10447 file#=6 block#=15532 blocks=1 obj#=63623 tim=3271373342
WAIT #43: nam='db file sequential read' ela= 12565 file#=6 block#=15629 blocks=1 obj#=63623 tim=3271386741
WAIT #43: nam='db file sequential read' ela= 17168 file#=5 block#=15725 blocks=1 obj#=63623 tim=3271404135
WAIT #43: nam='db file sequential read' ela=  7542 file#=5 block#=15820 blocks=1 obj#=63623 tim=3271411882
WAIT #43: nam='db file sequential read' ela=  9400 file#=5 block#=15917 blocks=1 obj#=63623 tim=3271421514
WAIT #43: nam='db file sequential read' ela=  7804 file#=5 block#=16013 blocks=1 obj#=63623 tim=3271429519
WAIT #43: nam='db file sequential read' ela= 14470 file#=6 block#=16237 blocks=1 obj#=63623 tim=3271444168
WAIT #43: nam='db file sequential read' ela=  5788 file#=6 block#=16333 blocks=1 obj#=63623 tim=3271450154
WAIT #43: nam='db file sequential read' ela=  9630 file#=6 block#=16429 blocks=1 obj#=63623 tim=3271460008
WAIT #43: nam='db file sequential read' ela= 10910 file#=6 block#=16525 blocks=1 obj#=63623 tim=3271471174
WAIT #43: nam='db file sequential read' ela= 15683 file#=5 block#=16620 blocks=1 obj#=63623 tim=3271487065
WAIT #43: nam='db file sequential read' ela=  8094 file#=5 block#=16717 blocks=1 obj#=63623 tim=3271495454
WAIT #43: nam='db file sequential read' ela=  6670 file#=5 block#=16813 blocks=1 obj#=63623 tim=3271502293
WAIT #43: nam='db file sequential read' ela=  7852 file#=5 block#=16908 blocks=1 obj#=63623 tim=3271510360
WAIT #43: nam='db file sequential read' ela= 10500 file#=6 block#=17133 blocks=1 obj#=63623 tim=3271521039
WAIT #43: nam='db file sequential read' ela= 11038 file#=6 block#=17229 blocks=1 obj#=63623 tim=3271532275
WAIT #43: nam='db file sequential read' ela= 12432 file#=6 block#=17325 blocks=1 obj#=63623 tim=3271544974
WAIT #43: nam='db file sequential read' ela=  7784 file#=6 block#=17421 blocks=1 obj#=63623 tim=3271553331
WAIT #43: nam='db file sequential read' ela=  7774 file#=5 block#=17517 blocks=1 obj#=63623 tim=3271561346
WAIT #43: nam='db file sequential read' ela=  6583 file#=5 block#=17613 blocks=1 obj#=63623 tim=3271568146
WAIT #43: nam='db file sequential read' ela=  7901 file#=5 block#=17708 blocks=1 obj#=63623 tim=3271576231
WAIT #43: nam='db file sequential read' ela=  6667 file#=5 block#=17805 blocks=1 obj#=63623 tim=3271583259
WAIT #43: nam='db file sequential read' ela=  9427 file#=6 block#=18029 blocks=1 obj#=63623 tim=3271592988
WAIT #43: nam='db file sequential read' ela= 52334 file#=6 block#=18125 blocks=1 obj#=63623 tim=3271646055
WAIT #43: nam='db file sequential read' ela= 50512 file#=6 block#=18221 blocks=1 obj#=63623 tim=3271697284
WAIT #43: nam='db file sequential read' ela= 10095 file#=6 block#=18317 blocks=1 obj#=63623 tim=3271708095

Check the block number for this list of single block reads – we’re jumping through the index about 100 blocks at a time to read the next block where an index entry has to go. The jumps are the expected (and designed) effect of reverse key indexes: the fact that the jumps turn into physical disc reads is the (possibly unexpected) side effect. Reversing an index makes adjacent values look very different (by reversing the bytes) and go to different index leaf blocks: the purpose of the exercise is to scatter concurrent similar inserts across multiple blocks, but if you scatter the index entries you need to buffer a lot more of the index to keep the most recently used values in memory. Reversing the index may eliminate buffer busy waits, but it may increase time lost of db file sequential reads dramatically.

Here’s a short list of interesting statistics from this test – this time running on 11.2.0.4 on a machine with SSDs) comparing the effects of reversing the index with those of not reversing the index – normal index first:


Normal index
------------
CPU used by this session               83
DB time                                97
db block gets                      40,732
physical reads                         51
db block changes                   40,657
redo entries                       20,174
redo size                       5,091,436
undo change vector size         1,649,648

Repeat with reverse key index
-----------------------------
CPU used by this session              115
DB time                               121
db block gets                      40,504
physical reads                     10,006
db block changes                   40,295
redo entries                       19,973
redo size                       4,974,820
undo change vector size         1,639,232

Because of the SSDs there’s little difference in timing between the two sets of data and, in fact, all the other measures of work done are very similar except for the physical read, and the increase in reads is probably the cause of the extra CPU time thanks to both the LRU manipulation and the interaction with the operating system.

If you want to check the effect of index reversal you can take advantage of the sys_op_lbid() function to sample a little of your data – in my case I’ve queried the last 10,000 rows (values) in the table:


select 
	/*+ 
		cursor_sharing_exact 
		dynamic_sampling(0) 
		no_monitoring 
		no_expand 
		index_ffs(t1,t1_i1) 
		noparallel_index(t,t1_i1) 
	*/ 
	count (distinct sys_op_lbid( &m_ind_id ,'L',t1.rowid)) as leaf_blocks
from 
	t1
where 
	id between 2e7 + 1 and 2e7 + 1e4
;

The &m_ind_id substition variable is the object_id of the index t1_i1.

In my case, with an index of 22,300 leaf blocks, my 10,000 consecutive values were scattered over 9,923 leaf blocks. If I want access to “recent data” to be as efficient as possible I need to keep that many blocks of the index cached, compared to (absolute) worst case for my data 100 leaf blocks. When you reverse key an index you have to think about how much bigger you have to make your buffer cache to keep the performance constant.

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 6,206 other followers