Oracle Scratchpad

March 22, 2019

Stats advisor

Filed under: 12c,Oracle,Statistics — Jonathan Lewis @ 1:10 pm GMT Mar 22,2019

This is just a little shout-out about the Stats Advisor – if you decide to give it a go, what sort of things is it likely to tell you. The answer is in a dynamic performance view called v$stats_advisor_rules – which I’ve list below from an instance running 18.3.0.0.


SQL> set linesize 180
SQL> set trimspool on
SQL> set pagesize 40
SQL> column description format a75
SQL> column name format a32
SQL> break on rule_type duplicate skip 1
SQL> select * from v$stats_advisor_rules;

  RULE_ID NAME                             RULE_TYPE DESCRIPTION                                                                     CON_ID
---------- -------------------------------- --------- --------------------------------------------------------------------------- ----------
         0                                  SYSTEM                                                                                         0
         1 UseAutoJob                       SYSTEM    Use Auto Job for Statistics Collection                                               0
         2 CompleteAutoJob                  SYSTEM    Auto Statistics Gather Job should complete successfully                              0
         3 MaintainStatsHistory             SYSTEM    Maintain Statistics History                                                          0
         4 UseConcurrent                    SYSTEM    Use Concurrent preference for Statistics Collection                                  0
         5 UseDefaultPreference             SYSTEM    Use Default Preference for Stats Collection                                          0
         6 TurnOnSQLPlanDirective           SYSTEM    SQL Plan Directives should not be disabled                                           0

         7 AvoidSetProcedures               OPERATION Avoid Set Statistics Procedures                                                      0
         8 UseDefaultParams                 OPERATION Use Default Parameters in Statistics Collection Procedures                           0
         9 UseGatherSchemaStats             OPERATION Use gather_schema_stats procedure                                                    0
        10 AvoidInefficientStatsOprSeq      OPERATION Avoid inefficient statistics operation sequences                                     0

        11 AvoidUnnecessaryStatsCollection  OBJECT    Avoid unnecessary statistics collection                                              0
        12 AvoidStaleStats                  OBJECT    Avoid objects with stale or no statistics                                            0
        13 GatherStatsAfterBulkDML          OBJECT    Do not gather statistics right before bulk DML                                       0
        14 LockVolatileTable                OBJECT    Statistics for objects with volatile data should be locked                           0
        15 UnlockNonVolatileTable           OBJECT    Statistics for objects with non-volatile should not be locked                        0
        16 MaintainStatsConsistency         OBJECT    Statistics of dependent objects should be consistent                                 0
        17 AvoidDropRecreate                OBJECT    Avoid drop and recreate object seqauences                                            0
        18 UseIncremental                   OBJECT    Statistics should be maintained incrementally when it is beneficial                  0
        19 NotUseIncremental                OBJECT    Statistics should not be maintained incrementally when it is not beneficial          0
        20 AvoidOutOfRange                  OBJECT    Avoid Out of Range Histogram endpoints                                               0
        21 UseAutoDegree                    OBJECT    Use Auto Degree for statistics collection                                            0
        22 UseDefaultObjectPreference       OBJECT    Use Default Object Preference for statistics collection                              0
        23 AvoidAnalyzeTable                OBJECT    Avoid using analyze table commands for statistics collection                         0

24 rows selected.

As you can see the rules fall into three groups: system, operation, and object – and you can’t help noticing at all three levels how commonly the theme is: “just stick with the defaults!”.

As so often happens when I start writing a catch-up or “remind myself” not I found that Tim Hall has already written all about it.

March 21, 2019

Lost time

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 12:51 pm GMT Mar 21,2019

Here’s a little puzzle that came up in the ODC database forum yesterday – I’ve got a query that has been captured by SQL Monitor, and it’s taking much longer to run than it should but the monitoring report isn’t telling me what I need to know about the time.

Here’s a little model to demonstrate the problem – I’m going to join a table to itself (the self join isn’t a necessary feature of the demonstration, I’ve just been a bit lazy in preparing data). Here’s a (competely truthful) description of the table:

SQL> desc t1
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ID                                     NUMBER
 MOD_1000                               NUMBER
 V1                                     VARCHAR2(40)
 V2                                     VARCHAR2(40)
 PADDING                                VARCHAR2(100)

SQL> select num_rows, blocks from user_tables where table_name = 'T1';

  NUM_ROWS     BLOCKS
---------- ----------
    400000       7798

1 row selected.


And here’s the output I get from dbms_monitor.report_sql_monitor() for the query running on 18.3.0.0


Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TEST_USER (261:36685)
 SQL ID              :  g6j671u7zc9mn
 SQL Execution ID    :  16777218
 Execution Started   :  03/21/2019 08:54:56
 First Refresh Time  :  03/21/2019 08:54:56
 Last Refresh Time   :  03/21/2019 08:55:17
 Duration            :  21s
 Module/Action       :  MyModule/MyAction
 Service             :  SYS$USERS
 Program             :  sqlplus@linux183.localdomain (TNS V1-V3)
 Fetch Calls         :  2

Global Stats
==========================================================================
| Elapsed |   Cpu   |    IO    | PL/SQL  | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Time(s) | Calls |  Gets  | Reqs | Bytes |
==========================================================================
|      21 |      21 |     0.04 |    0.07 |     2 |     3M |  215 | 180MB |
==========================================================================

SQL Plan Monitoring Details (Plan Hash Value=83896840)
==================================================================================================================================================
| Id |       Operation       | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |      | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT      |      |         |      |        21 |     +1 |     1 |        1 |      |       |     . |          |                 |
|  1 |   FILTER              |      |         |      |        21 |     +1 |     1 |        1 |      |       |     . |          |                 |
|  2 |    HASH JOIN OUTER    |      |     400 | 2014 |        21 |     +1 |     1 |      400 |      |       |   2MB |          |                 |
|  3 |     TABLE ACCESS FULL | T1   |     400 | 1005 |         1 |     +1 |     1 |      400 |   62 |  60MB |     . |          |                 |
|  4 |     TABLE ACCESS FULL | T1   |    400K | 1001 |        21 |     +1 |     1 |     400K |      |       |     . |          |                 |
==================================================================================================================================================

As you can see the total elapsed time 21 seconds of which the CPU time is the whole 21 seconds.

This seems a little large for a simple hash join so we should look at the “Activity Detail” section of the plan because that will report any ASH samples that can be found for this execution of this query. And that’s where the problem lies: despite using 21 CPU seconds in 21 seconds there are no ASH samples for the execution! Moreover – looking for other numerical oddities – the plan says we read 60MB in 62 read requests (that’s the first tablescan of of the 7,798 blocks of t1), but the summary says we read 180MB – where did we lose (or gain) 120MB ?

It might help to see the query (which is why I didn’t show it in the output above) and it might help to see the predicate section (which SQL Monitor doesn’t report). So here’s the query and its plan, with the resulting predicate section, pulled from memory:


SQL_ID  g6j671u7zc9mn, child number 0
-------------------------------------
select  /*+ monitor */  t1a.mod_1000, t1b.mod_1000 from  t1 t1a,  t1
t1b where  t1a.mod_1000 = 500 and t1b.id(+) = t1a.id and
nvl(t1b.mod_1000,0) + f2(t1a.id) + f3(t1a.id) > 0

Plan hash value: 83896840

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |       |       |  2014 (100)|          |
|*  1 |  FILTER             |      |       |       |            |          |
|*  2 |   HASH JOIN OUTER   |      |   400 |  7200 |  2014   (4)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| T1   |   400 |  3600 |  1005   (3)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| T1   |   400K|  3515K|  1001   (3)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("T1B"."MOD_1000",0)+"F2"("T1A"."ID")+"F3"("T1A"."ID")>0)
   2 - access("T1B"."ID"="T1A"."ID")
   3 - filter("T1A"."MOD_1000"=500)


Notice the two function calls that are part of the FILTER operation. That’s where the time is going – but I don’t think you can infer that from the SQL Monitor report (which rather suggests that the problem might be with the second full tablescan of t1 at operation 4).

Functions f2() and f3() both do a table scan of another table that is as large as t1 – and it looks as if they’re both going to be called 400 times, which is a lot of work. The 120MB of “lost” reads is the first table scan of each of the two tables, the 3M buffer gets (in the summary, if you didn’t notice it) is from the repeated tablescans as each row comes out of the hash join and the filter is applied.

Why doesn’t the report show us the CPU (and disk) samples? Because it’s querying ASH (v$active_session_history) by the SQL_ID of the principle SQL statement – and the work is being done by two other statements with different SQL_IDs.

Since it’s PL/SQL functions doing all the work why isn’t the 21 CPU seconds showing up in the “PL/SQL time(s)” summary figure? Because the time isn’t being spent in PL/SQL, it’s being spent in the SQL being run by the PL/SQL.

Just to finish off, let’s try to find the SQL being run by the PL/SQL. I’ll post the script to recreate the example at the end of the posting but for the moment I’ll just show you the query against v$sql that I ran to extract the SQL that’s embedded in the functions:


select  sql_id, executions, disk_reads, buffer_gets, sql_text
from    V$sql 
where   sql_text like 'SELECT%T2%' 
or      sql_text like 'SELECT%T3%'
;

SQL_ID        EXECUTIONS DISK_READS BUFFER_GETS SQL_TEXT
------------- ---------- ---------- ----------- --------------------------------------------------
12ytf1rry45d9        400       7683     3072817 SELECT MAX(ID) FROM T2 WHERE MOD_1000 = :B1
85dmsgqg3bh4w          1       7680        7698 SELECT MAX(ID) FROM T3 WHERE MOD_1000 = :B1

As you can see, there are roughly 120MB of disk I/O and 3M buffer gets due to these two statement – and one of them has run the 400 times we expected. It looks as if Oracle has done a cute little optimisation with the other function, though. If you look at the predicate it says:

  • NVL(“T1B”.”MOD_1000″,0)+”F2″(“T1A”.”ID”)+”F3″(“T1A”.”ID”)>0

It looks as if Oracle has either cached the result of the f3() function call (in which case why didn’t it also cache the f2() result) or it’s stopped evaluating the predicate as soon as the running total exceeded zero (but that might be a problem since f3() could return a negative number !). I don’t know exactly why Oracle is doing what it’s doing – but if I reverse the order of the functions in the predicates the f3() query will run 400 times and the f2() query will run once.

Appendix

If you want to re-run some tests here’s the code I used to create the demo:


create table t1
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,1000)              mod_1000,
        lpad(rownum,10,'0')             v1,
        lpad(rownum,10,'0')             v2,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4e5 -- > comment to avoid WordPress format issue
;

-- alter table t1 add constraint t1_pk primary key(id);

create table t2 nologging as select * from t1;
create table t3 nologging as select * from t1;

create function f2 (i_in number) return number
as
        m_ret number;
begin
        select max(id)
        into    m_ret
        from    t2
        where   mod_1000 = i_in
        ;

        return m_ret;
end;
/

create function f3 (i_in number) return number
as
        m_ret number;
begin
        select  max(id)
        into    m_ret
        from    t3
        where   mod_1000 = i_in
        ;

        return m_ret;
end;
/


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T3',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter system flush buffer_cache;

variable b1 number
variable b2 number

set timing on

select
        /*+ monitor */
        t1a.mod_1000, t1b.mod_1000
from
        t1      t1a,
        t1      t1b
where
        t1a.mod_1000 = 500
and     t1b.id(+) = t1a.id
and     nvl(t1b.mod_1000,0) + f3(t1a.id) + f2(t1a.id) > 0
;


One interesting little detail, if you try the tests, is that the join has to be an outer join for the FILTER operation to appear.

March 19, 2019

IM_DOMAIN$

Filed under: 18c,Oracle,Troubleshooting — Jonathan Lewis @ 12:05 pm GMT Mar 19,2019

A few months ago Franck Pachot wrote about a recursive SQL statement that kept appearing in the library cache. I discovered the note today because I had just found a client site where the following statement suddenly appeared near the top of the “SQL ordered by Executions” section of their AWR reports after they had upgraded to 18c.


select domain# from sys.im_domain$ where objn = :1 and col# = :2

I found Franck’s article by the simple expedient of typing the entire query into a Google search – his note was the first hit on the list, and he had a convenient example (based on the SCOTT schema) to demonstrate the effect, so I built the tables from the schema and ran a simple test with extended SQL tracing (event 10046) enabled.

Here’s an extract (with no deletions) from the resulting trace file:

PARSING IN CURSOR #139819795591784 len=110 dep=0 uid=104 oct=3 lid=104 tim=31306461773 hv=3572295767 ad='6bf8b8a0' sqlid='8n2bcc3aftu2r'
select /*+ leading(EMP DEPT) USE_HASH(DEPT) USE_HASH(BONUS) */ * from DEPT natural join EMP natural join BONUS
END OF STMT
PARSE #139819795591784:c=59,e=59,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4262704031,tim=31306461772

PARSING IN CURSOR #139819795585328 len=64 dep=1 uid=0 oct=3 lid=0 tim=31306461966 hv=1240924087 ad='69a8b760' sqlid='0b639nx4zdzxr'
select domain# from sys.im_domain$ where objn = :1 and col# = :2
END OF STMT
PARSE #139819795585328:c=37,e=37,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=2321277860,tim=31306461965
EXEC #139819795585328:c=32,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=2321277860,tim=31306462058
FETCH #139819795585328:c=17,e=17,p=0,cr=1,cu=0,mis=0,r=0,dep=1,og=4,plh=2321277860,tim=31306462098
STAT #139819795585328 id=1 cnt=0 pid=0 pos=1 obj=10422 op='TABLE ACCESS BY INDEX ROWID IM_DOMAIN$ (cr=1 pr=0 pw=0 str=1 time=21 us cost=0 size=39 card=1)'
STAT #139819795585328 id=2 cnt=0 pid=1 pos=1 obj=10423 op='INDEX UNIQUE SCAN IM_DOMAIN_UK (cr=1 pr=0 pw=0 str=1 time=18 us cost=0 size=0 card=1)'
CLOSE #139819795585328:c=5,e=5,dep=1,type=1,tim=31306462287

EXEC #139819795591784:c=484,e=484,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=4262704031,tim=31306462316
FETCH #139819795591784:c=0,e=804,p=0,cr=44,cu=0,mis=0,r=0,dep=0,og=1,plh=4262704031,tim=31306463191
STAT #139819795591784 id=1 cnt=0 pid=0 pos=1 obj=0 op='HASH JOIN  (cr=45 pr=0 pw=0 str=1 time=1222 us cost=72 size=97 card=1)'
STAT #139819795591784 id=2 cnt=4 pid=1 pos=1 obj=0 op='HASH JOIN  (cr=45 pr=0 pw=0 str=1 time=1001 us cost=70 size=232 card=4)'
STAT #139819795591784 id=3 cnt=4 pid=2 pos=1 obj=117764 op='TABLE ACCESS FULL EMP (cr=22 pr=0 pw=0 str=1 time=259 us cost=35 size=152 card=4)'
STAT #139819795591784 id=4 cnt=4 pid=2 pos=2 obj=117765 op='TABLE ACCESS FULL DEPT (cr=22 pr=0 pw=0 str=1 time=81 us cost=35 size=80 card=4)'
STAT #139819795591784 id=5 cnt=0 pid=1 pos=2 obj=117766 op='TABLE ACCESS FULL BONUS (cr=0 pr=0 pw=0 str=1 time=4 us cost=2 size=39 card=1)'
CLOSE #139819795591784:c=24,e=24,dep=0,type=1,tim=31306508552

PARSE #139819795591784:c=41,e=42,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4262704031,tim=31306508798
PARSE #139819795585328:c=21,e=22,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=2321277860,tim=31306509010
EXEC #139819795585328:c=132,e=132,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=2321277860,tim=31306509220
FETCH #139819795585328:c=20,e=19,p=0,cr=1,cu=0,mis=0,r=0,dep=1,og=4,plh=2321277860,tim=31306509415
CLOSE #139819795585328:c=8,e=8,dep=1,type=3,tim=31306509494
EXEC #139819795591784:c=682,e=704,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=4262704031,tim=31306509558
FETCH #139819795591784:c=588,e=1246,p=0,cr=44,cu=0,mis=0,r=0,dep=0,og=1,plh=4262704031,tim=31306511014
CLOSE #139819795591784:c=23,e=22,dep=0,type=3,ti

As you can see, every time I do a parse call for the query against the SCOTT schema (PARSE #139819795591784), Oracle does a parse/exec/fetch/close for the query against im_domain$ (PARSE #139819795585328) – and this happens even when the SCOTT query is in the session cursor cache!

As Franck points out, this looks like something to do with the In Memory option even though the option wasn’t enabled in his database, and wasn’t enabled in my client’s database. Once you’ve got a reproducible example of a problem, though, you can start fiddling to see if you can bypass it. In this case I decided to check all the parameters to do with the in-memory option – which is a bit like hard work because there are 208 parameters that include the word “inmemory”. After checking the descriptions of the first twenty or so I decided there was an easier option – if Oracle is asking about “domains” for columns possibly it’s something to do with the relatively new “join group” feature for in-memory columnar compression, so I ran a query to produce the list of parameter names and description for parameter with the words “join” and “group” in their names – there are two:


_sqlexec_join_group_aware_hj_enabled              enable/disable join group aware hash join
_sqlexec_join_group_aware_hj_unencoded_rowset     minimum number of unencoded rowsets processed before adaptation 

The first one looks rather promising – and it has a default value to TRUE, and it can be changed by “alter session” or “alter system”. So I executed:


alter session set "_sqlexec_join_group_aware_hj_enabled" = false;
alter system flush shared_pool;

Then I ran my test again and voila! there it wasn’t. No more queries against in_domain$.

Problem solved (until the client decides they want to use the in-memory option, perhaps).

There may be other reasons why this recursive query appears which aren’t to do with hash joins, of course, but join groups are specifically to allow efficient hash joins with the in-memory option, (it’s a trick to do with common encoding for compression to allow Bloom filtering to eliminate CUs without decoding) so I’m hoping I won’t have to track down and eliminate another sources for the query.

 

March 13, 2019

Hash Partitions

Filed under: Oracle,Partitioning,Statistics — Jonathan Lewis @ 1:13 pm GMT Mar 13,2019

Here’s an important thought if you’ve got any large tables which are purely hash partitioned. As a general guideline you should not need partition level stats on those tables. The principle of hash partitioned tables is that the rows are distributed uniformly and randomly based on the hash key so, with the assumption that the number of different hash keys is “large” compared to the number of partitions, any one partition should look the same as any other partition.

Consider, as a thought experiment (and as a warning), a table of product_deliveries which is hash partitioned by product_id with ca. 65,000 distinct products that have been hashed across 64 partitions. (Hash partitioning should always use a power of 2 for the partition count if you want the number of rows per partition to be roughly the same across all partitions – if you don’t pick a power of two then some of the partitions will be roughly twice the size of others.)

Consider a query for “deliveries to Basingstoke” – in the absence of a histogram on the delivery location the optimizer will produce a cardinality estimate that is:

  • total rows in table / number of distinct delivery locations in table

Now consider a query for: “deliveries of product X to Basingstoke” – again in the absence of histograms. The optimizer could have two ways of calculating this cardinality:

  • total rows in table / (number of distinct products in table * number of distinct delivery locations in table)
  • total rows in relevant partition / (number of distinct products in relevant partition * number of distinct delivery locations in relevant partition)

But given the intent of hash partitioning to distribute data evenly we can make three further observations:

  1. the number of rows in any one partition should be very similar to the number of rows in the table divided by the number of partitions
  2. the number of distinct products in any one partition should be very similar to the number of products in the table divided by the number of partitions
  3. the number of distinct locations in any one partition should be very similar to the number of distinct locations in the whole table.

The second condition holds because product is the partition key, the third holds because location is not the partition key.

So we can rewrite the second, partition-oriented, formula as:

  • (total rows in table / number of partitions) / ((number of distinct products in table / number of partitions) * number of distinct locations in table)

which, re-arranging parentheses and cancelling common factors, reduces to:

  • total rows in table / (number of distinct products in table * number of distinct locations in table)

which matches the first formula. (Q.E.D.) In the absence of any statistics on hash partitions the optimizer can (ought to be able to) produce reasonable cardinality estimates based purely on table-level stats.

In fact if you look back into the history of partitioning this observation is implicit in the early days of composite partitioning when the only option was for range/hash composite partitions – the optimizer never used sub-partition stats to calculate costs or cardinality it used only partition-level statistics. (And it was several years before the optimizer caught up to the fact that (e.g.) range/list composite partitioning might actually need to do arithmetic based on subpartition stats.)

I did say that the example was also a warning. Hash partitioning is “expected” to have a large number of distinct key values compared to the number of partitions. (If you don’t meet this requirement then possibly you should be using list partitioning). There’s also a “uniformity” assumption built into the arithmetic (both the basic arithmetic and the hand-waving discussion I produced above). Just imagine that your company supplies a handful of products that for some strange reason are incredibly popular  in Basingstoke. If this is the case then the assumption that “all partitions look alike” is weakened and you would have to consider the possibility that the variation would require you to produce a workaround to address problems of poor cardinality estimates that the variation might produce.

A pattern of this type has two generic effects on the optimizer, of course. First is the simple skew in the data – to have a significant impact the number of rows for the problem products would have to be much larger than average, which suggests the need for a suitably crafted histogram; secondly there’s an implied correlation between a few products and Basingstoke, so you might even end up creating a column group and manually coding a histogram on it to capture the correlation.

 

March 11, 2019

sys_op_lbid

Filed under: Indexing,Infrastructure,IOT,Oracle,Statistics — Jonathan Lewis @ 1:23 pm GMT Mar 11,2019

I’ve made use of the function a few times in the past, for example in this posting on the dangers of using reverse key indexes, but every time I’ve mentioned it I’ve only been interested in the “leaf blocks per key” option. There are actually four different variations of the function, relevant to different types of index and controlled by setting a flag parameter to one of 4 different values.

The call to sys_op_lbid() take 3 parameters: index (or index [sub]partition object id, a flag vlaue, and a table “rowid”, where the flag value can be one of L, R, O, or G. The variations of the call are as follows:

  • L – the function will return the row directory address  (i.e. something that look like a rowid) of the first index entry in the leaf block that holds the index entry for the referenced table rowid. The effect of this is that the number of distinct values returned by calling the function for every row in the table is equal to the number of index leaf blocks which current hold an active entry.
  • R – Relevant only to bitmap indexes; the function will return the row directory address of the bitmap index entry for the referenced table rowid. The effect of this is that the number of distinct values returned by calling the function for every row in the table is equal to the number of index entries in the bitmap index.
  • O – Relevent only to the primary key index of an index organized table with an overflow. The function is used with a non-key column instead of a rowid and returns a rowid that corresponds to the row directory entry in the overflow segment. An interesting detail of the overflow entries is that there is an “nrid” (next rowid) pointer in the primary key index entry that does not get deleted when all the columns in the related overflow entry are set null – so you can delete all the data from the overflow (set every overflow column in every row to null) and the primary key clustering factor would not change.
  • G – Relevent only to secondary indexes on an index organized table. Like the L and R options this function takes a rowid (which is a special case for IOTs) as one of its inputs and uses the block guess from the secondary index to construct a row directory entry for the first entry in the primary key leaf block that corresponds to that block guess. This serves two purposes – it allows Oracle to calculate the clustering factor of the secondary index (as you walk the secondary index in order how much do you jump around the leaf blocks of the primary key), and it allows Oracle to produce the pct_direct_access figure for the secondary index by joining the secondary index to the primary key index on primary key, and comparing the ‘G’ result for the secondary with the ‘L’ result from the primary, which gives a count of the number of times the guess is correct.

These observations can be confirmed by gathering stats on different structures with trace enabled, and doing a couple of block dumps. For reference the following is just a simple script to create an index organized table with overflow and secondary index:


rem
rem     Script:         sys_op_lbid_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2018
rem

create table t1(
        id      constraint t1_pk primary key,
        v1      ,
        v2      ,
        v3      ,
        padding 
)
organization index
pctthreshold 2
overflow
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum,
        lpad(rownum,30),
        lpad(rownum,30),
        lpad(rownum,40),
        rpad('x',100,'x')
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create index t1_i1 on t1(v3);

alter session set sql_trace true;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter session set sql_trace false;

select
        object_id, object_name
from
        user_objects
order by
        object_id
;

The significance of the query for object_id and data_object_id shows up in the trace file (and subsequent dumps) when Oracle uses one or other of the values in its SQL and rowid construction.

Here are the interesting SQL statements generated as the stats are gathered – but cosmetically altered to be reader-friendly. In order they are:

  1. Stats for primary key of IOT: using the ‘L’ option for counting leaf blocks and the ‘O’ option for the clustering factor into overflow segment.
  2. Stats for secondary index of IOT: using the ‘L’ option for counting leaf blocks and the ‘G’ option for the clustering factor into the primary key index
  3. Calculate pct_direct_access: the ‘L’ option gives the actual leaf block in the primary key index, the ‘G’ option gives the leaf block guessed by the secondary index

select 
        /*+ index(t,t1_pk) */ 
        count(*) as nrw,
        count(distinct sys_op_lbid(351334,'L',t.rowid)) as nlb,
        null as ndk,
        (sys_op_lbid(351334,'O',V1),1) as clf
from
        t1 t 
where 
        id is not null
;


select 
        /*+ index(t,t1_i1) */ 
        count(*) as nrw,
        count(distinct sys_op_lbid(351335,'L',t.rowid)) as nlb,
        null as ndk,
        sys_op_countchg(sys_op_lbid(351335,'G',t.rowid),1) as clf
from
        t1 t 
where 
        v3 is not null
;


select
        case when count(*) = 0
                then 100
                else round(
                        count(
                                case when substr(gdba,7,9)=substr(lbid,7,9)
                                        then 1
                                        else null
                                end
                        )/count(*)*100
                )
        end
from    (
        select
                /*+
                        ordered
                        use_hash(i.t1 t2)
                        index_ffs(t2,t1_pk)
                */
                sys_op_lbid(351334,'L',t2.rowid) lbid,
                gdba
        from (
                select
                        /*+ index_ffs(t1,t1_i1) */
                        sys_op_lbid(351335,'G',t1.rowid) gdba,
                        t1.ID
                from
                        t1 t1
                ) i,
`               t1 t2
        where
                i.id = t2.id
        )
;

The strange substr(,7,9) that appears in the join between the primary key index and the secondary index is needed because the ‘G’ option uses the object_id of the table to turn an absolute block guess into a rowid while the ‘L’ option is using the data_object_id of the primary key index to turn its block addrss into a rowid. (This means there may be variants of this SQL for IOTs using partitioning.)

 

March 8, 2019

Append hint

Filed under: Hints,Ignoring Hints,Infrastructure,Oracle — Jonathan Lewis @ 1:13 pm GMT Mar 8,2019

One of the questions that came up on the CBO Panel Session at the UKOUG Tech2018 conference was about the /*+ append */ hint – specifically how to make sure it was ignored when it came from a 3rd party tool that was used to load data into the database. The presence of the hint resulted in increasing amounts of space in the table being “lost” as older data was deleted by the application which then didn’t reuse the space the inserts always went above the table’s highwater mark; and it wasn’t possible to change the application code.

The first suggestion aired was to create an SQL Patch to associate the hint /*+ ignore_optim_embedded_hints */ with the SQL in the hope that this would make Oracle ignore the append hint. This won’t work, of course, because the append hint is not an optimizer hint, it’s a “behaviour” hint.

There are, however, various things that will make the append hint invalid – sometimes to the great surprise of the people using it. The three things I can think of at present are:

  • row-level triggers
  • enabled foreign key constraints
  • non-unique indexes enforcing unique constraints

It’s been some time since I last ran a detailed check so I’m not going to guarantee that the following claims are still true – if it matters to you then it’s easy enough to create a little test of (say) 10,000 rows inserted into a non-empty, indexed table.

  • Row level triggers turn array processing into single row processing, so there’s a significant increase in redo generation.
  • Non-unique indexes enforcing unique constraints have (small but) potentially undesirable effects on the optimizer and on run-time and may turn array processing into single row processing.
  • Foreign key constraint require checking which may have some impact, but doesn’t turn array processing into single row processing.

Of the three options the foreign key constraint seemed to me to be the best strategy to disable the hint with minimal side effects, so my answer was:

“Create a new table with no data and a primary key constraint; add an invisible column to the target table, index it (online) and add a foreign key constraint from the column to the new table”.

My thinking on this was that the foreign key will disable the append hint. The column will always be null which means it will always satisfy the foreign key constraint without being checked and it won’t introduce any index maintenance overheads. Net effect: /*+ append */ is disabled with no extra overheads.

Footnote

The append hint is also ignored if the table is an Index Organized Table (IOT), but changing a heap table to an IOT is probably not a sensible choice if all you want to do is disable the hint – the potential for unexpected client side anomalies is too high, and if the table is heavily indexed the processing overhead for the secondary indexes could be fairly significant.

If I recall correctly the person asking the question said that the “do nothing” trigger option sounded like something they would be most comfortable with and they’d live with the overheads. I understand the feeling – an invisible column with an invisible index and extra foreign key constraint sounds too close to the boundary where mixing and matching simple features ends up hitting some unexpected behaviour (i.e. a bug).

Update 14th March 2019

A note on tweeter has reminded me that distributed transactions introduce another limitation.  Inserting across a database link will work only if the insert is into a local table; the first insert below will append, the second will not.


insert /*+ append */ into local_table select * from table@remote_location;

insert /*+ append */ into table@remote_location select * from local_table;

There are a couple more limitations given in the comments – disable table locking, and adding a security policy (VPD / RLS / FGAC / OLS).

March 6, 2019

12c Snapshots

Filed under: 12c,Oracle,Partitioning,Performance — Jonathan Lewis @ 10:35 am GMT Mar 6,2019

I published a note a few years ago about using the 12c “with function” mechanism for writing simple SQL statements to takes deltas of dynamic performance views. The example I supplied was for v$event_histogram but I’ve just been prompted by a question on ODC to supply a couple more – v$session_event and v$sesstat (joined to v$statname) so that you can use one session to get an idea of the work done and time spent by another session – the first script reports wait time:


rem
rem     Program:        12c_with_function_2.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session WAIT time
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 121, 127'

set timing on
set sqlterminator off

set linesize 180

break on sid skip 1

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        sum(total_waits),
        sum(total_timeouts), 
        sum(time_waited), 
        event
from    (
        select
                sid, event_id, 
                -total_waits total_waits, 
                -total_timeouts total_timeouts, 
                -time_waited time_waited, 
                -time_waited_micro time_waited_micro, 
                event
        from    v$session_event
        where   sid in ( &m_sid_list )
        union all
        select
                null, null, null, null, null, wait_row(&m_snap_time, 0), null
        from    dual
        union all
        select
                sid, event_id, total_waits, total_timeouts, time_waited, time_waited_micro, event
        from    v$session_event
        where   sid in ( &m_sid_list )
        )
where
        time_waited_micro != 0
group by
        sid, event_id, event
having
        sum(time_waited) != 0
order by
        sid, sum(time_waited) desc
/


And this one reports session activity:

rem
rem     Program:        12c_with_function_3.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session stats
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 13, 357'


set timing on
set sqlterminator off

set linesize 180

break on sid skip 1
column name format a64

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        name,
        sum(value)
from    (
        select
                ss.sid, 
                ss.statistic#,
                sn.name,
                -ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        union all
        select
                null, null, null, wait_row(&m_snap_time, 0)
        from    dual
        union all
        select
                ss.sid, ss.statistic#, sn.name, ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        )
where
        value != 0
group by
        sid, statistic#, name
having
        sum(value) != 0
order by
        sid, statistic#
/


You’ll notice that I’ve used dbms_lock.sleep() in my wait function – and the session running the SQL can be granted the execute privilege on the package through a role to make this work – but if you’re running Oracle 18 then you’ve probably noticed that the sleep() function and procedure have been copied to the dbms_session package.

 

March 4, 2019

Cartesian Join

Filed under: Execution plans,Oracle — Jonathan Lewis @ 1:37 pm GMT Mar 4,2019

I wrote this note a little over 4 years ago (Jan 2015) but failed to publish it for some reason. I’ve just rediscovered it and it’s got a couple of details that are worth mentioning, so I’ve decided to go ahead and publish it now.

A recent [ed: 4 year old] question on the OTN SQL forum asked for help in “multiplying up” data – producing multiple rows from a single row source. This is something I’ve done fairly often when modelling a problem, for example by generating an orders table and then generating an order_lines table from the orders table, and there are a couple of traps to consider.

The problem the OP had was that their base data was the result set from a complex query – which ran “fine”, but took 10 minutes to complete when a Cartesian join to a table holding just three rows was included. Unfortunately the OP didn’t supply, or even comment on, the execution plans. The obvious guess, of course, is that the extra table resulted in a completely different execution plan rather than the expected “do the original query then multiply by 3” plan, in which case the solution to the problem is (probably) simple – stick the original query into a non-mergeable view before doing the join.

Assume we have the following tables, t1 has 100,000 rows (generated from the SQL in this article), t2 has 4 rows where column id2 has the values from 1 to 4, t3 is empty – we can model the basic requirement with the query shown below:


SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------------
 ID                               NUMBER
 C1                               CHAR(2)
 C2                               CHAR(2)
 C3                               CHAR(2)
 C4                               CHAR(2)
 PADDING                          VARCHAR2(100)

SQL> desc t2
 Name                    Null?    Type
 ----------------------- -------- ----------------
 ID2                              NUMBER

SQL> desc t3
 Name                    Null?    Type
 ----------------------- -------- ----------------
 ID                               NUMBER
 ID2                              NUMBER
 C1                               CHAR(2)
 C2                               CHAR(2)
 C3                               CHAR(2)
 C4                               CHAR(2)
 PADDING                          VARCHAR2(100)


insert into t3
select
        t1.id, t2.id2, t1.c1, t1.c2, c3, t1.c4, t1.padding
from
       (select * from t1) t1,
        t2
;

If we “lose” the plan for the original “select * from t1” (assuming t1 was really a complicated view) when we extend to the Cartesian join all we need to do is the following:


insert into t3
select
        /*+ leading(t1 t2) */
        t1.id, t2.id2, t1.c1, t1.c2, c3, t1.c4, t1.padding
from
        (select /*+ no_merge */ * from t1) t1,
        t2
;

This is where the problem starts to get a little interesting. The /*+ no_merge */ hint is (usually) a winner in situations like this – but why have I included a /*+ leading() */ hint choosing to put t2 (the small table) second in the join order? It’s because of the way that Cartesian Merge Joins work, combined with an eye to where my most important resource bottleneck is likely to be. Here’s the execution plan taken from memory after executing this statement with statistics_level set to all. (11.2.0.4):


SQL_ID  azu8ntfjg9pwj, child number 0
-------------------------------------
insert into t3 select   /*+ leading(t1 t2) */  t1.id, t2.id2, t1.c1,
t1.c2, c3, t1.c4, t1.padding from  (select /*+ no_merge */ * from t1)
t1,   t2

Plan hash value: 1055157753

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |      1 |        |      0 |00:00:10.28 |   48255 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL |      |      1 |        |      0 |00:00:10.28 |   48255 |       |       |          |
|   2 |   MERGE JOIN CARTESIAN   |      |      1 |    400K|    400K|00:00:06.30 |    1727 |       |       |          |
|   3 |    VIEW                  |      |      1 |    100K|    100K|00:00:00.94 |    1725 |       |       |          |
|   4 |     TABLE ACCESS FULL    | T1   |      1 |    100K|    100K|00:00:00.38 |    1725 |       |       |          |
|   5 |    BUFFER SORT           |      |    100K|      4 |    400K|00:00:01.78 |       2 |  3072 |  3072 | 2048  (0)|
|   6 |     TABLE ACCESS FULL    | T2   |      1 |      4 |      4 |00:00:00.01 |       2 |       |       |          |
----------------------------------------------------------------------------------------------------------------------

Let’s try that again (building from scratch, of course) with the table order reversed in the leading() hint:


SQL_ID  52qaydutssvn5, child number 0
-------------------------------------
insert into t3 select   /*+ leading(t2 t1) */  t1.id, t2.id2, t1.c1,
t1.c2, c3, t1.c4, t1.padding from  (select /*+ no_merge */ * from t1)
t1,   t2

Plan hash value: 2126214450

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |      1 |        |      0 |00:00:12.29 |   48311 |   6352 |   1588 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL |      |      1 |        |      0 |00:00:12.29 |   48311 |   6352 |   1588 |       |       |          |
|   2 |   MERGE JOIN CARTESIAN   |      |      1 |    400K|    400K|00:00:06.64 |    1729 |   6352 |   1588 |       |       |          |
|   3 |    TABLE ACCESS FULL     | T2   |      1 |      4 |      4 |00:00:00.01 |       2 |      0 |      0 |       |       |          |
|   4 |    BUFFER SORT           |      |      4 |    100K|    400K|00:00:04.45 |    1727 |   6352 |   1588 |    13M|  1416K| 9244K (0)|
|   5 |     VIEW                 |      |      1 |    100K|    100K|00:00:00.80 |    1725 |      0 |      0 |       |       |          |
|   6 |      TABLE ACCESS FULL   | T1   |      1 |    100K|    100K|00:00:00.28 |    1725 |      0 |      0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

There’s one important detail that’s not explicit in the execution plans – I’ve set the workarea_size_policy to manual and the sort_area_size to 10MB to demonstrate the impact of having a dataset that is too large for the session’s workarea limit.

The results, in terms of timing, are border-line. With the “correct” choice of order the completion time is 10.28 seconds compared to 12.29 seconds, though if you look at the time for the Merge Join Cartesian operation the difference is much less significant. The critical point, though, appears at operation 4 – the Buffer Sort. I set my sort_area_size to something that I know was smaller than the data set I needed to buffer – so the operation had to spill to disc. Ignoring overheads and rounding errors the data from the 1,727 blocks I read from the table at pctfree = 10 were dumped to the temporary space in 1,588 packed blocks (sanity check: 1,727 * 0.9 = 1,554); and then those blocks were read back once for each row from the driving t2 table (sanity check: 1,588 * 4 = 6,352).

With my setup I had a choice of bottlenecks:  scan a very small data set in memory 100,000 times to burn CPU, or scan a large data set from disc 4 times. There wasn’t much difference in my case: but the difference could be significant on a full-scale production system.  By default the optimizer happened to pick the “wrong” path with my data sets.

But there’s something even more important than this difference in resource usage to generate the data: what does the data look like after it’s been generated.  Here’s a simple query to show you the first few rows of the stored result sets in the two different test:


SQL> select id, id2, c1, c2, c3, c4 from t3 where rownum <= 8;

Data from leading (t1 t2)
=========================
        ID        ID2 C1 C2 C3 C4
---------- ---------- -- -- -- --
         1          1 BV GF JB LY
         1          2 BV GF JB LY
         1          3 BV GF JB LY
         1          4 BV GF JB LY
         2          1 YV LH MT VM
         2          2 YV LH MT VM
         2          3 YV LH MT VM
         2          4 YV LH MT VM


Data from leading (t2 t1)
=========================
        ID        ID2 C1 C2 C3 C4
---------- ---------- -- -- -- --
         1          1 BV GF JB LY
         2          1 YV LH MT VM
         3          1 IE YE TS DP
         4          1 DA JY GS AW
         5          1 ZA DC KD CF
         6          1 VJ JI TC RI
         7          1 DN RY KC BE
         8          1 WP EQ UM VY

If we had been using code like this to generate an order_lines table from an orders table, with  leading(orders t2) we would have “order lines” clustered around the “order number” – which is a realistic model; when we have leading(t2 orders) the clustering disappears (technically the order numbers are clustered around the order lines). It’s this aspect of the data that might have a much more important impact on the suitability (and timing) of any testing you may be doing rather than a little time lost or gained in generating the raw data.

Footnote

If you try to repeat this test on your own system don’t expect my timing to match yours. Bear in mind, also, that with statistics_level set to all there’s going to be a CPU overhead that varies between the two options for the leading() hint – the CPU usage on rowsource execution stats could be much higher for the case where one of the operations starts 100,000 times.

 

March 1, 2019

Hash Optimisation-

Filed under: Oracle,Parallel Execution,Performance,trace files — Jonathan Lewis @ 4:39 pm GMT Mar 1,2019

Franck Pachot did an interesting presentation at the OBUG (Belgium user group) Tech Days showing how to use one of the O/S debug/trace tools to step through the function calls that Oracle made during different types of joins. This prompted me to ask him a question about a possible optimisation of hash joins as follows:

The hash join operation creates an in-memory hash table from the rowsource produced by its first child operation then probes the hash table with rows from the row source produced by the second child operation; but if there are no rows in the first row source then there’s no need to acquire rows from the second row source, so Oracle doesn’t call the second child operation.

Does the hash join adopt a “symmetrical” strategy – starting the second child operation prematurely to find the first couple of rows before executing the first child operation and building the in-memory hash table ?

Franck got back to me before the end of the event with the answer – which was “no”. (This makes sense, of course, because if there’s no data in the second rowsource – which is supposed to be the “big” rowsource – you’ve probably done a lot of work which you may not have needed to do if you’d checked the first rowsource properly first.  (Thought – might Oracle look for the first row from the first row source, then check the second row source for a first row, then finish the first rowsource and build? How difficult do you want to make your life?)

So Franck got me an answer by looking at function calls – but I’d only asked him because I thought he might have done the experiment already and might be able to give me an answer immediately rather than having to waste his valuable Belgian Beer Drinking time to do some tests for me. (And I’ve never got around to playing with strace, ptrace, dtrace, truss, et. al. so I didn’t fancy doing the tracing for myself, of course). But this particular question doesn’t need any examination of function calls to get an answer: an extended trace file would be sufficient. Here’s how to start:


rem
rem     Script:         hash_join_opt.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2019
rem
rem     Last tested
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,20)                  n20,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create table t2
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,10)                  n10,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
end;
/

set serveroutput off

alter system flush buffer_cache;
execute dbms_lock.sleep(2)

alter session set statistics_level = all;
alter session set events '10046 trace name context forever, level 8';

select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
alter session set events '10046 trace name context off';
alter session set statistics_level = typical;

I’ve created a pair of tables with similar volumes of data and then executed a query that should do a hash join between the tables. The t2 table will return no rows since the predicate on n10 is asking for a value that doesn’t exist but the optimizer will think it will supply 10% of the table since I’ve avoided creating a histogram on the column and the num_distinct for the n10 column is 10. You’ll notice that I’ve flushed the buffer_cache (with a couple of seconds pause, just in case) before executing the query. Here’s the execution plan with rowsource execution stats:


SQL_ID  bscu1r7cu36ur, child number 0
-------------------------------------
select  t1.v1, t2.v1 from  t1, t2 where  t1.n20 = 19 and t2.id = t1.id
and t2.n10 = 7.5

Plan hash value: 1838229974

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      0 |00:00:00.04 |     352 |    348 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |    500 |      0 |00:00:00.04 |     352 |    348 |  1519K|  1519K| 1487K (0)|
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500 |    500 |00:00:00.03 |     176 |    174 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T2   |      1 |   1000 |      0 |00:00:00.01 |     176 |    174 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter("T1"."N20"=19)
   3 - filter("T2"."N10"=7.5)

As you can see, the tablescan of t2 returned no rows so the hash join returned no rows, but the stats show 174 blocks read from both t1 and t2. The fact that we read 174 blocks from t2 doesn’t mean we didn’t do a pre-emptive check, of course, as those reads could (by the hypothesis) have been the reads required by the pre-emptive check. The fact that we did a complete scan both tables, though, tells us that the reads on t2 (probably) weren’t from a pre-emptive check (if they were then Oracle shouldn’t have scanned the whole of t1).

We can get extra corroborative evidence though when we check the trace file – which shows us the following waits:

PARSING IN CURSOR #139927916187824 len=88 dep=0 uid=62 oct=3 lid=62 tim=1551445144496821 hv=3651246935 ad='97353e10' sqlid='bscu1r7cu36ur'
select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
END OF STMT

PARSE #139927916187824:c=3000,e=2760,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=1838229974,tim=1551445144496811
EXEC #139927916187824:c=0,e=78,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144497155
WAIT #139927916187824: nam='SQL*Net message to client' ela= 8 driver id=1650815232 #bytes=1 p3=0 obj#=351199 tim=1551445144497224

WAIT #139927916187824: nam='db file sequential read' ela= 634 file#=5 block#=130 blocks=1 obj#=351198 tim=1551445144498386
WAIT #139927916187824: nam='db file scattered read' ela= 397 file#=5 block#=131 blocks=5 obj#=351198 tim=1551445144498971
WAIT #139927916187824: nam='db file scattered read' ela= 1752 file#=5 block#=136 blocks=8 obj#=351198 tim=1551445144502411
WAIT #139927916187824: nam='db file scattered read' ela= 426 file#=5 block#=145 blocks=7 obj#=351198 tim=1551445144504056
WAIT #139927916187824: nam='db file scattered read' ela= 590 file#=5 block#=152 blocks=8 obj#=351198 tim=1551445144505655
WAIT #139927916187824: nam='db file scattered read' ela= 8063 file#=5 block#=161 blocks=7 obj#=351198 tim=1551445144514574
WAIT #139927916187824: nam='db file scattered read' ela= 402 file#=5 block#=168 blocks=8 obj#=351198 tim=1551445144515741
WAIT #139927916187824: nam='db file scattered read' ela= 803 file#=5 block#=177 blocks=7 obj#=351198 tim=1551445144517495
WAIT #139927916187824: nam='db file scattered read' ela= 423 file#=5 block#=184 blocks=8 obj#=351198 tim=1551445144518743
WAIT #139927916187824: nam='db file scattered read' ela= 422 file#=5 block#=193 blocks=7 obj#=351198 tim=1551445144520696
WAIT #139927916187824: nam='db file scattered read' ela= 266 file#=5 block#=200 blocks=8 obj#=351198 tim=1551445144521460
WAIT #139927916187824: nam='db file scattered read' ela= 307 file#=5 block#=209 blocks=7 obj#=351198 tim=1551445144522249
WAIT #139927916187824: nam='db file scattered read' ela= 313 file#=5 block#=216 blocks=8 obj#=351198 tim=1551445144523166
WAIT #139927916187824: nam='db file scattered read' ela= 247 file#=5 block#=225 blocks=7 obj#=351198 tim=1551445144523927
WAIT #139927916187824: nam='db file scattered read' ela= 315 file#=5 block#=232 blocks=8 obj#=351198 tim=1551445144524646
WAIT #139927916187824: nam='db file scattered read' ela= 326 file#=5 block#=241 blocks=7 obj#=351198 tim=1551445144525565
WAIT #139927916187824: nam='db file scattered read' ela= 186 file#=5 block#=248 blocks=8 obj#=351198 tim=1551445144526255
WAIT #139927916187824: nam='db file scattered read' ela= 341 file#=5 block#=258 blocks=55 obj#=351198 tim=1551445144527363
----------------
WAIT #139927916187824: nam='db file sequential read' ela= 201 file#=5 block#=386 blocks=1 obj#=351199 tim=1551445144531165
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=387 blocks=5 obj#=351199 tim=1551445144531492
WAIT #139927916187824: nam='db file scattered read' ela= 194 file#=5 block#=392 blocks=8 obj#=351199 tim=1551445144531750
WAIT #139927916187824: nam='db file scattered read' ela= 301 file#=5 block#=401 blocks=7 obj#=351199 tim=1551445144532271
WAIT #139927916187824: nam='db file scattered read' ela= 272 file#=5 block#=408 blocks=8 obj#=351199 tim=1551445144532756
WAIT #139927916187824: nam='db file scattered read' ela= 258 file#=5 block#=417 blocks=7 obj#=351199 tim=1551445144533218
WAIT #139927916187824: nam='db file scattered read' ela= 242 file#=5 block#=424 blocks=8 obj#=351199 tim=1551445144533704
WAIT #139927916187824: nam='db file scattered read' ela= 232 file#=5 block#=433 blocks=7 obj#=351199 tim=1551445144534125
WAIT #139927916187824: nam='db file scattered read' ela= 213 file#=5 block#=440 blocks=8 obj#=351199 tim=1551445144534506
WAIT #139927916187824: nam='db file scattered read' ela= 241 file#=5 block#=449 blocks=7 obj#=351199 tim=1551445144534914
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=456 blocks=8 obj#=351199 tim=1551445144535258
WAIT #139927916187824: nam='db file scattered read' ela= 311 file#=5 block#=465 blocks=7 obj#=351199 tim=1551445144536518
WAIT #139927916187824: nam='db file scattered read' ela= 215 file#=5 block#=472 blocks=8 obj#=351199 tim=1551445144536906
WAIT #139927916187824: nam='db file scattered read' ela= 164 file#=5 block#=481 blocks=7 obj#=351199 tim=1551445144537343
WAIT #139927916187824: nam='db file scattered read' ela= 160 file#=5 block#=488 blocks=8 obj#=351199 tim=1551445144537624
WAIT #139927916187824: nam='db file scattered read' ela= 253 file#=5 block#=497 blocks=7 obj#=351199 tim=1551445144538041
WAIT #139927916187824: nam='db file scattered read' ela= 252 file#=5 block#=504 blocks=8 obj#=351199 tim=1551445144538456
WAIT #139927916187824: nam='db file scattered read' ela= 947 file#=5 block#=514 blocks=55 obj#=351199 tim=1551445144539690

FETCH #139927916187824:c=25996,e=43227,p=348,cr=352,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144540502

I’ve inserted a few gaps into this section of the trace file but haven’t deleted any lines. As you can see there’s a series of reads for obj# 351198 (table t1), followed by a series of reads of obj# 351199. We don’t do anything subtle like peeking at t1, switching to peek at t2, then continuing with t1: it’s a simple end to end brute force scan of each table in turn.

But there’s more …

Before I created the test above I decided to check whether I had already done the test some time in the past and written about it. A search through my blog didn’t turn up any notes about this question, but it did uncover the following footnote to an earlier posting about hash joins:

taking a closer look at the set of trace files generated in the broadcast test I discovered that the first set of slaves start their parallel tablescan of t1 but stop after just one read from each slave, then the second set of slaves scans and builds the hash table from t2 before calling for further data from t1.

Naturally I included a link to the footnote in the draft notes for this blog entry and promptly forgot about writing the blog note until a couple of days ago when Chinar Aliyev produced a note on the hash join buffered operation in which he had included an execution plan similar to the broadcast plan from my earlier blog note – so I thought I’d take another look at it, because it looks like some versions of the parallel hash join can do exactly the pre-emptive test that the serial execution plan doesn’t.

So, here’s a query to run against the same data set, and the resulting execution plan – pulled from memory after executing the query:


select
        /*+
                parallel(2)
                leading(t1 t2)
                use_hash(t2)
                pq_distribute(t2 none broadcast)
        */
        t1.v1, t2.v1
--      t1.v1, count(*)
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7
;

----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |      0 |    500 |  Q1,01 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      1 |    500 |  Q1,01 | PCWP |            |      0 |00:00:00.12 |     128 |    173 |  1483K|  1483K|     2/0/0|
|   4 |     PX BLOCK ITERATOR    |          |      2 |    500 |  Q1,01 | PCWC |            |    500 |00:00:00.03 |     248 |    173 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |     25 |    500 |  Q1,01 | PCWP |            |    500 |00:00:00.02 |     248 |    173 |       |       |          |
|   6 |     BUFFER SORT          |          |      2 |        |  Q1,01 | PCWC |            |   2000 |00:00:00.07 |       0 |      0 |   108K|   108K|     2/0/0|
|   7 |      PX RECEIVE          |          |      2 |   1000 |  Q1,01 | PCWP |            |   2000 |00:00:00.06 |       0 |      0 |       |       |          |
|   8 |       PX SEND BROADCAST  | :TQ10000 |      0 |   1000 |  Q1,00 | P->P | BROADCAST  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |   1000 |  Q1,00 | PCWC |            |   1000 |00:00:00.05 |     248 |    173 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     25 |   1000 |  Q1,00 | PCWP |            |   1000 |00:00:00.04 |     248 |    173 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the rule for reading a DFO tree in a parallel plan is to follow the table queues (a.k.a. DFO numbers / TQ numbers) – but in this case we have a bit of a mixture as the table scan of t1 isn’t used as the input to a distribution so it’s hard to know whether the first physical activity is supposed to be the tablescan of t2 (feeding the zeroth table queue) or the tablescan of t1 which doesn’t feed a table queue until after the hash join.

Given that t2 is being broadcast we should probably expect to see the first set of parallel query slaves starting first with a scan of “random” sections of t1 to build an in-memory hash table, and then see the second set of parallel query slaves scanning t2 and broadcasting the results to the first set of slaves (every slave gets a copy of every row) to do the probe. We can try to confirm this by looking at the trace files. First we extract “db file scattered read” lines from the four trace files and then we sort them by timestamp.

test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 677 file#=5 block#=220 blocks=4 obj#=351218 tim=1551450113700076
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 214 file#=5 block#=225 blocks=3 obj#=351218 tim=1551450113701131
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 238 file#=5 block#=161 blocks=7 obj#=351218 tim=1551450113702026
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 318 file#=5 block#=177 blocks=6 obj#=351218 tim=1551450113703464
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 371 file#=5 block#=190 blocks=2 obj#=351218 tim=1551450113705168
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 227 file#=5 block#=193 blocks=5 obj#=351218 tim=1551450113705690
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 225 file#=5 block#=138 blocks=6 obj#=351218 tim=1551450113706850
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 487 file#=5 block#=131 blocks=5 obj#=351218 tim=1551450113708512
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 228 file#=5 block#=136 blocks=2 obj#=351218 tim=1551450113709161
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 219 file#=5 block#=294 blocks=7 obj#=351218 tim=1551450113710099
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 150 file#=5 block#=184 blocks=6 obj#=351218 tim=1551450113711398
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 229 file#=5 block#=273 blocks=7 obj#=351218 tim=1551450113712456
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 223 file#=5 block#=280 blocks=7 obj#=351218 tim=1551450113713829
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 253 file#=5 block#=168 blocks=7 obj#=351218 tim=1551450113715124
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 1014 file#=5 block#=308 blocks=5 obj#=351218 tim=1551450113717046
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 266 file#=5 block#=235 blocks=5 obj#=351218 tim=1551450113718684
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 208 file#=5 block#=241 blocks=2 obj#=351218 tim=1551450113719321
----
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 326 file#=5 block#=287 blocks=7 obj#=351218 tim=1551450113700003
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 147 file#=5 block#=198 blocks=2 obj#=351218 tim=1551450113701394
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 72 file#=5 block#=200 blocks=5 obj#=351218 tim=1551450113701840
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 459 file#=5 block#=153 blocks=7 obj#=351218 tim=1551450113703442
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 308 file#=5 block#=259 blocks=7 obj#=351218 tim=1551450113705132
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 313 file#=5 block#=266 blocks=7 obj#=351218 tim=1551450113706540
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 304 file#=5 block#=243 blocks=5 obj#=351218 tim=1551450113707925
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 219 file#=5 block#=248 blocks=2 obj#=351218 tim=1551450113708505
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 590 file#=5 block#=228 blocks=4 obj#=351218 tim=1551450113709705
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 178 file#=5 block#=232 blocks=3 obj#=351218 tim=1551450113710270
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 142 file#=5 block#=205 blocks=3 obj#=351218 tim=1551450113711046
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 98 file#=5 block#=209 blocks=4 obj#=351218 tim=1551450113711457
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 179 file#=5 block#=213 blocks=3 obj#=351218 tim=1551450113712308
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 296 file#=5 block#=216 blocks=4 obj#=351218 tim=1551450113712967
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 188 file#=5 block#=301 blocks=7 obj#=351218 tim=1551450113714238
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 207 file#=5 block#=250 blocks=6 obj#=351218 tim=1551450113715393
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 567 file#=5 block#=146 blocks=6 obj#=351218 tim=1551450113717686
----
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 196 file#=5 block#=522 blocks=7 obj#=351219 tim=1551450113722013
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 145 file#=5 block#=564 blocks=5 obj#=351219 tim=1551450113723239
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 317 file#=5 block#=491 blocks=5 obj#=351219 tim=1551450113724578
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 128 file#=5 block#=497 blocks=2 obj#=351219 tim=1551450113725217
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 495 file#=5 block#=402 blocks=6 obj#=351219 tim=1551450113726250
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 168 file#=5 block#=394 blocks=6 obj#=351219 tim=1551450113727928
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 249 file#=5 block#=424 blocks=7 obj#=351219 tim=1551450113730384
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 120 file#=5 block#=476 blocks=4 obj#=351219 tim=1551450113731225
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 485 file#=5 block#=481 blocks=3 obj#=351219 tim=1551450113732064
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 594 file#=5 block#=387 blocks=5 obj#=351219 tim=1551450113734168
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 323 file#=5 block#=392 blocks=2 obj#=351219 tim=1551450113735040
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 121 file#=5 block#=506 blocks=6 obj#=351219 tim=1551450113736253
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 285 file#=5 block#=440 blocks=6 obj#=351219 tim=1551450113739084
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 345 file#=5 block#=433 blocks=6 obj#=351219 tim=1551450113740940
----
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 155 file#=5 block#=484 blocks=4 obj#=351219 tim=1551450113721738
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 131 file#=5 block#=488 blocks=3 obj#=351219 tim=1551450113722260
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 143 file#=5 block#=557 blocks=7 obj#=351219 tim=1551450113723245
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 138 file#=5 block#=454 blocks=2 obj#=351219 tim=1551450113724545
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 124 file#=5 block#=456 blocks=5 obj#=351219 tim=1551450113724954
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 243 file#=5 block#=469 blocks=3 obj#=351219 tim=1551450113725947
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 523 file#=5 block#=472 blocks=4 obj#=351219 tim=1551450113726745
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 178 file#=5 block#=409 blocks=7 obj#=351219 tim=1551450113727781
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 230 file#=5 block#=543 blocks=7 obj#=351219 tim=1551450113729470
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 488 file#=5 block#=515 blocks=7 obj#=351219 tim=1551450113731618
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 356 file#=5 block#=461 blocks=3 obj#=351219 tim=1551450113733840
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 185 file#=5 block#=465 blocks=4 obj#=351219 tim=1551450113735195
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 284 file#=5 block#=536 blocks=7 obj#=351219 tim=1551450113736172
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 204 file#=5 block#=499 blocks=5 obj#=351219 tim=1551450113737957
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 306 file#=5 block#=504 blocks=2 obj#=351219 tim=1551450113738871
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 281 file#=5 block#=446 blocks=2 obj#=351219 tim=1551450113739710
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 158 file#=5 block#=449 blocks=5 obj#=351219 tim=1551450113740159
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 99 file#=5 block#=417 blocks=7 obj#=351219 tim=1551450113741709


tim=1551450113698677 obj#=351219
tim=1551450113698687 obj#=351219
----
tim=1551450113700003 obj#=351218
tim=1551450113700076 obj#=351218
tim=1551450113701131 obj#=351218
tim=1551450113701394 obj#=351218
tim=1551450113701840 obj#=351218
tim=1551450113702026 obj#=351218
tim=1551450113703442 obj#=351218
tim=1551450113703464 obj#=351218
tim=1551450113705132 obj#=351218
tim=1551450113705168 obj#=351218
tim=1551450113705690 obj#=351218
tim=1551450113706540 obj#=351218
tim=1551450113706850 obj#=351218
tim=1551450113707925 obj#=351218
tim=1551450113708505 obj#=351218
tim=1551450113708512 obj#=351218
tim=1551450113709161 obj#=351218
tim=1551450113709705 obj#=351218
tim=1551450113710099 obj#=351218
tim=1551450113710270 obj#=351218
tim=1551450113711046 obj#=351218
tim=1551450113711398 obj#=351218
tim=1551450113711457 obj#=351218
tim=1551450113712308 obj#=351218
tim=1551450113712456 obj#=351218
tim=1551450113712967 obj#=351218
tim=1551450113713829 obj#=351218
tim=1551450113714238 obj#=351218
tim=1551450113715124 obj#=351218
tim=1551450113715393 obj#=351218
tim=1551450113717046 obj#=351218
tim=1551450113717686 obj#=351218
tim=1551450113718684 obj#=351218
tim=1551450113719321 obj#=351218
--
tim=1551450113721738 obj#=351219
tim=1551450113722013 obj#=351219
tim=1551450113722260 obj#=351219
tim=1551450113723239 obj#=351219
tim=1551450113723245 obj#=351219
tim=1551450113724545 obj#=351219
tim=1551450113724578 obj#=351219
tim=1551450113724954 obj#=351219
tim=1551450113725217 obj#=351219
tim=1551450113725947 obj#=351219
tim=1551450113726250 obj#=351219
tim=1551450113726745 obj#=351219
tim=1551450113727781 obj#=351219
tim=1551450113727928 obj#=351219
tim=1551450113729470 obj#=351219
tim=1551450113730384 obj#=351219
tim=1551450113731225 obj#=351219
tim=1551450113731618 obj#=351219
tim=1551450113732064 obj#=351219
tim=1551450113733840 obj#=351219
tim=1551450113734168 obj#=351219
tim=1551450113735040 obj#=351219
tim=1551450113735195 obj#=351219
tim=1551450113736172 obj#=351219
tim=1551450113736253 obj#=351219
tim=1551450113737957 obj#=351219
tim=1551450113738871 obj#=351219
tim=1551450113739084 obj#=351219
tim=1551450113739710 obj#=351219
tim=1551450113740159 obj#=351219
tim=1551450113740940 obj#=351219
tim=1551450113741709 obj#=351219

I’ve got four slave processes p000 through to p0003, and in the top part of the output you can see that p000/p001 scan object 351218 (after several experiments the object_id for t1 has climbed a bit), and p002/p003 scan object 351219 (t2). Sorting by the timestamps (tim=), though, we can see that there are two reads for t2 (351219) followed by many reads of t1 (351218), finishing with a load of reads of t2. Cross-checking the timestamps on the sorted output with the original extract we can see that the two initial reads of t2 are the first reads by p002 and p003 respectively.

tim=1551450113698677 obj#=351219
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677

tim=1551450113698687 obj#=351219
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687

It really does look as if at run time Oracle does the scan and build for t1 first but starts with a quick peek at t2 to check that it’s going to return some data. We can be a little more fussy with this test – let’s change some of the data in t2 so that Oracle doesn’t find any rows in the first two scattered reads and see if anything changes.

This is going to take a little fiddling but in my case I can see that the first reads by slaves p002/p003 are looking at file 5, blocks 529 – 535 and blocks 550 – 556; and broadly speaking the scattered reads are working backwards down the table – so I’m going to update a load of data that appears in the second half of the table on the next run of the test with a simple command to put the n10 column value out of range:


update t2 set n10 = 0 where n10 = 7 and id > 750;
commit;

-- test the query

update t2 set n10 = 0 where n10 = 7;
commit;

-- test the query

After the first update the pattern of reads (in time order) changed do:

  • 22 scattered reads of t2
  • 34 scattered reads of t1
  • 12 scattered reads of t2

This really does look like Oracle checking whether t2 will have any data before doing the full scan and build from t1.

Unfortunately when I did the final update that eliminated all the interesting data from t2 the pattern (in time order) turned into:

  • 34 scattered reads of t2
  • 34 scattered reads of t1

In other words – having discovered NO data in t2 that could join Oracle still scanned and built the hash table from t1. So maybe there’s a reason other than a sanity check for peeking at the second table, or maybe there’s a bug in the extreme case where the second table produces no data at all. (For reference you might also want to see an older note on how a parallel hash join doesn’t skip the probe table when the build table is empty.)

Just as a quick check that I wasn’t chasing an old bug at this point I repeated the tests on 18.3.0.0 – the same patterns emerged.

 

January 31, 2019

Descending Problem

Filed under: Execution plans,Indexing,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 3:34 pm GMT Jan 31,2019

I’ve written in the past about oddities with descending indexes ( here, here, and here, for example) but I’ve just come across a case where I may have to introduce a descending index that really shouldn’t need to exist. As so often happens it’s at the boundary where two Oracle features collide. I have a table that handles data for a large number of customers, who record a reasonable number of transactions per year, and I have a query that displays the most recent transactions for a customer. Conveniently the table is partitioned by hash on the customer ID, and I have an index that starts with the customer_id and transaction_date columns. So here’s my query or, to be a little more accurate, the client’s query – simplified and camouflaged:


select  /*+ gather_plan_statistics */
        *
from    (
             select
                    v1.*,
                    rownum rn
             from   (
                             select   /*
                                         no_eliminate_oby
                                         index_rs_desc(t1 (customer_id, transaction_date))
                                      */
                                      t1.*
                             from     t1
                             where    customer_id = 50
                             and      transaction_date >= to_date('1900-01-01','yyyy-mm-dd')
                             order by transaction_date DESC
                ) v1
                where  rownum <= 10 -- > comment to avoid WordPress format issue
         )
where    rn >= 1
;

You’ll notice some hinting – the /*+ gather_plan_statistics */ will allow me to report the rowsource execution stats when I pull the plan from memory, and the hints in the inline view (which I’ve commented out in the above) will force a particular execution plan – walking through the index on (company_id, transaction_date) in descending order.

If I create t1 as a simple (non-partitioned) heap table I get the following plan unhinted (I’ve had to edit a “less than or equal to” symbol to avoid a WordPress format issue):

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |       |      1 |        |    14 (100)|     10 |00:00:00.01 |      14 |
|*  1 |  VIEW                           |       |      1 |     10 |    14   (0)|     10 |00:00:00.01 |      14 |
|*  2 |   COUNT STOPKEY                 |       |      1 |        |            |     10 |00:00:00.01 |      14 |
|   3 |    VIEW                         |       |      1 |     10 |    14   (0)|     10 |00:00:00.01 |      14 |
|   4 |     TABLE ACCESS BY INDEX ROWID | T1    |      1 |    340 |    14   (0)|     10 |00:00:00.01 |      14 |
|*  5 |      INDEX RANGE SCAN DESCENDING| T1_I1 |      1 |     10 |     3   (0)|     10 |00:00:00.01 |       4 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("RN">=1)
   2 - filter(ROWNUM .LE. 10)
   5 - access("CUSTOMER_ID"=50 AND "TRANSACTION_DATE" IS NOT NULL AND "TRANSACTION_DATE">=TO_DATE('
              1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))


Notice the descending range scan of the index – just as I wanted it – the minimal number of buffer visits, and only 10 rows (and rowids) examined from the table. But what happens if I recreate t1 as a hash-partitioned table with local index – here’s the new plan, again without hinting the SQL:


----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |       |      1 |        |   207 (100)|     10 |00:00:00.01 |     138 |       |       |          |
|*  1 |  VIEW                                          |       |      1 |     10 |   207   (1)|     10 |00:00:00.01 |     138 |       |       |          |
|*  2 |   COUNT STOPKEY                                |       |      1 |        |            |     10 |00:00:00.01 |     138 |       |       |          |
|   3 |    VIEW                                        |       |      1 |    340 |   207   (1)|     10 |00:00:00.01 |     138 |       |       |          |
|*  4 |     SORT ORDER BY STOPKEY                      |       |      1 |    340 |   207   (1)|     10 |00:00:00.01 |     138 |  2048 |  2048 | 2048  (0)|
|   5 |      PARTITION HASH SINGLE                     |       |      1 |    340 |   206   (0)|    340 |00:00:00.01 |     138 |       |       |          |
|   6 |       TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1    |      1 |    340 |   206   (0)|    340 |00:00:00.01 |     138 |       |       |          |
|*  7 |        INDEX RANGE SCAN                        | T1_I1 |      1 |    340 |     4   (0)|    340 |00:00:00.01 |       3 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("RN">=1)
   2 - filter(ROWNUM. LE. 10)
   4 - filter(ROWNUM .LE. 10)
   7 - access("CUSTOMER_ID"=50 AND "TRANSACTION_DATE">=TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "TRANSACTION_DATE" IS NOT NULL)

Even though the optimizer has recognised that is will be visiting a single partition through a local index it has not chosen a descending index range scan, though it has used the appropriate index; so it’s fetched all the relevant rows from the table in the wrong order then sorted them discarding all but the top 10. We’ve done 138 buffer visits (which would turn into disk I/Os, and far more of them, in the production system).

Does this mean that the optimizer can’t use the descending index when the table is partitioned – or that somehow the costing has gone wrong. Here’s plan with the hints in place to see what happens when we demand a descending range scan:


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |      1 |        |   207 (100)|     10 |00:00:00.01 |       8 |
|*  1 |  VIEW                                 |       |      1 |     10 |   207   (1)|     10 |00:00:00.01 |       8 |
|*  2 |   COUNT STOPKEY                       |       |      1 |        |            |     10 |00:00:00.01 |       8 |
|   3 |    VIEW                               |       |      1 |    340 |   207   (1)|     10 |00:00:00.01 |       8 |
|   4 |     PARTITION HASH SINGLE             |       |      1 |    340 |   206   (0)|     10 |00:00:00.01 |       8 |
|   5 |      TABLE ACCESS BY LOCAL INDEX ROWID| T1    |      1 |    340 |   206   (0)|     10 |00:00:00.01 |       8 |
|*  6 |       INDEX RANGE SCAN DESCENDING     | T1_I1 |      1 |    340 |     4   (0)|     16 |00:00:00.01 |       3 |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("RN">=1)
   2 - filter(ROWNUM .LE. 10)
   6 - access("CUSTOMER_ID"=50 AND "TRANSACTION_DATE" IS NOT NULL AND "TRANSACTION_DATE">=TO_DATE('
              1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

The optimizer is happy to oblige with the descending range scan – we can see that we’ve visited only 8 buffers, and fetched only 10 rows from the table. The cost, however, hasn’t made any allowance for the limited range scan. Check back to the plan for the simple (non-partitioned) table and you’ll see that the optimizer did allow for the reduced range scan. So the problem here is a costing one – we have to hint the index range scan if we want Oracle limit the work it does.

You might notice, by the way that the number of rowids returned in the index range scan descending operation is 16 rather than 10 – a little variation that didn’t show up when the table wasn’t partitioned. I don’t know why this happened, but when I changed the requirement to 20 rows the range scan returned 31 rowids, when I changed it to 34 rows the range scan returned 46 rows, and a request for 47 rows returned 61 index rowids – you can see the pattern, the number of rowids returned by the index range scan seems to be 1 + 15*N.

Footnote:

If you want to avoid hinting the code (or adding an SQL patch) you need only re-create the index with the transaction_date column declared as descending (“desc”), at which point the optimizer automatically chooses the correct strategy and the run-time engine returns exactly 10 rowids and doesn’t need to do any sorting. But who wants to create a descending index when they don’t really need it !

If you want to reproduce the experiments, here’s the script to create my test data.


rem
rem     Script:         pt_ind_desc_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2018
rem     Purpose:        
rem
rem     Last tested 
rem             18.3.0.0
rem             12.2.0.1
rem             12.1.0.2
rem

create table t1 (
        customer_id,
        transaction_date,
        small_vc,
        padding 
)
partition by hash(customer_id) partitions 4
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        mod(rownum,128)                         customer_id,
        (trunc(sysdate) - 1e6) + rownum         transaction_date,
        lpad(rownum,10,'0')                     v1,
        lpad('x',100,'x')                       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

create index t1_i1 on t1(customer_id, transaction_date) 
local 
nologging
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

I’ve run this test on 12.1.0.2, 12.2.0.1, and 18.3.0.0 – the behaviour is the same in all three versions.

Update (1st Feb 2019)

As the client reminded me after reading the post, it’s worth pointing out that for more complex SQL you still have to worry about the errors in the cardinality and cost calculations that could easily push the optimizer into the wrong join order and/or join method – whether you choose to hint the ascending index or create a descending index.  Getting the plan you want for this type of “pagination” query can be a messy process.

January 18, 2019

DML Tablescans

Filed under: deadlocks,Infrastructure,Locks,Oracle,Parallel Execution,Performance — Jonathan Lewis @ 1:25 pm GMT Jan 18,2019

This note is a follow-up to a recent comment a blog note about Row Migration:

So I wonder what is the difference between the two, parallel dml and serial dml with parallel scan, which makes them behave differently while working with migrated rows. Why might the strategy of serial dml with parallel scan case not work in parallel dml case? I am going to make a service request to get some clarifications but maybe I miss something obvious?

The comment also referenced a couple of MoS notes:

  • Bug 17264297 “Serial DML with Parallel scan performs single block reads during full table scan when table has chained rows in 11.2”
  • Doc ID 1514011.1 “Performance decrease for parallel DML on compressed tables or regular tables after 11.2 Upgrade

The latter document included a comment to the effect that 11.2 uses a “Head Piece Scan” while 11.1 uses a “First Piece scan”, which is a rather helpful comment. Conveniently the blog note itself referenced an earlier note on the potential for differentiating between migrated and chained rows through a “flag” byte associated with each row piece. The flag byte has an H bit for the row head piece, an F bit for the row first piece, and L bit for the row last piece and {no bits set} for a row piece in the middle of a chained row.

Side note: A “typical” simple row will be a single row-piece with the H, F and L bits all set; a simple migrated row will start with an “empty” row-piece in one block with the H bit set and a pointer (nrid – next rowid) to a row in another block that will have the F and L bits set and a pointer (hrid – head rowid) back to the head piece. A chained row could start with a row piece holding a few columns and the H and F bits set and a pointer to the next row piece which might lead to a long chain of row pieces with no bits set each pointing to the next row piece until you get to a row piece with the L bit set.  Alternatively you might have row which had migrated and chained – which means it could start with an empty row piece with just the H bit and a pointer to the next row piece, then a row piece with the F bit set, a back pointer to the header, and a next pointer to the next row piece, which could lead to a long chain of row pieces with no bits set until you reach a row piece with the L bit set.

Combining the comments about “head piece” and “first piece” scans with the general principles of DML and locking it’s now possible to start makings some guesses about why the Oracle developers might want updates through tablescans to behave differently for serial and parallel tablescans. There are two performance targets to consider:

  • How to minimise random (single block) I/O requests
  • How to minimise the risk of deadlock between PX server processes.

Assume you’re doing a serial tablescan to find rows to update – assume for simplicity that there are no chained rows in the table. When you hit a migrated row (H bit only) you could follow the next rowid pointer (nrid) to find and examine the row. If you find that it’s a row that doesn’t need to be updated you’ve just done a completely redundant single block read; so it makes sense to ignore row pieces which are “H”-only row pieces and do a table scan based on “F” pieces (which will be FL “whole row” pieces thanks to our assumption of no chained rows). If you find a row which is an F row and it needs to be updated then you can do a single block read using the head rowid pointer (hrid) to lock the head row piece then lock the current row piece and update it; you only do the extra single block read for rows that need updates, not for all migrated rows. So this is (I guess) the “First Piece Scan” referenced in Doc ID 1514011.1. (And, conversely, if you scan the table looking only for row pieces with the H flag set this is probably the “Head Piece Scan”).

But there’s a potential problem with this strategy if the update is a parallel update. Imagine parallel server process p000 is scanning the first megabyte of a table and process p001 is scanning the second megabyte using the “first piece” algorithm.  What happens if p001 finds a migrated row (flags = FL) that needs to be updated and follows its head pointer back into a block in the megabyte being scanned by p000?  What if p000 has been busy updating rows in that block and there are no free ITLs for p001 to acquire to lock the head row piece? You have the potential for an indefinite deadlock.

On the other hand, if the scan is using the “head piece” algorithm p000 would have found the migrated row’s head piece and followed the next rowid pointer into a block in the megabyte being scanned by p001. If the row needs to be updated p000 can lock the head piece and the migrated piece.

At this point you might think that the two situations are symmetrical – aren’t you just as likely to get a deadlock because p000 now wants an ITL entry in a block that p001 might have been updating? Statistically the answer is “probably not”. When you do lots of updates it is possible for many rows to migrate OUT of a block; it is much less likely that you will see many rows migrate INTO a specific block. This means that in a parallel environment you’re more likely to see several PX servers all trying to acquire ITL entries in the same originating block than you are  to see several PX servers trying to acquire ITL entries in the same destination block. There’s also the feature that when a row (piece) migrates into a block Oracle adds an entry to the ITL list if the number of inwards migrated pieces is more than the current number of ITL entries.

Conclusion

It’s all guesswork of course, but I’d say that for a serial update by tablescan Oracle uses the “first piece scan” to minimise random I/O requests while for a parallel update by tablescan Oracle uses the “head piece scan” to minimise the risk of deadlocks – even though this is likely to increase the number of random (single block) reads.

Finally (to avoid ambiguity) if you’ve done an update which does a parallel tablescan but a serial update (by passing rowids to the query co-ordinator) then I’d hope that Oracle would use the “first piece scan” for the parallel tablescan because there’s no risk of deadlock when only the query co-ordinator is the only process doing the locking and updating, which makes it safe to use the minimum I/O strategy. (And a paralle query with serial update happens quite frequently because people forget to enable parallel dml.)

Footnote

While messing around to see what happened with updates and rows that were both migrated and chained I ran the following script to create one nasty row. so that I could dump a few table blocks to check for ITLs, pointers, and locks. The aim was to get a row with a head-only piece (“H” bit), an F-only piece, a piece with no bits set, then an L-only piece. With an 8KB block size and 4,000 byte maximum for varchar2() this is what I did:


rem
rem     Script:         migrated_lock.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2019
rem     Purpose:
rem
rem     Last tested
rem             18.3.0.0
rem

create table t1 (
        n1 number,
        l1 varchar2(4000),
        s1 varchar2(200),
        l2 varchar2(4000),
        s2 varchar2(200),
        l3 varchar2(4000),
        s3 varchar2(200)
);

insert into t1 (n1,l1,s1) values(0,rpad('X',4000,'X'),rpad('X',200,'X'));
commit;

insert into t1 (n1,l1) values(1,null);
commit;

update t1 set
        l1 = rpad('A',4000),
        s1 = rpad('A',200),
        l2 = rpad('B',4000),
        s2 = rpad('B',200),
        l3 = rpad('C',4000),
        s3 = rpad('C',200)
where
        n1 = 1
;

commit;

execute dbms_stats.gather_table_stats(user,'t1');

update t1 set
        s1 = lower(s1),
        s2 = lower(s2),
        s3 = lower(s3)
where
        n1 = 1
;

alter system flush buffer_cache;

select
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_starting_in_block
from
        t1
group by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
order by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
;

The query with all the calls to dbms_rowid gave me the file and block number of the row I was interested in, so I dumped the block, then read the trace file to find the next block in the chain, and so on. The first block held just the head piece, the second block held the n1 and l1 columns (which didn’t get modified by the update), the third block held the s1 and l2 columns, the last block held the s2, l3 and s3 columns. I had been expecting to see the split as (head-piece(, (n1, l1, s1), (l2, s2), (l3, s3) – but as it turned out the unexpected split was a bonus.

Here are extracts from each of the blocks (in the order they appeared in the chain), showing the ITL information and the “row overhead” information. If you scan through the list you’ll see that three of the 4 blocks have an ITL entry for transaction id (xid) of 8.1e.df3, using three consecutive undo records in undo block 0x0100043d. My update has locked 3 of the 4 rowpieces – the header and the two that have changed. It didn’t need to “lock” the piece that didn’t change. (This little detail was the bonus of the unexpected split.)


Block 184
---------
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.36  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.2e  ----    1  fsc 0x0000.00000000

...

tab 0, row 1, @0xf18
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x00800089.0



Block 137       (columns n1, l1 - DID NOT CHANGE so no ITL entry acquired)
---------       (the lock byte relates to the previous, not cleaned, update) 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.35  --U-    1  fsc 0x0000.005beb39
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xfcb
tl: 4021 fb: ----F--- lb: 0x1  cc: 2
hrid: 0x008000b8.1
nrid:  0x00800085.0



Block 133 (columns s1, l2)
--------------------------
Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.34  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.2f  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xf0b
tl: 4213 fb: -------- lb: 0x2  cc: 2
nrid:  0x008000bc.0



Block 188 (columns s2, l3, s3)
------------------------------
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.33  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.30  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xe48
tl: 4408 fb: -----L-- lb: 0x2  cc: 3

Note, by the way, how there are nrid (next rowid) entries pointing forward in every row piece (except the last), but it’s only the “F” (First) row-piece has the hrid (head rowid) pointer pointing backwards.

 

January 17, 2019

Hint Reports

Filed under: dbms_xplan,Execution plans,Hints,Oracle — Jonathan Lewis @ 9:59 am GMT Jan 17,2019

Nigel Bayliss has posted a note about a frequently requested feature that has now appeared in Oracle 19c – a mechanism to help people understand what has happened to their hints.  It’s very easy to use, it’s just another format option to the “display_xxx()” calls in dbms_xplan; so I thought I’d run up a little demonstration (using an example I first generated 18 years and 11 versions ago) to make three points: first, to show the sort of report you get, second to show you that the report may tell you what has happened, but that doesn’t necessarily tell you why it has happened, and third to remind you that you should have stopped using the /*+ ordered */ hint 18 years ago.

I’ve run the following code on livesql:


rem
rem     Script:         c_ignorehint.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2001
rem


drop table ignore_1;
drop table ignore_2;

create table ignore_1
nologging
as
select
        rownum          id,
        rownum          val,
        rpad('x',500)   padding
from    all_objects
where   rownum <= 3000
;

create table ignore_2
nologging
as
select
        rownum          id,
        rownum          val,
        rpad('x',500)   padding
from    all_objects
where   rownum <= 500
;

alter table ignore_2
add constraint ig2_pk primary key (id);


explain plan for
update
        (
                select
                        /*+
                                ordered
                                use_nl(i2)
                                index(i2,ig2_pk)
                        */
                        i1.val  val1,
                        i2.val  val2
                from
                        ignore_1        i1,
                        ignore_2        i2
                where
                        i2.id = i1.id
                and     i1.val <= 10
        )
set     val1 = val2
;

select * from table(dbms_xplan.display(null,null,'hint_report'));

explain plan for
update
        (
                select
                        /*+
                                use_nl(i2)
                                index(i2,ig2_pk)
                        */
                        i1.val  val1,
                        i2.val  val2
                from
                        ignore_1        i1,
                        ignore_2        i2
                where
                        i2.id = i1.id
                and     i1.val <= 10
        )
set     val1 = val2
;

select * from table(dbms_xplan.display(null,null,'hint_report'));

As you can see I’ve simply added the format option “hint_report” to the call to dbms_xplan.display(). Before showing you the output I’ll just say a few words about the plans we might expect from the two versions of the update statement.

Given the /*+ ordered */ hint in the first statement we might expect Oracle to do a full tablescan of ignore_1 then do a nested loop into ignore_2 (obeying the use_nl() hint) using the (hinted) ig2_pk index. In the second version of the statement, and in the absence of the ordered hint, it’s possible that the optimizer will still use the same path but, in principle, it might find some other path.

So what do we get ? In order here are the two execution plans:


Plan hash value: 3679612214
 
--------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                      |          |    10 |   160 |   111   (0)| 00:00:01 |
|   1 |  UPDATE                               | IGNORE_1 |       |       |            |          |
|*  2 |   HASH JOIN                           |          |    10 |   160 |   111   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| IGNORE_2 |   500 |  4000 |    37   (0)| 00:00:01 |
|   4 |     INDEX FULL SCAN                   | IG2_PK   |   500 |       |     1   (0)| 00:00:01 |
|*  5 |    TABLE ACCESS STORAGE FULL          | IGNORE_1 |    10 |    80 |    74   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("I2"."ID"="I1"."ID")
   5 - storage("I1"."VAL"<=10)
       filter("I1"."VAL"<=10)
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3 (U - Unused (1))
---------------------------------------------------------------------------
   1 -  SEL$DA9F4B51
           -  ordered
 
   3 -  SEL$DA9F4B51 / I2@SEL$1
         U -  use_nl(i2)
           -  index(i2,ig2_pk)




Plan hash value: 1232653668
 
------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |          |    10 |   160 |    76   (0)| 00:00:01 |
|   1 |  UPDATE                       | IGNORE_1 |       |       |            |          |
|   2 |   NESTED LOOPS                |          |    10 |   160 |    76   (0)| 00:00:01 |
|   3 |    NESTED LOOPS               |          |    10 |   160 |    76   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS STORAGE FULL | IGNORE_1 |    10 |    80 |    74   (0)| 00:00:01 |
|*  5 |     INDEX UNIQUE SCAN         | IG2_PK   |     1 |       |     0   (0)| 00:00:01 |
|   6 |    TABLE ACCESS BY INDEX ROWID| IGNORE_2 |     1 |     8 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - storage("I1"."VAL"<=10)
       filter("I1"."VAL"<=10)
   5 - access("I2"."ID"="I1"."ID")
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   5 -  SEL$DA9F4B51 / I2@SEL$1
           -  index(i2,ig2_pk)
           -  use_nl(i2)

As you can see, the “Hint Report” shows us how many hints have been seen in the SQL text, then the body of the report shows us which query block, operation and table (where relevant) each hint has been associated with, and whether it has been used or not.

The second query has followed exactly the plan I predicted for the first query and the report has shown us that Oracle noted, and used, the use_nl() and index() hints to access table ignore2, deciding for itself to visit the tables in the order ignore_1 -> ignore_2, and doing a full tablescan on ignore_1.

The first query reports three hints, but flags the use_nl() hint as unused. (There is (at least) one other flag that could appear against a hint – “E” for error (probably syntax error), so we can assume that this hint is not being ignored because there’s something wrong with it.) Strangely the report tells us that the optimizer has used the ordered hint but we can see from the plan that the tables appear to be in the opposite order to the order we specified in the from clause, and the chosen order has forced the optimizer into using an index full scan on ig2_pk because it had to obey our index() hint.  Bottom line – the optimizer has managed to find a more costly plan by “using but apparently ignoring” a hint that described the cheaper plan that we would have got if we hadn’t used the hint.

Explanation

Query transformation can really mess things up and you shouldn’t be using the ordered hint.

I’ve explained many times over the years that the optimizer evaluates the cost of an update statement by calculating the cost of selecting the rowids of the rows to be updated. In this case, which uses an updatable join view, the steps taken to follow this mechanism this are slightly more complex.  Here are two small but critical extracts from the 10053 trace file (taken from an 18c instance):


CVM:   Merging SPJ view SEL$1 (#0) into UPD$1 (#0)
Registered qb: SEL$DA9F4B51 0x9c9966e8 (VIEW MERGE UPD$1; SEL$1; UPD$1)

...

SQE: Trying SQ elimination.
Query after View Removal
******* UNPARSED QUERY IS *******
SELECT
        /*+ ORDERED INDEX ("I2" "IG2_PK") USE_NL ("I2") */
        0
FROM    "TEST_USER"."IGNORE_2" "I2",
        "TEST_USER"."IGNORE_1" "I1"
WHERE   "I2"."ID"="I1"."ID"
AND     "I1"."VAL"<=10


The optimizer has merged the UPDATE query block with the SELECT query block to produce a select statement that will produce the necessary plan (I had thought that i1.rowid would appear in the select list, but the ‘0’ will do for costing purposes). Notice that the hints have been preserved as the update and select were merged but, unfortunately, the merge mechanism has reversed the order of the tables in the from clause. So the optimizer has messed up our select statement, then obeyed the original ordered hint!

Bottom line – the hint report is likely to be very helpful in most cases but you will still have to think about what it is telling you, and you may still have to look at the occasional 10053 to understand why the report is showing you puzzling results. You should also stop using a hint that was replaced by a far superior hint more than 18 years ago – the ordered hint in my example should have been changed to /*+ leading(i1 i2) */ in Oracle 9i.

December 21, 2018

QC vs. PX

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 12:26 pm GMT Dec 21,2018

One last post before closing down for the Christmas break.

Here’s a little puzzle with a remarkably easy and obvious solution that Ivica Arsov presented at the UKOUG Tech2018 conference. It’s a brilliant little puzzle that makes a very important point, because it reminded me that most problems are easy and obvious only after you’ve seen them at least once. If you’ve done a load of testing and investigation into something it’s easy to forget that there may be many scenarios you haven’t even thought of testing – so when you see the next puzzle your mind follows all the things you’ve done previously and doesn’t think that you might be looking at something new.

In this case I had to wait until the end of the presentation to discover how “easy and obvious” the solution was. Here’s a query with its results: all I’m going to do is join a session (from v$session) with all its parallel execution slaves by looking for the matching qcsid in v$px_session:


break on server_group skip 1 duplicate

select
        px.sid, px.qcsid,
        px.server_group, px.server_set, px.server#,
        ss.sql_id
from
        V$px_session px,
        v$session ss
where
        ss.username = 'TEST_USER'
and     ss.sid = px.sid
order by
        px.server_group nulls first, px.server_set, px.server#
;

     QCSID        SID SERVER_GROUP SERVER_SET    SERVER# SQL_ID
---------- ---------- ------------ ---------- ---------- -------------
       357        357                                    b4wg6286xn324

       357        125            1          1          1 bppfad1y1auhj
       357        246                       1          2 bppfad1y1auhj
       357        364                       1          3 bppfad1y1auhj

       357          7            2          1          1 5vdbyjy0c7dam
       357        133                       1          2 5vdbyjy0c7dam
       357        253                       1          3 5vdbyjy0c7dam

As you can see session 357 is reported as a query coordinator session, with two parallel server groups of 3 slave processes each. Strangely, though, the coordinator and the two groups of parallel query slaves are reporting different SQL_IDs which is probably contrary to the experience that most of us have had. When a parallel query (or DML or DDL statement) is executing the query co-ordinator and all its slave processes should report the same SQL_ID – so what’s happening here?

Little pause for thought …
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
… and no doubt some of you were immediately aware of the probable explanation. It’s very simple if you’ve come across the phenomenon before. Here’s the SQL that allowed me (from another session) to capture this result:


rem
rem     Script: px_qc_joke_2.sql
rem     Author: Jonathan Lewis
rem     Dated:  Dec 2018
rem

create table t1 nologging
as
select ao.*
from
        all_objects ao,
        (
         select rownum id
         from   dual
         connect by level  comment to avoid wordpress format issue
;

create table t2 nologging as select * from t1;
create table t3 nologging as select * from t1;

begin
        dbms_stats.gather_table_stats(user,'t1', method_opt=>'for all columns size 1');
        dbms_stats.gather_table_stats(user,'t2', method_opt=>'for all columns size 1');
        dbms_stats.gather_table_stats(user,'t3', method_opt=>'for all columns size 1');
end;
/

prompt  =====================
prompt  Starting PL/SQL block
prompt  Prepare to query v$
prompt  =====================

declare
        cursor c1 is select /*+ parallel (t1 3) */ object_id from t1;
        cursor c2 is select /*+ parallel (t2 3) */ object_id from t2;
        m_id1 number;
        m_id2 number;
begin
        open c1;
        fetch c1 into  m_id1;

        open c2;
        fetch c2 into  m_id2;

        for r in (select owner from t3 where object_id = least(m_id1,m_id2)) loop
                dbms_output.put_line(r.owner);
                dbms_lock.sleep(3);
        end loop;

        dbms_output.put_line(m_id1 || ' - ' || m_id2); 

        close c2;
        close c1;
end;
/

I’ve opened two cursors on parallel queries inside an anonymous PL/SQL block. The driving session is the query co-ordinator for two different parallel queries at the same time because it’s keeping two cursors open, and it’s also executing the cursor that is the driving query block. If we check v$sql for the three SQL_IDs reported from v$px_session this is what we see:


SQL_ID
-------------
SQL_TEXT
--------------------------------------------------------------------------------
b4wg6286xn324
declare  cursor c1 is select /*+ parallel (t1 3) */ object_id from t1;  cursor c
2 is select /*+ parallel (t2 3) */ object_id from t2;  m_id1 number;  m_id2 numb
er; begin  open c1;  fetch c1 into  m_id1;   open c2;  fetch c2 into  m_id2;   f
or r in (select owner from t3 where object_id = least(m_id1,m_id2)) loop   dbms_
output.put_line(r.owner);   dbms_lock.sleep(1);  end loop;   dbms_output.put_lin
e(m_id1 || ' - ' || m_id2);   close c2;  close c1; end;

bppfad1y1auhj
SELECT /*+ parallel (t1 3) */ OBJECT_ID FROM T1

5vdbyjy0c7dam
SELECT /*+ parallel (t2 3) */ OBJECT_ID FROM T2

Apart from the warning that it’s easy to be misled by a problem because you keep thinking of all the cases you’ve seen before there’s another important point behind this little quiz. It’s often said that when you run parallel queries you may actually use “2 * DOP” parallel query slaves – this is true (though for more complicated queries you may get multiple DFO trees at once, each with its “2 * DOP(n)” slaves) – it’s worth remembering that even with very simple queries a single session can have many cursors open at once, holding “2 * DOP” slave for each one, and ruin the response time for every other session because every other session ends up running serial queries.

Update (Fen 2019)

On a re-read of this note it cross my mind to wonder what would happen if parallel query queueing were enabled and you had two sessions executing PL/SQL blocks that ended up colliding in the following way:

  • Session 1: open parallel cursor 1a – using half the available PX slaves
  • Session 2: open parallel cursor 2a – using the rest of the PX slaves
  • Session 1: open parallel cursor 1b – demanding the rest of the slaves, and joining the parallel queue.
  • Session 2: open parallel cursor 2b – demanding the rest of the slave, and joining the parallel queue.

If this scenario is possible you now have a “parallel queue deadlock” between the sessions with no obvious scope for one of them to “rollback” and allow the other to continue. One of the PL/SQL blocks probably has to fail and terminate (which could be very bad news if you’ve done some DML and commits in the middle of the PL/SQL, prior to opening both cursors.

It’s possible, of course, that there are configuration details that mean this collision can’t happen. One day I may find time to test the idea – but maybe an interested (or well-informed) reader will supply the answer (or a reference to the bit of the manual I should have read before writing this update).

 

 

December 20, 2018

Transitive Closure

Filed under: CBO,Execution plans,Oracle — Jonathan Lewis @ 1:19 pm GMT Dec 20,2018

This is a follow-up to a note I wrote nearly 12 years ago, looking at the problems of transitive closure (or absence thereof) from the opposite direction. Transitive closure gives the optimizer one way of generating new predicates from the predicates you supply in your where clause (or, in some cases, your constraints); but it’s a mechanism with some limitations. Consider the following pairs of predicates:


    t1.col1 = t2.col2
and t2.col2 = t3.col3

    t1.col1 = t2.col2
and t2.col2 = 'X'

A person can see that the first pair of predicate allows us to infer that “t1.col1 = t3.col3” and the second pair of predicates allows us to infer that “t1.col1 = ‘X'”. The optimizer is coded only to recognize the second inference. This has an important side effect that can have a dramatic impact on performance in a way that’s far more likely to appear if your SQL is generated by code. Consider this sample data set (reproduced from the 2006 article):

rem
rem     Script:         transitive_loop.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2006
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem

create table t1 
as
select
        mod(rownum,100) col1,
        rpad('x',200)   v1
from
        all_objects
where   
        rownum <= 2000
;

create table t2
as
select
        mod(rownum,100) col2,
        rpad('x',200)   v2
from
        all_objects
where   
        rownum <= 2000
;

create table t3
as
select
        mod(rownum,100) col3,
        rpad('x',200)   v3
from
        all_objects
where   
        rownum <= 2000
;

-- gather stats if necessary

set autotrace traceonly explain

prompt  =========================
prompt  Baseline - two hash joins
prompt  =========================

select 
        t1.*, t2.*, t3.*
from
        t1, t2, t3
where
        t2.col2 = t1.col1
and     t3.col3 = t2.col2
;

prompt  ================================================
prompt  Force mismatch between predicates and join order
prompt  ================================================

select 
        /*+
                leading(t1 t3 t2)
        */
        t1.*, t2.*, t3.*
from
        t1, t2, t3
where
        t2.col2 = t1.col1
and     t3.col3 = t2.col2
;

The first query simply joins the tables in the from clause order on a column we know will have 20 rows for each distinct value, so the result sets will grow from 2,000 rows to 40,000 rows to 800,000 rows. Looking at the second query we would like to think that when we force Oracle to use the join order t1 -> t3 -> t2 it would be able to use the existing predicates to generate the predicate “t3.col3 = t1.col1” and therefore be able to do the same amount of work as the first query (and, perhaps, manage to produce the same final cardinality estimate).

Here are the two plans, taken from an instance of 12.2.0.1:


=========================
Baseline - two hash joins
=========================

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   800K|   466M|    48  (38)| 00:00:01 |
|*  1 |  HASH JOIN          |      |   800K|   466M|    48  (38)| 00:00:01 |
|   2 |   TABLE ACCESS FULL | T3   |  2000 |   398K|    10   (0)| 00:00:01 |
|*  3 |   HASH JOIN         |      | 40000 |    15M|    21   (5)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| T1   |  2000 |   398K|    10   (0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL| T2   |  2000 |   398K|    10   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T3"."COL3"="T2"."COL2")
   3 - access("T2"."COL2"="T1"."COL1")

================================================
Force mismatch between predicates and join order
================================================

------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |   800K|   466M| 16926   (3)| 00:00:01 |
|*  1 |  HASH JOIN            |      |   800K|   466M| 16926   (3)| 00:00:01 |
|   2 |   TABLE ACCESS FULL   | T2   |  2000 |   398K|    10   (0)| 00:00:01 |
|   3 |   MERGE JOIN CARTESIAN|      |  4000K|  1556M| 16835   (2)| 00:00:01 |
|   4 |    TABLE ACCESS FULL  | T1   |  2000 |   398K|    10   (0)| 00:00:01 |
|   5 |    BUFFER SORT        |      |  2000 |   398K| 16825   (2)| 00:00:01 |
|   6 |     TABLE ACCESS FULL | T3   |  2000 |   398K|     8   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."COL2"="T1"."COL1" AND "T3"."COL3"="T2"."COL2")

As you can see, there’s a dramatic difference between the two plans, and a huge difference in cost (though the predicted time for both is still no more than 1 second).

The first plan, where we leave Oracle to choose the join order, builds an in-memory hash table from t3, then joins t1 to t2 with a hash table and uses the result to join to t3 by probing the in-memory hash table.

The second plan, where we force Oracle to use a join order that (I am pretending) we believe to be a better join order results in Oracle doing a Cartesian merge join between t1 and t3 that explodes the intermediate result set up to 4 million rows (and the optimizer’s estimate is correct) before eliminating a huge amount of redundant data.

As far as performance is concerned, the first query took 0.81 seconds to generate its result set, the second query took 8.81 seconds. In both cases CPU time was close to 100% of the total time.

As a follow-up demo I added the extra predicate “t3.col3 = t1.col1” to the second query, allowing the optimizer to use a hash join with the join order t1 -> t3 -> t2, and this brought the run time back down (with a slight increase due to the extra predicate check on the second join).

Summary

The choice of columns in join predicates may stop Oracle from choosing the best join order because it is not able to use transitive closure to generate all the extra predicates that the human eye can see. If you are using programs to generate SQL rather than writing SQL by hand you are more likely to see this limitation resulting in some execution plans being less efficient than they could be.

 

 

 

 

December 18, 2018

NULL predicate

Filed under: CBO,Execution plans,Indexing,Oracle — Jonathan Lewis @ 1:13 pm GMT Dec 18,2018

People ask me from time to time if I’m going to write another book on the Cost Based Optimizer – and I think the answer has to be no because the product keeps growing so fast it’s not possible to keep up and because there are always more and more little details that might have been around for years and finally show up when someone asks me a question about some little oddity I’ve never noticed before.

The difficult with the “little oddities” is the amount of time you could spend trying to work out whether or not they matter and if it’s worth writing about them. Here’s a little example to show what I mean – first the data set:


rem
rem     Script:         null_filter.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2018
rem     Purpose:
rem
rem     Last tested
rem             18.3.0.0
rem             12.1.0.2
rem

create table t1
nologging
as
select  *
from    all_objects
where   rownum <= 50000 -- > comment to avoid wordpress format issue
;

insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
commit;

create index t1_i1 on t1(object_type, data_object_id, object_id, created);

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/

It’s a simple data set with a single index. The only significant thing about the index is that the second column (data_object_id) is frequently null. This leads to a little quirk in the execution plans for a very similar pair of statements:


set serveroutput off
alter session set statistics_level = all;

select
        object_name, owner
from
        t1
where
        object_type = 'TABLE'
and     data_object_id = 20002
and     object_id = 20002
and     created > trunc(sysdate - 90)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

select
        object_name, owner
from
        t1
where
        object_type = 'TABLE'
and     data_object_id is null
and     object_id = 20002
and     created > trunc(sysdate - 90)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

How much difference would you expect in the execution plans for these two queries? There is, of course, the side effect of the “is null” predicate disabling the “implicit column group” that is the index distinct_keys value, but in this case I’ve got a range-based predicate on one of the columns so Oracle won’t be using the distinct_keys anyway.

Of course there’s the point that you can’t use the equality operator with null, you have to use “is null” – and that might make a difference, but how ? Here are the two execution plan:


----------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |      0 |00:00:00.01 |       3 |      1 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |      0 |00:00:00.01 |       3 |      1 |
|*  2 |   INDEX RANGE SCAN                  | T1_I1 |      1 |      1 |      0 |00:00:00.01 |       3 |      1 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("OBJECT_TYPE"='TABLE' AND "DATA_OBJECT_ID"=20002 AND "OBJECT_ID"=20002 AND
              "CREATED">TRUNC(SYSDATE@!-90))

-------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |      0 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |      0 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN                  | T1_I1 |      1 |      1 |      0 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("OBJECT_TYPE"='TABLE' AND "DATA_OBJECT_ID" IS NULL AND "OBJECT_ID"=20002 AND
              "CREATED">TRUNC(SYSDATE@!-90))
       filter(("OBJECT_ID"=20002 AND "CREATED">TRUNC(SYSDATE@!-90)))

The query with the predicate “data_object_id is null” repeats the object_id and sysdate predicates as access predicates and filter predicates. This seems a little surprising and a potential performance threat. In the first query the run_time engine will hit the correct index leaf block in exactly the right place very efficiently and then walk along it supplying every rowid to the parent operator until it hits the end of the range.

With the “is null” plan the run-time engine will be checking the actual value of object_id and created for every index entry on the way – how much extra CPU will this use and, more importantly, might Oracle start with the first index entry where object_type = ‘TABLE’ and data_object_id is null and walk through every index entry that has that null checking for the correct object_id as it goes ?

That last question is the reason for running the query with rowsource execution stats enabled. The first query did a single physical read while the second didn’t have to, but the more important detail is that both queries did the same number of buffer gets – and there is, by the way, a set of eight rows where the object_id and data_object_id are  20,002, but they were created several years ago so the index range scan returns no rows in both cases.

Based on that comparison, how do we show that Oracle has not walked all the way from the first index entry where object_type = ‘TABLE’ and data_object_id is null checking every entry on the way or, to put it another way, has Oracle really managed to prune down the index range scan to the minimum “wedge” indicated by the presence of the predicates “OBJECT_ID”=20002 AND “CREATED”>TRUNC(SYSDATE@!-90) as access predicates?

Let’s just count the number of leaf blocks that might be relevant, using the sys_op_lbid() function (last seen here) that Oracle uses internally to count the number of leaf blocks in an index. First we get the index object_id, then we scan it to see how many leaf blocks hold entries that match our object_type and data_object_id predicates but appear in the index before our target value of 20,002:


column object_id new_value m_index_id

select
        object_id
from
        user_objects
where
        object_type = 'INDEX'
and     object_name = 'T1_I1'
;

select  distinct sys_op_lbid(&m_index_id, 'L', rowid)
from    t1
where   object_type    = 'TABLE'
and     data_object_id is null
and     object_id      < 20002
;


SYS_OP_LBID(159271
------------------
AAAm4nAAFAAACGDAAA
AAAm4nAAFAAACF9AAA
AAAm4nAAFAAACGCAAA
AAAm4nAAFAAACF/AAA
AAAm4nAAFAAACF+AAA
AAAm4nAAFAAACGFAAA
AAAm4nAAFAAACGEAAA
AAAm4nAAFAAACGGAAA

8 rows selected.


This tells us that there are 8 leaf blocks in the index that we would have to range through before we found object_id 20,002 and we would have seen 8 buffer gets, not 3 in the rowsource execution stats, if Oracle had not actually been clever with its access predicates and narrowed down the wedge of the index it was probing.

Bottom line: for a multi-column index there seems to be a difference in execution plans between “column is null” and “column = constant” when the column is one of the earlier columns in the index – but even though the “is null” option results in some access predicates re-appearing as filter predicates in the index range scan the extra workload is probably not significant – Oracle still uses the minimum number of index leaf blocks in the index range scan.

 

Next Page »

Powered by WordPress.com.