Oracle Scratchpad

March 21, 2019

Lost time

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 12:51 pm BST Mar 21,2019

Here’s a little puzzle that came up in the ODC database forum yesterday – I’ve got a query that has been captured by SQL Monitor, and it’s taking much longer to run than it should but the monitoring report isn’t telling me what I need to know about the time.

Here’s a little model to demonstrate the problem – I’m going to join a table to itself (the self join isn’t a necessary feature of the demonstration, I’ve just been a bit lazy in preparing data). Here’s a (competely truthful) description of the table:

SQL> desc t1
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ID                                     NUMBER
 MOD_1000                               NUMBER
 V1                                     VARCHAR2(40)
 V2                                     VARCHAR2(40)
 PADDING                                VARCHAR2(100)

SQL> select num_rows, blocks from user_tables where table_name = 'T1';

  NUM_ROWS     BLOCKS
---------- ----------
    400000       7798

1 row selected.


And here’s the text version of the output I get from dbms_monitor.report_sql_monitor() for the query running on 18.3.0.0


Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TEST_USER (261:36685)
 SQL ID              :  g6j671u7zc9mn
 SQL Execution ID    :  16777218
 Execution Started   :  03/21/2019 08:54:56
 First Refresh Time  :  03/21/2019 08:54:56
 Last Refresh Time   :  03/21/2019 08:55:17
 Duration            :  21s
 Module/Action       :  MyModule/MyAction
 Service             :  SYS$USERS
 Program             :  sqlplus@linux183.localdomain (TNS V1-V3)
 Fetch Calls         :  2

Global Stats
==========================================================================
| Elapsed |   Cpu   |    IO    | PL/SQL  | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Time(s) | Calls |  Gets  | Reqs | Bytes |
==========================================================================
|      21 |      21 |     0.04 |    0.07 |     2 |     3M |  215 | 180MB |
==========================================================================

SQL Plan Monitoring Details (Plan Hash Value=83896840)
==================================================================================================================================================
| Id |       Operation       | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |      | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT      |      |         |      |        21 |     +1 |     1 |        1 |      |       |     . |          |                 |
|  1 |   FILTER              |      |         |      |        21 |     +1 |     1 |        1 |      |       |     . |          |                 |
|  2 |    HASH JOIN OUTER    |      |     400 | 2014 |        21 |     +1 |     1 |      400 |      |       |   2MB |          |                 |
|  3 |     TABLE ACCESS FULL | T1   |     400 | 1005 |         1 |     +1 |     1 |      400 |   62 |  60MB |     . |          |                 |
|  4 |     TABLE ACCESS FULL | T1   |    400K | 1001 |        21 |     +1 |     1 |     400K |      |       |     . |          |                 |
==================================================================================================================================================

As you can see the total elapsed time 21 seconds of which the CPU time is the whole 21 seconds.

This seems a little large for a simple hash join so we should look at the “Activity Detail” section of the plan because that will report any ASH samples that can be found for this execution of this query. And that’s where the problem lies: despite using 21 CPU seconds in 21 seconds there are no ASH samples for the execution! Moreover – looking for other numerical oddities – the plan says we read 60MB in 62 read requests (that’s the first tablescan of of the 7,798 blocks of t1), but the summary says we read 180MB – where did we lose (or gain) 120MB ?

It might help to see the query (which is why I didn’t show it in the output above) and it might help to see the predicate section (which SQL Monitor doesn’t report). So here’s the query and its plan, with the resulting predicate section, pulled from memory:


SQL_ID  g6j671u7zc9mn, child number 0
-------------------------------------
select  /*+ monitor */  t1a.mod_1000, t1b.mod_1000 from  t1 t1a,  t1
t1b where  t1a.mod_1000 = 500 and t1b.id(+) = t1a.id and
nvl(t1b.mod_1000,0) + f2(t1a.id) + f3(t1a.id) > 0

Plan hash value: 83896840

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |       |       |  2014 (100)|          |
|*  1 |  FILTER             |      |       |       |            |          |
|*  2 |   HASH JOIN OUTER   |      |   400 |  7200 |  2014   (4)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| T1   |   400 |  3600 |  1005   (3)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| T1   |   400K|  3515K|  1001   (3)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("T1B"."MOD_1000",0)+"F2"("T1A"."ID")+"F3"("T1A"."ID")>0)
   2 - access("T1B"."ID"="T1A"."ID")
   3 - filter("T1A"."MOD_1000"=500)


Notice the two function calls that are part of the FILTER operation. That’s where the time is going – but I don’t think you can infer that from the SQL Monitor report (which rather suggests that the problem might be with the second full tablescan of t1 at operation 4).

Functions f2() and f3() both do a table scan of another table that is as large as t1 – and it looks as if they’re both going to be called 400 times, which is a lot of work. The 120MB of “lost” reads is the first table scan of each of the two tables, the 3M buffer gets (in the summary, if you didn’t notice it) is from the repeated tablescans as each row comes out of the hash join and the filter is applied.

Why doesn’t the report show us the CPU (and disk) samples? Because it’s querying ASH (v$active_session_history) by the SQL_ID of the principle SQL statement – and the work is being done by two other statements with different SQL_IDs.

Since it’s PL/SQL functions doing all the work why isn’t the 21 CPU seconds showing up in the “PL/SQL time(s)” summary figure? Because the time isn’t being spent in PL/SQL, it’s being spent in the SQL being run by the PL/SQL.

Just to finish off, let’s try to find the SQL being run by the PL/SQL. I’ll post the script to recreate the example at the end of the posting but for the moment I’ll just show you the query against v$sql that I ran to extract the SQL that’s embedded in the functions:


select  sql_id, executions, disk_reads, buffer_gets, sql_text
from    V$sql 
where   sql_text like 'SELECT%T2%' 
or      sql_text like 'SELECT%T3%'
;

SQL_ID        EXECUTIONS DISK_READS BUFFER_GETS SQL_TEXT
------------- ---------- ---------- ----------- --------------------------------------------------
12ytf1rry45d9        400       7683     3072817 SELECT MAX(ID) FROM T2 WHERE MOD_1000 = :B1
85dmsgqg3bh4w          1       7680        7698 SELECT MAX(ID) FROM T3 WHERE MOD_1000 = :B1

As you can see, there are roughly 120MB of disk I/O and 3M buffer gets due to these two statement – and one of them has run the 400 times we expected. It looks as if Oracle has done a cute little optimisation with the other function, though. If you look at the predicate it says:

  • NVL(“T1B”.”MOD_1000″,0)+”F2″(“T1A”.”ID”)+”F3″(“T1A”.”ID”)>0

It looks as if Oracle has either cached the result of the f3() function call (in which case why didn’t it also cache the f2() result) or it’s stopped evaluating the predicate as soon as the running total exceeded zero (but that might be a problem since f3() could return a negative number !). I don’t know exactly why Oracle is doing what it’s doing – but if I reverse the order of the functions in the predicates the f3() query will run 400 times and the f2() query will run once.

Update

Following the comment below from “anonymous”, pointing out that the “Active” SQL Monitor (and, of course, the equivalent OEM screen) has a section showing the impact of everything the session has been doing while the query ran I re-ran my call to dbms_monitor.report_sql_monitor() with the “type” parameter set to ‘ACTIVE’ rather than ‘TEXT’. After spooling this to a text file (set linesize 255, set pagesize 0) with the suffix “html” and trimming the top and bottom off so that the file started and ended with opening and closing html tags, I opened it in Firefox.

You have to have Adobe Flash Player for this to work, and you have to be connected to the Internet as the file contains content that will call to Oracle’s home domain. After getting Adobe flash to work, here’s a snapshot of the resulting display:

The print may be a little small, but in the lower half of the screen (which usually shows the plan with execution statistics) I’ve selected the “Activity” tab, and this has produced a graphic display of all the samples captured for the session while the query was running. I took this report after re-running the test case, but I had swapped the order of the two functions in the interim, so this output is showing that (from Oracle’s perspective) all the work done during the execution of the query was done by a statement with SQL_ID 85dmsgqg3bh4w.

Appendix

If you want to re-run some tests here’s the code I used to create the demo:


create table t1
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,1000)              mod_1000,
        lpad(rownum,10,'0')             v1,
        lpad(rownum,10,'0')             v2,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4e5 -- > comment to avoid WordPress format issue
;

-- alter table t1 add constraint t1_pk primary key(id);

create table t2 nologging as select * from t1;
create table t3 nologging as select * from t1;

create function f2 (i_in number) return number
as
        m_ret number;
begin
        select max(id)
        into    m_ret
        from    t2
        where   mod_1000 = i_in
        ;

        return m_ret;
end;
/

create function f3 (i_in number) return number
as
        m_ret number;
begin
        select  max(id)
        into    m_ret
        from    t3
        where   mod_1000 = i_in
        ;

        return m_ret;
end;
/


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T3',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter system flush buffer_cache;

variable b1 number
variable b2 number

set timing on

select
        /*+ monitor */
        t1a.mod_1000, t1b.mod_1000
from
        t1      t1a,
        t1      t1b
where
        t1a.mod_1000 = 500
and     t1b.id(+) = t1a.id
and     nvl(t1b.mod_1000,0) + f3(t1a.id) + f2(t1a.id) > 0
;


One interesting little detail, if you try the tests, is that the join has to be an outer join for the FILTER operation to appear.

9 Comments »

  1. Hi Johnathan,
    If you use the ACTIVE sql monitor, it will show you the “recursive” sql that is consuming your time, in this case, it gave me 85dmsgqg3bh4w.

    Comment by Anonymous — March 21, 2019 @ 6:25 pm BST Mar 21,2019 | Reply

  2. Anonymous,

    Thanks for that. I’ll have to run up a few tests and check what gets in and what goes missing with ACTIVE.

    Okay – re-ran in 12.1.0.2 and generated ACTIVE to get an HTML file.
    The sql_id appeared under “details -> activity”, so that’s nice because virtual all the time was spent on that statement in the version shown in the blog post.

    (I discovered, though, that in the code I ran and discussed I had the f2() and f3() function calls the opposite way round from the way they were in the full script at the end of the article.)

    Comment by Jonathan Lewis — March 21, 2019 @ 9:38 pm BST Mar 21,2019 | Reply

  3. I have retested this scenario too and with active report I also got this recursive query sql_id. But this is very inconvenient and easy to miss. Maybe there is some easier way to somehow find those recursive queries?

    Comment by sopenas — March 22, 2019 @ 11:17 am BST Mar 22,2019 | Reply

    • sopenas,

      Randolf Geist has produced a tool called xplan_ash which enhances the presentation of in-memory execution plans to include as much data as possible. It’s been some time since I last looked at where he’d got to, but possibly he’s got something that includes information about the recursive SQL. A fairly recent link to start at would be: https://oracle-randolf.blogspot.com/2016/06/new-version-of-xplanash-utility.html

      Comment by Jonathan Lewis — March 22, 2019 @ 12:08 pm BST Mar 22,2019 | Reply

      • Jonathan,

        thanks for mentioning XPLAN_ASH and the link.

        I, too, wanted to add that the “active” version of the SQL Monitoring report also covers recursive (and any other kind of activity / SQL_IDs, think for example of a session having multiple cursors open at the same time and fetching from different ones alternately).

        XPLAN_ASH covers and summarises that information in its section “Summary of this sessions’ other activity” along with SQL texts of the other SQL_IDs found if available – which can easily be missed with all that information XPLAN_ASH displays. It’s rather at the top of the report…

        Kind regards,
        Randolf

        Comment by Randolf Geist — March 22, 2019 @ 6:49 pm BST Mar 22,2019 | Reply

      • And I’ve just tried out your example – since it’s taking things to an extreme and XPLAN_ASH is supposed to use ASH as primary data source it fails miserably in that case since there is simply not a single ASH sample for the main query so XPLAN_ASH can’t detect anything useful from ASH in that case.

        If I change the query to have at least some activity of the main query recorded in ASH like this, which adds some execution time at the start and the beginning relating to the main query:

        select
                /*+ leading(t1a) use_nl(t1b) */
                max(t1a.mod_1000), max(t1b.mod_1000)
        from
                t1      t1a,
                t1      t1b
        where
                t1a.mod_1000 = 500
        and     t1b.id = t1a.id
        union all
        select
                /*+ monitor */
                t1a.mod_1000, t1b.mod_1000
        from
                t1      t1a,
                t1      t1b
        where
                t1a.mod_1000 = 500
        and     t1b.id = t1a.id
        and     nvl(t1b.mod_1000,0) + f3(t1a.id) + f2(t1a.id) > 0
        union all
        select
                /*+ leading(t1a) use_nl(t1b) */
                max(t1a.mod_1000), max(t1b.mod_1000)
        from
                t1      t1a,
                t1      t1b
        where
                t1a.mod_1000 = 500
        and     t1b.id = t1a.id
        ;
        

        Then XPLAN_ASH will show for the middle part (your original query) something like this in the output:

        Summary of this sessions' other activity
        -----------------------------------------------
        
                     |            |             |            |            |                |              |
        DURATION SECS|DURATION    |DURATION SECS|DURATION    |            |                |              |
                TOTAL|TOTAL       |       ACTIVE|ACTIVE      |SAMPLE_COUNT|CPU_SAMPLE_COUNT|PERCENTAGE_CPU|
        -------------|------------|-------------|------------|------------|----------------|--------------|
                    9|+0 00:00:09 |            9|+0 00:00:09 |           9|               9|           100|
        
        
        Other activity details
        -----------------------------------------------
        
                     |                                                                                |                |
                     |                                                                                |                |
        SQL_ID       |SQL_TEXT                                                                        |TOP_LEVEL_SQL_ID|
        -------------|--------------------------------------------------------------------------------|----------------|
        85dmsgqg3bh4w|SELECT MAX(ID) FROM T3 WHERE MOD_1000 = :B1                                     |9xtsjz8ufw721   |
        

        There are actually many more details shown to the right of the output that I’ve omitted for brevity here…

        I hope the formatting makes it through…

        Kind regards,
        Randolf

        Comment by Randolf Geist — March 22, 2019 @ 7:11 pm BST Mar 22,2019 | Reply

  4. > Adobe Flash Player
    Does this produce anything useful for your example? :
    SELECT DBMS_AUTO_REPORT.REPORT_REPOSITORY_DETAIL(RID => :Report_ID, TYPE => ‘HTML’) FROM Dual;
    It should be static html without requiring flash.
    Maybe dbms_monitor.report_sql_monitor() also accepts HTML for static display and still has more info than TEXT?

    Comment by Racer I. — April 2, 2019 @ 8:52 am BST Apr 2,2019 | Reply

    • Racer I.

      Thanks for the thought.
      I’ve just done a quick re-run and check.
      The 18.3 HTML output from report_sql_monitor is prettier than it used to be with older versions (which were basically unreadable) – but it’s very sparse in its content.
      The dbms_auto_report call (using the ‘TEXT’ option) simply replays the sql_monitor html which is has been capturing every few minutes.

      Comment by Jonathan Lewis — April 4, 2019 @ 9:38 am BST Apr 4,2019 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.