Oracle Scratchpad

March 29, 2016

Index Usage

Filed under: 12c,Exadata,HCC,in-memory,Indexing,Oracle,Performance — Jonathan Lewis @ 10:53 am BST Mar 29,2016

There are some questions about Oracle that are like the mythical Hydra – you think you’ve killed it, but for every head you cut off another two grow. The claim that “the optimizer will switch between using an index and doing a tablescan when you access more than X% of the data” re-appeared on the OTN database forum a little while ago – it doesn’t really matter what the specific value of X was – and it’s a statement that needs to be refuted very firmly because it’s more likely to cause problems than it is to help anyone understand what’s going on.

At a very informal level we may have an intuitive feeling that for a “precise” query accessing a “small” amount of data an indexed access path should make sense while for a “big” query accessing a “large” amount of data we might expect to see a tablescan, but any attempt to give a meaning to “small” and “large” that is both general purpose and strictly quantified will be wrong: there are too many variables involved.

Just as a quick demonstration of how badly we can be misled by a simple numeric claim here’s a quick test I created on a newly installed instance of 11.2.0.4, which I happened to set up with a locally defined tablespace using uniform extents of 1MB using the default 8KB blocksize but with manual (freelist) space management:


rem
rem     Script:   index_usage_pct.sql
rem     Dated:    March 2016
rem     Author:   J P Lewis
rem

drop table t1;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4
)
select
        cast(rownum as number(8,0))                              id,
        cast(trunc(dbms_random.value(0,1e6)) as number(8,0))     n1,
        lpad(rownum,6,'0')              v1,
        rpad('x',10,'x')                small_vc,
        rpad('x',151,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
begin dbms_stats.gather_table_stats( ownname => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(id);

spool index_usage_pct.lst

select  num_rows, blocks, avg_row_len, round(num_rows/blocks) rows_per_block
from    user_tables
where   table_name = 'T1'
;

set autotrace on explain
select count(v1) from t1 where id between 1 and 245000;
set autotrace off

spool off

I’ve created a table with 1 million rows; the rows are about 180 bytes long (you’ll see the sizes a few lines further down the page), so it’s not an unreasonable model for lots of tables in typical systems – if you want to experiment further you can adjust the rpad() in the padding column; and I’ve created an index on a sequentially  (rownum) generated column. My call to autotrace will produce a truthful execution plan for the query supplied – there’s no risk of unexpected type conversion and no problems from bind variable peeking. As you can easilky infer, my query will access 245,000 rows in the table of 1,000,000 – nearly a quarter of the table. Would you expect to see Oracle use the index ?

Here’s the output from the script on MY brand new database, instance, and tablespace:


  NUM_ROWS     BLOCKS AVG_ROW_LEN ROWS_PER_BLOCK
---------- ---------- ----------- --------------
   1000000      25642         180             39

1 row selected.


 COUNT(N1)
----------
    245000

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 269862921

--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    10 |  6843   (1)| 00:01:23 |
|   1 |  SORT AGGREGATE              |       |     1 |    10 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |   245K|  2392K|  6843   (1)| 00:01:23 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |   245K|       |   553   (1)| 00:00:07 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("ID">=1 AND "ID"<=245000)


There are no tricks involved here, no cunning fiddling with data structures or parameters – this is just a simple, straightforward, test.

Of course the result is probably a little counter-intuitive; 24.5% of the data seems a lot for the optimizer to pick an index. There are many reasons for this, the first being that the data is very well clustered relative to the index – the index’s clustering_factor is the smallest it could be for a B-tree indexing every row in this table.

Another important feature, though, is that I haven’t done anything with the system statistics so the optimizer was using various default values which tell it that a multiblock read will be quite small (eight blocks) and a lot slower than a single block read (26 ms vs. 12 ms). One simple change that many people might have made during or shortly after installation (though it shouldn’t really be done in any modern version of Oracle) is to set the db_file_multiblock_read_count parameter to 128 – with just this change the optimizer would assume that a multiblock read really would be 128 blocks, but that it would now take 266 ms. That means the optimizer will assume that the read will be ten times slower than it was, but will read 32 times as much data – a fairly significant relative improvement thanks to which the access path for my initial query will switch to a full tablescan and won’t switch back to an index range scan until I reduce the range from 245,000 to something like 160,000.

I can go further, of course. With a few background queries running to exercise the database I executed the dbms_stats.gather_system_stats() procedure with the ‘start’ and ‘stop’ options to collect some figures about the hardware and expected workload. This gave me the following results,  interpreted from the sys.aux_stats$ table:


MBRC       :126
MREADTIM   :0.902
SREADTIM   :0.386
CPUSPEED   :976

With the optmizer using these figures to compare the relative speed and size of single and multiblock reads I had to reduce my selected range to roughly 51,000 before the optimizer would choose the index range scan.

I could go on to demonstrate the effects of the dbms_resource_manager.calibrate_io procedure and the effects of allowing different degrees of parallelism with different system stats, but I think I’ve probably made the point that there’s a lot of variation in the break point between index range scans and tablescans EVEN when you don’t change the data. With this very well ordered (perfect clustering_factor) data I’ve seen the break point vary between 51,000 rows and 245,000 rows (5% and 25%).

And finally …

Let’s just finish with a last (and probably the most important) variation:  changing the pattern in the data we want from perfectly clustered to extremely scattered. If you check the query that generated the data you’ll see that we can do this by creating an index on column n1 instead of column id, and changing the where clause in the test query to n1 between 1 and 4500 (which, in my case, returned slightly more that 4,500 rows thanks to a small amount of duplication generated by the call to dbms_random.value()). At slightly under 0.5% of the data (and with my most recent settings for the system statistics) the optimizer chose to use a tablescan.

Remember, there are many factors involved in the optimizer choosing between a tablescan and an index range scan and one of the most significant factors in the choice is the (apparent) clustering of the data so, if you haven’t come across it before, you should examine the “table_cached_blocks” option that appeared in 11.2.0.4 for the procedure dbms_stats.set_table_prefs() as this allows you to give the optimizer a better idea of how well your data really is clustered.

Addendum (April 2016)

Following on from the demonstration of how changes in parameters, patterns and statistics can make a difference in what we (or the optimizer) might consider a “small” amount of data and whether an indexed access path would be appropriate, it’s worth mentioning that the Exadata technologies of smart scans and hybrid columnar compression and Oracle’s latest technology of In-Memory Colum Store do not change the way you think about indexes – they only change the (unspecifiable) volume at which an index ceases to be the better option to use.

 

March 14, 2016

Quiz

Filed under: Oracle,Performance — Jonathan Lewis @ 9:38 pm BST Mar 14,2016

Can you spot anything that might appear to be a little surprising about this (continuous) extract from a trace file ? The example is from 10.2.0.5, but 11g and 12c could produce similar results (or direct path read equivalents):


PARSING IN CURSOR #3 len=23 dep=0 uid=30 oct=3 lid=30 tim=112607775392 hv=4235652837 ad='2f647320'
select count(*) from t1
END OF STMT
PARSE #3:c=0,e=53,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=112607775385
EXEC #3:c=0,e=99,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=112607787370
WAIT #3: nam='SQL*Net message to client' ela= 9 driver id=1111838976 #bytes=1 p3=0 obj#=10443 tim=112607789931
WAIT #3: nam='db file sequential read' ela= 415 file#=5 block#=9 blocks=1 obj#=21580 tim=112607795682
WAIT #3: nam='db file scattered read' ela= 2785 file#=5 block#=905 blocks=98 obj#=21580 tim=112607801263
WAIT #3: nam='db file scattered read' ela= 2919 file#=5 block#=777 blocks=128 obj#=21580 tim=112607808280
WAIT #3: nam='db file scattered read' ela= 2066 file#=5 block#=649 blocks=128 obj#=21580 tim=112607813300
WAIT #3: nam='db file scattered read' ela= 1817 file#=5 block#=521 blocks=128 obj#=21580 tim=112607817243
WAIT #3: nam='db file scattered read' ela= 1563 file#=5 block#=393 blocks=128 obj#=21580 tim=112607820899
WAIT #3: nam='db file scattered read' ela= 1605 file#=5 block#=265 blocks=128 obj#=21580 tim=112607824710
WAIT #3: nam='db file scattered read' ela= 1529 file#=5 block#=137 blocks=128 obj#=21580 tim=112607828296
WAIT #3: nam='db file scattered read' ela= 1652 file#=5 block#=10 blocks=127 obj#=21580 tim=112607831946
FETCH #3:c=15625,e=41568,p=994,cr=996,cu=0,mis=0,r=1,dep=0,og=1,tim=112607834004
WAIT #3: nam='SQL*Net message from client' ela= 254 driver id=1111838976 #bytes=1 p3=0 obj#=21580 tim=112607835527
FETCH #3:c=0,e=3,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=112607836780
WAIT #3: nam='SQL*Net message to client' ela= 2 driver id=1111838976 #bytes=1 p3=0 obj#=21580 tim=112607837935
WAIT #3: nam='SQL*Net message from client' ela= 14371 driver id=1111838976 #bytes=1 p3=0 obj#=21580 tim=112607853526
=====================
PARSING IN CURSOR #2 len=52 dep=0 uid=30 oct=47 lid=30 tim=112607855239 hv=1029988163 ad='2f6c5ec0'
BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END;
END OF STMT
PARSE #2:c=0,e=51,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=112607855228
WAIT #2: nam='SQL*Net message to client' ela= 4 driver id=1111838976 #bytes=1 p3=0 obj#=21580 tim=112607861803
EXEC #2:c=0,e=1271,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,tim=112607862976
WAIT #2: nam='SQL*Net message from client' ela= 1093883 driver id=1111838976 #bytes=1 p3=0 obj#=21580 tim=112608958078
STAT #3 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=996 pr=994 pw=0 time=41588 us)'
STAT #3 id=2 cnt=8332 pid=1 pos=1 obj=21580 op='TABLE ACCESS FULL T1 (cr=996 pr=994 pw=0 time=144816 us)'

Update

If you look at the values for block# in the “db file scattered read” waits you’ll notice that they appear in descending order. This looks like a tablescan running backwards – and that’s not a coincidence, because that’s what it is.

It’s obviously a good strategy to have because if you do a big tablescan it’s the blocks at the END of the table which are mostly likely to be subject to change by other sessions [unless, see comment 4, you’ve done a purge of historic data] and the longer it takes you to get there the more work you’ll have to do to get consistent read versions of the last blocks in the table, so reading the last blocks first should, generally, reduce the workload – and the risk of ORA-01555: snapshot too old. Strangely it’s not documented – but it’s been around for years – at least since 10.2.0.5, if not earlier releases of 10g, through event 10460.

The topic came up in a conversation on the Oracle-L list server a few years ago, with Tanel Poder supplying the event number, but I had forgotten about it until I rediscovered the thread by accident a little while ago.

It’s not a supported feature, of course – but if you run into serious performance problems with tablescans doing lots of work with the undo tablespace (physical reads, lots of undo records applied for consistent read, etc.) while a lot of update activity is going on, then have a chat with Oracle support to see if it’s an allowed workaround.

 

 

February 15, 2016

Connect By

Filed under: Execution plans,Hints,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 2:01 pm BST Feb 15,2016

I received an email a couple of days ago that was a little different from usual – although the obvious answer was “it’s the data”. A connect by query with any one of several hundred input values ran in just a few seconds, but with one specific input it was still running 4,000 seconds later using the same execution plan – was this a bug ?

There’s nothing to suggest that it should be, with skewed data anything can happen: even a single table access by exact index could take 1/100th of a second to return a result if there was only one row matching the requirement and 1,000 seconds if there were 100,000 rows in 100,000 different table blocks (and the table was VERY big). The same scaling problem could be true of any type of query – and “connect by” queries can expose you to a massive impact because their run time can increase geometrically as the recursion takes place.

So it was easy to answer the question – no it’s (probably) not a bug, check the data for that one value.

Then I decided to build a simple model. The original email had a four table join, but I just created a single table, and used a “no filtering” connect by which I had to hint. Here’s some code I ran on 11.2.0.4:


rem
rem     script: connect_by_skew.sql
rem     dated:  Feb 2016
rem     Last tested:
rem             12.1.0.2
rem

create table t1 nologging 
as
select 
        rownum id_p, 10 * rownum id
from
        all_objects
where 
        rownum <= 50000 ; execute dbms_stats.gather_table_stats(user,'t1', method_opt=>'for all columns size 1')

alter system flush shared_pool;

set serveroutput off
alter session set statistics_level = all;

select sum(ct) 
from    (
        select
                /*+ no_connect_by_filtering */
                count(id) ct
        from
                t1
        connect by
                id = 20 * prior id_p
        start with
                id_p = 1
        group by
                id
)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

update t1 set id_p = 0
where   id_p = 1
;

update t1 set id_p = 1
where   id_p > 45000
;

select sum(ct) 
from    (
        select
                /*+ no_connect_by_filtering */
                count(id) ct
        from
                t1
        connect by
                id = 20 * prior id_p
        start with
                id_p = 1
        group by
                id
)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

The sum() of the inline aggregate view emulates the original code – I don’t know what it was for, possibly it was a way of demonstrating the problem without producing a large output, I just copied it.

As you can see in my script every parent id (id_p) starts out unique, and if I look at the pattern of the raw data identified by the recursion from id_p = 1 (rather than looiking at the result of the actual query) this is what I’d get:

      ID_P         ID
---------- ----------
         1         10
         2         20
         4         40
         8         80
        16        160
        32        320
        64        640
       128       1280
       256       2560
       512       5120
      1024      10240
      2048      20480
      4096      40960
      8192      81920
     16384     163840
     32768     327680

When I modify the data so that I have exactly 5,000 rows with id_p = 1 the initial data generation will be 80,000 rows of data. If you want to try setting id_p = 1 for more rows make sure you do it to rows where id_p is already greater than 32768 or you’ll run into Oracle error ORA-01436: CONNECT BY loop in user data.

Here’s the execution plan, with rowsource execution stats I got for the first query (running 11.2.0.4):


-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |      1 |        |    32 (100)|      1 |00:00:00.44 |     103 |       |       |          |
|   1 |  SORT AGGREGATE                            |      |      1 |      1 |            |      1 |00:00:00.44 |     103 |       |       |          |
|   2 |   VIEW                                     |      |      1 |      2 |    32   (7)|     16 |00:00:00.44 |     103 |       |       |          |
|   3 |    HASH GROUP BY                           |      |      1 |      2 |    32   (7)|     16 |00:00:00.44 |     103 |  1519K|  1519K| 1222K (0)|
|*  4 |     CONNECT BY NO FILTERING WITH START-WITH|      |      1 |        |            |     16 |00:00:00.44 |     103 |       |       |          |
|   5 |      TABLE ACCESS FULL                     | T1   |      1 |  50000 |    31   (4)|  50000 |00:00:00.10 |     103 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this took 0.44 seconds, generated the expected 16 rows (still visible up to operation 2) which it then counted. Oracle followed the same execution plan when I set 5,000 rows to the critical value – here’s the new run-time plan:


-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |      1 |        |    32 (100)|      1 |00:05:39.25 |     103 |       |       |          |
|   1 |  SORT AGGREGATE                            |      |      1 |      1 |            |      1 |00:05:39.25 |     103 |       |       |          |
|   2 |   VIEW                                     |      |      1 |      2 |    32   (7)|   5015 |00:05:39.24 |     103 |       |       |          |
|   3 |    HASH GROUP BY                           |      |      1 |      2 |    32   (7)|   5015 |00:05:39.22 |     103 |  5312K|  2025K| 1347K (0)|
|*  4 |     CONNECT BY NO FILTERING WITH START-WITH|      |      1 |        |            |  80000 |00:05:38.56 |     103 |       |       |          |
|   5 |      TABLE ACCESS FULL                     | T1   |      1 |  50000 |    31   (4)|  50000 |00:00:00.09 |     103 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

As expected, 80,000 rows generated (5,000 * 16), aggregated down to 5,015, then aggregated again to the one row result. Time to complete: 5 minutes 39 seconds – and it was all CPU time. It’s not entirely surprising – a single recursive descent (with startup overheads) took 0.44 seconds – presumably a fairly large fraction of that was startup, but even 0.1 seconds adds up if you do it 5,000 times.

Everybody knows that skewed data can produced extremely variable response times. With a deeper tree and more rows with the special value it wouldn’t be hard for the total run time of this query to get to the 4,000 seconds reported in the original email. (I also tried running with 10,000 rows set to 1 and the run time went up to 18 minutes – of which a large fraction was reading from the TEMPORARY tablespace because something had overflowed to disc).

Was there a solution ?

I don’t know – but I did suggest two options
a) create a histogram on the data to show that there was one particular special value; since the code seemed to include literals perhaps the optimizer would notice the special case and choose a different plan.
b) hint the code to use a different strategy – the hint would be /*+ connect_by_filtering */. Here’s the resulting execution plan:


---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |      |      1 |        |    95 (100)|      1 |00:00:06.50 |    1751 |       |       |          |
|   1 |  SORT AGGREGATE              |      |      1 |      1 |            |      1 |00:00:06.50 |    1751 |       |       |          |
|   2 |   VIEW                       |      |      1 |      2 |    95   (6)|   5015 |00:00:06.49 |    1751 |       |       |          |
|   3 |    HASH GROUP BY             |      |      1 |      2 |    95   (6)|   5015 |00:00:06.47 |    1751 |  5312K|  2025K| 1346K (0)|
|   4 |     CONNECT BY WITH FILTERING|      |      1 |        |            |  80000 |00:00:06.30 |    1751 |   337K|   337K|  299K (0)|
|*  5 |      TABLE ACCESS FULL       | T1   |      1 |      1 |    31   (4)|   5000 |00:00:00.01 |     103 |       |       |          |
|*  6 |      HASH JOIN               |      |     16 |      1 |    63   (5)|     15 |00:00:05.98 |    1648 |  1969K|  1969K|  741K (0)|
|   7 |       CONNECT BY PUMP        |      |     16 |        |            |     16 |00:00:00.01 |       0 |       |       |          |
|   8 |       TABLE ACCESS FULL      | T1   |     16 |  50000 |    31   (4)|    800K|00:00:01.49 |    1648 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

We get the result in 6.5 seconds! [UPDATE: but there’s a nice explanation for that – most of the time comes from the work done gathering rowsource execution statistics; with statistics_level set back to typical the run time dropped to 0.19 seconds.]

February 5, 2016

Parallel DML

Filed under: Execution plans,Oracle,Parallel Execution,Performance — Jonathan Lewis @ 1:02 pm BST Feb 5,2016

A recent posting on OTN presented a performance anomaly when comparing a parallel “insert /*+ append */” with a parallel “create table as select”.  The CTAS statement took about 4 minutes, the insert about 45 minutes. Since the process of getting the data into the data blocks would be the same in both cases something was clearly not working properly. Following Occam’s razor, the first check had to be the execution plans – when two statements that “ought” to do the same amount of work take very different times it’s probably something to do with the execution plans – so here are the two statements with their plans:

First the insert, which took 45 minutes:

insert  /*+ append parallel(a,16) */ into    
        dg.tiz_irdm_g02_cc  a
select
        /*+ parallel (a,16) parallel (b,16) */ 
        *
from    tgarstg.tst_irdm_g02_f01 a, 
        tgarstg.tst_irdm_g02_f02 b
where   a.ip_id = b.ip_id
;

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                 |                  |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->S | QC (RAND)  |
|*  4 |     HASH JOIN BUFFERED           |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000         |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001         |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

And here’s the ‘create table’ at 4:00 minutes:

create table dg.tiz_irdm_g02_cc 
nologging 
parallel 16 
compress for query high 
as
select
        /*+ parallel (a,16) parallel (b,16) */ 
        *
from    tgarstg.tst_irdm_g02_f01 a , 
        tgarstg.tst_irdm_g02_f02 b 
where
        a.ip_id = b.ip_id

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |                  |    13M|    36G|       |   397K  (1)| 00:00:14 |        |      |            |
|   1 |  PX COORDINATOR                  |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10002         |    13M|    36G|       |   255K  (1)| 00:00:09 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,02 | PCWP |            |
|*  4 |     HASH JOIN                    |                  |    13M|    36G|  1842M|   255K  (1)| 00:00:09 |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |                  |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000         |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |                  |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |                  |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,02 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001         |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |                  |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

As you can see, the statements are supposed to operate with degree of parallelism 16, and we were assured that the pre-existing table had been declared as nologging with the same level of compression as that given in the CTAS so, assuming the queries did run with the degree expected, they should take virtually the same amount of time.

But there’s an important clue in the plan about why there was a difference, and why the difference could be so great. The first statement is DML, the second is DDL. Parallel DDL is automatically enabled, parallel DML has to be enabled explicitly otherwise the select will run in parallel but the insert will be serialized. Look at operations 1 – 4 of the insert – the query co-ordinator does the “load as select” of the rowsource sent to it by the parallel execution slaves. Not only does this mean that one process (rather than 16) does the insert, you also have all the extra time for all the messaging and the hash join (at line 4) has to be buffered – which means a HUGE amount of data could have been dumped to disc by each slave prior to the join actually taking place and then been read back from disc, joined, and forwarded.

Note that the hash join in the CTAS is not buffered – each slave does the join as the data arrives and writes the result directly to its local segment. Basically the insert could be doing something like twice the I/O of the CTAS (and this is Exadata, so reads from temp can be MUCH slower than the tablescans that supply the data to be joined).

So the OP checked, and found that (although he thought he had enabled parallel DML) he hadn’t actually done so. And after enabling parallel DML the timing was … just as bad. Ooops!! Something else must have gone wrong. Here’s the plan after enabling parallel DML:


--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                   |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  PX COORDINATOR                    |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10003         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE                     |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,03 | PCWP |            |
|   5 |      PX SEND RANDOM LOCAL          | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->P | RANDOM LOCA|
|*  6 |       HASH JOIN BUFFERED           |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE                  |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH               | :TQ10000         |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR         |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE                  |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH               | :TQ10001         |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR         |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------------

As you can see, line 3 has the LOAD AS SELECT after which the slaves message the query co-ordinator – so the DML certainly was parallel even though it wasn’t any faster. But why is the hash join (line 6) still buffered, and why is there an extra data flow (lines 5 and 4 – PX SEND RANDOM LOCAL / PX RECEIVE). The hash join has to be buffered because of that extra data flow (which suggests that the buffering and messaging could still be the big problem) – but WHY is the data flow there at all, it shouldn’t be.

At this point I remembered that the first message in the thread had mentioned testing partitioned tables as well as non-partitioned tables – and if you do a parallel insert to a partitioned table and the data is going to be spread across several partitions, and the number of partitions is not a good match for the degree of parallelism then you’re likely to an extra stage of data distribution as Oracle tries to share the data and the partitions as efficiently as possible across slaves. One of the possible distribution methods is “local random” – which is fairly likely to appear if the number of slaves is larger than the number of partitions. This behaviour can be modified with the newer “single distribution” version of the pq_distribute hint. So I asked the OP if their latest test was on a partitioned table, and suggested they insert the hint /*+ pq_distribute(a none) */ just after the parallel hint.

The answer was yes, and the hint had the effect of dropping the run time down to 7 minutes – still not as good as the CTAS, but then the CTAS wasn’t creating a partitioned table so it’s still not a completely fair test. Here’s the (start of the) final plan:

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                   |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  PX COORDINATOR                    |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,02 | PCWP |            |
|*  4 |     HASH JOIN                      |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |

As you can see, we have a hash join that is NOT buffered; we don’t have a third distribution, and the slaves do the data load and then message the query co-ordinator.

It would be interesting to know if there was a significant skew in the data volumes that went into each partition of the partitioned table, and check where the time was spent for both the partitioned insert and the non-partitioned CTAS (and compare with a non-partitioned insert) – but real-world DBAs don’t necessarily have all the time for investigations that I do.

My reference: parallel_dml.sql

February 3, 2016

Hinting

Filed under: Hints,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 1:04 pm BST Feb 3,2016

This is just a little example of thinking about hinting for short-term hacking requirements. It’s the answer to a question that came up on the Oracle-L listserver  a couple of months ago (Oct 2015) and is a convenient demonstration of a principle that can often (not ALWAYS) be applied as a response to the problem: “I can make this query work quickly once, how do I make it work quickly when I make it part of a join ?”

The question starts with this query, which returns “immediately” for any one segment:


SELECT DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM dba_extents DE
WHERE 1=1
  AND DE.OWNER           = <owner>
  AND DE.SEGMENT_NAME    = <segment_name>
  AND DE.segment_type    = <segment_type>
  AND DE.tablespace_name = <tablespace_name>
  AND DE.partition_name  = <max_partition_name>
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME
;

But the email then goes on to explain: “I’ve got a table of values that I need to use as a list of inputs for this query, but when I do the join it takes ages to complete; how do I make the join run quickly?”

Here’s the initial code:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

What we’ve got is a query where the user’s reference table generates a list of segments (the rownum < 2 was a temporary test) and we want the detail query to run for each segment identified. The “for each segment” gives us a clue that what we want to see is a simple nested loop join, driven by the factored subquery, with the very efficient query above running as the “second table of the nested loop”.

What I failed to notice at the time is that the /*+ materialize */ hint was in the wrong place, it should have been placed after the outer (i.e. very first) select, and it’s possible that if it had been in the right place the user would have got the plan they wanted – especially in the later versions of Oracle. As it was I suggested that we merely need to tell the optimizer to:

Visit the “tables” in the order (segment_list, dba_extents), and do a nested loop into (dba_extents), but since both segment_list and dba_extents were views we needed to stop Oracle from trying to merge them and play silly games with distinct aggregate placement, etc. by including directives that the views should not be merged, but then we might need to explain to Oracle that it would have to push the join predicate between segment_list and dba_extents inside the dba_extents view.

In other words, a list of 4 hints, as shown below:


WITH SEGMENT_LIST AS
(
  select * from (
   SELECT /*+ materialize cardinality(100) */
           owner, segment_name, segment_type, tablespace_name,
           MAX(partition_name) MAX_PARTITION_NAME
   FROM my_custom_table
   GROUP BY owner, segment_name, segment_type, tablespace_name
  ) where rownum < 2
)
SELECT /*+
        leading(sl de)
        no_merge(sl)
        no_merge(de)
        push_pred(de)
        */
       DE.TABLESPACE_NAME, DE.OWNER,DE.SEGMENT_NAME,
       MAX(DE.BYTES) LARGEST_EXTENT_BYTES
FROM SEGMENT_LIST SL, dba_extents DE
WHERE 1=1
  AND DE.OWNER           = SL.OWNER
  AND DE.SEGMENT_NAME    = SL.SEGMENT_NAME
  AND DE.segment_type    = SL.segment_type
  AND DE.tablespace_name = SL.tablespace_name
  AND DE.partition_name  = SL.max_partition_name
GROUP BY DE.TABLESPACE_NAME, DE.OWNER, DE.SEGMENT_NAME

According to a follow-up email, this was sufficient.  The OP had actually tried variations on the leading() and use_nl() hints – but without the no_merge() hint the optimizer was probably rewriting the SQL in a way that put the hints out of context. It’s worth noting that the /*+ materialize */ hint is in the wrong place – it should be after the first appearance of the SELECT keyword in the factored subquery – and that probably added to the difficulty of getting the desired execution plan.

For a production system I’d probably want to do something a little more sophisticated in terms of stability once I’d got the plan I wanted – but this looks like a DBA query used to run an ad hoc report, so perhaps this solution is good enough for the current requirement.

 

January 18, 2016

Drop Column

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 8:14 am BST Jan 18,2016

I published a note on AllthingsOracle a few days ago discussing the options for dropping a column from an existing table. In a little teaser to a future article I pointed out that dropping columns DOESN’T reclaim space; or rather, probably doesn’t, and even if it did you probably won’t like the way it does it.

I will  be writing about “massive deletes” for AllthingsOracle in the near future, but I thought I’d expand on the comment about not reclaiming space straight away. The key point is this – when you drop a column you are probably dropping a small fraction of each row. (Obviously there are some extreme variants on the idea – for example, you might have decided to move a large varchar2() to a separate table with shared primary key).

If you’ve dropped a small fraction of each row you’ve freed up a small fraction of each block, which probably means the block hasn’t been identified as having available free space for inserts. In many cases this is probably  a good thing – because it’s quite likely the if every block in your table is suddenly labelled as having sufficient free space for new row then you could end up with a difficult and ongoing performance problem.

Many large tables have a “time-based” component to their usage – as time passes the most recently entered rows are the ones that get most usage, and older rows are no longer accessed; this means you get a performance benefit from caching because the most useful fractions of such tables are often well cached and the “interesting” data is fairly well clustered.

In a case like this, imagine what will happen if EVERY block in your table suddenly acquires enough free space to accept a couple of new rows – over the next few days the incoming data will be spread across the entire length of the table, and for the next couple of months, or years, you will have to keep the entire table cached in memory if the performance is to stay constant; moreover the clustering_factor of the most useful indexes is likely to jump from “quite small” to “absolutely massive”, and the optimizer will start changing lots of plans because it will decide that your favourite indexes are probably much to expensive to user.

I am, of course, painting a very grim picture – but it is a possible scenario that should be considered before you drop a column from a table. Combined with my observations about the locking and overheads of dropping a column you might (probably ought to) decide that you should never drop a column you should only mark it as unused or (better still if you’re on 12c) mark it invisible for a while before marking it unused. You can worry about space reclamation at a later date when you considered all the ramifications of how it might impact on performance.

Footnote: If you’re still using freelist management then dropping a column won’t put a block on the freelist until the total used space in the block falls below the value dictated by pctused (default 40%); if you’re using ASSM then the block doesn’t become available for reuse until (by default) the free space exceeds 25% of the block’s usable space.

 

 

October 9, 2015

PL/SQL Functions

Filed under: 12c,Oracle,Performance — Jonathan Lewis @ 6:17 pm BST Oct 9,2015

Assuming everything else has been tuned to perfection, what’s the best you can do while calling PL/SQL functions from SQL ? Here’s a little code to create a table with some data, and a function that we can use to start an investigation:


create table t1
nologging
pctfree 0
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1,1001))        n1
from
        generator       v1,
        generator       v2
where
        rownum <= 3e5
;

create or replace function func_normal(i_in number)
return number
as
begin
        return round(exp(ln(i_in)));
end;
/

That’s 300,000 rows in the table and a silly little function to use up some CPU to get nowhere. There are 1,000 distinct integer values scattered uniformly through the table, and the function returns the value it’s called with – but it does it a hard way.

Here’s some test code – with “set timing on” – followed by the results:


select count(distinct(n1)) from t1;

select 
        count(distinct(func_normal(n1)))
from    t1
;



COUNT(DISTINCT(N1))
-------------------
               1000

1 row selected.

Elapsed: 00:00:00.03

COUNT(DISTINCT(FUNC_NORMAL(N1)))
--------------------------------
                            1000

1 row selected.

Elapsed: 00:00:11.39

This is running on 12.1.0.2 on a machine with a CPU speed of aboaut 3 MHz – you may want to adjust the number of rows in the table for your own testing.

The question is, what options can we use to improve the efficiency of the calls to the PL/SQL. One option, of course, is to use the new 12c “with PL/SQL” clause – we could embed the function in the SQL like this:


with
        function func_with (
                i_in    number
        ) return number
        is
        begin
                return round(exp(ln(i_in)));
        end;
select
        count(distinct(func_with(n1)))
from    t1
/

COUNT(DISTINCT(FUNC_WITH(N1)))
------------------------------
                          1000

1 row selected.

Elapsed: 00:00:09.77

So, for the cost of copying the function into the SQL we get a 10% improvement in performance – which we could reasonably attribute to an improved efficiency in the call mechanism. There are arguments for and against copying code like this, of course, and my improvement was only 1 second for 300,000 calls, but you may decide that the benefit of the “With-PL/SQL” method is sufficient to justify the effort.

If you don’t want to copy a PL/SQL function into the SQL, though, there is another alternative – the pragma UDF (for user-defined function) – which has the effect of reducing the cost of the so-called “context switch” as you call from SQL to PL/SQL or vice versa.

Technically I think what’s happening is that the stack formats for SQL and PL/SQL are different and the context switch is the work needed to reformat the stack as you pass from one environment to the other – by declaring the function as UDF you probably set it up to deal with the incoming SQL stack itself.

Here’s how to add the UDF pragma to the function, and the results showing the effects:


create or replace function func_udf(i_in number)
return number
as
pragma UDF;
begin
        return round(exp(ln(i_in)));
end;
/

COUNT(DISTINCT(FUNC_UDF(N1)))
-----------------------------
                         1000

1 row selected.

Elapsed: 00:00:09.55

With pragma UDF the standalone function call is slightly faster than the in-line “WITH” function. I did wonder whether the pragma UDF would make the function call slower if I simply called it from a loop in a PL/SQL block, but there didn’t seem to be any significant difference between the normal function and the UDF function.

Initially, then, it looks like UDF is faster than WITH, which is faster than basic; but there are other considerations. My sample data has only 1,000 possible inputs to the function – and Oracle has three different methods for caching that I might be able to benefit from:

  • Declaring the function as deterministic
  • Putting the function into the PL/SQL result cache
  • Modifying the SQL to take advantage of scalar subquery caching

Here’s what the function declaration looks like if I want to use the PL/SQL function cache:


create or replace function func_cached(i_in number)
return number
result_cache
as
begin
        return round(exp(ln(i_in)));
end;
/

Changing my query to use func_cached() the query completed in 1.65 seconds – a clear winner, but can anything else get close.

To make the functions deterministic, I just have to add the word “deterministic” after the declaration of the return type:


create or replace function func_normal(i_in number) 
return number
deterministic
as
begin
        return round(exp(ln(i_in)));
end;
/

We can also add the deterministic keyword to the function defined in the WITH clause. Before reporting the results of testing the functions with determinism, there’s one more strategy to consider. Remove the deterministic key word from the functions, and introduce a scalar subquery to the test query, e.g.:


select
        count(distinct(select func_normal(n1) from dual))
from    t1

;

Here’s a table of results:

Method Initial Deterministic Scalar Subquery
Basic 11.39 4.30 4.32
With 9.77 9.72 3.60
UDF 9.55 9.57 3.57
Cached 1.65 0.71 0.72

Summary

Before stating any conclusions it’s worth remembering that the sample data is based on a relatively small number of distinct input values. It is the repetition that allows us to benefit from things like caching and determinism. On top of that we need to consider the scale of the time-saving in light of the number of calls made (or not made).

Having said that, the PL/SQL function cache is clearly the thing that gives us the most benefit in return for a simple implementation. We should remember that the cache is “public” – i.e. stored in the SGA – and each input value takes up another bit of the public result cache: on the plus side this means that everyone else calling this function gets the benefit of our prior calculation; on the minus side this means if we use the result cache for the wrong function then we could take up a lot of space in the cache for very little benefit. Since the relevant result cache latch has not child latches it’s also possible to end up with latch contention if the too many sessions are taking advantage of the result cache too frequently.

Although we might expect a deterministic function to give us a big benefit (in the “very repetitive inputs” case), we find that the deterministic keyword has no effect (as at 12.1.0.2) in functions declared in a WITH clause or declared as standalone with pragma UDF. Fortunately scalar subquery caching (which seems to use the same hashing algorithm as the deterministic caching algorithm) still works with WITH functions or UDF functions and (slightly surprisingly) standalone functions declared with pragma UDF seem to have a very small edge over WITH functions.

Both of the latter two approaches use a local cache with a strictly limited size. The benefit is that the locality means they won’t interfere with other sessions or end up hogging an extreme amount of a public memory area; the drawback is that the size limit (which can be adjusted with a hidden parameter) means that you can get unlucky with hash collisions and end up with extremely variable performance from day to day because of a small change in the data being processed, or even a change in the order in which an unchanged data set is processed.

Footnote

I’ll leave you with one thought. Without declaring the original function as deterministic, I ran the following query to maximise the benefit from the scalar subquery caching algorithm:


select
        count(distinct(select func_normal(n1) from dual))
from    (
        select /*+ no_eliminate_oby */  n1 from t1 order by n1
        )
;

The query completed in 0.13 seconds: the subquery was called 1,000 times (once for each distinct value – see this ancient URL), and the benefit of eliminating the function calls outweighed the cost of having to sort the data set.

Ref: 12c_function_options.sql

September 28, 2015

Result Cache 2

Filed under: 12c,Flashback,Oracle,Performance — Jonathan Lewis @ 8:50 am BST Sep 28,2015

Following on from my earlier posting of problems with temporary table and the PL/SQL result cache (a combination which the manuals warn you against) here’s another problem – again, to a large degree, self-inflicted.

Imagine you have a complex report involving a large number of financial transactions with a need to include calculations about current exchange rates. Unfortunately the rules about calculating the appropriate exchange rate for any transaction are complex and you find you have a choice between adding 6 tables with outer joins and a couple of aggregate (max) subqueries to the base query or calling a PL/SQL function to calculate the exchange rate for each row. I’m going to create an extremely simplified model of this requirement:

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        case mod(rownum,2) when 0 then 'USD' else 'GBP' end v1,
        case mod(rownum,2) when 0 then 'GBP' else 'USD' end v2
from
        generator       v1
where
        rownum <= 1e3
;

create table t2 (
        v1      varchar2(3),
        v2      varchar2(3),
        cvt     number(10,6),
        constraint t2_pk primary key(v1,v2)
)
organization index
;

insert into t2 values('USD','GBP',0);
insert into t2 values('GBP','USD',1);

commit;

create or replace function plsql_ordinary(
        from_cur        in varchar2,
        to_cur          in varchar2
)
return number is
        m_ret number;
begin
        select /*+ ordinary trace_this */
                cvt
        into    m_ret
        from    t2
        where   v1 = from_cur
        and     v2 = to_cur
        ;

        return m_ret;

end plsql_ordinary;
/

execute dbms_stats.gather_table_stats(user,'t2')

My t1 table represents the set of transactions but only has to worry about two exchange rates, the dollar/sterling and its inverse. My t2 table is an exchange rate table and I’ve loaded it with the two exchange rates I’m interested in. My function plsql_ordinary() takes two currency codes as inputs and returns the exchange rate.

Here’s the modelling query, with a little infrastructure to examine the workload. Note that this will only run on 12c because of the inline PL/SQL function I’ve used for slowing the query down.

set pagesize 20
set arraysize 6

set serveroutput off
alter system flush shared_pool;
alter session set statistics_level = all;

with
        function wait_row_n (
                i_secs          number,
                i_return        number default -1
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end wait_row_n;
select
        /*+ driver trace_this */
        wait_row_n(0.3,id),
        plsql_ordinary(v1,v2),
        (select /*+ scalar trace_this */ t2.cvt from t2 where t2.v1 = t1.v1 and t2.v2 = t1.v2) scalar_sub
from
        t1
where
        rownum <= 50
;

set pagesize 40

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

select  sql_id, executions, fetches, rows_processed, sql_text
from    v$sql
where   lower(sql_text) like 'select%trace_this%'
and     lower(sql_text) not like '%v$sql%'
;

The query includes a scalar subquery in the select list to get the same data as the PL/SQL function, and you’ll see the point of that in a while. Because of the arraysize and input parameters to wait_row_n() the query will produce output in batches of 6 rows roughly every two seconds for a total of about 18 seconds – which will give me plenty of time to play around in another session. Before I try to do any damage, though, let’s check the execution plan of the report and the statistics of the queries with “trace_this” in their text:


--------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |     50 |00:00:00.01 |      11 |
|*  1 |  INDEX UNIQUE SCAN | T2_PK |      2 |      1 |      2 |00:00:00.01 |       2 |
|*  2 |  COUNT STOPKEY     |       |      1 |        |     50 |00:00:00.01 |      11 |
|   3 |   TABLE ACCESS FULL| T1    |      1 |     50 |     50 |00:00:00.01 |      11 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."V1"=:B1 AND "T2"."V2"=:B2)
   2 - filter(ROWNUM<=50)


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------
f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

Looking at operation 1 in the plan you can see that the scalar subquery has started just twice (once for each distinct combination of currency codes).
Looking at the results from v$sql you can see that the query in the PL/SQL function was executed 50 times – once for each row. Functions like this can be a massive drain of resources (typically CPU, but also through latch contention on buffer cache and shared pool latches).

Conveniently we realise that in our system the derived exchange rates change very slowly – so how about telling Oracle that the exchange rate function is a deterministic function (it’s nearly true), or better still, perhaps, experiment with the PL/SQL Function Result Cache.

(Note very carefully, however, that the Database Administrators’ Manual for 11.2 makes the following comment about using the deterministic keyword with PL/SQL functions)

DETERMINISTIC

Tells the optimizer that the function returns the same value whenever it is invoked with the same parameter values (if this is not true, then specifying DETERMINISTIC causes unpredictable results).

Look carefully at that “unpredictable” – it’s true … but you might not realise it until too late. Our PL/SQL function is NOT deterministic – after all a function that queries the database to produce a result may produce a different result every time it executes if someone keeps changing the underlying data – but we might wave our hands a bit and say that the table isn’t going to change while we’re running our report so it’s okay to pretend it’s deterministic, we might even make it a read-only table for the duration. Similar thoughts should cross our minds about declaring a PL/SQL function to the result cache – even though the manual doesn’t say anything quite so explicitly threatening about the result cache. But what the heck, let’s just do it and see what happens:

create or replace function plsql_result_cache(
        from_cur        in varchar2,
        to_cur          in varchar2
)
return number
result_cache
is
        m_ret number;
begin
        select /*+ result cache trace_this */
                cvt
        into    m_ret
        from    t2
        where   v1 = from_cur
        and     v2 = to_cur
        ;

        return m_ret;

end plsql_result_cache;
/

create or replace function plsql_deterministic(
        from_cur        in varchar2,
        to_cur          in varchar2
)
return number
deterministic
is
        m_ret number;
begin
        select /*+ det trace_this */
                cvt
        into    m_ret
        from    t2
        where   v1 = from_cur
        and     v2 = to_cur
        ;

        return m_ret;

end plsql_deterministic;
/

...
select
        /*+ driver trace_this */
        wait_row_n(0.3,id),
        plsql_ordinary(v1,v2),
        plsql_deterministic(v1,v2),
        plsql_result_cache(v1,v2),
        (select /*+ scalar trace_this */ t2.cvt from t2 where t2.v1 = t1.v1 and t2.v2 = t1.v2) scalar_sub
from
...

All three functions returned the same set of results as the scalar subquery – and here’s the output from v$sql showing the impact of declaring a deteministic function and a result cache function (note that “result cache” is not a hint in the first statement, it’s just a convenient label):


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------------
49buxp3gba3cg          2          2              2
SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

2sh7bm59dkwhw         18         18             18
SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1


The simple function ran 50 times, the deteministic function ran 18 times, and the result cache function ran twice. It required just two executions to get the two distinct results needed, after which the session used the result cache rather than calling the function again.

The deterministic function only remembers its results for the duration of a single database call – in this case the fetch – so on each fetch the session has to re-populate the session’s “deterministic cache”, which takes 2 calls for each fetch, a total of 9 fetches * 2 calls = 18 calls.

Clearly, if I can get away with it safely, the PL/SQL function result cache looks like a real winner, with the deterministic option coming a close second (given that real life ought to be using a significantly larger fetch arraysize). So what could possibly go wrong ? Two things – first, the results … and if the results can go wrong there’s not much point in talking about the second thing at the moment.

My query runs for 18 seconds, I’m going to start another session while it runs and update one of the rows in the t2 table a few seconds after my long-running query starts. Here’s the SQL I’ve run, an extract from the query output, and the results from v$sql:


update  t2 set
        cvt = 2
where   v1 = 'USD' 
;

commit;


WAIT_ROW_N(0.3,ID) PLSQL_ORDINARY(V1,V2) PLSQL_DETERMINISTIC(V1,V2) PLSQL_RESULT_CACHE(V1,V2) SCALAR_SUB
------------------ --------------------- -------------------------- ------------------------- ----------
                 1                     1                          1                         1          1
                 2                     0                          0                         0          0
                 3                     1                          1                         1          1
                 4                     0                          0                         0          0
                 5                     1                          1                         1          1
                 6                     0                          0                         0          0
                 7                     1                          1                         1          1
                 8                     0                          0                         0          0
                 9                     1                          1                         1          1
                10                     2                          0                         2          0
                11                     1                          1                         1          1
                12                     2                          0                         2          0
                13                     1                          1                         1          1
                14                     2                          2                         2          0
                15                     1                          1                         1          1
                16                     2                          2                         2          0


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------------
49buxp3gba3cg          4          4              4
SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

2sh7bm59dkwhw         18         18             18
SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

The most important point is that we’ve got results that are not self-consistent – except for the scalar subquery results.

The SQL statements that are executed inside the PL/SQL functions do not run at the SCN of the driving query, each individual statement executes at its own starting SCN. This is an important point that is often overlooked when people write PL/SQL functions that are then called from SQL. The inline scalar subquery, on the other hand, always runs as at the start SCN of the driving query no matter how many times or how late in the lifetime of the driving query it runs.

If we examine the results we can see that the ordinary PL/SQL function and the result cache PL/SQL function changed their output the moment the commit took place (you’ll have to take my word on that, of course), but the deterministic function seemed to delay slightly. We can also see that the number of executions for the ordinary and deterministic functions didn’t change, but the result cache function doubled its executions.

Because of the way I’ve created my data and defined the function, the ordinary function executes its SQL once every row while the deterministic function executes its SQL twice every fetch (once for each pair of input values (though the first fetch from SQL*Plus is a special case) and then remembers the outputs for the rest of the fetch – this is why there is a delay before the deterministic function changes its output and doesn’t introduce any extra calls to the SQL – it was going to re-execute for both values on the fetch starting at id 13 whatever went on around it; the result cache function gets an invalidation warning the moment the other session commits, so re-executes its SQL as many times as needed to repopulate the bits of the cache that have been discarded – and that’s why we see the number of SQL calls doubling, the relevant part of the cache was identified by some sort of hash value for the statement with SQL_ID = ’49buxp3gba3cg’ so both results were discarded and reloaded even though only one of them actually had to change.

Critically every execution of the recursive statements runs at the then-current SCN – so when the underlying data changes our report sees those changes, the report is not self-consistent.

Fortunately there’s something we can do about this – if we want the whole report to operate at the same SCN all we need to do is freeze our session at a specific point in time with the command “set transaction read only;”. This has to be executed as the first statement of a transaction but if we can slip it in just before we start running our report we can ensure that all our SQL statements (including the recursive ones) run as at the same SCN until we issue a commit. When I repeated the data-changing experiment after setting the transaction read only the report ran to completion showing the correct results.

But there was a significant change in the output from v$sql:


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------------
49buxp3gba3cg         44         44             44
SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

2sh7bm59dkwhw         18         18             18
SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

I did my update just after the first batch of rows had appeared on screen – notice how the result cache SQL has executed 44 times instead of 2 (or 4) times. When I set my transaction to read only it looks as if my session stopped using the result cache the moment the other session commited – and that’s a GOOD THING. If my session were able to continue using the result cache that would mean one of two things, either I would be seeing a result created by another user – which would be too new for me, or every other session would be seeing the results from my session – which would (typically) be out of date for them. The session seems to have protected itself from the risk of a result cached PL/SQL function producing inconsistent results – but the workload changed the moment another session committed a change to the data we were interested in.

At that point I stopped investigating “set transaction read only” until a couple of days later when I realised that there was one more thing I had to test – when I changed the data from another session I didn’t check to see what that session would see when it executed the cached function, so I modified the code run by the other session to do the following:


update  t2 set 
        cvt = 2 
where   v1 = 'USD' 
;

commit;

execute dbms_lock.sleep(6)
execute dbms_output.put_line(plsql_result_cache('USD','GBP'))

The other session goes to sleep for a while (to let the reporting session get through a little more work) and then calls the function. I was very pleased to see that the session returned the correct result – the value 2 that it had just written to the table. But what I got from the reporting session wasn’t so good:

WAIT_ROW_N(0.3,ID) PLSQL_ORDINARY(V1,V2) PLSQL_DETERMINISTIC(V1,V2) PLSQL_RESULT_CACHE(V1,V2) SCALAR_SUB
------------------ --------------------- -------------------------- ------------------------- ----------
                 1                     1                          1                         1          1
                 2                     0                          0                         0          0
                 3                     1                          1                         1          1
                 4                     0                          0                         0          0
                 5                     1                          1                         1          1
                 6                     0                          0                         0          0
                 7                     1                          1                         1          1
                 8                     0                          0                         0          0
...
                24                     0                          0                         0          0
                25                     1                          1                         1          1
                26                     0                          0                         0          0
                27                     1                          1                         1          1
                28                     0                          0                         0          0
                29                     1                          1                         1          1
                30                     0                          0                         2          0
                31                     1                          1                         1          1
                32                     0                          0                         2          0

SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED SQL_TEXT
------------- ---------- ---------- -------------- --------------------------------
49buxp3gba3cg         32         32             32 SELECT /*+ result cache trace_th
                                                   is */ CVT FROM T2 WHERE V1 = :B2
                                                    AND V2 = :B1

49buxp3gba3cg          1          1              1 SELECT /*+ result cache trace_th
                                                   is */ CVT FROM T2 WHERE V1 = :B2
                                                    AND V2 = :B1

2sh7bm59dkwhw         18         18             18 SELECT /*+ det trace_this */ CVT
                                                    FROM T2 WHERE V1 = :B2 AND V2 = 
                                                    :B1    

f1bz07bk5rbth         50         50             50 SELECT /*+ ordinary trace_this * 
                                                   / CVT FROM T2 WHERE V1 = :B2 AND
                                                    V2 = :B1

I changed t2 just after the first batch of rows had appeared (just after id 6), then called the function a few seconds later – and as I called the function from the other session it queried the data (the second child to 49buxp3gba3cg, executed just once above) and reloaded the result cache. At that moment (visible at id 30) the first session found data in the result cache and stopped re-executing its queries. When my session went read only it protected other sessions from the past by NOT re-populating the result cache as it ran its queries – but if it found something in the result cache it used it (notice how it has recorded 32 executions of the query rather than 44 – it found something in the result cache on the last 12 calls of the function). The protection only goes one way.

Observations

Using PL/SQL functions in the select list to encapsulate complex calculations that query the database is not something you can do casually. You have no guarantee that you will end up with a self-consistent result unless you take action to deal with the threats introduced by concurrent activity – ideally all tables involved should be set to read-only (which is only possible in 12c [Ed: see comment below] became possible from 11.1 onwards, though you can set a whole tablespace readonly in earlier versions: neither strategy is commonly viable). If you decide that you can work around those threats you still have the performance threat implied by the need to do some complex work for each driving row of your result set. For a report the simple solution to consistency is to “freeze” the session SCN by setting your session (transaction) into read only mode.

Once you’ve dealt with the consistency problem you can start to address the performance problen by claiming that you were using deterministic functions. You might be tempted to use the PL/SQL Result Cache to give you an even larger performance boost, but if you do you really have to make the table (or tablespace) read-only to be protected from read-consistency problems. The deterministic strategy may not be as dramatic in its effects as the result cache strategy but, given a reasonably sized fetch array, the performance benefit you get may be all you really need.

Whatever else you do, there’s an important underlying threat to remember. The moment you create a PL/SQL function that uses the result cache or deterministic option you have to ensure that nobody uses that function without ensuring that their code has handled the consistency threat properly. It’s too easy to forget, with the passing of time, that certain procedures have to be adopted when particular coding strategies are used.

Left as Exercise

I was planning to write a further article going into some detail about using dbms_flashback.enable_at_time(systimestamp) instead of “set transaction read only” – a mechanism that might be used to achieve the same read-consistency requirement though, perhaps, used less frequently than the older, less flexible option. I was also planning to include notes in the same araticle about the effect of “alter session set isolation_level = serializable” which some systems probably use to get read-consistency across multiple statements while writing results back to the database.

Both strategies run into the same issue as “set transaction read only”, with the added effect that your session (the one that has tried to “fix” its SCN) will repopulate the cache, so not only could you see newer results from other sessions in the cache; other sessions could see older results because you put them into the cache.

I’m not going to write up these demonstrations (which require fairly simple modifications to the code supplied above) as all the phenomena have been recorded as bugs on MoS (along with the GTT problem from my previous post, and a documentation bug for the PL/SQL manual to clarify the various threats):

  • Bug 21905592 : PL/SQL RESULT CACHE MUST BE BYPASSSED WHEN SESSION SCN NOT CURRENT
  • Bug 21905695 : BAD X-SESSION RESULTS W/ RESULT-CACHED PL/SQL FN AND “SET TXN READ ONLY”
  • Bug 21907155 : PL/SQL RC FN SHOWS ONE SESSION THE CONTENTS OF ANOTHER SESSION’S GTT
  • Bug 21885173 : THE LNPLS SECTION “RESTRICTIONS ON RESULT-CACHED FUNCTIONS” NEEDS RE-WRITE

Footnote

I haven’t mentioned it here, but another strategy for reducing the number of PL/SQL calls is simply to wrap the function call inside a scalar subquery, along the lines of:

select
       (select plsql_ordinary(v1, v2) from dual),
       ...

Provided you don’t have many combinations of (v1,v2) to handle, and with a little luck with Oracle’s internal hashing function, you could find that scalar subquery caching reduces your execution count from once per row to once per combination. Note that the function is the “ordinary” function, not the “fake deterministic” one; internally Oracle uses the same hashing strategy for remembering the results, but the duration of the scalar subquery cache is the statement rather than the fetch.

 

September 22, 2015

Result Cache

Filed under: 12c,Bugs,Infrastructure,Oracle,Performance — Jonathan Lewis @ 10:11 am BST Sep 22,2015

Yesterday I thought I’d spend half an hour before breakfast creating a little demonstration of a feature; some time about midnight I felt it was time to stop because I’d spent enough time chasing around a couple of bugs that produced wrong results in a variety of ways. Today’s short post is just little warning: be VERY careful what you do with the PL/SQL result cache – if you use the results of database queries in the cache you may end up with inconsistent results in your application. Here’s one very simple example of what can go wrong, starting with a little script:


alter system flush shared_pool;

truncate table gtt1;
drop table gtt1;
create global temporary table gtt1 (n1 number)
on commit preserve rows
;

insert into gtt1 values(1);
commit;

execute dbms_stats.gather_table_stats(user,'gtt1');


create or replace function f_cache
return number 
result_cache 
-- relies_on (gtt1)
is
        m_ret number;
begin
        select  max(n1)
        into    f_cache.m_ret
        from    gtt1
        ;

        return f_cache.m_ret;
end;
/

execute dbms_output.put_line(f_cache)

Here’s the output from a session that’s just connected and run this script (the table already existed from an earlier run):

SQL> @temp

System altered.


Table truncated.


Table dropped.


Table created.


1 row created.


Commit complete.


PL/SQL procedure successfully completed.


Function created.

1

PL/SQL procedure successfully completed.

SQL> 

Let’s call this session A, and I’m going to connect with another session which I’ll call session B. The following extracts are cut-and-pastes as I alternate between sessions:

Session B:

SQL> execute dbms_output.put_line(f_cache);
1

PL/SQL procedure successfully completed.

SQL> insert into gtt1 values(0);

1 row created.

SQL> execute dbms_output.put_line(f_cache);
0

PL/SQL procedure successfully completed.


Session B has just seen the data inserted into a global temporary table by session A; but after inserting a row into the table it now sees what it perceives to be the correct answer.

Session A:

SQL> truncate table gtt1;

Table truncated.

SQL> execute dbms_output.put_line(f_cache);
1

PL/SQL procedure successfully completed.

Despite truncating the table, session A still sees the data which has been eliminated unrecoverably.

Session B (where I hadn’t yet committed):


SQL> commit;

Commit complete.

SQL> 

Session A (where I’ve done nothing new):

SQL> execute dbms_output.put_line(f_cache);


PL/SQL procedure successfully completed.

SQL> 

The row has finally “disappeared” because session B committed.

Session B (where I haven’t done anything since committing):


SQL> execute dbms_output.put_line(f_cache);


PL/SQL procedure successfully completed.

SQL> 

Session B no longer sees the data because it’s now seeing what session A has just seen.

Warning.

This is just one of several ways I managed to get surprising inconsistencies when using the (cross-session) PL/SQL Result Cache. Oracle (in 12c) is supposed to know what the cache relies on and react accordingly, but it doesn’t manage to do it properly (even if you include the relies_on clause) – if you query the database in your PL/SQL you may find strange things happen.

The most important point to note in this example is that a session that wasn’t necessarily doing anything wrong got the wrong results because of the actions of another session that Oracle should have protected it from.

The good thing about this is example that it’s documented (sort of) – as it says in the manual: “to be result cached … does not reference … temporary tables ..”. So that should warn people off copying my example; on the other hand the problem I’m seeing arises because Oracle seems to be trying to use the result cache when the manuals suggests it shouldn’t be.

Conclusion

Do not mix the pl/sql result cache with database queries. The cache is public but (unlike the buffer cache) it is not guaranteed to give you read-consistency.

If you think this was an unrealistic example and you don’t need to worry about it – I’ll post a couple more examples in the next couple of weeks. They’ll be slightly special cases again, but I find the Oracle world is full of special cases.

Update

This behaviour is now visible on MoS as “Bug 21907155 : PL/SQL RC FN SHOWS ONE SESSION THE CONTENTS OF ANOTHER SESSION’S GTT”

July 6, 2015

SQL vs. PL/SQL

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 10:23 am BST Jul 6,2015

Which piece of code will be faster (clue – the table in question has no indexes):

Option 1 – pure SQL


update join1 set
        data = data||'#'
where   key_1=500
and     key_2=23851
and     key_3=57012
and     key_4=521
and     key_6=1
and     key_7=23352
;

Option 2 – a silly PL/SQL row by row approach:


declare
        type rowid_type is table of urowid index by binary_integer;
        tab_rowid           rowid_type;  

        lv_rows_updated     number :=0;  

        cursor my_cursor is
                select  rowid rid
                from    join1
                where   key_1=500
                and     key_2=23851
                and     key_3=57012
                and     key_4=521
                and     key_6=1
                and     key_7=23352
        ;

begin
        open my_cursor;

        -- We know that the number of rows to be updated is very small
        fetch my_cursor bulk collect into tab_rowid limit 10000;

        forall lv_row in tab_rowid.first .. tab_rowid.last
             update join1 set data = data||'#' where  rowid = tab_rowid(lv_row);

        lv_rows_updated := sql%rowcount;
        close my_cursor;
end;
/

It’s a trick question, of course, and although the automatic response from any DBA-type is likely to be “the SQL”, the correct answer is (as so often) “it depends”.

This question appeared as a problem on the OTN database forum a few days ago. In it’s original form it asked why a select statement should be much faster than a select for update or an update – even though the volume identified and updated was very small (just one row in 100M).The note then went on to show that using PL/SQL to select the rowids of the target rows then doing the bulk update by rowid was faster than the basic SQL update. The answer didn’t spring to mind immediately; but fortunately someone asked for some run-time statistics (v$sesstat) and the supplied statistics told me what was going on.

Conveniently the OP gave us the code to recreate the test case – all 100M rows of it; I cut this back to 16M rows (ca. 1.5GB of disc space), and then ran the tests with ny db_cache_size set to 256MB (another clue). I got similar results to the OP – not so dramatic, but the PL/SQL ran faster than the SQL and the difference was due to an obvious change in the CPU usage.

If you haven’t guess from the clue in the 256MB db_cache_size (which means the table is more than 5 times the size of the cache), the answer is “serial direct path reads”. For a sufficiently large table (and that’s not easy to define – start here and follow a few links) it’s fairly common knowledge that from 11g a tablescan can use a serial direct path read, and that’s what the PL/SQL was doing to select the required rowids. However, here’s a detail that’s not often mentioned: an update has to take place in public where everyone can see it so when Oracle executed the simple SQL update or select for update statement it had to scan the table through the buffer cache. Pulling all those blocks into the buffer cache, grabbing latches to link them to the right cache buffers chains, pinning them, then unpinning them uses a lot of CPU – which isn’t needed for the direct path read. The PL/SQL with its pure select used far less CPU than the basic SQL with its update/select for update, and because the OP had a very high-powered machine with plenty of CPU and loads of (disc-)caching effects all over the place the difference in CPU time was exremely visible as a fraction of the total DB time.

This was, inevitably, a very special case where a little detail became a significant fraction of the workload. The OP scanned 100M rows to update 1 row (in 7 – 13 seconds!). This doesn’t sound like a strategy you would normally want to adopt for frequent use; and for occasional use we might be happy to use the slower (13 second) approach to avoid the coding requirement of the fast (7 second) solution.

Footnote:

It’s worth pointing out that the PL/SQL strategy is not safe. In the few seconds between the select statement starting and the row being identified and updated by rowid it’s possible that another session could have updated (or deleted) the row. In the former case the update statement is now updating a row which doesn’t match the specification; in the latter case the code will raise an exception.

We can make the PL/SQL safer by including the original predicates in the update statement – but that still leaves the question of what the code should do if the select statement finds a row and the update fails to update it. Should it, perhaps, assume that there is still a row in the table that needs an update and re-run (using up all the time you saved by adopting a PL/SQL solution).

 

 

 

May 11, 2015

Efficiency

Filed under: Oracle,Performance — Jonathan Lewis @ 9:24 pm BST May 11,2015

Here’s a question to which I don’t know the answer, and for which I don’t think anyone is likely to need the answer; but it might be an entertaining little puzzle for thr curious.

Assume you’re doing a full tablescan against a simple heap table of no more than 255 columns (and not using RAC, Exadata, In-memory, etc. etc. etc.), and the query is something like:


select  {columns 200 to 250}
from    t1
where   column_255 = {constant}
;

To test the predicate Oracle has to count its way along each row column by column to find column 255. Will it:

  1. transfer columns 200 to 250 to local memory as it goes, then check column_255 — which seems to be a waste of CPU if the predicate doesn’t evaluate to TRUE
  2. evaluate the predicate, then walk the row again to transfer columns 200 to 250 to local memory if the predicate evaluates to TRUE — which also seems to be a waste of CPU
  3. one or other of the above depending on which columns are needed, how long they are, and the selectivity of the predicate

How would you test your hypothesis ?

April 15, 2015

Cartesian join

Filed under: Oracle,Performance,Tuning — Jonathan Lewis @ 6:40 pm BST Apr 15,2015

Some time ago I pulled off the apocryphal “from 2 hours to 10 seconds” trick for a client using a technique that is conceptually very simple but, like my example from last week, falls outside the pattern of generic SQL. The problem (with some camouflage) is as follows: we have a data set with 8 “type” attributes which are all mandatory columns. We have a “types” table with the same 8 columns together with two more columns that are used to translate a combination of attributes into a specific category and “level of relevance”. The “type” columns in the types table are, however, allowed to be null although each row must have at least one column that is not null – i.e. there is no row where every “type” column is null.

The task is to match each row in the big data set with all “sufficiently similar” rows in the types table and then pick the most appropriate of the matches – i.e. the match with the largest “level of relevance”. The data table had 500,000 rows in it, the types table has 900 rows. Here’s a very small data set representing the problem client data (cut down from 8 type columns to just 4 type columns):


create table big_table(
	id		number(10,0)	primary key,
	v1		varchar2(30),
	att1		number(6,0),
	att2		number(6,0),
	att3		number(6,0),
	att4		number(6,0),
	padding		varchar2(4000)
);

create table types(
	att1		number(6,0),
	att2		number(6,0),
	att3		number(6,0),
	att4		number(6,0),
	category	varchar2(12)	not null,
	relevance	number(4,0)	not null
);

insert into big_table values(1, 'asdfllkj', 1, 1, 2, 1, rpad('x',4000));
insert into big_table values(2, 'rirweute', 1, 3, 1, 4, rpad('x',4000));

insert into types values(   1, null, null, null, 'XX',  10);
insert into types values(   1, null, null,    1, 'YY',  20);
insert into types values(   1, null,    1, null, 'ZZ',  20);

commit;

A row from the types table is similar to a source row if it matches on all the non-null columns. So if we look at the first row in big_table, it matches the first row in types because att1 = 1 and all the other attN columns are null; it matches the second row because att1 = 1 and att4 = 1 and the other attN columns are null, but it doesn’t match the third row because types.att3 = 1 and big_table.att3 = 2.

Similarly, if we look at the second row in big_table, it matches the first row in types, doesn’t match the second row because types.att4 = 1 and big_table.att4 = 4, but does match the third row. Here’s how we can express the matching requirement in SQL:


select
	bt.id, bt.v1,
	ty.category,
	ty.relevance
from
	big_table	bt,
	types		ty
where
	nvl(ty.att1(+), bt.att1) = bt.att1
and	nvl(ty.att2(+), bt.att2) = bt.att2
and	nvl(ty.att3(+), bt.att3) = bt.att3
and	nvl(ty.att4(+), bt.att4) = bt.att4
;

You’ll realise, of course, that essentially we have to do a Cartesian merge join between the two tables. Since there’s no guaranteed matching column that we could use to join the two tables we have to look at every row in types for every row in big_table … and we have 500,000 rows in big_table and 900 in types, leading to an intermediate workload of 450,000,000 rows (with, in the client case, 8 checks for each of those rows). Runtime for the client was about 2 hours, at 100% CPU.

When you have to do a Cartesian merge join there doesn’t seem to be much scope for reducing the workload, however I didn’t actually know what the data really looked like so I ran a couple of queries to analyse it . The first was a simple “select count (distinct)” query to see how many different combinations of the 8 attributes existed in the client’s data set. It turned out to be slightly less than 400.

Problem solved – get a list of the distinct combinations, join that to the types table to translate to categories, then join the intermediate result set back to the original table. This, of course, is just applying two principles that I’ve discussed before: (a) be selective about using a table twice to reduce the workload, (b) aggregate early if you can reduce the scale of the problem.

Here’s my solution:


with main_data as (
	select
		/*+ materialize */
		id, v1, att1, att2, att3, att4
	from
		big_table
),
distinct_data as (
	select
		/*+ materialize */
		distinct att1, att2, att3, att4
	from	main_data
)
select
	md.id, md.v1, ty.category, ty.relevance
from
	distinct_data	dd,
	types		ty,
	main_data	md
where
	nvl(ty.att1(+), dd.att1) = dd.att1
and	nvl(ty.att2(+), dd.att2) = dd.att2
and	nvl(ty.att3(+), dd.att3) = dd.att3
and	nvl(ty.att4(+), dd.att4) = dd.att4
and	md.att1 = dd.att1
and	md.att2 = dd.att2
and	md.att3 = dd.att3
and	md.att4 = dd.att4
;

And here’s the execution plan.


---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |    12 |  2484 |    11  (10)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D6619_8FE93F1 |       |       |            |          |
|   3 |    TABLE ACCESS FULL       | BIG_TABLE                  |     2 |   164 |     2   (0)| 00:00:01 |
|   4 |   LOAD AS SELECT           | SYS_TEMP_0FD9D661A_8FE93F1 |       |       |            |          |
|   5 |    HASH UNIQUE             |                            |     2 |   104 |     3  (34)| 00:00:01 |
|   6 |     VIEW                   |                            |     2 |   104 |     2   (0)| 00:00:01 |
|   7 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D6619_8FE93F1 |     2 |   164 |     2   (0)| 00:00:01 |
|*  8 |   HASH JOIN                |                            |    12 |  2484 |     6   (0)| 00:00:01 |
|   9 |    NESTED LOOPS OUTER      |                            |     6 |   750 |     4   (0)| 00:00:01 |
|  10 |     VIEW                   |                            |     2 |   104 |     2   (0)| 00:00:01 |
|  11 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661A_8FE93F1 |     2 |   104 |     2   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL      | TYPES                      |     3 |   219 |     1   (0)| 00:00:01 |
|  13 |    VIEW                    |                            |     2 |   164 |     2   (0)| 00:00:01 |
|  14 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D6619_8FE93F1 |     2 |   164 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access("MD"."ATT1"="DD"."ATT1" AND "MD"."ATT2"="DD"."ATT2" AND
              "MD"."ATT3"="DD"."ATT3" AND "MD"."ATT4"="DD"."ATT4")
  12 - filter("DD"."ATT1"=NVL("TY"."ATT1"(+),"DD"."ATT1") AND
              "DD"."ATT2"=NVL("TY"."ATT2"(+),"DD"."ATT2") AND
              "DD"."ATT3"=NVL("TY"."ATT3"(+),"DD"."ATT3") AND
              "DD"."ATT4"=NVL("TY"."ATT4"(+),"DD"."ATT4"))

Critically I’ve taken a Cartesian join that had a source of 500,000 and a target of 900 possible matches, and reduced it to a join between the 400 distinct combinations and the 900 possible matches. Clearly we can expect this to to take something like one twelve-hundredth (400/500,000) of the work of the original join – bringing 7,200 seconds down to roughly 6 seconds. Once this step is complete we have an intermediate result set which is the 4 non-null type columns combined with the matching category and relevance columns – and can use this in a simple and efficient hash join with the original data set.

Logic dictated that the old and new results would be the same – but we did run the two hour query to check that the results matched.

Footnote: I was a little surprised that the optimizer produced a nested loops outer join rather than a Cartesian merge in the plan above – but that’s probably an arterfact of the very small data sizes in my test.There’s presumably little point in transferring the data into the PGA when the volume is so small.

Footnote 2: I haven’t included the extra steps in the SQL to eliminate the reduce the intermediate result to just “the most relevant” – but that’s just an inline view with an analytic function. (The original code actually selected the data with an order by clause and used a client-side filter to eliminate the excess!).

Footnote 3: The application was a multi-company application – and one of the other companies had not yet gone live on the system because they had a data set of 5 million rows to process and this query had never managed to run to completion in the available time window.  I’ll have to get back to the client some day and see if the larger data set also collapsed to a very small number of distinct combinations and how long the rewrite took with that data set.

 

April 13, 2015

Not Exists

Filed under: CBO,Execution plans,Oracle,Performance — Jonathan Lewis @ 12:51 pm BST Apr 13,2015

The following requirement appeared recently on OTN:


=========================================================================================================
I have a following query and want to get rid of the "NOT EXISTS' clause without changing the end results.

SELECT   A.c,
         A.d,
         A.e,
         A.f
  FROM   A
WHERE   NOT EXISTS (SELECT   1
                       FROM   B
                      WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e);
===========================================================================================================

Inevitably this wasn’t the problem query, and almost inevitably the OP was asking us how to implement a solution which wasn’t appropriate for a problem that shouldn’t have existed. Despite this it’s worth spending a little time to take the request at its face value and look at the sort of thing that could be going on.

First, of course, you cannot get rid of the “not exists” clause, although you may be able to make it look different. If you want “all the rows in A that are not referenced in B” then you HAVE to examine all the rows in A, and you have to do some sort of check for each row to see whether or not it exists in B. The only option you’ve got for doing something about the “not exists” clause is to find a way of making it as a cheap as possible to implement.

A couple of people came up with suggestions for rewriting the query to make it more efficient. One suggested writing it as a “NOT IN” subquery, but it’s worth remembering that the optimizer may cheerfully transform a “NOT IN” subquery to a “NOT EXISTS” subquery if it’s legal and a manual rewrite may overlook the problem of NULLs; another suggested rewriting the query as an outer join, but again it’s worth remembering that the optimimzer may transform a “NOT EXISTS” subquery to an “ANTI-JOIN” – which is a bit like an outer join with filter, only more efficient. So, before suggesting a rewrite, it’s worth looking at the execution plan to see what the optimizer is doing just in case it’s doing something silly. There are two options – anti-join or filter subquery.

Here, with code I’ve run under 10.2.0.5 to match the OP, is a demonstration data set, with the two plans you might expect to see – first, some the data:


execute dbms_random.seed(0)

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(0,4))           c,
        trunc(dbms_random.value(0,5))           d,
        trunc(dbms_random.value(0,300))         e,
        rownum                                  f,
        rpad('x',100)                   padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table t2
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(0,4))           c,
        trunc(dbms_random.value(0,5))           d,
        trunc(dbms_random.value(0,300))         e,
        rownum                                  f,
        rpad('x',100)                   padding
from
        generator       v1,
        generator       v2
where
        rownum <= 24000
;

create index t1_i1 on t1(c,d,e);
create index t2_i1 on t2(c,d,e);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

The OP had followed up their original query with a claim that “Table A holds 100 million rows and table B holds 24,000” – that’s a lot of checks (if true) and you ought to be asking how quickly the OP expects the query to run and how many of the 100 M rows are going to survive the check. I’ve set up just 1M rows with 6,000 distinct values for the column combination (c,d,e), and a reference table with 24,000 rows which are likely to include most, but not all, of those 6,000 combinations.

Rather than generate a very large output, I’ve written a query that generates the required data set, then counts it:


select
        max(f), count(*)
from (
        SELECT   /*+ no_merge */
                 A.c,
                 A.d,
                 A.e,
                 A.f
          FROM   t1 A
        WHERE   NOT EXISTS (SELECT   /* no_unnest */
                                      1
                               FROM   t2 B
                              WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e)
)
;

This took about 0.35 seconds to run – aggregating roughly 14,500 rows from 1M. The plan was (as I had expected) based on a (right) hash anti join:


---------------------------------------------------------------------------------
| Id  | Operation               | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |       |     1 |    13 |  2183   (5)| 00:00:11 |
|   1 |  SORT AGGREGATE         |       |     1 |    13 |            |          |
|   2 |   VIEW                  |       |   999K|    12M|  2183   (5)| 00:00:11 |
|*  3 |    HASH JOIN RIGHT ANTI |       |   999K|    23M|  2183   (5)| 00:00:11 |
|   4 |     INDEX FAST FULL SCAN| T2_I1 | 24000 |   234K|    11  (10)| 00:00:01 |
|   5 |     TABLE ACCESS FULL   | T1    |  1000K|    14M|  2151   (4)| 00:00:11 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("B"."C"="A"."C" AND "B"."D"="A"."D" AND "B"."E"="A"."E")

Oracle has built an in-memory hash table from the 24,000 rows in t2, then scanned the t1 table, probing the hash table with each row in turn. That’s 1M probe in less than 0.35 seconds. You ought to infer from this that most of the time spent in the original query should have been spent scanning the 100M rows, and only a relatively small increment appear due to the “not exists” clause.

You’ll notice, though that there was a comment in my subquery with the /* no_unnest */ hint embedded – if I change this from a comment to a hint (/*+ */) I should get a plan with a filter subquery, and maybe that’s what’s happening to the OP for some odd reason. Here’s the plan:


------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |     1 |    13 | 15166   (1)| 00:01:16 |
|   1 |  SORT AGGREGATE      |       |     1 |    13 |            |          |
|   2 |   VIEW               |       |   999K|    12M| 15166   (1)| 00:01:16 |
|*  3 |    FILTER            |       |       |       |            |          |
|   4 |     TABLE ACCESS FULL| T1    |  1000K|    14M|  2155   (4)| 00:00:11 |
|*  5 |     INDEX RANGE SCAN | T2_I1 |     4 |    40 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "B"
              WHERE "B"."E"=:B1 AND "B"."D"=:B2 AND "B"."C"=:B3))
   5 - access("B"."C"=:B1 AND "B"."D"=:B2 AND "B"."E"=:B3)

The query took 1.65 seconds to complete. (And re-running with rowsource execution statistics enabled, I found that the subquery had executed roughly 914,000 times in that 1.65 seconds). Even if the original query had used the filter subquery plan the subquery shouldn’t have made much difference to the overall performance. Of course if T2 didn’t have that index on (c,d,e) then the filter subquery plan would have been much more expensive – but then, we would really have expected to see the hash anti-join.

If you’re wondering why the subquery ran 914,000 times instead of 1M times, you’ve forgotten “scalar subquery caching”.  The session caches a limited number of results from subquery execution as a query runs and may be able to use cached results (or simply a special “previous-execution” result) to minimise the number of executions of the subquery.

Did you notice the index I created on t1(c,d,e) ? If I drive the query through this index I’ll access all the rows for a given combination of (c,d,e) one after the other and only have to run the subquery once for the set. To make this happen, though, I’ll have to declare one of the columns to be NOT NULL, or add a suitable “column is not null” predicate to the query; and then I’ll probably have to hint the query anyway:


select
        max(f)
from (
        SELECT   /*+ no_merge index(a) */
                 A.c,
                 A.d,
                 A.e,
                 A.f
          FROM   t1 A
        WHERE   NOT EXISTS (SELECT   /*+ no_unnest */
                                      1
                               FROM   t2 B
                              WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e)
        and     c is not null
)
;

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |     1 |    13 | 65706   (1)| 00:05:29 |
|   1 |  SORT AGGREGATE               |       |     1 |    13 |            |          |
|   2 |   VIEW                        |       |   999K|    12M| 65706   (1)| 00:05:29 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1    | 50000 |   732K| 52694   (1)| 00:04:24 |
|*  4 |     INDEX FULL SCAN           | T1_I1 | 50000 |       |  2869   (2)| 00:00:15 |
|*  5 |      INDEX RANGE SCAN         | T2_I1 |     4 |    40 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("C" IS NOT NULL AND  NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM
              "T2" "B" WHERE "B"."E"=:B1 AND "B"."D"=:B2 AND "B"."C"=:B3))
   5 - access("B"."C"=:B1 AND "B"."D"=:B2 AND "B"."E"=:B3)

Re-running this code with rowsource execution statistics enabled showed that the subquery ran just 6,000 times (as expected) – for a total run time that was slightly faster than the hash anti-join method (0.17 seconds – but I do have a new laptop using SSD only, with a 3.5GHz CPU and lots of memory).

Every which way, if we can get reasonable performance from the underlying table access there’s no way that introducing a “NOT EXISTS” ought to be a disaster. The worst case scenario – for some reason Oracle chooses to run a filter subquery plan and the appropriate index hasn’t been created to support it.

Footnote:

Of course, table A didn’t really exist, it was a three table join; and it didn’t produce 100M rows, it produced anything between zero and 5 million rows, and the effect of the subquery (which correlated back to two of the joined tables) was to leave anything between 0 and 5 million rows. And (apparently) the query was quick enough in the absence of the subquery (producing, for example, 1 million rows in only 5 minutes), but too slow with the subquery in place.

But that’s okay. Because of our tests we know that once we’ve produced a few million rows it takes fractions of a second more to pass them through a hash table with an anti-join to deal with the “not exists” subquery; and I doubt if we have to play silly games to push the data through a filter subquery plan in the right order to squeeze a few extra hundredths of a second from the query.

If the OP is happy with the basic select statement before the “not exists” subquery, all he has to do is take advantage of a no_merge hint:


select  {list of columns}
from
        (
        select /*+ no_merge */ .... rest of original query
        )    v1
where
        not exists (
                select  null
                from    b
                where   b.c = v1.c and b.d = v1.d and b.e = v1.e
        )
;

You’re probably wondering why the OP currently sees a performance problem as the subquery is added. The best guess is that the subquery has introduce a “magic 5% fudge factor” to the arithmetic (did you notice the cardinality of t1 dropping to 50,000 from 1M in the plan above) and made it pick a worse execution plan for the rest of the query. We can’t tell, though, since the OP hasn’t yet given us the information that would allow us to see what’s going wrong.

April 10, 2015

Counting

Filed under: Execution plans,Indexing,Oracle,Performance — Jonathan Lewis @ 5:27 pm BST Apr 10,2015

There’s a live example on OTN at the moment of an interesting class of problem that can require some imaginative thinking. It revolves around a design that uses a row in one table to hold the low and high values for a range of values in another table. The problem is then simply to count the number of rows in the second table that fall into the range given by the first table. There’s an obvious query you can write (a join with inequality) but if you have to join each row in the first table to several million rows in the second table then aggregate to count them that’s an expensive strategy.  Here’s the query (with numbers of rows involved) that showed up on OTN; it’s an insert statement, and the problem is that it takes 7 hours to insert 37,600 rows:


    INSERT INTO SA_REPORT_DATA
    (REPORT_ID, CUTOFF_DATE, COL_1, COL_2, COL_3)
    (
    SELECT 'ISRP-734', to_date('&DateTo', 'YYYY-MM-DD'),
           SNE.ID AS HLR
    ,      SNR.FROM_NUMBER||' - '||SNR.TO_NUMBER AS NUMBER_RANGE
    ,      COUNT(M.MSISDN) AS AVAILABLE_MSISDNS
    FROM
           SA_NUMBER_RANGES SNR          -- 10,000 rows
    ,      SA_SERVICE_SYSTEMS SSS        --  1,643 rows
    ,      SA_NETWORK_ELEMENTS SNE       --    200 rows
    ,      SA_MSISDNS M                  --    72M rows
    WHERE
           SSS.SEQ = SNR.SRVSYS_SEQ
    AND    SSS.SYSTYP_ID = 'OMC HLR'
    AND    SNE.SEQ = SSS.NE_SEQ
    AND    SNR.ID_TYPE = 'M'
    AND    M.MSISDN  >= SNR.FROM_NUMBER
    AND    M.MSISDN  <= SNR.TO_NUMBER
    AND    M.STATE  = 'AVL'
    GROUP BY
           SNE.ID,SNR.FROM_NUMBER||' - '||SNR.TO_NUMBER
    )  

The feature here is that we are counting ranges of msisdn: we take 10,000 number ranges (snr) and join with inequality to a 72M row table. It’s perfectly conceivable that at some point the data set expands (not necessarily all at once) to literally tens of billions of rows that are then aggregated down to the 37,600 that are finally inserted.

The execution plan shows the optimizer joining the first three tables before doing a merge join between that result set and the relevant subset of the MSISDNs table – which means the MSISDNs have to be sorted and buffered (with a probable spill to disc) before they can be used. It would be interesting to see the rowsource execution stats for the query – partly to see how large the generated set became but also to see if the ranges involved were so large that most of the time went in constantly re-reading the sorted MSISDNs from the temporary tablespace.

As far as optimisation is concerned there are a couple of trivial things around the edges we can examine: we have 10,000 number ranges but insert 37,600 results, and the last stages of the plan generated those results so we’ve scanned and aggregated the sorted MSISDNs 37,600 times. Clearly we could look for a better table ordering that eliminated any number ranges early, then did the minimal number of joins to MSISDN, aggregated, then scaled up to 37,600. With the best join order we might reduce the run time by a factor of around 3.76 or more. (But that’s still a couple of hours run time.)

What we really have to do to make a significant difference is change the infrastructure in some way – preferably invisibly to the rest of the application. There are a number of specific details relating to workload, read-consistency, timing, concurrency, etc. that will need to be considered but, broadly speaking, we need to take advantage of a table that effectively holds the “interesting” MSISDNs in sorted order. When considering the options it’s worth remembering that currently the result is “wrong” because by the time the 7 hour run is complete some (or even many) of the MSISDNs are probably no longer available – so how accurate does the report have to be ?

I’ve kept the approach simple here, and it would need a few modifications for a production system. The important bit of the report is the bit that produces the count so I’m only going to worry about a two-table join – number ranges and msidn; here’s some model data:


execute dbms_random.seed(0)

create table msisdns
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1e9,1e10))      msisdn
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table number_ranges
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1e9,1e10))      from_number,
        trunc(dbms_random.value(1e9,1e10))      to_number
from
        generator       v1
where
        rownum  <= 1000
;

update number_ranges set
        from_number = to_number,
        to_number = from_number
where
        to_number < from_number
;

commit;

I’ve created a table of numbers with values between 10e9 and 10e10 to represent 1 million MSISDNs, and a list of 1,000 number ranges – making sure that the FROM number is not greater than the TO number. Now I need a “summary” table of the MSISDNs, which I’m going to create as an index-organized table:


create table tmp_msisdns (
        msisdn,
        counter,
        constraint tmp_pk primary key (msisdn, counter)
)
organization index
as
select
        msisdn,
        row_number() over(order by msisdn)      counter
from
        msisdns
;

This is only a demonstration so I’ve haven’t bothered with production-like code to check that the MSISDNs I had generated were unique (they were); and I’ve casually included the row_number() as part of the primary key as a performance fiddle even though it’s something that could, technically, allow some other program to introduce bad data if I made the table available for public use rather than keeping it task specific.

Finally we get down to the report. To find out how many MSISDN values there are between the FROM and TO number in a range I just have to find the lowest and highest MSISDNs from in that range and find the difference between their counter values, and add 1. And there’s a very fast way to find the lowest or highest values when you have the appropriate index – the index range scan (min/max) – but you have to access the table twice, once for the low, once for the high. Here’s the necessary SQL, with execution plan from 12.1.0.2:


select
        nr.from_number, nr.to_number,
--      fr1.msisdn, fr1.counter,
--      to1.msisdn, to1.counter,
        1 + to1.counter - fr1.counter range_count
from
        number_ranges   nr,
        tmp_msisdns     fr1,
        tmp_msisdns     to1
where
        fr1.msisdn = (
                select min(msisdn) from tmp_msisdns where tmp_msisdns.msisdn >= nr.from_number
        )
and     to1.msisdn = (
                select max(msisdn) from tmp_msisdns where tmp_msisdns.msisdn <= nr.to_number
        )
;

-------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |               |       |       |  4008 (100)|          |
|   1 |  NESTED LOOPS                   |               |  1000 | 38000 |  4008   (1)| 00:00:01 |
|   2 |   NESTED LOOPS                  |               |  1000 | 26000 |  2005   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL            | NUMBER_RANGES |  1000 | 14000 |     2   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN             | TMP_PK        |     1 |    12 |     2   (0)| 00:00:01 |
|   5 |     SORT AGGREGATE              |               |     1 |     7 |            |          |
|   6 |      FIRST ROW                  |               |     1 |     7 |     3   (0)| 00:00:01 |
|*  7 |       INDEX RANGE SCAN (MIN/MAX)| TMP_PK        |     1 |     7 |     3   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN              | TMP_PK        |     1 |    12 |     2   (0)| 00:00:01 |
|   9 |    SORT AGGREGATE               |               |     1 |     7 |            |          |
|  10 |     FIRST ROW                   |               |     1 |     7 |     3   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN (MIN/MAX) | TMP_PK        |     1 |     7 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("FR1"."MSISDN"=)
   7 - access("TMP_MSISDNS"."MSISDN">=:B1)
   8 - access("TO1"."MSISDN"=)
  11 - access("TMP_MSISDNS"."MSISDN"<=:B1)

Execution time – with 1 million MSISDNs and 1,000 ranges: 0.11 seconds.

For comparative purposes, and to check that the code is producing the right answers, here’s the basic inequality join method:


select
        nr.from_number, nr.to_number, count(*) range_count
from
        number_ranges   nr,
        msisdns         ms
where
        ms.msisdn >= nr.from_number
and     ms.msisdn <= nr.to_number
group by
        nr.from_number, nr.to_number
order by
        nr.from_number
;

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |               |       |       |       |   472K(100)|          |
|   1 |  HASH GROUP BY        |               |   707K|    14M|  6847M|   472K (17)| 00:00:19 |
|   2 |   MERGE JOIN          |               |   255M|  5107M|       | 13492  (77)| 00:00:01 |
|   3 |    SORT JOIN          |               |  1000 | 14000 |       |     3  (34)| 00:00:01 |
|   4 |     TABLE ACCESS FULL | NUMBER_RANGES |  1000 | 14000 |       |     2   (0)| 00:00:01 |
|*  5 |    FILTER             |               |       |       |       |            |          |
|*  6 |     SORT JOIN         |               |  1000K|  6835K|    30M|  3451   (7)| 00:00:01 |
|   7 |      TABLE ACCESS FULL| MSISDNS       |  1000K|  6835K|       |   245  (14)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("MS"."MSISDN"<="NR"."TO_NUMBER")
   6 - access("MS"."MSISDN">="NR"."FROM_NUMBER")
       filter("MS"."MSISDN">="NR"."FROM_NUMBER")

The two queries produced the same results (apart from ordering); but the second query took 2 minutes 19.4 seconds to complete.

 

Update:

In a moment of idle curiosity I recreated the data with 40 Million rows in the MSISDNs table to get some idea of how fast the entire report process could go when re-engineered (remember the OP has 72M rows, but selects the subset flagged as ‘AVL’). It took 1 minute 46 seconds to create the IOT – after which the report for 1,000 number ranges still took less than 0.2 seconds.

Footnote:

My random generation of data doesn’t attempt to avoid duplicate MSISDNs, but they are unlikely to appear in the 1M row test; the 40M row test, however, will almost certainly produce a small percentage of duplicates. As a consequence the final result may exceed 1,000 rows by one or two, but since this is just a demonstration of the principle and a quick performance check that doesn’t worry me.

Update 12th Dec 2015

Stew Ashton was in the audience at the UKOUG Tech 15 conference when I described the problem in a presentation on “avoiding work you don’t need to do” and immediately suggested using match_recognize() as an alternative strategy for solving the problem. He has since published his solution for two different scenarios on his blog.

When I tested his “overlapping ranges” solution against my temporary table solution it produced two advantages – first it ran twice as quickly to produce the result, secondly it didn’t need an auxiliary table. Having brought the run-time down from hours to a minute or two the second advantage is rather more desirable, probably, than the first – especially since you then don’t have to mess around producing a point-in-time read-consistent result. Of course, you may still base your choice of method on the ease of comprehension of the code – and in the short term the match_recognize() mechanism is not well-known.

 

February 19, 2015

255 columns

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:45 am BST Feb 19,2015

You all know that having more than 255 columns in a table is a Bad Thing ™ – and surprisingly you don’t even have to get to 255 to hit the first bad thing about wide tables. If you’ve ever wondered what sorts of problems you can have, here are a few:

  • If you’re still running 10g and gather stats on a table with more than roughly 165 columns then the query Oracle uses to collect the stats will only handle about 165 of them at a time; so you end up doing multiple (possibly sampled) tablescans to gather the stats. The reason why I can’t give you an exact figure for the number of columns is that it depends on the type and nullity of the columns – Oracle knows that some column types are fixed length (e.g. date types, char() types) and if any columns are declared not null then Oracle doesn’t have to worry about counting nulls – so for some of the table columns Oracle will be able to eliminate one or two of the related columns it normally includes in the stats-gathering SQL statement – which means it can gather stats on a few more table columns.  The 165-ish limit doesn’t apply in 11g – though I haven’t checked to see if there’s a larger limit before the same thing happens.
  • If you have more than 255 columns in a row Oracle will split it into multiple row pieces of 255 columns each plus one row piece for “the rest”; but the split counts from the end, so if you have a table with 256 columns the first row-piece has one column the second row-piece has 255 columns. This is bad news for all sorts of operations because Oracle will have to expend extra CPU chasing the the row pieces to make use of any column not in the first row piece. The optimists among you might have expected “the rest” to be in the last row piece. If you want to be reminded how bad row-chaining can get for wide tables, just have a look at an earlier blog note of mine (starting at this comment).
  • A particularly nasty side effect of the row split comes with direct path tablescans – and that’s what Oracle does automatically when the table is large. In many cases all the row pieces for a row will be in the same block; but they might not be, and if a continuation row-piece is in a different block Oracle will do a “db file sequential read” to read that block into the buffer cache and it won’t be cached (see 1st comment below).  As an indication of how badly this can affect performance, the results I got at a recent client site showed “select count(col1) from wide_table” taking 10  minutes while “select count(column40) from wide_table” took 22 minutes because roughly one row in a hundred required a single block read to follow the chain.
  • An important side effect of the split point is that you really need to put the columns you’re going to index near the start of the table to minimise the risk of this row chaining overhead when you create or rebuild an index.
  • On top of everything else, of course, it takes a surprisingly large amount of extra CPU to load a large table if the rows are chained. Another client test reported 140 CPU seconds to load 5M rows of 256 columns, but only 20 CPU seconds to load 255.

If you are going to have tables with more than 255 columns, think very carefully about column order – if you can get all the columns that are almost always null at the end of the row you may get lucky and find that you never need to create a secondary row piece. A recent client had about 290 columns in one table of 16M rows, and 150 columns were null for all 16M rows – unfortunately they had a mandatory “date_inserted” column at the end of the row, but with a little column re-arrangement they eliminated row chaining and saved (more than) 150 bytes storage per row.  Of course, if they have to add and back-fill a non-null column to the table they’re going to have to rebuild the table to insert the column “in the middle”, otherwise all new data will be chained and wasting 150 bytes per row, and any old data that gets updated will suffer a row migration/chain catastrophe.

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 6,321 other followers