Oracle Scratchpad

September 14, 2010

Exadata

Filed under: Exadata,Infrastructure — Jonathan Lewis @ 6:58 pm BST Sep 14,2010

Inevitably there’s still excitement and uncertainty in the air about Exadata – and/or the Database Machine. For ease of reference and a jolly good read I’ve started collecting a few blog postings about it:

Jason Arneil’s Blog

Pythian

Kerry Osborne

Kevin Closson

Tanel Poder

Uwe Hesse

Arup Nanda

Oracle Official Links

 

December 8, 2020

Hash Joins

Filed under: Hash Join,Joins,Oracle — Jonathan Lewis @ 11:55 am GMT Dec 8,2020

This is a follow-up to a note that’s 10 years old [opens in new tab], prompted by a thread on the Oracle Developer Community forum asking about the order of operation in an execution plan, and following up with a question about PGA memory use that I had left open in the original note.

The original note pointed out that the serial execution plan for a 4 table join that used hash joins for every join and had a leading() hint dictating a single join order could still have 8 distinct execution plans (which I then corrected to 4, because half of them were eliminated by an odd little inconsistency of the leading() hint).

The source of the number 8 was the fact that when you hint a hash join with /*+ use_hash(next_alias) */ the optimizer will consider a /*+ swap_join_inputs(next_alias) */ – in other words, it will choose which of the two inputs should be the “build” table and which the “probe” table, even though the human eye will probably assume that next_alias was clearly intended as the probe table. (The “little inconsistency” in the leading() hint is that it blocks the option for a swap for the first – and only the first – join unless it’s explicitly hinted.)

In the article I showed all 8 possible plans, and described how the two plans at the extreme ends of my list would operate at run-time. I’ve reproduced these two plans below, preceded by the hints that produced them:

leading(t1 t2 t3 t4)
use_hash(t2) no_swap_join_inputs(t2)
use_hash(t3) no_swap_join_inputs(t3)
use_hash(t4) no_swap_join_inputs(t4)
 
--------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     1 |    24 |    11 |
|   1 |  SORT AGGREGATE       |      |     1 |    24 |       |
|*  2 |   HASH JOIN           |      |    70 |  1680 |    11 |
|*  3 |    HASH JOIN          |      |    70 |  1260 |     8 |
|*  4 |     HASH JOIN         |      |    70 |   840 |     5 |
|   5 |      TABLE ACCESS FULL| T1   |    70 |   420 |     2 |
|   6 |      TABLE ACCESS FULL| T2   |    70 |   420 |     2 |
|   7 |     TABLE ACCESS FULL | T3   |    70 |   420 |     2 |
|   8 |    TABLE ACCESS FULL  | T4   |    70 |   420 |     2 |
--------------------------------------------------------------

leading(t1 t2 t3 t4)
use_hash(t2)    swap_join_inputs(t2)
use_hash(t3)    swap_join_inputs(t3)
use_hash(t4)    swap_join_inputs(t4)
 
--------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     1 |    24 |    11 |
|   1 |  SORT AGGREGATE       |      |     1 |    24 |       |
|*  2 |   HASH JOIN           |      |    70 |  1680 |    11 |
|   3 |    TABLE ACCESS FULL  | T4   |    70 |   420 |     2 |
|*  4 |    HASH JOIN          |      |    70 |  1260 |     8 |
|   5 |     TABLE ACCESS FULL | T3   |    70 |   420 |     2 |
|*  6 |     HASH JOIN         |      |    70 |   840 |     5 |
|   7 |      TABLE ACCESS FULL| T2   |    70 |   420 |     2 |
|   8 |      TABLE ACCESS FULL| T1   |    70 |   420 |     2 |
--------------------------------------------------------------

Side note: one of the comments on the original post raised the point that (e.g.) the second of the plans above could have been generated by the join order represented by the hint leading(t2 t1 t3 t4), and that’s absolutely correct. In fact, if you see the plan for a 4 table join consisting of nothing but 3 hash join you cannot be certain what join order the optimizer was examining when it produced that plan unless you look at the CBO trace file (or check to see if it’s been explicitly and completely, hinted in the code).

Having set the scene, we come to the question that prompted this note. The question related to the second plan above, and said:

“However, this one is the plan that takes the most memory in the PGA because it builds 3 hash tables before joining any table, right?”

The quick answer to the question is: “Not necessarily.”

In the original article I had pointed out that plans following the pattern of the first plan above with N tables and N – 1 joins would have at most two build tables in memory at any one moment while the bottom plan would create N – 1 build tables in memory before any join results could be created. This does rather invite the inference that the bottom plan has to be the one that is going to use most PGA memory, but what I had said in the original article was (emphasis added):

“Notice: the number of in-memory hash (build) tables we have in the first of the 8 plans at any instant after the first join starts is two and (no matter how many tables are involved in this pattern) the number of in-memory hash tables will always be two. The actual size of the two hash tables is a little unpredictable and, as a very crude guideline, you might expect the size to grow as more tables are joined into the result set.

As a thought experiment, consider 3 small dimension tables and one big fact table. If Oracle were to create in-memory hash tables from the three dimension tables and then start scanning the fact table (following the pattern of the second plan above with t1 in the role of the fact table) probing each of the dimension tables in turn, it could deliver the first result row very quickly without requiring more memory to store intermediate results.

Conversely if Oracle were to create a tiny in-memory hash table from the first dimension and probe it with the fact table (following the pattern of the first plan above with t2 in the role of the fact table) Oracle would then have to build a very large in-memory hash table before before probing it with the second dimension table, and as that second join takes place it would be generating a new result set that would become the next big in-memory hash table.

In this thought experiment we would probably find that the optimizer did the right thing without prompting and constructed three tiny in-memory hash tables – but it’s not always so clear-cut, and even the “right” decision can result in very large intermediate build tables (and if those intermediate build tables spill to disc in an Exadata environment the change in performance can be huge).

To finish off, here’s a script to turn the thought experiment into a concrete example (by careful, but deliberately silly, hinting).

rem
rem     Script:         c_treblehash_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem 

create table t1 
as
select
        rownum          id,
        to_char(rownum) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 70
;

alter table t1 
        add constraint t1_pk primary key(id)
;

create table t2
as
select
        rownum          id,
        to_char(rownum) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 70
;

alter table t2
        add constraint t2_pk primary key(id)
;

create table t3
as
select
        rownum          id,
        to_char(rownum) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 70
;

alter table t3
        add constraint t3_pk primary key(id)
;

create table t4
nologging
as
select
        t1.id                   id1,
        t2.id                   id2,
        t3.id                   id3,
        rpad(rownum,10)         small_vc,
        rpad('x',100)           padding
from
        t1, t2, t3
;

connect test_user/test
set linesize 156
set serveroutput off

select
        /*+ 
                leading(t4 t1 t2 t3)
                full(t4) 
                use_hash(t1) full(t1) swap_join_inputs(t1)
                use_hash(t2) full(t2) swap_join_inputs(t2)
                use_hash(t3) full(t3) swap_join_inputs(t3) 
        */
        count(t1.small_vc),
        count(t2.small_vc),
        count(t3.small_vc),
        count(t4.small_vc)
from
        t4,     
        t1,     
        t2,     
        t3
where
        t1.id = t4.id1
and     t2.id = t4.id2
and     t3.id = t4.id3
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

column pga_max_mem format 99,999,999

select pga_max_mem from v$process where addr = (
                select paddr from v$session where sid = (
                        select sid from V$mystat where rownum = 1
                )
        )
;

connect test_user/test
set linesize 156
set serveroutput off

select
        /*+ 
                leading(t4 t1 t2 t3)
                full(t4) 
                use_hash(t1) full(t1) no_swap_join_inputs(t1)
                use_hash(t2) full(t2) no_swap_join_inputs(t2)
                use_hash(t3) full(t3) no_swap_join_inputs(t3) 
        */
        count(t1.small_vc),
        count(t2.small_vc),
        count(t3.small_vc),
        count(t4.small_vc)
from
        t4,     
        t1,     
        t2,     
        t3
where
        t1.id = t4.id1
and     t2.id = t4.id2
and     t3.id = t4.id3
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

column pga_max_mem format 99,999,999

select pga_max_mem from v$process where addr = (
                select paddr from v$session where sid = (
                        select sid from V$mystat where rownum = 1
                )
        )
;

All I’ve done is create three small “dimension” tables of 70 rows each then created a table which is their Cartesian join, which produces a “fact” table of 343,000 rows. Then I’ve written a simple query to join the three dimension tables to the fact table.

I’ve used swap_join_inputs() for all the joins in one version of the query, and no_swap_join_inputs() in the other versions; and I’ve reconnected to the data before running each query to make it easier to see the different impact on the PGA of the two plans.

I’ve use dbms_xplan.display_cursor() to pull the execution plans from memory after running the queries, and since I’ve not set statistics_level to all, or added the hint /*+ gather_plan_statistics */ to the queries the only rowsource execution statistics I’ll get are the PGA usage: estimates and actuals.

The final step of each test reports the maximum PGA memory usage the session reached in the course of the test.

Here are the two plans, each followed by the PGA maximum memory size.

--------------------------------------------------------------------------
| Id  | Operation             | Name | E-Rows |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |        |       |       |          |
|   1 |  SORT AGGREGATE       |      |      1 |       |       |          |
|*  2 |   HASH JOIN           |      |    343K|  2171K|  2171K| 1684K (0)|
|   3 |    TABLE ACCESS FULL  | T3   |     70 |       |       |          |
|*  4 |    HASH JOIN          |      |    343K|  2171K|  2171K| 1681K (0)|
|   5 |     TABLE ACCESS FULL | T2   |     70 |       |       |          |
|*  6 |     HASH JOIN         |      |    343K|  2171K|  2171K| 1711K (0)|
|   7 |      TABLE ACCESS FULL| T1   |     70 |       |       |          |
|   8 |      TABLE ACCESS FULL| T4   |    343K|       |       |          |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T3"."ID"="T4"."ID3")
   4 - access("T2"."ID"="T4"."ID2")
   6 - access("T1"."ID"="T4"."ID1")

PGA_MAX_MEM
-----------
 13,859,925

--------------------------------------------------------------------------
| Id  | Operation             | Name | E-Rows |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |        |       |       |          |
|   1 |  SORT AGGREGATE       |      |      1 |       |       |          |
|*  2 |   HASH JOIN           |      |    343K|    23M|  3727K|   29M (0)|
|*  3 |    HASH JOIN          |      |    343K|    23M|  3727K|   29M (0)|
|*  4 |     HASH JOIN         |      |    343K|    23M|  3667K|   29M (0)|
|   5 |      TABLE ACCESS FULL| T4   |    343K|       |       |          |
|   6 |      TABLE ACCESS FULL| T1   |     70 |       |       |          |
|   7 |     TABLE ACCESS FULL | T2   |     70 |       |       |          |
|   8 |    TABLE ACCESS FULL  | T3   |     70 |       |       |          |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T3"."ID"="T4"."ID3")
   3 - access("T2"."ID"="T4"."ID2")
   4 - access("T1"."ID"="T4"."ID1")

PGA_MAX_MEM
-----------
 52,984,917

As you can see, each of the hash joins in the first plan required roughly 1.7MB of memory. All three hash tables would have been in existence at the same time, giving a total of about 5.1MB of memory for the query. The session’s maximum PGA usage shows up as 13MB, of which 5MB was my basic “startup” PGA, leaving 3MB “unaccounted”.

In comparison, each of the hash joins in the second plan required roughly 29MB, although only two of the hash tables would have been in existence at any one moment. That’s still an allocation of 58MB for the same basic 4 table join. In fact things aren’t quite as bad as they seem in this case since the maximum PGA allocated was only about 52MB (again with 5MB of startup PGA). The apparent contradiction may be due to the way that Oracle allocates PGA in increasing chunks – the 29MB reported may have been the result of the session doing something like: “I’ve reached 23MB, my next allocation will be 4MB, oops, I only needed another 128KB)”

As a final check of activity, I’ve also run a couple of queries against V$sql_workarea – a convenient memory structure you can examine to get some “recent history” of queries that have been using large amount of memory, or spilling to disc. In this case I’ve query the structure by sql_id for the two queries, reporting just a little detail about the last execution and memory usage.

SQL> select operation_id, operation_type, last_memory_used, last_execution  from V$sql_workarea where sql_id = 'dcc01q28gcbmy';

OPERATION_ID OPERATION_TYPE                   LAST_MEMORY_USED LAST_EXECUTION
------------ -------------------------------- ---------------- ----------------------------------------
           2 HASH-JOIN                                 1724416 OPTIMAL
           4 HASH-JOIN                                 1721344 OPTIMAL
           6 HASH-JOIN                                 1752064 OPTIMAL


SQL> select operation_id, operation_type, last_memory_used, last_execution  from V$sql_workarea where sql_id = 'b52uwjz07fwhk';

OPERATION_ID OPERATION_TYPE                   LAST_MEMORY_USED LAST_EXECUTION
------------ -------------------------------- ---------------- ----------------------------------------
           2 HASH-JOIN                                30930944 OPTIMAL
           3 HASH-JOIN                                30945280 OPTIMAL
           4 HASH-JOIN                                30650368 OPTIMAL

As you can see, the view reports each plan operation (with id and type) that needed memory for an SQL workarea – and the numbers confirm the 1.7MB and 29MB reported by the execution plans. (Which is not surprising since it’s exactly these figures that are used to supply the details in the plans.)

You need to be a little cautious with this view in older versions of Oracle – it used to do a “full tablescan” of the library cache, which resulted in a lot of library cache latch activity and could cause some contention if you ran it frequently, but it’s a very useful view for finding queries that are doing unexpectedly large sorts or hash joins, and one you might query occasionally if you see any space management threats in the temporary tablespace.

November 12, 2020

rowsets

Filed under: Execution plans,Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:35 pm GMT Nov 12,2020

Here’s a little demonstration of the benefit of rowsets. It started with a very simple question that arrived in my in-box from Kaley Crum.

  • “Why does the query with the UNION ALL take so much longer than the query without the UNION ALL?”

Here are the two queries – surely they should take virtually the same amount of time.

select count(*)
from (select /*+ no_merge */
             object_id
      from drop_me);


select count(*)
from (select /*+ no_merge */
             object_id
      from drop_me
      --
      union all
      --
      select 1 from dual);

Look closely – the difference between the two queries is just the addition through a UNION ALL of “select 1 from dual”. The first query took about 3.5 seconds to complete (there were 70M rows in the table), and the second took nearly 69 seconds.

Here are the execution plans showing the rowsource execution statistics (and Query Block // Object Alias information) – first the baseline query:

QL_ID  0ph1vfuuxkbqb, child number 0
-------------------------------------
select count(*) from (select /*+ no_merge */              object_id    
  from drop_me)
 
Plan hash value: 3609429292
 
--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name    | Starts | E-Rows | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |      1 |        | 26465 (100)|          |      1 |00:00:03.34 |   95896 |  95891 |
|   1 |  SORT AGGREGATE     |         |      1 |      1 |            |          |      1 |00:00:03.34 |   95896 |  95891 |
|   2 |   VIEW              |         |      1 |     70M| 26465   (2)| 00:00:02 |     70M|00:00:03.32 |   95896 |  95891 |
|   3 |    TABLE ACCESS FULL| DROP_ME |      1 |     70M| 26465   (2)| 00:00:02 |     70M|00:00:03.27 |   95896 |  95891 |
--------------------------------------------------------------------------------------------------------------------------
 
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SEL$2 / from$_subquery$_001@SEL$1
   3 - SEL$2 / DROP_ME@SEL$2

And then the UNION ALL query:

SQL_ID  0chdajr28y0ub, child number 0
-------------------------------------
select count(*) from (select /*+ no_merge */              object_id    
  from drop_me       --       union all       --       select 1 from
dual)
 
Plan hash value: 3408528233
 
---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name    | Starts | E-Rows | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |      1 |        | 26467 (100)|          |      1 |00:01:18.58 |   95896 |  95891 |
|   1 |  SORT AGGREGATE      |         |      1 |      1 |            |          |      1 |00:01:18.58 |   95896 |  95891 |
|   2 |   VIEW               |         |      1 |     70M| 26467   (2)| 00:00:02 |     70M|00:01:10.84 |   95896 |  95891 |
|   3 |    UNION-ALL         |         |      1 |        |            |          |     70M|00:00:53.13 |   95896 |  95891 |
|   4 |     TABLE ACCESS FULL| DROP_ME |      1 |     70M| 26465   (2)| 00:00:02 |     70M|00:00:19.28 |   95896 |  95891 |
|   5 |     FAST DUAL        |         |      1 |      1 |     2   (0)| 00:00:01 |      1 |00:00:00.01 |       0 |      0 |
---------------------------------------------------------------------------------------------------------------------------
 
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SET$1 / from$_subquery$_001@SEL$1
   3 - SET$1
   4 - SEL$2 / DROP_ME@SEL$2
   5 - SEL$3 / DUAL@SEL$3

This is not a trick question – I’ve got the SQL to create the data set and run the test; and I’ve run the test through 19.3.0.0 and 12.2.0.1 with results very similar to the above. Here’s a slightly modified version of the script – you’ll notice the column name has changed because I’ve used my usual “large data set” generator rather than scaling up a clone of all_objects:

rem
rem     Script:         rowset_kaley.sql
rem     Author:         Kaley Crum / JP Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem

create table drop_me
pctfree 0
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum  n1
from
        generator
cross join
        generator
where rownum <= 7e7
;

alter session set statistics_level = all;

set timing on
set serveroutput off

prompt  ===========================================
prompt  Baseline: Simple query without a union all.
prompt  Runs in 3-4 seconds
prompt  ===========================================

select
         count(*)
from (select /*+ no_merge */
             n1
      from drop_me);

select *
from table(dbms_xplan.display_cursor(format => 'allstats last advanced'));

prompt  ==========================================
prompt  Add a UNION ALL for one record from dual.
Prompt  Runs in over a minute
prompt  ==========================================

pause Press return

select
         count(*)
from (select /*+ no_merge */
             n1
      from drop_me
      --
      union all
      --
      select 1 from dual);

select *
from table(dbms_xplan.display_cursor(format => 'allstats last advanced'));

There are two factors involved in the massive variation in performance. The first factor is one that the test case will report – though I didn’t show it in the output above, the second is that the test case has enabled rowsource execution statistics.

Here’s the big clue – from the Column Projection Information, which is one of the things that appears with the “advanced” format option (or when you add the “projection” format option) in the call to dbms_xplan.display_xxx()

First for the fast query:

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]
   2 - (rowset=1019)
   3 - (rowset=1019)

The rowsource passes 1,019 rows at a time from the tablescan operation to the view operation and from the view operation to the sort operation, for a total of roughly 70,000 calls for each of those two steps of the plan.

Compare this to the projection information for the slow UNION ALL query:

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]

I haven’t missed a bit in copying – this really is it. The rows move up the stack one at a time, not in a rowset array. That’s 70,000,000 subroutime calls for each of the two steps of the plan.

I’m sure most of us have heard the mantra “row by row is slow by slow” (or some variant on the theme). This is true all the way down to the internal levels of an execution plan.

Of course 70M calls vs. 70K calls shouldn’t really explain a difference of more than 60 seconds – but when the calls to the O/S for timing information for rowsource execution stats are repeated for each call (or even if it’s only a 1% sample of the calls) down the stack that’s where a lot of time can disappear.

On my laptop, running Oracle 19.3 in a VM, this is how my timing went:

  • Fast query: 1.94 seconds, dropping to 1.79 seconds when I disabled rowsource execution stats.
  • Slow query: 65.91 seconds, dropping to a far less astounding 3.26 seconds when I disabled rowsource execution stats.

So there really is a noticeable difference between row-by-row and array-processing but most of the difference in the original test came from using rowsource execution statistics to measure how much of a difference there would be.

It’s also worth mentioning that this is probably the most extreme case you could produce to show the difference – using the largest possible rowset size with the smallest possible rows when you want to mazimise the gap between (internal) array processing and single row processing – and then processing a very large number of rows. [But isn’t ridiculously large numbers of rows what you do with Exadata?].

If you want further evidence that the difference is due to the rowset size you can always alter session set “_rowsets_enabled”=false; and watch the fast query slow down. It will take about half the time of the slow query as it only has to pass 70M rows up one step of the plan rather than the two steps that are in the UNION ALL plan.

In theory it looks as if you could also restrict the size of the rowset by setting _rowsets_max_rows or _rowsets_target_maxsize, but since the default value for the former is 256 the results above suggest that the settings might be ignored, and when I tried adjusting them at the session level nothing changed.

Another test you could run is to adjust (alter session) the parameter “_rowsource_statistics_sampfreq”, When I set this to 100 the time for the slow query dropped to about 21 seconds (and the fast query – with rowsets disabled – dropped to about 11 seconds).

Footnote

It’s always easy to hit an accidental special case without realising it, so when you’re testing something it’s important to think about what’s special in the example. I came up with two possibilities in this case – dual is always going to be a very special case in any circumstances, and I’ve got a table in my query that is guaranteed to return exactly one row. So I repeated the test with:

  • a real table with one row and primary key in case the reference to dual was disabling rowsets
  • a real table with 100 rows in case this enabled a rowset size of 100

In neither case did Oracle start using array processing.

Footnote 2

One of the patterns I see occasionally follows the form of the slow query above. Some front-end tool executes a select statement that starts with a select {list of desired column headings} from dual so that the front-end code doesn’t have to be programmed to create headings as it receives the data.

Although the impact this would have on rowsets looks like a large percentage of the work done in the database in this case, in more realistic cases it would probably be irrelevant, until someone starts testing with the statistics_level set to all, or _rowsource_execution_statistics explicitly enabled.

There is a related effect, though, from the SQL Monitor feature (which activates automatically for a statement if it’s executing in parallel, or if it’s predicted to take more than 5 seconds to complete). When I added the /*+ monitor */ hint to the two queries (and didn’t fiddle with the rowsets parameter) the fast query averaged 2.26 seconds instead of 1.79 seconds, and the slow query averaged 3.51 seconds instead of 3.26 seconds. I have to say, though, that the variation between executions of the same statement was quite significant compared the variation produced by adding or eliminating the hint.

October 10, 2020

Interval Oddity

Filed under: CBO,Execution plans,Oracle,Partitioning — Jonathan Lewis @ 2:51 pm BST Oct 10,2020

Interval partitioning is a popular strategy for partitioning date-based data. It’s an enhanced variant of range partitioning that allows you to define a starting partition and an interval that should be used to derive the high values for all subsequent partitions – and Oracle doesn’t even have to create intervening partitions if you insert data that goes far beyond the current partition, it automatically creates exactly the right partition (with the correct high_value and correctly inferred lower boundary) for the incoming data and behaves as if the intervening partitions will become available when they’re needed at some later point in time. So no need for DBAs to work out actual partition high_values, no need to ensure that all the partitions you need out into the future have been pre-created, no more batch processes crashing with Oracle error ORA-14400: inserted partition key does not map to any partition.

But there’s a surprising difference between traditional range partitioning and the newer interval partitioning that will increase CPU usage in some cases and may (though I haven’t yet investigated this in sufficient detail to create an example) lead to variations in execution plans.

To demonstrate the difference I’m going to create two tables with the same structure and content then run the same query against them and show you the resulting execution plans. Here’s the code to create and populate the tables:

rem
rem     Script:         interval_or_range.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem


create table t_interval(
        order_date      date not null,
        order_id        number(10,0) not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range (order_date) 
interval (numtoyminterval(1,'MONTH'))
subpartition by hash (order_id) subpartitions 4
        (
                partition start_p1 values less than (to_date('01-Jan-2020','dd-mon-yyyy'))
        )
;

create table t_range(
        order_date      date not null,
        order_id        number(10,0) not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range (order_date) 
subpartition by hash (order_id) subpartitions 4
        (
                partition start_p1 values less than (to_date('01-Jan-2020','dd-mon-yyyy')),
                partition start_p2 values less than (to_date('01-Feb-2020','dd-mon-yyyy')),
                partition start_p3 values less than (to_date('01-Mar-2020','dd-mon-yyyy')),
                partition start_p4 values less than (to_date('01-Apr-2020','dd-mon-yyyy')),
                partition start_p5 values less than (to_date('01-May-2020','dd-mon-yyyy')),
                partition start_p6 values less than (to_date('01-Jun-2020','dd-mon-yyyy')),
                partition start_p7 values less than (to_date('01-Jul-2020','dd-mon-yyyy')),
                partition start_p8 values less than (to_date('01-Aug-2020','dd-mon-yyyy')),
                partition start_p9 values less than (to_date('01-Sep-2020','dd-mon-yyyy'))
        )
;

insert into t_range select
        to_date('01-Jan-2020','dd-mon-yyyy') + rownum,
        rownum,
        lpad(rownum,10,'0'),
        lpad('x',100,'x')
from
        all_objects
where
        rownum <= 240 -- > comment to avoid wordpress format issue  
;

insert into t_interval select
        to_date('01-Jan-2020','dd-mon-yyyy') + rownum,
        rownum,
        lpad(rownum,10,'0'),
        lpad('x',100,'x')
from
        all_objects
where
        rownum <= 240 -- > comment to avoid wordpress format issue
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T_RANGE',
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T_INTERVAL',
                method_opt  => 'for all columns size 1'
        );
end;
/

I’ve used composite partitioned tables in this example but the same anomaly appears with simple partitioning – which you can test by commenting out the “subpartion by ..” lines. The two tables have partitions defined to hold a month’s data. I’ve inserted a few rows into most of the partitions, and ensured that I haven’t accidentally attempted to insert data that falls outside the legal range of the table with the predefined partitions .

To show that the tables are nominally identical here’s the list of partitions with their high values and number of rows:

column table_name       format a15
column partition_name   format a15
column high_value       format a80

break on table_name skip 1

select
        table_name, partition_name, num_rows, high_value
from
        user_tab_partitions
where
        table_name in ('T_INTERVAL','T_RANGE')
order by
        table_name, partition_name
;


ABLE_NAME      PARTITION_NAME    NUM_ROWS HIGH_VALUE
--------------- --------------- ---------- --------------------------------------------------------------------------------
T_INTERVAL      START_P1                 0 TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10722              30 TO_DATE(' 2020-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10723              29 TO_DATE(' 2020-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10724              31 TO_DATE(' 2020-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10725              30 TO_DATE(' 2020-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10726              31 TO_DATE(' 2020-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10727              30 TO_DATE(' 2020-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10728              31 TO_DATE(' 2020-08-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10729              28 TO_DATE(' 2020-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

T_RANGE         START_P1                 0 TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P2                30 TO_DATE(' 2020-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P3                29 TO_DATE(' 2020-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P4                31 TO_DATE(' 2020-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P5                30 TO_DATE(' 2020-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P6                31 TO_DATE(' 2020-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P7                30 TO_DATE(' 2020-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P8                31 TO_DATE(' 2020-08-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P9                28 TO_DATE(' 2020-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA


18 rows selected.

With this setup we can now run a simple query against the two tables using a where clause that is an exact match for the partition definition for the July data:

set serveroutput off

select 
        count(*) 
from 
        t_range 
where   order_date >= date '2020-07-01'  -- to_date('01-Jul-2020','dd-mon-yyyy')
and     order_date <  date '2020-08-01'  -- to_date('01-Aug-2020','dd-mon-yyyy')
/

select * from table(dbms_xplan.display_cursor);

select 
        count(*) 
from 
        t_interval 
where   order_date >= date '2020-07-01'  -- to_date('01-Jul-2020','dd-mon-yyyy')
and     order_date <  date '2020-08-01'  -- to_date('01-Aug-2020','dd-mon-yyyy')
/

select * from table(dbms_xplan.display_cursor);

I’ve left in two versions of the date predicates – the “ANSI” style, and a strictly formatted “to_char()” style that is still the approached used most commonly in Oracle systems. The effect on the execution plans is the same for both variations of the date declaration, and here are the two execution plans – reported from an instance of 19.3.0.0:

---------------------------------------------------------------------------------------------------
| Id  | Operation               | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |         |       |       |   130 (100)|          |       |       |
|   1 |  SORT AGGREGATE         |         |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE|         |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
|   3 |    TABLE ACCESS FULL    | T_RANGE |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
---------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |       |       |   130 (100)|          |       |       |
|   1 |  SORT AGGREGATE         |            |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE|            |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
|*  3 |    TABLE ACCESS FULL    | T_INTERVAL |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(("ORDER_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))

It’s quite possible that you won’t notice the difference these plans, even though I’ve put them directly one below the other; and it’s very likely that most people wouldn’t think about the difference if they didn’t have the two plans available at the same time to compare.

If you can’t spot the difference it’s because I’ve deliberately arranged them in a way that might fool you into not looking carefully enough.

If you’ve spotted the difference it’s probably because you’ve examined the Predicate Information section. There isn’t one for the the t_range example but there is one for the t_interval example – and it was a little sneaky of me to make it easy for you to assume that the one set of predicates I showed was common to the two plans.

Since the plans have been pulled from memory (v$sql_plan) it seems likely that they are truthful and the query based on the interval partitioning is actually checking every row in the partition against two date values. In my example that’s not going to make much difference to performance, but in a datawarehouse with millions of rows per partition there’s scope for a noticeable increase in CPU between the two queries – especially since the block reads are likely to be direct path, bypassing much of the usual (CPU -intensive) buffer cache activity.

Interestingly when I checked the 10053 trace files for the two queries the CPU cost for the pair was identical. Normally you expect to see at least a little CPU cost (even if it’s only tens of units out of millions) for checking a predicate. This raises the question – does the optimizer allow a cost for the t_range table for an event that isn’t going to happen, or is it reporting an event that isn’t going to happen for the t_interval table. [Edit: easy enough to check by doing the test on a simple heap table that clones the data from that one partition, and checking the calculated cost with and without the predicate]

There’s a further ramification to this anomaly, relating to the question on the Oracle-L list server that prompted the investigation. What happens on Exadata where the predicates can be offloaded to storage?

The original question wasn’t actually about the appearance (or not) of the predicates, it was about an unexpected cardinality estimate for a query involving two equi-partitioned tables, so this note has gone completely off-topic from the question; but another strange detail about the predicates showed up when I suggested the creation of a column group on the join columns. A (redundant) predicate that had been offloaded to storage stopped being offloaded; here, from the original posting with the original tables, are the two different Predicate Information sections that appeared – the first without the column group, the second after the column group had been created and its stats collected:

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ORDER_DETL"."ORDR_RQST_NUMBER"="ORDER_REQST"."ORDR_RQST_NUMBER" AND
              "ORDER_DETL"."ORDR_RQST_DATE"="ORDER_REQST"."ORDR_RQST_DATE")
   6 - storage("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   8 - storage("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Predicate Information (identified by operation id):
---------------------------------------------------
   4 -
access("ORDER_DETL"."ORDR_RQST_NUMBER"="ORDER_REQST"."ORDR_RQST_NUMBER" AND
              "ORDER_DETL"."ORDR_RQST_DATE"="ORDER_REQST"."ORDR_RQST_DATE")
   6 - storage("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   8 - filter("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))

Notice how the storage() predicate that appears at operation 8 in the first set has disappeared from the second even though operation 6 manages to retain its storage() predicate throughout.

I’ve modelled a further example of odd behaviour using two pairs of tables – one pair using range/hash partitioning the other interval/hash partitioning. I won’t reproduce it here but the problem of redundant predicates appearing and then disappearing gets even stranger.

I haven’t yet produced an example where the unexpected predicate behaviour has affected the costs or cardinalities – but I’ve only spent a couple of hours playing around with well-formed examples: it’s possible that in badly formed examples (e.g. with statistical inconsistencies) the side effect could mean that two notionally identical queries produce different numbers and different plans because they end up with different predicates in the final transformed query.

Footnote

Following an email from David Kurtz, it occurred to me that I should have made it clear that the disappearance of predicates on the partition key is expected behaviour when the predicates are clearly synchronised with the partition boundaries. The behaviour for the interval partitioning is the oddity, the behaviour for the “normal” range partitioning is the standard.

October 1, 2020

Index FFS Cost

Filed under: Bugs,CBO,Execution plans,Indexing,Oracle — Jonathan Lewis @ 11:46 am BST Oct 1,2020

There are a number of unexpected issues with the optimizer’s treatment of the index fast full scan, the access path where Oracle ignores the structure of the B-tree and uses multiblock reads to do a brute-force segment scan as if the index were a “skinny table” with a few blocks of irrelevant garbage (i.e. the branch blocks) that could be ignored.

A recent comment made a few days on a blog about the optimizer’s “index-join” access path reminded me that I had a few notes to finish and publish that might help some people address performance issues relating to a couple of little-known fixes for optimizer oddities in this area. Today’s “fix” is one that appeared in 10g (or maybe very late in 9i) but never became the default optimizer behaviour (possibly because it’s one of those “not a bug” types of bug with no “always correct” strategy).

The problem revolves around the optimizer_index_cost_adj parameter (which is another reason why the fix might have been kept out of the standard code base). For index-only execution plans the parameter will apply to the index range scan, full scan, or skip scan, but not to the index fast full scan. Here’s a model to demonstrate this:


rem
rem     Script:         oica_iffs.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2012
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        lpad(trunc(dbms_random.value(0,10000)),10)      v1,
        rpad('x',20)                                    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid wordpress format issue
;

create index t1_i1 on t1(v1);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                cascade          => true,
                method_opt       => 'for all columns size 1'
        );
end;
.

select blocks from user_tables where table_name = 'T1';
select leaf_blocks from user_indexes where index_name = 'T1_I1';

I’ve created a fairly narrow table with 1M rows, and created an index on that table on the v1 column. In my test there were 5,219 blocks in the table and 3,068 blocks in the index.

I’m now going to run a query to count the rows in the table where v1 is not null and report its execution plan. The first run will be unhinted, the second run will include a hint that forces an index full scan, and the third run will include a hint to force an index fast full scan.


prompt  ==============
prompt  Unhinted query
prompt  ==============

select 
        count(*) 
from
        t1
where
        v1 is not null
;

prompt  ==========
prompt  Index hint
prompt  ==========

select 
        /*+ index(t1) */
        count(*) 
from
        t1
where
        v1 is not null
;

prompt  ==============
prompt  index_ffs hint
prompt  ==============

select 
        /*+ index_ffs(t1) */
        count(*) 
from
        t1
where
        v1 is not null
;

Here are the three execution plans. As you might have predicted the default plan (for my Oracle 19.3) is the index fast full scan:


==============
Unhinted query
==============
-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   450  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|   450  (15)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1" IS NOT NULL)

==========
Index hint
==========
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    11 |  3159   (3)| 00:00:01 |
|   1 |  SORT AGGREGATE  |       |     1 |    11 |            |          |
|*  2 |   INDEX FULL SCAN| T1_I1 |  1000K|    10M|  3159   (3)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1" IS NOT NULL)

==============
index_ffs hint
==============
-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   450  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|   450  (15)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1" IS NOT NULL)

Since the size of the index is less than the size of the table the cost of the index fast full scan will be lower than the cost of a tablescan; and since an index fast full scan does multiblock reads while the index full scan is assumed to use single block reads the cost of the index fast full scan will be lower than the cost of the index full scan.

Now we set the optimizer_index_cost_adj to 1 (as happens in a number of “legacy” commercial products) and repeat the experiment. Here are the three plans (reduced to just the body of the plan):

==================
OICA = 1, Unhinted
==================
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    11 |    32   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE  |       |     1 |    11 |            |          |
|*  2 |   INDEX FULL SCAN| T1_I1 |  1000K|    10M|    32   (4)| 00:00:01 |
--------------------------------------------------------------------------

====================
OICA = 1, index hint
====================
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    11 |    32   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE  |       |     1 |    11 |            |          |
|*  2 |   INDEX FULL SCAN| T1_I1 |  1000K|    10M|    32   (4)| 00:00:01 |
--------------------------------------------------------------------------

========================
OICA = 1, index_ffs hint
========================
-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   450  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|   450  (15)| 00:00:01 |
-------------------------------------------------------------------------------

The default plan is now the index full scan, not the index fast full scan; and when we look at the two hinted plans we can see why. The full scan cost has been reduced to 1% of the original cost (as dictated by the setting for optimizer_index_cost_adj); the fast full scan cost has not been adjusted.

So is this a good thing or a bad thing? Different people may answer that question differently. There are clearly likely to be cases where changing the setting for the optimizer_index_cost_adj (which you shouldn’t have been doing anyway for the last 20 years or so) will result in some plans switching from fast full scans to full scans with unfortunate consequences;

On the other hand when you do a fast full scan it may do direct path reads, while the full scan will do cached reads, and the cache may be holding a lot of the index leaf blocks already, so some people might prefer the current behaviour;

On the other, other hand an Exadata system can use storage indexes when doing direct path segment scans (even when the segment is an index) so an Exadata fast full scan might be far more desirable than a partially cached full scan.

So no generic right answer, only specific answers for particular sets of circumstances.

Solutions

This issue was addressed some time in the 10g timeline- my notes say the earliest successful tests I did of the fix were on 10.2.0.4, and that it wasn’t available in 9.2.0.8. There are two possibilities – a fix control, and an event.

  • Fix control: 4483286 – descibed as “Discount FFS cost using optimizer_index_cost_adj”
  • Event: 38085 at level one. I don’t think this event has ever appeared in the oraus.msg file, but it is referenced by bug 4483286

Of the two, setting the fix control is probably the safer option since the event is (apparently) non-existent and there’s always the possibility that the number will be hi-jacked in a future release to do something completely different. It’s also worth noting that the fix control can be applied to an individual statement through the /*+ opt_param() */ hint (there’s an example of the syntax towards the end of another article of mine).

A little demonstration:


-- alter session set events '38085 trace name context forever, level 1';
alter session set "_fix_control"='4483286:1';
alter session set optimizer_index_cost_adj = 1;

set autotrace traceonly explain

prompt  ==============
prompt  Unhinted query
prompt  ==============

select 
        count(*) 
from
        t1
where
        v1 is not null
;

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|     4   (0)| 00:00:01 |
-------------------------------------------------------------------------------

With the event or _fix_control set the cost of the index fast full scan drops to 1%, and the path reappears without being hinted.

 

September 30, 2020

Serial Bloom

Filed under: Execution plans,Oracle,Parallel Execution,Performance — Jonathan Lewis @ 1:35 pm BST Sep 30,2020

Following the recent note I wrote about an enhancement to the optimizer’s use of Bloom filters, I received a question by email asking about the use of Bloom filters in serial execution plans:

I’m having difficulty understanding the point of a Bloom filter when used in conjunction with a hash join where everything happens within the same process.

I believe you mentioned in your book (Cost Based Oracle) that hash joins have a mechanism similar to a Bloom filter where a row from the probe table is checked against a bitmap, where each hash table bucket is indicated by a single bit. (You have a picture on page 327 of the hash join and bitmap, etc).

The way that bitmap looks and operates appears to be similar to a Bloom filter to me…. So it looks (to me) like hash joins have a sort of “Bloom filter” already built into them.

My question is… What is the advantage of adding a Bloom filter to a hash join if you already have a type of Bloom filter mechanism thingy built in to hash joins?

I can understand where it would make sense with parallel queries having to pass data from one process to another, but if everything happens within the same process I’m just curious where the benefit is.

 

The picture on page 327 of CBO-F is a variation on the following, which is the penultimate snapshot of the sequence of events in a multi-pass hash join. The key feature is the in-memory bitmap at the top of the image describing which buckets in the (partitioned and spilled) hash table hold rows from the build table. I believe that it is exactly this bitmap that is used as the Bloom filter.

The question of why it might be worth creating and using a Bloom filter in a simple serial hash join is really a question of scale. What is the marginal benefit of the Bloom filter when the basic hash join mechanism is doing all the hash arithmetic and comparing with a bitmap anyway?

If the hash join is running on an Exadata machine then the bitmap can be passed as a predicate to the cell servers and the hash function can be used at the cell server to minimise the volume of data that has to be passed back to the database server – with various optimisations dependent on the version of the Exadata software. Clearly minimising traffic through the interconnect is going to have some benefit.

Similarly, as the email suggests, for a parallel query where (typically) one set of parallel processes will read the probe table and distribute the data to the second set of parallel processes which then do the hash join it’s clearly sensible to allow the first set of procsses to apply the hash function and discard as many rows as possible before distributing the survivors – minimising inter-process communication.

In both these cases, of course, there’s a break point to consider of how effective the Bloom filter needs to be before it’s worth taking advantage of the technology. If the Bloom filter allows 99 rows out of every hundred to be passed to the database server / second set of parallel processes then Oracle has executed the hash function and checked the bitmap 100 times to avoid sending one row (and it will (may) have to do the same hash function and bitmap check again to perform the hash join); on the other hand if the Bloom filter discards 99 rows and leaves only one row surviving then that’s a lot of traffic eliminated – and that’s likely to be a good thing. This is why there are a few hidden parameters defining the boundaries of when Bloom filters should be used – in particular there’s a parameter “_bloom_filter_ratio” which defaults to 35 and is, I suspect, a figure which says something like “use Bloom filtering only if it’s expected to reduce the probe data to 35% of the unfiltered size”.

So the question then becomes: “how could you benefit from a serial Bloom filter when it’s the same process doing everything and there’s no “long distance” traffic going on between processes?” The answer is simply that we’re operating at a much smaller scale. I’ve written blog notes in the past where the performance of a query depends largely on the number of rows that are passed up a query plan before being eliminated (for example here, where the volume of data moving results in a significant fraction of the total time).

If you consider a very simple hash join its plan is going to be shaped something like this:


-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |   720 |    31 |
|*  1 |  HASH JOIN         |      |    45 |   720 |    31 |
|*  2 |   TABLE ACCESS FULL| T2   |    15 |   120 |    15 |
|   3 |   TABLE ACCESS FULL| T1   |  3000 | 24000 |    15 |
-----------------------------------------------------------

If you read Tanel Poder’s article on execution plans as a tree of Oracle function calls you’ll appreciate that you could translate this into informal English along the lines of:

  • Operation 1 calls a function (at operation 2) to do a tablescan of t1 and return all the relevant rows, building an in-memory hash table by applying a hashing function to the join column(s) of each row returned by the call to the tablescan. As the hash table is populated the operation also constructs a bitmap to flag buckets in the hash table that have been populated.
  • Operation 1 then calls a function (at operation 3) to start a tablescan and then makes repeated calls for it to return one row (or, in newer versions, a small rowset) at a time from table t2. For each row returned operation 1 applies the same hash function to the join column(s) and checks the bitmap to see if there’s a potential matching row in the relevant bucket of the hash table, and if there’s a potential match Oracle examines the actual contents of the bucket (which will be stored as a linked list) to see if there’s an actual match.

Taking the figures above, let’s imagine that Oracle is using a rowset size of 30 rows. Operation 1 will have to make 100 calls to Operation 3 to get all the data, and call the hashing function 3,000 times.  A key CPU component of the work done is that the function represented by operation 3 is called 100 times and (somehow) allocates and fills an array of 30 entries each time it is called.

Now assume operation 1 passes the bitmap to operation 3 as an input and it happens to be a perfect bitmap. Operation 3 starts its tablescan and will call the hash function 3,000 times, but at most 45 rows will get past the bitmap. So operation 1 will only have to call operation 3 twice.  Admittedly operation 1 will (possibly) call the hash function again for each row – but maybe operation 3 will supply the hash value in the return array. Clearly there’s scope here for a trade-off between the reduction in work due to the smaller number of calls and the extra work needed to take advantage of the bitmap technology.

Here’s an example that shows the potential for savings – if you want to recreate this test you’ll need about 800MB of free space in the database, the first table takes about 300MB and the second about 450MB.


rem
rem     Script:         bloom_filter_serial_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,30,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;

create table t2
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        round(rownum + 0.5,2)           id,
        mod(rownum,1e5)                 n1,
        lpad(rownum,10)                 v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;


prompt  =================
prompt  With Bloom filter
prompt  =================

select 
        /*+ 
                px_join_filter(t1) 
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

prompt  ===============
prompt  No Bloom filter
prompt  ===============

select 
        /*+
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

I’ve created tables t1 and t2 with an id column that never quite matches, but the range of values is set so that the optimizer thinks the two tables might have a near-perfect 1 to 1 match. I’ve given t2 an extra column with 105 distinct values in its 107 rows, so it’s going to have 100 rows per distinct value. Then I’ve presented the optimizer with a query that looks as if it’s going to find 100 rows in t2 and needs to find a probable 100 rows of matches in t1. For my convenience, and to highlight a couple of details of Bloom filters, it’s not going to find any matches.

In both runs I’ve enabled the SQL Monitor feature with the /*+ monitor */ hint, and in the first run I’ve also hinted the use of a Bloom filter. Here are the resulting SQL Monitor outputs. Bear in mind we’re looking at a reasonably large scale query – volume of input data – with a small result set.

First without the Bloom filter:


Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    3.00 |    2.24 |     0.77 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=2959412835)
==================================================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                      |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT     |      |         |       |         2 |     +2 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN          |      |     100 | 14373 |         2 |     +2 |     1 |        0 |      |       |   2MB |          |                 |
|  2 |    TABLE ACCESS FULL | T2   |      99 |  5832 |         2 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  3 |    TABLE ACCESS FULL | T1   |     10M |  8140 |         2 |     +2 |     1 |      10M |  463 | 453MB |     . |          |                 |
==================================================================================================================================================

According to the Global Stats the query has taken 3 seconds to complete, of which 2.24 seconds is CPU. (The 750MB read in 0.77 second would be due to the fact that I’m running off SSD, and I’ve got a 1MB read size that helps). A very large fraction of the CPU appears because of the number of calls from operation 1 to operation 3 (the projection information pulled from memory reports a rowset size of 256 rows, so that’s roughly 40,000 calls to the function.

When we force the use of a Bloom filter the plan doesn’t change much (though the creation and use of the Bloom filter has to be reported) – but the numbers do change quite significantly.

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    1.97 |    0.99 |     0.98 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=4148581417)
======================================================================================================================================================
| Id |       Operation       |  Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |         | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
======================================================================================================================================================
|  0 | SELECT STATEMENT      |         |         |       |         1 |     +1 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN           |         |     100 | 14373 |         1 |     +1 |     1 |        0 |      |       |   1MB |          |                 |
|  2 |    JOIN FILTER CREATE | :BF0000 |      99 |  5832 |         1 |     +1 |     1 |      100 |      |       |     . |          |                 |
|  3 |     TABLE ACCESS FULL | T2      |      99 |  5832 |         1 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  4 |    JOIN FILTER USE    | :BF0000 |     10M |  8140 |         1 |     +1 |     1 |    15102 |      |       |     . |          |                 |
|  5 |     TABLE ACCESS FULL | T1      |     10M |  8140 |         1 |     +1 |     1 |    15102 |  463 | 453MB |     . |          |                 |
======================================================================================================================================================


In this case, the elapsed time dropped to 1.97 seconds (depending on your viewpoint that’s either a drop of “only 1.03 seconds” or drop of “an amazing 34.3%”; with the CPU time dropping from 2.24 seconds to 0.99 seconds (55.8% drop!)

In this case you’ll notice that the tablescan of t1 produced only 15,102 rows to pass up to the hash join at operation 1 thanks to the application of the predicate (not reported here): filter(SYS_OP_BLOOM_FILTER(:BF0000,”T1″.”ID”)). Instead of 40,000 calls for the next rowset the hash function has been applied during the tablescan and operation 5 has exhausted the tablescan after only about 60 calls. This is what has given us the (relatively) significant saving in CPU.

This example of the use of a Bloom filter highlights up the two points I referred to earlier.

  • First, although we see operations 4 and 5 as Join (Bloom) filter use and Table access full respectively I don’t think the data from the tablescan is being “passed up” from operation 5 to 4; I believe operation 4 can be views as a “placeholder” in the plan to allow us to see the Bloom filter in action, the hashing and filtering actually happening during the tablescan.
  • Secondly, we know that there are ultimately no rows in the result set, yet the application of the Bloom filter has not eliminated all the data. Remember that the bitmap that Oracle constructs of the hash table identifies used buckets, not actual values. Those 15,102 rows are rows that “might” find a match in the hash table because they belong in buckets that are flagged. A Bloom filter won’t discard any data that is needed, but it might fail to eliminate data that subsequently turns out to be unwanted.

How parallel is parallel anyway?

I’ll leave you with one other thought. Here’s an execution plan from 12c (12.2.0.1) which joins three dimension tables to a fact table. There are 343,000 rows in the fact table and the three joins individually identify about 4 percent of the data in the table. In a proper data warehouse we might have been looking at a bitmap star transformation solution for this query, but in a mixed system we might want to run warehouse queries against normalised data – this plan shows what Bloom filters can do to minimise the workload. The plan was acquired from memory after enabling rowsource execution statistics:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |      1 |        |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   1 |  SORT AGGREGATE              |          |      1 |      1 |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   2 |   PX COORDINATOR             |          |      1 |        |        |      |            |      2 |00:00:00.05 |      22 |      3 | 73728 | 73728 |          |
|   3 |    PX SEND QC (RANDOM)       | :TQ10000 |      0 |      1 |  Q1,00 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   4 |     SORT AGGREGATE           |          |      2 |      1 |  Q1,00 | PCWP |            |      2 |00:00:00.09 |    6681 |   6036 |       |       |          |
|*  5 |      HASH JOIN               |          |      2 |     26 |  Q1,00 | PCWP |            |     27 |00:00:00.09 |    6681 |   6036 |  2171K|  2171K|     2/0/0|
|   6 |       JOIN FILTER CREATE     | :BF0000  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  7 |        TABLE ACCESS FULL     | T3       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  8 |       HASH JOIN              |          |      2 |    612 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6634 |   6026 |  2171K|  2171K|     2/0/0|
|   9 |        JOIN FILTER CREATE    | :BF0001  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 10 |         TABLE ACCESS FULL    | T2       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 11 |        HASH JOIN             |          |      2 |  14491 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6614 |   6022 |  2171K|  2171K|     2/0/0|
|  12 |         JOIN FILTER CREATE   | :BF0002  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 13 |          TABLE ACCESS FULL   | T1       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|  14 |         JOIN FILTER USE      | :BF0000  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  15 |          JOIN FILTER USE     | :BF0001  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  16 |           JOIN FILTER USE    | :BF0002  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  17 |            PX BLOCK ITERATOR |          |      2 |    343K|  Q1,00 | PCWC |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|* 18 |             TABLE ACCESS FULL| T4       |     48 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.05 |    6594 |   6018 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

It’s a parallel plan, but it’s used the 12c “PQ_REPLICATE” strategy. The optimizer has decided that all the dimension tables are so small that it’s going to allow every PX process to read every (dimension) table through the buffer cache and build its own hash tables from them. (In earlier versions you might have seen the query coordinator scanning and broadcasting the three small tables, or one set of PX processes scanning and broadcasting to the other set).

So every PX process has an in-memory hash table of all three dimension tables and then (operation 17) they start a tablescan of the fact table, picking non-overlapping rowid ranges to scan. But since they’ve each created three in-memory hash tables they’ve also been able to create three Bloom filters each, which can all be applied simultaneously as the tablescan takes place; so instead of 343,000 rows being passed up the plan and through the first hash join (where we see from operation 11 that the number of surviving rows would have been about 14,500 ) we see all but 27 rows discarded very early on in the processing. Like bitmap indexes part of the power of Bloom filters lies in the fact that with the right plan the optimizer can combine them and identify a very small data set very precisely, very early.

The other thing I want you to realise about this plan, though, is that it’s not really an “extreme” parallel plan. It’s effectively running as a set of concurrent, non-interfering, serial plans. Since I was running (parallel 2) Oracle started just 2 PX processes: they both built three hash tables from the three dimension tables then split the fact table in half and took half each to do all the joins, and passed the nearly complete result to the query co-ordinator at the last moment. That’s as close as you can get to two serial, non-interfering, queries and still call it a parallel query. So, if you wonder why there might be any benefit in serial Bloom filters – Oracle’s actually being benefiting from it under the covers for several years.

Summary

Bloom filters trade a decrease in messaging against an increase in preparation and hashing operations. For Exadata systems with predicate offloading it’s very easy to see the potential benefit; for general parallel execution; it’s also fairly easy to see the potential benefit for parallel query execution what inter-process message between two sets of PX processes can be resource intensive; but even for serial queries there can be some benefit though, in absolute terms, they are likely to be only a small saving in CPU.

 

September 9, 2020

Bloom Upgrade

Filed under: 18c,CBO,Joins,Oracle,Tuning — Jonathan Lewis @ 5:18 pm BST Sep 9,2020

It’s a common pattern of Oracle features that they start with various restrictions or limitations that disappear over time. This note is about an enhancement to Bloom filter processing that appeared in the 18.1 optimizer and, for some people, may be a good enough reason for upgrading to a newer version of Oracle. This enhancement came to my attention by way of the Oracle Developer forum in a thread with the title Bloom filters and view using UNION ALL asking how to get a Bloom filter pushed inside a UNION ALL view. The original requirement wasn’t a trivial one so I’ll demonstrate the problem with a very simple example – first the data set:

rem
rem     Script:         bloom_pushdown.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem
rem     Last tested:
rem             19.3.0.0
rem

create table t1 as select * from all_objects where rownum <= 50000;
create table t2 as select t1.* from t1, (select rownum n1 from dual connect by level <= 4);
create table t3 as select t1.* from t1, (select rownum n1 from dual connect by level <= 4); -- > comment to avoid wordpress format issue

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns object_type size 254'
        );
end;
/

I’ve been a bit lazy here, copying data from view all_objects. I’ve gathered stats on t1 so that I can generate a histogram on the object_type column because I’m going to query for a rare object_type and I want the optimizer to get a reasonable estimate of rows. I’m going to hint a parallel query to join t1 to t2 (aliased, trivially, as v1 for reasons that will become apparent soon):

select
        /*+ 
                parallel(2) 
        */
        t1.object_name, v1.object_name
from
        t1,
        t2 v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

In my case the optimizer chooses to do a hash join between these two table, and creates a Bloom filter to try and minimise the data passing through the data flow operation. The result set in my 12.2.0.1 database is only 16 rows, so it would be nice if the parallel scan could eliminate most of the 200,000 rows in t2 early – here’s the execution plan pulled from memory after running the query with rowsource execution stats enabled:


----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |      1 |        |   371 (100)|     16 |00:00:00.06 |      20 |      0 |       |       |          |
|   1 |  PX COORDINATOR        |          |      1 |        |            |     16 |00:00:00.06 |      20 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)  | :TQ10000 |      0 |     16 |   371   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN           |          |      2 |     16 |   371   (5)|     16 |00:00:00.05 |    6278 |   3988 |  1250K|  1250K|     2/0/0|
|   4 |     JOIN FILTER CREATE | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|   6 |     JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.04 |    4244 |   3988 |       |       |          |
|   7 |      PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.04 |    4244 |   3988 |       |       |          |
|*  8 |       TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|     16 |00:00:00.03 |    4244 |   3988 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"V1"."OBJECT_ID"))

We see that Oracle has generated a Bloom filter at operation 4 from the data returned from t1 at operation 5, and then used that Bloom filter at operation 6 to eliminate most of the data from t2 before passing the remaining few rows up to the hash join.

Let’s make the query more interesting – what if you want to use a UNION ALL of t2 and t3 in the query (for example one might be “current data” while the other is “historic data”. Here’s the query and plan from 12.2.0.1:

select
        /*+ 
                parallel(2) 
        */
        t1.object_name, v1.object_name
from
        t1,
        (select * from t2 union all select * from t3) v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |        |   667 (100)|     32 |00:00:00.37 |      40 |      0 |       |       |          |
|   1 |  PX COORDINATOR         |          |      1 |        |            |     32 |00:00:00.37 |      40 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000 |      0 |     32 |   667   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN            |          |      1 |     32 |   667   (5)|     32 |00:00:00.34 |    5125 |   3860 |  1250K|  1250K|     2/0/0|
|*  4 |     TABLE ACCESS FULL   | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|   5 |     VIEW                |          |      2 |    400K|   584   (4)|    400K|00:00:00.52 |    8488 |   7976 |       |       |          |
|   6 |      UNION-ALL          |          |      2 |        |            |    400K|00:00:00.24 |    8488 |   7976 |       |       |          |
|   7 |       PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|    200K|00:00:00.11 |    4244 |   3988 |       |       |          |
|*  8 |        TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|    200K|00:00:00.07 |    4244 |   3988 |       |       |          |
|   9 |       PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|    200K|00:00:00.11 |    4244 |   3988 |       |       |          |
|* 10 |        TABLE ACCESS FULL| T3       |     32 |    200K|   292   (4)|    200K|00:00:00.03 |    4244 |   3988 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   4 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
   8 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
  10 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue

No Bloom filter – so all 400,000 rows feed up the plan and through the hash join. This won’t matter too much for my sub-second tiny data set but on a pair of 50GB tables, with the potential to offload the Bloom filter to storage in Exadata and, perhaps, eliminate 99% of the data at the cell servers, this could make a huge difference to performance.

Since Bloom filters are all about hashing data (in Oracle the standard Bloom filter is the bitmap summarising the build table in a hash join) let’s trying pushing the optimizer into a hash distribution for the parallel join to see if that had any effect:


select
        /*+ 
                parallel(2) 
                gather_plan_statistics
                leading(@sel$1 t1@sel$1 v1@sel$1)
                use_hash(@sel$1 v1@sel$1)
                pq_distribute(@sel$1 v1@sel$1 hash hash)
        */
        t1.object_name, v1.object_name
from
        t1,
        (select * from t2 union all select * from t3) v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |      1 |        |   667 (100)|     32 |00:00:00.43 |      60 |      0 |       |       |          |
|   1 |  PX COORDINATOR             |          |      1 |        |            |     32 |00:00:00.43 |      60 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)       | :TQ10002 |      0 |     32 |   667   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN BUFFERED       |          |      1 |     32 |   667   (5)|     32 |00:00:00.38 |    4000 |   3752 |  2290K|  2082K|     2/0/0|
|   4 |     JOIN FILTER CREATE      | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX RECEIVE             |          |      2 |      4 |    75   (4)|      8 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       PX SEND HYBRID HASH   | :TQ10000 |      0 |      4 |    75   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |        STATISTICS COLLECTOR |          |      2 |        |            |      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|   8 |         PX BLOCK ITERATOR   |          |      2 |      4 |    75   (4)|      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|*  9 |          TABLE ACCESS FULL  | T1       |     26 |      4 |    75   (4)|      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|  10 |     PX RECEIVE              |          |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  11 |      PX SEND HYBRID HASH    | :TQ10001 |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  12 |       JOIN FILTER USE       | :BF0000  |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  13 |        VIEW                 |          |      2 |    400K|   584   (4)|    400K|00:00:00.68 |    8488 |   7976 |       |       |          |
|  14 |         UNION-ALL           |          |      2 |        |            |    400K|00:00:00.59 |    8488 |   7976 |       |       |          |
|  15 |          PX BLOCK ITERATOR  |          |      2 |    200K|   292   (4)|    200K|00:00:00.18 |    4244 |   3988 |       |       |          |
|* 16 |           TABLE ACCESS FULL | T2       |     32 |    200K|   292   (4)|    200K|00:00:00.06 |    4244 |   3988 |       |       |          |
|  17 |          PX BLOCK ITERATOR  |          |      2 |    200K|   292   (4)|    200K|00:00:00.12 |    4244 |   3988 |       |       |          |
|* 18 |           TABLE ACCESS FULL | T3       |     32 |    200K|   292   (4)|    200K|00:00:00.08 |    4244 |   3988 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   9 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue 
       filter("T1"."OBJECT_TYPE"='SCHEDULE') 
  16 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue
  18 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue

We’ve managed to introduce a Bloom filter (which is visible as :BF0000 in the plan, even through there’s no reference to sys_op_bloom_filter() in the predicate information) but there’s a problem, we’re still passing 400,000 rows up the plan and the Bloom filter is only being applied at (or just after) the VIEW operator, discarding all but 66 rows before doing the hash join. It’s an improvement but not ideal; we’d like to see the Bloom filter applied to each of the two tables separately to eliminate rows as early as possible.

This can’t be done in 12.2, and you’d have to rewrite the query, changing a “join with union” into a “union of joins”, and that’s not really a desirable strategy.

Next Steps

Searching MOS, though you will be able to find the following note:

Doc ID 18849313.8 – ENH : bloom filters/pruning are pushed through union-all view

There’s an enhancement request to do what we want in 18.1, and the enhancement has got into the software. Here’s the (unhinted) plan from 19.3 (the plan stays the same when optimizer_features_enable is set back to 18.1.0, but drops back to the 12.1. plan when OFE is set to 12.2.0.1):

------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |   666 (100)|     32 |00:00:00.11 |      10 |      0 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |            |     32 |00:00:00.11 |      10 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10000 |      0 |     32 |   666   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      2 |     32 |   666   (5)|     32 |00:00:00.05 |   10020 |   7958 |  1250K|  1250K|     2/0/0|
|   4 |     JOIN FILTER CREATE   | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    1998 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    1998 |      0 |       |       |          |
|   6 |     VIEW                 |          |      2 |    400K|   583   (4)|     32 |00:00:00.04 |    8022 |   7958 |       |       |          |
|   7 |      UNION-ALL           |          |      1 |        |            |     12 |00:00:00.02 |    4011 |   3979 |       |       |          |
|   8 |       JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|  11 |       JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|  12 |        PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|* 13 |         TABLE ACCESS FULL| T3       |     32 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
  10 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T2"."OBJECT_ID")) 
  13 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T3"."OBJECT_ID"))

As you can see, we create a Bloom filter at operation 4, and use it twice at operations 8 and 11 – with the sys_op_bloom_filter() functions clearly visible in the predicate information showing us that the Bloom filter is applied to the object_id column in both cases.

If you want to disable this enhancement for some reasons there are two hidden parameters available (which you might set for a single query using the opt_param() hint):

  • _bloom_filter_setops_enabled = true
  • _bloom_pruning_setops_enabled = true

The first is for Bloom filters in the situation shown, I assume the second deals with Bloom filters for partition pruning.

Summary

In versions prior to 18.1 the optimizer is unable to push Bloom filters down to the individual tables in a UNION ALL view, but this limitation was removed in the 18.1 code set.

 

February 18, 2020

Interval Partition(s)

Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 1:45 pm GMT Feb 18,2020

A quirky little feature of interval partitioning showed up on Twitter today – a parallel insert that would only use a single PX slave to do the inserting. With 1.1 billion rows and the option for running parallel 32 this made the loading process rather slower than it ought to have been.

Fortunately it’s quite easy to model (and work around) the oddity. So here’s a small data set and an empty partitioned table to work with:


rem
rem     Script:         pt_int_load_anomaly.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem

create table t1 
nologging 
as
select 
        ao.* 
from 
        all_Objects ao, 
        (select rownum id from dual connect by level <= 20)
;

create table pt1
partition  by range (object_id) interval (1000000) (
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

I’ve created a table by copying all_objects 20 times which, for my little sandbox, has given me a total of about 1.2M rows. Then I’ve created an empty interval-partitioned clone of all_objects, with the first partition defined to hold all rows where the object_id is less than 1 (and there’s no object in the database that could match that criterion). I’ve defined the interval to be 1,000,000 and since the highest object_id in my database is about 90,000 the first partition that gets added to this table will be able to hold all the data from t1.

So now we try to do a parallel insert from t1 into pt1, and check the execution plan and parallel execution statistics:


set serveroutput off

insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select * from t1;

select * from table(dbms_xplan.display_cursor);

start pq_tqstat

Note how I’ve used the hint /*+ enable_parallel_dml */ (possible a 12c hint back-ported to 11.2.0.4) rather than using an “alter session”, it’s just a little convenience to be able to embed the directive in the SQL. The pq_tqstat script is one I published some time ago to report the contents of the session-specific dynamic performance view v$pq_tqstat immediately after running a parallel statement.

Here’s the plan:


SQL_ID  25hub68pf1z1s, child number 0
-------------------------------------
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select *
from t1

Plan hash value: 2888707464

-------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                            |          |       |       |   631 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                             |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)                       | :TQ10001 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT (HIGH WATER MARK BROKERED)| PT1      |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     OPTIMIZER STATISTICS GATHERING          |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                             |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANDOM LOCAL                  | :TQ10000 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | P->P | RANDOM LOCA|
|   7 |        PX BLOCK ITERATOR                    |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL                   | T1       |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 6 because of hint

The most important detail of this plan is that the PX slaves do the load as select (operation 3), then send a message to the query coordinator (PX send QC, operation 2) to tell it about the data load. They do not send their data to the QC for the QC to do the load.

So the plan says we will be doing parallel DM, but here’s what v$pq_tqstat tells us:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                215880   34785363     17.47      16.86         16          0           0
                                             1 P007                202561   34436325     16.39      16.69         17          0           0
                                             1 P008                207519   34564496     16.79      16.75         17          0           0
                                             1 P009                208408   34594770     16.86      16.77         17          0           0
                                             1 P00A                198915   33529627     16.10      16.25         16          0           0
                                             1 P00B                202537   34430603     16.39      16.69         16          0           0
                      Consumer               1 P000                     0        144      0.00       0.00         51         47           0
                                             1 P001                     0        144      0.00       0.00         51         47           0
                                             1 P002               1235820  206340464    100.00     100.00         75         47           0
                                             1 P003                     0        144      0.00       0.00         51         47           0
                                             1 P004                     0        144      0.00       0.00       1138       1134           0
                                             1 P005                     0        144      0.00       0.00       1137       1133           0

                    1 Producer               1 P000                     0         24      0.00       5.91         51         42           0
                                             1 P001                     0         24      0.00       5.91         50         41           0
                                             1 P002                     2        286    100.00      70.44         58         14           0
                                             1 P003                     0         24      0.00       5.91         51         43           0
                                             1 P004                     0         24      0.00       5.91         51         42           0
                                             1 P005                     0         24      0.00       5.91         51         43           0
                      Consumer               1 QC                       2        406    100.00     100.00        311        179           0

19 rows selected.

The query did run parallel 6 as hinted – and 6 PX slaves scanned the t1 table; but they all sent all their data to one PX slave in the second slave set and that one PX slave did all the inserts. The plan was parallel, but the execution was effectively serial. (You’ll note there is something a little odd about the waits and timeout for p004 and p005 when they are acting as consumers. I may worry about that later, but it could be a host-based side effect of running parallel 6 on a VM with 4 CPUs).

The serialization leads to two questions

  1. What went wrong?
  2. How do we work around this and make the insert “truly” parallel

My answer to (1) is “I don’t know – but I’ll look at it if necessary” combined with the guess – it’s something to do with the table having only one partition at the outset and this has an unexpected side effect on the randomising function for the PX distribution.

My answer to (2) is “if I’m right about (1), why not try pre-defining two partitions, and I’ll even let both of them stay empty”.

So here’s my new definition for pt1:


create table pt1
partition  by range (object_id) interval (1000000) (
        partition p0 values less than (0),
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

Re-running the test with the completely redundant, and permanently empty p0 partition the plan doesn’t change but the results from v$pq_tqstat change dramatically:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                207897   34581153     16.82      16.76         23          4           0
                                             1 P007                215669   34786429     17.45      16.86         30          5           0
                                             1 P008                221474   36749626     17.92      17.81         28          5           0
                                             1 P009                204959   34497164     16.58      16.72         22          2           0
                                             1 P00A                177755   30141002     14.38      14.61         21          0           0
                                             1 P00B                208066   35585810     16.84      17.25         25          2           0
                      Consumer               1 P000                213129   35612973     17.25      17.26         82         57           0
                                             1 P001                200516   33570586     16.23      16.27         84         55           0
                                             1 P002                203395   33950449     16.46      16.45         83         56           0
                                             1 P003                205458   34235575     16.63      16.59         82         54           0
                                             1 P004                204111   33999932     16.52      16.48        581        555           0
                                             1 P005                209211   34971669     16.93      16.95        580        553           0

                    1 Producer               1 P000                     2        286     16.67      16.67        422        149           0
                                             1 P001                     2        286     16.67      16.67        398        130           0
                                             1 P002                     2        286     16.67      16.67        405        128           0
                                             1 P003                     2        286     16.67      16.67        437        161           0
                                             1 P004                     2        286     16.67      16.67        406        116           0
                                             1 P005                     2        286     16.67      16.67        440        148           0
                      Consumer               1 QC                      12       1716    100.00     100.00        242        111           0



19 rows selected.

Every consumer receives and inserts roughly 200,000 rows – it’s a totally fair parallel DML. Timings are pretty irrelevant for such a small data set but the excution time did drop from 7 seconds to 4 seconds when parallelism was working “properly”.

I’ve tested this script on 12.2.0.1 and 19.3.0.0 – the same anomaly appears in both versions though it might be worth noting that the strange skew in the waits and timeouts doesn’t appear in 19.3.0.0.

January 10, 2020

Collection limitation

Filed under: Execution plans,Hints,Oracle,Troubleshooting — Jonathan Lewis @ 5:43 pm GMT Jan 10,2020

The ODC SQL and PL/SQL forum came up with an example a couple of days ago that highlighted an annoying limitation in the optimizer’s handling of table functions. The requirement was for a piece of SQL that would generate “installments” information from a table of contract agreements and insert into another table any installments that were not yet recorded there.

The mechanism to turn a single row of contract data into a set of installments was a (optionally pipelined) table function that involved some business logic that (presumably) dealt with the timing and size of the installments. The final SQL to create the data that needed to be inserted was reported as follows (though it had clearly been somewhat modified):

select 
        y.*,
        trunc(sysdate,'mm'),
        user
from 
        table_a a
join 
        table(
                my_function(
                        a.loan_acct_nbr, 
                        a.start_dt,
                        a.maturity_dt,
                        a.num_of_terms
                )
        ) y
on 
        a.loan_acct_nbr = y.loan_acct_nbr
where 
        a.yr_mnth_cd = to_char(add_months(sysdate,-1),'yyyymm')       -- last month
and     a.loan_typ   = 'ABC'
and     not exists ( 
                select  1 
                from
                        final_load_table l
                where
                        l.loan_acct_nbr = y.loan_acct_nbr
                and     l.yr_mnth_cd    = y.yr_mnth_cd 
        )
;

In this query, table_a is the set of contracts, final_load_table is the set of installments, and my_function() is the pipelined function returning a table of installments derived from the start date, maturity date, and number of installments for a contract. The query needs a “not exists” subquery to eliminate any installments that are already known to the database. Once this query is operating efficiently it could be used either to drive a PL/SQL loop or (generally the better strategy) to do a simple “insert as select”.

We were told that the function would return around 60 rows per contract; that the starting size of the final_load_table would be about 60M rows, and the size of the result set from the initial join would be about 60M or 70M rows (which suggests about 1M rows acquired from table_a).

The owner of this problem seemed to be saying that the query had returned no data after 4 or 5 hours – which suggests (a) the optimizer chose a bad plan and/or (b) the PL/SQL function is working very inefficiently and/or (c) the join had generated a vast amount of data but the effect of the subquery was to discard all of it .

Clearly we needed to see an execution plan (preferably from SQL Monitor) and be given have some idea of how much of the 60M/70M rows predicted for the join would be discarded because it already existed.

The owner did send us an execution plan – which included a very obvious threat and suggested a couple of optimizer errors – but supplied it as a picture rather than a flat text.

You’ll notice, of course, that the tables and columns have changed from the original statement. More significantly, though, there are serious problems with the numbers – the estimated row count for the basic join is only 5,326 rather than 50 Million which, at first sight, is probably why the optimizer has decided that a filter subquery (operation 1) to do an index-only probe (operation 5) is a good way of handling the subquery. Perhaps if the estimates had been a little more accurate (e.g. through the addition of a couple of column groups or, if necessary, by opt_estimate() or cardinality() hints) the subquery would have been unnested and turned into a hash anti-join.

I suggested a quick test of a suitable cardinality() hint – but ran up a little model to check that I’d got the hint right – and found that I had but it wasn’t going to help. So I decided to write up the model (and a possible solution for the owner of the problem) in this blog note.

Here’s the code to create the set of objects I’m going to work with. The naming follows the naming in the original statement of the problem suggested by the owner:


rem
rem     Script:         table_function_plan.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table table_a (
        loan_acct_nbr   number,
        yr_mnth_cd      varchar2(6),
        start_dt        date,
        maturity_dt     date    ,
        number_of_terms number,
        constraint ta_pk primary key(loan_acct_nbr, yr_mnth_cd)
)
;

execute dbms_random.seed(0)

insert /*+ append */
into    table_a
with generator as (
        select rownum id
        from    dual
        connect by
                level <= 4000 -- > comment to avoid wordpress issue
)
select
        trunc(dbms_random.value(1e9, 2e9)),
        to_char(sysdate-(365-mod(rownum,365)),'yyyymm'),
        sysdate-(365-mod(rownum,365)),
        sysdate+(1500+mod(rownum,365)),
        60
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create table final_load_table_l(
        loan_acct_nbr   number,
        yr_mnth_cd      varchar2(6),
        v1              varchar2(10),
        padding         varchar2(200),
        constraint lt_pk primary key(loan_acct_nbr, yr_mnth_cd)
)
;

insert /*+ append */ into final_load_table_l
with generator as (
        select rownum id
        from    dual
        connect by
                level <= 4000 -- > comment to avoid wordpress issue
)
select
        trunc(dbms_random.value(1e9, 2e9)),
        to_char(sysdate-(365-mod(rownum,365)),'yyyymm'),
        lpad(rownum,10),
        lpad('x',200,'x')
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'table_a',
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'final_load_table_l',
                method_opt  => 'for all columns size 1'
        );
end;
/

create type my_row_type as object (
        loan_acct_nbr   number,
        yr_mnth_cd      varchar2(6),
        v1              varchar2(10),
        padding         varchar2(200)
);
/

create type my_table_type as table of my_row_type;
/

create  or replace function my_function (
        i_loan_acct_nbr         in      number,
        i_yr_mnth_cd            in      varchar2,
        i_start_dt              in      date,
        i_maturity_dt           in      date,
        i_number_of_terms       in      number
)       return  my_table_type pipelined
as
begin
        for i in 1..i_number_of_terms loop
                pipe row (
                        my_row_type(
                                i_loan_acct_nbr,
                                to_char(i_start_dt+32*i,'yyyymm'),
                                i,
                                lpad('x',200,'x')
                        )
                );
        end loop;
        return;
end;
/

I was planning to create some large tables – hence the option to generate 16M rows from my generator CTEs – but I discovered the critical problem almost as soon as I had some data and code in place, so I didn’t need to go large.

I’ve had to create an object type and table type in order to create a pipelined function that returns the table type by piping rows of the object type. The data I’ve created, and the way the function generates data probably doesn’t bear much resemblance to the real system of course, but I don’t think it needs to be very close to make the crucial point.

Here’s the text of the select statement the OP wants to run, with the execution plan I got from my data set after running the query and pulling the plan from memory:

alter session set statistics_level = all;

select 
        /*+ find this 1 */
        y.* 
from 
        table_a a, 
        table(my_function(
                a.loan_acct_nbr,
                a.yr_mnth_cd,
                a.start_dt,
                a.maturity_dt,
                a.number_of_terms
        )) y
where 
        a.yr_mnth_cd = '202001'
and     not exists (
                select  
                        /*+ unnest */
                        null
                from    final_load_table_l l
                where   l.loan_acct_nbr = y.loan_acct_nbr
                and     l.yr_mnth_cd    = y.yr_mnth_cd
        )
;

--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |             |      1 |        |  5059K(100)|  14580 |00:00:00.15 |   16330 |
|*  1 |  FILTER                             |             |      1 |        |            |  14580 |00:00:00.15 |   16330 |
|   2 |   NESTED LOOPS                      |             |      1 |   6283K| 10421   (8)|  14580 |00:00:00.11 |     335 |
|*  3 |    TABLE ACCESS FULL                | TABLE_A     |      1 |    769 |    10  (10)|    243 |00:00:00.01 |     297 |
|   4 |    COLLECTION ITERATOR PICKLER FETCH| MY_FUNCTION |    243 |   8168 |    14   (8)|  14580 |00:00:00.10 |      38 |
|*  5 |   INDEX UNIQUE SCAN                 | LT_PK       |  14580 |      1 |     1   (0)|      0 |00:00:00.02 |   15995 |
--------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( IS NULL)
   3 - filter("A"."YR_MNTH_CD"='202001')
   5 - access("L"."LOAN_ACCT_NBR"=:B1 AND "L"."YR_MNTH_CD"=:B2)


I’ve put in a hint to tell the optimizer to unnest the subquery – and it didn’t. Oracle does not ignore hints (unless they’re illegal, or out of context, or the optimizer never got to them, or you’ve found a bug) so why did Oracle appear to ignore this hint? There’s a really nice feature in 19.3 execution plans – you can request a hint report for a plan, and here’s the hint report for this query (ignore the bits about “find this” being an error):


Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3 (U - Unused (1), E - Syntax error (2))
---------------------------------------------------------------------------
   0 -  SEL$1
         E -  find
         E -  this

   5 -  SEL$3
         U -  unnest / Invalid correlated predicates


I put in an /*+ unnest */ hint to unnest the subquery, and I’ve been told that the predicates are not valid. The only thing about them that could be invalid is that they come from a pipelined function that has returned an object type. The pipelined function does not behave exactly like a table. But this gives me a clue about forcing the unnest to happen – hide the pipelined function inside a non-mergeable view.


select
        /*+ find this 2 */
        v.*
from    (
        select  /*+ no_merge */
                y.* 
        from 
                table_a a, 
                table(my_function(
                        a.loan_acct_nbr,
                        a.yr_mnth_cd,
                        a.start_dt,
                        a.maturity_dt,
                        a.number_of_terms
                )) y
        where 
                a.yr_mnth_cd = '202001'
        )       v
where   not exists (
                select
                        /*+ unnest */
                        null
                from    final_load_table_l l
                where   l.loan_acct_nbr = v.loan_acct_nbr
                and     l.yr_mnth_cd    = v.yr_mnth_cd
        )
/


------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |             |      1 |        | 10628 (100)|  14580 |00:00:00.12 |     387 |       |       |          |
|*  1 |  HASH JOIN RIGHT ANTI                |             |      1 |   6283K| 10628  (10)|  14580 |00:00:00.12 |     387 |  1878K|  1878K| 2156K (0)|
|   2 |   INDEX FAST FULL SCAN               | LT_PK       |      1 |  10000 |     6  (17)|  10000 |00:00:00.01 |      91 |       |       |          |
|   3 |   VIEW                               |             |      1 |   6283K| 10371   (8)|  14580 |00:00:00.11 |     296 |       |       |          |
|   4 |    NESTED LOOPS                      |             |      1 |   6283K| 10371   (8)|  14580 |00:00:00.10 |     296 |       |       |          |
|*  5 |     TABLE ACCESS FULL                | TABLE_A     |      1 |    769 |    10  (10)|    243 |00:00:00.01 |     296 |       |       |          |
|   6 |     COLLECTION ITERATOR PICKLER FETCH| MY_FUNCTION |    243 |   8168 |    13   (0)|  14580 |00:00:00.10 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("L"."LOAN_ACCT_NBR"="V"."LOAN_ACCT_NBR" AND "L"."YR_MNTH_CD"="V"."YR_MNTH_CD")
   5 - filter("A"."YR_MNTH_CD"='202001')

I’ve wrapped the basic join between table_a and the table function in an inline view called (unimaginatively) v, and I’ve added a /*+ no_merge */ hint to that inline view. So the main query becomes a select from a non-mergeable view with a “not exists” subquery applied to a couple of view columns. And Oracle thinks that that’s okay – and my unnest hint encouraged the optimizer to use a hash anti-join.

So here’s an option for the owner of the problem – but with a few outstanding questions: will a rewrite of their query in this form do the same, will the performance of the select be good enough, and will the subsequent “insert as select” keep the same driving plan.

There’s one significant detail to worry about: the build table in this hash (anti-)join is going to be constructed from 50M (load_acct_bfr, yr_mnth_cd) pairs – which means the workarea requirement is likely to be about 1.2GB for an optimial (i.e. in-memory) hash join; otherwise the join may spill to disc and do a lot of I/O – probably as a one-pass hash join.

(Did you notice,by the way, that the word storage appeared at operation 3 in the original plan?  That suggests a nice big Exadata box; however, there’s no storage predicate in the Predicate Information section for that operation and you would have thought that lease_type = ‘US_SSP’ would be pushed to storage, so maybe this is a ZFS Pillar backing a less exotic server.)

Conclusion

Some (if not all) types of correlated subqueries behave badly if the correlation predicates involve table functions. But you may be able to work around the issue by hiding part of the query, including the table function, inside a non-mergeable inline view before applying the subquery to the view.

Footnote

When I realised that the table function was causing a problem unnesting I remembered that I had written about a similar problem a few years ago – after searching for a little while I discovered a draft note that I had started in September 2015 but had not quite finished; so I’ll be finishing it off and publishing it some time in the next few days. (Update: now published)

 

 

September 28, 2019

Negative Offload

Filed under: Exadata,Execution plans,HCC,Oracle,Troubleshooting — Jonathan Lewis @ 5:38 pm BST Sep 28,2019

At the Trivadis Performance Days 2019 I did a presentation on using execution plans to understand what a query was doing. One of the examples I showed was a plan from an Exadata system (using 11.2.0.4) that needed to go faster. The plan was from the SQL Monitor report and all I want to show you is one line that’s reporting a tablescan. To fit the screen comfortably I’ve removed a number of columns from the output.

The report had been generated while the statement was still running (hence the “->” at the left hand edge) and the query had scanned 166 segments (with no partition elimination) of a table with 4,500 data segments (450 range partitions and 10 hash sub-partitions – note the design error, by the way, hash partitioning in Oracle should always hash for a powert of 2).


SQL Plan Monitoring Details (Plan Hash Value=3764612084)  
============================================================================================================================================
| Id   |           Operation            | Name  | Read  | Read  | Write | Write |   Cell   | Mem  | Activity |       Activity Detail       |  
|      |                                |       | Reqs  | Bytes | Reqs  | Bytes | Offload  |      |   (%)    |         (# samples)         |   
============================================================================================================================================
| -> 5 |      TABLE ACCESS STORAGE FULL | TXN   |  972K | 235GB |       |       | -203.03% |   7M |    63.43 | Cpu (1303)                  | 
|      |                                |       |       |       |       |       |          |      |          | cell smart table scan (175) | 
============================================================================================================================================

In the presentation I pointed out that for a “cell smart table scan” (note the Activity Detail colum) this line was using a surprisingly large amount of CPU.

We had been told that the table was using hybrid columnar compression (HCC) and had been given some figures that showed the compression factor was slightly better than 4. I had also pointed out that the typical size of a read request was 256KB. (Compare Read Reqs with Read Bytes)

To explain the excessive CPU I claimed that we were seeing “double decompression” – the cell was decompressing (uncompressing) compression units (CUs), finding that the resulting decompressed data was larger than the 1MB unit that Exadata allows and sending the original compressed CU to the database server where it was decompressed again – and the server side decompression was burning up the CPU.

This claim is (almost certainly) true – but the justification I gave for the claim was at best incomplete (though, to be brutally honest, I have to admit that I’d made a mistake): I pointed out that the Cell Offload was negative 200% and that this was what told us about the double decompression. While double decompression was probably happening the implication I had made was that a negative offload automatically indicated double decompression – and that’s was an incorrect assumption on my part. Fortunately Maurice Müller caught up with me after the session was over and pointed out the error then emailed me a link to a relevant article by Ahmed Aangour.

The Cell Offload is a measure of the difference between the volume of data read and the volume of data returned to the server. If the cell reads 256KB from disc, but the column and row selection means the cell returns 128KB the Cell Offload would be 50%; if the cell returns 64KB the Cell Offload would be 75% (100 * (1 – 64KB/256KB)). But what if you select all the rows and columns from a compressed table – the volume of data after decompression would be larger than the compressed volume the cell had read from disc – and in this case we knew that we were reading 256KB at a time and the compression factor was slightly greater than 4, so the uncompressed data would probably be around 1MB, giving us a Cell Offload of 100 * (1 – 1024KB / 256KB) = negative 300%

Key Point: Any time that decompression, combined with the row and column selection, produces more data than the volume of data read from disc the Cell Offload will go negative. A negative Cell Offload is not inherently a problem (though it might hint at a suboptimal use of compression).

Follow-up Analysis

Despite the error in my initial understanding the claim that we were seeing double decompression was still (almost certainly) true – but we need to be a little more sophisticated in the analysis. The clue is in the arithmetic a few lines further up the page. We can see that we are basically reading 256KB chunks of the table, and we know that 256KB will expand to roughly 1MB so we ought to see a Cell Offload of about -300%; but the Cell Offload is -200%. This suggests fairly strongly that on some of the reads the decompressed data is slightly less than 1MB, which allows the cell to return the decompressed data to the database server, while some of the time the decompressed data is greater than 1MB, forcing the cell to send the original (compressed) CU to the databsae server.

We may even be able work the arithmetic backwards to estimate the number of times double decompression appeared.  Assume that two-thirds of the time the cell decompressed the data and successfully sent (just less than) 1MB back to the database server and one-third of the time the cell decompressed the data and found that the result was too large and sent 256KB of compressed data back to the server, and let’s work with the 972,000 read requests reported to see what drops out of the arithmetic:

  • Total data read: 972,000 * 256KB = 243,000 MB
  • Data sent to db server:  648,000 * 1MB + 324,000 * 256KB = 729,000 MB
  • Cell Offload = 100 * (1 – 729/243) = -200%   Q.E.D.

Of course it would be nice to avoid guessing – and if we were able to check the session activity stats (v$sessstat) while the query was running (or after it had completed) we could pick up several numbers that confirmed our suspicion. For 11.2.0.4, for example, we would keep an eye on:

	cell CUs sent uncompressed
	cell CUs processed for uncompressed
	EHCC {class} CUs Decompressed

Differences between these stats allows you to work out the number of compression units that failed the 1MB test on the cell server and were sent to the database server to be decompressed. There is actually another statistic named “cell CUs sent compressed” which would make life easy for us, but I’ve not seen it populated in my tests – so maybe it doesn’t mean what it seems to say.

Here’s an example from an 11.2.0.4 system that I presented a few years ago showing some sample numbers.

cell CUs sent uncompressed              5,601
cell CUs processed for uncompressed     5,601

EHCC CUs Decompressed                  17,903
EHCC Query High CUs Decompressed       12,302 

This reveals an annoying feature of 11g (continued in 12.1) that results in double counting of the statistics, confusing the issue when you’re trying to analyze what’s going on. In this case the table consisted of 12,302 compression units, and the query was engineered to cause the performance problem to appear. The first two statistics show us how many CUs were decompressed successfully (we’ll see a change appearing there in 12.1). We then see that all 12,302 of the table’s “query high” compression units were decompressed – but the “total” of all CUs decompressed was 17.903.

It’s not a coincidence that 12,302 + 5,601 = 17,903; there’s some double counting going on. I don’t know how many of the statistics are affected in this way, but Oracle has counted the CUs that passsed decompression once as they were processed at the cell server and again as they arrived at the database server. In this example we can infer that 12,302 – 5,601 = 6,701 compression units failed decompression at the cell server and were sent to the database server in compressed form to be decompressed again.

Here’s a couple of sets of figures from some similar tests run on 12.1.0.2 – one with a table compressed to query high another compressed to query low. There is one critical difference from the 11g figures but the same double-counting seems to have happened. In both cases the “EHCC Query [Low|High] CUs Decompressed” show the correct number of CUs in each table. Note, though that the “cell CUs processed for uncompress” in 12.1 appear to report the number of attempted decompressions rather than 11g’s number of successful decompressions.


=========================

cell CUs sent uncompressed                     19,561	-- successful decompressions at cell server
cell CUs processed for uncompressed            19,564	=> 3 failures

EHCC CUs Decompressed                          39,125	=  2 * 19,561 successes + 3 db server decompression
EHCC Query High CUs Decompressed               19,564

=========================

cell CUs sent uncompressed                     80,037	-- successful decompressions at cell server
cell CUs processed for uncompressed            82,178	=> 2,141 failures

EHCC CUs Decompressed                         162,215	=  2 * 80,037 successes + 2,141 db server decompressions
EHCC Query Low CUs Decompressed                82,178

=========================

I’ve annotated the figures to explain the arithmetic.

There has been some significant renaming and separation of statistics in 12.2, as described in this post by Roger MacNicol, and the problems of double-counting should have disappeared. I haven’t yet tested my old models in the latest versions of Oracle, though, so can’t show you anyy figures to demonstrate the change.

Takeaways

There are 4 key points to note in this posting.

  • Hash (sub)partitioning should be based on powers of 2, otherwise some partitions will be twice size of others.
  • There is a 1MB limit on the “data packet” sent between the cell server and database server in Exadata.
  • If you select a large fraction of the rows and columns from an HCC compressed table you may end up decompressing a lot of your data twice if the decompressed data for a read request is larger than the 1MB unit (and the cost will be highly visible at the database server as CPU usage).
  • The Cell Offload figure for a tablescan (in particular) will go negative if the volume of data sent from the cell server to the database server is larger than the volume of data read from the disk- even if double decompression hasn’t been happening.

A little corollary to the third point: if you are writing to a staging table with the expectation of doing an unfiltered tablescan (or a select *), then you probably don’t want to use hybrid columnar compression on the table as you will probably end up using a lot of CPU at the database server to compress it, then do double-decompression using even more CPU on the database server.  It’s only if you really need to minimise disk usage and have lots of CPU capacity to spare that you have a case for using hybrid columnar compression for the table (and Oracle In-Memory features may also change the degree of desirability).

Footnote

I haven’t said anything about accessing table data by index when the table is subject to HCC compression. I haven’t tested the mechanism in recent versions of Oracle but it used to be the case that the cell server would supply the whole compression unit (CU) to the database server which would decompress it to construct the relevant row. One side effect of this was that the same CU could be decompressed (with a high CPU load) many times in the course of a single query.

 

August 14, 2019

gather_system_stats

Filed under: CBO,Exadata,Oracle,Statistics,System Stats — Jonathan Lewis @ 2:20 pm BST Aug 14,2019

What happens when you execute dbms_stats.gather_system_stats() with the ‘Exadata’ option ?

Here’s what my system stats look like (12.2.0.1 test results) after doing so. (The code to generate the two different versions is at the end of the note).


System Stats
============
Status: COMPLETED
Timed: 13-Aug-2019 15:00:00 - 13-Aug-2019 15:00:00
--------------------------------------------------
CPUSPEED        :
CPUSPEEDNW      :          918
IOSEEKTIM       :           10
IOTFRSPEED      :      204,800
MAXTHR          :
MBRC            :          128
MREADTIM        :
SLAVETHR        :
SREADTIM        :

PL/SQL procedure successfully completed.

MBRC       :          128
MREADTIM   :
SREADTIM   :
CPUSPEED   :
CPUSPEEDNW :          918
IOSEEKTIM  :           10
IOTFRSPEED :      204,800
MAXTHR     :
SLAVETHR   :

PL/SQL procedure successfully completed.

All the code does is set the MBRC, IOSEEKTIM, and IOTFRSPEED to fixed values and the only real gather is the CPUSPEEDNW. The parameters showing blanks are deliberately set null by the procedure – before I called the gather_system_stats() every parameter had a value. You could also check the SQL trace file (with bind captured enabled) to see the statements that deliberately set those parameters to null if you want more proof.

What are the consequences of this call (assuming you haven’t also done something with the calibrate_io() procedure? Essentially Oracle now has information that says a single (8KB) block read request will take marginally over 10 milliseconds, and a multiblock read request of 1MB will take just over 15 milliseconds: in other words “tablescans are great, don’t use indexes unless they’re really precisely targetted”. To give you a quantitative feel for the numbers: given the choice between doing a tablescan of 1GB to pick 1,500 randomly scattered rows and using a perfect index the optimizer would choose the index.

To explain the time calculations: Oracle has set an I/O seek time of 10 ms, and a transfer rate of 204,800 bytes per ms (200 MB/s), with the guideline that a “typical” multiblock read is going to achieve 128 blocks. So the optimizer believes a single block read will take 10 + 8192/204800 ms = 10.04ms, while a multiblock read request for 1MB will take 10 + 1048576/204800 ms = 15.12 ms.

It’s also important to note that Oracle will use the 128 MBRC value in its calculation of the cost of the tablescan – even if you’ve set the db_file_mulitblock_read_count parameter for the session or system to something smaller; and if you have set the db_file_multiblock_read_count that’s the maximum size of multiblock read that the run-time engine will use for both cached reads (db file scattered read waits) and direct path reads.

Code

Here are the two procedures I used to report the values above. You will only need the privilege to execute the dbms_stats package for the second one, but you’ll need the privilege to access the SYS table aux_stats$ to use the first. The benefit of the first one is that it can’t go out of date as versions change – and, of  course, you could just run the SQL statement implied by the procedure; though wrapping the statement in a procedure means you could grant execute privileges on the procedure to non-sys users).


rem
rem     Script:         get_system_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2002
rem
rem     Last tested
rem             18.3.0.0
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

set linesize 180
set trimspool on
set pagesize 60
set serveroutput on

spool get_system_stats

declare
        m_value         number;
        m_status        varchar2(64);
        m_start         date;
        m_stop          date;
begin
        for r1 in (
                select  rownum rn, pname
                from    sys.aux_stats$
                where   sname = 'SYSSTATS_MAIN'
        ) loop
                dbms_stats.get_system_stats(m_status, m_start, m_stop, r1.pname, m_value);
                if r1.rn = 1 then
                        dbms_output.put_line('System Stats');
                        dbms_output.put_line('============');
                        dbms_output.put_line('Status: ' || m_status);
                        dbms_output.put_line(
                                'Timed: ' ||
                                to_char(m_start,'dd-Mon-yyyy hh24:mi:ss') ||
                                ' - ' ||
                                to_char(m_stop ,'dd-Mon-yyyy hh24:mi:ss')
                        );
                        dbms_output.put_line('--------------------------------------------------');
                end if;
                dbms_output.put_line(rpad(r1.pname,15) ||  ' : ' || to_char(m_value,'999,999,999'));
        end loop;
end;
/

declare
        m_value         number;
        m_status        varchar2(64);
        m_start         date;
        m_stop          date;
begin
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'MBRC', m_value);
        dbms_output.put_line('MBRC       : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'MREADTIM', m_value);
        dbms_output.put_line('MREADTIM   : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'SREADTIM', m_value);
        dbms_output.put_line('SREADTIM   : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'CPUSPEED', m_value);
        dbms_output.put_line('CPUSPEED   : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'CPUSPEEDNW', m_value);
        dbms_output.put_line('CPUSPEEDNW : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'IOSEEKTIM', m_value);
        dbms_output.put_line('IOSEEKTIM  : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'IOTFRSPEED', m_value);
        dbms_output.put_line('IOTFRSPEED : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'MAXTHR', m_value);
        dbms_output.put_line('MAXTHR     : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'SLAVETHR', m_value);
        dbms_output.put_line('SLAVETHR   : ' || to_char(m_value,'999,999,999'));
end;
/

spool off

Update (Feb 2021)

Testing some mixed effects of gather_system_stats(‘EXADATA’) and set_systemt_stats({parameter},{value}) I’ve found that on 19c the call for Exadata stats usually sets the IOSEEKTIM to 10, but sometimes sets it to 0 or 1. (As mentioned above, 12.2.0.1 always seems to set it to 10.)

 

June 11, 2019

Redo Dumps

Filed under: ASSM,Infrastructure,Oracle,redo,Troubleshooting — Jonathan Lewis @ 12:53 pm BST Jun 11,2019

A thread started on the Oracle-L list-server a few days ago asking for help analysing a problem where a simple “insert values()” (that handled millions of rows per day) was running very slowly. There are many reasons why this might happen, ranging from the trivial (someone has locked the table in exclusive mode), through the slightly subtle (we’re trying to insert a row that collides on a uniqueness constraint with an uncommitted insert from another session) to the subtle (Oracle has to read through the undo to check current versions of blocks against read-consistent versions) ending up at the esoteric (the ASSM space management blocks are completely messed up again).

A 10046 trace of a session doing an insert showed only that there was a lot of time spent on single block reads. Unfortunately, since this was on an Exadata system the waits were reported as “cell single block physical read”. Unfortunately the parameters to this wait event are “cellhash#”, “diskhash#”, and “bytes” and we don’t see the file_id, block_id which can be very helpful for a case like this. The only information we got from the trace file was that the object_id was for the table were rows were being inserted.

Before digging into exotic debugging methods, the OP supplied us with a 1-second session report from Tanel Poder’s snapper script:


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  SID @INST, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   1070  @2, SYSADMIN, STAT, session logical reads                                     ,         13865,      7.73k,         ,             ,          ,           ,      14.1k total buffer visits
>   1070  @2, SYSADMIN, STAT, user I/O wait time                                        ,           141,      78.65,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, non-idle wait time                                        ,           141,      78.65,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, non-idle wait count                                       ,         12230,      6.82k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read total IO requests                           ,          6112,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read requests optimized                          ,          6111,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read total bytes optimized                       ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read total bytes                                 ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, cell physical IO interconnect bytes                       ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, gcs messages sent                                         ,             3,       1.67,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block gets                                             ,         13860,      7.73k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block gets from cache                                  ,         13860,      7.73k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block gets from cache (fastpath)                       ,          7737,      4.32k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets                                           ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets from cache                                ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets pin                                       ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets pin (fastpath)                            ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, logical read bytes from cache                             ,     113541120,     63.34M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical reads                                            ,          6111,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical reads cache                                      ,          6111,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read IO requests                                 ,          6112,      3.41k,         ,             ,          ,           ,      8.19k bytes per request
>   1070  @2, SYSADMIN, STAT, physical read bytes                                       ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block changes                                          ,            11,       6.14,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, free buffer requested                                     ,          6112,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, hot buffers moved to head of LRU                          ,           958,     534.39,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, free buffer inspected                                     ,          6144,      3.43k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, shared hash latch upgrades - no wait                      ,             7,        3.9,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, blocks decrypted                                          ,          6110,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo entries                                              ,          6120,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo size                                                 ,        465504,    259.67k,         ,             ,          ,           ,          ~ bytes per user commit
>   1070  @2, SYSADMIN, STAT, redo entries for lost write detection                     ,          6110,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo size for lost write detection                        ,        464756,    259.25k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo subscn max counts                                    ,             7,        3.9,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, file io wait time                                         ,       1408659,    785.78k,         ,             ,          ,           ,   230.47us bad guess of IO wait time per IO request
>   1070  @2, SYSADMIN, STAT, gc current blocks received                                ,             3,       1.67,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, gc local grants                                           ,          6116,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM cbk:blocks examined                                  ,         12366,       6.9k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM gsp:L1 bitmaps examined                              ,          2478,      1.38k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM gsp:L2 bitmaps examined                              ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM gsp:reject db                                        ,         12388,      6.91k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, buffer is pinned count                                    ,           230,      128.3,         ,             ,          ,           ,       1.63 % buffer gets avoided thanks to buffer pin caching
>   1070  @2, SYSADMIN, STAT, cell flash cache read hits                                ,          6723,      3.75k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, TIME, background cpu time                                       ,        365192,   203.71ms,    20.4%, [##        ],          ,           ,
>   1070  @2, SYSADMIN, TIME, background elapsed time                                   ,       1273623,   710.45ms,    71.0%, [########  ],          ,           ,      28.95 % unaccounted time
>   1070  @2, SYSADMIN, WAIT, gc current block busy                                     ,           629,   350.87us,      .0%, [          ],         3,       1.67,   209.67us average wait
>   1070  @2, SYSADMIN, WAIT, cell single block physical read                           ,       1557638,   868.88ms,    86.9%, [WWWWWWWWW ],      6746,      3.76k,    230.9us average wait
>  
> --  End of Stats snap 1, end=2019-05-18 12:58:58, seconds=1.8

My first step was simply to read down the list (using a very small font to get the entire width on screen without wrapping) to see if anything stood out as unusual. The report showed two things I rarely see in the session stats:


blocks decrypted                                          ,          6110
redo entries for lost write detection                     ,          6110

These stats tell me that there are two “uncommon” features enabled: db_lost_write_protect, and block level encryption. (So whatever else is going on it’s just possible that mixing in two rarely used – and therefore less frequently tested – features may be confusing the issue.

Lost write protection means Oracle writes a “block read record” (BRR) to the redo log every time it reads a block from disc, so I decided to follow up the 6,110 figure to see what other stats reported similar values.


physical read total IO requests                           ,          6112
physical read requests optimized                          ,          6111
physical reads                                            ,          6111
physical reads cache                                      ,          6111
physical read IO requests                                 ,          6112
free buffer requested                                     ,          6112
redo entries                                              ,          6120
redo entries for lost write detection                     ,          6110
gc local grants                                           ,          6116
cell flash cache read hits                                ,          6723

There’s nothing particularly surprising here – basically we see all the blocks being read as single block reads, into cache. All the necessary global cache (gc) grants are local so it’s possible the table of interest has been remastered to this node. The value for “cell flash cache read hits” look a little odd as the cache is hit more frequently than blocks are read – but dynamic performance views are not read-consistent and this session is hammering away like crazy so this might just be a side effect of the time to gather the data for the report.

We can chase the redo a little further – the number of redo entries is slightly larger than the number of blocks read, so (even though small inconsistencies are not necessarily meaningful) this might tell us something:


redo entries                                              ,          6120
redo size                                                 ,        465504
redo entries for lost write detection                     ,          6110
redo size for lost write detection                        ,        464756 
db block changes                                          ,            11

The number of “redo entries” that were NOT for lost write detection is 10, totalling 748 bytes (not a lot – so indicative of “non-user” activity). The number of “db block changes” is 11 (close enough to 10), and generally it’s changes to db blocks that require redo to be generated. The final significant number is the one that isn’t there – there’s no undo generated, so no user-change to data. This system is working like crazy achieving absolutely nothing at this point.

The next point to ponder is what sort of work it is doing – so let’s check how the physical reads turn into buffer gets.


session logical reads                                     ,         13865
db block gets                                             ,         13860
db block gets from cache                                  ,         13860
db block gets from cache (fastpath)                       ,          7737
consistent gets                                           ,             1
consistent gets from cache                                ,             1
consistent gets pin                                       ,             1
consistent gets pin (fastpath)                            ,             1
hot buffers moved to head of LRU                          ,           958

buffer is pinned count                                    ,           230 

The unusual thing you notice with these figures is that virtually every buffer get is a current get. We’ve also got a number of blocks pinned – this might just be the segment header block, or the segment header and level 2 bitmap block that we keep revisiting. Finally we can see a lot of hot buffers being moved to the head of the LRU; since our session has been doing a lot of work for a long time it seems likely that those buffers are ones that our session is keeping hot – and for a big insert that shouldn’t really be happening unless, perhaps, we were managing to do a lot of maintenance of (well-clustered) indexes.

I’ve isolated the (new in 12.2) “ASSM gsp (get space)” statistics from this output – they’re all about handling blocks, but I wanted to look at them without being distracted by other stats.


ASSM cbk:blocks examined                                  ,         12366 
ASSM gsp:L1 bitmaps examined                              ,          2478 
ASSM gsp:L2 bitmaps examined                              ,             1 
ASSM gsp:reject db                                        ,         12388

We can see that we’ve examined 2,478 “level 1” bitmap blocks. A level 1 block holds the basic “bitmap” that records the state of a number of data blocks (typically 128 blocks once the object gets very large) so our session has worked its way through 2,478 maps trying to find a data block that it could use to insert a row. The “reject db” statistic tells us about data blocks that have been examined and rejected (presumably because the row we want to insert is too large to fit, or maybe because there are no free ITL (interested transaction list) entries available in the block). So we seem to be spending all our time searching for somewhere to insert rows. This shouldn’t really be happening – it’s a type of problem that Oracle has been worrying away at for quite some time: how do you avoid “losing” space by updating bitmap blocks too soon on inserts without going to the opposite extreme and leaving bitmap blocks that claim the space is free when it’s in use by uncommitted transactions.

Note to self: I don’t know how we managed to reject more blocks (12,388) than we’ve examined (12,366) but possibly it’s just one of those timing glitches (the error is less than one fifth of one percent) , possibly it’s something to do with the reject count including some of the L1 bitmap blocks.

Clearly there’s something funny going on with space management – and we need to look at a few blocks that are exhibiting the problems. But how do we find a few suitable blocks? And that’s where, finally, we get to the title of the piece.

We are in the lucky position of having “lost write protection” enabled – so the redo log file will hold lots of “block read records”. We can’t get the file and block addresses we need from the “cell physical read” wait events so let’s ask the redo log to supply them. We just have to pick a log file (online or archived) and tell Oracle to dump some of it – and we can probably get away with a fairly small dump since we want just a single type of redo record over a short period of time. Here’s an example showing the format of two slightly different commands you could execute:


alter system dump logfile '/u01/app/oracle/oradata/orcl12c/redo03.log'
        rba min 2781    1
        rba max 2781    1000
        layer 23 opcode 2
;

alter system dump redo 
        scn min 19859991 scn max 19964263 
        layer 23 opcode 2
;

The first command is to dump a log file by name – but you may have to fiddle around a bit to find the names of an archived log file because if you choose this option you need to know the sequence number (sequence# in v$log_history) of the file if you want to restrict the size of the dump. The second command simply dumps redo for (in this example) an SCN range – and it’s easy to query v$log_history to find dates, times, and SCN ranges – Oracle will work out for itself which file it has to access. In both cases I’ve restricted the dump to just those redo records that contain change vectors of type BRR (block read records) which is what the layer 23 opcode 2 line is about.

Here’s an example of a redo record that contains nothing but a single BRR. (It’s from a single-block read, a multi-block read would produce a redo record with multiple change vectors, one vector for each block read.)


REDO RECORD - Thread:1 RBA: 0x000add.00000019.01b0 LEN: 0x004c VLD: 0x10 CON_UID: 2846920952
SCN: 0x00000000025a7c13 SUBSCN:  1 05/23/2019 10:42:51
CHANGE #1 CON_ID:3 TYP:2 CLS:6 AFN:9 DBA:0x00407930 OBJ:40 SCN:0x00000000001a1e2a SEQ:2 OP:23.2 ENC:0 RBL:0 FLG:0x0000
 Block Read - afn: 9 rdba: 0x00407930 BFT:(1024,4225328) non-BFT:(1,31024)
              scn: 0x00000000001a1e2a seq: 0x02
              flags: 0x00000006 ( dlog ckval )
              where: qeilwh05: qeilbk

If you’re wondering about the two interpretations of the rdba (relative datablock address), one is for BFTs (big file tablespaces) and one for non-BFTs. The other thing you’ll notice about the interpretations is that neither file number (1024 or 1) matches the afn (absolute file number). In smaller, non-CDB databases you will probably find that the afn matches the file number in the non-BFT interpretation, but I happen to be testing on a PDB and the first file in my SYSTEM tablespace happens to be the 9th file created in the CDB – connecting as SYS in my PDB I can compare the absolute and “relative” file number very easily:


SQL> select file#, rfile#, name from v$datafile;

     FILE#     RFILE# NAME
---------- ---------- ----------------------------------------------------------------
         9          1 /u01/app/oracle/oradata/orcl12c/orcl/system01.dbf
        10          4 /u01/app/oracle/oradata/orcl12c/orcl/sysaux01.dbf
        11          9 /u01/app/oracle/oradata/orcl12c/orcl/undotbs01.dbf
        12         12 /u01/app/oracle/oradata/orcl12c/orcl/users01.dbf
        13         13 /u01/app/oracle/oradata/orcl12c/orcl/APEX_1991375173370654.dbf
        14         14 /u01/app/oracle/oradata/orcl12c/orcl/APEX_1993195660370985.dbf
        22         22 /u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf
        23         23 /u01/app/oracle/oradata/orcl12c/orcl/test_8k.dbf

8 rows selected.

For bigfile tablespaces the “relative” file number is a complete fake and simply reports 1024 – you’re only allowed one file in a bigfile tablespace, so there is no “relativity” involved. (Unless you’re working at CERN and storing data about particle collisions in the LHC.)

The key point to remember when reading BRRs then, is that you should take the file number from the afn and the block number from the (appropriate) interpretation of the rdba. For the example above I would issue: “alter database dump datafile 9 block 31024;”

Finally

The originator of the thread hasn’t yet made any public response to the suggestion of dumping and reviewing blocks – possibly they’ve started a private conversation with Stefan Koehler who had suggested a strategy that examined function calls rather than block contents – so we’re unable to do any further analysis on what’s going on behind the scenes.

What we would be looking for is any indication that Oracle is repeatedly re-reading the same bitmap blocks and the same data blocks (by a simple check of block addresses); and if that is the case we would want to get some clue about why that might be happening by examining the contents of the data blocks that are subject to repeated reads without changing their status in the bitmap from “space available” to “full”.  As it is we just have to wait for the OP to tell us if they’ve made any further progress.

 

June 14, 2017

Unpivot

Filed under: Execution plans,Oracle,Performance — Jonathan Lewis @ 3:46 pm BST Jun 14,2017

An interesting observation appeared recently as a side-channel on a question on the OTN database forum – how does Oracle execute an unpivot() operation. Here’s an example of such a query:

rem
rem     Script:         unpivot_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2017
rem

create table t1( id, col1, col2, col3, col4, col5, padding )
cache
pctfree 95 pctused 5
-- compress for query low
as
select
        1, 100 , 200 , 300 , 400 , 500,rpad('x',100)
from
        all_objects
where
        rownum <= 50000 -- > needed to bypass wordpress format glitch
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

select
        /*+ gather_plan_statistics find this */
        id, max(val) as high_val
from
        t1
unpivot include nulls (
        val for source in (col1, col2, col3, col4, col5)
)
group by id
order by id
; 

I’ve created a table with 50,000 rows (all_objects is quite large in 12.1.0.2 and 12.2.0.1), but with lots of free space per block so that I get three rows per block for a total of roughly 16,667 blocks which is going to make it fairly easy to spot any interesting session statistics. Then I’ve used an unpivot() call that has the effect of turning one row with five columns into five rows with one column.

Here’s the basic execution plan for the query (as pulled from memory):


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        | 10695 (100)|      1 |00:00:00.18 |   16671 |       |       |          |
|   1 |  SORT GROUP BY       |      |      1 |      1 | 10695   (2)|      1 |00:00:00.18 |   16671 |  2048 |  2048 | 2048  (0)|
|   2 |   VIEW               |      |      1 |    250K| 10653   (2)|    250K|00:00:01.41 |   16671 |       |       |          |
|   3 |    UNPIVOT           |      |      1 |        |            |    250K|00:00:00.52 |   16671 |       |       |          |
|   4 |     TABLE ACCESS FULL| T1   |      1 |  50000 |  2131   (2)|  50000 |00:00:00.12 |   16671 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

According to the plan Oracle has done one full tablescan on the data (Starts = 1), but the unpivot operation has used this to produce 250,000 rows of output from 50,000 rows (A-Rows=) of input. Note, however, that the cost of the view operation is 5 times the cost of the tablescan but, on the other hand, the number of buffers visited is 16,671 (which matches the size of the table). So have we done 5 tablescans with a union all, or have we done one tablescan ?

The next step is to look at the 10053 (optimizer) trace file, specifically for the “unparsed query” which (despite the final plan table showing the plan we’ve just seen above) looked like this – after applying a few purely cosmetic changes:

Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "from$_subquery$_002"."ID" "ID",MAX("from$_subquery$_002"."VAL") "HIGH_VAL"
FROM    (
                   (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL1' "SOURCE","T1"."COL1" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL2' "SOURCE","T1"."COL2" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL3' "SOURCE","T1"."COL3" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL4' "SOURCE","T1"."COL4" "VAL" FROM "TEST_USER"."T1" "T1")
        UNION ALL  (SELECT "T1"."ID" "ID","T1"."PADDING" "PADDING",'COL5' "SOURCE","T1"."COL5" "VAL" FROM "TEST_USER"."T1" "T1")
        ) "from$_subquery$_002"
GROUP BY "from$_subquery$_002"."ID"
ORDER BY "from$_subquery$_002"."ID"
;

And then there’s the outline (which I can see in the 10053 trace, or in the plan pulled from memory by a call to dbms_xplan.display_cursor()).


Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.1.0.2')
      DB_VERSION('12.1.0.2')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SEL$3")
      OUTLINE_LEAF(@"SEL$4")
      OUTLINE_LEAF(@"SEL$5")
      OUTLINE_LEAF(@"SET$1")
      OUTLINE_LEAF(@"SEL$6")
      NO_ACCESS(@"SEL$6" "from$_subquery$_002"@"SEL$6")
      FULL(@"SEL$5" "T1"@"SEL$5")
      FULL(@"SEL$4" "T1"@"SEL$4")
      FULL(@"SEL$3" "T1"@"SEL$3")
      FULL(@"SEL$2" "T1"@"SEL$2")
      FULL(@"SEL$1" "T1"@"SEL$1")
      END_OUTLINE_DATA
  */

And if I also pull the alias (query block and fully qualified table name) information from memory:


Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$6
   2 - SET$1 / from$_subquery$_002@SEL$6
   3 - SET$1
   4 - SEL$5 / T1@SEL$5

Then there’s the alternative (though not necessarily correct) way to find an “unparsed” version of the query – dbms_sql2.expand_sql_text() in 11g, dbms_utility.expand_sql_text() in 12c:


variable m_sql_out clob

declare
        m_sql_in    clob :=
                'select /*+ gather_plan_statistics find this */ id, max(val) as high_val
                from    t1
                        unpivot include nulls ( val for source in (col1, col2, col3, col4, col5) )
                group by id
                order by id
                '
        ;
begin

--      dbms_sql2.expand_sql_text(        -- 11g
        dbms_utility.expand_sql_text(     -- 12c
                m_sql_in,
                :m_sql_out
        );

end;
/

set long 20000
print m_sql_out

M_SQL_OUT
--------------------------------------------------------------------------------
SELECT "A1"."ID" "ID",MAX("A1"."VAL") "HIGH_VAL" FROM  ( (SELECT "A3"."ID" "ID",
"A3"."PADDING" "PADDING",'COL1' "SOURCE","A3"."COL1" "VAL" FROM "TEST_USER"."T1"
 "A3") UNION ALL  (SELECT "A4"."ID" "ID","A4"."PADDING" "PADDING",'COL2' "SOURCE
","A4"."COL2" "VAL" FROM "TEST_USER"."T1" "A4") UNION ALL  (SELECT "A5"."ID" "ID
","A5"."PADDING" "PADDING",'COL3' "SOURCE","A5"."COL3" "VAL" FROM "TEST_USER"."T
1" "A5") UNION ALL  (SELECT "A6"."ID" "ID","A6"."PADDING" "PADDING",'COL4' "SOUR
CE","A6"."COL4" "VAL" FROM "TEST_USER"."T1" "A6") UNION ALL  (SELECT "A7"."ID" "
ID","A7"."PADDING" "PADDING",'COL5' "SOURCE","A7"."COL5" "VAL" FROM "TEST_USER".
"T1" "A7")) "A1" GROUP BY "A1"."ID" ORDER BY "A1"."ID"

So at some level the optimizer does think it’s doing 5 tablescans – and the t1 reported in the unpivot plan is from the fifth (i.e. last) copy of the table in the union all. But the execution statistics and the session statistics say I’ve done just one tablescan – and this persists even when I make the table so large that it (a) won’t fit the buffer cache, and/or (b) uses direct path reads, and/or (c) runs under Exadata with Hybrid columnar compression.

So what’s (probably) happening ?

I think Oracle has a code path that says it’s doing a union all of tablescans (at least for this example) but tells it that the union all is there as an expansion of an unpivot so (sticking with an example that does a tablescan into the buffer cache) Oracle reads the number of blocks dictated by the current multiblock read count into the cache, pins that one batch of blocks, scans the batch (or, possibly, each block in turn) 5 times (or as required), unpins the batch and then reads the next batch. So the session does five tablescans but does them in a way that lets you see only one tablescan in the statistics.

Footnote

There was a change in the results when I tested this on 12.2.0.1; the unparsed query reported only a two-part union all subquery, and the table alias information in the plan identified the referenced table as the copy of the table from the second (i.e. still last) subquery in the union all. More significantly the cost of the VIEW operation was a close match to the cost of a single tablescan, rather than being a multiple thereof:


select * from table(dbms_xplan.display_cursor('1k077bzp0t6mn',null,'outline alias cost 

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |  1117 (100)|      1 |00:00:00.13 |    8350 |   8334 |       |       |          |
|   1 |  SORT GROUP BY       |      |      1 |    250K|  1117   (6)|      1 |00:00:00.13 |    8350 |   8334 |  2048 |  2048 | 2048  (0)|
|   2 |   VIEW               |      |      1 |    250K|  1076   (2)|    250K|00:00:00.13 |    8350 |   8334 |       |       |          |
|   3 |    UNPIVOT           |      |      1 |        |            |    250K|00:00:00.09 |    8350 |   8334 |       |       |          |
|   4 |     TABLE ACCESS FULL| T1   |      1 |  50000 |  1074   (2)|  50000 |00:00:00.07 |    8350 |   8334 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$3
   2 - SET$1 / from$_subquery$_002@SEL$3
   3 - SET$1
   4 - SEL$2 / T1@SEL$2

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SET$1")
      OUTLINE_LEAF(@"SEL$3")
      NO_ACCESS(@"SEL$3" "from$_subquery$_002"@"SEL$3")
      FULL(@"SEL$2" "T1"@"SEL$2")
      END_OUTLINE_DATA
  */

M_SQL_OUT
--------------------------------------------------------------------------------
SELECT "A1"."ID" "ID",MAX("A1"."VAL") "HIGH_VAL" FROM  (SELECT "A3"."ID" "ID","A
3"."PADDING" "PADDING",'COL1' "SOURCE","A3"."COL1" "VAL" FROM "TEST_USER"."T1" "
A3" UNION ALL SELECT "A3"."ID" "ID","A3"."PADDING" "PADDING",'COL2' "SOURCE","A3
"."COL2" "VAL" FROM "TEST_USER"."T1" "A3" UNION ALL SELECT "A3"."ID" "ID","A3"."
PADDING" "PADDING",'COL3' "SOURCE","A3"."COL3" "VAL" FROM "TEST_USER"."T1" "A3"
UNION ALL SELECT "A3"."ID" "ID","A3"."PADDING" "PADDING",'COL4' "SOURCE","A3"."C
OL4" "VAL" FROM "TEST_USER"."T1" "A3" UNION ALL SELECT "A3"."ID" "ID","A3"."PADD
ING" "PADDING",'COL5' "SOURCE","A3"."COL5" "VAL" FROM "TEST_USER"."T1" "A3") "A1
" GROUP BY "A1"."ID" ORDER BY "A1"."ID"
allstats last'));

It used to be hard enough working out what the run-time engine was doing when the optimizer and the run-time engine claimed that they were working on the same (transformed version of the) SQL; but now we can get the impression that the optimizer is directing the run-time engine to execute a plan that couldn’t possibly match the (unparsed) query that the optimizer had been considering.

Update (May 2019)

In a response to Stew Ashton in the comment section I pointed out that the unpivot feature (unlike the merge command) behaved nicely when it came to deciding which columns to project from the table you were operating on. I made the comment that I had tested 11.2.0.4 and 12.1.0.2 but not got round to testing 12.2.0.1.  A couple of years have now passed and today I finally tested 12.2.0.1 and 18.3.0.0 – they’ve taken a retrograde step, both versions project the padding column and the plan is forced to do a full tablescan even if there’s a suitable index that could be used for an index fast full scan.

Key point:  if you’re going to unpivot a few columns from a big table then don’t supply the table name in the “from” clause,  change it to an inline view that selects only the columns you need, for example:


select 
        id, max(val) as high_val 
from 
        (select id, col1, col2, col3, col4, col5 from t1) t1
unpivot include nulls ( 
        val for source in (col1, col2, col3, col4, col5)
) 
group by id 
order by id 
; 

With my sample data the execution plan for the original query used a tablescan full, but the modified query switched to an index fast full scan as I had a covering index available. Of course, you could use this strategy even if you don’t have a covering index as there will still be some CPU overhead of projecting redundant columns and passing them up the pipeline to the parent (unpivot) operation.

Update Mar 2020

Andrew Sayer has just brought Bug 22339954 – High Parse Time for Query Using UNPIVOT Operator (Doc ID 22339954.8) to my attention: present up to version 12.1.0.2, fixed in 12.2.0.1 with back ports for 11.2.0.3 and 12.1.0.2.

The details about the bug suggest that the bug also allows for errors in expansion and column projection (ORA-00904, ORA-00942), so part of the fix to bug 22339954 may be the underlying cause of the appearance of the redundant column in the projection list for my example.

May 25, 2017

Parallelism

Filed under: 12c,CBO,Hints,Ignoring Hints,Oracle — Jonathan Lewis @ 3:48 pm BST May 25,2017

Headline – if you don’t want to read the note – the /*+ parallel(N) */ hint doesn’t mean a query will use parallel execution, even if there are enough parallel execution server processes to make it possible. The parallel(N) hint tells the optimizer to consider the cost of using parallel execution for each path that it examines, but ultimately the optimizer will still take the lowest cost path (bar the odd few special cases) and that path could turn out to be a serial path.

The likelihood of parallelism appearing for a given query changes across versions of Oracle so you can be fooled into thinking you’re seeing bugs as you test new versions but it’s (almost certainly) the same old rule being applied in different circumstances. Here’s an example – which I’ll start off on 11.2.0.4:


create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 ; create index t1_i1 on t1(id); begin dbms_stats.gather_table_stats( ownname => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

set autotrace traceonly explain

select
        count(v1)
from    t1
where   id = 10
;

select
        /*+ parallel(4) */
        count(v1)
from    t1
where   id = 10
;

select
        /*+ parallel(4) full(t1) */
        count(v1)
from    t1
where   id = 10
;

set autotrace off

I haven’t declare the index to be unique, but it clearly could be; and it’s obvious that with 1M rows and about 120M of table a parallel full scan is probably a bad idea to acquire one row (even if you’re running Exadata!). So what do we get for the three plans – I’ll skip the predicate section – when we want to collect one row.


Base plan - unhinted
--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    16 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |       |     1 |    16 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     1 |    16 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |     1 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Hinted parallel(4)
--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    16 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |       |     1 |    16 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     1 |    16 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |     1 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Hinted parallel(4) and full(t1)
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |    16 |   606   (2)| 00:00:02 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |    16 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |    16 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |    16 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          |     1 |    16 |   606   (2)| 00:00:02 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS FULL| T1       |     1 |    16 |   606   (2)| 00:00:02 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

In 11.2.0.4 the optimizer did consider the parallel hint when it appeared on its own – but it has compared the parallel(4) cost of 606 with the serial index cost of 4 and chosen the indexed access path. This is not a case of ignoring the hint, it’s an example of being fooled if you don’t know how the hint is really supposed to work.

But here’s an interesting change that appeared in 12.2 – this time just the plan with the parallel(4) hint on its own:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |          |     1 |    16 |     4   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                         |          |     1 |    16 |            |          |        |      |            |
|   2 |   PX COORDINATOR                        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)                  | :TQ10001 |     1 |    16 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                      |          |     1 |    16 |            |          |  Q1,01 | PCWP |            |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| T1       |     1 |    16 |     4   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE                        |          |     1 |       |     3   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH (BLOCK ADDRESS)     | :TQ10000 |     1 |       |     3   (0)| 00:00:01 |  Q1,00 | S->P | HASH (BLOCK|
|   8 |         PX SELECTOR                     |          |       |       |            |          |  Q1,00 | SCWC |            |
|*  9 |          INDEX RANGE SCAN               | T1_I1    |     1 |       |     3   (0)| 00:00:01 |  Q1,00 | SCWP |            |
---------------------------------------------------------------------------------------------------------------------------------

You get a parallel execution plan – although it starts with a serial index range scan which is operated for the new (12c) PX Selector operator that allocates a serial operation to one of the parallel execution slaves – which, approximately, is why the indexed access cost doesn’t change in this example – rather than running it through the query coordinator (QC). The serial range scan does a hash distribution (hashed by block address of the rowids it finds to avoid collisions between parallel execution slave as they do their table accesses.

This is just one cute little trick that makes it worth looking at the upgrade to 12c – this new path is likely to be of benefit to people who had to create global (as opposed to globally partitioned) indexes on partitioned tables.

This note was prompted by a recent twitter comment by Timur Akhmadeev followed in short order by an OTN posting that added further confusion to the problem by running Siebel – which is just one of several 3rd party products that love to configure optimizer parameters with non-standard values like: optimizer_index_cost_adj = 1, or optimizer_mode = first_rows_10. (At the last update I’ve seen on the thread, there seemed to be some other reason why parallelism was being blocked.)

Footnote

In a follow-up tweet, Timue directed me to the 11.2 SQL Language Reference manual – specifically a section on the Parallel Hint, asking if this was an example of a documentation bug.

The trouble with the manuals is that sometimes they are obviously wrong, sometimes they are wrong but it’s not obvious they are wrong, sometimes they omit important information, and sometimes they are badly written and, most specfically, the writing can be ambiguous.

Here’s an extract we could consider:

For PARALLEL, if you specify integer, then that degree of parallelism will be used for the statement.

But my example above shows a “parallel({integer})” hint where we didn’t use that degree of parallelism for the statement.

However the next two sentences read as follows:

If you omit integer, then the database computes the degree of parallelism. All the access paths that can use parallelism will use the specified or computed degree of parallelism.

So what if the optimizer uses the degree of parallelism while calculating the lowest cost plan and ends up with a serial plan ? How comfortable would you feel saying that Oracle has “used the degree of parallelism for the statement”. Or would you say that the first sentence means Oracle isn’t allowed to use a serial plan even if it finds one when doing the arithmetic with the appropriate degree of parallelism.

My call is that this is one of those ambiguous cases – the manual should say something more like:

For PARALLEL, if you specify integer, then that degree of parallelism will be used by the optimizer while calculating the best execution  plan for the statement.

Even then I’m not sure that that’s a complete statement of how the hint works because when you have a full set of system statistics, or have used the dbms_resource_manager.calibrate_io mechanism to tell Oracle about the I/O capacity of the system the optimizer may do some working that says something like: “the hint says degree 64, but the stats say the maximum effective degree will be 38 so I’ll calculate using 38” (This type of thing happens with the older usage of the parallel hint with manual parallelism – I haven’t examined what happens with an automatic policy and the newer option for the hint.)

 

May 19, 2017

255 columns

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 5:49 pm BST May 19,2017

This is one of my “black hole” articles – I drafted it six months ago, but forgot to publish it.

A recent post on OTN highlighted some of the interesting oddities that appear when you create tables with more than 255 columns. In fact this was a more subtle case than usual because it reminded us that it’s possible to have a partitioned table which appears to have less than the critical 255 columns while actually having more than 255 columns thanks to the anomaly of how Oracle handles dropping columns in a partitioned table.  (For a useful insight see this note from Dominic Brooks – and for a nice thought about preparing simple tables for an exchange with such a partitioned tables look at the 12.2 feature of “create table for exchange” in Maria Colgan’s recent article)

The thread took me down the path of trying to recreate some notes I wrote a long time ago and can no longer find and the OTN problem wasn’t the basic one I had assumed anyway, but I thought I’d publish a bit of the work I had done so that you can see another of the funny effects that appear when your table definition has too many columns (and you use them).

The OP told us about a table with more than 350 columns, so here’s a little script I wrote to generate a table with 365 columns and some data. (It turned out that the OP had more than 390 columns in the table, but 30+ had been “dropped”.)


rem
rem	Script:		wide_table_2.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Nov 2016
rem	Purpose:
rem
rem	Last tested
rem		11.2.0.4
rem

create sequence s1;

declare
	m_statement_1	varchar2(32767) :=
		'create table t1(col0001 varchar2(10),';
	m_statement_2	varchar2(32767) :=
		'insert into t1 values(lpad(s1.nextval,10),';
begin
	for i in 2 .. 365 loop
		m_statement_1 := m_statement_1 ||
			'col' || to_char(i,'FM0000') || ' varchar2(100),'
		;
		m_statement_2 := m_statement_2 ||
			case when i in (2,3,4)
			-- case when i in (122,123,124)
			-- case when i in (262,263,264)
				then 'dbms_random.string(''U'',ceil(dbms_random.value(0,100))),'
			     when i = 365
				then 'lpad(s1.currval,7))'
				else '''COL' || to_char(i,'FM0000') || ''','
			end
		;
	end loop;

	m_statement_1 := substr(m_statement_1, 1, length(m_statement_1) - 1);
	m_statement_1 := m_statement_1 || ') pctfree 25';

	execute immediate m_statement_1;

	for i in 1..10000 loop
		execute immediate m_statement_2;
	end loop;

end;
/

I’ve taken a fairly simple approach to building a string that creates a table – and it’s easy to adjust the number of columns – and a string to insert some values into that table. The insert statement will insert a row number into the first and last columns of the table and generate a random length string for a few of the columns. I’ve picked three possible sets of three columns for the random length string; one set is definitely going to be in the first row piece, one set is definitely going to be in the last row piece, and (since the row will split 110/255) one will be somewhere inside whichever is the larger row piece.

If I wanted to do something more sophisticated I’d probably have to switch to a PL/SQL array for the two statements strings – 32,767 characters doesn’t give me much freedom to play if I wanted to test a table with 1,000 columns.

Having created and populated my table, I performed the following three tests on it:


analyze table t1 compute statistics;

prompt	====
prompt	CTAS
prompt	====

create table t1a pctfree 25 as select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

prompt	======
prompt	Insert
prompt	======

truncate table t1a;
insert into t1a select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

prompt	=============
prompt	Insert append
prompt	=============

truncate table t1a;
insert /*+ append */ into t1a select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

The first test creates a new table (t1a, at pctfree 25, matching the original) copying the original table with a simple “create as select”.

The second test truncates this table and does a basic “insert as select” to repopulate it.

Third test truncates the table again and does an “insert as select” with the /*+ append */ hint to repopulate it.

In all three cases (and with three variations of where the longer random strings went) I used the analyze command to gather stats on the tables so that I could get a count of the number of chained rows; and I dumped a couple of blocks from the tables to see what the inserted rows looked like.

Here’s a summary of the results from 11.2.0.4 when the random-length columns are near the start of the row (the position didn’t really affect the outcome, and the results for 12.1.0.2 and 12.2.0.1 were very similar):

====
CTAS
====

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       9504        237

======
Insert
======

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       6676       3287

=============
Insert append
=============

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       9504        237

As you can see we get two significantly different results: the CTAS and the “insert append” produce tables reporting 9,504 blocks and 237 chained rows, while the original table (single row inserts) and the regular “insert as select” produce tables with 6,676 blocks and 3,133 chained rows. It seems that the CTAS (which would also cover “alter table move”) and direct path insert have minimised the number of chained rows at a cost of a dramatically increased number of blocks. (The scale of the difference happens to be particularly extreme in this case – I didn’t do this deliberately it was simply a consequence of the way I happened to generate the data and the length of the rows.)

We know, of course, that every row in this table will consist of two row pieces, one of 110 columns and one of 255 columns; so every row is in some respects chained due to the potential for intra-block chaining of those two pieces, but the analyze command reports only inter-block chaining i.e. only those rows that start in one block and end in another block – intra-block chaining doesn’t count as “proper” chaining (at least in this version of Oracle).

There are two questions to address in these results: the first is “What’s happening?”, the second, which we ask when we get the answer to the first, is “How come the direct path method still gives us some chained rows?”

I believe the answer to the first question is that the direct path method attempts to avoid chaining unavoidable row-pieces. Even if it means leaving a huge amount of empty space in a block Oracle starts a new row in a new block if there isn’t enough space for both of the anticipated row-pieces to fit in the current block. I think this may be a feature to help Exadata and its use of direct path reads for smart scans, where a relatively small number of chained rows (which might be outside the current Exadata storage unit – and even in the disk space managed by another cell server) could have a catastrophic impact on performance because the system would have to do a single block read to pick up the extra piece – which could have a devastating impact on the performance.

So why do some rows still see chaining under this strategy – I think it’s because there’s a small error in the arithmetic somewhere (possibly visible only in ASSM tablespaces, perhaps related to row-piece headers) where Oracle thinks there’s enough space for both row pieces but there isn’t quite so it tries to insert both pieces and has to chain the second one. (I’ll corroborate this hypothesis with some analysis of block dumps in a moment).

So you have a choice – lots of wasted space and a little row-chaining, or maximum packing of data and (potentially) lots of row-chaining. But there’s more: I’ve said we get one row piece of 110 columns and one of 255 columns for each row, but the point at which the split occurs and the order in which the pieces are inserted depends on the method used.

  • Single row inserts (initial table creation, typical OLTP processing): The split occurred at column 111 – so the leading 110 columns are in one row piece and the trailing 255 columns are in the other – and the row piece with the trailing columns is inserted first.
  • Array inserts (normal): Exactly the same as the single row inserts.
  • Direct path inserts / CTAS: The split occurred at column 256, with the leading 255 column row-piece inserted first and the trailing 110 column row-piece inserted second.

I’m not sure that this particular detail matters very much in normal circumstances when you consider the dramatic difference in size that can appear in the comparison between direct path and normal inserts, but maybe there’s someone who will notice a performance (or even space) side effect because of this inconsistency. We will see in a later post, though, that this difference can have an enormous impact if you subsequently add columns to the table and populate them.

I said I’d come back to the row-chaining anomaly. One of the little details that I didn’t include in my code listing was the call to “analyze table report chained rows” that I did (after executing $ORACLE_HOME/rdbms/admin/utlchain.sql) to list the head rowids of the chained rows into the chained_rows table. After doing this I ran a simple pl/sql loop to dump all the relevant blocks to the trace file:

begin
	for r in (
		select
			dbms_rowid.rowid_relative_fno(head_rowid) file#,
			dbms_rowid.rowid_block_number(head_rowid) block#
			from	chained_rows
		) loop
			execute immediate 'alter system dump datafile ' || r.file# || ' block ' || r.block#;
	end loop;
end;
/

Here’s a little extract from the resulting trace file showing you what the start of a row piece looks like when dumped:

tab 0, row 0, @0x1765
tl: 2075 fb: --H-F--- lb: 0x0  cc: 255
nrid:  0x01401bc4.1
col  0: [10]  20 20 20 20 20 20 20 33 36 32
col  1: [19]  41 52 43 4a 4a 42 4e 55 46 4b 48 4c 45 47 4c 58 4c 4e 56
col  2: [ 8]  59 4b 51 46 4a 50 53 55
col  3: [17]  53 58 59 4e 4a 49 54 4a 41 5a 5a 51 44 44 4b 58 4d
col  4: [ 7]  43 4f 4c 30 30 30 35
col  5: [ 7]  43 4f 4c 30 30 30 36

A convenient thing to check is the cc: (“column count”) entry (end of 2nd line). You can see that this row piece has 255 columns, and if you look at the first six columns dumped you can see that it’s the row numbered 362 (33 36 32), then there are three columns of different length strings, then two columns with the values ‘COL00005’ and ‘COL0007’ respectively. It’s the “cc:” entry that’s useful though, so I’m going to do a bit of simple unix hackery:

grep " cc: " test_ora_24398.trc | sed "s/^.*cc: //"  | sort | uniq -c | sort -n
      1 1
      5 2
    112 108
    125 109
    237 110
    474 255

In my 237 blocks with chained rows I had 474 row pieces of 255 columns and 237 row pieces of 110 columns; then I had 125 row pieces that had lost (and therefore chained) one column and 112 row pieces that had lost and therefore chained 2 columns. I also had a couple of small “tail-end” pieces from earlier blocks scattered in these blocks. These figures suggest that there’s a small error (actually no more than about 20 bytes) in the calculation Oracle does to decide if it can fit a whole row into the current block or whether it has to go on to the next empty block.

Conclusions

When copying a table defined with more than 255 columns there’s the potential for a huge variation in the space usage and chain count depending on whether you do a CTAS (or insert /*+ append */) or a simple insert. You have to decide which option is the biggest threat to your available resources.

There is a little anomaly with the way in which rows are split that is also dependent on the method used for copying – this may also have some effect, though perhaps small enough to be ignored when compared with the space/chaining difference as far as ordinary OLTP processing is concerned. But there are  some important side effects we will consider in a later post.

Even though CTAS/direct path insert can eliminate a lot of row chaining it is still possible to find some row chaining in the resulting data. This may be the result of a calculation error (or possibly a deliberate space saving compromise).

Note that any comments about using CTAS to copy a table also apply to “alter table move” and to using expdp/impdp.

 

Next Page »

Website Powered by WordPress.com.