Oracle Scratchpad

July 22, 2022

Trim CPU

Filed under: Execution plans,Hash Join,Joins,Oracle,Performance,Problem Solving — Jonathan Lewis @ 6:56 am BST Jul 22,2022

Prompted by an unexpectedly high CPU usage on a hash join of two large dadta sets Stefan Koehler posted a poll on twitter recently asking for opinions on the following code fragment:

FROM
    TAB1
INNER JOIN TAB2 ON
    TAB1.COL1 = TAB2.COL1
AND TRIM(TAB1.COL3) > TRIM(TAB2.COL3)

While I struggle to imagine a realistic business requirement for the second predicate and think it’s indicative of a bad data model, I think it is nevertheless quite instructive to use the example to show how a hash join can use a lot of CPU if the join includes a predicate that isn’t on equality.

Trivia

Before examining the potential for wasting CPU, I’ll just point out two problems with using the trim() function in this way – because (while I hope that col3 is character string in both tables) I’ve seen code that uses “to_date(to_char(date_column))” instead of trunc(date_column):

Cut-n-paste from SQL*Plus:

SQL> select 1 from dual where trim(100) > trim(20);

no rows selected

==================================================================

SQL> alter session set nls_date_format = 'dd-mon-yyyy hh24:mi:ss';

SQL> select d1, d2 from t2 where trim(d1) > trim(d2);

20-jul-2022 15:24:46 19-aug-2022 15:26:44

1 row selected.

SQL> alter session set nls_date_format = 'yyyy-mm-dd hh24:mi:ss';

SQL> select d1, d2 from t2 where trim(d1) > trim(d2);

no rows selected

The trim() function converts numerics and dates to strings using the default format for the session before the comparison takes place, so not only can you get unexpected (i.e. wrong) results, two users can get contradictory results from the same data at the same time because they’ve specified different session defaults!

The CPU issue

The critical point that everyone should remember is this: hash joins can only operate on equality (though, to avoid ambiguity, one should point out that “equality” does also mean “not equals”, which is why hash anti-joins can be efficient).

This means that even though the clause “where tab1.col1 = tab2.col1 and tab1.col3 > tab2.col3” might specify the matching rows for an individual tab1 row with high precision and great efficiency for a nested loop join with the right index, a hash join has a completely different workload. Every tab1 row has to have its col3 compared with every tab2 row that matches on col1. The secondary tests multiply up to “n-squared”, and if any col1 value is is highly repetitive then the work done on checking col3 becomes excessive.

It’s easier to see this in a worked example, so here’s some sample data:

rem
rem     Script:         trim_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table tab1 as select * from all_Objects where owner != 'PUBLIC' and object_type != 'SYNONYM' and rownum <= 200;

create table tab2 as select * from all_Objects where owner != 'PUBLIC' and object_type != 'SYNONYM';

On a new pdb in 19.11 and 21.3 the second statement gave me roughly 46,000 rows. checking owners and row counts I got the following results:

SQL> select owner, count(*) from tab1 group by owner;

OWNER                      COUNT(*)
------------------------ ----------
SYS                             128
SYSTEM                           65
OUTLN                             7

SQL> select owner, count(*) from tab2 group by owner;

OWNER                      COUNT(*)
------------------------ ----------
SYS                           40104
SYSTEM                          417
OUTLN                             7

... plus about 17 rows aggregating 6,000 rows

And here’s the query (indicating 4 variations) that I’m going to use to demonstrate the CPU issue, followed by its execution plan and rowsource_execution_statistics:

set serveroutput off
alter session set statistics_level = all;

select
        count(*)
from
        tab1
inner join 
        tab2 
on
        tab1.owner = tab2.owner
-- and  trim(tab1.object_name)  > trim(tab2.object_name)
-- and  rtrim(tab1.object_name) > rtrim(tab2.object_name)
-- and  ltrim(tab1.object_name) > ltrim(tab2.object_name)
and     tab1.object_name > tab2.object_name
;

select * from table(dbms_xplan.display_cursor(format=>'projection allstats last'));

SQL_ID  74m49y5av3mpg, child number 0
-------------------------------------
select  count(*) from  tab1 inner join  tab2 on  tab1.owner =
tab2.owner -- and trim(tab1.object_name)  > trim(tab2.object_name) -- and rtrim(tab1.object_name) > rtrim(tab2.object_name) 
-- and ltrim(tab1.object_name) > ltrim(tab2.object_name) and tab1.object_name > tab2.object_name

Plan hash value: 2043035240

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.39 |     942 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.39 |     942 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    101K|    329K|00:00:00.39 |     942 |  1335K|  1335K|  814K (0)|
|   3 |    TABLE ACCESS FULL| TAB1 |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| TAB2 |      1 |  46014 |  46014 |00:00:00.01 |     937 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TAB1"."OWNER"="TAB2"."OWNER")
       filter("TAB1"."OBJECT_NAME">"TAB2"."OBJECT_NAME")

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]
   2 - (#keys=1; rowset=407)
   3 - (rowset=256) "TAB1"."OWNER"[VARCHAR2,128], "TAB1"."OBJECT_NAME"[VARCHAR2,128]
   4 - (rowset=256) "TAB2"."OWNER"[VARCHAR2,128], "TAB2"."OBJECT_NAME"[VARCHAR2,128]

Comparing the basic colums the CPU time recorded at the Hash Join operation was 0.39 seconds, of which only a tiny amount was in the feeding tablescans. There are two things to note from the plan.

First is confirmation of my comments about the join having to be an equality and the inequality being applied later. You can see this in the Predicate Information in the way the user’s predicate list has been split at operation 2 into an access() predicate and a filter() predicate. The access predicate finds the right hash bucket and row(s) within bucket – the filter predicate is applied as a secondary test.

The second point to note is that the Column Projection Information shows us that the basic column values are passed up to the Hash Join, which tells us that the hash join operation has to do the trimming. The big question at that point is – how many times does the same value from the same incoming row get trimmed.

Remember that there are 128 rows in tab1 where where owner = ‘SYS’, so when a ‘SYS’ row arrives from tab2 the hash join has to find the right bucket then walk through the rows in that bucket (which will probably be nothing but those SYS rows). So how many times does Oracle evaluate trim(SYS). Arguably it needs to for each tab1 row in the bucket (though the hash table might have been built to include the trimmed value) but clearly it ought not to re-evaluate it 128 times for the column in the single tab2 row – and we’ll come back to that point later.

Let’s go back to the 3 variants on the first test. We were interested in the comparing trim() with trim(), but since trim() is equilavent to ltrim(rtrim()) I wondered if ltrim (left trim) and rtrim (right trim) took different amount of time, and whether the trim() time would be close to the sum of ltrim() time and rtrim() time.

Without showing the plans etc. here are the time reported in my 19.11.0.0 test at the hash join operation (the 21.3 times were very similar):

  • no trim – 0.39 seconds
  • ltrim() – 1.02 seconds
  • rtrim() – 2.66 seconds
  • trim() – 2.70 seconds.

Clearly that’s a lot of extra CPU on top of the base CPU cost. This is not entirely surprising since string operations tend to be expensive, neverthless the differences are large enough to be more than random fluctuations and operational error.

Remember that this is just two tables of 200 and 46,000 rows respectively. It turned out that the rowsources that Stefan was using were in the order of 800K and 2M rows – with CPU time increasing from 1,100 seconds to 2,970 seconds because of the trim().

So how many times was the trim() function called in total?

Faking it.

If we assume that the trim() built-in SQL function behaves in the same way as a deterministic PL/SQL function we can at least count the number of calls that take place by writing a deterministic function to put into the SQL. Something like:

create or replace package p1 as
        n1 number;
        function f1(v1 in varchar2) return varchar2 deterministic;
end;
/

create or replace package body p1 as 

        function f1 (v1 in varchar2)
        return varchar2 
        deterministic
        is
        begin
                p1.n1 := p1.n1 + 1;
                return trim(v1);
        end;

end;
/

set serveroutput off
alter session set statistics_level = all;

exec p1.n1 := 0

select
        count(*)
from
    tab1
inner join tab2 on
    tab1.owner = tab2.owner
and     p1.f1(tab1.object_name) > p1.f1(tab2.object_name)
-- and  p1.f1(tab1.object_name) > trim(tab2.object_name)
-- and  trim(tab1.object_name)  > p1.f1(tab2.object_name)
;

select * from table(dbms_xplan.display_cursor(format=>'projection allstats last'));

set serveroutput on
execute dbms_output.put_line(p1.n1);

I’ve created a package with a public variable n1 so that I can set it and read it from “outside”, then I’ve created (and lied about) a function that increments that variable and returns its input, claiming that it’s deterministic. Once I’ve got the package in place I’ve:

  • set the variable to zero
  • run a query that does one of
    • use my function twice
    • use my function once – on the build table
    • use my function once – on the probe table
  • report the execution plan with stats
  • print the value of the variable

The timings are not really important, but here’s the execution plan when I used the function on both sides of the inequality:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:21.15 |    1513 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:21.15 |    1513 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |  23007 |    329K|00:00:21.13 |    1513 |  1335K|  1335K|  860K (0)|
|   3 |    TABLE ACCESS FULL| TAB1 |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| TAB2 |      1 |  46014 |  46014 |00:00:00.02 |     937 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("TAB1"."OWNER"="TAB2"."OWNER")
       filter("P1"."F1"("TAB1"."OBJECT_NAME")>"P1"."F1"("TAB2"."OBJECT_NAME"))

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT(*)[22]
   2 - (#keys=1)
   3 - (rowset=256) "TAB1"."OWNER"[VARCHAR2,128], "TAB1"."OBJECT_NAME"[VARCHAR2,128]
   4 - "TAB2"."OWNER"[VARCHAR2,128], "TAB2"."OBJECT_NAME"[VARCHAR2,128]

Apart from the change of function name the plan is the same – although it now takes over 21 CPU seconds to complete, of which most of the time is probably building and tearing down the PL/SQL stack. The important figure, though is the number of function calls I saw recorded in p1.n1: it was a little over 10 million calls to generate the 329 thousand rows (A-Rows for the hash join).

When I ran the code with only one call to my deterministic function it was called 5 million times regardless of whether it was used for the build or probe table. Oracle did nothing to minimise the number of times the function was called.

Predictive Mode

Near the start of this note I showed you a little query to aggregate the rows of the two tables by owner; with a little enhancement I can reuse that code to show you exactly how many times the deterministic function was called:

select
        v1.owner, ct1, ct2, ct1 * ct2, sum(ct1 * ct2) over() tot_ct
from
        (select owner, count(object_name) ct1 from tab1 group by owner) v1,
        (select owner, count(object_name) ct2 from tab2 group by owner) v2
where
        v2.owner = v1.owner
/

OWNER                  CT1        CT2    CT1*CT2     TOT_CT
--------------- ---------- ---------- ---------- ----------
SYS                    128      40104    5133312    5160466
SYSTEM                  65        417      27105    5160466
OUTLN                    7          7         49    5160466

3 rows selected.

The number of comparisons done by the filter() predicate 5,160,466: double it to get the number of function calls. For every single one of the 40,104 SYS rows in tab2 the function was called for every single one of the SYS rows in tab1, for both sides of the inequality.

It’s a shame that Oracle doesn’t calculate and project the “virtual columns” that will be used in the join predicates, because in my case that would have reduced the number of calls from 10 million to 40,232 – a factor of roughly 250. That would probably be worth a lot of CPU to Stefan.

Damage Limitation

For my silly little query that went from 0.39 seconds to 2.70 seconds you might decide there’s no point in trying to improve things – in fact many of the sites I’ve visited probably wouldn’t even notice the CPU wastage (on one call); but when the query runs for 2,970 seconds and a little fiddling around shows that it could run in 1,100 seconds you might be inclined to see if there’s something you could do improve things.

Andrew Sayer suggested the possibility of rewriting the query with a pair of CTEs (“with” subqueries) that were forced to materialize the trim() in the CTE. The cost of physically creating the two large GTTs might well be much less than the CPU spent on the trim()ed join.

Alternatively – and dependent on the ownership and quality of the application – you could write a check constraint on each table to ensure that the column value was always equal to the trim() of the column value.

A similar option would be to add an (invisible) column to each table and use a trigger to populate the column with the trimmed value and then use the trimmed column in the query.

Conclusion

I don’t think that anything I’ve done or described in this note could be called rocket science (or telescope science as, perhaps, it should be in honour of Webb); but it has shown how much insight you can gain into what Oracle is doing and how you may be able to pin-point excess work using a few simple mechanisms that have been around for more than 15 years.

July 21, 2022

Pagination cost

Filed under: CBO,Execution plans,Oracle,Problem Solving — Jonathan Lewis @ 3:56 pm BST Jul 21,2022

There’s a thread on the MOSC database tuning forum (needs an account) at the moment asking why a “fetch first N” query to fetch next 41 rows with an offset of 8602 rows takes longer to run than the same query when the offset is zero rows. Here’s a possible answer with a little lesson in thinking about what’s going on.

Apart from gremlins in the system there are two possible reasons

  • nothing has changed, but it takes longer to fetch 8643 rows in order and discard 8602 of them than it takes to fetch 41 rows in order and discard none
  • the optimizer has worked out that if it has to fetch 8643 rows then it ought to use a different plan but (as often happens) it was a bad idea to change the plan.

Here’s a little script to build some demo data.

rem
rem     Script:         fetch_first_offset.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table t1 
as
select 
        * 
from 
        all_objects
where   rownum <= 50000
order by 
        dbms_random.value
/

create index t1_i1 on t1(object_name);

alter session set statistics_level = all;
set serveroutput off
set feedback off

column owner format a20
column object_type format a12
column object_name format a32

All I’ve done is create a table of 50,000 rows with an order by clause that maximises the randomness of the data pattern so that the index on object_name will have a very high clustering_factor.

Here’s the first query I’m going to run, followed by the execution plan, pulled from memory with rowsource execution stats enabled. I’ve queried for the first 20 rows (offset 0 next 20) ordered by object_name:

select
        owner, object_type, object_name
from
        t1
order by
        object_name
offset 
        0 rows
fetch next 
        20 rows only
/

select * from table(dbms_xplan.display_cursor(format=>'+cost allstats last'));


SQL_ID  fmdb8vuxwkp99, child number 0
-------------------------------------
select  owner, object_type, object_name from  t1 order by  object_name
offset  0 rows fetch next  20 rows only

Plan hash value: 3254925009

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |    23 (100)|     20 |00:00:00.01 |      25 |      2 |
|*  1 |  VIEW                         |       |      1 |     20 |    23   (0)|     20 |00:00:00.01 |      25 |      2 |
|*  2 |   WINDOW NOSORT STOPKEY       |       |      1 |     20 |    23   (0)|     20 |00:00:00.01 |      25 |      2 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1    |      1 |  50000 |    23   (0)|     20 |00:00:00.01 |      25 |      2 |
|   4 |     INDEX FULL SCAN           | T1_I1 |      1 |     20 |     3   (0)|     20 |00:00:00.01 |       5 |      2 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(("from$_subquery$_002"."rowlimit_$$_rownumber"<=20 AND
              "from$_subquery$_002"."rowlimit_$$_rownumber">0))
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "OBJECT_NAME")<=20)

As you can see from the execution plan Oracle has used an index full scan (because that will access the data in exactly the right order and allows the “nosort stopkey” on the “window (no)sort” operation). It has fetched (A-Rows) 20 rows and reported a cost of 23 – which basically corresponds to 3 block visits for the index and one block visit for each row from the table. In passing you’ll notice from the Predicate Information at operation 2 that Oracle has transformed our “fetch first” into an analytic query using row_number() over(). The phrase “syntactic sugar” seems appropriate.

How do things change if we ask for the 2nd 20 rows – (offset 20, next 20). I’ll show only the output from dbms_xplan, including its slightly mangled SQL statement but dropping the Predicate Information:

select  owner, object_type, object_name from  t1 order by  object_name
offset  20 rows fetch next  20 rows only

Plan hash value: 3254925009

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |    43 (100)|     20 |00:00:00.01 |      45 |
|*  1 |  VIEW                         |       |      1 |     40 |    43   (0)|     20 |00:00:00.01 |      45 |
|*  2 |   WINDOW NOSORT STOPKEY       |       |      1 |     40 |    43   (0)|     40 |00:00:00.01 |      45 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1    |      1 |  49999 |    43   (0)|     40 |00:00:00.01 |      45 |
|   4 |     INDEX FULL SCAN           | T1_I1 |      1 |     40 |     3   (0)|     40 |00:00:00.01 |       5 |
--------------------------------------------------------------------------------------------------------------

As you can see, the optimizer has still decided to use the index full scan, and this time has fetched 40 rows and passed them up the plan until at operation 1 it discards the first 20 rows. The cost (43) is again related to 3 blocks for the index, 40 blocks for 40 rows from the (randomly distributed) table.

What would we see if we added a /*+ full(t1) */ hint to the query to force a tablescan to get the 2nd 20 rows?

select /*+ full(t1) */  owner, object_type, object_name from  t1 order
by  object_name offset  20 rows fetch next  20 rows only

Plan hash value: 2433988517

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |      |      1 |        |   947 (100)|     20 |00:00:00.01 |     996 |       |       |          |
|*  1 |  VIEW                    |      |      1 |     40 |   947   (1)|     20 |00:00:00.01 |     996 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK|      |      1 |  50000 |   947   (1)|     40 |00:00:00.01 |     996 | 15360 | 15360 |14336  (0)|
|   3 |    TABLE ACCESS FULL     | T1   |      1 |  50000 |   278   (1)|  50000 |00:00:00.01 |     996 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------

Oracle has obeyed the hint, and the tablescan has fetched all 50,000 rows from the table and sorted them. Fortunately the optimizer knows that it needs only the top 40 rows so it has been discarding rows as it sorts, hence the appearance of the “pushed rank” in the “window sort” at operation 2; we haven’t had to create a sorted list of all 50,000 rows before picking the top 40. Again, once we’ve got the top 40 we discard the top 20 to allow for the offset.

We note that the cost of the tablescan was 278 but the cost of the sort was really rather large, taking the total cost of this path to 947. So here’s a thought experiment – what’s likely to happen if we ask for an offset of 940 and next 20? Given the costs we’ve seen for the indexed access path the optimizer will calculate a cost of 3 (plus a bit, maybe) for the index and a cost of 960 for visiting the table giving a total cost of about 963 – which should make the tablescan strategy the lower cost.

select  owner, object_type, object_name from  t1 order by  object_name
offset  940 rows fetch next  20 rows only

Plan hash value: 2433988517

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |      |      1 |        |   947 (100)|     20 |00:00:00.02 |     996 |       |       |          |
|*  1 |  VIEW                    |      |      1 |    960 |   947   (1)|     20 |00:00:00.02 |     996 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK|      |      1 |  50000 |   947   (1)|    960 |00:00:00.02 |     996 |   267K|   267K|  237K (0)|
|   3 |    TABLE ACCESS FULL     | T1   |      1 |  50000 |   278   (1)|  50000 |00:00:00.01 |     996 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------

My guesswork about the cost seems to have been nearly right. Unhinted, with an offset of 940 (which you can see as the 960 rows fetched) the optimizer has decided that the tablescan path has a lower cost than the indexed access.

Of course we ought to check this by hinting the indexed access path and seeing what its cost is:

select  /*+ index(t1) */  owner, object_type, object_name from  t1
order by  object_name offset  940 rows fetch next  20 rows only

Plan hash value: 3254925009

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |   969 (100)|     20 |00:00:00.01 |     968 |      4 |
|*  1 |  VIEW                         |       |      1 |    960 |   969   (1)|     20 |00:00:00.01 |     968 |      4 |
|*  2 |   WINDOW NOSORT STOPKEY       |       |      1 |    960 |   969   (1)|    960 |00:00:00.01 |     968 |      4 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1    |      1 |    960 |   969   (1)|    960 |00:00:00.01 |     968 |      4 |
|   4 |     INDEX FULL SCAN           | T1_I1 |      1 |  50000 |     9   (0)|    960 |00:00:00.01 |       9 |      4 |
-----------------------------------------------------------------------------------------------------------------------

The cost of the indexed access path is 969 – that’s 960 for the randomly scattered table rows we need plus 9 for the index scan (because at 960 index entries we’re going to visit a few more index leaf blocks than the original 3).

Summing Up

I’ve demonstrated with a simple query using “offset N rows fetch M rows only” that the optimizer will calculate the cost of fetching “N + M” rows using whatever paths are available, then pick the lowest cost path.

As you might expect, the presence of a suitable index will encourage the optimizer to walk the index in order jumping randomly around the table to avoid the cost of acquiring all the relevant data and sorting it. So for “small” values of “offset + next” Oracle might choose an indexed access path with “window nosort stopkey”, but for “large” values of “offset + next” it might choose to do a full tablescan with “window sort pushed rank”.

The consequence of this is that – in the same way we see the optimizer switching between nested loops and hash joins at the wrong moment – we may see the optimizer switch from an indexed access path to a full tablescan either too soon, or too late.

Answering the question

Why did the query with an offset of 8602 take so much longer than the query with an offset of zero when the next was only 41 rows?

It may be that the optimizer stuck with an indexed access path and had to do physical reads of 8,643 blocks when it should have switched to a tablescan.

It may be that the optimizer switched to a tablescan and sort when it should have stuck with using an index on well-clustered, well-cached, data.

As so often happens, the first step to answering an SQL performance question is to look at the actual execution plans.

July 20, 2022

LOB space

Filed under: Infrastructure,LOBs,Oracle,Problem Solving — Jonathan Lewis @ 5:07 pm BST Jul 20,2022

Erratum: (With thanks to Bobby Durrett for highlighting the error – see comment 3 below): I’ve stated in this note that CLOBs are stored internally using a fixed-length two-byte characterset. This is only true if the “external” characterset is a variable length multi-byte characterset (such as AL32UTF8). It is not true of fixed-width charactersets (which, implicitly, covers all the single-byte character sets such as WE8ISO8859P1). See MOS document 257772.1 for further comments, and the Database Globalization Support Guide 19c Appendix A for a list identifying fixed and variable width character sets.

This note is still subject to checking and correction regarding the break point between in-row and out-of-row.

There’s a thread on the Oracle developer forum at present asking why calls to dbms_lob.getlength() and calls to dbms_space.space_usage() produce such different results for the storage used by a LOB column.

It’s a really good question to demonstrate two points. First that it’s hard to supply all the details that are needed when you ask a question; second that it’s hard to write well about even such a tiny topic. The source of the second problem answers the first problem – there’s a fairly large decision tree to consider:

  1. Is the LOB defined as enable or disable storage in row?
  2. Is it a BLOB or CLOB?
  3. Is compression enabled ?
  4. Is deduplication enabled ?
  5. Is it a basicfile or securefile LOB?
    1. Is there a chunk declaration (for basicfile LOBs)
    2. Is the LOB undo limit set by pctversion or retention
    3. Have multiple freepools been declared (for basicfile LOBs)

The answer to question 1 means the difference between a tiny LOB taking a few bytes in the table segment (enable) or a whole chunk in the LOB segment (disable).

The answer to question 2 is important for two points – (1) CLOBs are stored using a two-byte fixed width character set ([ed] if the session character is multi-byte variable length) , which means they may take much more space than you might be expecting, and (2) the getlength() function reports characters not bytes.

Questions 3 and 4 result in similar space estimation errors: getlength() will report the decompressed length for every (logical) copy of a LOB value, so summing it across the table could be over-reporting quite dramatically.

Question 5.1 and 5.2 ought to start with a check of whether the LOB has been stored in a tablespace using (“legacy”) freelist management or (“new”) automatic segment space management because that affects whether or not it’s even possible to use securefiles, and it affects whether or not basicfiles can use the retention option.

Question 5.1: This applies only to basicfile LOBs because securefile LOBs ignore the chunk parameter. The dbms space.space_usage() procedures have many “out” parameters with the word “blocks” in their names – but the code uses these parameters to report chunks, not blocks. So if you’ve set the LOB chunk size to something other than default the result will need to be scaled up to get the block count (or you could just use the “bytes” values and divide by the LOB segment’s block size).

Question 5.2: This gets messy, because pctversion is silently ignored by securefiles but used by basicfiles. On the other hand retention can be used for securefiles or basicfiles (assuming automatic undo management is enabled), but its usage with basicfiles doesn’t match its usage with securefiles. Moreover if you include both retention and pctversion in your table declaration Oracle raises error: ORA-32600: RETENTION and PCTVERSION cannot be used together for both basicfiles and securefiles (so Oracle is not quite ignoring pctversion for securefiles). (It may be worth mentioning that if you use export/import to upgrade your database you may find that this “spontaneously” changes a basicfile lob to a securefile lob.)

Question 5.3: freepools is another parameter that is silently ignored for securefiles but can have a significant impact on the way that basicfiles reuse space from old copies of LOB values hence on the amount of allocated but unused space.

Underneath all these, of course, is the question “which version of Oracle”, because that affects the default for the choice between securefile or basicfile (though no-one ought to be using a version of Oracle that still has basicfile as the default). There’s also a significant variation with version of when a blob goes “out of row”. [Ed Jan 2023: no, there isn’t]. And then there’s the question of which overloads of dbms_space.space_usage() are available and which is the correct one to be used.

I think you can appreciate from the above how easy it might be to answer a “simple” question like “why do these two numbers differ” if the question included all the relevant details, but very hard to give a useful answer if the question simply quoted the results of unspecified calls to a couple of procedures or queries.

Answering the question

This user is running 11.2.0.4, has a basicfile BLOB, stored in an ASSM tablespace, no compression, not deduplicated, with the default 8KB chunk size, and storage enabled in row enable. So:

  • it’s a BLOB so getlength() will report the byte count and we don’t have to cater for 2-byte characters
  • it’s set to enable storage in row, so we have to ignore BLOBs with lengths up to 3,898 bytes (for 11.2.0.4 – it would be 3,964 for basicfile blobs in 19c.)
  • it’s basicfile so will stored 8,132 bytes per lob segment block. (securefile would use 8,060 – these figures don’t seem to have changed between 11g and 19c).
  • it’s stored under ASSM so there will be an overhead of just under 1% for segment space metadata when the segment is very large (one bitmap block out of each 128).
  • we can infer that the chunk size is 8KB because we have a report that shows “Full blocks” x 8KB = “Full bytes” – so we could be “losing” an average of 4KB per LOB. (Note, all blocks in a LOB segment are Full or Unformatted, and the metadata blocks aren’t reported by the dbms_space.space_usage() procedures)
  • we might guess that the LOB is using the default pctversion 10, but if it’s using the retention then the undo_retention time can make a huge difference to the amount of space holding “deleted but retained” lobs.

So, for the OP, a first stab at the code to estimate the expected number of blocks in the LOB segment would be:

with blob_lengths as (
        select  dbms_lob.getlength(b1) blob_len
        from    t1
),
blobs_ool as (
        select  blob_len
        from    blob_lengths
        where   blob_len > 3964  -- Corrected Jan 2023 (11g boundary)
)
select
        sum(ceil(blob_len/8132)) blob_blocks
from
        blobs_ool
/

You’ll notice I’ve deliberately used the label “blob” everywhere in the hope that this will be a reminder that the code won’t apply to CLOB columns with their 2-byte fixed-width character set.

Starting with this figure we then need to consider the “undo” for LOBs so (assuming the OP has created a basicfile with the default pctversion (i.e. 10) allowing 10% of the total space to be old copies of the BLOB values) the number of blocks has to be multiplied by 10/9; then, since the metadata for the segment will take about 1% of the segment’s blocks, multiply by 101/100. So scale the query result up by 101/90 to get the “expected” storage requirement.

To compare this estimate with results from dbms_space.space_used() we need to pick the right overload of procedures; two of them are for securefile lob segments only, the third (initially the only option and the one we want now) was generically for ASSM segments.

I’ve published a sample of one of the securefile options here, and the generic ASSM code that we need is part of the script here. Our estimate (excluding the 1% metadata) should be a reasonable match for the “Full blocks” reported by this procedure call.

Since this is a basicfile LOB (and especially since it’s on 11.2.0.4) we may find a massive discrepancy due to some serious defects in the internal code which can be triggered by the way the LOB is used – in which case it’s worth reading the series I wrote 5 years ago modelling a problem (mostly about time spent, in fact) a client had with basicfile LOB deletion.

Summary Points

You shouldn’t be using basicfile LOBs on any recent version of Oracle

Check exactly when your type of LOB goes “out of row”, and remember the difference between Blobs and Clobs when it comes to using length() or getlength() on them.

When summing lengths remember to ignore LOBs that will (probably) be stored “in row”, and don’t forget that even a couple of bytes stored “out of row” require a full “chunk” (not “block”).

The “undo data” for a LOB value is simply the previous copy of the value, left in place. The total volume of “undo” can be limited by pctversion for basicfile lobs (but is silently ignored by securefiles). Both types of lob will obey retention but the only (implicit) option for basicfiles is “auto” which means “the same as undo_retention” and requires automatic undo management to be enabled. This setting for retention can result in a very large amount of old lob data being kept.

If you’re using basicfile LOBs (especially with older versions of Oracle) the mechanism for re-using expired LOB values has some defects that can result in catastrophic behaviour that introduce performance and space problems.

July 15, 2022

Index Wildcard

Filed under: CBO,Execution plans,Oracle,Problem Solving — Jonathan Lewis @ 11:51 am BST Jul 15,2022

This is a little case study on working around a problem caused by a 3rd-party application that couldn’t be changed quickly and easily. There’s a touch of irony in this case as the problem I’ll address is a side effect of the best possible workaround to a more generic design flaw in the application, moreover the side effect appears because of a special case optimisation mechanism introduced in 12.2.

A recent post on the MOS database forum (needs an account) came up with a slightly surprising problem – a query with a predicate “columnX like ‘%XYZ%” would perform reasonably well if hinted to use a B-tree index on columnX (doing an index full scan) but performed badly when cursor_sharing was set to force because it switched to an index range scan which did a lot more work.

Modelling

Here’s a little script to build some data to demonstate the issue:

rem
rem     Script:         index_wildcard.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0

create table t1
as
select  obj.*
from
        (select * from all_objects where rownum <= 50000) obj,
        (select rownum from dual connect by level <= 5) gen
/

create index t1_id on t1(object_name);

alter session set cursor_sharing = force;
alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

set serveroutput off

All I’ve done is create a simple heap table with 250,000 rows and an index (which happens to be on a column declared as NOT NULL, but that detail is not important). Here’s my query:

select  /*+ index(t1(object_name)) cursor_sharing_exact */
        t1.*
from
        t1 t1
where
        object_name like '%XYZ%'
/

select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

In normal circumstances the leading wildcard would mean that the optimizer would not be able to use an index driven plan (unless the query satisfied the requirements to allow an index fast full scan), so the default plan would have been a tablescan with a selectivity of 5% (estimated rows = 12,500).

Because I’ve hinted the use of an index I get the following plan from a call to dbms_xplan.display_cursor():

LAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------
SQL_ID  1q1p6fyubhafu, child number 0
-------------------------------------
select /*+ index(t1(object_name)) cursor_sharing_exact */  t1.* from
t1 t1 where  object_name like '%XYZ%'

Plan hash value: 2798063786

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |      0 |00:00:00.08 |    1718 |   1717 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |  12500 |      0 |00:00:00.08 |    1718 |   1717 |
|*  2 |   INDEX FULL SCAN                   | T1_ID |      1 |  12500 |      0 |00:00:00.08 |    1718 |   1717 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("OBJECT_NAME" LIKE '%XYZ%')

Because I’ve set cursor_sharing to force I’ve included the hint /*+ cursor_sharing_exact */ in the query to make sure that Oracle sees the query with a literal rather than a system generated bind variable. The plan tells us that the optimizer has picked an index full scan, and the Predicate Information for operation 2 tells us that Oracle has checked our predicate for every single entry in the index, and the A-rows column tells us that it didn’t find any matching index entries. So the total work done was 1718 buffer gets and 250,000 index entries tested. (You’ll notice that the E-rows is still the standard 5% guess.)

Now let’s try again, allowing enforced cursor sharing to be applied (i.e. taking out the cursor_sharing_exact hint):

SQL_ID  cm4tjwuqxgbqg, child number 0
-------------------------------------
select /*+ index(t1(object_name)) */  t1.* from  t1 t1 where
object_name like :"SYS_B_0"

Plan hash value: 539998951

-------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |      0 |00:00:00.26 |     241K|
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |  12500 |      0 |00:00:00.26 |     241K|
|*  2 |   INDEX RANGE SCAN                  | T1_ID |      1 |  12500 |    250K|00:00:00.04 |    1718 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("OBJECT_NAME" LIKE :SYS_B_0)
   2 - access("OBJECT_NAME" LIKE :SYS_B_0)

The SQL text shows us that the literal has been replaced by the bind variable :SYS_B_0. Operation 2 has changed from an index full scan to an index range scan – which “feels” as if “the range” should be the whole index and the query shouldn’t, therefore, do any more work than the index full scan plan. But the Buffers column now shows 241K buffer visits at operation 1, the table access! What’s gone wrong?

In the case of the OP the unexpected visit to the table blocks didn’t do just a few hundred thousand “unnecessary” buffer visits, it did 4 million buffer gets and 127,000 disk reads.

Look at the Predicate Information (always): the filter() that matched our original predicate is no longer being applied to the index, it’s only applied at the table. The only predicate applied to the index is an access() predicate – so Oracle takes the bind value and works out that the range scan should start at the beginning of the index and stop at the end of the index – that’s what an access() predicate does, it doesn’t check the actual values in the index entries once it has derived the start and stop locations.

Trouble-shooting

You will have noticed in the initial code to set up the data I had a call to set event 10053 (the CBO trace event) – a fairly quick and easy option to look at in this case since I’m looking at such a simple query. The critical details that showed up in the trace files were:

  • in both cases the optimizer reported “User hint to use this index” in the Base Statistical Information.
  • in the case of the “literal string” SQL the optimizer did not consider an index full scan
  • in the case of the “forced bind” SQL the optimizer did not consider an index range scan

This is worth knowing as it tells us that the switch between paths was not an “accident” of circumstances or statistics, it was a deliberately code strategy; so we need to discover whether there is something we can do to change the strategy.

Time to check the hidden parameters and fix controls, and this is what I found in a listing of the 19 .11.0.0 system fix controls (listed as appearing in 12.2.x.x):

     BUGNO OPTIMIZE SQL_FEATURE                        DESCRIPTION

---------- -------- ---------------------------------- ---------------------------------------------------------------- 
  20289688 12.2.0.1 QKSFM_ACCESS_PATH_20289688         check for leading wildcard in LIKE with bind 

Doesn’t that sound a little bit promising? Let’s turn the fix off by adding the hint /*+ opt_param(‘_fix_control’ ‘20289688:0’) */ to the problem query and see what we get:

SQL_ID  75cxtymdrh8r2, child number 0
-------------------------------------
select /*+   index(t1(object_name))   opt_param('_fix_control'
'20289688:0')  */  t1.* from  t1 t1 where  object_name like :"SYS_B_0"

Plan hash value: 539998951

-------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |      0 |00:00:00.08 |    1718 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |  12500 |      0 |00:00:00.08 |    1718 |
|*  2 |   INDEX RANGE SCAN                  | T1_ID |      1 |  12500 |      0 |00:00:00.08 |    1718 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_NAME" LIKE :SYS_B_0)
       filter("OBJECT_NAME" LIKE :SYS_B_0)

We still have an index range scan (with a range that is obviously the whole index), but our predicate is now being applied as a filter() in the index with a total of only 1,718 buffer gets even though the literal has been replaced by a system-generated bind variable.

Summary

The application is accepting a search value from the user interface and constructing “literal string” SQL that adds a leading and trailing wildcard to the value. As a result of this the optimizer will not consider using a particular index unless hinted, and does a very expensive tablescan.

The optimizer will accept and use an index() hint to produce a plan that is much more efficient than the tablescan even though it has to use an index full scan and apply a filter predicate on every index entry.

Unfortunately cursor_sharing has been set to force to deal with the general parsing overheads of applications that construct “literal string” SQL, and when the index() hint is applied to a bind-variable version of this query the optimizer applies an index range scan and postpones the filter() until it has acquired rows from the table.

A check of the system fix controls identified a fix that looked as if it might have some relevance to our situation, “leading wild card, bind variable, LIKE clause”, and a test with an opt_param() hint to switch off that optimizer “fix” had the effect we wanted – the optimizer still reported an index range scan but it also used our initial predicate as a filter() predicate on the index.

Technically, since the fix_control was introduced in 12.2.0.1 we could have got the same effect by setting the optimizer_features_enable back to 12.1.0.2 – but that’s not a good idea when the production version is so much newer although it might be an acceptable strategy when implemented as a hint for a few special case queries.

After showing that the fix control works by hinting the next step should be to create an SQL Patch (using dbms_sqldiag.create_sql_patch) attached to the (forced) bind-variable version of the query.

Update (a few hours later)

It didn’t take long to find that the fix_control and the optimzer_features_enable were not the only workarounds. There’s a hidden parameter “_optim_peek_user_binds” with the description “enable peeking of user binds” default value “true”, so if you set this to false the optimizer can’t notice that the critical bind value starts with a wildcard and behaves the way the OP wants. As with the other options you could set this as a system parameter or a session parameter, or as a hint (SQL patch etc.) /*+ opt_param(‘_optim_peek_user_binds’ ‘false’) */ – the last being the kindest for the system.

Update (21st July)

Bug number 33500777 (unpublished) looks like an exact match for the problem; the description reads:

FILTER FOR PREDICATE WITH LIKE AND BIND WITH LEADING WILDCARD IS NOT ALLOCATED ON INDEX SCAN AFTER FIX 20289688

June 20, 2022

ORA-29857

Filed under: Oracle,Problem Solving,trace files,Troubleshooting — Jonathan Lewis @ 3:30 pm BST Jun 20,2022

This is another little “case study” on solving a problem. It’s based on a question on the MOSC DBA Forum (needs an account) about trying to drop a tablespace and seeing the error “ORA-29857: domain indexes and/or secondary objects exist in the tablespace” when (apparently) there are no domain indexes or secondary objects in the tablespace according to the queries that have been used to try and locate the problem objects.

Not being completely familiar with all the possibilities that might be relevant to that particular error, but having a vague memory of seeing an oddity once before, I thought I’d do a quick check on a couple of possible causes.

  • Case 1, which I assumed ought to raise the error because the tablespace held some objects that met the requirements for the error: I created a table in a tablespace called users and, for the domain indexes a couple of Context (Oracle Text) indexes in a tablespace called test_8k_assm. Then I tried to drop the tablespace holding the context indexes. Here’s the relevant cut-n-paste (19.11.0.0):
SQL> drop tablespace test_8k_assm including contents and datafiles;
drop tablespace test_8k_assm including contents and datafiles
*
ERROR at line 1:
ORA-29857: domain indexes and/or secondary objects exist in the tablespace

  • Case 2, which I thought would drop the tablespace because the model didn’t meet the requirement of the error message: I swapped the locations of the table and its indexes, creating the table in the tablespace called test_8k_assm and the Context indexes in the tablespace called users then I tried to drop the tablespace holding only the table. Here’s the relevant cut-n-paste (19.11.0.0):
SQL> drop tablespace test_8k_assm including contents and datafiles;
drop tablespace test_8k_assm including contents and datafiles
*
ERROR at line 1:
ORA-29857: domain indexes and/or secondary objects exist in the tablespace

So, the error message is misleading and, possibly, this might explain why the queries the OP had posted weren’t reporting any guilty objects. Whatever the error message says, the domain indexes that you need to deal with may not be in the tablespace you want to drop.

I also have a question-mark over “secondary objects” – I don’t know what objects (other than the bits and pieces that make up domain indexes) are deemed to be secondary.

Investigating

It’s at this point that the purpose of the blog note starts to appear – I don’t know the answer to the OP’s question and I’ve discovered a gap in my knowledge so I’m willing to spend a little time extending my knowledge and trying to give a little help to the OP.

We can start with the view dba_secondary_objects (based on secobj$ from script dcore.bsq) which tells us that there are two possible sources of secondary objects – index types and statistics types:

SQL> select dbms_metadata.get_ddl('VIEW','DBA_SECONDARY_OBJECTS') view_text from dual;

-- with a little cosmetic tidying:

VIEW_TEXT
-------------------------------------------------------------------------------------------
CREATE OR REPLACE FORCE NONEDITIONABLE VIEW "SYS"."DBA_SECONDARY_OBJECTS" ("INDEX_OWNER", "INDEX_NAME", "SECONDARY_OBJECT_OWNER", "SECONDARY_OBJECT_NAME", "SECONDARY_OBJDATA_TYPE")
AS
select u.name, o.name, u1.name, o1.name, 
       decode(s.spare1, 0, 'FROM INDEXTYPE',1, 'FROM STATISTICS TYPE')
from   sys.user$ u, sys.obj$ o, sys.user$ u1, sys.obj$ o1, sys.secobj$ s
where  s.obj# = o.obj# and o.owner# = u.user# and
       s.secobj# = o1.obj#  and  o1.owner# = u1.user#;

With this view definition I’d be tempted to do a little drilling down, with queries that return increasing amounts of data, and then pick subsets to expand to higher precision, e.g:

select  secondary_objdata_type, count(*)
from    dba_secondary_objects
group by
        secondary_objdata_type
/

Are there any secondary objects that aren’t domain indexes – if there are a few then query the view for just those to find out what they are and who owns them.

select  *
from    dba_secondary_objects
where   secondary_objdata_type = 'FROM STATISTICS TYPE'
/

In fact if the total number of secondary objects was small I might report the details about all of them and try to track them down, or if there was only one schema using that tablespace I might restrict my queries to the subset of secondary objects owned by that schema:

select  distinct index_owner, index_name
from    dba_secondary_objects
where   secondary_object_owner = 'TEST_USER'
;

The OP had a rather larger task than my little model – there were 14,499 segments in the tablespace, and a query for (segment_owner, segment_type, count(*)) returned 332 rows – so had taken a similar approach to mine to try and minimise the amount of work needed, starting with the following queries (which I’ve enhanced slightly and given a cosmetic makeover):

prompt  ==================================
prompt  Domain indexes (in the tablespace)
prompt  ==================================

select  owner, index_name, table_owner, table_name, tablespace_name
from    dba_indexes 
where   index_type = 'DOMAIN'
-- and  tablespace_name = 'TEST_8K_ASSM'        -- error: domain indexes don't have tablespaces
;

prompt  ==========================================
prompt  Domain indexes on tables in the tablespace
prompt  ==========================================

select  t.table_name, t.tablespace_name, i.index_name, i.tablespace_name
from    dba_indexes i, dba_tables t
where   t.table_name = i.table_name
and     t.owner = i.owner
and     i.index_type = 'DOMAIN'
and     t.tablespace_name = 'TEST_8K_ASSM'
;

prompt  ===================================================
prompt  Secondary objects that are tables in the tablespace
prompt  ===================================================

select  index_owner, index_name, secondary_object_owner, secondary_object_name, secondary_objdata_type
from    dba_secondary_objects
where   secondary_object_name in (
                select table_name 
                from dba_tables 
                where tablespace_name = 'TEST_8K_ASSM'
        )
;

prompt  ======================================================
prompt  Indexes in the tablespace that are secondary objects ?
prompt  ======================================================

select  index_owner, index_name, secondary_object_owner, secondary_object_name, secondary_objdata_type
from    dba_secondary_objects
where   index_name in (
                select index_name 
                from dba_indexes 
                where tablespace_name = 'TEST_8K_ASSM'
        )
;

The first query had an error – domain indexes do not report a tablespace_name in dba_indexes, so this query inevitably returned no rows.

The second query would report any domain indexes on tables that were in the given tablespace – which would be useful for the simple case where everything was created in the user’s default tablespace.

The third query reports any tables in the given tablespace that are the physical tables instantiating the “logical” domain index – which is particularly important for my example of the indexes being in the given tablespace when the table itself was in some other tablespace.

The final query is trying to report domain indexes that have corresponding physical objects in the given tablespace – but this isn’t going to help (at least with my Context indexes) because the index_name in dba_secondary_indexes will be a name that appears in dba_indexes as index_type = ‘DOMAIN’ and won’t have a value stored in the tablespace_name column.

Here’s the output I get (preceded by a full report of the view dba_secondary_objects in my otherwise empty little database) when I had the table in test_8k_assm and the indexes in tablespace users.

SQL> select * from dba_secondary_objects;

INDEX_OWNER     INDEX_NAME           SECONDARY_OBJECT_OWNER           SECONDARY_OBJECT_NAME            SECONDARY_OBJDATA_TY
--------------- -------------------- -------------------------------- -------------------------------- --------------------
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$I                  FROM INDEXTYPE
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$K                  FROM INDEXTYPE
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$N                  FROM INDEXTYPE
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$U                  FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$I                FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$K                FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$N                FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$U                FROM INDEXTYPE

8 rows selected.

==================================
Domain indexes (in the tablespace)
==================================

OWNER           INDEX_NAME           TABLE_OWNER     TABLE_NAME                TABLESPACE_NAME
--------------- -------------------- --------------- ------------------------- ------------------------------
TEST_USER       SH_IX_NAME           TEST_USER       SCRIPT_HANDLES
TEST_USER       SH_IX_HANDLE         TEST_USER       SCRIPT_HANDLES

2 rows selected.

==========================================
Domain indexes on tables in the tablespace
==========================================

TABLE_NAME                TABLESPACE_NAME                INDEX_NAME           TABLESPACE_NAME
------------------------- ------------------------------ -------------------- ------------------------------
SCRIPT_HANDLES            TEST_8K_ASSM                   SH_IX_NAME
SCRIPT_HANDLES            TEST_8K_ASSM                   SH_IX_HANDLE

2 rows selected.

===================================================
Secondary objects that are tables in the tablespace
===================================================

no rows selected

=====================================================
Indexes in the tablespace that have secondary objects
=====================================================

no rows selected

As you can see I have two Context indexes, each represented by 4 secondary tables (and that’s not the complete story because those tables have their own indexes and a couple of them have LOB columns, and one of the secondary tables is actually an index-organized table).

And this is the result when the base table is in tablespace users and the indexes are in test_8k_assm:

==================================
Domain indexes (in the tablespace)
==================================

OWNER           INDEX_NAME           TABLE_OWNER     TABLE_NAME                TABLESPACE_NAME
--------------- -------------------- --------------- ------------------------- ------------------------------
TEST_USER       SH_IX_NAME           TEST_USER       SCRIPT_HANDLES
TEST_USER       SH_IX_HANDLE         TEST_USER       SCRIPT_HANDLES

2 rows selected

==========================================
Domain indexes on tables in the tablespace
==========================================

no rows selected

===================================================
Secondary objects that are tables in the tablespace
===================================================

INDEX_OWNER     INDEX_NAME           SECONDARY_OBJECT_OWNER           SECONDARY_OBJECT_NAME            SECONDARY_OBJDATA_TY
--------------- -------------------- -------------------------------- -------------------------------- --------------------
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$I                  FROM INDEXTYPE
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$K                  FROM INDEXTYPE
TEST_USER       SH_IX_NAME           TEST_USER                        DR$SH_IX_NAME$U                  FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$I                FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$K                FROM INDEXTYPE
TEST_USER       SH_IX_HANDLE         TEST_USER                        DR$SH_IX_HANDLE$U                FROM INDEXTYPE

6 rows selected

=====================================================
Indexes in the tablespace that have secondary objects
=====================================================

no rows selected

So one of the middle two queries ought to return some rows if there are any Context indexes in the tablespace or on tables in the tablespace. (And for the final option – i.e. the table and its Context indexes being in the tablespace – both queries will return rows). The OP got no rows returned on all 4 queries.

What next?

It looks like the OP must have some secondary objects that aren’t playing by the rules if they’re “from index types”, or there are some secondary objects that are “from statistics types” and don’t get picked up by the queries. There is a view dba_ustat for “user stats types” which has interesting column names, but at this point I decided to follow a strategy that I’ve used many times in the past when Oracle raises an unexpected error on a repeatable process: repeat the process but enable SQL trace and see what Oracle was doing just before it raised the error.

Here’s the critical line in the trace file I produced; with a couple of the precding lines:

CLOSE #140708355472384:c=1860,e=1512,dep=1,type=0,tim=31701605233
EXEC #140708354845632:c=68018,e=68006,p=0,cr=1276,cu=0,mis=0,r=0,dep=0,og=1,plh=0,tim=31701605472
ERROR #140708354845632:err=29857 tim=31701605485

A search backwards up the trace file showed that cursor #140708354845632 was for the “drop tablespace” statement. Just above the ERROR and EXEC lines we see cursor #140708355472384 closing. It’s reasonably likely (though not 100% certain) that this is the cursor representing the last SQL statement that Oracle executed before raising the error so searching backwards for “PARSING IN CURSOR #140708355472384” was my next step (and if that statement didn’t look promising I’d have search backwards from the ERROR line to find the cursor number from the most recent “EXEC” line). In this case (with my data) the CLOSE took me to the following (cosmetically cleaned) SQL:

select null from  (
             select o.obj# from obj$ o, tab$ t      where t.ts#=:1 and t.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select o.obj# from obj$ o, tabpart$ tp      where tp.ts#=:1 and tp.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select o.obj# from obj$ o, tabsubpart$ tsp      where tsp.ts#=:1 and tsp.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select o.obj# from obj$ o, ind$ i      where i.ts#=:1 and i.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select o.obj# from obj$ o, indpart$ ip      where ip.ts#=:1 and ip.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select o.obj# from obj$ o, indsubpart$ isp      where isp.ts#=:1 and isp.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select o.obj# from obj$ o, clu$ c      where c.ts#=:1 and c.obj#=o.obj# and            (decode(bitand(o.flags, 16), 16, 1, 0)=1)    
union all    select i.obj# from ind$ i, tab$ t, tabpart$ tp      where tp.ts#=:1 and tp.bo# = t.obj# and t.obj#=i.bo# and i.type#=9    
union all    select i.obj# from ind$ i, tab$ t, tabcompart$ tcp,tabsubpart$ tsp      where tsp.ts#=:1 and tsp.pobj#=tcp.obj# and tcp.bo#=t.obj# and            t.obj#=i.bo# and i.type#=9    
union all    select i.obj# from ind$ i, tab$ t      where t.ts#=:1 and t.obj#=i.bo# and i.type#=9
)

Every single line of the UNION ALL is addressing the same tablespace by number (:1) looking for objects where bit 5 is set in the obj$.flags column (i.e. secondary object) or indexes where the ind$.type# is 9 (“cooperative index method”).

That looks like a good bet for being the code that Oracle uses to check if there are any objects that should block the drop, so I changed the bind variable to the tablespace number (from v$tablespace) changed the “select null” to “select obj#” and stuck the resulting query block into an IN subquery against dba_objects (this may not be the most efficient of queries):

select 
        secondary, owner, object_name, object_type, object_id
from    dba_objects
where   object_id in (
                select obj# from  (
                                  select o.obj# from obj$ o, tab$ t where t.ts#=6 and t.obj#=o.obj# and (decode(bitand(o.flags, 16), 16, 1, 0)=1)
                        union all ...
                        union all select i.obj# from ind$ i, tab$ t where t.ts#=6 and t.obj#=i.bo# and i.type#=9
                )
        )
order by secondary, owner, object_name
/

S OWNER           OBJECT_NAME                      OBJECT_TYPE              OBJECT_ID
- --------------- -------------------------------- ----------------------- ----------
N TEST_USER       SH_IX_HANDLE                     INDEX                       123956
N TEST_USER       SH_IX_NAME                       INDEX                       123944
Y TEST_USER       DR$SH_IX_HANDLE$I                TABLE                       123957
Y TEST_USER       DR$SH_IX_HANDLE$K                TABLE                       123960
Y TEST_USER       DR$SH_IX_HANDLE$KD               INDEX                       123966
Y TEST_USER       DR$SH_IX_HANDLE$KR               INDEX                       123967
Y TEST_USER       DR$SH_IX_HANDLE$U                TABLE                       123963
Y TEST_USER       DR$SH_IX_HANDLE$X                INDEX                       123965
Y TEST_USER       DR$SH_IX_NAME$I                  TABLE                       123945
Y TEST_USER       DR$SH_IX_NAME$K                  TABLE                       123948
Y TEST_USER       DR$SH_IX_NAME$KD                 INDEX                       123954
Y TEST_USER       DR$SH_IX_NAME$KR                 INDEX                       123955
Y TEST_USER       DR$SH_IX_NAME$U                  TABLE                       123951
Y TEST_USER       DR$SH_IX_NAME$X                  INDEX                       123953
Y TEST_USER       SYS_C0016528                     INDEX                       123952
Y TEST_USER       SYS_C0016536                     INDEX                       123964
Y TEST_USER       SYS_IL0000123945C00006$$         INDEX                       123947
Y TEST_USER       SYS_IL0000123957C00006$$         INDEX                       123959
Y TEST_USER       SYS_IOT_TOP_123949               INDEX                       123950
Y TEST_USER       SYS_IOT_TOP_123961               INDEX                       123962

20 rows selected.

This is the list I got when the base table and the Context indexes were all created in the tablespace I wanted to drop – notice that the “logical” indexes (the first two items, flagged as secondary=’N’) are reported along with all the physical objects relating to the context indexes.

When I created the indexes in the target tablespace but the base table in an alternative tablespace the query reported only the physical objects making up the Context indexes; when I created the base table in the target tablespace and the Context indexes in an alternative tablespace the query reported only the “logical” indexes.

Conclusion

As a last resort, when the simple queries you find in the manuals or on MOS (or elsewhere on the internet) don’t seem to supply all the answers the strategy of finding and hacking the query that Oracle runs just before raising the error is a good generic method for getting a better handle on what’s causing the problem.

In this example of Oracle error ORA-29758 you can get a list of the objects (logical or physical or both) that are blocking a “drop tablespace” command and that makes it easier to find out what else you need to drop before you can drop the tablespace.

Bear in mind that I’ve just demonstrated the approach while experimenting with a couple of Context indexes, where the information I needed was available through a couple of ordinary data dictionary views – I only dived down into the internals because the data dictionary views weren’t giving the OP a complete answer.

I don’t know what was in the OP’s database to make this problem happen (it may simply be that the database was version 11.2.0.4 while mine was 19.11.0.0 and there’s a bug that’s been fixed since 11g), so the OP may find that after they’ve cleared all domain indexes indicated by this query they may still see ORA-29758 and find that there’s another query that Oracle uses to check for other conditions (maybe those “from statistics type” secondary objects) that also needs to be hijacked and subverted to complete the job. I know that when I traced a successful “drop tablespace” there were 13 more statements after the one I’ve shown that looked as if they might be checking for the (non-)existence of other types of object before the tablespace could be dropped.

Update

The OP has reported back that following this strategy they’ve chased the problem through (for their system) three further sets of objects, with the following error messages:

ORA-23515: materialized views and/or their indices exist in the tablespace

ORA-25183: index-organized table top index segment is in a different tablespace

ORA-02449: unique/primary keys in table referenced by foreign keys

June 17, 2022

Case Study

Filed under: Execution plans,Oracle,Parallel Execution,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 1:25 pm BST Jun 17,2022

The question “How do you trouble-shoot a performance problem” came up in an online session I did for the AIOUG (All-India OUG) today. It’s a very broad question and the only possible answers are either extremely generic, or very specific – so here’s a specific example that I talked about to give some idea of the processes I go through. It’s an example from the Oracle-l list server asking the (paraphrased) question:

I’ve got a parallel query that take 24 seconds to return 2,500 rows for 2018. .The same query for 2019 should return a similar amount of data but consumes a lot of TEMP space before failing; it takes 45 minutes to complete if I remove the parallel hint. The only difference between the two queries is a change to a single predicate: “opclf.year_number = to_number(‘YYYY’)” and the statistics are up to date – what’s going on?

The ease of diagnosing a problem is dependent on the ease of access to all the right information, and you don’t always know initially what the right information might be. In this case the OP had started by posting to github the SQL Monitor reports for the two queries, which were automatically generated since the queries were using parallel execution.

Click here to show/hide the first few sections for the 2019 output
SQL Text
------------------------------
SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id AS doc_id ,'firm_id' AS criteria_key ,opf.ultimate_parent_firm_id AS series_id ,period_number FROM ( SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id ,rd.client_role_id ,( CASE WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12 THEN TRUNC(dc.date_value, 'YEAR') ELSE TRUNC(dc.date_value, 'MONTH') END ) period_number FROM REPORT_ENGINE.date_code dc ,REPORT_ENGINE.lit_fact_bd rd INNER JOIN
report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY') AND TO_DATE('04/28/2022', 'MM/DD/YYYY') AND rd.publication_date_id = dc.date_id AND rd.year_number = to_number('2019') AND (FIRM.ultimate_parent_firm_id IN (to_number('252094'))) ) rd INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id AND op.opposing_counsel_role_id
= opclf.client_role_id AND opclf.year_number = to_number('2019') INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id AND opf.firm_id >= 1000 WHERE 1 = 1

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  SYS (647:28741)
 SQL ID              :  3cjg20q2vw41f
 SQL Execution ID    :  16777216
 Execution Started   :  06/09/2022 05:08:24
 First Refresh Time  :  06/09/2022 05:08:25
 Last Refresh Time   :  06/09/2022 05:08:27
 Duration            :  221s
 Module/Action       :  sqlplus@c111dhw (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@c111dhw (TNS V1-V3)

Global Stats
====================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |
====================================================================
|     222 |     179 |       28 |       15 |    49M | 11624 |   2GB |
====================================================================

Parallel Execution Details (DOP=8 , Servers Allocated=16)
=====================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |         Wait Events         |
|                |       |         | Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |         (sample #)          |
=====================================================================================================================================
| PX Coordinator | QC    |         |    0.26 |    0.26 |          |     0.00 |  12144 |       |     . |                             |
| p000           | Set 1 |       1 |         |         |          |          |        |       |     . |                             |
| p001           | Set 1 |       2 |         |         |          |          |        |       |     . |                             |
| p002           | Set 1 |       3 |         |         |          |          |        |       |     . |                             |
| p003           | Set 1 |       4 |         |         |          |          |        |       |     . |                             |
| p004           | Set 1 |       5 |         |         |          |          |        |       |     . |                             |
| p005           | Set 1 |       6 |         |         |          |          |        |       |     . |                             |
| p006           | Set 1 |       7 |         |         |          |          |        |       |     . |                             |
| p007           | Set 1 |       8 |         |         |          |          |        |       |     . |                             |
| p008           | Set 2 |       1 |     220 |     177 |       28 |       15 |    48M | 11624 |   2GB | direct path write temp (28) |
| p009           | Set 2 |       2 |         |         |          |          |        |       |     . |                             |
| p010           | Set 2 |       3 |         |         |          |          |        |       |     . |                             |
| p011           | Set 2 |       4 |    1.71 |    1.70 |          |     0.01 |   595K |       |     . |                             |
| p012           | Set 2 |       5 |         |         |          |          |        |       |     . |                             |
| p013           | Set 2 |       6 |         |         |          |          |        |       |     . |                             |
| p014           | Set 2 |       7 |         |         |          |          |        |       |     . |                             |
| p015           | Set 2 |       8 |         |         |          |          |        |       |     . |                             |
=====================================================================================================================================

You’ll note that I haven’t got as far as the body of the execution plan yet, and I’ve highlighted line 44 – a line in the middle of the summary of activity for the parallel execution processes. There are 8 servers in each of two sets (we’re running parallel 8) and the line I’ve highlighted is the first server of the second set. The reason I’ve highlighted it is that it’s the one server that’s doing almost all the work – effectively the query (at some point in the plan) is running serially.

So, a first thought, maybe we’ve just been unlucky and running parallel 7 (say) would result in a better distribution of data across parallel servers and allow each of 7 processes to do whatever they had to do to a much smaller amount of data. Maybe a change to the distribution method (pq_distribute() hint) would spread the workload more evenly. In either case “N” smaller workload might still use a lot of TEMP, but possibly no individual process would fail, and the job would complete nearly N times as quickly.

Note: the 2018 Monitor report showed an equivalent skew in the data distribution, but the amount of work needed was much smaller with a read/write load of only 143MB compared to 2GB for the 2019 report. The OP did try running parallel 7, but with no change to the overall effect.

Let’s take a quick glance at the plan body (click to view/hide)
SQL Plan Monitoring Details (Plan Hash Value=1862006233)
=========================================================================================================================================================================================================================
| Id    |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Write | Write | Mem | Temp | Activity |        Activity Detail       |
|       |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes |     |      |   (%)     |         (# samples)         |
=========================================================================================================================================================================================================================
|     0 | SELECT STATEMENT                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     1 |   PX COORDINATOR                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     3 |     HASH UNIQUE                                    |                           |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     4 |      PX RECEIVE                                    |                           |     781 |  151 |           |        |       |          |       |       |     |      |   |                             |
|     5 |       PX SEND HASH                                 | :TQ10002                  |     781 |  151 |           |        |     2 |          |       |       |     |      |   |                             |
|  -> 6 |        BUFFER SORT                                 |                           |     781 |  153 |       219 |     +3 |     2 |        0 | 11624 |   2GB | 61M |   2G |    26.21 | Cpu (26)                     |
|       |                                                    |                           |         |      |           |        |       |          |       |       |     |      |   | direct path write temp (28) |
|  -> 7 |         NESTED LOOPS                               |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
|  -> 8 |          NESTED LOOPS                              |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |     0.49 | Cpu (1)                      |
|  -> 9 |           NESTED LOOPS                             |                           |     777 |   44 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
| -> 10 |            NESTED LOOPS                            |                           |      41 |   26 |       219 |     +3 |     2 |     6463 |       |       |     |      |   |                             |
| -> 11 |             HASH JOIN                              |                           |      41 |   21 |       219 |     +3 |     2 |     6463 |       |       |  6M |      |   |                             |
|    12 |              BUFFER SORT                           |                           |         |      |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    13 |               PX RECEIVE                           |                           |      87 |   19 |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    14 |                PX SEND HASH                        | :TQ10000                  |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    15 |                 NESTED LOOPS                       |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    16 |                  NESTED LOOPS                      |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +3 |    43 |       49 |       |       |     |      |   |                             |
|    22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      63 |   19 |         3 |     +1 | 38694 |    38694 |       |       |     |      |     0.49 | Cpu (1)                      |
|    23 |              PX RECEIVE                            |                           |      20 |    2 |         1 |     +3 |     2 |        2 |       |       |     |      |   |                             |
|    24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
| -> 27 |             TABLE ACCESS BY INDEX ROWID            | DATE_CODE                 |       1 |      |       219 |     +3 |  6465 |     6463 |       |       |     |      |   |                             |
| -> 28 |              INDEX UNIQUE SCAN                     | PK_DATE_CODE              |       1 |      |       219 |     +3 |  6465 |     6465 |       |       |     |      |   |                             |
| -> 29 |            PARTITION RANGE SINGLE                  |                           |      19 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |   |                             |
| -> 30 |             TABLE ACCESS BY LOCAL INDEX ROWID      | LIT_FACT_BD               |      19 |      |       220 |     +2 |  6465 |      39M |       |       |     |      |    35.92 | Cpu (74)                     |
| -> 31 |              INDEX RANGE SCAN                      | LIT_FACT_BD_IDX20         |       1 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |     9.22 | Cpu (19)                     |
| -> 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    10.68 | Cpu (22)                     |
| -> 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    16.99 | Cpu (35)                     |
===================================================================================================================================================================

You can see from the “->” symbols at the left hand side of the plan that this report was generated while the plan was still running. The thing that leaps out as you glance down the page is the value in the “Rows (Actual)” column at operations 7-9 (which show the rowsources generated by some nested loop joins) and operations 29, 32 and 33 of the plan that tell us something about how those rowsources were generated.

Operation 29 has executed (Execs) 6,465 so far, producing a total of 39M rows, and operations 32 and 33 have both executed 39M times each producing a total of 39M rows by index unique scan.

The plan for the 2018 data was similar though the join order for DATE_CODE, LIT_FACT_BD and FIRM was different (and it was the join to LIT_FACT_BD that increased the row count dramatically – so hinting it to be the last table in the join might help a bit), but the largest rowcount for the 2018 query was only 3M rows, not the 39M that had appeared after only 6,465 rows of a possible driving 39,855 in the 2019 query.

So it’s not surprising that the query could take much longer for 2019. It’s not the volume of output that matters, it’s the volume of input (or, more accurately, throughput or intermediate) data that matters.

Let’s think about that volume, though: the 2018 plan generated 3M rows and then crunched them down to 2,500 rows and the 2019 plan was supposed to produce a similar sized output (from 39M+ rows). Could we rewrite the query in some way that made it do some intermediate aggregation so that the volume of data to be aggregated was never enormous?

Let’s take a look at the plan from the 2018 Monitor report (click to show/hide)
SQL Plan Monitoring Details (Plan Hash Value=472871521)
=======================================================================================================================================================================================================================================
| Id |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity |       Activity Detail       |
|    |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)     |        (# samples)         |
=======================================================================================================================================================================================================================================
|  0 | SELECT STATEMENT                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  1 |   PX COORDINATOR                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |       1 |   39 |         1 |    +24 |     8 |     2613 |      |       |       |       |       |       |   |                            |
|  3 |     HASH UNIQUE                                    |                           |       1 |   39 |         9 |    +16 |     8 |     2613 |      |       |       |       |    9M |       |     6.90 | Cpu (2)                     |
|  4 |      PX RECEIVE                                    |                           |       1 |   38 |         9 |    +16 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  5 |       PX SEND HASH                                 | :TQ10002                  |       1 |   38 |        12 |    +14 |     8 |       3M |      |       |       |       |       |       |     3.45 | Cpu (1)                     |
|  6 |        BUFFER SORT                                 |                           |       1 |   39 |        23 |     +2 |     8 |       3M | 4584 | 143MB |   703 | 143MB |  151M |  151M |    34.48 | Cpu (2)                     |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path read temp (6)  |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path write temp (2) |
|  7 |         NESTED LOOPS                               |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  8 |          NESTED LOOPS                              |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  9 |           NESTED LOOPS                             |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 10 |            NESTED LOOPS                            |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 11 |             HASH JOIN                              |                           |      41 |   21 |        15 |     +2 |     8 |    19334 |      |       |       |       |    7M |       |   |                            |
| 12 |              BUFFER SORT                           |                           |         |      |        13 |     +2 |     8 |    19233 |      |       |       |       |    1M |       |   |                            |
| 13 |               PX RECEIVE                           |                           |      89 |   19 |        13 |     +2 |     8 |    19233 |      |       |       |       |       |       |   |                            |
| 14 |                PX SEND HASH                        | :TQ10000                  |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 15 |                 NESTED LOOPS                       |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 16 |                  NESTED LOOPS                      |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +1 |    43 |       51 |      |       |       |       |       |       |   |                            |
| 22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      64 |   19 |         1 |     +1 | 19233 |    19233 |      |       |       |       |       |       |   |                            |
| 23 |              PX RECEIVE                            |                           |      20 |    2 |        15 |     +2 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |         1 |    +14 |     3 |       20 |      |       |       |       |       |       |   |                            |
| 27 |             PARTITION RANGE SINGLE                 |                           |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 28 |              TABLE ACCESS BY LOCAL INDEX ROWID     | LIT_FACT_BD               |       1 |      |        16 |     +1 | 19334 |       3M |      |       |       |       |       |       |    17.24 | Cpu (5)                     |
| 29 |               INDEX RANGE SCAN                     | LIT_FACT_BD_IDX20         |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 30 |            TABLE ACCESS BY INDEX ROWID             | DATE_CODE                 |       1 |      |        15 |     +2 |    3M |       3M |      |       |       |       |       |       |    10.34 | Cpu (3)                     |
| 31 |             INDEX UNIQUE SCAN                      | PK_DATE_CODE              |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |        23 |     +2 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |    13.79 | Cpu (4)                     |
=======================================================================================================================================================================================================================================

We see from operations 3 – 7 that the 3M rows generated from the nested loop joins pass up through a buffer sort operation before being crunched down to 2,613 rows. It’s probably the buffer sort (which is buffering but not sorting) that has mostly passed through a single server and spilled to disc in the 2019 report. We just don’t want that 39M+ rows ever to exist.

So how easy will it be to change the SQL (click to view/hide)
SELECT
        /*+ PARALLEL(8) */
        DISTINCT rd.document_id AS doc_id
        ,'firm_id' AS criteria_key
        ,opf.ultimate_parent_firm_id AS series_id
        ,period_number
FROM (
        SELECT
                /*+ PARALLEL(8) */
                DISTINCT rd.document_id
                ,rd.client_role_id
                ,(
                        CASE 
                                WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12
                                        THEN TRUNC(dc.date_value, 'YEAR')
                                ELSE TRUNC(dc.date_value, 'MONTH')
                                END
                        ) period_number
        FROM REPORT_ENGINE.date_code dc
                ,REPORT_ENGINE.lit_fact_bd rd
        INNER JOIN report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id
        WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY')
                        AND TO_DATE('04/28/2022', 'MM/DD/YYYY')
                AND rd.publication_date_id = dc.date_id
                AND rd.year_number = to_number('2019')
                AND (FIRM.ultimate_parent_firm_id IN (to_number('252094')))
        ) rd
INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id
INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id
        AND op.opposing_counsel_role_id = opclf.client_role_id
        AND opclf.year_number = to_number('2019')
INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id
        AND opf.firm_id >= 1000
WHERE 1 = 1;

Lines 7-10 and 27 tell us we alredy have an inline view where we’re doing a “select distinct” and, unwinding the mix of “Oracle” and “ANSI” syntax, we can see that it joins DATE_CODE, LIT_FACT_BD and FIRM, and we know that one of those tables explodes the intermediate data size to something enormous. So it looks like the original author of this code had already worked out that the query needed to aggregate early.

Checking back to the original plans we note that there’s only one “hash unique” operation, and there’s no sign of a “view” operation, so maybe the performance problem is a result of the optimizer suddenly deciding it can do complex view merging with this inline view, and perhaps all we need to do is add the hint /*+ no_merge */ to the inline view and see what happens.

Here’s the plan after adding the hint (click to hide/vew)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
| Id  | Operation                                            | Name                      | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
|   0 | SELECT STATEMENT                                     |                           |       |       |   314 (100)|          |       |       |        |      | |
|   1 |  PX COORDINATOR                                      |                           |       |       |            |          |       |       |        |      | |
|   2 |   PX SEND QC (RANDOM)                                | :TQ10005                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | P->S | QC (RAN D)  |
|   3 |    HASH UNIQUE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   4 |     PX RECEIVE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   5 |      PX SEND HASH                                    | :TQ10004                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | P->P | HASH |
|   6 |       HASH UNIQUE                                    |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   7 |        NESTED LOOPS                                  |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   8 |         NESTED LOOPS                                 |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   9 |          NESTED LOOPS                                |                           |    23 |  2323 |   310   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|* 10 |           HASH JOIN                                  |                           |   388 | 21340 |   148   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  11 |            PX RECEIVE                                |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,04 | PCWP | |
|  12 |             PX SEND BROADCAST                        | :TQ10002                  |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | P->P | BROADCA ST  |
|  13 |              PX BLOCK ITERATOR                       |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWC | |
|* 14 |               TABLE ACCESS FULL                      | OPPOSING_COUNSEL          |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWP | |
|  15 |            VIEW                                      |                           |   835 | 39245 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  16 |             HASH UNIQUE                              |                           |   835 | 63460 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  17 |              PX RECEIVE                              |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  18 |               PX SEND HASH                           | :TQ10003                  |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | P->P | HASH |
|* 19 |                HASH JOIN BUFFERED                    |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  20 |                 BUFFER SORT                          |                           |       |       |            |          |       |       |  Q1,03 | PCWC | |
|  21 |                  PX RECEIVE                          |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  22 |                   PX SEND HASH                       | :TQ10000                  |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        | S->P | HASH |
|  23 |                    NESTED LOOPS                      |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  24 |                     NESTED LOOPS                     |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  25 |                      TABLE ACCESS BY INDEX ROWID     | FIRM                      |     1 |    12 |     2   (0)| 00:00:01 |       |       |        |      | |
|* 26 |                       INDEX RANGE SCAN               | FIRM_ULT_PARENT_FIRM_IDX1 |     1 |       |     1   (0)| 00:00:01 |       |       |        |      | |
|  27 |                      PARTITION RANGE SINGLE          |                           |       |       |            |          |    30 |    30 |        |      | |
|  28 |                       BITMAP CONVERSION TO ROWIDS    |                           |       |       |            |          |       |       |        |      | |
|* 29 |                        BITMAP INDEX SINGLE VALUE     | LIT_FACT_BD_IDX09         |       |       |            |          |    30 |    30 |        |      | |
|* 30 |                     TABLE ACCESS BY LOCAL INDEX ROWID| LIT_FACT_BD               |   598 | 30498 |   136   (0)| 00:00:03 |    30 |    30 |        |      | |
|  31 |                 PX RECEIVE                           |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,03 | PCWP | |
|  32 |                  PX SEND HASH                        | :TQ10001                  |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | P->P | HASH |
|  33 |                   PX BLOCK ITERATOR                  |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWC | |
|* 34 |                    TABLE ACCESS FULL                 | DATE_CODE                 |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWP | |
|  35 |           PARTITION RANGE SINGLE                     |                           |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 36 |            TABLE ACCESS BY LOCAL INDEX ROWID         | LIT_FACT_BD               |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 37 |             INDEX RANGE SCAN                         | LIT_FACT_BD_IDX20         |     1 |       |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 38 |          INDEX UNIQUE SCAN                           | PK_FIRM                   |     1 |       |     0   (0)|          |       |       |  Q1,04 | PCWP | |
|  39 |         TABLE ACCESS BY INDEX ROWID                  | FIRM                      |     1 |    12 |     0   (0)|          |       |       |  Q1,04 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----

Note particularly that operations 15 and 16 tell us that we’ve forced the optimizer into treating the inline view as a separate query block and we now have two aggregation steps, one inside the view, and another after joining FIRM (again) and LIT_FACT_BD (again) to the inline view.

Unfortunately the plan shown here is pulled from memory using dbms_xplan.display_cursor() after execution, so it include the various parallel executoin colums (TQ, IN-OUT, PQ Distrib), but doesn’t have the rowsource execution stats enabled so we can’t see the actual workload and data volume, but in the words of the OP: “adding no_merge hint did the trick and now the SQL is just executing fine”.

Summary

The steps for solving the performance problems of a specific SQL statement are very fluid. For a long-running or parallel statement the SQL Monitor report will automatically be created (though there are limits on the size of the plan that may disable the feature) and this is the easiest source of useful information, though you might also need to pull the execution plan from v$sql_plan to get details about parallel execution and partitioning at the same time.

If you’re not licensed for the diagnostic and performance packs, though, enabling SQL Trace to get the plan and waits gets you a lot of infomation, and querying (g)v$pq_tqstat immediately after running the query can fill in the parallel traffic details.

In this example the SQL Monitor report showed a highly skewed distribution, which might have been fixable by changing the PQ distribution, or even by simply changing the degree of parallelism.

Further examination of the report showed that the query generated an enormous rowsource which it then crunched down to a small result set. Comparing the 2018 and 2019 plans (which were not quite identical but were sufficiently similar to justify comparison) the same skew and explosion of rowsource were visible in both, though the data size involved in the 2018 plan made it seem that the plan was a “good” one which really it wasn’t.

The obvious target for tuning was to see if the explosion in volume could be reduced or eliminated by writing the query with some intermediate non-mergeable view(s), and it turned out that the query had been written with that intent in its original form but without a hint to block complex view merging. After adding the hint the performance was acceptable.

May 20, 2022

Quiz Night

Filed under: Infrastructure,Oracle,Problem Solving,Statistics — Jonathan Lewis @ 6:09 pm BST May 20,2022

It’s a long time since I’ve done a quiz night – but here’s something that fooled me (briefly) when it appeared on the Oracle Developers’ Forum. Here’s a table definition – and I’m not going to make it easy by giving you a “create table” statement – but it’s just a simple heap table:

sql> desc interr_skuplannparam
 name                             null?    type
 -------------------------------- -------- -------------------
 atpdur                                    number(38)
 depdmdopt                                 number(38)
 externalskusw                             number(1)
 firstreplendate                           date
 lastfrzstart                              date
 lastplanstart                             date
 plandur                                   number(38)
 planleadtime                              number(38)
 planleadtimerule                          number(38)
 planshipfrzdur                            number(38)
 restrictdur                               number(38)
 allocbatchsw                              number(1)
 cmpfirmdur                                number(38)
 custservicelevel                          float(126)
 maxchangefactor                           float(126)
 mfgleadtime                               number(38)
 recschedrcptsdur                          number(38)
 cpppriority                               number(38)
 cpplocksw                                 number(1)
 criticalmaterialsw                        number(1)
 aggexcesssupplyrule                       number(38)
 aggundersupplyrule                        number(38)
 bufferleadtime                            number(38)
 maxoh                                     float(126)
 maxcovdur                                 number(38)
 drpcovdur                                 number(38)
 drpfrzdur                                 number(38)
 drprule                                   number(38)
 drptimefencedate                          date
 drptimefencedur                           number(38)
 incdrpqty                                 float(126)
 mindrpqty                                 float(126)
 mpscovdur                                 number(38)
 mfgfrzdur                                 number(38)
 mpsrule                                   number(38)
 mpstimefencedate                          date
 mpstimefencedur                           number(38)
 incmpsqty                                 float(126)
 minmpsqty                                 float(126)
 shrinkagefactor                           number(38)
 item                                      varchar2(50 char)
 loc                                       varchar2(50 char)
 expdate                                   date
 atprule                                   number(38)
 prodcal                                   varchar2(50 char)
 prodstartdate                             date
 prodstopdate                              date
 orderingcost                              float(126)
 holdingcost                               float(126)
 eoq                                       float(126)
 ff_trigger_control                        number(38)
 workingcal                                varchar2(50 char)
 lookaheaddur                              number
 orderpointrule                            number
 orderskudetailsw                          number(1)
 supsdmindmdcovdur                         number(38)
 orderpointminrule                         number(38)
 orderpointminqty                          float(126)
 orderpointmindur                          number(38)
 orderuptolevelmaxrule                     number(38)
 orderuptolevelmaxqty                      float(126)
 orderuptolevelmaxdur                      number(38)
 aggskurule                                number(38)
 fwdbuymaxdur                              number(38)
 costuom                                   number(38)
 cumleadtimedur                            number(38)
 cumleadtimeadjdur                         number(38)
 cumleadtimerule                           number(38)
 roundingfactor                            float(126)
 limitplanarrivpublishsw                   number(1)
 limitplanarrivpublishdur                  number
 maxohrule                                 number(1)
 integration_stamp                         date
 integration_jobid                not null varchar2(32 char)
 error_str                                 varchar2(2000 char)
 error_stamp                               date

The column integration_jobid (the single “not null” column) has been defined with the default value of “INT_JOB”, which takes 7 bytes to store. What’s the result of the query at the end of this little script:

truncate table interr_skuplannparam;

insert into interr_skuplannparam (atpdur) 
select   0 
from     all_objects 
where    rownum <= 10000
/

commit;

execute dbms_stats.gather_table_stats(user,'interr_skuplannparam')

select  avg_row_len 
from    user_tables
where   table_name = 'INTERR_SKUPLANNPARAM' -- added later
;

Hint: the value zero is represented internally as a single byte holding the value 0x80 (decimal 128).

Answer

It’s fairly common knowledge that Oracle includes a “length byte” in the estimates for average column length and average row length. It’s also fairly well known that “trailing nulls take no space” in a row so don’t need a length byte. There are a couple of “less common knowledge” details to add, though:

  1. any column that is null in every row gets a zero for the avg_col_len even if it isn’t a “trailing null”.
  2. but point 1 doesn’t apply to date columns that are always null, they get a 1 for the avg_col_len even the column is a “trailing null”.. The same is true for the various timestamp and interval types.
  3. for columns that hold at least one value the avg_col_len is the average over all rows of the actual space used by that column’s data, rounded up, plus 1 for the length byte.
  4. the avg_row_len is not the sum(avg_col_len) it is based on the average of the summed column lengths for each row, plus the count of the length bytes recorded.
  5. User defined type, LOBs, varray types etc. introduce all sorts of other strange effects. (But that’s probably “more common” knowledge.

So what does that mean in my example where there’s a declared not null column near the end of the row, with two trailing columns and with every column except the first and the non-null column set to null for every single row in the table? The easy option is to create the model and show you the results of querying user_tab_cols.

break on report 
compute sum of avg_col_len on report

select  avg_row_len 
from    user_tables
where   table_name = 'INTERR_SKUPLANNPARAM'
;

select 
        column_name, data_type, avg_col_len 
from
         user_tab_cols 
where 
        table_name = 'INTERR_SKUPLANNPARAM'
and     avg_col_len != 0
order by
        column_id
;


============

AVG_ROW_LEN
-----------
         20

1 row selected.


 COLUMN_ID SEGMENT_COLUMN_ID COLUMN_NAME              DATA_TYPE                 AVG_COL_LEN
---------- ----------------- ------------------------ ------------------------- -----------
         1                 1 ATPDUR                   NUMBER                              2
         4                 4 FIRSTREPLENDATE          DATE                                1
         5                 5 LASTFRZSTART             DATE                                1
         6                 6 LASTPLANSTART            DATE                                1
        29                29 DRPTIMEFENCEDATE         DATE                                1
        36                36 MPSTIMEFENCEDATE         DATE                                1
        43                43 EXPDATE                  DATE                                1
        46                46 PRODSTARTDATE            DATE                                1
        47                47 PRODSTOPDATE             DATE                                1
        73                73 INTEGRATION_STAMP        DATE                                1
        74                74 INTEGRATION_JOBID        VARCHAR2                            8
        76                76 ERROR_STAMP              DATE                                1
                                                                                -----------
sum                                                                                      20

12 rows selected.


My query of user_tab_cols orders by column_id, technically it should order by segment_column_id to show the physical ordering in the data segment to allow for all the strange effects you can get in more complex scenarios, but in this very simple case the two values are the same.

As you can see, every date type (including the trailing error_stamp) has an avg_col_len of 1, even though all the dates are null in every row. Column atdpur has avg_col_len = 2, which is 1 byte for storing zero plus a length byte and integration_job_id has avg_col_len = 8, which is 7 bytes for storing ‘INT_JOB’ plus a length byte.

In this case where every single row is identical there are no rounding effects due to the calculation of average column length (the column data stored is the same in every row) so the avg_row_len = sum(avg_col_len).

Conclusion

This was an extreme (though real) production) case. I wrote this note because someone was asking why there was such a massive difference between their estimate of the space needed for a table and the actual space taken by the table.

There are many reasons why this could happen, of course, but in their case their estimate was based on multiplying the avg_row_len by the number of rows in the table – and they had a large number of columns that were always null but still required space for the length byte, so the avg_row_len was about half the actual stored space for the average row. (And then the avg_row_len doesn’t allow for the row overhead and row directory entry, which total another 5 bytes per row).

Lagniappe

It wasn’t part of the quiz, but once you’ve got a simple model you can always start playing with it to find out a little more, and to check how well your guesswork holds up as you make the scenario more complicated. For example, let’s update a few columns in just one row of the table, gather stats again, and report the row and column lengths.

update interr_skuplannparam
set 
        shrinkagefactor = 1234567890,
        item            = 'xxx',
        expdate         = sysdate
where
        rownum = 1
/

commit;

execute dbms_stats.gather_table_stats(user,'interr_skuplannparam')

select  avg_row_len 
from    user_tables
where   table_name = 'INTERR_SKUPLANNPARAM'
;

select 
        column_id, segment_column_id, column_name, data_type, avg_col_len 
from
        user_tab_cols 
where 
        table_name = 'INTERR_SKUPLANNPARAM'
and     avg_col_len != 0
order by
        column_id
;

============================

AVG_ROW_LEN
-----------
         22

1 row selected.


 COLUMN_ID SEGMENT_COLUMN_ID COLUMN_NAME              DATA_TYPE                 AVG_COL_LEN
---------- ----------------- ------------------------ ------------------------- -----------
         1                 1 ATPDUR                   NUMBER                              2
         4                 4 FIRSTREPLENDATE          DATE                                1
         5                 5 LASTFRZSTART             DATE                                1
         6                 6 LASTPLANSTART            DATE                                1
        29                29 DRPTIMEFENCEDATE         DATE                                1
        36                36 MPSTIMEFENCEDATE         DATE                                1
        40                40 SHRINKAGEFACTOR          NUMBER                              2
        41                41 ITEM                     VARCHAR2                            2
        43                43 EXPDATE                  DATE                                2
        46                46 PRODSTARTDATE            DATE                                1
        47                47 PRODSTOPDATE             DATE                                1
        73                73 INTEGRATION_STAMP        DATE                                1
        74                74 INTEGRATION_JOBID        VARCHAR2                            8
        76                76 ERROR_STAMP              DATE                                1
                                                                                -----------
sum                                                                                      25


The total of the avg_col_len has gone up from 20 to 25 – this is two bytes each for the shrinkage_factor and item columns (a tiny average for the stored data, plus 1 for a length byte), and one extra byte for the expdate column (a tiny average for the stored data). All three values rounded up from “1 and a bit” to 2.

The avg_row_len, however, has gone up by only 2 – which I am going to assume is the two newlength bytes, ,and with no allowance for the impact of the one row in 10,000 that is now a few bytes longer. It looks as if the rounding rules for the row length may be different from the rounding (up) rules for the column length.

May 5, 2022

Lag/Lead slow

Filed under: Bugs,Oracle,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 10:05 am BST May 5,2022

This note is about a surprising performance difference between the lead() and lag() analytic functions (which31 turns out to be due to the behaviour of the nth_value() function) when the option to “ignore nulls” is included in their use (jump to conclusion). The detail I’ll be writing about was highlighted in a thread on the Oracle developer forum about a requirement to add a number of analytic columns to a dataset of 156 million rows using a statement of the following shape:

create table tb_target_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

You’ll notice that I’ve introduced a row_number(), and both a lead() and a lag() of a column called ed_ucc_dt. All three analytic columns use the same partitioning and ordering, though, so Oracle will only be doing one “window sort” in the execution plan. Part of the performance problem, of course, was that with 156M rows of a couple of dozen existing columns and adding a dozen new columns, the workload due to sorting was bound to be very large, so there were various suggestions of how to minimise that part of the workload.

However Solomon Yakobson pointed out that the code was using the “ignore nulls” option and there was a bug in 11g that made lead() and lag() very slow when this option was used. He subsequently reported that this defect was still present in 19c, restricted it to just the lag() function, and cited a MOS document ID referencing the problem: LAG Function is slow when using ignore nulls (Doc ID 2811596.1). The implication of the MOS note is that we shouldn’t expect this to change.

A follow-up posting by User_H3J7U gave us a reason for the slowness of the lag() function by running a sample query through dbms_utility.expand_sql(). Oracle rewrites the query to use variants of the nth_value() function when you use “ignore nulls”, but rewrites it to use variants of first_value() when you aren’t using the “ignore nulls” option. This isn’t a complete explanation of why lag() should be slow while lead() is not – but it’s a significant pointer towards a possible implementation issue and is a good clue about working around the problem. So let’s build a model of the situation.

The basic model

rem
rem     Script:         lag_ignore_nulls.sql
rem     Author:         Jonathan Lewis / Sam P
rem     Dated:          May 2022
rem     Purpose:     
rem
rem     Last tested
rem             19.11.0.0
rem

create table tb_source (
        pat_id,
        visit_date_dt,
        ed_ucc_dt
)
as
with generator as (
        select rownum id
        from    dual
        connect by
                level <= 1e4    --> comment to avoid wordpress format issue
)
select
        g1.id,
        to_date('01-Apr-2022') + dbms_random.value(0,100),
        to_date('01-Apr-2022') + dbms_random.value(5,105)
--      to_date(null)
from
        generator g1,
        generator g2
where
        g2.id <= 20     --> comment to avoid wordpress format issue
order by
        dbms_random.value
/


spool lag_ignore_nulls.lst

set serveroutput off
set timing on

prompt  ======================
prompt  Without "ignore nulls"
prompt  (My time 0.61 seconds)
prompt  ======================

create table tb_target_no_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

prompt  ======================
prompt  With "ignore nulls"
prompt  (My time 0.88 seconds)
prompt  ======================

create table tb_target_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

I’ve created a source table with 200,000 rows, consisting of 10,000 pat_id values, and 20 rows per pat_id. The 20 rows for a pat_id (probably) each have a different visit_date_dt and a different ed_ucc_dt.

After creating the data set I’ve created two more tables using the lead() and lag() functions to generate a previous (lag) and next (lead) ed_ucc_dt for each row, partitioning by pat_id, ordering by visit_date_dt. One statement includes the “ignore nulls” option the other doesn’t and, as you can see, the time to create the “no ignore” table was 0.61 seconds while the time to create the “ignore null” table was 0.88 seconds.

The variation isn’t dramatic – but this is just 200,000 rows, in memory, with only a few columns and only two columns added through lead and lag.

After the baseline test I tweaked the statement that created the table with the “ignore nulls” option to get three more times.

  • With neither lead() nor lag() the time was 0.29 seconds
  • With just the lead() column the time was 0.46 seconds – an increase of 0.17 seconds
  • With just the lag() column the time was 0.71 seconds – an increase of 0.42 seconds

You might note that 0.29 + 0.17 + 0.42 = 0.88 (the time I got for adding both columns) – it’s a little lucky that it looks like a perfect match, but even matching within a couple of hundredths of a second would be have been a nice detail. It certainly seems that lag() – with my test data – consumes more resources than lead() for a pair of operationd that look as if they should produce the same workloads.

Internal Rewrite

The next step was to check what the internal rewrite of the code looked like, so I passed the select part of the statements (the procedure won’t accepts “create as select”) through dbms_utility.expand_sql() and reformatted the results. Here are the two rewritten statements – first without “ignore nulls”:

select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        decode(
                count(*) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 preceding and 1 preceding
                        ),a
                 1,     first_value(a1.ed_ucc_dt) over (
                                partition by a1.pat_id order by a1.visit_date_dt
                                rows between 1 preceding and 1 preceding
                        ),
                        null
        ) prev_ed_ucc_dt,
        decode(
                count(*) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and 1 following
                        ),a
                 1,     first_value(a1.ed_ucc_dt) over (
                                partition by a1.pat_id order by a1.visit_date_dt
                                rows between 1 following and 1 following
                        ),
                        null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1


The code looks a little long and messy, but that’s mainly because just about everything it does happens twice. The lag() function (prev column) turns into a first_value() function that looks at the row preceding the current row in the partition (rows between 1 preceding and 1 preceding). However it first has to count over the same clause to see if a row exists, and then either report its value or report a null – hence the structure decode(count(), 1, first_value(), null)

Note: the full lag() function call is: “lag(expression, offset, default)” where the offset (number of rows to lag) defaults to 1 and the default is the value you want reported when there is no matching row, and defaults to null.

The call to lead() basically does the same thing, but uses (rows between 1 following and 1 following) to access the next row in the partition.

On the other hand this is the SQL that Oracle generates when we include the “ignore nulls” clause (which means Oracle can’t restrict the row range to just one preceding or following row):

select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        nvl(
                nth_value(a1.ed_ucc_dt, 1) from last ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt,
        nvl(
                nth_value(a1.ed_ucc_dt, 1)           ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and unbounded following
                ),
                null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1

Both lag() and lead() turn into nth_value() with a second parameter of 1 (i.e. nth == 1st … which makes you wonder why Oracle isn’t using first_value()), and we can also see the “ignore nulls” still being used.

The lag() call now uses the range (rows between unbounded preceding and 1 preceding) i.e. everything from the start of partition to the previous row, while the lead() call uses the range (rows between 1 following and unbounded following) i.e. from the next row to the end of partition.

The other important detail to note is that the translation of the lag() call also includes the clause “from last” – in other words we want the first row when reading the partition in reverse order, and that might have something to do with the extra time it takes to operate the (translated) lag() function.

Workaround

Oracle is using a generic nth_value() to translate a generic lead()/lag(), but we’re in the special case where we know n = 1, which means we (and Oracle) could use first_value()/last_value(). It’s perfectly reasonable for Oracle’s internal code to avoid special cases if it makes no difference to performance, of course, but maybe in this case we could imitate Oracle’s rewrite to get some benefit.

  • Step 1 – change nth_value() to the appropriate first/last.
  • Step 2 – get rid of the “from last” which won’t be needed with last_value()
  • Step 3 – move the “ignore nulls” to the spot that Oracle wants to see it with first/last

Here’s the resulting SQL – I’ve left the nvl(count, expression, null) in place, but if you wanted a null as the default return value for the original lead()/lag() calls you could simplify the code a little further.

create table tb_target 
as
select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        nvl(
                last_value(a1.ed_ucc_dt ignore nulls) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt,
        nvl(
                first_value(a1.ed_ucc_dt ignore nulls) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and unbounded following
                ),
                null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1
/

  • Run-time: 0.61 seconds.
  • Run-time with just last_value() / preceding: 0.47 seconds
  • Run time with just first_value() / following: 0.43 seconds

There still seems to be a little discrepancy between accessing to the preceding data compared to accessing the following data but there’s a much better balance than before.

One more edit – taking out the nvl() construct because the original lead()/lag() calls didn’t have a non-null default supplied:

create table tb_target 
as
select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
--
        last_value(a1.ed_ucc_dt ignore nulls) over (
                partition by a1.pat_id order by a1.visit_date_dt
                rows between unbounded preceding and 1 preceding
        ) prev_ed_ucc_dt,
--
        first_value(a1.ed_ucc_dt ignore nulls) over (
                partition by a1.pat_id order by a1.visit_date_dt
                rows between 1 following and unbounded following
        ) next_ed_ucc_dt,
--
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1
/

With this simplification the time dropped by a further couple of hundredths of a second hovering between 0.57 and 0.58 seconds.

There was one last detail about the test code that should be mentioned – do the changes in code still produce the same results? As a quick and dirty check I ran the following query after each variant of creating the tb_target table (tb_target_ignore is the table created using the original lead()/lag() code with “ignore nulls”):

select  *
from   (
        select * from tb_target minus select * from tb_target_ignore
        union all
        select * from tb_target_ignore minus select * from tb_target
        )
/

The result was always zero rows.

Hypothesis

I raised the idea that the difference in timing for the lead() and lag() functions might have something to do with the volume of data that Oracle could be processing to find the one row it needed.

My thinking was that for the lead() rewrite – the call to nth_value(ed_ucc_dt,1) – would simply be looking at the next row in the partition (if it existed) because my data has no nulls that neeed to be ignored, while the rewrite of the lag() function with its “from last” requirement could be making Oracle re-read the entire preceding section of the partition before starting to process it backwards.

As a possible check to see if this was a viable hypothesis I ran one more test – visible in the initial declaration of tb_source – I created the data with ed_ucc_dt set to null in every single row, so that Oracle would be forced to process from the current position to whichever end of the partition was relevant regardless of whether it was calling lead() or lag().

With this change in place the timing for the lead() only and lag() only statements were nearly identical – which is a weak support for the hypothesis.

And once I’d done that test the next obvious test was to see what happened if I increased size of each partition (using non-null values for ed_ucc_dt) to see if larger partitions would increase the difference between the forward and backward tests. To do this I changed the script to create the tb_source table to produce 5,000 pat_id value with 40 rows per pat_id by changing the where clause to:

where
        g2.id <= 40     --> comment to avoid wordpress format issue
and     g1.id <= 5e3    --> comment to avoid wordpress format issue

With this change in place the timings for the original form of the lead()/lag() statement were:

  • With both lead() and lag() in place the time was 1.05 seconds
  • With neither lead() nor lag() the time was 0.25 seconds
  • With just the lead() column the time was 0.41 seconds – an increase of 0.16 seconds
  • With just the lag() column the time was 0.98 seconds – an increase of 0.73 seconds

So the lag() time (ballpark figures) nearly doubles as the partition size doubles but the lead() time stays pretty much the same.

The results of these two tests do tend to suggest that the generic nth_value() implementation can do some short-circuiting when working “forwards”, using a mechanism that isn’t available when the “from last” clause requires it to work “backwards”.

Writing the previous paragraph prompted me to do one last test – it wouldn’t produce the same results, of course, but I ought to check the performance when I moved the “from last” clause out of the “prev”/lag() column expression into the “next”/lead() column expression in Oracle’s original translation to confirm that the problem was associated with the “from last” and not with the choice of “preceding” or “following” in the row range section of the over() clause. (It was the “from last” that made the difference.)

tl;dr

If you’re using the lag() or lead() functions with “ignore nulls” on a very large dataset you may find that you can rewrite the code with first_value() or last_value() calls that use less CPU. The benefit is probably only significant on fairly large data sets, and may be particularly noticeable for cases where the same over() clause is used many times and the partition sizes are more than a couple of dozen rows each.

The potential for excess CPU usage comes from the effect of a generic internal rewrite using the nth_value() function with the “from last” clause even for the special case where your lead()/lag() have an offset (i.e. n) of 1 which would allow for the use of first_value()/last_value().

To get the correct rewrite you can use dbms_utility.expand_sql() to generate a suitable statement from which you can extract and edit the relevant pieces of text.

Footnote

A further point raised by Solomon Jakobson was that in the generic case (where you can’t use first_value() / last_value() to rewrite the SQL because you want to use an offset greater than one) it’s worth noting that the following two expressions are effectively identical:

        nvl(
                nth_value(a1.ed_ucc_dt, 2) from last ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt

        nvl(
                nth_value(a1.ed_ucc_dt, 2)           ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt desc
                        rows between 1 following and unbounded following
                ),
                 null
        ) prev_ed_ucc_dt
/

Note particularly that to avoid the issue of scanning the data “from last”, I’ve changed the ordering (in line 11) to descending and I’ve changed the preceding of line 4 to following in line 12 while reversing the positions of 1 and unbounded. (The same strategy can be applied for the lag() function in the original code).

In general it’s likely that lead() is will be more efficient than lag() so this change could make a big difference in performance. It’s worth noting, however, that if you’ve been using both lead() and lag() as my example did then your execution plan will (truthfully) show two “window sort” operations and the second (i.e. lower number in the plan) window sort will also be sorting the columns added by the first window sort. Depending on the nature of your data the additional sort might increase the workload by more than the benefit you get from eliminating a lag().

In an example created by Solomon Jakobson the effect of using lag( ,5) was catastrophic and the change that introduced the extra sort to get a lead( ,5) made a huge difference (40 seconds down to 0.15 seconds – and I haven’t made a mistake in the decimal point there); in my original test case the improvement I got from applying the same strategy was small (0.88 seconds down to 0.86 seconds). The critical difference between the test cases was that one had a single partition of 20,000 rows, the other had (10,000) partitions of only 20 rows.

(Footnote to the footnote – it did cross my mind that if you had a performance catastrophe with the simple lag(,n) and with the extra sort from reversing the order to use a lead(,n) would Oracle do something clever if you found a way to rewrite the query with a cascade of inline-views that all used last_value() – but I decided I wasn’t going to look at that option until someone paid me to. More seriously I did wonder about the option for using match_recognize to bypass the lag() problem, and may get around to thinking about that more carefully at some point.)

September 5, 2021

Quiz Night

Filed under: Oracle,Problem Solving — Jonathan Lewis @ 6:35 pm BST Sep 5,2021

This little observation came from running a couple of tests while looking at a problem on OTN – hence the odd bit of PL/SQL.

declare
        cursor c1
        is
        select * from t1;

        rec c1%rowtype;
        ch char(1);

begin
        open c1;
        loop
              fetch c1 into rec;
                  exit when c1%notfound;
                  select null into ch from t1 where id  = rec.id;
                  select null into ch from t1 where id  = rec.id;
                  select null into ch from t1 where id  = rec.id;
        end loop;
        close c1;
end;

There is a simple B-tree index on t1(id), and t1 was populated with the values from 1 to 10,000 in order. A query against user_tables (immediately after gathering stats) reported 20 blocks and 10,000 rows. This is NOT an attempt to trick anyone, it’s simple demonstrating a surprising result.

The trace file shows a tablescan for the driving cursor and an index(-only) path for the three identical queries inside the loop.

Question: if you look at the session activity stats (v$mystat) from running this anonymous PL/SQL block what value (ballpark figure) would you expect to see for the following statistics:

table scans (short tables)
table scan rows gotten
table scan blocks gotten

If it’s any help: when I counted rows per block I had 15 blocks with 657 rows and one block with 145 rows. (The 20 blocks reported in the table stats is the number of formatted blocks below the highwater mark, but some may be empty and some may be space management (bitmap) blocks).

Answers

For my test data, block size, and segment space management the results I got for 19.11.0.0 (and they’d be pretty much the same for any version back to 10g) were as follows:

table scans (short tables)                  1
table scan rows gotten              6,495,760
table scan blocks gotten               10,000

Sneak insight of where that 6.5M comes from – I said I had 15 blocks with 657 rows and one with 145 rows:

15 * (657 * 657) + (145 * 145) = 6,495,760

When the PL/SQL loop is doing single row fetches it release the table block after each row, so has to do 10,000 gets on the table – which are all counted towards the “table scan blocks gotten”. Then, to fetch one row from a block the code seems to believe that it has to visit every single row in the block, hence the bit of arithmetic that produced a row scan count of around 6.5M for this “short table” tablescan.

When I ran up this test initially the three indexed accesses in the middle were emulating the code supplied in the original statement of the problem but, as Viacheslav Andzhich noted out in his follow-up to comment #3, they don’t have anything to do with this effect.

We could change from an explicit cursor loop to an implicit cursor loop as follows, of course:

declare
        cursor c1
        is
        select * from t1;

        ch char(10) := '0';

begin
        for rec in c1 loop
                  select null into ch from t1 where id  = rec.id;
                  select null into ch from t1 where id  = rec.id;
                  select null into ch from t1 where id  = rec.id;
        end loop;
end;
/

While you might think this won’t make much (or any) difference you need to remember that quite a long time ago Oracle introduced a PL/SQL optimisation behind the scenes that turns the single row loop into an array fetch of 100 rows (see footnote), so the table scan results I got were as follows:

table scans (short tables)                 1
table scan rows gotten                74,531
table scan blocks gotten                 115

Again the numbers are higher than we might expect but rather more realistic. Again the results are due to double (and treble, and quadruple…) counting at the start of each array fetch, but I haven’t yet tried to work out the exact numbers in detail – though you can see that since I have roughly 7 arrays per block it’s not entirely surprising that the “blocks gotten” should be somewhere in the region of 7 * “data blocks below hwm”.

Addendum

It’s hard to leave out a little detail when the numbers don’t quite fit expectations – so I enabled the 10202 trace at level 4 while running the second loop and checked the resulting trace file for the buffer header dumps as the tablescan took place. (I commented out the 3 indexed accesses before I did this). Here’s a sample of the dump for one buffer access:

buffer tsn: 6 rdba: 0x03400084 (13/132)
scn: 0x82af1e seq: 0x02 flg: 0x04 tail: 0x0201002c
frmt: 0x02 chkval: 0x716f type: 0x06=trans data
Hex dump of corrupt header 4 = CORRUPT
Dump of memory from 0x000000009B0BE000 to 0x000000009B0BE014
09B0BE000 0000A206 03400084 0082AF1E 04020000  [......@.........]
09B0BE010 0000716F                             [oq..]
Consistent read complete...

Once I’d got this trace file I just extracted all the lines starting with “buffer tsn”, did a sort with count on them, and then worked out (and annotated each block address in turn) how each block had contributed to multiples of 100 rows. The results were as follows – and the answer was 115 buffer visits as reported:

grep "^buffer tsn" or21_ora_10141.trc | sort | uniq -c >temp.txt
vi temp.txt

      7 buffer tsn: 6 rdba: 0x03400084 (13/132)              600 + 57
      8 buffer tsn: 6 rdba: 0x03400085 (13/133)         43 + 600 + 14
      7 buffer tsn: 6 rdba: 0x03400086 (13/134)         86 + 500 + 71
      8 buffer tsn: 6 rdba: 0x03400087 (13/135)         29 + 600 + 28
      7 buffer tsn: 6 rdba: 0x03400088 (13/136)         72 + 500 + 85
      8 buffer tsn: 6 rdba: 0x03400089 (13/137)         15 + 600 + 42
      7 buffer tsn: 6 rdba: 0x0340008a (13/138)         58 + 500 + 99
      8 buffer tsn: 6 rdba: 0x0340008b (13/139)          1 + 600 + 56
      8 buffer tsn: 6 rdba: 0x0340008c (13/140)         44 + 600 + 13
      7 buffer tsn: 6 rdba: 0x0340008d (13/141)         87 + 500 + 70
      8 buffer tsn: 6 rdba: 0x0340008e (13/142)         30 + 600 + 27
      7 buffer tsn: 6 rdba: 0x0340008f (13/143)         73 + 500 + 84
      8 buffer tsn: 6 rdba: 0x03400090 (13/144)         16 + 600 + 41
      7 buffer tsn: 6 rdba: 0x03400091 (13/145)         59 + 500 + 98
      8 buffer tsn: 6 rdba: 0x03400092 (13/146)          2 + 600 + 55
      2 buffer tsn: 6 rdba: 0x03400093 (13/147)         45 + 100

To finish off the arithmetic, this also shows 113 visits to blocks with 657 rows and 2 visits to blocks with 145 rows: (113 * 657) + (2 * 145) = 74,531 (as reported by “table scan rows gotten”).

Setup

If you want to do some experimentation on variations, here’s the code I used to create my test data.

rem
rem     Script:         tablescan_joke.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2021
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem
rem     Notes:
rem     When you do a tablescan with single row fetches 
rem     inside a pl.sql loop, evey row fetch is a block 
rem     fetch and the row count increments by the number
rem     of rows in the block.
rem
rem     In 19.3 a call to rowid_count gave me the result
rem     ROWS_STARTING_IN_BLOCK     BLOCKS
rem     ---------------------- ----------
rem                        145          1
rem                        657         15
rem     So total rows =
rem             657 * (10000 - 145) +
rem             145 * 145
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

create index t1_i1 on t1(id);

select  blocks, num_rows 
from    user_tables
where   table_name = 'T1'
;

The code I used to check the number of rows per block is at this URL.

Footnote

In 10g Oracle introduced the plsql_optimize_level to affect the way in which the pl/sql optimizer was allowed to manipulate your code. The default level is 2, and at this level an implicit cursor for loop will use an array fetch size of 100 behind the scenes. If you want to disable this optimisation you need to set the parameter value down to 1 (or zero).

August 5, 2021

Finding SQL

Filed under: dbms_xplan,Oracle,Problem Solving,Troubleshooting — Jonathan Lewis @ 6:24 pm BST Aug 5,2021

There are some questions about performance issues for which there is no easy answer, and sometimes the best you can do is try to generate an approximate answer then examine the results to eliminate the innocent.

Three such problems – essentially similar – appeared recently on the Oracle Developer Forum, and in this note I’ll supply a mechanism that may be a good enough step in the right direction for at least two of them. The questions were:

  • How do I find who’s been using up all the temporary space recently?
  • How do I find the execution plans for all the SQL that is called by a procedure?
  • What SQL is responsible for generating most redo?

A basic (but incomplete) strategy for attacking these questions is to think of a way of to identify the sql_id and child_number for statements that might be contributing to the problem. If you can think of a suitable attack then you can get all the execution plans (or SQL Monitor reports) for those statements and examine them further.

The strategy is incomplete on two counts – first that you won’t get find the perfect set of statements, you’ll get more than you need but still miss some that are relevant; second that you’ll probably have to run secondary queries to get extra details about statements that look like good culprits for the problem you’re trying to solve.

Who’s been using the temporary space

The way this question was actually posed was more like:

I’ve got a query that started crashing with Oracle error “ORA-01652: unable to extend temp segment by 256 in tablespace TEMP”, but when I check the contents of the temporary tablespace (TEMP) there’s plenty of space available. How do I fix this problem.

This often means your query has changed execution plan and picked a very bad join order with some hash (or merge) joins that have dumped huge amount of data to disc; but it may mean that the space was being taken up by some other activity that doesn’t usually happen when you’re running your query.

So if you find that the SQL Monitor report (or simple call to dbms_xplan.display_cursor) makes you think that your query wasn’t doing anything differently a step that may help is to find all the sql in the library cache that might have been using a lot of temporary space.

For temporary space we can always check v$sql_workarea to identify operations from plans ran as one-pass or multi-pass operations, and we can check their most recent “tempseg size” or maximum tempseg size. Hence:

rem
rem     Script:         check_full_2a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Aug 2021
rem
rem     Columns of interest for temporary space:
rem     onepass_executions, multipasses_executions, last_tempseg_size, max_tempseg_size

select  
        distinct sql_id, child_number
from    v$sql_workarea 
where   onepass_executions != 0
or      multipasses_executions != 0
-- or      max_tempseg_size > 1e7
/


select
        distinct
        sql.sql_id, sql.child_number
from
        v$sql           sql,
        v$sql_workarea  war
where
        (   war.onepass_executions != 0
         or war.multipasses_executions != 0
        )
and     sql.sql_id = war.sql_id
and     sql.child_number = war.child_number
and     sql.last_active_time > sysdate - 15/1440
;

The first query here simply scans through the v$sql_workarea structure (which means it will actually thrash its way through the library cache) looking for operations that spilled to disk or (commented out) have used at least some specified amount of memory.

The second variation joins v$sql_workarea to v$sql so that it can restrict the chosen statements to those that were active some time in the last 15 minutes.

Obviously you will be able to think of other ways of tweaking these statements, and once you’ve got a statement expressing a suitable set of criteria you can embed it into a query that calls dbms_xplan.display_cursor() – as I demontrated about 10 years ago – or dbms_sql_monitor.report_sql_monitor() if you’re suitably licensed.

set linesize 230
set trimspool on
set pagesize 90
set tab off

set long 20000


select
        plan_table_output
from    (
        select  
                distinct sql_id, child_number
        from    v$sql_workarea 
        where   onepass_executions != 0
        or      multipasses_executions != 0
        ) v,
        table(dbms_xplan.display_cursor(v.sql_id, v.child_number, 'memstats'))
;


select
        dbms_sql_monitor.report_sql_monitor(
                sql_id             => v.sql_id,
                start_time_filter  => sysdate - 15/1440,
                type               =>'TEXT'
        ) text_line
from    (
        select
                distinct sql.sql_id
        from
                v$sql           sql,
                v$sql_workarea  war
        where
                (   war.onepass_executions != 0
                or war.multipasses_executions != 0
                )
        and     sql.sql_id = war.sql_id
        and     sql.child_number = war.child_number
        and     sql.last_active_time > sysdate - 15/1440
        ) v
/



A couple of points to note. I’ve included the MEMSTATS format option in the call to dbms_xplan.display_cursor() so that it shows some summary information from v$sql_workarea. However this does have a defect, it doesn’t show space used in temp by materialized “with” subqueries (CTEs) – which is where the call to dbms_sql_monitor.report_sql_monitor() helps because if the execution was captured it will show writes to disc in the “LOAD AS SELECT” operation under the TEMPORARY TABLE TRANSFORMATION operation.

Here’s a sample of output I got from the two queries after forcing a nasty plan to do a big hash join that ultimately produced a small result set.

First the output from the query using package dbms_xplan:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------
SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------
with ttemp as (  select /*+ materialize */ * from t1 ) select  /*+
no_partial_join(@sel$2 t1b) no_place_group_by(@sel$2) */
t1a.object_type,  max(t1a.object_name) from  ttemp t1a, ttemp t1b where
 t1a.object_id = t1b.object_id group by  t1a.object_type order by
t1a.object_type

Plan hash value: 1682228242

-----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                      | Starts | E-Rows | A-Rows |   A-Time   |  OMem |  1Mem |  O/1/M   | Max-Tmp |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                           |      1 |        |     45 |00:00:16.24 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                           |      1 |        |     45 |00:00:16.24 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6645_5FF4FE |      1 |        |      0 |00:00:02.53 |  2070K|  2070K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                        |      1 |   1154K|   1154K|00:00:00.55 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                           |      1 |     45 |     45 |00:00:13.72 | 11264 | 11264 |     1/0/0|         |
|*  5 |    HASH JOIN                             |                           |      1 |     18M|     18M|00:00:08.63 |    58M|    12M|          |      71M|
|   6 |     VIEW                                 |                           |      1 |   1154K|   1154K|00:00:01.03 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6645_5FF4FE |      1 |   1154K|   1154K|00:00:01.03 |       |       |          |         |
|   8 |     VIEW                                 |                           |      1 |   1154K|   1154K|00:00:00.49 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6645_5FF4FE |      1 |   1154K|   1154K|00:00:00.43 |       |       |          |         |
-----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")


30 rows selected.



Note the absence of any numbers in the Max-Tmp column for operation 2 (highlighted line 18).

Then compare with the results below of the the query using package dbms_sql_monitor:

TEXT_LINE
-----------------------------------------------------------------------
SQL Monitoring Report

SQL Text
------------------------------
with ttemp as ( select /*+ materialize */ * from t1 ) 
select /*+ no_partial_join(@sel$2 t1b) no_place_group_by(@sel$2) */ t1a.object_type, max(t1a.object_name) 
from ttemp t1a, ttemp t1b where t1a.object_id = t1b.object_id 
group by t1a.object_type order by t1a.object_type

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  SYS (10:47180)
 SQL ID              :  1cwabt12zq6zb
 SQL Execution ID    :  16777216
 Execution Started   :  08/05/2021 17:24:56
 First Refresh Time  :  08/05/2021 17:25:00
 Last Refresh Time   :  08/05/2021 17:25:12
 Duration            :  16s
 Module/Action       :  MyModule/MyAction
 Service             :  orclpdb
 Program             :  sqlplus@linux183.localdomain (TNS V1-V3)
 Fetch Calls         :  4

Global Stats
================================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  | Write | Write |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes | Reqs  | Bytes |
================================================================================
|      16 |      14 |     2.23 |     4 |  84677 |  986 | 253MB |   866 | 219MB |
================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1682228242)
=====================================================================================================================================================================================================================
| Id |                 Operation                  |           Name            |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity | Activity Detail |
|    |                                            |                           | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)    |   (# samples)   |
=====================================================================================================================================================================================================================
|  0 | SELECT STATEMENT                           |                           |         |       |         1 |    +16 |     1 |       45 |      |       |       |       |     . |     . |          |                 |
|  1 |   TEMP TABLE TRANSFORMATION                |                           |         |       |         1 |    +16 |     1 |       45 |      |       |       |       |     . |     . |          |                 |
|  2 |    LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D6645_5FF4FE |         |       |         5 |     +0 |     1 |        2 |      |       |   351 | 176MB |     . |     . |          |                 |
|  3 |     TABLE ACCESS FULL                      | T1                        |      1M |  3166 |         1 |     +4 |     1 |       1M |      |       |       |       |     . |     . |          |                 |
|  4 |    SORT GROUP BY                           |                           |      45 | 17727 |        11 |     +6 |     1 |       45 |      |       |       |       | 10240 |     . |          |                 |
|  5 |     HASH JOIN                              |                           |     18M |  9804 |        13 |     +4 |     1 |      18M |  561 |  66MB |   561 |  66MB |  59MB |  71MB |          |                 |
|  6 |      VIEW                                  |                           |      1M |  2966 |         1 |     +4 |     1 |       1M |      |       |       |       |     . |     . |          |                 |
|  7 |       TABLE ACCESS FULL                    | SYS_TEMP_0FD9D6645_5FF4FE |      1M |  2966 |         2 |     +3 |     1 |       1M |  351 | 176MB |       |       |     . |     . |          |                 |
|  8 |      VIEW                                  |                           |      1M |  2966 |         3 |     +6 |     1 |       1M |      |       |       |       |     . |     . |          |                 |
|  9 |       TABLE ACCESS FULL                    | SYS_TEMP_0FD9D6645_5FF4FE |      1M |  2966 |         5 |     +4 |     1 |       1M |   74 |  12MB |       |       |     . |     . |          |                 |
=====================================================================================================================================================================================================================


1 row selected.

In this report you can see that the materialization resulted in 176MB of data being written to temp at operation 2 (highlighted line 43), which is actually more than was written to temp as the hash join spilled over from memory.

There are a number of defects in this call to dbms_sql_monitor – including the need for an extra cost license. Most particularly (a) the plan is captured only if the query runs for more than 6 seconds or has a parallel componet and (b) we’re only passing in the SQL_ID with a time limit, so we could get several executions reported. We could refine the inputs, though, by including the sql_plan_hash_value in our query against v$sql. and we might include an end-time filter.

Whatever we do to minimise the number of plans reported the point will almost certainly come where we have to do eyeball the data to see if we can identify the queries which were almost certainly running and using the temp space we needed.

How do I find the execution plans for all the SQL that is called by a procedure?

To be written

What SQL is responsible for generating most redo?

To be written

April 13, 2021

CSV

Filed under: Infrastructure,Oracle,Problem Solving — Jonathan Lewis @ 3:00 pm BST Apr 13,2021

An old question with a little extra twist came up on the Oracle Developer forum recently – someone wanted to export a table into a CSV file but wanted to do a character set conversion at the same time because various “non-printable” characters were getting corrupted.

Looking at the code they had supplied they were running from SQL*Plus and were trying to use a select statement to generate a script that, for each row in a table, produced a statement of the form:

insert into tableX({list of columns}) values({list of values});

This worked reasonably well when they didn’t attempt to get a UTF8 output, but when they modified their client O/S environment to make SQL*Plus produce the desired UTF8 output the operation was much slower, increasing from about 1 minute to more than half an hour. (This may simply be the overhead of SQL*Net translating each line of output from the server character set to the client character set.)

Since the system in question was running on 12.2 I suggested a change of strategy so that they could at least take advantage of a built-in CSV generator to see how this behaved. Oracle extended the “set markup” command in 12c to include a CSV option:

set markup csv on delimiter {character} quote on

If this behaves nicely with the user’s attempt to get the output in their chosen characterset then a change in strategy to generate a “proper” CSV file and reload it through an external table, rather than generating and executing a list of “insert ….” statements might do a lot less work. (Unfortunately they had a requirement to generate the insert statements rather than adopting this bulk strategy – but I wasn’t going to let that stop me finishing this note.)

One detail to note with this markup option is that, unlike the “delimiter” character, the “quote” character cannot be specified and will always be the double-quote (ASCII character 34). If this character appears in any of the reported strings Oracle will “escape” it by duplicating it. Here’s a little example to demonstrate the effects – first a little data with a basic output (running 12.2.0.1):

rem
rem     Script:         markup.csv.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem

create table t1 (v1 varchar2(10), n1 number, n2 number, v2 varchar2(10));

insert into t1 values('abcdef',1,1,'qwer');
insert into t1 values('abc"def',2,null,'asdf');
insert into t1 values (null,3,3,null);

commit;

prompt  ==================
prompt  markup not yet set
prompt  ==================

select * from t1;


V1                 N1         N2 V2
---------- ---------- ---------- ----------
abcdef              1          1 qwer
abc"def             2            asdf
                    3          3

3 rows selected.

I’ve included one row with a double-quote in one of the columns, and a couple of rows with NULLs in various columns. Now I just switch markup to CSV using its default options:

prompt  ==================
prompt  Default CSV markup
prompt  ==================

set markup csv on
select * from t1;


"V1","N1","N2","V2"
"abcdef",1,1,"qwer"
"abc""def",2,,"asdf"
,3,3,

3 rows selected.


As you can see this has behaved nicely – every column is present (even when empty – judging by the number of delimiters) and character strings have been quoted with the double-quote character, and the stored double-quote in the second row has been duplicated. The default option for CSV should be nearly perfect for use as an extrernal table – though we probably want to get rid of column headings, page breaks, and feedback etc.

Before creating a suitable external table, let’s play around with a couple of variations for the options:

prompt  ===========
prompt  delimiter =
prompt  ===========

set heading off
set pagesize 0
set feedback off

set markup csv on delimiter =
select * from t1;

"abcdef"=1=1="qwer"
"abc""def"=2=="asdf"
=3=3=

prompt  ================================
prompt  quote off (delimiter remembered)
prompt  ================================

set markup csv on quote off
select * from t1;

abcdef=1=1=qwer
abc"def=2==asdf
=3=3=

Now to read back one of the outputs – I’ll pick the default output for this (but with headers etc. off to leave just three lines of data). Here’s how I can define my external table to read back the values from the file:

create or replace directory ext_tab as '/mnt/working/';
create or replace directory ext_log as '/mnt/working/';

create table ext(
        v1 varchar2(10), 
        n1 number, 
        n2 number, 
        v2 varchar2(10)
)
organization external(
        type oracle_loader
        default directory ext_tab
        access parameters (
                records delimited by newline 
                discardfile     ext_log:'exttab_dis.txt'
                badfile         ext_log:'exttab_bad.txt'
                logfile         ext_log:'exttab_log.txt'
        fields 
                csv 
                with embedded
        missing field values are null
        )
        location ('markup_csv.lst')
)
reject limit 10
;

select * from ext;

V1                 N1         N2 V2
---------- ---------- ---------- ----------
abcdef              1          1 qwer
abc"def             2            asdf
                    3          3

3 rows selected.

I’ve created a couple of directories to get started – linking a physical location to a logical name that I can use in the external table definition.

The key details that allow Oracle to read the file correctly appear in the “fields” specification: it’s a pity that the output from the default CSV markup doesn’t become the default expectation for the external table CSV option, but I needed to use the “with embedded” to adjust the duplicated “double-quote” characters and then add “missing field values are null” to cope with a null column at the end of the line.

Summary

With a simple call to “set markup csv on” (and a handful of other set {formatting} commands) it’s very easy to turn a table into a flat CSV file that is comma separated, (double)quote deliminted.

It’s then very easy – but requires two non-default settings – to read this flat file back into an Oracle database using “fields CSV” as the basic formatting directive.

Footnote

I’ve kept the example very simple, but as soon as you start dealing with dates and timestamps – and even very large numbers – you’ll have to worry about formatting when generate the flat file (set nls_date_format, column n1 format 99999, etc.) to make sure you don’t lose precision on the output; and for dates and timestamps you’ll have to include the equivalent date_format options in the external table definition.

I’ve only demonstrated how to reload the file back into Oracle as an external table. If you wanted to use this markup option to prepare the data for an Excel spreadsheet all you have to do is spool to a file with the “.csv” extension and Excel will be able to open the output file and parse the data correctly into a multi-column display. (But see comment #1 below for a warning from Jason Bucata.)

March 12, 2021

Distributed Sequences

Filed under: distributed,Execution plans,Oracle,Performance,Problem Solving — Jonathan Lewis @ 9:09 am GMT Mar 12,2021

A request for help came up some time ago on ODC reporting a query that was hanging when it included a sequence.nextval. In fact the intial “query” was an “insert as select” with a select that was a join of two remote tables.

Making the fairly automatic assumption that many people say “hanging” when they really mean “hasn’t finished yet” the first thought I had about the structure of the statement was that it was just a variation of the standard problem of distributed DML. I haven’t written anything previously about how using sequences can introduce the problem so here’s a note to demonstrate the issue and suggest a workaround:

We start with a database link and a few tables:


rem
rem     Script:         distributed_sequence.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             18.3.0.0
rem             12.2.0.1
rem

rem     create public database link orcl@loopback using 'orcl';

define m_target=orcl@loopback

create sequence s1 cache 10000;
select s1.nextval from dual;

create table t1
segment creation immediate
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n1,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1
;

create table t2
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n2,
        lpad(rownum,10,'0')             v2,
        lpad('x',100,'x')               padding
from
        generator       v1
;

create table t3(
        id1     number(6,0),
        id2     number(6,0),
        n0      number(6,0),
        n1      number(6,0),
        n2      number(6,0),
        v1      varchar2(10),
        v2      varchar2(10)
)
segment creation immediate
;

create or replace view v1 as
select
        t1.id id1,
        t2.id id2,
        t1.n1,
        t2.n2,
        t1.v1,
        t2.v2
from
        t1, t2
where
        t2.id = t1.id
;

The command to create a public database link (one example of the many optional commands in my original source) has to be run by a suitably privileged schema as a one-off event. The definition of the substitution variable m_target that I’ll be using as my database link (again with many possibilities in my original script) means I only have to edit my choice of database link once in my script as I change Oracle instances. You’ll notice I’ve done my usual trick of using a loopback database link to emulate a distributed system.

I’ve then created two populated tables (t1, t2) and a third empty table (t3) that will be the target of an insert. I’ve also created a view (v1) that joins the first two tables and a sequence (s1) that I’ve primed with a single call to nextval. When I get to the tests I’ll be using t1, t2 and v1 as if they had been created in the remote database (referenced through my loopback database link) while t3 and s1 will be local objects.

So let’s run a couple of statements and see what their execution plans look like:


set serveroutput off

prompt  =======================
prompt  Insert without sequence
prompt  =======================

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        t1.id,
        t2.id,
        0,
        t1.n1,
        t2.n2,
        t1.v1,
        t2.v2
from
        t1@&m_target    t1,
        t2@&m_target    t2
where
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));

prompt  =======================
prompt  Insert WITH sequence
prompt  =======================

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        t1.id,
        t2.id,
        s1.nextval,
        t1.n1,
        t2.n2,
        t1.v1,
        t2.v2
from
        t1@&m_target    t1,
        t2@&m_target    t2
where
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));
commit;

The code joins t1 and t2, selects a couple of columns and inserts then into t3 without or with a call to s1.nextval (the local sequence). Here’s the output (with minor cosmetic changes) from executing this code under 19.3.0.0 (the effects are the same in 18.3.0.0 and 12.2.0.1):


=======================
Insert without sequence
=======================

10000 rows created.


SQL_ID  373nz810u3frv, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  t1.id,  t2.id,
0,  t1.n1,  t2.n2,  t1.v1,  t2.v2 from  t1@orclpdb@loopback t1,
t2@orclpdb@loopback t2 where  t2.id = t1.id

-----------------------------------------------------------------
| Id  | Operation                | Name | Cost  | Inst   |IN-OUT|
-----------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |     1 |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |        |      |
|   2 |   REMOTE                 |      |       | ORCLP~ | R->S |
-----------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------
   2 - SELECT "A2"."ID","A1"."ID",0,"A2"."N1","A1"."N2","A2"."V1","A1"."
       V2" FROM "T1" "A2","T2" "A1" WHERE "A1"."ID"="A2"."ID" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

Note
-----
   - cpu costing is off (consider enabling it)


=======================
Insert WITH sequence
=======================

10000 rows created.


SQL_ID  8jg23arujnh01, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  t1.id,  t2.id,
s1.nextval,  t1.n1,  t2.n2,  t1.v1,  t2.v2 from  t1@orclpdb@loopback
t1,  t2@orclpdb@loopback t2 where  t2.id = t1.id

-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |       |       |    54 (100)|          |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |       |            |          |        |      |
|   2 |   SEQUENCE               | S1   |       |       |            |          |        |      |
|*  3 |    HASH JOIN             |      | 10000 |   937K|    54   (8)| 00:00:01 |        |      |
|   4 |     REMOTE               | T1   | 10000 |   468K|    26   (4)| 00:00:01 | ORCLP~ | R->S |
|   5 |     REMOTE               | T2   | 10000 |   468K|    26   (4)| 00:00:01 | ORCLP~ | R->S |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"="T1"."ID")

Remote SQL Information (identified by operation id):
----------------------------------------------------
   4 - SELECT /*+ OPAQUE_TRANSFORM */ "ID","N1","V1" FROM "T1" "A2" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

   5 - SELECT /*+ OPAQUE_TRANSFORM */ "ID","N2","V2" FROM "T2" "A1" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

The key thing to notice is that when we want to insert the local sequence as a column in the select list Oracle breaks the hash join into two separate accesses to the remote database and pulls all the data we might need from the two tables before trying to join them locally. When the query is “fully remote” the local database can let the remote database deal with the join, when the query is distributed – which is a side effect of introducing the sequence – the local site becomes the driving site and has to work out the least worst way of handling the join, which might be much slower than the remote join.  (It’s an odd little quirk that when the select is fully remote the optimizer thinks that it’s not using CPU costing. Possibly that’s because all the arithmetic happens somewhere else and the local cost of the query never gets above zero.)

This is one of those cases where we might work around the problem by creating a remote view to handle the join – hence the creation of the view v1; here’s a suitable statement and the resulting execution plan:

prompt  ==============================
prompt  Insert using view and sequence
prompt  ==============================

set serveroutput off

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        v1.id1,
        v1.id2,
        s1.nextval,
        v1.n1,
        v1.n2,
        v1.v1,
        v1.v2
from
        v1@&m_target v1
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));
commit;



==============================
Insert using view and sequence
==============================

10000 rows created.


SQL_ID  4tz0rrqt87nb8, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  v1.id1,  v1.id2,
s1.nextval,  v1.n1,  v1.n2,  v1.v1,  v1.v2 from  v1@orclpdb@loopback v1

-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |       |       |    27 (100)|          |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |       |            |          |        |      |
|   2 |   SEQUENCE               | S1   |       |       |            |          |        |      |
|   3 |    REMOTE                | V1   | 10000 |   937K|    27   (8)| 00:00:01 | ORCLP~ | R->S |
-------------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------
   3 - SELECT /*+ OPAQUE_TRANSFORM */ "ID1","ID2","N1","N2","V1","V2" FROM "V1" "V1"
       (accessing 'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

As you can see, the local optimizer doesn’t know enough about the remote view to be able to split it into components and make a mess of the execution plan, it simply sees a query against a “single table” and leaves the remote database to worry about optimising it. But, of course, we don’t always have the luxury of being able to create objects on someone else’s database, so what’s the alternative?

Try rewriting the query to use an inline view with the /*+ no_merge() */ hint:


prompt  =======================================
prompt  Insert from a no-merge inline view with
prompt  a sequence.nextval in the outer query
prompt  ========================================

set serveroutput off

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        id1, id2, s1.nextval, n1, n2, v1, v2
from    (
        select  /*+ no_merge */
                t1.id   id1,    
                t2.id   id2,
                0       n0,
                t1.n1,
                t2.n2,
                t1.v1,
                t2.v2
        from
                t1@&m_target    t1,
                t2@&m_target    t2
        where
                t2.id = t1.id
        )       v1
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));
commit;


=======================================
Insert from a no-merge inline view with
a sequence.nextval in the outer query
========================================

10000 rows created.


SQL_ID  20z81g550tbsk, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  id1, id2,
s1.nextval, n1, n2, v1, v2 from (  select /*+ no_merge */   t1.id id1,
 t2.id id2,   0 n0,   t1.n1,   t2.n2,   t1.v1,   t2.v2  from
t1@orclpdb@loopback t1,   t2@orclpdb@loopback t2  where   t2.id = t1.id
 ) v1

-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |       |       |    54 (100)|          |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |       |            |          |        |      |
|   2 |   SEQUENCE               | S1   |       |       |            |          |        |      |
|   3 |    VIEW                  |      | 10000 |   937K|    54   (8)| 00:00:01 |        |      |
|   4 |     REMOTE               |      |       |       |            |          | ORCLP~ | R->S |
-------------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------
   4 - SELECT /*+ NO_MERGE */ "A2"."ID","A1"."ID",0,"A2"."N1","A1"."N2","A2"."V1","A1"."V
       2" FROM "T1" "A2","T2" "A1" WHERE "A1"."ID"="A2"."ID" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )


The hint has done exactly what we needed: it has passed the text of the inline view to the remote database for optimisation so the join happens remotely, and the sequence number is then included after the result set comes back from the remote database. The SQL is a little messier, of course, mostly thanks to the doubled appearance of the columns in the select list.

Summary

Using a local sequence in DML that accesses a remote database makes the optimizer treats the underlying query as a distributed query, and this may mean it can’t find an efficient execution path unless you do some re-engineering of the code. If you can manage to make an insert with a constant efficient then using that version of the code as an in-line no_merge view with one extra layer that brings the sequence into play may be all you need to do to make the DML operate efficiently.

November 26, 2020

Row sizes 3

Filed under: Oracle,Problem Solving — Jonathan Lewis @ 12:08 pm GMT Nov 26,2020

Several years ago I published a couple of examples of scripts that I’d been using since Oracle 6 (with minor changes over time) to analyse the content of tables in some detail. The first was a breakdown of the lengths of the rows in the table, the second was a map showing the distribution of nulls in the rows. There used to be a third component of the analysis which produced a report of the number of non-null entries for each column in the table,  but by the time I published the first two parts there didn’t seem to be much point in publishing the third since you could get a sufficiently accurate picture by querying view user_tab_columns after gathering stats:


break on num_rows

select
        tab.num_rows, col.column_name , col.num_nulls
from 
        user_tables             tab
join
        user_tab_columns        col
on
        col.table_name = tab.table_name
where
        tab.table_name = 'T1' 
order by 
        col.num_nulls desc, col.column_id
/


  NUM_ROWS COLUMN_NAME           NUM_NULLS
---------- -------------------- ----------
     56777 EDITION_NAME              56777
           CREATED_APPID             56777
           CREATED_VSNID             56777
           MODIFIED_APPID            56777
           MODIFIED_VSNID            56777
           SUBOBJECT_NAME            56570
           DATA_OBJECT_ID            55353
           DEFAULT_COLLATION         51058
           EDITIONABLE               40216
           OWNER                         0
           OBJECT_NAME                   0
           OBJECT_ID                     0
           OBJECT_TYPE                   0
           CREATED                       0
           LAST_DDL_TIME                 0
           TIMESTAMP                     0
           STATUS                        0
           TEMPORARY                     0
           GENERATED                     0
           SECONDARY                     0
           NAMESPACE                     0
           SHARING                       0
           ORACLE_MAINTAINED             0
           APPLICATION                   0
           DUPLICATED                    0
           SHARDED                       0


In this example t1 is a copy of view all_objects and I’ve listed the columns in order of most frequently null with the table’s num_rows reported at the top for comparison purposes. If you were going to do something that made it a good idea to rebuild the table you could use this report as a guide to re-arranging the column order to position the most frequently null columns towards the end of the row (saving the “trailing nulls take no space” length bytes).

This approach depends on the stats being up to date and accurate, of course, so if you didn’t have them, and didn’t want to collect them, another strategy would be to run a query like the following:


select
        to_char(count(OWNER),'999,999,999,999,999') OWNER,
        to_char(count(OBJECT_NAME),'999,999,999,999,999') OBJECT_NAME,
        to_char(count(SUBOBJECT_NAME),'999,999,999,999,999') SUBOBJECT_NAME,
        to_char(count(OBJECT_ID),'999,999,999,999,999') OBJECT_ID,
        to_char(count(DATA_OBJECT_ID),'999,999,999,999,999') DATA_OBJECT_ID,
        to_char(count(OBJECT_TYPE),'999,999,999,999,999') OBJECT_TYPE,
        to_char(count(CREATED),'999,999,999,999,999') CREATED,
        to_char(count(LAST_DDL_TIME),'999,999,999,999,999') LAST_DDL_TIME,
        to_char(count(TIMESTAMP),'999,999,999,999,999') TIMESTAMP,
        to_char(count(STATUS),'999,999,999,999,999') STATUS,
        to_char(count(TEMPORARY),'999,999,999,999,999') TEMPORARY,
        to_char(count(GENERATED),'999,999,999,999,999') GENERATED,
        to_char(count(SECONDARY),'999,999,999,999,999') SECONDARY,
        to_char(count(NAMESPACE),'999,999,999,999,999') NAMESPACE,
        to_char(count(EDITION_NAME),'999,999,999,999,999') EDITION_NAME,
        to_char(count(SHARING),'999,999,999,999,999') SHARING,
        to_char(count(EDITIONABLE),'999,999,999,999,999') EDITIONABLE,
        to_char(count(ORACLE_MAINTAINED),'999,999,999,999,999') ORACLE_MAINTAINED,
        to_char(count(APPLICATION),'999,999,999,999,999') APPLICATION,
        to_char(count(DEFAULT_COLLATION),'999,999,999,999,999') DEFAULT_COLLATION,
        to_char(count(DUPLICATED),'999,999,999,999,999') DUPLICATED,
        to_char(count(SHARDED),'999,999,999,999,999') SHARDED,
        to_char(count(CREATED_APPID),'999,999,999,999,999') CREATED_APPID,
        to_char(count(CREATED_VSNID),'999,999,999,999,999') CREATED_VSNID,
        to_char(count(MODIFIED_APPID),'999,999,999,999,999') MODIFIED_APPID,
        to_char(count(MODIFIED_VSNID),'999,999,999,999,999') MODIFIED_VSNID,
        to_char(count(*),'999,999,999,999,999') row_count 
fromi
         t1
;

You don’t really need the to_char() function, but it’s a conveience for what I’m going to do with the SQL. Obviouslty it would be a bit tedious to create this statement by hand so, as I did in the previous “rowsize” notes, I’ve written some code to generate it for me:


rem
rem     row_size_2b.sql
rem     
rem     Generate SQL to report counts of 
rem     non-null columns in a table.
rem

set linesize 32000
set feedback off
define m_table = '&1'

declare
        m_string        varchar2(32000) := 'select ';
        m_cursor        sys_refcursor;

begin
        for r in (
                select 
                        column_name, data_type
                from    user_tab_columns
                where   table_name = upper('&m_table')
        ) loop
                m_string := m_string ||  
                                ' to_char(count(' || trim(r.column_name) || '),''999,999,999,999,999'') ' ||
                                trim(r.column_name) || ',' || chr(10) ;
        end loop;

        m_string := m_string || ' to_char(count(*),''999,999,999,999,999'') row_count from &m_table';

--      dbms_output.put_line(m_string);
        print_table(m_string);

end;
/

set linesize 156

The script accepts a table name from the user’s schema (you could edit this to query dba_tab_columns), constructs a string (as shown above – though I’ve done a little cosmetic work on it), and then passes the resulting string to a verion of Tom Kyte’s print_table() routine which produces the following output:


OWNER                         :               56,777
OBJECT_NAME                   :               56,777
SUBOBJECT_NAME                :                  207
OBJECT_ID                     :               56,777
DATA_OBJECT_ID                :                1,424
OBJECT_TYPE                   :               56,777
CREATED                       :               56,777
LAST_DDL_TIME                 :               56,777
TIMESTAMP                     :               56,777
STATUS                        :               56,777
TEMPORARY                     :               56,777
GENERATED                     :               56,777
SECONDARY                     :               56,777
NAMESPACE                     :               56,777
EDITION_NAME                  :                    0
SHARING                       :               56,777
EDITIONABLE                   :               16,561
ORACLE_MAINTAINED             :               56,777
APPLICATION                   :               56,777
DEFAULT_COLLATION             :                5,719
DUPLICATED                    :               56,777
SHARDED                       :               56,777
CREATED_APPID                 :                    0
CREATED_VSNID                 :                    0
MODIFIED_APPID                :                    0
MODIFIED_VSNID                :                    0
ROW_COUNT                     :               56,777

-----------------

1 rows selected

As with many of the scripts I’ve pubished, the task is one I rarely do, so I haven’t bothered to add in all the extra little refinements that might make it foolproof and totally self-contained. In this case, for example I’d have to go through a final couple of manual steps to edit the output by deleteing the last few lines then (since I use vi) issuing “!%!sort -k3 -n -r” which would sort the results in reverse order, numerically, according to the third field.

 

print_table()

Filed under: Oracle,Problem Solving — Jonathan Lewis @ 9:40 am GMT Nov 26,2020

Many years ago Tom Kyte published a small PL/SQL procedure to output each row in a table as a list of (column_name , value). I often use the original version, even though Tom refined it several times over the years. From time to time the links in my blog that I’ve set up to point to Tom’s code have simply stopped working so I’ve decided to publish a reference copy on my blog in case newer readers need to find a copy and the current link above stops  working.

rem
rem     Script:         print_table.sql
rem     Dated:          Nov 2001
rem     Author:         Tom Kyte - from Expert One on One Oracle.
rem
rem     Purpose:
rem     Prints the results of a query one column per line.
rem
rem     Created as a 'authid current_user' so that it can
rem     access any table that the caller is allowed to access
rem     from the SQL*Plus prompt.
rem
rem     A suitably privilefed user could also make this publicly
rem     available through the public synonym and grant at the end
rem     of the script
rem
rem     Minor modifications by Jonathan Lewis to separate rows
rem     in the result set, and report the number of rows returned
rem

create or replace procedure print_table( p_query in varchar2 )
authid current_user
is
        l_theCursor     integer default dbms_sql.open_cursor;
        l_columnValue   varchar2(4000);
        l_status        integer;
        l_descTbl       dbms_sql.desc_tab;
        l_colCnt        number;
        l_rowcnt        number := 0;
begin
        execute immediate 'alter session set nls_date_format=''dd-mon-yyyy hh24:mi:ss'' ';

        dbms_sql.parse(  l_theCursor,  p_query, dbms_sql.native );
        dbms_sql.describe_columns( l_theCursor, l_colCnt, l_descTbl );

        for i in 1 .. l_colCnt loop
                dbms_sql.define_column(l_theCursor, i, l_columnValue, 4000 );
        end loop;

        l_status := dbms_sql.execute(l_theCursor);

        while ( dbms_sql.fetch_rows(l_theCursor) > 0 ) loop
                for i in 1 .. l_colCnt loop
                        dbms_sql.column_value( l_theCursor, i, l_columnValue );
                        dbms_output.put_line(
                                rpad( l_descTbl(i).col_name, 30 )
                                || ' : ' || l_columnValue
                        );
                end loop;
                dbms_output.new_line;
                dbms_output.put_line( '-----------------' );
                dbms_output.new_line;
                l_rowcnt := l_rowcnt + 1;
        end loop;

        dbms_output.put_line(l_rowcnt || ' rows selected');

        execute immediate 'alter session set nls_date_format=''dd-mon-rr'' ';

exception
        when others then
                execute immediate 'alter session set nls_date_format=''dd-mon-rr'' ';
        raise;
end;
/

-- create public synonym print_table for print_table;
-- grant execute on print_table to public;

The reason I’m publishing this now is that I’m about to publish the third variant of a piece of code I wrote many years ago, and first blogged about in 2012.

October 5, 2020

Direct Path

Filed under: Oracle,Performance,Problem Solving,RAC,Troubleshooting,Tuning — Jonathan Lewis @ 11:29 am BST Oct 5,2020

Here’s a note that I might have written once already – but I can’t find it and I’ve just been reminded about what it (might have) said by a posting that came up on the Oracle database forum in the last few days.

The posting in question is asking why, after setting the hidden parameter _serial_direct_read to ‘always’ a particular query is now taking hours to complete when it used to complete in a minute or so.

The answer is partly “because you’ve forced direct path serial reads”, partly “because you’re running on RAC” and (most directly) because the optimizer is using a really bad execution plan for that query and the direct path reads have had a massive impact as a consequence. (It turns out, after modelling, that the answer might also include “because you’re running 11.2.0.4”)

I’m going to demonstrate the issue by forcing a very simple query to take a very bad execution plan.

rem
rem     Script:         ko.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem 

create table t1
as
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue
;

create table t2
as
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue
;

select  table_name, blocks 
from    user_tables
where   table_name in ('T1','T2')
;

alter system flush buffer_cache;

execute snap_events.start_snap
execute snap_my_stats.start_snap
alter session set "_serial_direct_read"=always;


select
        /*+ 
                leading(t1 t2)
                use_nl(t2)
                full(t2)
        */
        t1.object_type,
        t2.object_type
from
        t1, t2
where
        t2.object_id = t1.object_id + 0.5
;

execute snap_my_stats.end_snap
execute snap_events.end_snap

alter session set "_serial_direct_read"=auto;


My query very carefully ensures that it’s not going to return any rows; but it’s going to do a lot of work finding no data because I’ve forced Oracle into doing a tablescan of t2 for every row in t1 – so 10,000 scans of a table of 140 – 190 (depending on version) data blocks.

The snap_my_stats and snap_events packages are a couple of my simple diagnostic packages that allows me to find the change in some v$ content between the start and end snapshots. In this case it’s v$mystat and v$session_event for the session.

On a test using 11.2.0.4 the query ran for about 41 seconds with the following wait events reported:

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               2           0           0.11        .054           1
direct path read                                  6,527           0         368.30        .056           6
SQL*Net message to client                            13           0           0.00        .000           0
SQL*Net message from client                          13           0      10,689.26     822.251      10,689

The 6,500 “direct path read” waits corresponded to 1.33M “physical reads direct” reported in the session activity stats. Although the t2 table was fairly small Oracle was forced to use direct path reads for every single cycle through the nested loop. As a quick comparison, here are the figures if I don’t force direct path serial scans.

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               2           0           0.09        .047           2
db file scattered read                               34           0           2.45        .072           0
SQL*Net message to client                            11           0           0.01        .001           0
SQL*Net message from client                          11           0         174.36      15.851      82,849



We’ve waited for only 34 “db file scattered reads” and 2.45 centiseconds as we read the t2 (and the t1) tables into the cache for the first time, but then we’ve been able to revisit the blocks in the cache. We also saw a reduction in CPU usage and the total run time dropped from 41 seconds to about 22 seconds.

In this tiny example it hasn’t made a staggering difference to the overall run time, but the OP wasn’t that lucky with his “couple of minutes” to “hours”.

If you look at the fragment of the SQL Monitor report supplied by the OP you’ll see that they have an operation which shows:

=====================================================================================================================================================================================================================================
| Id    |                  Operation                    |            Name              | Rows    | Cost |  Time     |  Start | Execs |    Rows  | Read | Read  | Cell   | Mem | Activity |            Activity Detail               |
|       |                                               |                              | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload |    |    (%)   |              (# samples)                 |
=====================================================================================================================================================================================================================================
| -> 25 |       INDEX STORAGE FAST FULL SCAN            | TMRC_IX1                     |      1  |      |     10745 |     +0 |  268K |     112K | 267K |   2GB |  96.53% | 1M |    99.74 | enq: KO - fast object checkpoint (4783)  |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | Cpu (753)                                |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | gcs drm freeze in enter server mode (25) |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | latch free (1)                           |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | reliable message (3505)                  |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | cell smart index scan (1635)             |
=====================================================================================================================================================================================================================================

The “index fast full scan” is an example of a “segment scan” and will be subject to direct path reads, just like a tablescan. We don’t really know how big this index is – but we can see that we have read it 268K times (Execs column) reading approximately 2GB after 267K read requests. This suggests the index is at most 1MB in size since it can be read in a single read request, and may consist of only one populated block (2,000,000,000/267,000 = 7,490 bytes. Despite this small size the total time sampled for all these scans is more than 10,600 seconds – roughly 39 millisecond per scan. That’s not very good.

Looking more closely at the sampled time we notice 3 key features:

============================================
|            Activity Detail               |
|              (# samples)                 |
============================================
| enq: KO - fast object checkpoint (4783)  |
| Cpu (753)                                |
| gcs drm freeze in enter server mode (25) |
| latch free (1)                           |
| reliable message (3505)                  |
| cell smart index scan (1635)             |
============================================
  • A massive fraction of the time was spent on “enq: KO – fast object checkpoint”
  • A large fraction of the time was spent on “reliable message”
  • There was some time (relatively small, but large in absolute terms) for “gcs drm freeze …”

The last of these three is about global cache services, distributed resource manager” and is a clear indication that the system is running RAC and we have some hefty competition for “object mastering” between instances. But the actual time lost there is relatively small – though associated chatter between instances could be significant.

But what’s the “KO enqueue”? Every time an instance starts a direct path segment scan it has to get a message to the database writer (possibly via the checkpoint process) – hence the “reliable message” waits – to copy every dirty block for that segment from the buffer cache down to disc and it has to wait for the write to complete. This is necessary to ensure that the tablescan doesn’t miss any changes that have been made in memory without yet being written to disc.

The KO enqueue synchronises this activity – I haven’t worked out the complete chain of events, but the enqueue is negotiated between the session and the checkpoint process – and if you’re running RAC every instance has to write any dirty blocks it is holding for the segment, so you have to have a degree of cross-instance chatter to make this happen.

Thanks to the enforced serial direct reads the OP’s plan – which, surely, shouldn’t expect to do 267K index fast full scans – has a massive overhead thanks to the need for the repeated object checkpoints.

You may ask, at this point, why I didn’t see any KO enqueue waits in my test results – the answer is simple, I’d flushed the buffer cache before I started the test, so there were no dirty blocks for the session to worry about. Let’s see what happens if I introduce a little activity to keep dirtying a few blocks in the t2 table. Here’s a little loop that will update a few rows once per second:

begin
        for i in 1..1 loop
                update t2 set data_object_id = 0 where mod(object_id,1000) = i;
                dbms_output.put_line(sql%rowcount);
                commit;
                dbms_lock.sleep(1);
        end loop;
end;
/

You’ll notice the code runs through the loop just once – I started with a loop count of 30, and discovered it wasn’t necessary, but the option remains for further testing.

If I execute this anonymous block (which updates about 10 rows each time through the loop) from another session just after SQL*Plus reports my “alter session”, so that it runs just after the query starts, this is what the session event report looks like if I run the test against 11.2.0.4 (the version reported by the OP):

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: KO - fast object checkpoint                  7,645           0         290.34        .038           6
db file sequential read                               2           0           0.13        .066           2
direct path read                                 10,714           0         675.50        .063           6
SQL*Net message to client                            14           0           0.00        .000           0
SQL*Net message from client                          14           0       1,556.23     111.160     101,653
events in waitclass Other                         5,607           0         218.04        .039           8


Suddenly we see a lot of time spent on the KO enqueue waits and the “events in waitclass Other” (which turn out to be “reliable message” waits). Apparently the session keeps finding dirty t2 blocks in the cache and telling the database writer they need to be written to disc before the next tablescan of t2 can take place.

There’s something odd here, though and I’ll introduce it with this comment: when I repeated the test on 19.3 (even with with the constant trickle of updates once per second), we only see a tiny number of KO enqueues and reliable message waits – the fact that we see a huge number of them in 11g is a defect in the design of the code.

Think about what’s happening with the KO enqueue: when you start the first tablescan of t2 you force every dirty block for that segment to be copied from the cache to the disc.

As the tablescan proceeds you may have to apply some undo change vectors to the blocks you’re reading to take them back to the SCN as at the start of query execution, but you know that any data that had been committed before the query started is on disc (even if it has been over-written by committed changes made after the query started, or by uncommitted changes made before the query started). What’s on the disc right now will be usable to get the correct read-consistent version of the data for the duration of the query run, no matter how many newer changes are made, whether or not they over-write the disc blocks before the query ends. There is no need to force write any dirty blocks as the tablescan is repeated and, it seems, by 19.3 the code has been adjusted to accomodate that fact.

Footnote

After I had posted this comment on the forum, the OP raised the question of whether or not the fix might apply to 12c as well – so I ran up a VM of 12.1.0.2 and 12.2.0.1 and re-ran the tests. The results were initially promising – neither version reported an extreme number of KO enqueue waits or reliable message waits.

However when I changed the loop counter from 1 back to 30 I found that the waits re-appeared – though the numbers were significantly less than those from 11g – so perhaps there’s a timing element involved that might need further investigation and stress testing even for 19.3.

Footnote 2

Another snapshot I took in testing was from v$enqueue stat – which showed that (approximately) for every KO enqueue wait my session reported, the instance reported about 10 – 12 KO enqueue requests.

When checking v$enqueue_stat it’s important to remember that session activity stats (v$sesstat) report both “enqueue requests” and “enqueue conversions”. In v$enqueue_stat the conversions aren’t reported separately the view simply adds the two figures together under “requests”.

In the 11g test the session reported 7,645 KO enqueue waits, but the session activity stats reported 19,399 enqueue requests and 38,796 enqueue conversions; while v$enqueue_stat reported 96,990 KO enqueue requests. The remaining 38,796 KO enqueue requests were made by the checkpoint process (CKPT) – and it was only my session that repoted any waits for KO enqueue requests.

Without further low-level investigation this is what leads me to believe that the session sends CKPT a message that it wants an object-level checkpoint performed and waits for the message to be acknowledged (reliable message) before trying to convert a low-level KO enqueue to an exclusive one. But CKPT has acquired and converted the same KO enqueue before acknowledging the message from the session and will only release the enqueue when the database writer acknowledges that the checkpoint request has been completed. So this mechanism ensures that the session will have to wait until the checkpoint is complete and CKPT has released the enqueue before it can continue processing.

(P.S. If you enable event 10704 you will find that there seem to be two KO enqueues involved – one with id2 = 1, one with id2 = 2.)

Next Page »

Website Powered by WordPress.com.