Oracle Scratchpad

March 12, 2020

Sequence Costs

Filed under: Bugs,Oracle,Performance,trace files,Troubleshooting — Jonathan Lewis @ 8:18 pm GMT Mar 12,2020

You’re probably aware of the “identity” construct that appeared in 12.1 and uses Oracle’s sequence mechanism to model the Autonumber or Identity columns that other databases have. A posting from Clay Jackson on the Oracle-L list server suggests that something about their code path has introduced a surprising overhead in 19c … when you’re not using them.

The following code is a slightly modified version of a test case that Clay Jackson posted to demonstrate a strange difference in performance between 12.2 and 19.3

rem
rem     Script:         19c_sequence_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2020
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

drop table test_tab2;
drop table test_tab1;

drop sequence test_seq1;
drop sequence test_seq2;

create sequence test_seq1;
create sequence test_seq2;

create table test_tab1 (
   id number primary key,
   c1 varchar2(1000)
);

create table test_tab2 (
   id number,
   t1_id number,
   c2 varchar2(1000),
   constraint test_tab2_fk foreign key (t1_id) 
        references test_tab1 (id) on delete cascade
);

spool 19c_sequence_cost.lst

set feedback off
set pagesize 0
set serveroutput off

execute snap_my_stats.start_snap
alter session set events '10046 trace name context forever';

prompt  ==============================================================================
prompt  start_1000.sql cascades to start_1.sql
prompt  start_1.sql holds:
prompt
prompt  insert into test_tab1 values (test_seq1.nextval,'hello');
prompt  insert into test_tab2 values (test_seq2.nextval, test_seq1.currval, 'byebye');
prompt  commit;
prompt
prompt  ==============================================================================

start start_1000

set serveroutput on
set pagesize 60
alter session set events '10046 trace name context off';
execute snap_my_stats.end_snap

spool off

I’ve got a couple of tables with columns that I plan to generate from sequences, and I’ve got a referential integrity constraint between those tables. I’ll be using nextval from one sequence to populate the first table, then use currval from the same sequence for the foreign key and nextval from the other sequence as a unique identifier for the second table.

I’ve used my “repeater” code to run a short script from the SQL*Plus prompt 1,000 times (as suggested by Clay Jackson in his posting). I’ve also added lines to enable SQL_trace at the basic level, and taken a snapshot of the session activity stats.

Just as Clay Jackson had reported – 19.3.0.0 took significantly longer than 12.2.0.1 to complete the 1,000 calls to the script. (Though in absolute terms we’re only talking fractions of a second for a small single user test.) Examination of the stats – which prompted me to add the 10046 trace and repeat – made it very clear where the difference lay. Here are a few figures picked from the session activity stats:

Version    Statistic                        Value
========   ========================         ===== 
12.2.0.1   CPU used by this session            53  
19.3.0.0   CPU used by this session           114

12.2.0.1   recursive cpu usage                 18
19.3.0.0   recursive cpu usage                 44

12.2.0.1   recursive calls                  1,182
19.3.0.0   recursive calls                  7,215

12.2.0.1   execute count                    2,137                                     
19.3.0.0   execute count                    8,151

Clearly it’s worth finding out what was happening in those 6,000 extra recursive calls. And this is what the 19.3 trace file (after passing through tkprof … sort=execnt) showed:


select count(1)
from
 sys.idnseq$ where seqobj# = :1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute   6000      0.07       0.09          0          0          0           0
Fetch     6000      0.03       0.04          0       6000          0        6000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    12001      0.10       0.14          0       6000          0        6000

If you can’t guess what the table idnseq$ is about, the following definition appears in $ORACLE_HOME/rdbms/admin/dcore.bsq:

create table idnseq$     /* stores table identity column to sequence mapping */
( obj#         number not null,                       /* table object number */
  intcol#      number not null,                    /* identity column number */
  seqobj#      number not null,                    /* sequence object number */
  startwith    number not null,                   /* sequence starting value */
  supplemental log group idnseq$_log_grp (obj#, intcol#, seqobj#) always
)
/

We’re not using identity columns in this test, but every time we run the script we do six checks against this data dictionary table to see (presumably) if we’re using a sequence that is internally associated with an identity column. The requirement doesn’t seem completely unreasonable – but it’s a bit surprising that we do every check twice. (Note – we have to check test_seq1 for both statements but test_seq2 only for the second statement, which is how we get 6,000 executions: 2 * (2 checks for test_seq1 + 1 check for test_seq2) * 1,000 executions of the script.

The doubling looks like a bug, and you have to wonder why a suitable piece of information isn’t kept in the dictionary cache anyway to allow Oracle to avoid executing the recursive statement. I should point out that the surprisingly large impact is visible because I’ve executed 2,000 separate statements from the client side; mechanisms like array inserts and pl/sql loops might only have to do this check once for each array or pl/sql call. (Tests of this hypothesis are left as exercise for the interested reader.)

 

dense_rank

Filed under: Execution plans,Oracle,Performance,sorting,subqueries — Jonathan Lewis @ 6:42 pm GMT Mar 12,2020

I’ve just been prompted to complete and publish a draft I started a few years ago. It’s (ultimately) about a feature that appeared in 9i but doesn’t seem to show up very often at client sites or as a common solution to performance problems on the various Oracle forums – but maybe that’s not surprising given how slowly analytic functions have been taken up.

I want to work towards the feature by starting with a requirement, then examine several solutions. To supply a touch of realism I’ll create an orders table, which holds a customer id and an order date (including time), ,and then ask for a report of the most recent order for each customer. Here’s some starting data:

rem
rem     Script:         order_sample.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2006
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem             12.1.0.0        Costs are consistent
rem             11.2.0.4        Costs become consistent by 11.2.0.3
rem             11.1.0.7
rem             10.2.0.3
rem              9.2.0.8
rem

create table orders
as
with generator as (
        select
                rownum id 
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        rownum                                                                  id,
        mod(rownum-1,200)                                                       customer,
        sysdate - 20 + dbms_random.value(0,20)                                  date_ordered,
        rpad('x' || to_char(trunc(dbms_random.value(0,1000)),'FM009'),100)      padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid wordpress format issue
;

alter table orders modify customer not null;
alter table orders modify date_ordered not null;
alter table orders add constraint ord_pk primary key(id);

create index ord_cus on orders(customer);
-- create unique index ord_cus_date on orders(customer, date_ordered);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'orders',
                method_opt       => 'for all columns size 1',
                cascade          => true
        );
end;
/

I’ve got 200 customers, at 50 orders per customer dating back over the last 20 days. There’s a primary key on the table and (as it stands) an obvious “foreign key index” on the customer column, though I’ve allowed for changing this to a more useful (customer, date_ordered) combination which I’ve decided could be declared as unique.

With this data, how do I report “the most recent order for each customer”? The first question to ask in response to this request is: “do you literally mean ‘THE’ most recent; what if the customer has placed two or more orders on the same day or, in my initial case, at the same time?” There’s a special case to think about the moment you start to turn the natural language request into a formal language specification.

In this case I’m going to run with the “customer-only” index and allow for the fact that two or more orders could be placed at the same time by the same customer, and report both (all) of them if any such simultaneously placed orders appear.

Strategy number 1:

Start with a list showing the most recent order date for each customer and then report all orders that we can identify using that list of (customer, date_ordered). To do that I’ll start with a simple aggregate query and use the result it produced in an “IN” subquery:


prompt  =========================
prompt  Use IN subquery for max()
prompt  =========================

select  
        /*+ qb_name(main) */
        ord1.* 
from
        orders  ord1
where
        (ord1.customer, ord1.date_ordered) in (
                select  /*+ qb_name(subq) */
                        ord2.customer, max(ord2.date_ordered)
                from
                        orders  ord2
                group by 
                        ord2.customer
        )
order by
        ord1.customer
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

Plan hash value: 1500776991

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |      1 |        |    54 (100)|    200 |00:00:00.01 |     344 |       |       |          |
|   1 |  SORT ORDER BY        |          |      1 |      1 |    54  (15)|    200 |00:00:00.01 |     344 |   115K|   115K|  102K (0)|
|*  2 |   HASH JOIN RIGHT SEMI|          |      1 |      1 |    53  (14)|    200 |00:00:00.01 |     344 |  1695K|  1695K| 1568K (0)|
|   3 |    VIEW               | VW_NSO_1 |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |       |       |          |
|   4 |     HASH GROUP BY     |          |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |  1484K|  1484K| 1421K (0)|
|   5 |      TABLE ACCESS FULL| ORDERS   |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
|   6 |    TABLE ACCESS FULL  | ORDERS   |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ORD1"."CUSTOMER"="CUSTOMER" AND "ORD1"."DATE_ORDERED"="MAX(ORD2.DATE_ORDERED)")

I’ve included the qb_name() hint in both query blocks here – it’s always a good idea as it gives you a little extra edge in interpreting the execution plan when the queries get more complicated.

The first thing you’ll notice about the resulting execution plan is that the optimizer has “unnested” the subquery to create an inline view (which it has named VW_NSO_1) and then used a simple join to get the final result. That’s an interesting observation, and it’s something that will often happen with an “IN” subquery – and that brings us to strategy 2.

Strategy number 2:

Some people will take as gospel the claim that the optimizer “cannot handle subqueries efficiently” and will prefer to write their own inline views (possibly using the “WITH subquery” a.k.a. “Common Table Expression (CTE)” mechanism). There will be occasions, even in the latest versions of Oracle, where you may need to do this but there will also be occasions where the optimizer hasn’t done it because it would produce the wrong results – and I have seen a couple of accidents go into production code where this variant has been written incorrectly.


prompt  ==============================
prompt  Introduce max() as inline view
prompt  ==============================

select  
        /*+ qb_name(main) */
        ord1.* 
from
        (
                select  /*+ qb_name(in_line) */
                        ord2.customer, max(ord2.date_ordered) date_ordered
                from
                        orders  ord2
                group by 
                        ord2.customer
        )       ordv,
        orders  ord1
where
        ord1.customer     = ordv.customer
and     ord1.date_ordered = ordv.date_ordered
order by
        ord1.customer
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

Plan hash value: 2750501889

----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |      1 |        |    54 (100)|    200 |00:00:00.01 |     344 |       |       |          |
|   1 |  SORT ORDER BY        |        |      1 |    200 |    54  (15)|    200 |00:00:00.01 |     344 |   115K|   115K|  102K (0)|
|*  2 |   HASH JOIN           |        |      1 |    200 |    53  (14)|    200 |00:00:00.01 |     344 |  1695K|  1695K| 1531K (0)|
|   3 |    VIEW               |        |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |       |       |          |
|   4 |     HASH GROUP BY     |        |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |  1484K|  1484K| 1413K (0)|
|   5 |      TABLE ACCESS FULL| ORDERS |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
|   6 |    TABLE ACCESS FULL  | ORDERS |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ORD1"."CUSTOMER"="ORDV"."CUSTOMER" AND "ORD1"."DATE_ORDERED"="ORDV"."DATE_ORDERED")

You’ll notice, of course, the remarkable similarity between the previous plan and this one – the only significant difference being that the optimimzer used a “plain” hash join here rather than the “hash join right semi” that appeared in the previous plan. The “right semi” is an indication that the optimizer has first transformed your “IN” subquery to an equivalent “EXISTS” (“=ANY”) subquery. Don’t be misled by the “right”, by the way, this isn’t indicating any sort of outer join it’s just trying to let you know which table is the one where Oracle should stop its probe after finding the first row. It is, however, unfortunate that it gets a little awkward trying to sort out left from right when Oracle can do a “swap join inputs” on you.

It would have been nice if the VIEW operatio1n had reported the name of my inline view (to correspond to the generated VW_NSO_1 viewname from the previous plan) – but you if you included the ‘alias’ formatting option in the call to display_cursor() it would have reported the alias ordv@main at operation 3.

Strategy Number 3:

We might have decided to check every row in the table to see if the date in that row was the most recent date for the customer in that row, which we could do by running a correlated subquery to do the check for every row in the table.

prompt  ========================================
prompt  Orders with correlated EQUALITY subquery
prompt  ========================================

select  
        /*+ qb_name(main) */
        ord1.* 
from
        orders  ord1
where
        ord1.date_ordered = (
                select  /*+ qb_name(subq) */
                        max(ord2.date_ordered)
                from
                        orders  ord2
                where
                        ord2.customer = ord1.customer
        )
order by
        ord1.customer
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));


Plan hash value: 1152467146

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |      1 |        |    54 (100)|    200 |00:00:00.01 |     344 |       |       |          |
|   1 |  SORT ORDER BY        |         |      1 |    200 |    54  (15)|    200 |00:00:00.01 |     344 |   115K|   115K|  102K (0)|
|*  2 |   HASH JOIN           |         |      1 |    200 |    53  (14)|    200 |00:00:00.01 |     344 |  1695K|  1695K| 1622K (0)|
|   3 |    VIEW               | VW_SQ_1 |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |       |       |          |
|   4 |     HASH GROUP BY     |         |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |  1484K|  1484K| 1435K (0)|
|   5 |      TABLE ACCESS FULL| ORDERS  |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
|   6 |    TABLE ACCESS FULL  | ORDERS  |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ORD1"."DATE_ORDERED"="MAX(ORD2.DATE_ORDERED)" AND "ITEM_1"="ORD1"."CUSTOMER")

Yet again we end up with the same execution plan (barring the “right semi” issue) but with a different generated name for the unnested subquery. This is an interesting facet of Oracle (and SQL in general) – completely different ways of stating a requirement can end up doing the same work in the same way.

An important corrollary to this observation is that the first thing you should do when you start writing an SQL statement is to write it in a way that clearly expresses the requirement and is easy for others to comprehend. Don’t (at the first stage) try to do anything clever because (a) you may do it wrong and (b) the optimizer might have taken your clear, simple, code and done the clever bit behind the scenes for you.

However, we may have to move on to doing something new (-ish) and exciting.

Strategy number 4:

An “obvious” defect in the three plans so far is that we have to visit the orders table twice. Is there a way we can avoid doing this? The answer is yes. Oracle 8.1.6 gave us the analytic functions:


prompt  =======================
prompt  Analytic max() function
prompt  =======================

column padding noprint
column date_ordered noprint

select
        /*+ qb_name(main) */
        ordv.* 
from    (
        select  /*+ qb_name(inline) */
                customer, id, date_ordered, padding,
                max(date_ordered) over (
                        partition by customer
                ) max_date
        from    orders  ord2
        )       ordv
where
        ordv.date_ordered = ordv.max_date
order by
        ordv.customer
;

select * from table(dbms_xplan.display_cursor(null,null,'cost outline allstats last'));

Plan hash value: 813391662

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |      1 |        |   262 (100)|    200 |00:00:00.01 |     172 |       |       |          |
|*  1 |  VIEW               |        |      1 |  10000 |   262   (3)|    200 |00:00:00.01 |     172 |       |       |          |
|   2 |   WINDOW SORT       |        |      1 |  10000 |   262   (3)|  10000 |00:00:00.01 |     172 |  1612K|   624K| 1432K (0)|
|   3 |    TABLE ACCESS FULL| ORDERS |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("ORDV"."DATE_ORDERED"="ORDV"."MAX_DATE")

By adding the analytic max() function I can acquire the necessary “raw” data once and post-process it to find the max(date_ordered) for each customer before discarding all the rows where the row’s date doesn’t match the maximum date. The expression “max(date_ordered) over (partition by customer)” is like a virtual column that tells Oracle to partition the data by customer and find the maximum date within customer. Imagine copying the original data into a spreadsheet, sorting it by customer, then using one of the spreadsheet functions to add an extra column that derives it’s value by looking at the rows that have the same customer as the current row and you’ve got an exact replica of what Oracle is doing here.

So we’ve managed to produce the same result with a single tablescan of orders instead of the two tablescans we saw in every other plan. But there’s a drawback – to be able to partition by customer Oracle has had to fetch every row and column we’re interested in and sort the data before deriving values for the new column: the cost of this plan (262) is much higher than the cost of the plan (54) we got from the previous three queries.

In this case the variation in actual run-time for the two different plans was undetectable – and insignificant compared to the time spent getting the result set to the terminal and displaying. In general, though, you need to consider the trade off between the sorting that goes into the use of analytic functions and the “double visit” work of using subqueries.

Strategy number 5:

There is (at least) one more possibility that I’ve used in the past when the data structure has allowed it to produce the right answers; and it’s the one that is the ultimate target of this blog. Consider the following SQL:


select
        customer, 
        max(id)                 keep (dense_rank last order by date_ordered) as max_id,
        max(date_ordered)       keep (dense_rank last order by date_ordered) as max_date,
--      max(padding)            keep (dense_rank last order by date_ordered) as max_padding
        trim(
                max(padding)    keep (dense_rank last order by date_ordered)
        )       as max_padding
from
        orders
group by
        customer
;

(The trim() function on the padding column doesn’t change the fundamental behaviour of this query, it’s just there to avoid line-wrapping on my output.)

I’ve written a query that does an aggregate on customer, so “customer, max() group by customer”, but it’s a variant of the analytic max() function based on “keep(dense_rank last order by …)” rather then the more familiar “over(partition by … order by …)” form.

Because of the group by customer, the max() function is applied per customer (i.e. behaving like over(partition by customer)), and we’re not actually looking for the maximum value of the referenced column, we’re first ordering by the date_ordered (within customer) applying the dense_rank mechanism, keeping only the rows that have the highest (last) dense_rank, and then taking the maximum of that subset of the data.

Here’s an example applying the combination of mechanisms to a tiny data set:

Raw data
=========
   N1           V1
-----           ---
   93           'C'
   54           'Q',
   43           'A'
   13           'X'
   93           'G'
   54           'W',

Ordered by N1 and dense_rank() appended
========================================
   N1           V1              dr()
-----           ---             ----
   13           'X'             1
   43           'A'             2
   54           'W',            3
   54           'Q',            3
   93           'G'             4
   93           'C'             4

Keep(dense rank last)
=====================
   N1           V1              dr()
-----           ---             ----
   93           'G'             4
   93           'C'             4


max(v1) keep(dense rank last order by n1)
V1
---
'G'

In this tiny example we had cases where there were multiple rows for some of the rankings, but if we go back to our orders table and guarantee (by adding a unique constraint) that a customer will never have more than one row for any one value of date_ordered, then the expression max(id) keep (dense_rank last order by date_ordered) for any one customer will be the id of the row that has the maximum order date for that customer and, similarly, max(date_ordered) keep(…), and max(padding) keep (,,,) will also be the values from that same row.

Given the (absolutely critical) uniqueness constraint, we can get the data for the most recent for the customer using this dense_rank() strategy.

The question, of course, is why would we do something that may not be entirely intuitive and looks as if it could make Oracle do a lot of extra work to get the result. Here’s the answer – which is just the execution plan for the query on my orders table – with the unique constraint added:


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |    28 (100)|    200 |00:00:00.01 |     172 |       |       |          |
|   1 |  SORT GROUP BY     |        |      1 |    200 |    28  (18)|    200 |00:00:00.01 |     172 |   142K|   142K|  126K (0)|
|   2 |   TABLE ACCESS FULL| ORDERS |      1 |  10000 |    24   (5)|  10000 |00:00:00.01 |     172 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

The path uses a basic SORT GROUP BY, that “sorts” only 200 rows (A-rows) using only 126KB of memory. Compare that with the plan for the analytic max() over() in strategy 4 that takes 1.6MB of memory and sorts 10,000 rows and you’ll appreciate that the keep(dense_rank last) mechanism is doing something much more efficient. For cases where the drop from “num_rows” to “num_distinct” for the aggregating column(s) the benefit of using the somewhat non-intuitive dense_rank() approach may make a significant difference to memory, CPU, and even (if it avoids a spill to disk) I/O.

Footnotes

There are two major variations on how you can use the dense_rank() function, as well as this form in which dense_rank appears in the KEEP LAST (and FIRST) mechanism.

Remember the absolutely critical point that the “keep dense_rank last” strategy is only correct if there is a suitable unique constraint on the data set viz: unique on ({group by column(s)},{order by column(s)}).

There is another major option for getting the same “most recent” rows, which is to use the match_recognize() functionality, but I think I probably wrote this draft before the mechanism even existed – so it’s left as an exercise to the reader to work it out.  A key reason why I’m not going to do it myself is that (like the analytic over() in strategy 4) it will require all 10,000 rows to be sorted, and is therefore likely to be less efficient than strategy 5.

Finally – I thought I’d written a note explaining why a “sort group by” can use far less memory and CPU then a basic “sort order by”, but if I have it wasn’t on this blog.  I do have a note on how the mechanism to handle “rownum <= N” with a preceding “order by” minimises its use of memory, and that note should give you some idea of what the “sort group by” is doing to minimise memory usage. I’ll try to write a little note on the aggregate mechanism some time over the next few days.

 

 

February 25, 2020

count(*) – again

Filed under: Execution plans,Oracle,Performance — Jonathan Lewis @ 1:24 pm GMT Feb 25,2020

I’ve just received an email asking (yet again) a question about counting the number of rows in a table.

We have a large table with a CLOB column, where the CLOB occupies 85% storage space.
when we use select count(*) from , the DBA says that you should not use count(*) as it uses all columns and as this table contains CLOB it results in high CPU usage, where as if we use count(rowid), this brings us faster and same result.

Well I’ve pointed out in the past, in fact more than once, that count(*), count(1), count(declared-non-null-column) will all do the same thing … execute as count(*); I’ve also listed a few quirky anomalies, also more than once. However, count(rowid) is a little different, it doesn’t get transformed to count(*) as we can see from two pieces of evidence:

Exhibit A: fragments from a 10053 (CBO) trace file

----- Current SQL Statement for this session (sql_id=d381q70418ugs) -----
select count(rowid) from emp1

... 

Final query after transformations:******* UNPARSED QUERY IS *******
SELECT COUNT("EMP1".ROWID) "COUNT(ROWID)" FROM "TEST_USER"."EMP1" "EMP1"

Unlike the various count() calls that are converted to count(*), counting rowids doesn’t seem to go througn the CNT transformation and the final state of the query still shows count(rowid) as the critical mechanism.

Exhibit B: variations in Column Projection Information

SQL> explain plan for select count(1) from emp1;

SQL> select * from table(dbms_xplan.display(null,null,'projection'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------
Plan hash value: 2110459082

-----------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Cost (%CPU)| Time     |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |     7  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| E1_PK | 20000 |     7  (15)| 00:00:01 |
-----------------------------------------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]

SQL> explain plan for select count(rowid) from emp1;

SQL> select * from table(dbms_xplan.display(null,null,'projection'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------
Plan hash value: 2110459082

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    12 |     7  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    12 |            |          |
|   2 |   INDEX FAST FULL SCAN| E1_PK | 20000 |   234K|     7  (15)| 00:00:01 |
-------------------------------------------------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(ROWID)[22]
   2 - ROWID[ROWID,10]

The count(1) query transforms to count(*) and we don’t see any column information begin generated at operation 2 and passing up to the parent operation. However with count(rowid) we see Oracle at operation 2 constructing actual rowids from block addresses and row directory entries and passing them up to the the parent operation.

The CPU it takes to construct and pass the rowids is probably insignificant compared to the CPU usage of accessing data blocks in the first place so I doubt if there would be much visible difference in clock-time between count(1) (or count(*)) and count(rowid), but technically it looks as if count(rowid) would actually be slower and more resource-intensive than anything that went through the count(*) transformation.

In passing – the execution plan that appears in our count(1) example also tells us that count(*) doesn’t “use all columns” – after all, there aren’t many tables where every column is in the primary key and emp1 is no exception to the general rule, and the plan is doing an index fast full scan of the primary key index.

 

December 16, 2019

IOT Bug

Filed under: Bugs,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 3:58 pm GMT Dec 16,2019

Here’s a worrying bug that showed up a couple of days ago on the Oracle-L mailing list. It’s a problem that I’ve tested against 12.2.0.1 and 19.3.0.0 – it may be present on earlier versions of Oracle. One of the nastiest things about it is that you might not notice it until you get an “out of space” error from the operating system. You won’t get any wrong results from it, but it may well be adding an undesirable performance overhead.

Basically it seems that (under some circumstances, at least) Oracle is setting the “block guess” component of the secondary index on Index Organized Tables (IOTs) to point to blocks in the overflow segment instead of blocks in the primary key segment. As a result, when you execute a query that accesses the IOT through the secondary index and has to do reads from disc to satisfy the query – your session goes through the following steps:

  • Identify index entry from secondary index – acquire “block guess”
  • Read indicated block and discover the object number on the block is wrong, and the block type is wrong
  • Write a (silent) ORA-01410 error and do a block dump into the trace file
  • Use the “logical rowid” from the secondary index (i.e. the stored primary key value) to access the primary key index by key value

So your query runs to completion and you get the right result because Oracle eventually gets there using the primary key component stored in the secondary index, but it always starts with the guess[see sidebar] and for every block you read into the cache because of the guess you get a dump to the trace file.

Here’s a little code to demonstrate. The problem with this code is that everything appears to works perfectly, you have to be able to find the trace file for your session to see what’s gone wrong. First we create some data – this code is largely copied from the original posting on Oracle-L, with a few minor changes:


rem
rem     Script:         iot_bug_12c.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem
rem     Notes
rem     THe OP had tested on 19.5.0.0 to get the same effect, see:
rem     //www.freelists.org/post/oracle-l/IOT-cannot-get-valid-consensus-bug-or-unexplained-behavio
rem

drop table randomload purge;

create table randomload(
        roll number,
        name varchar2(40),
        mark1 number,
        mark2 number,
        mark3 number,
        mark4 number,
        mark5 number,
        mark6 number,
        primary key (roll)
) 
organization index 
including mark3 overflow
;

create index randomload_idx on randomload(mark6);

insert into randomload 
select 
        rownum, 
        dbms_random.string(0,40) name, 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,10000)) 
from 
        dual 
connect by 
        level < 1e5 -- > comment to avoid wordpress format issue
;

commit;

exec dbms_stats.gather_table_stats(null,'randomload', cascade=>true);

prompt  ==================================================
prompt  pct_direct_access should be 100 for randomload_idx
prompt  ==================================================

select 
        table_name, index_name, num_rows, pct_direct_access, iot_redundant_pkey_elim  
from 
        user_indexes
where
        table_name = 'RANDOMLOAD'
;

It should take just a few seconds to build the data set and you should check that the pct_direct_access is 100 for the index called randomload_idx.

The next step is to run a query that will do an index range scan on the secondary index.

 
column mark6 new_value m_6

select 
        mark6, count(*) 
from
        randomload 
group by 
        mark6
order by 
        count(*)
fetch first 5 rows only
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';
set serveroutput off

select avg(mark3) 
from 
        randomload 
where 
        mark6 = &m_6
;

select * from table(dbms_xplan.display_cursor);

alter session set events '10046 trace name context off';
set serveroutput on

I’ve started by selecting one of the least frequencly occuring values of m_6 (a column I know to be in the overflow); then I’ve flushed the buffer cache so that any access I make to the data will have to start with disk reads (the original poster suggested restarting the database at this point, but that’s not necessary).

Then I’ve enabled sql_trace to show wait states (to capture details of what blocks were read and which object they belong to),, and I’ve run a query for m_3 (a column that is in the primary key (TOP) segment of the IOT) and pulled its execution plan from memory to check that the query did use a range scan of the secondary index. Here’s the plan:

----------------------------------------------------------------------------------------
| Id  | Operation          | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                   |       |       |    11 (100)|          |
|   1 |  SORT AGGREGATE    |                   |     1 |     7 |            |          |
|*  2 |   INDEX UNIQUE SCAN| SYS_IOT_TOP_77298 |    10 |    70 |    11   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN| RANDOMLOAD_IDX    |    10 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("MARK6"=1316)
   3 - access("MARK6"=1316)

As you can see the plan shows what we are hoping to see – an index range scan of the secondary index that let’s it follow up with a unique scan of the primary key segment. It’s just a little odd that the access predicate reported for operation 2 (unique scan of TOP) suggests that the access is on a column that isn’t in the primary key and isn’t even in the TOP section.

So the query works and gives the right answer. But what do we find in the trace directory ? If you’re running 12c (possibly only 12.2), each time the error occurs the following pattern of information will be written to the alert log (it didn’t appear in 19.3)


ORCL(3):Hex dump of (file 22, block 16747) in trace file /u01/app/oracle/diag/rdbms/orcl12c/orcl12c/trace/orcl12c_ora_7888.trc
ORCL(3):
ORCL(3):Corrupt block relative dba: 0x0580416b (file 22, block 16747)
ORCL(3):Bad header found during multiblock buffer read (logical check)
ORCL(3):Data in bad block:
ORCL(3): type: 6 format: 2 rdba: 0x0580416b
ORCL(3): last change scn: 0x0000.0b86.0e36484c seq: 0x1 flg: 0x06
ORCL(3): spare3: 0x0
ORCL(3): consistency value in tail: 0x484c0601
ORCL(3): check value in block header: 0x4408
ORCL(3): computed block checksum: 0x0
ORCL(3):

And the following pattern of information is written to the trace file [Update: a follow-up test on 11.2.0.4 suggests that the basic “wrong block address” error also happens in that version of Oracle, but doesn’t result in a dump to the trace file]:


kcbzibmlt:: encounter logical error ORA-1410, try re-reading from other mirror..
cursor valid? 1 makecr 0 line 15461 ds_blk (22, 16747) bh_blk (22, 16747)
kcbds 0x7ff1ca8c0b30: pdb 3, tsn 8, rdba 0x0580416b, afn 22, objd 135348, cls 1, tidflg 0x8 0x80 0x0
    dsflg 0x108000, dsflg2 0x0, lobid 0x0:0, cnt 0, addr 0x0, exf 0x10a60af0, dx 0x0, ctx 0
    whr: 'qeilwh03: qeilbk'
env [0x7ff1ca8e3e54]: (scn: 0x00000b860e364893   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  statement num=0  parent xid:  0x0000.000.00000000  st-scn: 0x0000000000000000  hi-scn: 0x0000000000000000  ma-scn: 0x00000b860e364879  flg: 0x00000660)
BH (0xb1fd6278) file#: 22 rdba: 0x0580416b (22/16747) class: 1 ba: 0xb1c34000
  set: 10 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 763,14
  dbwrid: 0 obj: 135348 objn: 135348 tsn: [3/8] afn: 22 hint: f
  hash: [0x9eff0528,0x77cff808] lru: [0xb1fd2578,0x9ff84658]
  ckptq: [NULL] fileq: [NULL]
  objq: [0xb6f654c0,0x9ff84680] objaq: [0xb6f654d0,0x9ff84690]
  use: [0x77b78128,0x77b78128] wait: [NULL]
  st: READING md: EXCL tch: 0
  flags: only_sequential_access
  Printing buffer operation history (latest change first):
  cnt: 5
  01. sid:10 L122:zgb:set:st          02. sid:10 L830:olq1:clr:WRT+CKT
  03. sid:10 L951:zgb:lnk:objq        04. sid:10 L372:zgb:set:MEXCL
  05. sid:10 L123:zgb:no:FEN          06. sid:10 L083:zgb:ent:fn
  07. sid:08 L192:kcbbic2:bic:FBD     08. sid:08 L191:kcbbic2:bic:FBW
  09. sid:08 L604:bic2:bis:REU        10. sid:08 L190:kcbbic2:bic:FAW
  11. sid:08 L602:bic1_int:bis:FWC    12. sid:08 L822:bic1_int:ent:rtn
  13. sid:08 L832:oswmqbg1:clr:WRT    14. sid:08 L930:kubc:sw:mq
  15. sid:08 L913:bxsv:sw:objq        16. sid:08 L608:bxsv:bis:FBW
Hex dump of (file 22, block 16747)

   ... etc.

Corrupt block relative dba: 0x0580416b (file 22, block 16747)
Bad header found during multiblock buffer read (logical check)
Data in bad block:
 type: 6 format: 2 rdba: 0x0580416b
 last change scn: 0x0000.0b86.0e36484c seq: 0x1 flg: 0x06
 spare3: 0x0
 consistency value in tail: 0x484c0601
 check value in block header: 0x4408
 computed block checksum: 0x0
TRCMIR:kcf_reread     :start:  16747:0:/u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf
TRCMIR:kcf_reread     :done :  16747:0:/u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf

The nasty bit, of course, is the bit I’ve removed and replaced with just “etc.”: it’s a complete block dump (raw and symbolic) which in my example was somthing like 500 lines and 35KB in size.

It’s not immediately obvious exactly what’s going on and why, but the 10046 trace helps a little. From another run of the test (on 19.3.0.0) I got the following combination of details – which is an extract showing the bit of the wait state trace leading into the start of the first block dump:

WAIT #140478118667016: nam='db file scattered read' ela= 108 file#=13 block#=256 blocks=32 obj#=77313 tim=103574529210
WAIT #140478118667016: nam='db file scattered read' ela= 2236 file#=13 block#=640 blocks=32 obj#=77313 tim=103574531549
WAIT #140478118667016: nam='db file scattered read' ela= 534 file#=13 block#=212 blocks=32 obj#=77312 tim=103574532257
kcbzibmlt: encounter logical error ORA-1410, try re-reading from other mirror..
cursor valid? 1 warm_up abort 0 makecr 0 line 16082 ds_blk (13, 212) bh_blk (13, 212)

Object 77313 is the secondary index, object 77312 is the primary key index (IOT_TOP). It may seem a little odd that Oracle is using db file scattered reads of 32 blocks to read the indexes but this is a side effect of flushing the buffer – Oracle may decide to prefeetch many extra blocks of an object to “warmup” the cache just after instance startup or a flush of the buffer cache. The thing I want to check, though, is what’s wrong with the blocks that Oracle read from object 77312:


alter system dump datafile 13 block min 212 block max 243;

BH (0xc8f68e68) file#: 13 rdba: 0x034000d4 (13/212) class: 1 ba: 0xc8266000
  set: 10 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

BH (0xa7fd6c38) file#: 13 rdba: 0x034000d4 (13/212) class: 1 ba: 0xa7c2a000
  set: 12 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

BH (0xa5f75780) file#: 13 rdba: 0x034000d5 (13/213) class: 0 ba: 0xa5384000
  set: 11 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

BH (0xdafe9220) file#: 13 rdba: 0x034000d5 (13/213) class: 1 ba: 0xdadcc000
  set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

I’ve reported the first few lines of the symbolic dump for the first few blocks of the resulting trace file. Look at the third line of each group of three BH lines: it’s reporting object 77311 (the overflow segment), not 77312 (the TOP segment). And every single block reported in the db file scattered read of 32 blocks for object 77312 reports itself, when dumped, as being part of object 77311. And that’s possibly the immediate cause of the ORA-01410.

We can take the investigation a little further by dumping a leaf block or two from the secondary index.


alter session set events 'immediate trace name treedump level 77313';

----- begin tree dump
branch: 0x3400104 54526212 (0: nrow: 542, level: 1)
   leaf: 0x340010d 54526221 (-1: row:278.278 avs:2479)
   leaf: 0x340075e 54527838 (0: row:132.132 avs:5372)
   leaf: 0x34005fb 54527483 (1: row:41.41 avs:7185)

alter system dump datafile 13 block 1886   -- leaf: 0x340075e

BH (0xd5f5d090) file#: 13 rdba: 0x0340075e (13/1886) class: 1 ba: 0xd5158000
  set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77313 objn: 77313 tsn: [3/6] afn: 13 hint: f
...
row#6[5796] flag: K------, lock: 0, len=18
col 0; len 2; (2):  c1 1d
col 1; len 4; (4):  c3 07 41 5c
tl: 8 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 4]  03 40 05 7c

I’ve done a treedump of the secondary index and picked a leaf block address from the treedump and dumped that leaf block, and from that leaf block I’ve extracted one index entry to show you the three components: the key value (c1 1d), the primary key for the row (c3 07 41 5c), and the block guess (03 40 05 75). Read the block guess as a 4 byte hex number, and it translates to file 13, block 1397 – which should belong to the TOP segment. So the exciting question is – what object does block (13, 1397) think it belongs to ?


alter system dump datafile 13 block 1397;

Block header dump:  0x03400575
 Object id on Block? Y
 seg/obj: 0x12dff  csc:  0x00000b860e308c46  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x3400501 ver: 0x01 opc: 0
     inc: 0  exflg: 0

Converting from Hex to Decimal: obj: 0x12dff = 77311 which is the overflow segment. The secondary index block guess is pointing at a block in the overflow segment.

There are two ways to handle this problem – you could simply rebuild the index (alter index rebuild) or, as the original poster did, use the “update block references” command to correct all the block guesses: “alter index randomload_idx update block references;”. Neither is desirable, but if you’re seeing a lot of large trace files following the pattern above then it may be necessary.

There was one particular inconsistency in the tests – which I ran many times – occasionally the pct_direct_access for the secondary index would be reported as zero (which, technically, should always happen given the error).  If it did, of course, Oracle wouldn’t follow the guess but would go straight to the step where it used the primary key “logical rowid” – thus bypassing the error and block dump.

tl;dr

In some circumstances the block guesses in the secondary indexes of IOTs may be pointing to the overflow segment instead of the primary key (TOP) segment. If this happens then queries will still run and give the right answers, but whenever they read a “guessed” block from disc they will report an ORA-01410 error and dump a block trace. This will affect performance and may cause space problems at the O/S level.

Sidebar

An entry in the secondary index of an Index Organized Table (IOT) consists of three parts, which intially we can think in the form:

({key-value}, {logical rowid}, {block guess})

Since IOTs don’t have real rowids the “logical rowid” is actually the primary key of the row where the {key value} will be found. As a short cut for efficient execution Oracle includes the block address (4 bytes) where that primary key value was stored when the row was inserted. Because an IOT is an index “rows” in the IOT can move as new data is inserted and leaf blocks split, so eventually any primary key may move to a different block – this is why we refer to the block address as a guess – a few days, hours, or minutes after you’ve inserted the row the block address may no longer be correct.)

To help the runtime engine do the right thing Oracle collects a statistic called pct_direct_access for secondary indexes of IOTs. This is a measure of what percentage of the block guesses are still correct at the time that the statistics are gathered. If this value is high enough the run-time engine will choose to try using the block guesses while executing a query (falling back to using the logical rowid if it turns out that the guess is invalid), but if the value drops too low the optimizer will ignore the block guesses and only use the logical rowid.

Not relevant to this note – but a final point about secondary indexes and logical rowids – if the definition of the index includes  some of the columns from the primary keys Oracle won’t store those columns twice (in more recent version, that is) – the code is clever enough to use the values stored in the (key value) component when it needs to use the (logical rowid) component.

 

Update (Jan 2020)

I passed this example on to Oracle, and there are now two bugs recorded for it:

  • Bug 30733525 – ALERT LOG ENTRIES RE BLOCK GUESSES IN THE SECONDARY INDEXES OF IOTS POINTING TO OVERFLOW SEGMENT INSTEAD OF INDEX SEGMENT
  • Bug 30733563 – WRONG GUESS DBA IN INDEX

 

 

 

December 6, 2019

Temp space

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 12:18 pm GMT Dec 6,2019

A question about hunting down the source of the error “ORA-01652 unable to extend temp segment by NNN in tablespace XXX” shows up on the Oracle-L mailing list or the Oracle developer community forum from time to time. In most cases the tablespace referenced is the temporary tablespace, which means the session reporting the error was probably trying to allocate some space for sorting, or doing a hash join, or instantiating a GTT (global temporary table) or a CTE (common table expression / “with” subquery). The difficulty in cases like this is that the session reporting the error might be the victim of some other session’s greed – so looking at what the session was doing won’t necessarily point you to the real problem.

Of course you then run into a further problem tracking down the source of the problem. By the time you hear the complaint (even if it’s only seconds after the error appeared) the session that had been hogging the temporary tablespace may have finished what it was doing, leaving a huge amount of free space in the temporary tablespace and suggesting (to the less experienced and cynical user) that there’s something fundamentally wrong with the way Oracle has been accounting for space usage.

If you find yourself in this situation remember that (if you’re licensed to take advantage of it) the active session history may be able to help.  One of the columns in v$active_session_history is called temp_space_allocated with the slightly misleading description: “Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken”. A simple query against v$active_session_history may be enough to identify the session and SQL  statement that had been holding the temporary space when the error was raised, for example:


column pga_allocated        format 999,999,999,999
column temp_space_allocated format 999,999,999,999

break on session_id skip 1 on session_serial#

select
        session_id, session_serial#, 
        sample_id, 
        sql_id, 
        pga_allocated,
        temp_space_allocated
from
        v$active_session_history
where
        sample_time between sysdate - 5/1440 and sysdate
and     nvl(temp_space_allocated,0) != 0
order by
        session_id, sample_id
/

All I’ve done for this example is query v$active_session_history for the last 5 minutes reporting a minimum of information from any rows that show temp space usage. As a minor variation on the theme you can obviously change the time range, and you might want to limit the output to rows reporting more than 1MB (say) of temp space usage.

You’ll notice that I’ve also reported the pga_allocated (Description: Amount of PGA memory (in bytes) consumed by this session at the time this sample was taken) in this query; this is just a little convenience – a query that’s taking a lot of temp space will probably start by acquiring a lot of memory so it’s nice to be able to see the two figures together.

There are plenty of limitations and flaws in the usefulness of this report and I’ll say something about that after showing an example of usage. Let’s start with a script to build some data before running a space-consuming query:


rem
rem     Script:         allocate_tempspace.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1 as 
select * from all_objects
;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

execute dbms_stats.gather_table_stats(null,'t1')

execute dbms_lock.sleep(20)

set pagesize  60
set linesize 255
set trimspool on
set serveroutput off
alter session set statistics_level = all;

with ttemp as (
        select /*+ materialize */ * from t1
)
select 
        /*+ no_partial_join(@sel$2 t1b) no_place_group_by(@sel$2) */ 
        t1a.object_type,
        max(t1a.object_name)
from
        ttemp t1a, ttemp t1b
where
        t1a.object_id = t1b.object_id
group by
        t1a.object_type
order by
        t1a.object_type
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

My working table t1 consists of 16 copies of the view all_objects – so close to 1 million rows in my case – and the query is hinted to avoid any of the clever transformations that the optimizer could use to reduce the workload so it’s going to do a massive hash join and aggregation to report a summary of a couple of dozen rows. Here’s the execution plan (in this case from 12.2.0.1, though the plan is the same for 19.3 with some variations in the numbers).


SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------

Plan hash value: 1682228242

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |     29 |00:00:10.03 |   47413 |  21345 |  12127 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |     29 |00:00:10.03 |   47413 |  21345 |  12127 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D665B_E2772D3 |      1 |        |      0 |00:00:01.51 |   28915 |      0 |   9217 |  2068K|  2068K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                         |      1 |    989K|    989K|00:00:00.24 |   19551 |      0 |      0 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                            |      1 |     29 |     29 |00:00:08.51 |   18493 |  21345 |   2910 |  6144 |  6144 | 6144  (0)|         |
|*  5 |    HASH JOIN                             |                            |      1 |     15M|     15M|00:00:03.93 |   18493 |  21345 |   2910 |    48M|  6400K|   65M (1)|   25600 |
|   6 |     VIEW                                 |                            |      1 |    989K|    989K|00:00:00.36 |    9233 |   9218 |      0 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D665B_E2772D3 |      1 |    989K|    989K|00:00:00.35 |    9233 |   9218 |      0 |       |       |          |         |
|   8 |     VIEW                                 |                            |      1 |    989K|    989K|00:00:00.40 |    9257 |   9217 |      0 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D665B_E2772D3 |      1 |    989K|    989K|00:00:00.39 |    9257 |   9217 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")

Critically this plan shows us two uses of the temp space but only reports one of them as Used-Tmp. The “hash join” at operation 5 tells us that it reached 65MB of (tunable PGA) memory before going “1-pass”, eventually dumping 25,600 KB to disc. This space usage is corroborated by the 2,910 writes (which, at an 8KB block size, would be 23,280 KB). The missing Used-Tmp, however, is the space taken up by the materialized CTE. We can see that operation 2 is a “load as select” that writes 9,217 blocks to disc (subsequently read back twice – the tablescans shown in operations 7 and 9). That’s  74,000 KB of temp space that doesn’t get reported Used-Tmp.

If we take a look at the plan from 19.3 we see different numbers, but the same “error of omission”:

SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------

Plan hash value: 1682228242

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |     25 |00:00:08.15 |   34905 |  13843 |   8248 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |     25 |00:00:08.15 |   34905 |  13843 |   8248 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6624_E259E68 |      1 |        |      0 |00:00:01.26 |   23706 |      0 |   5593 |  2070K|  2070K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                         |      1 |    907K|    907K|00:00:00.21 |   18024 |      0 |      0 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                            |      1 |     25 |     25 |00:00:06.89 |   11193 |  13843 |   2655 |  6144 |  6144 | 6144  (0)|         |
|*  5 |    HASH JOIN                             |                            |      1 |     14M|     14M|00:00:03.55 |   11193 |  13843 |   2655 |    44M|  6400K|   64M (1)|      23M|
|   6 |     VIEW                                 |                            |      1 |    907K|    907K|00:00:00.26 |    5598 |   5594 |      0 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6624_E259E68 |      1 |    907K|    907K|00:00:00.25 |    5598 |   5594 |      0 |       |       |          |         |
|   8 |     VIEW                                 |                            |      1 |    907K|    907K|00:00:00.34 |    5595 |   5594 |      0 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6624_E259E68 |      1 |    907K|    907K|00:00:00.33 |    5595 |   5594 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")


With slightly fewer rows in t1 (907K vs. 989K) we write 5,593 blocks for the materialized CTE  (instead of 9,217) and spill 2,655 blocks during the hash join (instead of 2,910). But again it’s only the hash join spill that is reported under Used-Tmp. Note, by the way, that the Used-Tmp in 12.2 was reported in KB when it’s reported in MB in 19.3.0.0.

Side note: comparing the number of rows created and blocks written for the CTE, it looks as if 19.3 is using the data blocks much more efficiently than 12.2. There’s no obvious reason for this (though a first guess would be that the older mechanism is to write a GTT with pctfree=10 while the new avoid any free space and transactional details) so, as ever, I now have another draft for a blog note reminding me to investigate (eventually) what differences there are in CTE storage on the upgrade. It’s something that might make a difference in a few special cases.

With the figures from the execution plans in mind we can now look at the results of the query against v$active_session_history. Conveniently the queries took a few seconds to complete, so we’re going to see several rows for each execution.

First the results from 12.2.0.1

SESSION_ID SESSION_SERIAL#  SAMPLE_ID SQL_ID           PGA_ALLOCATED TEMP_SPACE_ALLOCATED
---------- --------------- ---------- ------------- ---------------- --------------------
        14           22234   15306218 1cwabt12zq6zb       95,962,112            1,048,576
                             15306219 1cwabt12zq6zb       97,731,584           37,748,736
                             15306220 1cwabt12zq6zb      148,194,304           77,594,624
                             15306221 1cwabt12zq6zb      168,117,248           85,983,232
                             15306222 1cwabt12zq6zb      168,117,248           90,177,536
                             15306223 1cwabt12zq6zb      168,117,248           95,420,416
                             15306224 1cwabt12zq6zb      168,117,248           98,566,144
                             15306225 1cwabt12zq6zb      168,117,248          102,760,448
                             15306226 1cwabt12zq6zb      116,933,632          103,809,024
                             15306227 1cwabt12zq6zb      116,933,632          103,809,024
                             15306228 b66ycurnwpgud        8,602,624            1,048,576

I pointed out that we had 25,600 KB reported as Used-Tmp and roughly 74,000 KB unreported – a total of nearly 100,000 KB that is reasonably close to the 103,800,000 bytes reported by ASH. Moreover the timing of the plan (loading the CTE in the first 2 seconds) seems to agree with the growth to 77,590,000 of temp_space_allocated by the time we get to sample_id 15306220 in ASH. Then we have several seconds of slow growth as the hash join takes place and feeds its resulte up to the sort group by. At the end of the query we happen to have been lucky enough to catch one last sample just before the session had released all its temp space and ceased to be active.  (Note: however, that the sql_id at that sample point was not the sql_id of our big query – and that’s a clue about one of the limitations of using ASH to find the greedy SQL.)

We see the same pattern of behaviour in 19.3.0.0:


SESSION_ID SESSION_SERIAL#  SAMPLE_ID SQL_ID           PGA_ALLOCATED TEMP_SPACE_ALLOCATED
---------- --------------- ---------- ------------- ---------------- --------------------
       136           42767    2217500 1cwabt12zq6zb      143,982,592           46,137,344
                              2217501 1cwabt12zq6zb      193,527,808           54,525,952
                              2217502 1cwabt12zq6zb      193,527,808           57,671,680
                              2217503 1cwabt12zq6zb      193,527,808           61,865,984
                              2217504 1cwabt12zq6zb      197,722,112           67,108,864
                              2217505 1cwabt12zq6zb      150,601,728           70,254,592
                              2217506 1cwabt12zq6zb      150,601,728           70,254,592

We start with an almost instantaneous jump to 46MB of temp_space_allocated in the first second of the query – that’s the 5,593 blocks of the CTE being materialized, then the slow growth of temp space as the hash join runs, spills to disc, and passes its data up to the sort group by. Again we can see that the peak usage was the CTE (46MB) plus the reported spill of 23MB (plus rounding errors and odd bits).

Preliminary Observations

Queries against ASH (v$active_session_history) can show us sessions that were holding space in the temporary tablespace at the moment a sample of active sessions was taken. This may allow us to identify greedy sessions that were causing other sessions to fail with ORA-01652 (unable to allocate temp segment).

We have seen that there is at least one case where we get better information about temp space allocation from ASH than we do from the variants on v$sql_plan that include the SQL Workarea information (v$sql_workarea, v$sql_workarea_active) because the space acquired during materialization of CTEs is not reported as a “tunable SQL workarea” but does appear in the ASH temp_space_allocated.

At first sight it looks as if we may be able to use the query against ASH to identify the statement (by sql_id) that was the one being run by the greedy session when it consumed all the space. As we shall see in a further article, there are various reasons why this may over-optimistic, however in many cases there’s a fair chance that when you see the same sql_id appearing in a number of consecutive rows of the report then that statement may be the thing that is responsible for the recent growth in temp space usage – and you can query v$sql to find the text and call dbms_xplan.display_cursor() to get as much execution plan information as possible.

Further questions

  • When does a session release the temp_space_allocated? Will the space be held (locked) as long as the cursor is open, or can it be released when it is no longer needed? Will it be held, but releasable, even after the cursor has (from the client program’s perspective) been closed?
  • Could we be fooled by a report that said a session was holding a lot of space when it didn’t need it and would have released it if the demand had appeared?
  • Under what conditions might the temp_space_allocated in an ASH sample have nothing to do with the sql_id reported in the same sample?
  • Are there any other reasons why ASH might report temp_space_allocated when an execution plan doesn’t?
  • Is temp_space_allocated only about the temporary tablespace, or could it include informatiom about other (“permanent”) tablespaces ?

Stay tuned for further analysis of the limitations of using v$active_session_history.temp_space_allocated to help identify the srouce of a space management ORA-01652 issue.

 

 

October 11, 2019

v$session

Filed under: Execution plans,Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:29 pm BST Oct 11,2019

Here’s an odd, and unpleasant, detail about querying v$session in the “most obvious” way. (And if you were wondering what made me resurrect and complete a draft on “my session id” a couple of days ago, this posting is the reason). Specifically if you want to select some information for your own session from v$session the query you’re likely to use in any recent version of Oracle will probably be of the form:


select {list for columns} from v$session where sid = to_number(sys_context('userenv','sid'));

Unfortunately that one little statement hides two anomalies – which you can see in the execution plan. Here’s a demonstration cut from an SQL*Plus session running under 19.3.0.0:


SQL> select * from table(dbms_xplan.display_cursor);

SQL_ID  gcfrzq9knynj3, child number 0
-------------------------------------
select program from V$session where sid = sys_context('userenv','sid')

Plan hash value: 2422122865

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    33 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FULL       | X$KSLWT         |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    21 |     0   (0)|
|*  6 |    FIXED TABLE FULL       | X$KSUSE         |     1 |    21 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------ -------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(SYS_CONTEXT('userenv','sid')))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter((BITAND("S"."KSUSEFLG",1)<>0 AND BITAND("S"."KSSPAFLG",1)<>0 AND 
              "S"."INDX"=TO_NUMBER(SYS_CONTEXT('userenv','sid'))
              AND INTERNAL_FUNCTION("S"."CON_ID") AND "S"."INST_ID"=USERENV('INSTANCE')))

As you can see, v$session is a join of 3 separate structures – x$kslwt (v$session_wait), x$ksled (v$event_name), and x$ksuse (the original v$session as it was some time around 8i), and the plan shows two “full tablescans” and a Cartesian merge join. Tablescans and Cartesian merge joins are not necessarily bad – especially where small tables and tiny numbers of rows are concerned – but they do merit at least a brief glance.

x$ksuse is a C structure in the fixed SGA and that structure is a segmented array (which seems to be chunks of 126 entries in 19.3, and chunks of 209 entries in 12.2 – but that’s fairly irrelevant). The SID is simply the index into the array counting from 1, so if you have a query with a predicate like ‘SID = 99’ Oracle can work out the address of the 99th entry in the array and access it very quickly – which is why the SID column is reported as a “fixed index” column in the view v$indexed_fixed_column.

But we have two problems immediately visible:

  1. the optimizer is not using the “index” to access x$ksuse despite the fact that we’re giving it exactly the value we want to use (and we can see a suitable predicate at operation 6 in the plan)
  2. the optimizer has decided to start executing the query at the x$kslwt table

Before looking at why thing’s have gone wrong, let’s check the execution plan to see what would have happened if we’d copied the value from the sys_context() call into a bind variable and queried using the bind variable – which we’ll keep as a character type to make it a fair comparison:

SQL_ID  cm3ub1tctpdyt, child number 0
-------------------------------------
select program from v$session where sid = to_number(:v1)

Plan hash value: 1627146547

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FIXED INDEX| X$KSLWT (ind:1) |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    20 |     0   (0)|
|*  6 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(:V1))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter(("S"."INDX"=TO_NUMBER(:V1) AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))


When we have a (character) bind variable instead of a sys_context() value the optimizer manages to use the “fixed indexes”., but it’s still started executing at x$kslwt, and still doing a Cartesian merge join. The plan would be the same if the bind variable were a numeric type, and we’d still get the same plan if we replaced the bind variable with a literal number.

So problem number 1 is that Oracle only seems able to use the fixed index path for literal values and simple bind variables (plus a few “simple” functions). It doesn’t seem to use the fixed indexes for most functions (even deterministic ones) returning a value and the sys_context() function is a particular example of this.

Transitivity

Problem number 2 comes from a side-effect of something that I first described about 13 years ago – transitive closure. Take a look at the predicates in both the execution plans above. Where’s the join condition between x$ksuse and x$kslwt ? There should be one, because the underlying SQL defining [g]v$session  has the following joins:

from
      x$ksuse s,
      x$ksled e,
      x$kslwt w
where
      bitand(s.ksspaflg,1)!=0
and   bitand(s.ksuseflg,1)!=0
and   s.indx=w.kslwtsid       -- this is the SID column for v$session and v$session_wait
and   w.kslwtevt=e.indx
 

What’s happened here is that the optimizer has used transitive closure: “if a = b and b = c then a = c” to clone the predicate “s.indx = to_number(sys_context(…))” to “w.kslwtsid = to_number(sys_context(…))”. But at the same time the optmizer has eliminated the predicate “s.indx = w.kslwtsid”, which it shouldn’t do because this is 12.2.0.1 and ever since 10g we’ve had the parameter _optimizer_transitivity_retain = true — but SYS is ignoring the parameter setting.

So we no longer have a join condition between x$ksuse and x$kslwt – which means there has to be a cartesian merge join between them and the only question is whether this should take place before or after the join between x$kslwt and x$ksled. In fact, the order doesn’t really matter because there will be only one row identified in x$kslwt and one row in x$ksuse, and the join to x$ksled is simply a lookup (by undeclarable unique key) to translate an id into a name and it will take place only once whatever we do about the other two structures.

But there is a catch – especially if your sessions parameter is 25,000 (which it shouldn’t be) and the number of connected sessions is currently 20,000 (which it shouldn’t be) – the predicate against x$ksuse does a huge amount of work as it walks the entire array testing every row (and it doesn’t even do the indx test first – it does a couple of bitand() operations). Even then this wouldn’t be a disaster – we’re only talking a couple of hundredths of a second of CPU – until you find the applications that run this query a huge number of times.

We would prefer to avoid two full tablescans since the arrays could be quite large, and of the two it’s the tablescan of x$ksuse that is going to be the greater threat; so is there a way to bypass the threat?  Once we’ve identified the optimizer anomaly we’ve got a pointer to a solution. Transitivity is going wrong, so let’s attack the transitivity. Checking the hidden parameters we can find a parameter: _optimizer_generate_transitive_pred which defaults to true, so let’s set it to false for the query and check the plan:

select  /*+   opt_param('_optimizer_generate_transitive_pred','FALSE')
*/  program from  v$session where  sid = sys_context('userenv','sid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    28 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."INDX"=TO_NUMBER(SYS_CONTEXT('user
              env','sid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


Although it’s not nice to insert hidden parameters into the optimizer activity we do have a result. We don’t have any filtering on x$kslwt – fortunately this seems to be limited in size (but see footnote) to the number of current sessions (unlike x$ksuse which has an array size defined by the sessions parameter or derived from the processes parameter). For each row in x$kslwt we do an access into x$ksuse using the “index” (note that we don’t see access predicates for the fixed tables, we just have to note the operation says FIXED INDEX and spot the “index-related” predicate in the filter predicate list), so this strategy has reduced the number of times we check the complex predicate on x$ksuse rows.

It’s still far from ideal, though. What we’d really like to do is access x$kslwt by index using the known value from sys_context(‘userenv’,’sid’). As it stands the path we get from using a hidden parameter which isn’t listed as legal for the opt_param() hint is a plan that we would get if we used an unhinted query that searched for audsid = sys_context(‘userenv’,’sessionid’).

SQL_ID  7f3f9b9f32u7z, child number 0
-------------------------------------
select  program from  v$session where  audsid =
sys_context('userenv','sessionid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     2 |    70 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     2 |    62 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    23 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."KSUUDSES"=TO_NUMBER(SYS_CONTEXT('
              userenv','sessionid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


The bottom line, then, seems to be that if you need a query by SID against v$session to be as efficient as possible then your best bet is to load a numeric variable with the sys_context(‘userenv’,’sid’) and then select where “sid = :bindvariable”.  Otherwise query by audsid, or use a hidden parameter to affect the optimizer.

Until the the SYS schema follows the _optimizer_transitivity_retain parameter or treats sys_context() the same way it treats a bind variable there is always going to be some unnecessary work when querying v$session and that excess will grow with either the number of connected sessions (if you optimize the query) or with the value of the sessions parameter.

Footnote

In (much) older versions of Oracle v$session_wait sat on top of x$ksusecst, which was part of the same C structure as x$ksuse. In newer versions of Oracle x$kslwt is a structure that is created on demand in the PGA/UGA – I hope that there’s a short cut that allows Oracle to find the waiting elements in x$ksuse[cst] efficiently, rather than requiring a walk through the whole thing, otherwise a tablescan of the (nominally smaller) x$kslwt structure will be at least as expensive as a tablescan of the x$ksuse structure.

Update (just a few minutes after posting)

Bob Bryla has pointed out in a tweet that there are many “bugs” not fixed until 19.1 for which the workaround is to set “_optimizer_transitivity_retain” to false. So maybe this isn’t an example of SYS doing something particularly strange – it may be part of a general reworking of the mechanism that still has a couple of undesirable side effects.

Bob’s comment prompted me to clone the x$ tables into real tables in a non-SYS schema and model the fixed indexes with primary keys, and I found that the resulting plan (though very efficient) still discarded the join predicate. So we may be seeing the side effects of a code enhancement relating to generating predicates that produce unique key access paths. (A “contrary” test, the one in the 2013 article I linked to, still retains the join predicate for the query that has non-unique indexes.)

 

October 2, 2019

_cursor_obsolete_threshold

Filed under: 12c,Infrastructure,Oracle,Performance — Jonathan Lewis @ 2:39 pm BST Oct 2,2019

At the recent Trivadis Performance Days in Zurich, Chris Antognini answered a question that had been bugging me for some time. Why would Oracle want to set the default value of _cursor_obsolete_threshold to a value like 8192 in 12.2 ?

In 11.2.0.3 the parameter was introduced with the default value 100; then in 11.2.0.4, continuing into 12.1, the default value increased to 1,024 – what possible reason could anyone have for thinking that 8192 was a good idea ?

The answer is PDBs – specifically the much larger number of PDBs a single CBD can (theoretically) support in 12.2.

In fact a few comments, and the following specific explanation, are available on MoS in Doc ID 2431353.1 “High Version Counts For SQL Statements (>1024) Post Upgrade To 12.2 and Above Causing Database Slow Performance”:

The default value of _cursor_obsolete_threshold is increased heavily (8192 from 1024) from 12.2 onwards in order to support 4096 PDBs which was only 252 PDBs till 12.1. This parameter value is the maximum limit for obsoleting the parent cursors in an multitenant environment and cannot be increased beyond 8192.

Having said, this is NOT applicable for non-CDB environment and hence for those databases, this parameter should be set to 12.1 default value manually i.e. 1024. The default value of 1024 holds good for non-CDB environment and the same parameter can be adjusted case-to-case basis should there be a problem.

It’s all about PDBs – more precisely, it’s all about CDBs running a huge number of PDBs, which is not necessarily the way that many companies are likely to use PDBs. So if you’re a fairly typical companyy running a handful of PDBs in a single CDB then it’s probably a good idea to set the parameter down to the 12.1 value of 1024 (and for bad applications I’d consider going even lower) – and this MOS note actually makes that an official recommendation.

Impact analysis

What’s the worst that could happen if you actually have many PDBs all executing the same application and that application has a few very popular and frequently executed statements? Chris Antognini described a model he’d constructed and some tests he’d done to show the effects. The following code is a variation onhis work. It addresses the following question:

If you have an application that repeatedly issues (explicitly or implicitly) parse calls but doesn’t take advantage of the session cursor cache it has to search the library cache by hash_value / sql_id for the parent cursor, then has to walk the chain of child cursors looking for the right child. What’s the difference in the work done if this “soft parse” has to walk the list to child number 8,191 instead of finding the right cursor at child number 0.

Here’s the complete code for the test:


create table t1
select 1 id from dual
/

alter table t1 add constraint t1_pk primary key (id)
/

spool cursor_obsolete.lst

alter system flush shared_pool;
alter system flush shared_pool;

set serveroutput off
select /*+ index(t1) */ id from t1 where id > 0;
select * from table(dbms_xplan.display_cursor);

execute snap_my_stats.start_snap
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+8192 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

column sql_text format a60
select sql_id, child_number, loaded_versions, executions, sql_text from v$sql where sql_text like 'SELECT%T1%' order by child_number;

prompt  ===============
prompt  Low child reuse
prompt  ===============

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

prompt  ================
prompt  High child reuse
prompt  ================

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 7168+1..7168+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

spool off

I’ve created a table with just one row and given it a primary key. My testing query is going to be very short and simple. A query hinted to return that one row by primary key index range scan.

I’ve flushed the shared pool (twice) to minimise fringe contention from pre-existing information, then executed the statement to populate the dictionary cache and some library cache information and to check the execution plan.

The call to the package snap_my_stats is my standard method for reporting changes in v$mystat across the test. I’ve called the start_snap procedure twice in a row to make sure that its first load doesn’t add some noise to the statistics that we’re trying to capture.

The test runs in three parts.

  • First I loop 8192 times executing the same statement, but with a different value for the optimizer_index_cost_adj for each execution – this gives me the limit of 8192 child cursors, each reporting “Optimizer Mismatch” as the reason for not sharing. I’ve run a query against v$sql after this to check that I have 8192 child cursors – you’ll need to make sure your shared pool is a few hundred megabytes if you want to be sure of keeping them all in memory.
  • The second part of the test simply repeats the loop, but only for the first 1,024 child cursors. At this point the child cursors exist, so the optimizer should be doing “soft” parses rather than hard parses.
  • The final part of the test repeats the loop again, but only for the last 1,024 child cursors. Again they should exist and be usable, so the optimizer should again be doing “soft” parses rather than hard parses.

What I’m looking for is the extra work it takes for Oracle to find the right child cursor when there’s a very long chain of child cursors. From my memory of dumping the library cache in older versions of Oracle, the parent will point to a “segmented array” of pointers to child cursors, and each segment of the array will consist of 16 pointers, plus a pointer to the next segment. So if you have to find child cursor 8191 you will have to following 512 segment pointers, and 16 pointers per segment (totalling 8708 pointers) before you find the child you want – and you’re probably holding a mutex (or latch) while doing so.

One preipheral question to ask, of course, is whether Oracle keeps appending to the segmented array, or whether it uses a “pushdown” approach when allocating a new segment so that newer child cursors are near the start of the array. (i.e. will searching for child cursor 0 be the cheapest one or the most expensive one).

And the results, limited to just the second and third parts, with just a couple of small edits are as follows:


host sdiff -w 120 -s temp1.txt temp2.txt >temp.txt

===============                                            |    ================
Low child reuse                                            |    High child reuse
===============                                            |    ================

Interval:-  0 seconds                                      |    Interval:-  6 seconds

opened cursors cumulative                      2,084       |    opened cursors cumulative                      2,054
recursive calls                                6,263       |    recursive calls                                6,151
recursive cpu usage                               33       |    recursive cpu usage                              570
session logical reads                          1,069       |    session logical reads                          1,027
CPU used when call started                        33       |    CPU used when call started                       579
CPU used by this session                          37       |    CPU used by this session                         579
DB time                                           34       |    DB time                                          580
non-idle wait count                               16       |    non-idle wait count                                5
process last non-idle time                         1       |    process last non-idle time                         6
session pga memory                           524,288       |    session pga memory                            65,536
enqueue requests                                  10       |    enqueue requests                                   3
enqueue releases                                  10       |    enqueue releases                                   3
consistent gets                                1,069       |    consistent gets                                1,027
consistent gets from cache                     1,069       |    consistent gets from cache                     1,027
consistent gets pin                            1,039       |    consistent gets pin                            1,024
consistent gets pin (fastpath)                 1,039       |    consistent gets pin (fastpath)                 1,024
consistent gets examination                       30       |    consistent gets examination                        3
consistent gets examination (fastpath)            30       |    consistent gets examination (fastpath)             3
logical read bytes from cache              8,757,248       |    logical read bytes from cache              8,413,184
calls to kcmgcs                                    5       |    calls to kcmgcs                                    3
calls to get snapshot scn: kcmgss              1,056       |    calls to get snapshot scn: kcmgss              1,026
table fetch by rowid                              13       |    table fetch by rowid                               1
rows fetched via callback                          6       |    rows fetched via callback                          1
index fetch by key                                 9       |    index fetch by key                                 1
index scans kdiixs1                            1,032       |    index scans kdiixs1                            1,024
session cursor cache hits                         14       |    session cursor cache hits                          0
cursor authentications                         1,030       |    cursor authentications                         1,025
buffer is not pinned count                     1,066       |    buffer is not pinned count                     1,026
parse time cpu                                    23       |    parse time cpu                                   558
parse time elapsed                                29       |    parse time elapsed                               556
parse count (total)                            2,076       |    parse count (total)                            2,052
parse count (hard)                                11       |    parse count (hard)                                 3
execute count                                  1,050       |    execute count                                  1,028
bytes received via SQL*Net from client         1,484       |    bytes received via SQL*Net from client         1,486

Two important points to note:

  • the CPU utilisation goes up from 0.33 seconds to 5.7 seconds.
  • the number of hard parses is zero, this is all about searching for the

You might question are the 2,048-ish parse count(total) – don’t forget that we do an “execute immediate” to change the optimizer_index_cost_adj on each pass through the loop. That’s probably why we double the parse count, although the “alter session” doesn’t then report as an “execute count”.

The third call to a statement is often an important one – it’s often the first one that doesn’t need “cursor authentication”, so I ran a similar test executing the last two loops a second time – there was no significant change in the CPU or parse activity between the 2nd and 3rd executions of each cursor. For completeness I also ran a test with the loop for the last 1,024 child cursors ran before the loop for the first child cursors. Again this made no significant difference to the results – the low number child cursors take less CPU to find than the high number child cursors.

Bottom line

The longer the chain of child cursors the more time (elapsed and CPU) you spend searching for the correct child; and when a parent is allowed 8,192 child cursors the extra time can become significant. I would claim that the ca. 5 seconds difference in CPU time appearing in this test corresponds purely to an extra 5 milliseconds walking an extra 7,000 steps down the chain.

If you have a well-behaved application that uses the session cursor cache effectively, or uses “held cursors”, then you may not be worried by very long chains of child cursors. But I have seen many applications where cursor caching is not used and every statement execution from the client turns into a parse call (usually implicit) followed by a hunt through the library cache and walk along the child chain. These applications will not scale well if they are cloned to multiple PDBs sharing the same CDB.

Footnote 1

The odd thing about this “cursor obselete” feature is that I have a distinct memory that when  PDBs were introduced at an ACE Director’s meeting a few years ago the first thought that crossed my mind was about the potential for someone running multiple copies of the same application as separate PDBs seeing a lot of library cache latch contention or cursor mutex contention because any popular statement would now be hitting the same parent cursor from multiple PDBs. I think the casual (i.e. neither formal, nor official) response I got when I raised the point was that the calculation of the sql_id in future releases would take the con_id into consideration. It seems that that idea fell by the wayside.

Footnote 2

If you do see a large number of child cursors for a single parent then you will probably end up looking at v$sql_shared_cursor for the sql_id to see if that gives you some good ideas about why a particular statement has generated so many child cursors. For a list of explainations of the different reasons captured in this view MOS Doc Id  296377.1“Troubleshooting: High Version Count Issues” is a useful reference.

September 16, 2019

Little sleeps

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 11:50 am BST Sep 16,2019

A peripheral question in a recent comment (made in response to me asking whether a loop had been written with a sleep time of 1/100th or 1/1000th of a second) asked “How do you sleep for 1/1000th of a second in pure PL/SQL?”

The answer starts with “How pure is pure ?” Here’s a “pure” PL/SQL solution that “cheats” by calling one of the routines in Oracle’s built-in Java library:


create or replace procedure milli_sleep( i_milliseconds in number) 
as 
        language java
        name 'java.lang.Thread.sleep(long)';
/


create or replace procedure nano_sleep( i_milliseconds in number, i_nanoseconds in number)
as
        language java
        name 'java.lang.Thread.sleep(long, int)';
/

prompt  Milli sleep
prompt  ===========
execute milli_sleep(18)

prompt  Nano sleep
prompt  ==========
execute  nano_sleep(0,999999)


The “nano-second” component of the nano_sleep() procedure is restricted to the ranage 0 – 999999. In both cases the “milli-second” component has to be positive.

Whether your machine is good at handling sleeps of less than 1/100th of a second is another question, of course.

Update – due to popular demand

If you want to find out what else is available in the database you can query view all_java_methods searching by partial name (which is very slow) for something you think might exist, for example:

SQL> select owner, name , method_name from all_java_methods where upper(method_name) like '%MILLI%'

OWNER           NAME                                     METHOD_NAME
--------------- ---------------------------------------- ----------------------------------------
SYS             java/util/concurrent/TimeUnit$4          toMillis
SYS             java/util/concurrent/TimeUnit$5          toMillis
SYS             java/util/concurrent/TimeUnit$6          toMillis
SYS             java/util/concurrent/TimeUnit$7          toMillis
SYS             java/util/concurrent/TimeUnit            toMillis
SYS             java/lang/System                         currentTimeMillis
SYS             javax/swing/ProgressMonitor              setMillisToDecideToPopup
SYS             javax/swing/ProgressMonitor              getMillisToDecideToPopup

There’s a lot more than the few listed above – but I just wanted to pick up currentTimeMillis. If you spot something that looks interesting the easiest next step is probably to do a google search with (for example): Oracle java.lang.system currenttimemillis (alternatively you could just keep a permanent link to Oracle’s manual pages for the Java and serarch them. In my case this link was high on the list of google hits, giving me the following method description:


static long 	currentTimeMillis​() 	Returns the current time in milliseconds.

Conveniently this is easy to embed in pl/sql (be careful with case sensitivity):


create or replace function milli_time return number
as
        language java
        name 'java.lang.System.currentTimeMillis() return long';
/

execute dbms_output.put_line(milli_time)
execute dbms_lock.sleep(1)
execute dbms_output.put_line(milli_time)

----------------

SQL> @ java_procs

Function created.

1568719117725

PL/SQL procedure successfully completed.


PL/SQL procedure successfully completed.

1568719118734

PL/SQL procedure successfully completed.

You now have a PL/SQL function that will return the number of millisecond since 1st January 1970.

August 11, 2019

Troubleshooting

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 9:28 pm BST Aug 11,2019

Here’s a question to provoke a little thought if you’ve got nothing more entertaining to do on a Sunday evening.  What threats do you think of when you see a statement like the following in (say) an AWR report, or in a query against v$sql ?

update tableX set
        col001 = :1, col002 = :2, col003 = :3, ...
        -- etc. 
        -- the names are supposed to indicate that the statement updates 302 columns
        -- etc.
        col301 = :301, col302 = :302
where
        pk_col = :303
;

I’ll be writing up some suggestions tomorrow (Monday, UK BST), possible linking to a few other articles for background reading.

Update

The first three comments have already hit the high points, but I’m going to jot down a few notes anyway.

The first two things that really (should) make an impact are:

  • There’s a table in the application with (at least) 303 columns – anything over 255 is a potential disaster area
  • An update statement that updates 302 columns is probably machine generated by a non-scalable application

A secondary detail that might be useful is recognising the pattern of the text – lower case for column names, simple “:nnn” for bind variables.  As it stands I don’t recognise the possible source for this style, but I know it’s not SQL embedded in PL/SQL (which would be all capitals with “:Bnnn” as bind variable names) and it’s not part of a distributed query from a remote database (which would be in capitals with quoted names, and table aliases like “A1, A2, …”), and it’s not “raw” Hiberbate code which produces meaningless table and column aliases based on numbers with a “t” for table and “c” for column.

So let’s think about possible problems and symptoms relating to the two obvious issues:

Wide tables

Once you have more than 255 (real) columns in a table – even if that count includes columns that have been marked unused – Oracle will have to split rows into “rowpieces” that do not exceed 255 columns and chain those pieces together. Oracle will try to be as efficient as possible – with various damage-limiting code changes appearing across versions – attempting store these row pieces together and keeping the number to a minimum, but there are a number of anomalies that can appear that have a significant impact on performance.

Simply having to visit two row pieces to pick up a column in the 2nd row piece (even if it is in the same block) adds to the cost of processing; but when you have to visit a second block to acquire a 2nd (or 3rd, or 4th) rowpiece the costs can be significant. As a quirky little extra, Oracle’s initial generation of row-pieces creates them from the end backwards – so a row with 256 columns starts with a row-piece of one column following by a rowpiece of 255 columns: so you may find that you have to fetch multiple row pieces for virtually every row you access.

It’s worth noting that a row splitting is based only on columns that have been used in the row. If your data is restricted to the first 255 column of a row then the entire row can be stored as a single row piece (following the basic rule that “trailing nulls take no space”); but as soon as you start to update such a row by populating columns past the 255 boundary Oracle will start splitting from the end – and it may create a new trailing row-piece each time you populate a column past the current “highest” column.  In an extreme case I’ve managed to show an example of a single row consisting of 746 row pieces, each in a different block (though that’s a bug/feature that’s been addressed in very recent versions of Oracle).

If rows have been split across multiple blocks then one of the nastier performance problems appears with direct path read tablescans. As Oracle follows the pointer to a secondary row piece it will do a physical read of the target block then immediately forget the target block so, for example, if you have inserted 20 (short) rows into a block then updated all of them in a way that makes them split and all their 2nd row pieces go to the same block further down the table you can find that Oracle does a single direct path read that picks up the head pieces, then 20 “db file sequential read” calls to the same block to pick up the follow-on pieces. (The same effect appears with simple migrated rows.) Contrarily, if you did the same tablescan using “db file scattered read” requests then Oracle might record a single, highly deceptive “table fetch continued row” because it managed to pin the target block and revisit it 20 times.

Often a very wide row (large number of columns) means the table is trying to hold data for multiple types of object. So a table of 750 columns may use the first 20 columns for generic data, columns 21 to 180 for data for object type A, 181 to 395 for data for object type B, and so on.  This can lead to rows with a couple of hundred used columns and several hundred null columns in the middle of each row – taking one byte per null column and burning up lots of CPU as Oracle walks a row to find a relevant column. A particularly nasty impact can appear from this type of arrangement when you upgrade an applications:  imagine you have millions of rows of the “type A” above which use only the first 180 columns.  For some reason the application adds one new “generic” column that (eventually) has to be populated for each row – as the column is populated for a type A row the row grows by 520 (null counting) bytes and splits into at least 3 pieces. The effect could be catastrophic for anyone who had been happy with their queries reporting type A data.

One of the difficulties of handling rows that are chained due to very high column counts is that the statistics can be quite confusing (and subject to change across versions). The most important clue comes from “table fetch continued row”; but this can’t tell you whether your “continued” rows are migrated or chained (or both), which table they come from, and whether you’ve been fetching the same small number multiple times or many different rows. Unfortunately the segment statistics (v$segstat / v$segment_statistics) don’t capture the number of continued fetches by segment – it would be nice if they did since it ought to be a rare (and therefore low-cost) event. The best you can do, probably, is to look at the v$sql_monitor report for queries that report tablescans against large tables but report large numbers of single block reads in the tablescan – and for repeatable cases enable SQL trace with wait tracing enabled against suspect queries to see if they show the characteristic mix of direct path reads and repeated db file sequential reads.

Update every column

The key performance threat in statements that update every column – including the ones that didn’t change – is that Oracle doesn’t compare before and after values when doing the update. Oracle’s code path assumes you know what you’re doing so it saves every “old” value to an undo record (which gets copied to the redo) and writes every “new” value to a redo change vector.  (Fortunately Oracle does check index definitions to see which index entries really have suffered changes, so it doesn’t visit index leaf blocks unnecessarily). It’s possible that some front-end tool that adopts this approach has a configuration option that switches from “one SQL statement for all update” to “construct minimal statement based on screen changes”.

The simple trade-off between these two options is the undo/redo overhead vs. parsing and optimisation overheads as the tool creates custom statements on demand. In the case of the table with more than 255 columns, of course, there’s the added benefit that an update of only the changed columns might limit the update to columns that are in the first rowpiece, eliminating the need (some of the time) to chase pointers to follow-up pieces.

Limiting the update can help with undo and redo, of course, but if the tool always pulls the entire row to the client anyway you still have network costs to consider. With the full row pulled and then updated you may find it takes several SQL*Net roundtrips to transfer the whole row between client and server.  In a quick test on a simple 1,000 column table with an update that set every column in a single row to null (using a bind variables) I found that the a default setup couldn’t even get 1,000 NULLs (let alone “real values”) across the network without resorting to one wait on “SQL*Net more data from client”


variable b1 number
exec :b1 := null;

update t1 set
        col0001 = :b1,
        col0002 = :b1,
        ...
        col1000 = :b1
;

Although “SQL*Net message to/from client” is typically (though not always correctly) seen as an ignorable wait, you need to be a little more suspicious of losing time to “SQL*Net more data to/from client”. The latter two waits mean you’re sending larger volumes of information across the network and maybe you can eliminate some of the data or make the transfer more efficient – perhaps a simple increase in the SDU (session data unit) in the tnsnames.ora, listener.ora, or sqlnet.ora (for default_sdu_size) might be helpful.

Warning

One of the features of trouble-shooting from cold is that you don’t know very much about the system you’re looking at – so it’s nice to be able to take a quick overview of a system looking for any potentially interesting anomalies and take a quick note of what they are and what thoughts they suggest before you start asking questions and digging into a system. This article is just a couple of brief notes along the lines of: “that doesn’t look nice- what questions does it prompt”.

August 9, 2019

Split Partition

Filed under: Oracle,Partitioning,Performance,Tuning — Jonathan Lewis @ 1:02 pm BST Aug 9,2019

This is a little case study on “pre-emptive trouble-shooting”, based on a recent question on the ODC database forum asking about splitting a range-based partition into two at a value above the current highest value recorded in a max_value partition.

The general principle for splitting (range-based) partitions is that if the split point is above the current high value Oracle will recognise that it can simply rename the existing partition and create a new, empty partition, leaving all the indexes (including the global and globally partitioned indexes) in a valid state. There are, however, three little wrinkles to this particular request:

  • first is that the question relates to a system running 10g
  • second is that there is a LOB column in the table
  • third is that the target is to have the new (higher value) partition(s) in a different tablespace

It’s quite possible that 10g won’t have all the capabilities of partition maintenance of newer versions, and if anything is going to go wrong LOBs are always a highly dependable point of failure, and since all the examples in the manuals tend to be very simple examples maybe any attempt to introduce complications like tablespace specification will cause problems.

So, before you risk doing the job in production, what are you going to test?

In Oracle terms we want to check the following

  • Will Oracle have silently copied/rebuilt some segments rather than simply renaming old segments and creating new, empty segments.
  • Will the segments end up where we want them
  • Will all the indexes stay valid

To get things going, the OP had supplied a framework for the table and told us about two indexes, and had then given us two possible SQL statements to do the split, stating they he (or she) had tested them and they both worked. Here’s the SQL (with a few tweaks) that creates the table and indexes. I’ve also added some data – inserting one row into each partition.

rem
rem     Script:         split_pt_lob.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2019
rem
rem     Last tested 
rem             12.2.0.1
rem             10.2.0.5
rem

define m_old_ts = 'test_8k'
define m_new_ts = 'assm_2'

drop table part_tab purge;

create table part_tab(
  pt_id            NUMBER,
  pt_name          VARCHAR2(30),
  pt_date          DATE default SYSDATE,
  pt_lob           CLOB,
  pt_status        VARCHAR2(2)
)
tablespace &m_old_ts
lob(pt_lob) store as (tablespace &m_old_ts)
partition by range (pt_date)
(
  partition PRT1 values less than (TO_DATE('2012-01-01', 'YYYY-MM-DD')),
  partition PRT2 values less than (TO_DATE('2014-09-01', 'YYYY-MM-DD')),
  partition PRT_MAX values less than (MAXVALUE)
)
/

alter table part_tab
add constraint pt_pk primary key(pt_id)
/

create index pt_i1 on part_tab(pt_date, pt_name) local
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    1,'one',to_date('01-Jan-2011'),rpad('x',4000),'X'
)
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    2,'two',to_date('01-Jan-2013'),rpad('x',4000),'X'
)cascade=>trueee
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    3,'three',to_date('01-Jan-2015'),rpad('x',4000),'X'
)
/

commit;

execute dbms_stats.gather_table_stats(null,'part_tab',cascade=>true,granularity=>'ALL')

We were told that

The table has
– Primary Key on pt_id column with unique index (1 Different table has FK constraint that refers to this PK)
– Composite index on pt_date and pt_name columns

This is why I’ve added a primary key constraint (which will generate a global index) and created an index on (pt_date,pt_name) – which I’ve created as a local index since it contains the partitioning column.

The description of the requirement was:

  • The Task is to split partition(PRT_MAX) to a different tablespace
  • New partition(s) won’t have data at the moment of creation

And the two “tested” strategies were:

alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
        PARTITION PRT3    tablespace &m_old_ts,
        PARTITION PRT_MAX tablespace &m_new_ts
);

alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
        PARTITION PRT3    tablespace &m_old_ts LOB (pt_lob) store as (TABLESPACE &m_old_ts), 
        PARTITION PRT_MAX tablespace &m_new_ts LOB (pt_lob) store as (TABLESPACE &m_new_ts)
)
;
 

If we’re going to test these strategies properly we will need queries similar to the following:


break on object_name skip 1
select object_name, subobject_name, object_id, data_object_id  from user_objects order by object_name, subobject_name;

break on index_name skip 1
select index_name, status from user_indexes;
select index_name, partition_name, status from user_ind_partitions order by index_name, partition_name;

break on segment_name skip 1
select segment_name, partition_name, tablespace_name from user_segments order by segment_name, partition_name;

First – what are the object_id and data_object_id for each object before and after the split. Have we created new “data objects” while splitting, or has an existing data (physical) object simply changed its name.

Secondly – are there any indexes or index partitions that are no longer valid

Finally – which tablespaces do physical objects reside in.

On a test run of the first, simpler, split statement here are the before and after results for the object_id and data_object_id, followed by the post-split results for index and segment details:


Before Split
============

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23677          23677
                                 PRT2                        23678          23678
                                 PRT_MAX                     23679          23679
                                                             23676

PT_I1                            PRT1                        23690          23690
                                 PRT2                        23691          23691
                                 PRT_MAX                     23692          23692
                                                             23689

PT_PK                                                        23688          23688

SYS_IL0000023676C00004$$         SYS_IL_P252                 23685          23685
                                 SYS_IL_P253                 23686          23686
                                 SYS_IL_P254                 23687          23687

SYS_LOB0000023676C00004$$        SYS_LOB_P249                23681          23681
                                 SYS_LOB_P250                23682          23682
                                 SYS_LOB_P251                23683          23683
                                                             23680          23680

After split
===========

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23677          23677
                                 PRT2                        23678          23678
                                 PRT3                        23693          23679
                                 PRT_MAX                     23679          23694
                                                             23676

PT_I1                            PRT1                        23690          23690
                                 PRT2                        23691          23691
                                 PRT3                        23700          23692
                                 PRT_MAX                     23699          23699
                                                             23689

PT_PK                                                        23688          23688

SYS_IL0000023676C00004$$         SYS_IL_P252                 23685          23685
                                 SYS_IL_P253                 23686          23686
                                 SYS_IL_P257                 23697          23687
                                 SYS_IL_P258                 23698          23698

SYS_LOB0000023676C00004$$        SYS_LOB_P249                23681          23681
                                 SYS_LOB_P250                23682          23682
                                 SYS_LOB_P255                23695          23683
                                 SYS_LOB_P256                23696          23696
                                                             23680          23680


INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            VALID
SYS_IL0000023676C00004$$         N/A


INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   USABLE
                                 PRT_MAX                USABLE

SYS_IL0000023676C00004$$         SYS_IL_P252            USABLE
                                 SYS_IL_P253            USABLE
                                 SYS_IL_P257            USABLE
                                 SYS_IL_P258            USABLE


SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023676C00004$$  SYS_IL_P252            TEST_8K
                          SYS_IL_P253            TEST_8K
                          SYS_IL_P257            TEST_8K
                          SYS_IL_P258            TEST_8K

SYS_LOB0000023676C00004$$ SYS_LOB_P249           TEST_8K
                          SYS_LOB_P250           TEST_8K
                          SYS_LOB_P255           TEST_8K
                          SYS_LOB_P256           TEST_8K

Before the split partition PRT_MAX – with 4 segments: table, index, LOB, LOBINDEX – has object_id = data_object_id, with the values: 23679 (table), 23692 (index), 23683 (LOB), 23687 (LOBINDEX); and after the split these reappear as the data_object_id values for partition PRT3 (though the object_id values are larger than the data_object_id values) – so we infer that Oracle has simply renamed the various PRT_MAX objects to PRT3 and created new, empty PRT_MAX objects.

We can also see that all the indexes (including the global primary key index) have remained valid. We also note that the data_object_id of the primary key index has not changed, so Oracle didn’t have to rebuild it to ensure that it stayed valid.

There is a problem, though, the LOB segment and LOBINDEX segments for the new PRT_MAX partition are not in the desired target tablespace. So we need to check the effects of the second version of the split command where we add the specification of the LOB tablespaces. This is what we get – after rerunning the entire test script from scratch:


OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23727          23727
                                 PRT2                        23728          23728
                                 PRT_MAX                     23729          23729
                                                             23726

PT_I1                            PRT1                        23740          23740
                                 PRT2                        23741          23741
                                 PRT_MAX                     23742          23742
                                                             23739

PT_PK                                                        23738          23738

SYS_IL0000023726C00004$$         SYS_IL_P272                 23735          23735
                                 SYS_IL_P273                 23736          23736
                                 SYS_IL_P274                 23737          23737

SYS_LOB0000023726C00004$$        SYS_LOB_P269                23731          23731
                                 SYS_LOB_P270                23732          23732
                                 SYS_LOB_P271                23733          23733
                                                             23730          23730


OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23727          23727
                                 PRT2                        23728          23728
                                 PRT3                        23743          23743
                                 PRT_MAX                     23729          23744
                                                             23726

PT_I1                            PRT1                        23740          23740
                                 PRT2                        23741          23741
                                 PRT3                        23750          23750
                                 PRT_MAX                     23749          23749
                                                             23739

PT_PK                                                        23738          23738

SYS_IL0000023726C00004$$         SYS_IL_P272                 23735          23735
                                 SYS_IL_P273                 23736          23736
                                 SYS_IL_P277                 23747          23747
                                 SYS_IL_P278                 23748          23748

SYS_LOB0000023726C00004$$        SYS_LOB_P269                23731          23731
                                 SYS_LOB_P270                23732          23732
                                 SYS_LOB_P275                23745          23745
                                 SYS_LOB_P276                23746          23746
                                                             23730          23730

INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            UNUSABLE
SYS_IL0000023726C00004$$         N/A

INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   UNUSABLE
                                 PRT_MAX                USABLE

SYS_IL0000023726C00004$$         SYS_IL_P272            USABLE
                                 SYS_IL_P273            USABLE
                                 SYS_IL_P277            USABLE
                                 SYS_IL_P278            USABLE

SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023726C00004$$  SYS_IL_P272            TEST_8K
                          SYS_IL_P273            TEST_8K
                          SYS_IL_P277            TEST_8K
                          SYS_IL_P278            ASSM_2

SYS_LOB0000023726C00004$$ SYS_LOB_P269           TEST_8K
                          SYS_LOB_P270           TEST_8K
                          SYS_LOB_P275           TEST_8K
                          SYS_LOB_P276           ASSM_2


Before looking at the more complex details the first thing that leaps out to hit the eye is the word UNUSABLE – which appears for the status of the (global) primary key index and the PRT3 subpartition. The (empty) PRT_MAX LOB and LOBINDEX partitions are where we wanted them, but by specifying the location we seem to have broken two index segments that will need to be rebuilt.

It gets worse, because if we check the data_object_id of the original PRT_MAX partition (23729) and its matching index partition (23742) we see that they don’t correspond to the (new) PRT3 data_object_id values which are 23743 and 23750 respectively – the data has been physically copied from one data object to another completely unnecessarily; moreover the same applies to the LOB and LOBINDEX segments – the data object ids for the PRT_MAX LOB and LOBINDEX partitions were 23733 and 23737, the new PRT3 data object ids are 23746 and 23747.

If you did a test with only a tiny data set you might not notice the implicit threat that these changes in data_object_id tell you about – you’re going to be copying the whole LOB segment when you don’t need to.

Happy Ending (maybe)

A quick check with 12.2 suggested that Oracle had got much better at detecting that it didn’t need to copy LOB data and invalidate indexes with the second form of the code; but the OP was on 10g – so that’s not much help. However it was the thought that Oracle might misbehave when you specifyied tablespaces that made me run up this test – in particular I had wondered if specifying a tablespace for the partition that would end up holding the existing data might trigger an accident, so here’s a third variant of the split statement I tested, with the results on the indexes, segments, and data objects. Note that I specify the tablespace only for the new (empty) segments:


alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
    PARTITION PRT3,
    PARTITION PRT_MAX tablespace &m_new_ts  LOB (pt_lob) store as (TABLESPACE &m_new_ts)
)
/

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23752          23752
                                 PRT2                        23753          23753
                                 PRT_MAX                     23754          23754
                                                             23751

PT_I1                            PRT1                        23765          23765
                                 PRT2                        23766          23766
                                 PRT_MAX                     23767          23767
                                                             23764

PT_PK                                                        23763          23763

SYS_IL0000023751C00004$$         SYS_IL_P282                 23760          23760
                                 SYS_IL_P283                 23761          23761
                                 SYS_IL_P284                 23762          23762

SYS_LOB0000023751C00004$$        SYS_LOB_P279                23756          23756
                                 SYS_LOB_P280                23757          23757
                                 SYS_LOB_P281                23758          23758
                                                             23755          23755

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23752          23752
                                 PRT2                        23753          23753
                                 PRT3                        23768          23754
                                 PRT_MAX                     23754          23769
                                                             23751

PT_I1                            PRT1                        23765          23765
                                 PRT2                        23766          23766
                                 PRT3                        23775          23767
                                 PRT_MAX                     23774          23774
                                                             23764

PT_PK                                                        23763          23763

SYS_IL0000023751C00004$$         SYS_IL_P282                 23760          23760
                                 SYS_IL_P283                 23761          23761
                                 SYS_IL_P287                 23772          23762
                                 SYS_IL_P288                 23773          23773

SYS_LOB0000023751C00004$$        SYS_LOB_P279                23756          23756
                                 SYS_LOB_P280                23757          23757
                                 SYS_LOB_P285                23770          23758
                                 SYS_LOB_P286                23771          23771
                                                             23755          23755
INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            VALID
SYS_IL0000023751C00004$$         N/A

INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   USABLE
                                 PRT_MAX                USABLE

SYS_IL0000023751C00004$$         SYS_IL_P282            USABLE
                                 SYS_IL_P283            USABLE
                                 SYS_IL_P287            USABLE
                                 SYS_IL_P288            USABLE

SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023751C00004$$  SYS_IL_P282            TEST_8K
                          SYS_IL_P283            TEST_8K
                          SYS_IL_P287            TEST_8K
                          SYS_IL_P288            ASSM_2

SYS_LOB0000023751C00004$$ SYS_LOB_P279           TEST_8K
                          SYS_LOB_P280           TEST_8K
                          SYS_LOB_P285           TEST_8K
                          SYS_LOB_P286           ASSM_2

All the index and index partitions stay valid; the new empty segments all end up in the target tablespace, and all the data object ids for the old PRT_MAX partitions becaome the data object ids for the new PRT3 partitions. Everything we want, and no physical rebuilds of any data sets.

Moral:

When you’re testing, especially when you’re doing a small test while anticipating a big data set, don’t rely on the clock; check the data dictionary (and trace files, if necessary) carefully to find out what activity actually took place.

Footnote:

It’s possible that there are ways to fiddle around with the various default attributes of the partitioned table to get the same effect – but since 12.2 is much better behaved anyway there’s no point in me spending more time looking for alternative solutions to a 10g problem.

 

August 8, 2019

Free Space

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 1:27 pm BST Aug 8,2019

Several years ago I wrote a note about reporting dba_free_space and dba_extents to produce a map of the space usage in a tablespace in anticipation of messing about with moving or rebuilding objects to try and reduce the size of the files in the tablespace.  In the related page where I published the script I pointed out that a query against dba_extents would be expensive because it makes use of structure x$ktfbue which generates the information dynamically by reading segment header blocks. I also pointed out in a footnote to the original article that if you’ve enabled the recyclebin and have “dropped” some objects then there will be some space that is reported as free but is not quite free since the extents will still be allocated. This brings me to the topic for today’s blog.

While visiting a client site recently I came across an instance that was running a regular report to monitor available space in the database. Basically this was a query against view dba_free_space. Surprisingly it was taking a rather long time to complete – and the reason for this came in two parts. First, the recyclebin was enabled and had some objects in it and secondly there were no stats on the fixed object x$ktfbue.

In the case of the client the particular query produced a plan that included the following lines:


Id  Operation             Name              Rows    Bytes  Cost (%CPU)  Time
--  --------------------- ----------------  ----   ------  -----------  --------
63  HASH JOIN                               2785     212K     46  (85)  00:00:01
64    TABLE ACCESS FULL   RECYCLEBIN$       1589    20657      7   (0)  00:00:01
65    FIXED TABLE FULL    X$KTFBUE          100K    6347K     38 (100)  00:00:01 

This is part of the view where Oracle calculates the size of all the extents of objects in the recyclebin so that they can be reported as free space. Notice that in this plan (which is dependent on version, system stats, object_stats and various optimizer parameters) the optimizer has chosen to do a hash join between the recyclebin (recyclebin$) and the x$ structure – and that has resulted in a “full tablescan” of x$ktfbue, which means Oracle reads the segment header block of every single segment in the entire database. (I don’t know where the row stats came from as there were no stats on x$ktfbue, and this plan was pulled from the AWR history tables so the query had been optimised and captured some time in the past.)

If there had been nothing in the recyclebin the hash join and two tablescans wouldn’t have mattered, unfortunately the recyclebin had been enabled and there were a few rows in recyclebin$, so the “tablescan” happened. Here’s a cut-n-paste from a much simpler query run against a fairly new (no 3rd party app) database running 12.1.0.2 to give you some idea of the impact:


SQL> execute snap_events.start_snap

PL/SQL procedure successfully completed.

SQL> select count(*) from x$ktfbue;

  COUNT(*)
----------
      8774

1 row selected.

SQL> execute snap_events.end_snap
---------------------------------------------------------
Session Events - 01-Aug 21:28:13
---------------------------------------------------------
Event                                             Waits   Time_outs        Csec    Avg Csec    Max Csec
-----                                             -----   ---------        ----    --------    --------
Disk file operations I/O                              7           0           0        .018           1
db file sequential read                           5,239           0          14        .003           6
SQL*Net message to client                             7           0           0        .000           0
SQL*Net message from client                           7           0       1,243     177.562         572
events in waitclass Other                             3           1           0        .002           0

PL/SQL procedure successfully completed.

On my little laptop, with nothing else going on, I’ve managed to get away with “only” 5,239 single block reads, and squeezed them all into just 14 centiseconds (local SSD helps). The clients wasn’t so lucky – they were seeing tens of thousands of real physical reads.

The ideal solution, of course, was to purge the recyclebin and disable the feature – it shouldn’t be necessary to enable it on a production system – but that’s something that ought to require at least some paperwork. In the short term gathering stats on the fixed table helped because the plan changed from a hash join with “tablescan” of x$ktfbue to a nested loop with an “indexed” access path, looking more like the following (from a query against just recyclebin$ and x$ktfbue)

---------------------------------------------------------------------------------------------
| Id  | Operation                | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                  |       |       |     4 (100)|          |
|   1 |  NESTED LOOPS            |                  |     7 |   182 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL      | RECYCLEBIN$      |     6 |    66 |     4   (0)| 00:00:01 |
|*  3 |   FIXED TABLE FIXED INDEX| X$KTFBUE (ind:1) |     1 |    15 |     0   (0)|          |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("BUE"."KTFBUESEGBNO"="BLOCK#" AND "BUE"."KTFBUESEGTSN"="TS#" AND
              "BUE"."KTFBUESEGFNO"="FILE#"))

This was still fairly resource-intensive for the client, but was something of an improvement – they had a lot more than 6 items in their recyclebin.

Part of the problem, of course, is that x$ktfbue is one of the objects that Oracle skips when you gather “fixed object” stats – it can be a bit expensive for exactly the reason that querying it can be expensive, all those single block segment header reads.

If you want to check the stats and gather them (as a one-off, probably) here’s some suitable SQL:


select
        table_name, num_rows, avg_row_len, sample_size, last_analyzed
from
        dba_tab_statistics
where
        owner = 'SYS'
and     table_name = 'X$KTFBUE'
;

begin
        dbms_stats.gather_table_stats('SYS','X$KTFBUE');
end;
/

Summary

You probably shouldn’t have the recyclebin enabled in a production system; but if you do, and if you also run a regular report on free space (as many sites seem to do) make sure (a) you have a regular routine to minimise the number of objects that it accumulates and (b) gather statistics (occasionally) on x$ktfbue to minimise the overhead of the necessary join between recyclebin$ and x$ktfbue.

July 3, 2019

DB links

Filed under: distributed,Oracle,Performance — Jonathan Lewis @ 1:16 pm BST Jul 3,2019

A recent thread on the ODC SQL & PL/SQL forum raised the question of how data moves across a database link – is it row by row or array based (or other)? It’s a good question as it’s fairly common knowledge that distributed queries can be very much slower than the same query when executed on an equivalent set of local tables, so it’s worth having a little poke around to see if there’s anything in the way the data moves across the link that could be tweaked to improve performance.

It’s fairly easy to get some ideas by running a couple of queries and checking session activity stats and wait events – so here’s a little test conducted between a database running 12.2.0.1 and a database running 11.2.0.4. For this type of testing it’s probably sensible to use two database rather than faking things with a loopback database link in case the loopback introduces some special features (a.k.a. dirty tricks).


rem
rem     Script:         db_link_usage.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2019
rem
rem     Last tested 
rem             12.2.0.1 -> 11.2.0.4
rem

prompt  ==================================
prompt  SQL*Plus 20,000 rows, arraysize 20
prompt  ==================================

set feedback only
set arraysize 20

execute snap_my_stats.start_snap
execute snap_events.start_snap
select * from all_objects@d11g where rownum <= 20000;

set feedback on
execute snap_events.end_snap
execute snap_my_stats.end_snap

prompt  ====================================
prompt  SQL*Plus 20,000 rows, arraysize 2000
prompt  ====================================

set feedback only
set arraysize 2000

execute snap_my_stats.start_snap
execute snap_events.start_snap
select * from all_objects@d11g where rownum <= 20000;

set feedback on
execute snap_events.end_snap
execute snap_my_stats.end_snap

All I’ve done is select 20,000 rows from view all_objects in a remote database with two different settings for the array fetch size. The “feedback only” option is a 12c option that means the session doesn’t output the data, it shows only the final “N rows selected” report. The two “snap” packages are some code I wrote a couple of decades ago to report changes in my session’s activity stats and wait events.

It turned out that there was very little of interest in the session activity stats although the impact of SQL*Net compression is always worth a quick reminder (here and here) on the other hand the wait events (specifically the wait counts) are most revealing:

================================== 
SQL*Plus 20,000 rows, arraysize 20
================================== 
Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                       1,004           0          15.69        .016     414,828
SQL*Net message to dblink                         1,004           0           0.04        .000           0
SQL*Net message from dblink                       1,004           0          61.02        .061          11
SQL*Net message to client                         1,004           0           0.11        .000           0

====================================
SQL*Plus 20,000 rows, arraysize 2000
====================================
Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                          14           0           4.88        .349     414,828
SQL*Net message to dblink                            13           0           0.00        .000           0
SQL*Net message from dblink                          13           0           7.72        .594          11
SQL*Net message to client                            14           0           0.00        .000           0

SQL*Net more data from dblink                        96           0           6.26        .065           6
SQL*Net more data to client                          96           0           0.16        .002           4

Doing a little arithmetic (and allowing for a few “metadata” messages that introduce small variations in the numbers), we can see that when we fetched the 20,000 rows with an arraysize of 20 this turned into 1,000 (fetch) calls from SQL*Plus to the server, with a matching 1,000 calls from the local server to the remote server. When the arraysize goes up to 2,000 rows, though, the SDU (session data unit) setting for SQL*Net is too small to hold the whole of a single fetch and we see a single fetch from SQL*Plus turning into one “SQL*Net message to client” accompanied by 7 or 8 “SQ:(Net more data to client” with exactly the same pattern of conversation between the local server and the remote server. You could imagine the conversation as something like:

  • Client to local server: “give me 2,000 rows”
  • local server to remote server: “give me 2,000 rows”
  • remote server to local server: “Here, I can manage 120 rows as the first installment”
  • local server to client: “Here’s a first installment”
  • Client to local server: “Got it, where’s the rest?”     *** but no message recorded in the session activity stats
  • Local server to remote server: “Where’s the rest?”
  • Remote server to local server: “You want more – here’s another 120 rows”
  • Local server to client: “Here’s a second installment”
  • Client to local server: “Got it, where’s the rest?”     *** but no message recorded in the session activity stats
  • etc.

In this example the volume of data sent back to the client in each message was limited by the SDU size negotiated between the local server and the remote server as the link opens. In my case I happened to have the default (apparently 8KB) as the SDU_SIZE in the 12c and a non-default 32KB setting in the 11g sqlnet.ora, so the negotiated SDU between servers was 8KB (the lower of the two).

Here’s what the figures looked like after I had restarted with the SDU set to 16KB in the 12c tnsnames.ora, and 16KB in the 11g listener.ora:


==================================
SQL*Plus 20,000 rows, arraysize 20
==================================
Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                       1,004           0          24.23        .024         588
SQL*Net message to dblink                         1,009           0           0.06        .000           0
SQL*Net message from dblink                       1,010           0          77.76        .077           4
SQL*Net message to client                         1,004           0           0.15        .000           0

====================================
SQL*Plus 20,000 rows, arraysize 2000
====================================
Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                          14           0           1.61        .115         588
SQL*Net message to dblink                            13           0           0.00        .000           0
SQL*Net message from dblink                          13           0           4.21        .324           4
SQL*Net message to client                            14           0           0.00        .000           0

SQL*Net more data from dblink                        45           0          13.53        .301           6
SQL*Net more data to client                          45           0           0.13        .003           0

The first set of figures (arraysize 20) don’t change. If an 8KB SDU is large enough to hold an array of 20 rows then it’s not going to make a difference when the SDU is increased. In the second set of figures we see that for each “SQL*Net message from dblink” we now have roughly 3 “SQL*Net more data from dblink” (with matching counts for “SQL*Net more data to client”). With an SDU of double the size it now takes a total of roughly 4 packets to transmit the array fetch rather than the 8 or 9 we saw with the smaller SDU size.

As a final test with SQL*Plus, what happens if we set the SDU size at 32K for the 12c database (and that means for the client calling in and the server calling out) and 8KB for the 11g database? The client negotiates a 32KB SDU with the 12c server, but the 12c server negotiates an 8KB SDU with the 11g server. Here are the stats for just the test with the large fetch size in this case:


====================================
SQL*Plus 20,000 rows, arraysize 2000
====================================
Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                          14           0           5.30        .379     214,570
SQL*Net message to dblink                            13           0           0.00        .000           0
SQL*Net message from dblink                          13           0           4.09        .314          13
SQL*Net message to client                            14           0           0.00        .000           0

SQL*Net more data from dblink                        96           0          14.46        .151           6
SQL*Net more data to client                          20           0           0.08        .004           0

We get the same 10 (plus a few) “message to/from client/dblink”, but now the “more data” waits are dramatically different. When the client calls for the “next fetch” the local server has to call the remote server 4 or 5 times to get enough 8KB data packets to fill a single 32KB packet to return to the client. You can confirm this (and some of my descriptions of the earlier behaviour) by enabling extended tracing for wait states. Here’s an extract of 5 consecutive lines from a suitable trace file (you have to ignore the “#bytes” parameter on the “SQL*Net more data from dblink” waits, they’ve always been garbage:


WAIT #140028098478296: nam='SQL*Net more data from dblink' ela= 2793 driver id=1413697536 #bytes=7 p3=0 obj#=-1 tim=703316387674
WAIT #140028098478296: nam='SQL*Net more data from dblink' ela= 34 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=703316388447
WAIT #140028098478296: nam='SQL*Net more data from dblink' ela= 8 driver id=1413697536 #bytes=2 p3=0 obj#=-1 tim=703316389134
WAIT #140028098478296: nam='SQL*Net more data from dblink' ela= 16 driver id=1413697536 #bytes=12 p3=0 obj#=-1 tim=703316389818
WAIT #140028098478296: nam='SQL*Net more data to client' ela= 73 driver id=1413697536 #bytes=32671 p3=0 obj#=-1 tim=703316389960

Summary Note

Oracle will use the negotiated SDU to do array fetches across a database link. Since a larger arraysize can (for large volumes of data) reduce the total work load on the remote database and on the network you may want the local server to have a specifically crafted tnsnames.ora entry and the remote server to expose a specific service with matching SDU size to help improve the efficiency of transferring data between two databases.

Lagniappe

If you want to look at other cases of how array fetching and SDU sizes interact, here are a couple of examples of using PL/SQL to execute SQL that fetches data across database links. The first is a simple, embedded “cursor for loop” that (apparently) does “row by row” procssing – although an enhancement appeared many versions ago to make this type of code use array fetching of 100 rows under the covers. The second demonstrates the effects of an explicit cursor with “fetch, bulk collect, limit”:


execute snap_events.start_snap

declare
        m_total number := 0;
        m_ct    number := 0;
begin
        for c1 in (select * from all_objects@d11g where rownum < 20000) loop
                m_total := m_total + c1.object_id;
                m_ct    := m_ct + 1;
        end loop;
end;
/

execute snap_events.end_snap

execute snap_events.start_snap

declare

        cursor c1 is  select * from all_objects@d11g where rownum <= 20000;

        type c1_array is table of c1%rowtype index by binary_integer;
        m_tab c1_array;

        m_ct number := 0;

begin
        open c1;
        loop
                fetch   c1
                bulk collect
                into    m_tab
                limit   1000
                ;

                exit when c1%notfound;
                m_ct := m_ct + 1;

        end loop;
        close c1;
end;
/

execute snap_events.end_snap

Checking the session events for these two test we see the following when the SDU_SIZE has been set at 16KB at both ends of the database link (which means the negotiated SDU will be 16KB):


Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                           4           0           0.66        .165         588 
SQL*Net message to dblink                           203           0           0.05        .000           0
SQL*Net message from dblink                         203           0          38.51        .190           5
SQL*Net message to client                             4           0           0.00        .000           0

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
SQL*Net message from client                           4           0           0.08        .021         588
SQL*Net message to dblink                            24           0           0.01        .000           0
SQL*Net message from dblink                          24           0           9.19        .383           5
SQL*Net more data from dblink                        40           0           8.47        .212           6
SQL*Net message to client                             4           0           0.00        .000           0

The critical information we can see in the first example is that getting 20,000 rows from the remote database “row by row” takes 200 (plus noise) round-trips – showing that Oracle really is converting our single-row processing loop in array fetches of 100 rows.

The second example shows us that (allowing for a little noise, again) it takes 20 messages to the remote database to fetch 20,000 rows – so 1,000 rows per message – but the SDU size is too small and we have to send each array in three consecutive packets, the “message from dblink” wait and two “more data from dblink” waits.

Footnote

I did point out that the session activity stats for theses tests weren’t interesting. There is, however, one detail worth mentioning since you could otherwise be fooled into thinking the number of packet travelling between the databases was smaller than it really was. There are two statistics about network roundtrips:


Name                                         Value
----                                         -----
SQL*Net roundtrips to/from client               18         
SQL*Net roundtrips to/from dblink               13

A “roundtrip” is a “message to / message from” pair. It doesn’t account for traffic due to “more data” being requested or sent.

 

April 8, 2019

Describe Upgrade

Filed under: 18c,Bugs,Infrastructure,Oracle,Performance — Jonathan Lewis @ 11:02 am BST Apr 8,2019

Here’s an odd little change between Oracle versions that could have a stunning impact on the application performance if the thing that generates your client code happens to use an unlucky selection of constructs.  It’s possible to demonstrate the effect remarkably easily – you just have to describe a table, doing it lots of times to make it easy to see what’s happening.

rem
rem     Script:         describe_18c.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2019
rem 

create table t1 as
select  *
from    all_objects
where   rownum = 1
;

set heading off
set feedback off
set pagesize 0
set linesize 156
set trimspool on
set termout off
set serveroutput on

execute snap_rowcache.start_snap
execute snap_libcache.start_snap

start start_10000
-- start_1 contains "describe t1"

set termout on
set serveroutput on

spool temp
execute snap_rowcache.end_snap
execute snap_libcache.end_snap

spool off

The start_10000 script is my mechanism for running a simple piece of code many times, and as the comment following it says, all I’m doing is repeating “describe t1”. The calls to “snap” something are examples of procedures I use to find the changes recorded in various dynamic performance views over short periods of time (there’s an example of the code for v$mystat here) In this case, as the names suggest, the snapshots record the changes in v$rowcache (the dictionary cache) and v$librarycache (the library cache). I’ve put a simple variant of the code at the end of the blog note so you don’t have to do all the preparation if you want to run a quick test for yourself.

Here are the results I get when running the test in Oracle 18.3.0.0

---------------------------------
Dictionary Cache - 05-Apr 19:00:00
Interval:-      27 seconds
---------------------------------
Parameter                 Sub# Usage Fixed    Gets  Misses   Scans  Misses    Comp    Mods Flushes
---------                ----- ----- -----    ----  ------   -----  --------------    ---- -------
dc_objects                         0     0 260,051       0       0       0       0       0       0
dc_users                           0     0 260,000       0       0       0       0       0       0
---------------------------------
Library Cache - 05-Apr 19:00:00
Interval:-      27 seconds
---------------------------------
Type      Description              Gets        Hits Ratio        Pins        Hits Ratio   Invalid    Reload
-----     -----                    ----        ---- -----        ----        ---- -----   -------    ------
NAMESPACE TABLE/PROCEDURE        10,003      10,003   1.0     280,028     280,028   1.0         0         0

Before showing you corresponding figures from 12.2.0.1 I’ll just point out that in version 18.3.0.0 of Oracle the structure of view all_objects gives me a table of 26 columns. Think about that and the 10,000 describes while looking at the number above, then look at the corresponding 12.2.0.1 results:

---------------------------------
Dictionary Cache - 05-Apr 19:00:00 
Interval:-      28 seconds
---------------------------------
Parameter                 Usage Fixed    Gets  Misses   Scans  Misses    Comp    Mods Flushes
---------                 ----- -----    ----  ------   -----  --------------    ---- -------
dc_users                      0     0       2       0       0       0       0       0       0
dc_objects                    0     0       9       0       0       0       0       0       0

---------------------------------
Library Cache - 05-Apr 19:04:17
Interval:-      28 seconds
---------------------------------
Type      Description              Gets        Hits Ratio        Pins        Hits Ratio   Invalid    Reload
-----     -----                    ----        ---- -----        ----        ---- -----   -------    ------
NAMESPACE TABLE/PROCEDURE        10,005      10,005   1.0      20,018      20,018   1.0         0         0

The internal mechanism of the “describe” call has changed between 12.2.0.1 to 18.3.0.0.

For each describe in 18.3, for each column in the table you see a “get” on dc_users and dc_objects in v$rowcache and you see one “get” on the TABLE/PROCEDURE namespace in v$librarycache, and (2 + number of columns) “pins”. In 12.2.0.1 there are no gets on the dictionary cache and only 1 get and two pins in the library cache for each describe.

As a couple of extra checks I modified the test to query from a 12c client to and 18c server, then from an 18c client to a 12c server. The big numbers appeared in the former test (i.e. when the server is 18c) and the small number for the latter (when the server is 12c). I also tried a couple of other variations on the theme:

  • If the table t1 doesn’t exist when I run the test then there are no gets on the row cache, and I see 2 gets and pins (with hits) on the library cache per describe.
  • If I run the test using “decribe other_schema.t1 the pins (and hits) on the library cache go up by 1 per describe
  • If I execute “alter session set current_schema = other_schema” so that “describe t1” is actually describing a table in another schema the pins (and hits) on the library cache go up by 1 per describe
  • If I run the test in the SYS schema, 18c behaves like 12c !! But SYS is often a little wierd compared to other schemas

Enabling the 10051 trace – I can see that both versions report an OPI call type = 119: “V8 Describe Any” for each call to “describe” (which, presumably, corresponds to the OCIDescribeAny() low-level function call). And that’s really where this blog started, and why lots of people might need to be aware (at least in the short term) of this change in behaviour across versions .

Welcome to the Real World.

My demonstration is clearly silly – no-one does hundreds of describes per second in a real application, surely. Well, not deliberately, and not necessarily with the intent to do a full describe, but sometimes n-tier development environments end up generating code that does things you might not expect. One such example is the way that JDBC can handle a requirement of the form:

insert into tableX( id, ......) values (my_sequence.nextval, ...) returning id into ?;

In the course of handling this requirement one of the coding strategies available to JDBC ends up executing the type 119 “V8 Describe Any” call. Imagine the effect this has when you have a couple of dozen concurrent sessions totalling a couple of million single row inserts per hour. The competition for library cache pins and row cache gets is massive – and the memory structures involved are all protected by mutexes. So when a a client of mine recently upgraded their system from 11.2.0.4 to 18.3.0.0 they saw “library cache: mutex X” waits change from a few hundred seconds per hour to tens of thousands of seconds, and “row cache mutex” leaping up  from nowhere in the “Top timed events” to reporting further even more thousands of seconds of wait time per hour.

The actual impact of this issue will depend very strongly on how much use you (or your ORM) makes of this construct. The problem may be particularly bad for my client because of the very large number of concurrent executions of a very small number of distinct statements that all address the same table. For low concurrency, or for a wide range of different tables and statements, you may not see so much contention.

If you are seeing contention for “row cache mutex” and “library cache: mutex X”, then a quick corroborative test (if you are licensed for the performance and dianostic packs) is to check the top_level_call# and top_level_call_name from v$active_session_history:

select
        top_level_call#, top_level_call_name, count(*)
from    v$active_session_history
group by
        top_level_call#, top_level_call_name
order by
        count(*)

If (119, ‘V8 Describe Any’) shows up as a significant fraction of the total then you’re probably looking at this issue.

Java is not my strong point – but here’s a trivial piece of standalone Java that you can use to demonstrate the issue if you’re familiar with running Java on the server. There are a few notes inline to explain necessary preparatory steps and code changes:


/*
        To create a class file, you need to execute
        javac temptest2.java

        This generates file temptest2.class
        If this is successful then execute
        java temptest {number of iterations} {commit frequency}

        e.g.
        java temptest2 10000 10

        To be able to compile, you need a CLASSPATH environment variable
        e.g. export CLASSPATH=$CLASSPATH:$ORACLE_HOME/jdbc/lib/ojdbc8.jar

        For java to see the created class the CLASSPATH must also include 
        the holding directory
        e.g. export CLASSPATH=$CLASSPATH:/mnt/working

        Example combined setting:
        export CLASSPATH=$CLASSPATH:$ORACLE_HOME/jdbc/lib/ojdbc8.jar:/mnt/working

        A schema will need to be created to hold two objects,
        And the connection string in the code below will need to be modified -
        and the listener has to be started and the database registered to it.

        Database objects:
        -----------------
        create sequence s1;

        create table test(
                id number, 
                code varchar2(32), 
                descr varchar2(32), 
                insert_user varchar2(32),
                insert_date date
        );

*/


import java.sql.*;
import oracle.jdbc.OracleDriver;
import java.util.Date;

public class temptest2
{
  public static void main (String arr[]) throws Exception
  {
    DriverManager.registerDriver(new oracle.jdbc.OracleDriver());
    Connection con = DriverManager.getConnection
          ("jdbc:oracle:thin:@localhost:1521:or18","test_user","test");

    Integer iters = new Integer(arr[0]);
    Integer commitCnt = new Integer(arr[1]);

    con.setAutoCommit(false);
    doInserts( con, iters.intValue(), commitCnt.intValue() );

    con.commit();
    con.close();
  }

  static void doInserts(Connection con, int count, int commitCount )
  throws Exception
  {

    int  rowcnt = 0;
    int  committed = 0;
    long start = new Date().getTime();

    for (int i = 0; i < count; i++ ) {
      PreparedStatement ps = con.prepareStatement(
           "insert into test (id, code, descr, insert_user, insert_date) " +
                     "values (s1.nextval,?,?, user, sysdate)",
           new String[]{"id"}
      );
      ps.setString(1,"PS - code" + i);
      ps.setString(2,"PS - desc" + i);
      ps.executeUpdate();

      ResultSet rs = ps.getGeneratedKeys();
      int x = rs.next() ? rs.getInt(1) : 0;
      System.out.println(x);
                
      rowcnt++;
      ps.close();

      if ( rowcnt == commitCount )
      {
        con.commit();
        rowcnt = 0;
        committed++;
      }
    }
    long end = new Date().getTime();
    con.commit();
    System.out.println
    ("pstatement " + count + " times in " + (end - start) + " milli seconds committed = "+committed);
  }
}

/*
 *
 * Sample from trace file after setting events 10046 and 10051:
 *
 * OPI CALL: type=119 argc= 7 cursor=  0 name=V8 Describe Any
 * OPI CALL: type=94 argc=38 cursor=  0 name=V8 Bundled Exec
 * PARSE #140693461998224:c=0,e=15,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3884345238,tim=1368184246
 * EXEC #140693461998224:c=0,e=135,p=0,cr=0,cu=4,mis=0,r=1,dep=0,og=1,plh=3884345238,tim=1368184411
 * OPI CALL: type=105 argc= 2 cursor=  0 name=Cursor close all
 * CLOSE #140693461998224:c=0,e=15,dep=0,type=3,tim=1368185231
 * OPI CALL: type=14 argc= 0 cursor=  0 name=COMMIT
 * XCTEND rlbk=0, rd_only=0, tim=1368185291
 * OPI CALL: type=119 argc= 7 cursor=  0 name=V8 Describe Any
 * OPI CALL: type=94 argc=38 cursor=  0 name=V8 Bundled Exec
 * PARSE #140693461998224:c=0,e=20,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3884345238,tim=1368187929
 * EXEC #140693461998224:c=0,e=162,p=0,cr=0,cu=4,mis=0,r=1,dep=0,og=1,plh=3884345238,tim=1368188141
 * OPI CALL: type=105 argc= 2 cursor=  0 name=Cursor close all
 * CLOSE #140693461998224:c=0,e=6,dep=0,type=3,tim=1368189336
 * OPI CALL: type=14 argc= 0 cursor=  0 name=COMMIT
 * XCTEND rlbk=0, rd_only=0, tim=1368189373
 *
*/


You’ll notice that I’ve prepared, executed and closed a statement inside a loop. The problem wouldn’t happen if I prepared the statement before the loop and closed it after the loop, doing nothing but the execute inside the loop; but the code is simply modelling the “single row processing” effect that typically appears through ORMs.

You’ll have to decide for yourself how to take snapshots of the dynamic performance views while this code is running, and how to emable tracing – but anyone who want to fiddle with the code is probably better at coding Java than I am – so it’s left as an exercise to the reader (I used a logon trigger for the traces, and snap_rowcache and snap_libcache from another session).

There is a certain cruel irony to this issue.  For years I have been telling people that

    insert into my_table(id, ...) values(my_sequence.nextval,...) returning id into :bind1;

is more efficient than:

    select my_sequence.nextval into :bind1 from dual;
    insert into my_table(id,.....) values(:bind1, ...);

If, at present, you’re using  Hibernate as your ORM it generates code that does the (inefficient, bad practice) latter and you won’t see the “describe” problem.

Footnote

If you want a very simple SQL*Plus script to see the effect – and have privileges to query v$rowcache and v$librarycache – here’s a hundred describes with a little wrapper to show the effect:

em
rem     The schema running this script must not be SYS
rem     but must be granted select on v_$rowcache and
rem     v_$librarycache. For the results to be clearly
rem     visible the test needs to be run while virtually
rem     nothing else is running on the instance.
rem
rem     In 18.3.0.0 every describe seems to access 
rem     dc_users, dc_objects, and pin the library cache
rem     once for every column in the table described
rem     (plus a fixed "overhead" of 2 pins and one get)
rem
rem     When run by SYS the counts fall back to the
rem     12.2  numbers -i.e. only two pins and one get
rem     on the libraray cache with no accesses to the 
rem     dictionary cache
rem
rem     The excess gets and pins disappear in 19.2, 
rem     thought the pin count on the library cache 
rem     goes up to 4 per describe.
rem

drop table t1 purge;
create table t1(
        n1 number,
        n2 number,
        n3 number,
        n4 number,
        n5 number,
        n6 number,
        n7 number
)
;


-- variant create table t1 as select * from all_objects where rownum = 1;


set serveroutput off
set linesize 167
set trimspool on


spool temp_desc
select namespace, gets, pins from v$librarycache where namespace = 'TABLE/PROCEDURE';
select parameter, gets from v$rowcache where parameter in ('dc_users','dc_objects') and subordinate# is null;
spool off

set termout off

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1
describe t1

set termout on
set serveroutput on

spool temp_desc append

select namespace, gets, pins from v$librarycache where namespace = 'TABLE/PROCEDURE';
select parameter, gets from v$rowcache where parameter in ('dc_users','dc_objects') and subordinate# is null;

spool off



set doc off
doc

Sample output from 18.3
=======================
NAMESPACE                                                              GETS       PINS
---------------------------------------------------------------- ---------- ----------
TABLE/PROCEDURE                                                       27449      71108


PARAMETER                              GETS
-------------------------------- ----------
dc_users                              17341
dc_objects                           115830


NAMESPACE                                                              GETS       PINS
---------------------------------------------------------------- ---------- ----------
TABLE/PROCEDURE                                                       27555      72017


PARAMETER                              GETS
-------------------------------- ----------
dc_users                              18041
dc_objects                           116533


Note change in rowcache gets - one per column per describe on each parameter.
Note change in library cache pins - (one per column + 2) per describe.

Sample output from 12.2
=======================
NAMESPACE                  GETS       PINS
-------------------- ---------- ----------
TABLE/PROCEDURE           13393      20318


PARAMETER                              GETS
-------------------------------- ----------
dc_users                               4889
dc_objects                            31413


NAMESPACE                  GETS       PINS
-------------------- ---------- ----------
TABLE/PROCEDURE           13504      20539


PARAMETER                              GETS
-------------------------------- ----------
dc_users                               4889
dc_objects                            31416


No change in v$rowcache
Only the same single get and the "+2" pins per describe in v$librarycache

#


The good news is that I sent this text to a colleague who has access to Oracle 19, and the problem goes away (almost completley) – there are just two extra pins on the library cache in Oracle 19 compared to Oracle 12, and no gets on the rowcache dc_users and dc_objects. This suggests that it’s a known issue (even though there’s no visible bug report, and the problem is still present in 18.5) so it may be possible to get a backport of the version 19 code for vesion 18 fairly quickly. If not the best temporary workaround is probably to bypass the ORM and manually code for a function call that processes an anonymous PL/SQL block – but I haven’t tested that idea yet.

There is a JDBC cursor cache available – and if this were enabled than the prepared statement that was closed by the code would be kept open by the JDBC cache (and, of course, still be present in Oracle’s v$open_cursor) and Oracle wouldn’t receive any further parse or “describe” calls. Unfortunately it seems that there’s a cursor leak (still) in the JDBC caching algorithm that will lead to sessions hitting Oracle error “ORA-01000: maximum open cursors exceeded.”

Acknowledgements.

I’d particularly like to thank Lasse Jenssen who spent a couple of hours with me (when he could have been attending some interesting sessions) at the OUG Ireland conference a few days ago, working through various tests and strategies to pin down the problem and attempt to circumvent it. (Any criticism of the Java code above should, nevertheless be aimed at me).

Update

This problem is now visible on MoS as: Bug 29628952 : INCREASED ROW CACHE GETS ON DC_OBJECTS AND DC_USERS FOR DESCRIBE IN 18C AND LATER.

It’s also visible as Bug 29628647 : INCREASED GETS FOR DC_OBJECTS AND DC_USERS FOR DESCRIBE, unfortunately this latter bug has been associated with version 19.2 – where the problem doesn’t exist so the analyst has reported back (quite corretly) with “I see no problem.”

 

March 21, 2019

Lost time

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 12:51 pm GMT Mar 21,2019

Here’s a little puzzle that came up in the ODC database forum yesterday – I’ve got a query that has been captured by SQL Monitor, and it’s taking much longer to run than it should but the monitoring report isn’t telling me what I need to know about the time.

Here’s a little model to demonstrate the problem – I’m going to join a table to itself (the self join isn’t a necessary feature of the demonstration, I’ve just been a bit lazy in preparing data). Here’s a (competely truthful) description of the table:

SQL> desc t1
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ID                                     NUMBER
 MOD_1000                               NUMBER
 V1                                     VARCHAR2(40)
 V2                                     VARCHAR2(40)
 PADDING                                VARCHAR2(100)

SQL> select num_rows, blocks from user_tables where table_name = 'T1';

  NUM_ROWS     BLOCKS
---------- ----------
    400000       7798

1 row selected.


And here’s the text version of the output I get from dbms_monitor.report_sql_monitor() for the query running on 18.3.0.0


Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TEST_USER (261:36685)
 SQL ID              :  g6j671u7zc9mn
 SQL Execution ID    :  16777218
 Execution Started   :  03/21/2019 08:54:56
 First Refresh Time  :  03/21/2019 08:54:56
 Last Refresh Time   :  03/21/2019 08:55:17
 Duration            :  21s
 Module/Action       :  MyModule/MyAction
 Service             :  SYS$USERS
 Program             :  sqlplus@linux183.localdomain (TNS V1-V3)
 Fetch Calls         :  2

Global Stats
==========================================================================
| Elapsed |   Cpu   |    IO    | PL/SQL  | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Time(s) | Calls |  Gets  | Reqs | Bytes |
==========================================================================
|      21 |      21 |     0.04 |    0.07 |     2 |     3M |  215 | 180MB |
==========================================================================

SQL Plan Monitoring Details (Plan Hash Value=83896840)
==================================================================================================================================================
| Id |       Operation       | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |      | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT      |      |         |      |        21 |     +1 |     1 |        1 |      |       |     . |          |                 |
|  1 |   FILTER              |      |         |      |        21 |     +1 |     1 |        1 |      |       |     . |          |                 |
|  2 |    HASH JOIN OUTER    |      |     400 | 2014 |        21 |     +1 |     1 |      400 |      |       |   2MB |          |                 |
|  3 |     TABLE ACCESS FULL | T1   |     400 | 1005 |         1 |     +1 |     1 |      400 |   62 |  60MB |     . |          |                 |
|  4 |     TABLE ACCESS FULL | T1   |    400K | 1001 |        21 |     +1 |     1 |     400K |      |       |     . |          |                 |
==================================================================================================================================================

As you can see the total elapsed time 21 seconds of which the CPU time is the whole 21 seconds.

This seems a little large for a simple hash join so we should look at the “Activity Detail” section of the plan because that will report any ASH samples that can be found for this execution of this query. And that’s where the problem lies: despite using 21 CPU seconds in 21 seconds there are no ASH samples for the execution! Moreover – looking for other numerical oddities – the plan says we read 60MB in 62 read requests (that’s the first tablescan of of the 7,798 blocks of t1), but the summary says we read 180MB – where did we lose (or gain) 120MB ?

It might help to see the query (which is why I didn’t show it in the output above) and it might help to see the predicate section (which SQL Monitor doesn’t report). So here’s the query and its plan, with the resulting predicate section, pulled from memory:


SQL_ID  g6j671u7zc9mn, child number 0
-------------------------------------
select  /*+ monitor */  t1a.mod_1000, t1b.mod_1000 from  t1 t1a,  t1
t1b where  t1a.mod_1000 = 500 and t1b.id(+) = t1a.id and
nvl(t1b.mod_1000,0) + f2(t1a.id) + f3(t1a.id) > 0

Plan hash value: 83896840

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |       |       |  2014 (100)|          |
|*  1 |  FILTER             |      |       |       |            |          |
|*  2 |   HASH JOIN OUTER   |      |   400 |  7200 |  2014   (4)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| T1   |   400 |  3600 |  1005   (3)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| T1   |   400K|  3515K|  1001   (3)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NVL("T1B"."MOD_1000",0)+"F2"("T1A"."ID")+"F3"("T1A"."ID")>0)
   2 - access("T1B"."ID"="T1A"."ID")
   3 - filter("T1A"."MOD_1000"=500)


Notice the two function calls that are part of the FILTER operation. That’s where the time is going – but I don’t think you can infer that from the SQL Monitor report (which rather suggests that the problem might be with the second full tablescan of t1 at operation 4).

Functions f2() and f3() both do a table scan of another table that is as large as t1 – and it looks as if they’re both going to be called 400 times, which is a lot of work. The 120MB of “lost” reads is the first table scan of each of the two tables, the 3M buffer gets (in the summary, if you didn’t notice it) is from the repeated tablescans as each row comes out of the hash join and the filter is applied.

Why doesn’t the report show us the CPU (and disk) samples? Because it’s querying ASH (v$active_session_history) by the SQL_ID of the principle SQL statement – and the work is being done by two other statements with different SQL_IDs.

Since it’s PL/SQL functions doing all the work why isn’t the 21 CPU seconds showing up in the “PL/SQL time(s)” summary figure? Because the time isn’t being spent in PL/SQL, it’s being spent in the SQL being run by the PL/SQL.

Just to finish off, let’s try to find the SQL being run by the PL/SQL. I’ll post the script to recreate the example at the end of the posting but for the moment I’ll just show you the query against v$sql that I ran to extract the SQL that’s embedded in the functions:


select  sql_id, executions, disk_reads, buffer_gets, sql_text
from    V$sql 
where   sql_text like 'SELECT%T2%' 
or      sql_text like 'SELECT%T3%'
;

SQL_ID        EXECUTIONS DISK_READS BUFFER_GETS SQL_TEXT
------------- ---------- ---------- ----------- --------------------------------------------------
12ytf1rry45d9        400       7683     3072817 SELECT MAX(ID) FROM T2 WHERE MOD_1000 = :B1
85dmsgqg3bh4w          1       7680        7698 SELECT MAX(ID) FROM T3 WHERE MOD_1000 = :B1

As you can see, there are roughly 120MB of disk I/O and 3M buffer gets due to these two statement – and one of them has run the 400 times we expected. It looks as if Oracle has done a cute little optimisation with the other function, though. If you look at the predicate it says:

  • NVL(“T1B”.”MOD_1000″,0)+”F2″(“T1A”.”ID”)+”F3″(“T1A”.”ID”)>0

It looks as if Oracle has either cached the result of the f3() function call (in which case why didn’t it also cache the f2() result) or it’s stopped evaluating the predicate as soon as the running total exceeded zero (but that might be a problem since f3() could return a negative number !). I don’t know exactly why Oracle is doing what it’s doing – but if I reverse the order of the functions in the predicates the f3() query will run 400 times and the f2() query will run once.

Update

Following the comment below from “anonymous”, pointing out that the “Active” SQL Monitor (and, of course, the equivalent OEM screen) has a section showing the impact of everything the session has been doing while the query ran I re-ran my call to dbms_monitor.report_sql_monitor() with the “type” parameter set to ‘ACTIVE’ rather than ‘TEXT’. After spooling this to a text file (set linesize 255, set pagesize 0) with the suffix “html” and trimming the top and bottom off so that the file started and ended with opening and closing html tags, I opened it in Firefox.

You have to have Adobe Flash Player for this to work, and you have to be connected to the Internet as the file contains content that will call to Oracle’s home domain. After getting Adobe flash to work, here’s a snapshot of the resulting display:

The print may be a little small, but in the lower half of the screen (which usually shows the plan with execution statistics) I’ve selected the “Activity” tab, and this has produced a graphic display of all the samples captured for the session while the query was running. I took this report after re-running the test case, but I had swapped the order of the two functions in the interim, so this output is showing that (from Oracle’s perspective) all the work done during the execution of the query was done by a statement with SQL_ID 85dmsgqg3bh4w.

Appendix

If you want to re-run some tests here’s the code I used to create the demo:


create table t1
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,1000)              mod_1000,
        lpad(rownum,10,'0')             v1,
        lpad(rownum,10,'0')             v2,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4e5 -- > comment to avoid WordPress format issue
;

-- alter table t1 add constraint t1_pk primary key(id);

create table t2 nologging as select * from t1;
create table t3 nologging as select * from t1;

create function f2 (i_in number) return number
as
        m_ret number;
begin
        select max(id)
        into    m_ret
        from    t2
        where   mod_1000 = i_in
        ;

        return m_ret;
end;
/

create function f3 (i_in number) return number
as
        m_ret number;
begin
        select  max(id)
        into    m_ret
        from    t3
        where   mod_1000 = i_in
        ;

        return m_ret;
end;
/


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T3',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter system flush buffer_cache;

variable b1 number
variable b2 number

set timing on

select
        /*+ monitor */
        t1a.mod_1000, t1b.mod_1000
from
        t1      t1a,
        t1      t1b
where
        t1a.mod_1000 = 500
and     t1b.id(+) = t1a.id
and     nvl(t1b.mod_1000,0) + f3(t1a.id) + f2(t1a.id) > 0
;


One interesting little detail, if you try the tests, is that the join has to be an outer join for the FILTER operation to appear.

March 6, 2019

12c Snapshots

Filed under: 12c,Oracle,Partitioning,Performance — Jonathan Lewis @ 10:35 am GMT Mar 6,2019

I published a note a few years ago about using the 12c “with function” mechanism for writing simple SQL statements to takes deltas of dynamic performance views. The example I supplied was for v$event_histogram but I’ve just been prompted by a question on ODC to supply a couple more – v$session_event and v$sesstat (joined to v$statname) so that you can use one session to get an idea of the work done and time spent by another session – the first script reports wait time:


rem
rem     Program:        12c_with_function_2.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session WAIT time
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 121, 127'

set timing on
set sqlterminator off

set linesize 180

break on sid skip 1

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        sum(total_waits),
        sum(total_timeouts), 
        sum(time_waited), 
        event
from    (
        select
                sid, event_id, 
                -total_waits total_waits, 
                -total_timeouts total_timeouts, 
                -time_waited time_waited, 
                -time_waited_micro time_waited_micro, 
                event
        from    v$session_event
        where   sid in ( &m_sid_list )
        union all
        select
                null, null, null, null, null, wait_row(&m_snap_time, 0), null
        from    dual
        union all
        select
                sid, event_id, total_waits, total_timeouts, time_waited, time_waited_micro, event
        from    v$session_event
        where   sid in ( &m_sid_list )
        )
where
        time_waited_micro != 0
group by
        sid, event_id, event
having
        sum(time_waited) != 0
order by
        sid, sum(time_waited) desc
/


And this one reports session activity:

rem
rem     Program:        12c_with_function_3.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session stats
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 13, 357'


set timing on
set sqlterminator off

set linesize 180

break on sid skip 1
column name format a64

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        name,
        sum(value)
from    (
        select
                ss.sid, 
                ss.statistic#,
                sn.name,
                -ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        union all
        select
                null, null, null, wait_row(&m_snap_time, 0)
        from    dual
        union all
        select
                ss.sid, ss.statistic#, sn.name, ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        )
where
        value != 0
group by
        sid, statistic#, name
having
        sum(value) != 0
order by
        sid, statistic#
/


You’ll notice that I’ve used dbms_lock.sleep() in my wait function – and the session running the SQL can be granted the execute privilege on the package through a role to make this work – but if you’re running Oracle 18 then you’ve probably noticed that the sleep() function and procedure have been copied to the dbms_session package.

 

Next Page »

Powered by WordPress.com.