Oracle Scratchpad

February 4, 2020

maxquerylen

Filed under: Bugs,Infrastructure,Oracle,undo — Jonathan Lewis @ 3:50 pm GMT Feb 4,2020

The view v$undostat is a view holding summary information about undo activity that can be used by the automatic undo mechanism to deal with optimising the undo retention time (hence undo space allocation). The view holds one row for every ten minute interval in the last 4 days (96 hours) and includes two columns called maxquerylen and maxqueryid – which tell you something about the query that was considered to be the longest running query active in the interval.

In this note I want to explain why the contents of these two columns are sometimes (possibly often) completely irrelevant despite there being a few notes on the internet about how you should investigate them to help you decide on a suitable setting for the undo_retention.

The descriptions in the 19c reference manual for these columns is as follows:

  • maxquerylen – Identifies the length of the longest query (in seconds) executed in the instance during the period. You can use this statistic to estimate the proper setting of the UNDO_RETENTION initialization parameter. The length of a query is measured from the cursor open time to the last fetch/execute time of the cursor. Only the length of those cursors that have been fetched/executed during the period are reflected in the view.
  • maxqueryid – SQL identifier of the longest running SQL statement in the period

It would be hard to misunderstand the meaning of the second column – if the first column tells us that Oracle has spotted a “longest query” then the second column gives us the sql_id so we can check v$sql to find out what it was. But what sort of queries are going to show up as the “longest query” in the interval?

Here’s an example from a few hours of a nearly idle instance, querying the begin and end times (formatted to show just day of month + time), with the two critical columns,


select  begin_time, end_time, maxquerylen, maxqueryid 
from    v$undostat 
order by 
    begin_time
;

BEGIN_TIME	END_TIME    MAXQUERYLEN MAXQUERYID
--------------- ----------- ----------- -------------
...
04 10:50:18	04 11:00:18	      0
04 11:00:18	04 11:10:18	    356 f3yfg50ga0r8n
04 11:10:18	04 11:20:18	    883 25u1mbkcr9rnu
04 11:20:18	04 11:30:18	   1486 25u1mbkcr9rnu
04 11:30:18	04 11:40:18	   2090 25u1mbkcr9rnu
04 11:40:18	04 11:50:18	      0
04 11:50:18	04 12:00:18	   3299 25u1mbkcr9rnu
04 12:00:18	04 12:10:18	   3903 25u1mbkcr9rnu
04 12:10:18	04 12:20:18	   4507 25u1mbkcr9rnu
04 12:20:18	04 12:30:18	      0
04 12:30:18	04 12:40:18	      0
04 12:40:18	04 12:50:18	      0
04 12:50:18	04 13:00:18	      0
04 13:00:18	04 13:10:18	      0
04 13:10:18	04 13:20:18	      0
04 13:20:18	04 13:30:18	      0
04 13:30:18	04 13:37:27	   9035 25u1mbkcr9rnu

173 rows selected.

Notice, particularly, that the SQL_ID 25u1mbkcr9rnu disappears from the 11:40 interval, then re-appears at 11:50, then disappears again from 12:20 through 13:20 (lunchtime), then reappears again at 13:30. And when it reappears after an absence the query length has increased in a way that’s consistent with the gap. So it looks as if the query wasn’t running during the gap, but turns out to have been running after the gap ended. (Is this Quantum queries?)

The explanation is in the detail of the definition: “from the cursor open time to the last fetch/execute time”. From an SQL*Plus session I “set pause on” then executed the query “select * from all_objects” and hit return a few times to get a few pages of output. Then, spread over the next couple of hours, I hit return a few more times to get a few more pages of output. Each time I hit return my session does another fetch call, and the code behind v$undostat notices that my query is still active.

I don’t know exactly how Oracle is keeping track of “active” statements because there seem to be some inconsistencies in the reporting (and I’ll comment on those later), but as a first approximation, until you close a cursor (either explicitly or implicitly) some piece of Oracle’s code registers the fact that the query might do further fetches, which means it might need to apply more undo to make current data read-consistent with the cursor’s start SCN, which means that it should take note of the time the cursor has been open because the undo retention time might need to be that long.

Inconsistencies.

I said there were some inconsistencies in the reporting. I’ve noticed three anomalies – perhaps due to the extreme idleness of the instance I was using for testing.

  1. At about 12:45 I hit return a few times to get the maxquerylen and maxqueryid refreshed – but the code didn’t seem to notice that I had done a few more fetches of a long running query. So it seems to be possible for Oracle to miss the query that should be reported.
  2. At about 11:52 I hit return a few times again, and you can see that the two critical columns were refreshed, but if you do the arithmetic Oracle has attributed 600 seconds to the query – the whole of the 10 minute refresh interval, not just the time up to the moment of the last fetch that I had done in that interval.
  3. I didn’t hit return again until about 12:45 (the moment in point 1 above when the two columns didn’t refresh the way I though they should), but the columns kept updating through 12:00 and 12:10 intervals anyway before disappearing from the 12:20 interval. So it looks like queries can be reported as long running even when they haven’t been executing or fetching.

The is one last anomaly – and this relates to the reason I started looking at this columns. If you check the entry for 11:00 you’ll see that the SQL Id ‘f3yfg50ga0r8n’ has been reported as “running” for 356 seconds. But here’s what I found in v$sql for that query:


select  executions, parse_calls, fetches, end_of_fetch_count, elapsed_time, cpu_time, rows_processed, sql_text 
from    v$sql 
where   sql_id = 'f3yfg50ga0r8n'
;

EXECUTIONS PARSE_CALLS	  FETCHES END_OF_FETCH_COUNT ELAPSED_TIME   CPU_TIME ROWS_PROCESSED SQL_TEXT
---------- ----------- ---------- ------------------ ------------ ---------- -------------- ------------------------------------------
        79          79         79                  0        20487      10667             79 select obj# from obj$ where dataobj# = :1


The SQL lookes like a sys-recursive statement which, in 79 executions, has accumulated 20 milliseconds of elapsed time (rather than 356 seconds – but that difference could just be related to one or other of the anomalies I reported earlier). The key thing to note is that the value of column end_of_fetch_count is zero: this looks like a statement where Oracle has simply not bothered to fetch all the data and has then not bothered to close the cursor. As a result it’s possible that each time the statement is executed (note that parse_calls = executions, so we’re not looking at a “held” cursor) the code behind v$undostat looks back at the time the cursor was initially opened to measure the query length, rather than looking at the time the statement was re-executed.

This may go a long way to answering the question that came up on Oracle-l earlier on today as follows:

The following query (sql_id is 89w8y2pgn25yd) was recorded in v$undostat.maxqueryid in a 12.2. database during a period of a high undo usage: select ts# from sys.ts$ where ts$.online$ != 3 and bitand(flags,2048) != 2048;

select
        undoblks,txncount,maxquerylen,maxconcurrency,activeblks
from    v$undostat u
where   maxqueryid='89w8y2pgn25yd'
;

UNDOBLKS  TXNCOUNT  MAXQUERYLEN  MAXCONCURRENCY  ACTIVEBLKS
--------  --------  -----------  --------------  ----------
   39199      4027         1378               5     2531960

What is this query, and how does it manage to report a maximum query length of 1,378 seconds? Just like the one above it’s a sys-recursive query; and this one appears when you query dba_tablespaces – and though it executes once for every row and takes just fractions of a second to execute. But if you trace a query like “select tablespace_name from dba_tablespaces” you’ll see that every time the query is called the trace file will show lines for: “Parse, Exec, Fetch, Close” until the last call – which doesn’t report a “Close”.

Just like my original “select * from all_objects” there’s a query dangling with an open cursor, leaving Oracle with the opportunity to go back to the moment it was opened and report it as a “long running query”.

tl;dr

The maxquerylen and maxqueryid in v$undostat don’t tell you about statements that have taken a long time to change a lot of data and generate a lot of undo; they tell you about statements that might need to apply a lot of undo to see read-consistent data and therefore might become victims in a “ORA-01555: snapshot too old” event.

For various reasons the statements reported may be completely irrelevant because there are various reasons (some, possibly, bug-related) why a cursor may be treated as if it opened a long time in the past when it was actually a re-use of an existing “dangling” cursor. It’s also possible that some bad client code will treat cursor in a way that does no harm to the client program but hides a more interesting query that would otherwise have been reported by these two columns.

Update Feb 2020

I don’t know why I didn’t think of searching MOS before poking around at this issue, but I’ve just done a search on “maxquerylen maxqueryid” and one of the first Knowledgebase articles to show up – dated September 2019 – was Doc ID 2005931.1: Wrongly calculated MAXQUERYLEN Causing ORA-01555 or ORA-30036 Errors.  (Note – the 30036 is the LOB equivalent of 1555).

The document reports several bugs all relating to the same issue, and all closed as duplicates of unpublished bug: 12314489: OCIRELEASESTMT() TO PUT THE CURSORS IN CURBOUND STATE WHEN PLACING THEM IN CACHE This is also close and replaced by bug 26833932 which is also unpublished and doesn’t even have a description.

January 27, 2020

ANSI flashback

Filed under: ANSI Standard,Bugs,Flashback,Oracle — Jonathan Lewis @ 9:45 am GMT Jan 27,2020

I am seeing “traditional” Oracle SQL syntax being replaced by “ANSI”-style far more frequently than I used to – so I thought I’d just flag up another reminder that you shouldn’t be too surprised if you see odd little glitches showing up in ANSI style that don’t show up when you translate to traditional; so if your SQL throws an unexpected error (and if it’s only a minor effort to modify the code for testing purposes) it might be a good idea to see if the problem goes away when you switch styles. Today’s little glitch is one that showed up on the Oracle-l listserver 7 years ago running 11.2.0.3 but the anomaly still exists in 19c.

As so often happens it’s a problem that appears in one of the less commonly used Oracle features – in this case flashback queries. We’ll start by creating a table, then pausing for thought (Note: this code is little more than a cosmetic rewrite of the original posting on Oracle-l):


rem
rem     Script:         ansi_flashback_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1 
as
select  * 
from    all_objects
where   rownum <= 10000 -- > comment to avoid wordpress format issue
;

create table t2
as
select  *
from    t1
where   rownum <= 10 -- > comment to avoid wordpress format issue
;
 
prompt  =======================
prompt  Sleeping for 10 seconds
prompt  =======================

execute dbms_lock.sleep(10)

column current_scn new_value m_scn format a15

select  to_char(current_scn,'99999999999999') current_scn 
from    v$database
/

 
select 
        v1.object_name
from 
        t1 as of scn &m_scn v1
join 
        t2 as of scn &m_scn v2
on 
        v2.object_id = v1.object_id
/

I’ve created a couple of tables then introduced a 10 second sleep before checking the database SCN. The sleep is there because I want to be able to query the tables “as of SCN” and if I don’t pause for a little while (typically about 5 seconds) the code will probably raise Oracle error ORA-01466: unable to read data – table definition has changed.

The query I want to use references both tables as of the same SCN, using “ANSI” syntax to do the join. The query behaves perfectly reasonably when run from SQL(Plus; the problem starts to appear when I try to embed the query as a cursor in a PL/SQL procedure. First I’ll copy the SQL exactly as it is (with substitution variable) into a procedure declaration. The variable will be replaced in both cases by an actual value before the procedure is created, as the subsequent check of user_source will show:


create or replace procedure p1( myscn in varchar2 ) as

        cursor c1 is 
                select  v1.object_name
                from 
                        t1 as of scn &m_scn v1
                join 
                        t2 as of scn &m_scn v2
                on 
                        v2.object_id = v1.object_id
        ;

        l_row c1%rowtype;

begin
        open c1;
        fetch c1 into l_row;
        dbms_output.put_line(l_row.object_name);
        close c1;
end;
/
 
select  text 
from    user_source
where   type = 'PROCEDURE'
and     name = 'P1'
order by 
        line
/

execute p1 ('0')

The procedure compiles successfully and the query against user_source shows it stored as follows (note, particularly, an actual value has been stored for the SCN):


procedure p1( myscn in varchar2 ) as

        cursor c1 is
                select  v1.object_name
                from
                        t1 as of scn  12670394063090 v1
                join
                        t2 as of scn  12670394063090 v2
                on
                        v2.object_id = v1.object_id
        ;

        l_row c1%rowtype;

begin
        open c1;
        fetch c1 into l_row;
        dbms_output.put_line(l_row.object_name);
        close c1;
end;

Next we recreate the procedure but replace the substitution variable with the name of the incoming formal parameter:


create or replace procedure p1( myscn in varchar2 ) as

        cursor c1 is
                select  v1.object_name
                from 
                        t1 as of scn myscn v1
                join 
                        t2 as of scn myscn v2
                on 
                        v2.object_id = v1.object_id
        ;

        l_row c1%rowtype;

begin
        open c1;
        fetch c1 into l_row;
        dbms_output.put_line(l_row.object_name);
        close c1;
end;
/
 
show errors

You’ll notice that instead of doing a test execution of the procedure I’ve called “show errors”. This is because the procedure won’t compile and reports “Warning: Procedure created with compilation errors” with the following output from the call to show errors:


Errors for PROCEDURE P1:

LINE/COL ERROR
-------- -----------------------------------------------------------------
3/9      PLS-00341: declaration of cursor 'C1' is incomplete or malformed
4/3      PL/SQL: ORA-00984: column not allowed here
4/3      PL/SQL: SQL Statement ignored
13/8     PL/SQL: Item ignored
17/2     PL/SQL: SQL Statement ignored
17/16    PLS-00320: the declaration of the type of this expression is
         incomplete or malformed

18/2     PL/SQL: Statement ignored
18/23    PLS-00320: the declaration of the type of this expression is
         incomplete or malformed

So we check to see if the same warning and list of errors appear if I switch to “traditional” Oracle syntax:


create or replace procedure p1( myscn in varchar2 ) as

        cursor c1 is
                select  v1.object_name
                from 
                        t1 as of scn myscn v1,
                        t2 as of scn myscn v2
                where 
                        v2.object_id = v1.object_id
        ;

        l_row c1%rowtype;

begin
        open c1;
        fetch c1 into l_row;
        dbms_output.put_line(l_row.object_name);
        close c1;
end;
/
 
execute p1 (&m_scn)

The answer is no. This version of the query is accepted by the PL/SQL compiler, and the call to execute it with the SCN supplied in the substitution variable produces the expected results.

Is there anything we can do to stick with ANSI style syntax? Almost invariably the answer will be yes. Here’s a little workaround in this case:

create or replace procedure p1( myscn in varchar2 ) as

        cursor c1 is
                select  v1.object_name
                from 
                        (select * from t1 as of scn myscn) v1
                join 
                        (select * from t2 as of scn myscn) v2
                on 
                        v2.object_id = v1.object_id
        ;

        l_row c1%rowtype;

begin
        open c1;
        fetch c1 into l_row;
        dbms_output.put_line(l_row.object_name);
        close c1;
end;
/
 
execute p1 (&m_scn)

We simply embed each “as of scn” clause inside an inline view and then join the views. If you enable the CBO (10053) trace before executing this version of the procedure you’ll find that the final “unparsed” SQL produced by the optimzer has, of course, been translated back into the traditional syntax.

Warning: it’s fairly likely that this workaround will do what you want, but it’s possible that in a few cases it may result in a different execution plan from the one you were expecting (or would get from traditional syntax).

 

January 23, 2020

WITH Subquery

Filed under: 12c,Infrastructure,Oracle — Jonathan Lewis @ 8:37 pm GMT Jan 23,2020

Here’s another anomaly that appears when you mix and match Oracle features. In this case it’s “With” subqueries (common table expressions / CTEs) and Active Dataguard (ADG) Standby databases. The problem appeared on the Oracle-l listserver and luckily for the OP another member of the list had seen it before and could point to a relevant MOS document id which explained the issue and supplied a workaround.

The OP had their standby database opened read-only for reporting and found the following oddity in the extended SQL trace file for one of their reports:


WAIT #140196872952648: nam='db file scattered read' ela= 1588 file#=4097 block#=579715946 blocks=128 obj#=-39778567 tim=17263910670242
WAIT #140196872952648: nam='db file scattered read' ela= 1495 file#=4097 block#=579715947 blocks=128 obj#=-39778567 tim=17263910672065
WAIT #140196872952648: nam='db file scattered read' ela= 1671 file#=4097 block#=579715948 blocks=128 obj#=-39778567 tim=17263910674042
WAIT #140196872952648: nam='db file scattered read' ela= 1094 file#=4097 block#=579715949 blocks=128 obj#=-39778567 tim=17263910675443

Before pointing out the oddity (if you haven’t spotted it already) I’ll just explain a few of the numbers thayt are a little unusual.

  • File# = 4097: the user has parameter db_files = 4096, so this is the first Temp file.
  • Block# = 579,715,946: the database is 120TB, and the temporary tablespace is a “bigfile” tablespace so it’s okay for the file to hold more than 579M blocks.
  • Obj# < 0: Negative object numbers is a characteristic of materialized CTEs: if you look at the execution plan a materialized CTE will be reported as a table with a name like  SYS_TEMP_FDA106F9_E259E68.  If you take the first hexadecimal number and treat is as a 32-bit signed integer you get the value that would be reported as the obj# in the trace file.  (Converting to decimal and subtract power(2,32) is one way of doing the arithmetic).
  • tim= nnnnnnnn:  this is the timestamp (usually in microseconds), and we can see intervals of roughly 1,400 to 2,000 microseconds between these lines.

So here’s the oddity: in this set of 4 consecutive waits we’re waiting for multiblock reads of 128 blocks – but each read starts one block after the previous read. It’s as if Oracle is reading 128 blocks and forgetting everything after the first one. And the timestamps are significant because they tell us that this isn’t a case of Oracle spending so much time between reads that the other blocks fall off the end of  the buffer cache before the query reaches them.

I think I’ve seen a pattern like this once before but it would have been quite a long time ago and I can’t find any notes I might have made about it (and it turns out that my previous experience was not relevant to this case). Fortunately another member of Oracle-l had also seen the pattern and supplied the solution through a reference to a MOS document that led to: Doc ID 2251339.1 With Subquery Factorization Temp Table Does Not Cache in Standby in 12.1.0.2.

It’s not a bug – Oracle is supposed to do this if you manage to materialize a CTE in a Read-only Standby database. I don’t understand exactly why there’s a problem but thanks to some feature of how consistent reads operate and block SCNs are generated when you populate the blocks of the global temporary table (GTT) that is your materialized CTE it’s possible for Oracle to produce the wrong results if it re-visits blocks that have been read into the cache from the GTT. So when you do a multiblock read during a tablescan of the GTT Oracle can use the first block it has read (presumably because it’s immediately pinned), but can’t use the remaining 127 – and so you get the odd pattern of consecutive blocks appearing at the start of consecutive multiblock reads.

This raises a couple of interesting (and nasty) questions.

  • First – does every 128 block read get read to the middle of the buffer cache, pushing another 128 blocks out of the buffer cache or does Oracle automatically read the blocks to the “cold” end of the LRU, minimising the impact on the rest of the cache; we hope it’s the latter.
  • Second – If I use a small fetch size while running my query might I find that I have to re-read the same block (with its 127 neghbours) many times because Oracle releases any pinned blocks at the end of each fetch and has to re-acquire the blocks on the next fetch.

If anyone wants to test the second question by running a query from SQL*Plus with extended trace enabled the following simple query should answer the question:

alter session set events '10046 trace name context forever, level 8';
set arraysize 2

with v1 as (select /*+ materialize */ * from all_objects)
select object_name from v1;

Workarounds

There is a workaround to the issue – you can add the hint /*+ inline */ to the query to ensure that the CTE is not materialized. There is a bit of a catch to this, though (on top of the fact that you might then need to have two slightly different versions of the code if you want to run the query on production and standby) – if Oracle places the subquery text inline the optimizer may manage to merge it into the rest of the query and come up with a bad execution plan. Again you can probably work around this threat by extending the hint to read: /*+ inline no_merge */. Even then the optimizer could decide it has better statistics about the “real” table columns that it might have lost when it materialized the subquery, so it could still produce a different execution plan from the materialized plan.

As an alternative (and somewhat more brutal) workaround you could set the hidden parameter “_with_subquery” to inline either at the session or system level, or in the startup parameter file.

 

December 24, 2019

Flashback Archive

Filed under: Bugs,Flashback,Oracle — Jonathan Lewis @ 8:33 pm GMT Dec 24,2019

A classic example of Oracle’s “mix and match” problem showed up on the Oracle Developer Forum a few days ago. Sometimes you see two features that are going to be really helpful in your application – and when you combine them something breaks. In this case it was the combination of Virtual Private Database (VPD/FGAC/RLS) and Flashback Data Archive (FDA/FBA) that resulted in the security predicate not being applied the way you would expect, hence allowing users to see data they were not supposed to see.

The OP supplied us with a model (based in part on Tim Hall’s FDA article) to demonstrate the issue on 11.2.0.4, and I’ve hacked it about a bit to explain it here, and to test it on 12.2.0.1 and 19.3.0.0 where the same failure occurs.

I’m going to start with just the VPD part of the setup before adding in the FDA. Most of the code has been written to run as the SYS user and it creates a new tablespace and a couple of users so you may want to do some editing before you try any tests. There’s also a short script at the end of the blog to remove the flashback data archive, tablespace, and users – again, something to be run by SYS.

You’ll note that this script assumes you already have a tablespace called test_8k_assm, and a temporary tablespace called temp. The latter may well be a valid assumption, the former probably isn’t.

rem
rem     Script:         vpd_fda_bug.sql
rem     Author:         Duncs (ODC)
rem     Dated:          Dec 2019
rem
rem     Last tested
rem             19.3.0.0
rem             12.2.0.1
rem             11.2.0.4
rem
rem     Notes
rem     Has to be run as SYS
rem
rem     See also
rem     https://community.oracle.com/thread/4307453
rem


create user vpd_test_data_owner identified by Password_1234 
        default tablespace test_8k_assm
        temporary tablespace temp 
        quota unlimited on test_8k_assm
;
 
grant 
        connect,
        resource, 
        create any context
to
        vpd_test_data_owner
;

grant
        execute on dbms_rls
to
        vpd_test_data_owner
;
 
 
create table vpd_test_data_owner.person (
        person_id       number, 
        surname         varchar2(30), 
        unit_id         number
);

insert into  vpd_test_data_owner.person values (-1, 'One',  -1);
insert into  vpd_test_data_owner.person values (.2, 'Two',  -2);
insert into  vpd_test_data_owner.person values (.3, 'Three',-3);
insert into  vpd_test_data_owner.person values (-4, 'Four', -4);
insert into  vpd_test_data_owner.person values (-5, 'Five', -5);

commit;

create user vpd_test_function_owner identified by Password_1234
        default tablespace test_8k_assm 
        temporary tablespace temp 
        quota unlimited on test_8k_assm
;
 
grant 
        connect, 
        resource
to 
        vpd_test_function_owner
;
 
prompt  ============================================
prompt  Create a packaged function to set a context
prompt  that we will use in a VPD security predicate
prompt  ============================================

create or replace package vpd_test_function_owner.context_api_pkg AS

procedure set_parameter(
        p_name  in  varchar2,
        p_value in  varchar2
);

end context_api_pkg;
/
 
create or replace package body vpd_test_function_owner.context_api_pkg IS
 
procedure set_parameter (
        p_name  in  varchar2,
        p_value in  varchar2
) is
begin
        dbms_session.set_context('my_vpd_context', p_name, p_value);
end set_parameter;

end context_api_pkg;
/

prompt  ======================================================
prompt  Allow public to set the context value.  (Not sensible)
prompt  ======================================================

grant execute on vpd_test_function_owner.context_api_pkg to public;

prompt  ===============================================================
prompt  Create a context that can only be set by our packaged procedure
prompt  ===============================================================

create or replace context my_vpd_context 
        using vpd_test_function_owner.context_api_pkg
;

prompt  =====================================================
prompt  Create a security function that generates a predicate
prompt  based on our context, then create a policy to connect
prompt  the function to the test table for select statements.
prompt  =====================================================
 
create or replace function vpd_test_function_owner.test_vpd_function (
    p_schema  in varchar2 default null
  , p_object  in varchar2 default null
)
return varchar2
as
    lv_unit_id number := nvl(sys_context('my_vpd_context','unit_id'), -1);
begin
    return 'unit_id = ' || lv_unit_id;
end test_vpd_function;
/

begin
      dbms_rls.add_policy (
               object_schema    => 'vpd_test_data_owner'
             , object_name      => 'person'
             , policy_name      => 'test_vpd_policy'
             , function_schema  => 'vpd_test_function_owner'
             , policy_function  => 'test_vpd_function'
             , statement_types  => 'select'
      );
end;
/


There are several quick and dirty bits to the script – you shouldn’t be using the connect and resoruce roles, for example; they exist only for backwards compatibility and don’t even manage that very well any more. Any grants made should be carefully chosen to be the minimum necessary to achieve the required functionality, and you should be defining roles of your own rather than using pre-existing ones.

Generally you don’t expect to set up a security policy that stops the owner of the data from seeing all the data – and I’ve left the policy to default to dynamic which means the function will execute on every parse and execute of a statement accessing the table (and that’s somethin to avoid if you can). For convenience I’ve also alloweed the owner of the data to execute the function that changes the context that is used by the predicate function – and you don’t really want to allow anyone who is constrained by a security policy to be able to modify their own access rights like this.

Since the code allows a deliberately lax setup on VPD you could at this point do something like the following to check that VPD is actually working before moving on to test the effect of FDA:

connect vpd_test_data_owner/Password_1234
select * from person;

execute vpd_test_function_owner.context_api_pkg.set_parameter('unit_id',-2)
select * from person;

The first execution of the query should show you only the row where unit_id = -1 as “unit_id = -1” is the default return value from the security function. The second execution should return only the row where unit_id = -2 as the call to set_parameter() changes the context value so that when the security function re-executes it generate a new security predicate “unit_it = -2”. (It’s worth noting that one of the options for security policies is to make them context-dependent so that they re-execute only when the relevant context is changed – but in this case the policy defaults to “re-execute the function on every parse and execute”.)  [NOTE: for some clues on the possible performance impact of badly defined VPD, check the comments made in response to this blog note]

Once you’re satisfied that the security policy is working correctly you can move on to the second feature – flashback data archive. Logging on as SYS once again, execute the following code – which, amongst other things, creates a new tablespace. You’ll notice that I’ve got three lines in the “create tablespace” statement naming a datafile (though one of them doesn’t actually supply a name). The names (or absence thereof) correspond to the default naming conventions I have for my sandbox 11g, 12c, and 19c instances in that order. You will want to adjust according to your file-naming conventions.


prompt  ============================
prompt  Setting up Flashback Archive
prompt  ============================

create tablespace fda_ts 
        datafile        
--              no name needed if OMF
--              '/u01/app/oracle/oradata/orcl12c/orcl/fda_ts.dbf'
--              '/u02/data/OR19/orclpdb/fda_ts.dbf'
        size 1m autoextend on next 1m
;

alter user vpd_test_data_owner quota unlimited on fda_ts;

create flashback archive default fda_1year tablespace fda_ts
quota 1g retention 1 year;
 
grant flashback archive on fda_1year to vpd_test_data_owner;
grant flashback archive administer to vpd_test_data_owner;
grant execute on dbms_flashback_archive to vpd_test_data_owner;
 
prompt  Sleeping for 1 minute before adding table to flashback archive
execute dbms_lock.sleep(60);
alter table vpd_test_data_owner.person flashback archive fda_1year;

prompt  Sleeping for 1 minute before updating the date
execute dbms_lock.sleep(60);
update vpd_test_data_owner.person set surname = upper(surname);

commit;

prompt  Sleeping for 5 minutes to give FDA a chance to do its thing.
execute dbms_lock.sleep(300);
alter system flush shared_pool;

prompt  ==================================================
prompt  Now connect to the data owner schema and run the 
prompt  original query then a couple of flashback queries, 
prompt  pulling their plans from memory
prompt  ==================================================

connect vpd_test_data_owner/Password_1234

set linesize 120
set pagesize 50
set trimspool on
set serveroutput off

spool vpd_fda_bug.lst
-- set autotrace on explain

select * from vpd_test_data_owner.person;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-1/1440;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-2/1440;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-3/1440;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-4/1440;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-5/1440;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-6/1440;
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-7/1440;  
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-8/1440;  
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-9/1440;  
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-10/1440; 
select * from vpd_test_data_owner.person AS OF TIMESTAMP SYSDATE-15/1440; 
set autotrace off
spool off 

I’ve created a tablespace that I’m going to reserve for the flashback archive and given my data owner a quota on that tablespace; then I’ve created a flashback archive in that tablespace and granted various privileges relating to flashback archive to my data owner.

The next few lines of code include a couple of calls to dbms_lock.sleep() because I want to avoid the risk of getting an Oracle error ORA-01466: unable to read data – table definition has changed, but all I’ve done otherwise is modify the person table to be archiving and then made a little change that will eventually be recorded as part of the archive.

I’ve then introduced a 5 minute wait as it seems to take about 5 minutes before the flashback process takes any action to capture the original table data and copy any related undo; but after that 5 minutes is up I’ve queried the person table directly (which should show you the one row where unit_id = -1, then gradually gone backwards in time re-querying the data.

You should see the same result being produced for a few minutes, then a version of the “pre-update” data (upper case ‘ONE’ changing to mixed case ‘One’), and then you will (I hope) see the entire original data set appearing and finally you should see Oracle raising error “ORA-01466: unable to read data – table definition has changed” when your “as of timestamp” goes back beyond the moment you created the archive. (Except that that doesn’t happen with 11.2.0.4, which manages to report the data as if it had existed long before you created it).

I’ve commented out the “set autotrace on explain” in the above, but if you leave it in, or introduce it for one of the queries, you’ll see what’s going on that allows flashback data archive show you data that should have been hidden by the security predicate. Here’s the execution plan for one run:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                     |     2 |    86 |    17  (12)| 00:00:01 |       |       |
|   1 |  VIEW                     |                     |     2 |    86 |    17  (12)| 00:00:01 |       |       |
|   2 |   UNION-ALL               |                     |       |       |            |          |       |       |
|*  3 |    FILTER                 |                     |       |       |            |          |       |       |
|   4 |     PARTITION RANGE SINGLE|                     |     1 |    71 |     7   (0)| 00:00:01 |   KEY |     1 |
|*  5 |      TABLE ACCESS FULL    | SYS_FBA_HIST_353151 |     1 |    71 |     7   (0)| 00:00:01 |   KEY |     1 |
|*  6 |    FILTER                 |                     |       |       |            |          |       |       |
|   7 |     MERGE JOIN OUTER      |                     |     1 |  2083 |    10  (20)| 00:00:01 |       |       |
|   8 |      SORT JOIN            |                     |     1 |    55 |     7  (15)| 00:00:01 |       |       |
|*  9 |       TABLE ACCESS FULL   | PERSON              |     1 |    55 |     6   (0)| 00:00:01 |       |       |
|* 10 |      SORT JOIN            |                     |     5 | 10140 |     3  (34)| 00:00:01 |       |       |
|* 11 |       TABLE ACCESS FULL   | SYS_FBA_TCRV_353151 |     5 | 10140 |     2   (0)| 00:00:01 |       |       |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("TIMESTAMP_TO_SCN"(SYSDATE@!-.004861111111111111111111111111111111111111)<12670390363943)
   5 - filter("ENDSCN" .le. 12670390363943 AND ("OPERATION" IS NULL OR "OPERATION"<>'D') AND
              "ENDSCN">"TIMESTAMP_TO_SCN"(SYSDATE@!-.004861111111111111111111111111111111111111) AND ("STARTSCN" IS
              NULL OR "STARTSCN" .le. "TIMESTAMP_TO_SCN"(SYSDATE@!-.004861111111111111111111111111111111111111)))
   6 - filter("STARTSCN"<="TIMESTAMP_TO_SCN"(SYSDATE@!-.004861111111111111111111111111111111111111) OR
              "STARTSCN" IS NULL)
   9 - filter("UNIT_ID"=(-1) AND ("VERSIONS_OPERATION" IS NULL OR "VERSIONS_OPERATION"<>'D') 
             AND ("VERSIONS_STARTSCN" IS NULL OR "VERSIONS_STARTSCN".le."TIMESTAMP_TO_SCN(SYSDATE@!-.004861111111111111111111111111111111111111))
             AND ("VERSIONS_ENDSCN" IS NULL OR "VERSIONS_ENDSCN">"TIMESTAMP_TO_SCN"(SYSDATE@!-.004861111111111111111111111111111111111111)))
  10 - access("RID"(+)=ROWIDTOCHAR("T".ROWID))
       filter("RID"(+)=ROWIDTOCHAR("T".ROWID))
  11 - filter(("ENDSCN"(+) IS NULL OR "ENDSCN"(+)>12670390363943) AND ("STARTSCN"(+) IS NULL OR
              "STARTSCN"(+)<12670390363943))

Note
-----
   - dynamic sampling used for this statement (level=2)

Notice that the predicate “unit_id = -1″ appears on the full table scan of person at operation 9 – that’s Oracle applying the security predicate to the person table. But the flashback code has replaced the person table with a union all of (some partititions of) the SYS_FBA_HIST_353151 and a join between the person table and the SYS_FBA_TCRV_353151 table. And the code path that attaches the security predicate fails to attach it to the history table.

tl;dr

VPD (virtual private database) does not seem to be aware of the query rewrite that takes place if a table has an assocated FDA (flashback data archive), so a flashback query may report rows from the “history” table that should have been blocked by the VPD security policy.

Lagniappe

There is another little problem with FDA that might affect you if you try to optimizer flashback queries by creating SQL Plan Baselines. If you create a baseline on a test system (that isn’t a backup copy of the production system) and use the export/import facility to move the baseline to production then the baseline won’t work because the sys_fba_hist_nnnnn and sys_dba_tcrv_nnnnn table names are constructed from the object_id of the base table – which means the archive table names (and associated baseline hints) in the test system are probably going to have different names from the production system.

Housekeeping

To clean up the database after you’ve done all this testing, run the following script (modified to match any changes you’ve made in the test) after logging on as SYS:


alter table vpd_test_data_owner.person no flashback archive;

drop flashback archive fda_1year;

drop USER VPD_TEST_FUNCTION_OWNER cascade;
drop USER VPD_TEST_DATA_OWNER cascade;

drop tablespace fda_ts including contents and datafiles;

 

 

 

 

November 9, 2019

Nested Tables

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 11:28 am GMT Nov 9,2019

This note is a little side effect of answering a question about the total space associated with a table, including its indexes, LOB columns and nested tables. The first suggestion I published failed to take account of the fact that nested tables can contain their own nested tables so I had to modify the code.

The basic change was easy – in the original code I had joined to the view dba_nested_tables to translate between a table_name and its parent_table_name. The only change I needed to make was to replace the reference to the view with a different view that joined a table_name to its “oldest ancestor”. To achieve this I had to do two things: first, create a table with multiple levels of nesting, then create a suitable view definition.  For reference – because it may help somebody – I’ve published the two stages here.

A revolting nested table:

The following code creates three table types, but the second table type


rem
rem     Script:         nested_tables.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create or replace type jpl_item3 as object (v1 varchar2(10), n3 number);
/

create or replace type jpl_tab3_type as table of jpl_item3;
/

create or replace type jpl_item2 as object(v2 varchar2(10), jpl3 jpl_tab3_type);
/

create or replace type jpl_tab2_type as table of jpl_item2;
/

create or replace type jpl_item1 as object(v3 varchar2(10), jpl2 jpl_tab2_type);
/

create or replace type jpl_tab1_type as table of jpl_item1;
/

create table demo_nest_2 (
        id      number  constraint d2_pk primary key,
        jpl1    jpl_tab1_type
)
segment creation immediate
nested table jpl1 store as t_jpl1
        (
        nested table jpl2  store as t_jpl2
                (
                nested table jpl3 store as t_jpl3 
                return as locator
                )
        return as locator
        )
return as locator
/

I’ve never seen nested tables manually created in a production system though I believe they are likely to appear (along with varrays and LOBs) as a side effect of some XML or JSON mechanisms, but many years ago (in Practical Oracle 8i, possibly) I discussed the pros and cons of returning them “by value” or (as I have here) “by reference”. As you can see, you need to exercise some care with brackets and locating the text as you try to refine multiple levels of nesting.

Tne Ancestor View

We’ll take this in three steps – first, report from user_nested_tables. (The final script for reporting space used dba_nested_tables with a predicate available on the owner column, but I don’t give myself DBA privileges while writing quick and dirty bits of SQL.).


select 
        parent_table_name, parent_table_column, table_name, 
        ltrim(storage_spec) storage_spec,       -- the DBA view definition includes lpad( , 30) !!
        ltrim(return_type ) return_type         -- the DBA view definition includes lpad( , 20) !!
from 
        user_nested_tables
order by
        parent_table_name, parent_table_column
/

PARENT_TABLE_NAME    PARENT_TABLE_COLUMN  TABLE_NAME           STORAGE_SPEC         RETURN_TYPE
-------------------- -------------------- -------------------- -------------------- --------------------
DEMO_NEST_2          JPL1                 T_JPL1               USER_SPECIFIED       LOCATOR
T_JPL1               JPL2                 T_JPL2               USER_SPECIFIED       LOCATOR
T_JPL2               JPL3                 T_JPL3               DEFAULT              LOCATOR

You’ll notice the odd ltrim() – I have no idea why the view defines these columns to be left-padded the way they are, possibly it dates back to the time when something like cast(… as vachar2(30)) wasn’t a possible option.

Next a simple “connect by” query what uses the above list in a materialize “with subquery” (CTE):


with my_nested_tables as (
select
        /*+ materialize */
        parent_table_name, table_name
from
        user_nested_tables
)
select
        parent_table, child_table, pathname
from    (
        select
                level,
                sys_connect_by_path(table_name, '/')    pathname,
                connect_by_root parent_table_name parent_table,
                table_name child_table
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
order by
        parent_table, child_table, pathname
/

PARENT_TABLE         CHILD_TABLE          PATHNAME
-------------------- -------------------- ----------------------------------------
DEMO_NEST_2          T_JPL1               /T_JPL1
DEMO_NEST_2          T_JPL2               /T_JPL1/T_JPL2
DEMO_NEST_2          T_JPL3               /T_JPL1/T_JPL2/T_JPL3
T_JPL1               T_JPL2               /T_JPL2
T_JPL1               T_JPL3               /T_JPL2/T_JPL3
T_JPL2               T_JPL3               /T_JPL3


As required this shows me demo_nest_2 as the owning ancestor of t_jpl1, t_jpl2 and t_jpl3. Unfortunately it has also produced three rows that we don’t want to see in our final space-summing code. But it’s easy enough to get rid of those – the only rows we want are the rows with a parent_table that doesn’t appear as a child_table:


with my_nested_tables as (
select
        /*+ materialize */
        parent_table_name, table_name
from
        user_nested_tables
)
select  parent_table, child_table, pathname
from    (
        select
                level,
                sys_connect_by_path(table_name, '/')    pathname,
                connect_by_root parent_table_name parent_table,
                table_name child_table
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
where   (parent_table) not in (
                select table_name
                from   my_nested_tables
        )
order by
        parent_table, child_table, pathname
/

PARENT_TABLE         CHILD_TABLE          PATHNAME
-------------------- -------------------- ----------------------------------------
DEMO_NEST_2          T_JPL1               /T_JPL1
DEMO_NEST_2          T_JPL2               /T_JPL1/T_JPL2
DEMO_NEST_2          T_JPL3               /T_JPL1/T_JPL2/T_JPL3

3 rows selected.


And there’s the result we wanted to see. A first simple corroboration of the fact is that the output corresponds with the “NESTED TABLE” segments reported by user_segments.

Of course, having written a query that gets the right result from a table definition we used to help us define the query in the first place we now ought to create a few more tables with different structures to see if the query continues to give the right results in more complex cases.

What happens, for example, if I create a table with two columns of nested tables, and one of the nested table typed also contained two nested tables ? What happens if the base table is an index organized table ?

It’s easy to do the second test – just add “organization index” immediately after “segment creation immediate” in the table creation statement. The correct results drop out.

As for the first test – here’s the SQL to create one example (and the query still gets the right answers – even if you change the table to be index organized).


drop type jpl_tab1_type;
drop type jpl_item1;

drop type jpl_tab2_type;
drop type jpl_item2;

drop type jpl_tab3_type;
drop type jpl_item3;

purge recyclebin;

create or replace type jpl_item3 as object (v1 varchar2(10), n3 number);
/

create or replace type jpl_tab3_type as table of jpl_item3;
/

create or replace type jpl_item2 as object(v2 varchar2(10), jpl3 jpl_tab3_type, jpl3x jpl_tab3_type);
/

create or replace type jpl_tab2_type as table of jpl_item2;
/

create or replace type jpl_item1 as object(v3 varchar2(10), jpl2 jpl_tab2_type)
/

create or replace type jpl_tab1_type as table of jpl_item1;
/

create table demo_nest_3 (
        id      number  constraint d2_pk primary key,
        jpl1    jpl_tab1_type,
        jpl1a   jpl_tab1_type
)
segment creation immediate
-- organization index
nested table jpl1 store as t_jpl1
        (
        nested table jpl2  store as t_jpl2
                (
                nested table jpl3 store as t_jpl3 
                return as locator
                nested table jpl3x store as t_jpl3x 
                return as locator
                )
        return as locator
        )
return as locator
nested table jpl1a store as t_jpl1a
        (
        nested table jpl2  store as t_jpl2a
                (
                nested table jpl3 store as t_jpl3a
                return as locator
                nested table jpl3x store as t_jpl3xa
                return as locator
                )
        return as locator
        )
return as locator
/

All that remains now is to modify the code to use the equivalent DBA views, with the addition of the owner column, then slot the resulting code into the original query in place of the simple references to dba_nested_tables. If you go to the original posting you’ll see that I’ve done this by wrapping the code into a CTE so that the only changes to the rest of the code appear as two (flagged) changes where the CTE is then used.

October 31, 2019

IOT Hash

Filed under: Execution plans,Hash Join,Infrastructure,IOT,Joins,Oracle,Troubleshooting — Jonathan Lewis @ 2:59 pm GMT Oct 31,2019

It’s another of my double-entendre titles. The optimizer can turn a hash join involving an index-organized table into a real performance disaster (though you may have to help it along the way by using a silly definition for your primary key columns). This post was inspired by a question posted on the Oracle Developer Community forum recently so the table and column names I’ve used in my model reflect (almost, with a few corrections) the names used in the post.

We start with a simple requirement expressed through the following SQL:


rem
rem     Script:         iot_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

insert
        /*+
                qb_name(insert)
        */
into t_iot(
        id, inst_id, nr_time,
        o_time, st, null_col, svname
)
select
        /*+
                qb_name(main)
                unnest(@subq)
                leading(@sel$a93afaed apar@main ob@subq)
                use_hash(@sel$a93afaed ob@subq)
                swap_join_inputs(@sel$a93afaed ob@subq)
                index_ss_asc(@sel$a93afaed ob@subq (t_iot.st t_iot.inst_id t_iot.svname))a
        */
        apar.id,
        'UP',
        to_date('2019-10-24','yyyy-mm-dd'),
        to_date('1969-12-31','yyyy-mm-dd'),
        'IDLE',
        null,
        'tkt007.jj.bb.com'
from
        t_base apar
where
        apar.id not in (
                select
                        /*+
                                qb_name(subq)
                        */
                        id
                from
                        t_iot ob
                where
                        inst_id = 'UP'
        )
and     nvl(apar.gp_nm,'UA') = 'UA'
and     rownum <= 5000
/

The requirement is simple – insert into table t_iot a set of values dictated by a subset of the rows in table t_base if they do not already exist in t_iot. To model the issue that appeared I’ve had to hint the SQL to get the following plan (which I pulled from memory after enabling rowsource execution stats):


---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |             |      1 |        |   296 (100)|      0 |00:00:00.03 |     788 |    148 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL | T_IOT       |      1 |        |            |      0 |00:00:00.03 |     788 |    148 |       |       |          |
|*  2 |   COUNT STOPKEY          |             |      1 |        |            |    100 |00:00:00.03 |      99 |     91 |       |       |          |
|*  3 |    HASH JOIN RIGHT ANTI  |             |      1 |    100 |   296   (2)|    100 |00:00:00.03 |      99 |     91 |    14M|  1843K|   15M (0)|
|*  4 |     INDEX SKIP SCAN      | T_IOT_STATE |      1 |  12614 |   102   (0)|  10000 |00:00:00.01 |      92 |     91 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | T_BASE      |      1 |    100 |     2   (0)|    100 |00:00:00.01 |       7 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=5000)
   3 - access("APAR"."ID"="ID")
   4 - access("INST_ID"='UP')
       filter("INST_ID"='UP')
   5 - filter(NVL("APAR"."GP_NM",'UA')='UA')

The optimizer has unnested (as hinted) the subquery and converted it to an anti-join using a right hash anti-join. Take a look at the Used-mem for the hash join – would it surprise you to learn that the total size of the (not compressed in any way) IOT, and all its indexes, and the t_base table together total less than 4 MB. Something dramatically awful has happened in the hash join to generated a requirement of 14MB. (In the case of the OP this appeared as an unexpected 5GB written to the temporary tablespace.)

Before I address the source of the high memory usage, take a close look at the Predicate Information, particularly operation 3, and ask yourself what the definition of index t_iot_state might be. The predicate joins t_base.id to t_iot.id, and here’s the code to create both tables and all the indexes.

create table t_iot (
        nr_time         timestamp,
        id              varchar2(1024),
        inst_id         varchar2(200),
        o_time          timestamp,
        st              varchar2(200),
        null_col        varchar2(100),
        svname          varchar2(200),
        constraint t_iot_pk primary key(nr_time, id, inst_id)
)
organization index
/

insert into t_iot
select
        sysdate,
        dbms_random.string('l',10),
        'UP',
        sysdate,
        'IDLE',
        null,
        rpad('x',25,'x')
from
        all_objects
where
        rownum <= 1e4 -- > hint to avoid wordpress format issue
/

create index t_iot_state on t_iot(st, inst_id, svname); 
create index idx2        on t_iot(id, inst_id, svname);

create table t_base(
        id              varchar2(400) not null,
        gp_nm           varchar2(200)
)
/

insert into t_base
select
        dbms_random.string('l',10),
        'UA'
from
        all_objects
where
        rownum <= 100 -- > hint to avoid wordpress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_iot',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_base',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/


The index t_iot_state that Oracle has used in the hash join is defined on the columns (st, inst_id, svname) – so the predicate is doing a comparison with a column that’s not in the index! At least, it’s not visibly declared in the index; but this is a secondary index on an IOT, and IOTs don’t have “normal” rowids, the rowid in a secondary index is the value of the primary key (plus a “block guess”). So the columns in the index (even though not declared in the index) are: (st, inst_id, svname, {nr_time, id, inst_id, blockguess}). So this index does supply the required id column.

Side note: you’ll see in the list of columns above that inst_id appears twice. In fact (since Oracle 9, I think) the code to handle secondary indexes has been smart enough to avoid this duplication. If the secondary index contains columns from the primary key then the “rowid” doesn’t store those columns, the code knows how to construct the primaryh key value from the stored PK columns combined with the needed columns from the index entry. This can make IOTs a very nice choice of implementation for “intersection” tables that are used to represent many-to-many joins between two other tables.

Unfortunately this “rowid” is the explanation for the massive memory demand. Take a look at the “Column Projection Information” for my execution plan:


Column Projection Information (identified by operation id):
-----------------------------------------------------------
   2 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200], ROWNUM[8]
   3 - (#keys=1) "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]
   4 - "OB".ROWID[ROWID,1249], "NR_TIME"[TIMESTAMP,11], "ID"[VARCHAR2,1024], "INST_ID"[VARCHAR2,200], "OB".ROWID[ROWID,1249]
   5 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]

The interesting line is operation 4. A hash join takes the rowsource from its first child (the build table) and creates an in-memory hash table (which may spill to disc, of course), so if I see an unreasonable memory allocation (or unexpected spill to disc) a good starting point is to look at what the first child is supplying. In this case the first child seems to be saying that it’s supplying (or allowing for) nearly 3,700 bytes to be passed up to the hash join.

On closer inspection we can see it’s reporting the “rowid” twice, and also reporting the three columns that make up that rowid. I think it’s reasonable to assume that it’s only supplying the rowid once, and maybe it’s not even supplying the other three columns because they are embedded in the rowid. Doing a quick arithmetic check, let’s multiply the size of the rowid by the value of A-rows: 1,249 * 10,000 = 12,490,000. That’s pretty close to the 14MB reported by the hash join in operation 3.

Hypothesis – to get at the id column, Oracle has used this index (actually a very bad choice of those available) to extract the rowid and then passed the rowid up to the parent in a (length padded) fixed format. Oracle has then created a hash table by extracting the id column from the rowid and building the hash table on it but also carrying the length-padded rowid into the hash table.  Possible variants on this theme are that some or all of the other columns in the Column Projection Information are also passed upwards so that the id doesn’t have to be extracted, but if they are they are not padded to their maximum length.

A simple test that this is broadly the right assumption is to re-run the model making the declared length of the rowid much larger to see what happens to the memory allocation. Changing the inst_id declaration from 200 bytes to 1000 bytes (note the stored value is only the 2 bytes needed for the value ‘UP’) the Used-mem jumps to 23 MB (which is an increment very similar to 800 * 10,000).  You’ll note that I chose to experiment with a column that wasn’t the column used in the join. It was a column in the secondary index definition, though, so another test would be to change the nr_time column from a timestamp (11 bytes) to a large varchar2, so I re-ran the test declaring the nr_time as a varchar2(1000) – reverting the inst_id to varchar2(200) – and the Used-mem increased to 25MB.

Preliminary Conclusion

If Oracle uses the contents of the rowid of a secondary index on an IOT in a join then it constructs a fixed format version for the rowid by padding every column in the primary key to its maximum length and concatenating the results. This can have catastrophic side effects on performance if you’ve declared some very long columns “just in case”. Any time you use index organized tables you should remember to check the Column Projection Information in any execution plans that use secondary indexes in case they are passing a large, padded, primary key through the plan to a point where a blocking operation (such as a hash join or merge join) has to accumulate a large number of rows.

Footnote

In my test case I had to hint the query heavily to force Oracle into the path I wanted to demonstrate.

It’s surprising that the optimizer should have chosen this path in the OP’s system, given that there’s another secondary index that contains the necessary columns in its definition. (So one thought is that there’s a statistics problem to address, or possibly the “good” index is subject to updates that make it become very inefficient (large) very quickly.)

Another oddity of the OP’s system was that Oracle should have chosen to do a right hash anti-join when it looked as if joining the tables in the opposite order would produce a much smaller memory demand and lower cost – there was an explict swap_join_inputs() hint in the Outline Information (so copying the outline into the query and changing that to no_swap_join_inputs() might have been abother viable workaround.) In the end the OP hinted the query to use a nested loop (anti-)join from t_base to t_iot – which is another way to avoid the hash table threat with padded rowids.

 

October 11, 2019

v$session

Filed under: Execution plans,Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:29 pm BST Oct 11,2019

Here’s an odd, and unpleasant, detail about querying v$session in the “most obvious” way. (And if you were wondering what made me resurrect and complete a draft on “my session id” a couple of days ago, this posting is the reason). Specifically if you want to select some information for your own session from v$session the query you’re likely to use in any recent version of Oracle will probably be of the form:


select {list for columns} from v$session where sid = to_number(sys_context('userenv','sid'));

Unfortunately that one little statement hides two anomalies – which you can see in the execution plan. Here’s a demonstration cut from an SQL*Plus session running under 19.3.0.0:


SQL> select * from table(dbms_xplan.display_cursor);

SQL_ID  gcfrzq9knynj3, child number 0
-------------------------------------
select program from V$session where sid = sys_context('userenv','sid')

Plan hash value: 2422122865

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    33 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FULL       | X$KSLWT         |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    21 |     0   (0)|
|*  6 |    FIXED TABLE FULL       | X$KSUSE         |     1 |    21 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------ -------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(SYS_CONTEXT('userenv','sid')))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter((BITAND("S"."KSUSEFLG",1)<>0 AND BITAND("S"."KSSPAFLG",1)<>0 AND 
              "S"."INDX"=TO_NUMBER(SYS_CONTEXT('userenv','sid'))
              AND INTERNAL_FUNCTION("S"."CON_ID") AND "S"."INST_ID"=USERENV('INSTANCE')))

As you can see, v$session is a join of 3 separate structures – x$kslwt (v$session_wait), x$ksled (v$event_name), and x$ksuse (the original v$session as it was some time around 8i), and the plan shows two “full tablescans” and a Cartesian merge join. Tablescans and Cartesian merge joins are not necessarily bad – especially where small tables and tiny numbers of rows are concerned – but they do merit at least a brief glance.

x$ksuse is a C structure in the fixed SGA and that structure is a segmented array (which seems to be chunks of 126 entries in 19.3, and chunks of 209 entries in 12.2 – but that’s fairly irrelevant). The SID is simply the index into the array counting from 1, so if you have a query with a predicate like ‘SID = 99’ Oracle can work out the address of the 99th entry in the array and access it very quickly – which is why the SID column is reported as a “fixed index” column in the view v$indexed_fixed_column.

But we have two problems immediately visible:

  1. the optimizer is not using the “index” to access x$ksuse despite the fact that we’re giving it exactly the value we want to use (and we can see a suitable predicate at operation 6 in the plan)
  2. the optimizer has decided to start executing the query at the x$kslwt table

Before looking at why thing’s have gone wrong, let’s check the execution plan to see what would have happened if we’d copied the value from the sys_context() call into a bind variable and queried using the bind variable – which we’ll keep as a character type to make it a fair comparison:

SQL_ID  cm3ub1tctpdyt, child number 0
-------------------------------------
select program from v$session where sid = to_number(:v1)

Plan hash value: 1627146547

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FIXED INDEX| X$KSLWT (ind:1) |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    20 |     0   (0)|
|*  6 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(:V1))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter(("S"."INDX"=TO_NUMBER(:V1) AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))


When we have a (character) bind variable instead of a sys_context() value the optimizer manages to use the “fixed indexes”., but it’s still started executing at x$kslwt, and still doing a Cartesian merge join. The plan would be the same if the bind variable were a numeric type, and we’d still get the same plan if we replaced the bind variable with a literal number.

So problem number 1 is that Oracle only seems able to use the fixed index path for literal values and simple bind variables (plus a few “simple” functions). It doesn’t seem to use the fixed indexes for most functions (even deterministic ones) returning a value and the sys_context() function is a particular example of this.

Transitivity

Problem number 2 comes from a side-effect of something that I first described about 13 years ago – transitive closure. Take a look at the predicates in both the execution plans above. Where’s the join condition between x$ksuse and x$kslwt ? There should be one, because the underlying SQL defining [g]v$session  has the following joins:

from
      x$ksuse s,
      x$ksled e,
      x$kslwt w
where
      bitand(s.ksspaflg,1)!=0
and   bitand(s.ksuseflg,1)!=0
and   s.indx=w.kslwtsid       -- this is the SID column for v$session and v$session_wait
and   w.kslwtevt=e.indx
 

What’s happened here is that the optimizer has used transitive closure: “if a = b and b = c then a = c” to clone the predicate “s.indx = to_number(sys_context(…))” to “w.kslwtsid = to_number(sys_context(…))”. But at the same time the optmizer has eliminated the predicate “s.indx = w.kslwtsid”, which it shouldn’t do because this is 12.2.0.1 and ever since 10g we’ve had the parameter _optimizer_transitivity_retain = true — but SYS is ignoring the parameter setting.

So we no longer have a join condition between x$ksuse and x$kslwt – which means there has to be a cartesian merge join between them and the only question is whether this should take place before or after the join between x$kslwt and x$ksled. In fact, the order doesn’t really matter because there will be only one row identified in x$kslwt and one row in x$ksuse, and the join to x$ksled is simply a lookup (by undeclarable unique key) to translate an id into a name and it will take place only once whatever we do about the other two structures.

But there is a catch – especially if your sessions parameter is 25,000 (which it shouldn’t be) and the number of connected sessions is currently 20,000 (which it shouldn’t be) – the predicate against x$ksuse does a huge amount of work as it walks the entire array testing every row (and it doesn’t even do the indx test first – it does a couple of bitand() operations). Even then this wouldn’t be a disaster – we’re only talking a couple of hundredths of a second of CPU – until you find the applications that run this query a huge number of times.

We would prefer to avoid two full tablescans since the arrays could be quite large, and of the two it’s the tablescan of x$ksuse that is going to be the greater threat; so is there a way to bypass the threat?  Once we’ve identified the optimizer anomaly we’ve got a pointer to a solution. Transitivity is going wrong, so let’s attack the transitivity. Checking the hidden parameters we can find a parameter: _optimizer_generate_transitive_pred which defaults to true, so let’s set it to false for the query and check the plan:

select  /*+   opt_param('_optimizer_generate_transitive_pred','FALSE')
*/  program from  v$session where  sid = sys_context('userenv','sid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    28 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."INDX"=TO_NUMBER(SYS_CONTEXT('user
              env','sid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


Although it’s not nice to insert hidden parameters into the optimizer activity we do have a result. We don’t have any filtering on x$kslwt – fortunately this seems to be limited in size (but see footnote) to the number of current sessions (unlike x$ksuse which has an array size defined by the sessions parameter or derived from the processes parameter). For each row in x$kslwt we do an access into x$ksuse using the “index” (note that we don’t see access predicates for the fixed tables, we just have to note the operation says FIXED INDEX and spot the “index-related” predicate in the filter predicate list), so this strategy has reduced the number of times we check the complex predicate on x$ksuse rows.

It’s still far from ideal, though. What we’d really like to do is access x$kslwt by index using the known value from sys_context(‘userenv’,’sid’). As it stands the path we get from using a hidden parameter which isn’t listed as legal for the opt_param() hint is a plan that we would get if we used an unhinted query that searched for audsid = sys_context(‘userenv’,’sessionid’).

SQL_ID  7f3f9b9f32u7z, child number 0
-------------------------------------
select  program from  v$session where  audsid =
sys_context('userenv','sessionid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     2 |    70 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     2 |    62 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    23 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."KSUUDSES"=TO_NUMBER(SYS_CONTEXT('
              userenv','sessionid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


The bottom line, then, seems to be that if you need a query by SID against v$session to be as efficient as possible then your best bet is to load a numeric variable with the sys_context(‘userenv’,’sid’) and then select where “sid = :bindvariable”.  Otherwise query by audsid, or use a hidden parameter to affect the optimizer.

Until the the SYS schema follows the _optimizer_transitivity_retain parameter or treats sys_context() the same way it treats a bind variable there is always going to be some unnecessary work when querying v$session and that excess will grow with either the number of connected sessions (if you optimize the query) or with the value of the sessions parameter.

Footnote

In (much) older versions of Oracle v$session_wait sat on top of x$ksusecst, which was part of the same C structure as x$ksuse. In newer versions of Oracle x$kslwt is a structure that is created on demand in the PGA/UGA – I hope that there’s a short cut that allows Oracle to find the waiting elements in x$ksuse[cst] efficiently, rather than requiring a walk through the whole thing, otherwise a tablescan of the (nominally smaller) x$kslwt structure will be at least as expensive as a tablescan of the x$ksuse structure.

Update (just a few minutes after posting)

Bob Bryla has pointed out in a tweet that there are many “bugs” not fixed until 19.1 for which the workaround is to set “_optimizer_transitivity_retain” to false. So maybe this isn’t an example of SYS doing something particularly strange – it may be part of a general reworking of the mechanism that still has a couple of undesirable side effects.

Bob’s comment prompted me to clone the x$ tables into real tables in a non-SYS schema and model the fixed indexes with primary keys, and I found that the resulting plan (though very efficient) still discarded the join predicate. So we may be seeing the side effects of a code enhancement relating to generating predicates that produce unique key access paths. (A “contrary” test, the one in the 2013 article I linked to, still retains the join predicate for the query that has non-unique indexes.)

 

October 9, 2019

My SID

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 12:03 pm BST Oct 9,2019

Here’s a little note that’s been hanging around as a draft for more than eight years according to the OTN (as it was) posting that prompted me to start writing it. At the time there were still plenty of people using Oracle 10g. so the question didn’t seem entirely inappropriate:

On 10g R2 when I open a sqlplus session how can I know my session SID ? I’m not DBA then can not open as sysdba and query v$session.

In all fairly recent versions of Oracle, of course, we have the option to use the sys_context() function to get the SID, but this specific option didn’t appear until some time in the 10g timeline – so you might have spent years “knowing” that you could get the audsid though sys_context(‘userenv’,’sessionid’) there was no equivalent way to get the sid. Now, of course, and even in the timeline of the original posting, the simplest solution to the requirement is to execute:


select sys_context('userenv','sid') from dual;

But there are a number of alternatives – which may occasionally do a better job (and sometimes are just plain silly). It’s also worth noting that even in 19c Oracle still doesn’t have access to v$session.serial# through sys_context() and, anyway, sys_context() behaves like an unpeekable bind variable – which can be a problem.

So here’s the first of several options:

select sid from V$mystat where rownum = 1;

You’ll need SYS to grant you select on v_$mystat to use this one, of course, but v$mystat is a very convenient view giving you the session activity stats since logon for your own session – so there ought to be some mechanism that allows you to see some form of it in place anyway (ideally including the join to v$statname).

One of the oldest ways of getting access to your session ID without having access to any of the dynamic performance views was through the dbms_support package:

variable v1 varchar2(32)
execute :v1 := dbms_support.mysid
execute dbms_output.put_line(:v1)

Again you’ll need SYS to grant you extra privileges, in this case execute on the dbms_support package – worse still, the package is not installed by default. In fact (after installing it) if you call dbms_support.package_version it returns the value: “DBMS_SUPPORT Version 1.0 (17-Aug-1998) – Requires Oracle 7.2 – 8.0.5” – which gives you some idea of how old it is. It used to be useful for the start_trace_in_session() procedure it contains but that procedure has been superseded by many newer mechanisms. If you enable SQL tracing to see what’s happening under the covers when you call dbms_support.mysid you’ll see that the function actually runs the query I showed above against v$mystat .

Unlike dbms_support the dbms_session package is installed automatically with the privilege to execute granted to public,  and this gives you a function to generate a “unique session id”, . The notes in the scripts $ORACLE_HOME/rdbms/admin/dbmssess.sql that create the package say that the return value can be up to 24 bytes long, but so far the maximum I’ve seen is 12.


select dbms_session.unique_session_id from dual;
UNIQUE_SESSION_ID
--------------------------
00FF5F980001


select
        to_number(substr(dbms_session.unique_session_id,1,4),'XXXX') sid,
        to_number(substr(dbms_session.unique_session_id,5,4),'XXXX') serial#,
        to_number(substr(dbms_session.unique_session_id,9,4),'XXXX') instance
from
        dual
;

       SID    SERIAL# INSTANCE
---------- ---------- --------
       255      24472        1

As you can see, the session_unique_id can be decoded to produce three useful bits of information, and the nice thing about this call is that it gives you session serial# at the same time as the SID. It’s possible, of course, that this query is as efficient as it could be, but there’s some scope for writing a query that uses a non-mergeable in-line view to call the function once, then splits the result into three pieces.

While we’re on the session_unique_id, the dbms_pipe package also has a “unique identifier” function unique_session_name():

SQL> select dbms_pipe.unique_session_name from dual;

UNIQUE_SESSION_NAME
------------------------
ORA$PIPE$00FF5F980001

It doesn’t take a lot of effort to spot that the “unique session name” is the “unique session id” of dbms_session prefixed with the text “ORA$PIPE$”. It’s convenient for the dbms_pipe package to be able to generate a unique name so that one session can create a safely named pipe and tell another session about it. Anyone using pipes should take advantage of this function for its original purpose. Unlike dbms_session you’ll need to be granted the privilege to execute this package, it’s not available to public. Interestingly the script that creates dbms_pipe says that this function could return 30 bytes – since it appears to be 9 bytes prepended to the (“could be 24 bytes”) dbms_session.unique_session_id you have to wonder whether there’s something more subtle that could happen.

There may be many more mechanisms available as built-ins, but the last one I know of is in the dbms_debug_jdwp package (another package with execute privilege already granted to public and the ability to supply both the sid and serial#):

SQL> select
  2          dbms_debug_jdwp.current_session_id     sid,
  3          dbms_debug_jdwp.current_session_serial serial#
  4  from dual
  5  /

       SID    SERIAL#
---------- ----------
       255      24472

There is a reason why I’ve decided to resurrect this list of ways of getting at a session’s SID, but that’s the topic of another blog note.

 

 

October 7, 2019

Resumable

Filed under: Infrastructure,Oracle,Tuning — Jonathan Lewis @ 10:31 am BST Oct 7,2019

There are two questions about temporary space that appear fairly regularly on the various Oracle forums. One is of the form:

From time to time my temporary tablespace grows enormously (and has to be shrunk), how do I find what’s making this happen?

The other follows the more basic pattern:

My process sometimes crashes with Oracle error: “ORA-01652: unable to extend temp segment by %n in tablespace %s” how do I stop this happening?

Before moving on to the topic of the blog, it’s worth pointing out two things about the second question:

  • First, it’s too easy to get stuck at the word temp and leap to the conclusion that the problem is about the temporary tablespace without noticing that the error message includes the specific tablespace that’s raised the problem. If, for example, you rebuild an index in a nominated tablespace Oracle first creates the index as a temporary segment (with a name like {starting_file_number}.{starting_block_number}) in that tablespace then renames it to match the original index name once the rebuild is complete and drops the old index.
  • Secondly a process that raises ORA-01652 isn’t necessarily the guilty party – it may be the victim of some other process hogging all the available space when it shouldn’t. Moreover that other process may have completed and released its space by the time you start looking for the problem – causing extra confusion because your process seems to have crashed without a cause. Taking my example of an index rebuild – your index rebuild may fail because someone else was rebuilding a different index at the same time in the same tablespace; but when you check the tablespace all the space from their original index is now free as their rebuild completed in the interim.

So, before you start chasing something that you think is a problem with your code, pause a moment to double-check the error message and think about whether you could have been the victim of some concurrent, but now complete, activity.

I’ve listed the two questions as variants on the same theme because the workaround to one of them introduces the risk of the other – if you want to avoid ORA-01652 you could make all your data files and temp files “autoextensible”, but then there may be occasions when they extend far too much and you need to shrink them down again (and that’s not necessarily easy if it’s not the temporary tablespace). Conversely, if you think your data or temp files randomly explode to ludicrous sizes you could decide on a maximum size for your files and disable autoextension – then handle the complaints when a user reports an ORA-01652.

There are various ways you could monitor your system in near real time to spot the threat as it builds, of course; and there are various ways to identify potentially guilty SQL after the event. You could keep an eye on various v$ dynamic performance views or dba_ administrative views to try and intercept a problem; you could set event 1652 to dump an errorstack (or even systemstate) for post-crash analysis to see what that reported. Neither is an ideal solution – one requires you to pay excessive attention to the system, the other is designed to let the problem happen then leave you to clean up afterwards.  There is, however, a strategy that may stop the problem from appearing without requiring constant monitoring. The strategy is to enable (selectively) resumable operations.

If a resumable operation needs to allocate space but is unable to do so – i.e. it would normally be about to raise ORA-01652 – it will suspend itself for a while going into the wait state “statement suspended, wait error to be cleared” which will show up as the event in v$session_wait, timing out every 2 seconds The session will also be reporting its current action in the view v$resumable or, for slightly more information, dba_resumable. As it suspends the session will also write a message to the alert log but you can also create an “after suspend” database trigger to alert you that a problem has occurred.

If you set the resumable timeout to a suitable value then you may find:

  • the problem goes away of its own accord and the session resumes before the timeout is reached

or

  • you receive a warning and have some time to identify the source of the problem and take the minimum action needed to allow the session to resume

Implementation

The parameter resumable_timeout is a general control for resumable sessions if you don’t handle the feature at a more granular level than the system.

By default this parameter is set to zero which translates into a default value of 7,200 seconds but that default doesn’t come into effect unless a session declares itself resumable. If you set the parameter to a non-zero value all session will automatically be operating as resumable sessions – and you’ll soon hear why you don’t want to do that.

The second enabling feature for resumable sessions is the resumable privilege – a session can’t control it’s own resumability unless the schema has been granted the resumable privilege – which may be granted through a role. If a session has the privilege it may set its own resumable_timeout, even if the system value is zero.

Assume we have set resumable_timeout to 10 (seconds) through the instance parameter file and restarted the instance. If we now issue (for example) the following ‘create table’ statement:


create table t1 (n1, v1 ) 
pctfree 90 pctused 10
tablespace tiny
as
select 
        rownum, cast(lpad('x',800) as varchar2(1000))
from    all_objects
where   rownum <= 20000
/

This will attempt to allocate 1 row per block for 20,000 blocks (plus about 1.5% for bitmap space management blocks) – and tablespace tiny lives up (or down) to its name, consisting of a single file of only 10,000 Oracle blocks. Shortly after starting, the session will hit Oracle error “ORA-01652: unable to extend temp segment by 128 in tablespace TINY”, but it won’t report it; instead it will suspend itself for 10 seconds before failing and reporting the error. This will happen whether or not the session has the resumable privilege – in this case the behaviour is dictated by our setting the system parameter. If you look in the alert log after the session finally errors out you will find text like the following:

2019-10-04T14:01:11.847943+01:00
ORCL(3):ORA-1652: unable to extend temp segment by 128 in tablespace TINY [ORCL] 
ORCL(3):statement in resumable session 'User TEST_USER(138), Session 373, Instance 1' was suspended due to
ORCL(3):    ORA-01652: unable to extend temp segment by 128 in tablespace TINY
2019-10-04T14:01:23.957586+01:00
ORCL(3):statement in resumable session 'User TEST_USER(138), Session 373, Instance 1' was timed out

Note that there’s a 10 (plus a couple) second gap between the point where the session reports that it is suspending itself and the point where it fails with a timeout. The two-extra seconds appear because the session polls every 2 seconds to see whether the problem is still present or whether it has spontaneously disappeared so allowing the session to resume.

Let’s change the game slightly; let’s try to create the table again, but this time execute the following statement first:

alter session enable resumable timeout 60 name 'Help I''m stuck';

The initial response to this will be Oracle error “ORA-01031: insufficient privileges” because the session doesn’t have the resumable privilege, but after granting resumable to the user (or a relevant role) we try again and find we will be allowed a little extra time before the CTAS times out. Our session now overrides the system timeout and will wait 60 seconds (plus a bit) before failing.The “timeout” clause is optional and if we omit it the session will use the system value, similarly the “name” clause is optional though there’s no default for it, it’s just a message that will get into various views and reports.

There are several things you might check in this 60 second grace period. The session wait history will confirm that your session has been timing out every two seconds (as will the active session history if you’re licensed to use it):


select seq#, event, wait_time from v$session_wait_history where sid = 373

      SEQ# EVENT							     WAIT_TIME
---------- ---------------------------------------------------------------- ----------
	 1 statement suspended, wait error to be cleared			   204
	 2 statement suspended, wait error to be cleared			   201
	 3 statement suspended, wait error to be cleared			   201
	 4 statement suspended, wait error to be cleared			   201
	 5 statement suspended, wait error to be cleared			   200
	 6 statement suspended, wait error to be cleared			   200
	 7 statement suspended, wait error to be cleared			   202
	 8 statement suspended, wait error to be cleared			   200
	 9 statement suspended, wait error to be cleared			   200
	10 statement suspended, wait error to be cleared			   200

Then there’s a special dynamic performance view, v$resumable which I’ve reported below using a print_table() procedure that Tom Kyte wrote many, many years ago to report rows in a column format:

SQL> set serveroutput on
SQL> execute print_table('select * from v$resumable where sid = 373')

ADDR                          : 0000000074515B10
SID                           : 373
ENABLED                       : YES
STATUS                        : SUSPENDED
TIMEOUT                       : 60
SUSPEND_TIME                  : 10/04/19 14:26:20
RESUME_TIME                   :
NAME                          : Help I'm stuck
ERROR_NUMBER                  : 1652
ERROR_PARAMETER1              : 128
ERROR_PARAMETER2              : TINY
ERROR_PARAMETER3              :
ERROR_PARAMETER4              :
ERROR_PARAMETER5              :
ERROR_MSG                     : ORA-01652: unable to extend temp segment by 128 in tablespace TINY
CON_ID                        : 0
-----------------
1 rows selected

Notice how the name column reports the name I supplied when I enabled the resumable session. The view also tells us when the critical statement was suspended and how long it is prepared to wait (in total) – leaving us to work out from the current time how much time we have left to work around the problem.

There’s also a dba_resumable variant of the view which is slightly more informative (though the sample below is not consistent with the one above because I ran the CTAS several times, editing the blog as I did so):

SQL> execute print_table('select * from dba_resumable where session_id = 373')

USER_ID                       : 138
SESSION_ID                    : 373
INSTANCE_ID                   : 1
COORD_INSTANCE_ID             :
COORD_SESSION_ID              :
STATUS                        : SUSPENDED
TIMEOUT                       : 60
START_TIME                    : 10/04/19 14:21:14
SUSPEND_TIME                  : 10/04/19 14:21:16
RESUME_TIME                   :
NAME                          : Help I'm stuck
SQL_TEXT                      : create table t1 (n1, v1 ) pctfree 90 pctused 10 tablespace tiny as  select rownum, 
                                cast(lpad('x',800) as varchar2(1000)) from all_objects where rownum <= 20000
ERROR_NUMBER                  : 1652
ERROR_PARAMETER1              : 128
ERROR_PARAMETER2              : TINY
ERROR_PARAMETER3              :
ERROR_PARAMETER4              :
ERROR_PARAMETER5              :
ERROR_MSG                     : ORA-01652: unable to extend temp segment by 128 in tablespace TINY
-----------------
1 rows selected

This view includes the text of the statement that has been suspended and shows us when it started running (so that we can decide whether we really want to rescue it, or might be happy to kill it to allow some other suspended session to resume).

If you look at the alert log in this case you’ll see that the name has been reported there instead of the user, session and instance – which means you might want to think carefully about how you use the name option:


2019-10-04T14:21:16.151839+01:00
ORCL(3):statement in resumable session 'Help I'm stuck' was suspended due to
ORCL(3):    ORA-01652: unable to extend temp segment by 128 in tablespace TINY
2019-10-04T14:22:18.655808+01:00
ORCL(3):statement in resumable session 'Help I'm stuck' was timed out

Once your resumable task has completed (or timed out and failed) you can stop the session from being resumable with the command:

alter session disable resumable;

And it’s important that every time you enable resumability you should disable it as soon as the capability is no longer needed. Also, be careful about when you enable it, don’t be tempted to make every session resumable. Use it only for really important cases. Once a session is resumable virtually everything that goes on in that session is deemed to be resumable, and this has side effects.

The first side effect that may spring to mind is the impact of the view v$resumable – it’s a memory structure in the SGA so that everyone can see it and all the resumable sessions can populate and update it. That means there’s got to be some latch (or mutex) protection going on – and if you look at v$latch you’ll discover that there;s just a single (child) latch doing the job, so resumability can introduce a point of contention. Here’s a simple script (using my “start_XXX” strategy to “select 1 from dual;” one thousand times, with calls to check the latch activity:

set termout off
set serveroutput off
execute snap_latch.start_snap

@start_1000

set termout on
set serveroutput on
execute snap_latch.end_snap(750)

And here are the results of running the script – reporting only the latches with more than 750 gets in the interval – first without and then with a resumable session:

---------------------------------
Latch waits:-   04-Oct 15:04:31
Lower limit:-  750
---------------------------------
Latch                              Gets      Misses     Sp_Get     Sleeps     Im_Gets   Im_Miss Holding Woken Time ms
-----                              ----      ------     ------     ------     -------   ------- ------- ----- -------
session idle bit                  6,011           0          0          0           0         0       0     0      .0
enqueue hash chains               2,453           0          0          0           0         0       0     0      .0
enqueue freelist latch                1           0          0          0       2,420         0       0     0      .0
JS queue state obj latch          1,176           0          0          0           0         0       0     0      .0

SQL> alter session enable resumable;

SQL> @test
---------------------------------
Latch waits:-   04-Oct 15:04:46
Lower limit:-  750
---------------------------------
Latch                              Gets      Misses     Sp_Get     Sleeps     Im_Gets   Im_Miss Holding Woken Time ms
-----                              ----      ------     ------     ------     -------   ------- ------- ----- -------
session idle bit                  6,011           0          0          0           0         0       0     0      .0
enqueue hash chains               2,623           0          0          0           0         0       0     0      .0
enqueue freelist latch                1           0          0          0       2,588         0       0     0      .0
resumable state object            3,005           0          0          0           0         0       0     0      .0
JS queue state obj latch          1,260           0          0          0           0         0       0     0      .0

PL/SQL procedure successfully completed.

SQL> alter session disable resumable;

That’s 1,000 selects from dual – 3,000 latch gets on a single child latch. It looks like every call to the database results in a latch get and an update to the memory structure. (Note: You wouldn’t see the same effect if you ran a loop inside an anonymous PL/SQL block since the block would be the single database call).

For other side effects with resumability think about what else is going on around your session. If you allow a session to suspend for (say) 3600 seconds and it manages to resume just in time to avoid a timeout it now has 3,600 seconds of database changes to unwind if it’s trying to produce a read-consistent result; so not only do you have to allow for increasing the size of the undo tablespace and increasing the undo retention time, you have to allow for the fact that when the process resumes it may run much more slowly than usual because it spends more of its time trying to see the data as it was before it suspended, which may require far more single block reads of the undo tablespace – and the session may then crash anyway with an Oracle error ORA-01555 (which is so well-known that I won’t quote the text).

In the same vein – if a process acquires a huge amount of space in the temporary tablespace (in particular) and fails instantly because it can’t get any more space it normally crashes and releases the space. If you allow that process to suspend for an hour it’s going to hold onto that space – which means other processes that used to run safely may now crash because they find there’s no free space left for them in the temporary tablespace.

Be very cautious when you introduce resumable sessions – you need to understand the global impact, not just the potential benefit to your session.

Getting Alerts

Apart from the (passive) views telling you that a session has suspended it’s also possible to get some form of (active) alert when the event happens. There’s an “after suspend” event that you can use to create a database trigger to take some defensive action, e.g.:

create or replace trigger call_for_help
after suspend
on test_user.schema
begin
        if sysdate between trunc(sysdate) and trunc(sysdate) + 3/24 then
                null;
                -- use utl_mail, utl_smtp et. al. to page the DBA
        end if;
end;
/

This trigger is restricted to the test_user schema, and (code not included) sends a message to the DBA’s pager only between the hours of midnight and 3:00 a.m. Apart from the usual functions in dbms_standard that returnn error codes, names of objects and so on you might want to take a look at the dbms_resumable package for the “helper” functions and procedures it supplies.

For further information on resumable sessions here’s a link to the 12.2 manual to get you started.

October 3, 2019

Trace Files

Filed under: 12c,Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 1:38 pm BST Oct 3,2019

A recent blog note by Martin Berger about reading trace files in 12.2 poped up in my twitter timeline yesterday and reminded me of a script I wrote a while ago to create a simple view I could query to read the tracefile generated by the current session while the session was still connected. You either have to create the view and a public synonym through the SYS schema, or you have to use the SYS schema to grant select privileges on several dynamic performance views to the user to allow the user to create the view in the user’s schema. For my scratch database I tend to create the view in the SYS schema.

Script to be run by SYS:

rem
rem     Script: read_trace_122.sql
rem     Author: Jonathan Lewis
rem     Dated:  Sept 2018
rem
rem     Last tested
rem             12.2.0.1

create or replace view my_trace_file as
select 
        *
from 
        v$diag_trace_file_contents
where
        (adr_home, trace_filename) = (
                select
                --      substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                        substr(
                                substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                                1,
                                instr(
                                        substr(tracefile, 1, instr(tracefile,'/',-1)),
                                        'trace'
                                ) - 2
                        ),
                        substr(tracefile, instr(tracefile,'/',-1)+1) trace_filename
                from 
                        v$process
                where   addr = (
                                select  paddr
                                from    v$session
                                where   sid = (
                                        sys_context('userenv','sid')
                                        -- select sid from v$mystat where rownum = 1
                                        -- select dbms_support.mysid from dual
                                )
                        )
        )
;


create public synonym my_trace_file for sys.my_trace_file;
grant select on my_trace_file to {some role};

Alternatively, the privileges you could grant to a user from SYS so that they could create their own view:


grant select on v_$process to some_user;
grant select on v_$session to some_user;
grant select on v_$diag_trace_file_contents to some_user;
and optionally one of:
        grant select on v_$mystat to some_user;
        grant execute on dbms_support to some_user;
                but dbms_support is no longer installed by default.

The references to package dbms_support and view v$mystat are historic ones I have lurking in various scripts from the days when the session id (SID) wasn’t available in any simpler way.

Once the view exists and is available, you can enable some sort of tracing from your session then query the view to read back the trace file. For example, here’s a simple “self-reporting” (it’s going to report the trace file that it causes) script that I’ve run from 12.2.0.1 as a demo:


alter system flush shared_pool;
alter session set sql_trace true;

set linesize 180
set trimspool on
set pagesize 60

column line_number      format  999,999
column piece            format  a150    
column plan             noprint
column cursor#          noprint

break on plan skip 1 on cursor# skip 1

select
        line_number,
        line_number - row_number() over (order by line_number) plan,
        substr(payload,1,instr(payload,' id=')) cursor#,
        substr(payload, 1,150) piece
from
        my_trace_file
where
        file_name = 'xpl.c'
order by
        line_number
/

alter session set sql_trace false;

The script flushes the shared pool to make sure that it’s going to trigger some recursive SQL then enables a simple SQL trace. The query then picks out all the lines in the trace file generated by code in the Oracle source file xpl.c (execution plans seems like a likely guess) which happens to pick out all the STAT lines in the trace (i.e. the ones showing the execution plans).

I’ve used the “tabibitosan” method to identify all the lines that belong to a single execution plan by assuming that they will be consecutive lines in the output starting from a line which includes the text ” id=1 “ (the surrounding spaces are important), but I’ve also extracted the bit of the line which includes the cursor number (STAT #nnnnnnnnnnnnnnn) because two plans may be dumped one after the other if multiple cursors close at the same time. There is still a little flaw in the script because sometimes Oracle will run a sys-recursive statement in the middle of dumping a plan to turn an object_id into an object_name, and this will cause a break in the output.

The result of the query is to extract all the execution plans in the trace file and print them in the order they appear – here’s a sample of the output:


LINE_NUMBER PIECE
----------- ------------------------------------------------------------------------------------------------------------------------------------------------------
         38 STAT #140392790549064 id=1 cnt=0 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=53 us cost=4 size=113 card
         39 STAT #140392790549064 id=2 cnt=0 pid=1 pos=1 obj=37 op='INDEX RANGE SCAN I_OBJ2 (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=0 card=1)'


         53 STAT #140392790535800 id=1 cnt=1 pid=0 pos=1 obj=0 op='MERGE JOIN OUTER (cr=5 pr=0 pw=0 str=1 time=95 us cost=2 size=178 card=1)'
         54 STAT #140392790535800 id=2 cnt=1 pid=1 pos=1 obj=4 op='TABLE ACCESS CLUSTER TAB$ (cr=3 pr=0 pw=0 str=1 time=57 us cost=2 size=138 card=1)'
         55 STAT #140392790535800 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'
         56 STAT #140392790535800 id=4 cnt=0 pid=1 pos=2 obj=0 op='BUFFER SORT (cr=2 pr=0 pw=0 str=1 time=29 us cost=0 size=40 card=1)'
         57 STAT #140392790535800 id=5 cnt=0 pid=4 pos=1 obj=73 op='TABLE ACCESS BY INDEX ROWID TAB_STATS$ (cr=2 pr=0 pw=0 str=1 time=10 us cost=0 size=40 card=1)
         58 STAT #140392790535800 id=6 cnt=0 pid=5 pos=1 obj=74 op='INDEX UNIQUE SCAN I_TAB_STATS$_OBJ# (cr=2 pr=0 pw=0 str=1 time=8 us cost=0 size=0 card=1)'


         84 STAT #140392791412824 id=1 cnt=1 pid=0 pos=1 obj=20 op='TABLE ACCESS BY INDEX ROWID BATCHED ICOL$ (cr=4 pr=0 pw=0 str=1 time=25 us cost=2 size=54 card
         85 STAT #140392791412824 id=2 cnt=1 pid=1 pos=1 obj=42 op='INDEX RANGE SCAN I_ICOL1 (cr=3 pr=0 pw=0 str=1 time=23 us cost=1 size=0 card=2)'


         94 STAT #140392790504512 id=1 cnt=2 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=7 pr=0 pw=0 str=1 time=432 us cost=6 size=374 card=2)'
         95 STAT #140392790504512 id=2 cnt=2 pid=1 pos=1 obj=0 op='HASH JOIN OUTER (cr=7 pr=0 pw=0 str=1 time=375 us cost=5 size=374 card=2)'
         96 STAT #140392790504512 id=3 cnt=2 pid=2 pos=1 obj=0 op='NESTED LOOPS OUTER (cr=4 pr=0 pw=0 str=1 time=115 us cost=2 size=288 card=2)'
         97 STAT #140392790504512 id=4 cnt=2 pid=3 pos=1 obj=19 op='TABLE ACCESS CLUSTER IND$ (cr=3 pr=0 pw=0 str=1 time=100 us cost=2 size=184 card=2)'
         98 STAT #140392790504512 id=5 cnt=1 pid=4 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=85 us cost=1 size=0 card=1)'
         99 STAT #140392790504512 id=6 cnt=0 pid=3 pos=2 obj=75 op='TABLE ACCESS BY INDEX ROWID IND_STATS$ (cr=1 pr=0 pw=0 str=2 time=8 us cost=0 size=52 card=1)'
        100 STAT #140392790504512 id=7 cnt=0 pid=6 pos=1 obj=76 op='INDEX UNIQUE SCAN I_IND_STATS$_OBJ# (cr=1 pr=0 pw=0 str=2 time=7 us cost=0 size=0 card=1)'
        101 STAT #140392790504512 id=8 cnt=0 pid=2 pos=2 obj=0 op='VIEW  (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=43 card=1)'
        102 STAT #140392790504512 id=9 cnt=0 pid=8 pos=1 obj=0 op='SORT GROUP BY (cr=3 pr=0 pw=0 str=1 time=44 us cost=3 size=15 card=1)'
        103 STAT #140392790504512 id=10 cnt=0 pid=9 pos=1 obj=31 op='TABLE ACCESS CLUSTER CDEF$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=15 card=1)'
        104 STAT #140392790504512 id=11 cnt=1 pid=10 pos=1 obj=30 op='INDEX UNIQUE SCAN I_COBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        116 STAT #140392791480168 id=1 cnt=4 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=3 pr=0 pw=0 str=1 time=62 us cost=3 size=858 card=13)'
        117 STAT #140392791480168 id=2 cnt=4 pid=1 pos=1 obj=21 op='TABLE ACCESS CLUSTER COL$ (cr=3 pr=0 pw=0 str=1 time=24 us cost=2 size=858 card=13)'
        118 STAT #140392791480168 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        126 STAT #140392789565328 id=1 cnt=1 pid=0 pos=1 obj=14 op='TABLE ACCESS CLUSTER SEG$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=68 card=1)'
        127 STAT #140392789565328 id=2 cnt=1 pid=1 pos=1 obj=9 op='INDEX UNIQUE SCAN I_FILE#_BLOCK# (cr=2 pr=0 pw=0 str=1 time=12 us cost=1 size=0 card=1)'


        135 STAT #140392789722208 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=22 us cost=3 size=51 card=
        136 STAT #140392789722208 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=2 pr=0 pw=0 str=1 time=16 us cost=2 size=0 card=1)'


        153 STAT #140392792055264 id=1 cnt=1 pid=0 pos=1 obj=68 op='TABLE ACCESS BY INDEX ROWID HIST_HEAD$ (cr=3 pr=0 pw=0 str=1 time=25 us)'
        154 STAT #140392792055264 id=2 cnt=1 pid=1 pos=1 obj=70 op='INDEX RANGE SCAN I_HH_OBJ#_INTCOL# (cr=2 pr=0 pw=0 str=1 time=19 us)'

If you want to investigate further, the “interesting” columns in the underlying view are probably: section_name, component_name, operation_name, file_name, and function_name. The possible names of functions, files, etc. vary with the trace event you’ve enabled.

 

October 2, 2019

_cursor_obsolete_threshold

Filed under: 12c,Infrastructure,Oracle,Performance — Jonathan Lewis @ 2:39 pm BST Oct 2,2019

At the recent Trivadis Performance Days in Zurich, Chris Antognini answered a question that had been bugging me for some time. Why would Oracle want to set the default value of _cursor_obsolete_threshold to a value like 8192 in 12.2 ?

In 11.2.0.3 the parameter was introduced with the default value 100; then in 11.2.0.4, continuing into 12.1, the default value increased to 1,024 – what possible reason could anyone have for thinking that 8192 was a good idea ?

The answer is PDBs – specifically the much larger number of PDBs a single CBD can (theoretically) support in 12.2.

In fact a few comments, and the following specific explanation, are available on MoS in Doc ID 2431353.1 “High Version Counts For SQL Statements (>1024) Post Upgrade To 12.2 and Above Causing Database Slow Performance”:

The default value of _cursor_obsolete_threshold is increased heavily (8192 from 1024) from 12.2 onwards in order to support 4096 PDBs which was only 252 PDBs till 12.1. This parameter value is the maximum limit for obsoleting the parent cursors in an multitenant environment and cannot be increased beyond 8192.

Having said, this is NOT applicable for non-CDB environment and hence for those databases, this parameter should be set to 12.1 default value manually i.e. 1024. The default value of 1024 holds good for non-CDB environment and the same parameter can be adjusted case-to-case basis should there be a problem.

It’s all about PDBs – more precisely, it’s all about CDBs running a huge number of PDBs, which is not necessarily the way that many companies are likely to use PDBs. So if you’re a fairly typical companyy running a handful of PDBs in a single CDB then it’s probably a good idea to set the parameter down to the 12.1 value of 1024 (and for bad applications I’d consider going even lower) – and this MOS note actually makes that an official recommendation.

Impact analysis

What’s the worst that could happen if you actually have many PDBs all executing the same application and that application has a few very popular and frequently executed statements? Chris Antognini described a model he’d constructed and some tests he’d done to show the effects. The following code is a variation onhis work. It addresses the following question:

If you have an application that repeatedly issues (explicitly or implicitly) parse calls but doesn’t take advantage of the session cursor cache it has to search the library cache by hash_value / sql_id for the parent cursor, then has to walk the chain of child cursors looking for the right child. What’s the difference in the work done if this “soft parse” has to walk the list to child number 8,191 instead of finding the right cursor at child number 0.

Here’s the complete code for the test:


create table t1
select 1 id from dual
/

alter table t1 add constraint t1_pk primary key (id)
/

spool cursor_obsolete.lst

alter system flush shared_pool;
alter system flush shared_pool;

set serveroutput off
select /*+ index(t1) */ id from t1 where id > 0;
select * from table(dbms_xplan.display_cursor);

execute snap_my_stats.start_snap
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+8192 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

column sql_text format a60
select sql_id, child_number, loaded_versions, executions, sql_text from v$sql where sql_text like 'SELECT%T1%' order by child_number;

prompt  ===============
prompt  Low child reuse
prompt  ===============

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

prompt  ================
prompt  High child reuse
prompt  ================

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 7168+1..7168+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

spool off

I’ve created a table with just one row and given it a primary key. My testing query is going to be very short and simple. A query hinted to return that one row by primary key index range scan.

I’ve flushed the shared pool (twice) to minimise fringe contention from pre-existing information, then executed the statement to populate the dictionary cache and some library cache information and to check the execution plan.

The call to the package snap_my_stats is my standard method for reporting changes in v$mystat across the test. I’ve called the start_snap procedure twice in a row to make sure that its first load doesn’t add some noise to the statistics that we’re trying to capture.

The test runs in three parts.

  • First I loop 8192 times executing the same statement, but with a different value for the optimizer_index_cost_adj for each execution – this gives me the limit of 8192 child cursors, each reporting “Optimizer Mismatch” as the reason for not sharing. I’ve run a query against v$sql after this to check that I have 8192 child cursors – you’ll need to make sure your shared pool is a few hundred megabytes if you want to be sure of keeping them all in memory.
  • The second part of the test simply repeats the loop, but only for the first 1,024 child cursors. At this point the child cursors exist, so the optimizer should be doing “soft” parses rather than hard parses.
  • The final part of the test repeats the loop again, but only for the last 1,024 child cursors. Again they should exist and be usable, so the optimizer should again be doing “soft” parses rather than hard parses.

What I’m looking for is the extra work it takes for Oracle to find the right child cursor when there’s a very long chain of child cursors. From my memory of dumping the library cache in older versions of Oracle, the parent will point to a “segmented array” of pointers to child cursors, and each segment of the array will consist of 16 pointers, plus a pointer to the next segment. So if you have to find child cursor 8191 you will have to following 512 segment pointers, and 16 pointers per segment (totalling 8708 pointers) before you find the child you want – and you’re probably holding a mutex (or latch) while doing so.

One preipheral question to ask, of course, is whether Oracle keeps appending to the segmented array, or whether it uses a “pushdown” approach when allocating a new segment so that newer child cursors are near the start of the array. (i.e. will searching for child cursor 0 be the cheapest one or the most expensive one).

And the results, limited to just the second and third parts, with just a couple of small edits are as follows:


host sdiff -w 120 -s temp1.txt temp2.txt >temp.txt

===============                                            |    ================
Low child reuse                                            |    High child reuse
===============                                            |    ================

Interval:-  0 seconds                                      |    Interval:-  6 seconds

opened cursors cumulative                      2,084       |    opened cursors cumulative                      2,054
recursive calls                                6,263       |    recursive calls                                6,151
recursive cpu usage                               33       |    recursive cpu usage                              570
session logical reads                          1,069       |    session logical reads                          1,027
CPU used when call started                        33       |    CPU used when call started                       579
CPU used by this session                          37       |    CPU used by this session                         579
DB time                                           34       |    DB time                                          580
non-idle wait count                               16       |    non-idle wait count                                5
process last non-idle time                         1       |    process last non-idle time                         6
session pga memory                           524,288       |    session pga memory                            65,536
enqueue requests                                  10       |    enqueue requests                                   3
enqueue releases                                  10       |    enqueue releases                                   3
consistent gets                                1,069       |    consistent gets                                1,027
consistent gets from cache                     1,069       |    consistent gets from cache                     1,027
consistent gets pin                            1,039       |    consistent gets pin                            1,024
consistent gets pin (fastpath)                 1,039       |    consistent gets pin (fastpath)                 1,024
consistent gets examination                       30       |    consistent gets examination                        3
consistent gets examination (fastpath)            30       |    consistent gets examination (fastpath)             3
logical read bytes from cache              8,757,248       |    logical read bytes from cache              8,413,184
calls to kcmgcs                                    5       |    calls to kcmgcs                                    3
calls to get snapshot scn: kcmgss              1,056       |    calls to get snapshot scn: kcmgss              1,026
table fetch by rowid                              13       |    table fetch by rowid                               1
rows fetched via callback                          6       |    rows fetched via callback                          1
index fetch by key                                 9       |    index fetch by key                                 1
index scans kdiixs1                            1,032       |    index scans kdiixs1                            1,024
session cursor cache hits                         14       |    session cursor cache hits                          0
cursor authentications                         1,030       |    cursor authentications                         1,025
buffer is not pinned count                     1,066       |    buffer is not pinned count                     1,026
parse time cpu                                    23       |    parse time cpu                                   558
parse time elapsed                                29       |    parse time elapsed                               556
parse count (total)                            2,076       |    parse count (total)                            2,052
parse count (hard)                                11       |    parse count (hard)                                 3
execute count                                  1,050       |    execute count                                  1,028
bytes received via SQL*Net from client         1,484       |    bytes received via SQL*Net from client         1,486

Two important points to note:

  • the CPU utilisation goes up from 0.33 seconds to 5.7 seconds.
  • the number of hard parses is zero, this is all about searching for the

You might question are the 2,048-ish parse count(total) – don’t forget that we do an “execute immediate” to change the optimizer_index_cost_adj on each pass through the loop. That’s probably why we double the parse count, although the “alter session” doesn’t then report as an “execute count”.

The third call to a statement is often an important one – it’s often the first one that doesn’t need “cursor authentication”, so I ran a similar test executing the last two loops a second time – there was no significant change in the CPU or parse activity between the 2nd and 3rd executions of each cursor. For completeness I also ran a test with the loop for the last 1,024 child cursors ran before the loop for the first child cursors. Again this made no significant difference to the results – the low number child cursors take less CPU to find than the high number child cursors.

Bottom line

The longer the chain of child cursors the more time (elapsed and CPU) you spend searching for the correct child; and when a parent is allowed 8,192 child cursors the extra time can become significant. I would claim that the ca. 5 seconds difference in CPU time appearing in this test corresponds purely to an extra 5 milliseconds walking an extra 7,000 steps down the chain.

If you have a well-behaved application that uses the session cursor cache effectively, or uses “held cursors”, then you may not be worried by very long chains of child cursors. But I have seen many applications where cursor caching is not used and every statement execution from the client turns into a parse call (usually implicit) followed by a hunt through the library cache and walk along the child chain. These applications will not scale well if they are cloned to multiple PDBs sharing the same CDB.

Footnote 1

The odd thing about this “cursor obselete” feature is that I have a distinct memory that when  PDBs were introduced at an ACE Director’s meeting a few years ago the first thought that crossed my mind was about the potential for someone running multiple copies of the same application as separate PDBs seeing a lot of library cache latch contention or cursor mutex contention because any popular statement would now be hitting the same parent cursor from multiple PDBs. I think the casual (i.e. neither formal, nor official) response I got when I raised the point was that the calculation of the sql_id in future releases would take the con_id into consideration. It seems that that idea fell by the wayside.

Footnote 2

If you do see a large number of child cursors for a single parent then you will probably end up looking at v$sql_shared_cursor for the sql_id to see if that gives you some good ideas about why a particular statement has generated so many child cursors. For a list of explainations of the different reasons captured in this view MOS Doc Id  296377.1“Troubleshooting: High Version Count Issues” is a useful reference.

September 16, 2019

Little sleeps

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 11:50 am BST Sep 16,2019

A peripheral question in a recent comment (made in response to me asking whether a loop had been written with a sleep time of 1/100th or 1/1000th of a second) asked “How do you sleep for 1/1000th of a second in pure PL/SQL?”

The answer starts with “How pure is pure ?” Here’s a “pure” PL/SQL solution that “cheats” by calling one of the routines in Oracle’s built-in Java library:


create or replace procedure milli_sleep( i_milliseconds in number) 
as 
        language java
        name 'java.lang.Thread.sleep(long)';
/


create or replace procedure nano_sleep( i_milliseconds in number, i_nanoseconds in number)
as
        language java
        name 'java.lang.Thread.sleep(long, int)';
/

prompt  Milli sleep
prompt  ===========
execute milli_sleep(18)

prompt  Nano sleep
prompt  ==========
execute  nano_sleep(0,999999)


The “nano-second” component of the nano_sleep() procedure is restricted to the ranage 0 – 999999. In both cases the “milli-second” component has to be positive.

Whether your machine is good at handling sleeps of less than 1/100th of a second is another question, of course.

Update – due to popular demand

If you want to find out what else is available in the database you can query view all_java_methods searching by partial name (which is very slow) for something you think might exist, for example:

SQL> select owner, name , method_name from all_java_methods where upper(method_name) like '%MILLI%'

OWNER           NAME                                     METHOD_NAME
--------------- ---------------------------------------- ----------------------------------------
SYS             java/util/concurrent/TimeUnit$4          toMillis
SYS             java/util/concurrent/TimeUnit$5          toMillis
SYS             java/util/concurrent/TimeUnit$6          toMillis
SYS             java/util/concurrent/TimeUnit$7          toMillis
SYS             java/util/concurrent/TimeUnit            toMillis
SYS             java/lang/System                         currentTimeMillis
SYS             javax/swing/ProgressMonitor              setMillisToDecideToPopup
SYS             javax/swing/ProgressMonitor              getMillisToDecideToPopup

There’s a lot more than the few listed above – but I just wanted to pick up currentTimeMillis. If you spot something that looks interesting the easiest next step is probably to do a google search with (for example): Oracle java.lang.system currenttimemillis (alternatively you could just keep a permanent link to Oracle’s manual pages for the Java and serarch them. In my case this link was high on the list of google hits, giving me the following method description:


static long 	currentTimeMillis​() 	Returns the current time in milliseconds.

Conveniently this is easy to embed in pl/sql (be careful with case sensitivity):


create or replace function milli_time return number
as
        language java
        name 'java.lang.System.currentTimeMillis() return long';
/

execute dbms_output.put_line(milli_time)
execute dbms_lock.sleep(1)
execute dbms_output.put_line(milli_time)

----------------

SQL> @ java_procs

Function created.

1568719117725

PL/SQL procedure successfully completed.


PL/SQL procedure successfully completed.

1568719118734

PL/SQL procedure successfully completed.

You now have a PL/SQL function that will return the number of millisecond since 1st January 1970.

September 13, 2019

Nologging

Filed under: Infrastructure,Oracle,redo — Jonathan Lewis @ 12:18 pm BST Sep 13,2019

Bobby Durrett recently published a note about estimating the volume of non-logged blocks written by an instance with the aim of getting some idea of the extra redo that would be generated if a database were switched to “force logging”.

Since my most recent blog notes have included various extracts and summaries from the symbolic dumps of redo logs it occurred to me that another strategy for generating the same information would be to dump the redo generated by Oracle when it wanted to log some information about non-logged blocks. This may sound like a contradiction, of course, but it’s the difference between data and meta-data: if Oracle wants to write data blocks to disc without logging their contents it needs to write a note into the redo log saying “there is no log of the contents of these blocks”.

In terms of redo op codes this is done through “layer 19”, the set of op codes relating to direct path loads, with op code 19.2 being the specific “invalidate range” one that we are (probably)interested in.

So here’s a little demo of extracting the information we need:

rem
rem     Script:         nologging_writes.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2019
rem
rem     Last tested 
rem             12.2.0.1
rem

column c_scn new_value m_scn_1
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

create table t1 nologging
as
select  * 
from    all_objects
where   rownum <= 10000
/

column c_scn new_value m_scn_2
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

alter session set tracefile_identifier = 'TABLE';
alter system dump redo scn min &m_scn_1 scn max &m_scn_2 layer 19;


create index t1_i1
on t1(object_name, owner, object_id)
pctfree 80
nologging
/

column c_scn new_value m_scn_3
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

alter session set tracefile_identifier = 'INDEX';
alter system dump redo scn min &m_scn_2 scn max &m_scn_3 layer 19;


insert /*+ append */ into t1
select * from t1
/

column c_scn new_value m_scn_4
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

alter session set tracefile_identifier = 'APPEND';
alter system dump redo scn min &m_scn_3 scn max &m_scn_4 layer 19;


I’ve executed a “create table nologging”, a “create index nologging”, then an “insert /*+ append */” into the nologging table. I’ve captured the current SCN before and after each call, added an individual identifier to the tracefile name for each call, then dumped the redo between each pair of SCNs, restricting the dump to layer 19. (I could have been more restrictive and said “layer 19 opcode 2”, but there is an opcode 19.4 which might also be relevant – though I don’t know when it might appear.)

Here’s the list of trace files I generated, plus a couple extra that appeared around the same time:

 ls -ltr *.trc | tail -6
-rw-r----- 1 oracle oinstall 361355 Sep 12 19:44 orcl12c_ora_23630.trc
-rw-r----- 1 oracle oinstall   5208 Sep 12 19:44 orcl12c_ora_23630_TABLE.trc
-rw-r----- 1 oracle oinstall  27434 Sep 12 19:44 orcl12c_ora_23630_INDEX.trc
-rw-r----- 1 oracle oinstall   2528 Sep 12 19:44 orcl12c_ora_23630_APPEND.trc
-rw-r----- 1 oracle oinstall 162633 Sep 12 19:45 orcl12c_mmon_3042.trc
-rw-r----- 1 oracle oinstall 162478 Sep 12 19:45 orcl12c_gen0_2989.trc


And having identified the trace files we can now extract the block invalidation records (I’ve inserted blank lines to separate the results from the three separate files):

grep OP orcl12c_ora_23630_*.trc

orcl12c_ora_23630_APPEND.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058001bd BLKS:0x0043 OBJ:125947 SCN:0x00000b860da6e1a5 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_APPEND.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800482 BLKS:0x006e OBJ:125947 SCN:0x00000b860da6e1a5 SEQ:1 OP:19.2 ENC:0 FLG:0x0000

orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000c4 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e162 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000c8 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e162 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
...
...     70 lines deleted
...
orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800424 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e181 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800428 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e181 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_INDEX.trc:CHANGE #1 MEDIA RECOVERY MARKER CON_ID:3 SCN:0x0000000000000000 SEQ:0 OP:18.3 ENC:0 FLG:0x0000

orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800103 BLKS:0x000d OBJ:125947 SCN:0x00000b860da6e13e SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800111 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e140 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800121 BLKS:0x0007 OBJ:125947 SCN:0x00000b860da6e141 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800268 BLKS:0x0008 OBJ:125947 SCN:0x00000b860da6e142 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800271 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e144 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800081 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e146 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800091 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e148 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000a1 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e14a SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000b1 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e14c SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800182 BLKS:0x003b OBJ:125947 SCN:0x00000b860da6e14c SEQ:1 OP:19.2 ENC:0 FLG:0x0000

Each line records the number of blocks (BLKS:) allocated and, as you can see, the APPEND trace shows much larger allocations than the TABLE trace (except for the last one) because the tablespace is locally managed with system allocated extents, and the first few invalidation records for the table creation are in the initial 8 block (64KB) extents; by the time we get to the last few blocks of the initial table creation we’ve just allocated the first 128 block (1MB) extent, which is why the last invalidation record for the table can cover so many more blocks than than the first few.

It is interesting to note, though, that the invalidation record for the INDEX trace are all small, typically 4 blocks, sometimes 3, even when we’ve obviously got to a point where we’re allocating from extents of 128 blocks.

I believe that somewhere I have a note explaining that the invalidation records always identified batches of 5 blocks in older versions of Oracle – but that may simply have been a consequence of the way that freelist management used to work (allocating 5 blocks at a time from the segment to the master freelist).

Although we could simply list all the invalidation records and sum the block counts manually we could be a little smarter with our code, summing them with awk, for example.

grep -n "OP:19.2" orcl12c_ora_23630_TABLE.trc |
     sed 's/.*BLKS://' |
     sed 's/ .*$//'  |
     awk '{m = m + strtonum($0) ; printf("%8i %8i \n",strtonum($0),m)}'
 
      13       13 
      15       28 
       7       35 
       8       43 
      15       58 
      15       73 
      15       88 
      15      103 
      15      118 
      59      177 

It’s left as an exercise to the Unix enthusiast to work out how to take the base tracefile name extract all the sets of data, cater for the odd 18.3 records (whose presence I didn’t request), report any lines for 19.x rows other than 19.2 and sum BLKS separately by TABLE, INDEX, and APPEND.

Once you’ve summed the number of blocks across all the invalidation records (and assuming you’re using the standard 8KB block size) the increease in the volume of redo generated if you alter the database to force logging will be (8KB + a little bit) * number of blocks.  The “little bit” will be close to 44 bytes.

If you’ve set your database up to use multiple block sizes you’ll have to aggregate the invalidation recrords by the AFN (absolute file number) entry and check which files use which block size and multiply up accordingly. And if you’re using a pluggable database (as I was) inside a container database you might also want to filter the redo dump by CON_ID.

If you do set the database to force logging and repeat the search for layer 19 in the redo  you’ll find that each individual data block written using a direct path write generates its own redo record, which will have length “data block size + 44” bytes and hold a single change vector of type 19.1 (Direct Loader block redo entry).

Footnote

It’s worth mentioning, that the dump of redo will go back into the archived redo logs in order to cover the entire range requested by the SCN man/max valeus; so it would be perfectly feasible (though possibly a little time and I/O consuming) to run the report across a full 24 hour window.

September 10, 2019

Update restarts

Filed under: Infrastructure,Oracle,redo,undo — Jonathan Lewis @ 11:00 am BST Sep 10,2019

Somewhere I think I’ve published a note about an anomaly that’s been bugging me since at least Oracle 10g – but if it is somewhere on the Internet it’s hiding itself very well and I can’t find it, though I have managed to find a few scripts on my laptop that make a casual reference to the side effects of the provlem. [Ed: a tweet from Timur Ahkmadeev has identified a conversation in the comments on an older post that could explain why I thought I’d written something about the issue.]

Anyway, I’ve decided to create some new code and write the article (all over again, maybe). The problem is a strange overhead that can appear when you do a simple but large update driving off a tablescan.

If you just want the conclusion and don’t feel like reading the detail, it’s this: a big update may roll itself back before it gets to the end of the tablescan and restart “invisibly”, doing a “select for update” (2nd tablescan, one row at a time) followed by the update (3rd tablescan, by array) massively increasing the redo and undo generated.

Here’s the script I’ve created to demonstrate the issue. As its header suggests the results I’ll be showing come from 12.1.0.2


rem
rem     Script:         update_restart.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Seo 2019
rem     Purpose:        
rem
rem     Last tested 
rem             12.1.0.2
rem

create table t1 
pctfree 80 
nologging
as
with g as (
        select rownum id 
        from dual 
        connect by level <= 500 -- > comment to avoid wordpress format issue
)
select
        rownum id, 
        lpad(rownum,10) v1
from    g,g
/

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select  blocks, num_rows
from    user_tables 
where   table_name = 'T1'
/

I’ve created a table with 250,000 rows (500 * 500) and gathered stats on it (the table was 3,345 blocks) as  gathering stats tends to sort out any issues of block cleanout that might confuse the issue. Having done that I’m going to do an update of every row in the table:


alter system switch logfile;
execute snap_my_stats.start_snap

update t1 set id = id;

execute snap_my_stats.end_snap
execute dump_log

The update doesn’t actually change the column value (and in 12.2 there’s an interesting change to the Oracle code that comes into play in that case) but it will still generate redo and undo for every row in the table, and update ITL entries and do all the rest of the work we normally associate with an update.

Before the update I’ve issued a “switch logfile” and after the update I’ve called a procedure I wrote to do a symbolic dump of the current log file (“alter system dump logfile”). The calls to the snap_my_stats package allow me to see the changes in the session activity stats due to the update. A simple check of these statistics can often give you a good idea of what’s going on. So you might ask yourself what you might expect to see in the stats from this update. (As Isaac Asimov once said: The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’, but ‘That’s funny …’)

Having updated 250,000 rows by tablescan I think we might expect to see a tablescan of 250,000 rows; we might also expect to see (in the worst case) something like 250,000 redo entries although we might expect to see far fewer since Oracle can probably do array updates and create one redo change vector and one undo record per block changed rather than per row changed.

So here are some of the more interesting numbers that we actually see (though the specific results are not 100% reproducible):

Name                                                                     Value
----                                                                     -----
session logical reads                                                  270,494
db block gets                                                          263,181
consistent gets                                                          7,313
db block changes                                                       534,032
free buffer requested                                                    7,274
switch current to new buffer                                             3,534
redo entries                                                           265,128
redo size                                                           83,803,996
rollback changes - undo records applied                                  1,002
table scan rows gotten                                                 522,470
table scan blocks gotten                                                 6,881 
HSC Heap Segment Block Changes                                         264,135
Heap Segment Array Updates                                              14,135

A pair of figures that stand out as most surprising are the tablescan figures – I’ve got a table of 3,345 blocks and 250,000 rows, but my tablescans total 6,881 blocks and 522,370 rows – I seem to have scanned the table (slightly more than) twice! At the same time I’ve requested 7,274 free buffers (some of them probably for undo blocks) and “switched current to new buffer” 3,534 times – so I seem to have read every table block then cloned it in memory.

Side note: the “switched current to new buffer” is a common feature of a tablescan update and not particularly relevant to this article. I assume the intent is to “leave behind” a read-consistent copy for any other queries that might have started before the big update.

Another really strange statistic is the one that says “rollback changes – undo records applied” – I’ve used 1,002 undo records in a rollback operation of some sort. I haven’t show you the stats for “user rollbacks” or “transaction rollbacks” because, despite the undo records being applied, there were no rollbacks reported.

What about changes? We see 534,000 “db block changes” – and maybe it’s okay for that statistic to be roughly twice the number of rows we expect to change because for every row we change we have to generate some undo, so may have modify an undo block to create an undo record – but there are still 34,000 “db block changes” too many. The number of changes is fairly consistent with the number of “redo entries”, of course, allowing for a factor of 2 because redo entries often consist of a matching pair of changes, one for the “forward change” and one for the “undo block change”. Again, though, there a bit of an excess: the 34,000 extra “db block changes” seem to have resulted in an excess 15,000 “redo entries”. And we still have to ask: “Whatever happened to array processing ?”

In fact, we do see some array processing – we did 14,135 “Heap Segment Array Updates” which, at an average of about 18 rows per array would account for our 250,000 row table – but we also did 264,135 “HSC Heap Segment Block Changes”, and here’s a (“That’s funny …”) coincidence: 264,135 – 14,135 = 250,000: the number of rows in the table. Maybe that means something, maybe not. Another funny coincidence: 264,135 (HSC Heap Segment Block Changes) + 1,002 (rollback changes – undo records applied) = 265,137 which is remarkably close to the number of “redo entries”.

If, by now, you’re not beginning to suspect that something a little odd has happened you haven’t been reading enough Agatha Christie novels, or watching enough Sherlock Holmes (or House) videos.

I don’t always … dump trace files but when I do I like to go over the top. My update generated 83MB of redo so I made sure my log files were 100MB each, and when I dumped the current log file I got a trace file that was a little over 466MB and 11M lines. Clearly I had no intention of reading it all. A critical skill in reading ANY type of trace file is knowing in advance what you ought to be looking for and working out a painless way of finding it. (Practising by slugging your way through a very small example is a good starting point.)

In this case I started by using grep to extract all the lines containing “OP:”, stripping them down to just the “OP:” followed by the redo op code, and then sorting and summarising the use of op codes. Here (with some annotations) is what I got:

grep OP or32_ora_14652.trc | sed 's/.*OP/OP/' | sed 's/ .*$//' | sort | uniq -c | sort -n

      1 OP:11.3
      2 OP:10.4
      2 OP:11.5
      3 OP:23.1
      3 OP:4.1
      4 OP:14.4
      4 OP:22.5
      5 OP:13.22
      8 OP:11.2
      8 OP:22.2
     13 OP:10.2
     25 OP:5.4
    119 OP:5.11         Mark undo as applied during rollback
    883 OP:5.6          Applying undo - NB 883 + 119 = 1002
   3776 OP:5.2          Get undo segment header
  15139 OP:11.19        Array update of table
 250000 OP:11.4         Lock table row piece
 264190 OP:5.1          Update undo block

The interesting point here is that I seem to have locked every single row in the table separately, but I’ve also done array updates on the table, and I’ve done some rolling back. So I need to do a more detailed check to see when things happened. So let’s begin by examining a few thousand lines near the start of the trace:

head -10000 or32_ora_14652.trc | grep -n OP | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//'

68:CHANGE #1  OP:11.19
214:CHANGE #2  OP:5.2
217:CHANGE #3  OP:11.19
363:CHANGE #4  OP:11.19
509:CHANGE #5  OP:5.2
512:CHANGE #6  OP:11.19
658:CHANGE #7  OP:11.19
692:CHANGE #8  OP:11.19
838:CHANGE #9  OP:11.19
984:CHANGE #10  OP:11.19
1130:CHANGE #11  OP:11.19
1276:CHANGE #12  OP:11.19
1422:CHANGE #13  OP:11.19
1568:CHANGE #14  OP:11.19
1714:CHANGE #15  OP:5.2
1717:CHANGE #16  OP:11.19
1863:CHANGE #17  OP:11.19
2009:CHANGE #18  OP:11.19
2155:CHANGE #19  OP:11.19
2301:CHANGE #20  OP:11.19
2447:CHANGE #21  OP:11.19
2593:CHANGE #22  OP:11.19
2739:CHANGE #23  OP:11.19
2885:CHANGE #24  OP:5.2
2888:CHANGE #25  OP:11.19
3034:CHANGE #26  OP:11.19
3180:CHANGE #27  OP:5.1
3336:CHANGE #28  OP:5.1
3489:CHANGE #29  OP:5.1
3642:CHANGE #30  OP:5.1
3795:CHANGE #31  OP:5.1
3836:CHANGE #32  OP:5.1
3989:CHANGE #33  OP:5.1
4142:CHANGE #34  OP:5.1
4295:CHANGE #35  OP:5.1
4448:CHANGE #36  OP:5.1
4601:CHANGE #37  OP:5.1
4754:CHANGE #38  OP:5.1
4907:CHANGE #39  OP:5.1
5060:CHANGE #40  OP:5.1
5213:CHANGE #41  OP:5.1
5366:CHANGE #42  OP:5.1
5519:CHANGE #43  OP:5.1
5672:CHANGE #44  OP:5.1
5825:CHANGE #45  OP:5.1
5978:CHANGE #46  OP:5.1
6131:CHANGE #47  OP:5.1
6284:CHANGE #48  OP:5.1
6440:CHANGE #1  OP:5.1
6593:CHANGE #2  OP:11.19
6742:CHANGE #1  OP:5.1
6895:CHANGE #2  OP:11.19
7044:CHANGE #1  OP:5.1
7197:CHANGE #2  OP:11.19
7346:CHANGE #1  OP:5.1
7499:CHANGE #2  OP:11.19
7648:CHANGE #1  OP:5.1
7801:CHANGE #2  OP:11.19
7950:CHANGE #1  OP:5.2
7953:CHANGE #2  OP:5.1
8106:CHANGE #3  OP:11.19
8255:CHANGE #1  OP:5.1
8408:CHANGE #2  OP:11.19
8557:CHANGE #1  OP:5.1
8710:CHANGE #2  OP:11.19
8859:CHANGE #1  OP:5.1
9012:CHANGE #2  OP:11.19
9161:CHANGE #1  OP:5.1
9314:CHANGE #2  OP:11.19
9463:CHANGE #1  OP:5.1
9616:CHANGE #2  OP:11.19
9765:CHANGE #1  OP:5.1
9918:CHANGE #2  OP:11.19


Everything up to line 6284 (change# 48) is one big “private redo” record where we do several “11.19 (table array updates)” followed by a matching group of “(5.1 update undo block)”, before switching to the “public redo” mechanism used by long transactions where each redo record is a matched pair of “backward” and “forward” change vectors, i.e. one 5.1 with a matching 11.19.

So we start by doing array updates without doing any single row locking – so I want to know when the array updates stop. Here’s a small but continuous piece from the next grep I ran (which produced an output totalling 15,000 lines):

grep -n "OP:11.19" or32_ora_14652.trc | sed 's/:.*OP/:-OP/' | sed 's/ .*$//' 
...
446693:-OP:11.19
446736:-OP:11.19
446891:-OP:11.19
447045:-OP:11.19
447200:-OP:11.19
7461225:-OP:11.19    *** Big jump
7461527:-OP:11.19
7461829:-OP:11.19
7462131:-OP:11.19
7462321:-OP:11.19
7462511:-OP:11.19
...

The 11.19 opcodes were appearing roughly every 300 lines of trace, except they stopped after 2,004 occurrences then disappeared for about 7M lines before reappering and running evenly (ca. every 300 lines) to the end of the trace file. Clearly we need to know what happened somewhere around lines 447200 and 7461225. But before looking too closely, let’s find out when the “11.4 lock row piece” and “5.6 Apply undo” start to appear”

grep -n "OP:11.4" or32_ora_14652.trc | sed 's/:.*OP/:-OP/' | sed 's/ .*$//' 

447374:-OP:11.4
447405:-OP:11.4
447433:-OP:11.4
447461:-OP:11.4
447489:-OP:11.4
447517:-OP:11.4
...
... lots of evenly spread 11.4, every 28 lines
...
7460948:-OP:11.4
7460976:-OP:11.4
7461004:-OP:11.4
7461032:-OP:11.4
7461060:-OP:11.4

Note where the 11.4 opcodes begin and end – they fit exactly into the gap between the two sets of 11.19 opcodes. And where do the 5.6 op codes come into play ?

grep -n "OP:5.6" or32_ora_14652.trc | sed 's/:.*OP/:-OP/' | sed 's/ .*$//'

295919:-OP:5.6
296046:-OP:5.6
296201:-OP:5.6
296356:-OP:5.6
...
... a few hundred appearances of 5.6, common spaced by 155 lines 
...
446529:-OP:5.6
446684:-OP:5.6
446727:-OP:5.6
446882:-OP:5.6
447191:-OP:5.6

That was the entire set of “5.6 Apply undo” records – a short burst of activity which finishes just before the big gap in the 11.19 opcodes . This leads to the question: “does anything difference happen around the time that the 5.6 op codes start to appear?”  Time to look at a very small extract from a very big grep output:

grep -n "OP" or32_ora_14652.trc | sed 's/ CON_ID.*OP/:-OP/' | sed 's/ ENC.*$//' 
..
294895:CHANGE #1:-OP:5.1
295048:CHANGE #2:-OP:11.19

295197:CHANGE #1:-OP:5.1
295322:CHANGE #2:-OP:11.19

295443:CHANGE #1:-OP:5.1
295596:CHANGE #2:-OP:11.19

295745:CHANGE #1:-OP:22.2

295753:CHANGE #1:-OP:22.5

295760:CHANGE #1:-OP:14.4
295766:CHANGE #2:-OP:22.2

295773:CHANGE #1:-OP:11.19
295919:CHANGE #2:-OP:5.6

295928:CHANGE #1:-OP:11.19
296046:CHANGE #2:-OP:5.6

296055:CHANGE #1:-OP:11.19
296201:CHANGE #2:-OP:5.6

...

I’ve inserted blank lines between redo records in this tiny extract from around the point where the 5.6 opcodes appear. The first few records show Oracle doing the update, then there’s a little bit of space management (Op codes in the 22 and 14 layer) and then Oracle starts rolling back. (Note: whether updating or rolling back Oracle is applying “11.19 table array updates”, but the “5.6 undo applied” appear as change vector #2 in the redo record as we rollback, while the “5.1 create undo” appear as change vector #1 as we make forward changes).

If we count the number of 11.19 opcodes up to the point where the 11.4 opcodes appear we find that we’ve used 1,002 to update the table, then another 1,002 to rollback the update.

Summary

Without going into very fine detail examining individual redo change vectors, what we’ve seen is as follows:

  • On executing the very large update the session activity statistics show the number of rows accessed by tablescan was more than twice the expected number. This obvious oddity made me look more closely and notice the surprising “rollback changes – undo records applied” which just shouldn’t have been there at all, and this led me to more closely at the number of redo entries which averaged more than one per row when it should have been a much smaller number due to Oracle’s internal array update mechanism.
  • Extracting just the redo op codes from a dump of the redo log generated during this update we’ve extracted summary information that the work we’ve done includes: (a) locking every single row in the table separately, (b) performing a load of table array updates, (c) applying a significant number of undo records to rollback changes.
  • Digging in for just a little more detail we see (in order of activity):
    • A thousand array updates  (roughly 20,000 rows by tablescan – I did actually check a few redo records in detail, 20 rows was a common array size)
    • The thousand updates being rolled back (luckily a long time before we got near the end of the first tablescan)
    • 250,000 rows locked individually (250,000 rows by tablescan – the 2nd tablescan) (NB each LKR – lock row – undo record is 68 bytes of undo)
    • 13,000 array updates   (250,000 rows by tablescan – the third tablescan) – at roughly 100 bytes per row of “overhead” in the undo

I don’t know why this happens – and from memory I think it started happening in 10g. I have an idea it relates to changes in the global SCN as the update takes place and something (possibly related to space management updates) making the update think it has to go into “write consistent” mode as its “start SCN” gets too far away from the current SCN.

The bottom line on this, though, is that there may be cases where a large, tablescan-driven, update could actually do the tablescan three times, and generate far more undo and redo than it should. In theory I could have completed the update in about 13,000 pairs of (11.19, 5.1) change vectors. In practice this simple test produced far more work.

And it gets worse …

Update – prompted by a tweet from Tanel Poder

In one of my tests I had added the /+ monitor */ hint to see if the activity would be reported correctly. It didn’t look quite right – but what I hadn’t noticed was that the monitor reported the statement as executing 3 times (and, of course, the tablescan executing 3 times):

Global Information
------------------------------
 Status              :  DONE
 Instance ID         :  1
 Session             :  TEST_USER (14:13842)
 SQL ID              :  9a166bpd7455j
 SQL Execution ID    :  16777216
 Execution Started   :  09/10/2019 09:41:54
 First Refresh Time  :  09/10/2019 09:41:54
 Last Refresh Time   :  09/10/2019 09:42:00
 Duration            :  6s
 Module/Action       :  MyModule/MyAction
 Service             :  orcl
 Program             :  sqlplus@vbgeneric (TNS V1-V3)

Global Stats
===================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs | Bytes |
===================================================================
|    5.84 |    5.77 |     0.01 |     0.06 |     1M |   42 |  26MB |
===================================================================

SQL Plan Monitoring Details (Plan Hash Value=2927627013)
=========================================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity | Activity Detail |
|    |                      |      | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |   (# samples)   |
=========================================================================================================================================
|  0 | UPDATE STATEMENT     |      |         |      |         4 |     +3 |     3 |        0 |      |       |          |                 |
|  1 |   UPDATE             | T1   |         |      |         6 |     +1 |     3 |        0 |      |       |   100.00 | Cpu (6)         |
|  2 |    TABLE ACCESS FULL | T1   |    250K |  432 |         4 |     +3 |     3 |     698K |   42 |  26MB |          |                 |
=========================================================================================================================================

As you see from Tanel’s tweet, he has a little script you can use to scan through the library cache to see if any (big) updates have suffered from this problem.

 

Addendum

I couldn’t quite manage to leave this problem alone, so I’ve run the test and dumped the redo log three more times and there’s an interesting coincidence. Every time I ran the test the switch from updating to rolling back had these 4 (annotated) redo opcodes separating them:


158766:CHANGE #1:-OP:22.2        ktfbhredo - File Space Header Redo
158774:CHANGE #1:-OP:22.5        ktfbbredo - File BitMap Block Redo:
158781:CHANGE #1:-OP:14.4        kteop redo - redo operation on extent map
158787:CHANGE #2:-OP:22.2        ktfbhredo - File Space Header Redo:

The group represents some allocation of space from a file to an object – so the question is: what object gets an extent map update. Here’s the full redo vchange vector for OP:14.4


REDO RECORD - Thread:1 RBA: 0x001cdd.000028a4.00d8 LEN: 0x00b0 VLD: 0x01 CON_UID: 0
SCN: 0x0b86.0db28b4c SUBSCN: 88 09/10/2019 15:06:13
CHANGE #1 CON_ID:0 TYP:0 CLS:29 AFN:2 DBA:0x00800688 OBJ:4294967295 SCN:0x0b86.0db28b4c SEQ:10 OP:14.4 ENC:0 RBL:0 FLG:0x0000
kteop redo - redo operation on extent map
   ADD: dba:0x802380 len:128 at offset:6
  ADDRET: offset:6 ctime:0
   SETSTAT: exts:9 blks:911 lastmap:0x0 mapcnt:0
   UPDXNT: extent:6 add:TRUE

The “ADD” is adding extent id 6, of 128 blocks, at file 2 block 9088. That happens to be the undo segment which my update statement is using. Further 14.4 change vectors that appear through the trace file take the size of the undo segment up to 35 extents, but in the (small number of ) tests I did it was always first 14.4 that seemed to trigger the rollback and restart.

Preliminary Hypothesis

When the transaction does sufficient work that it forces its undo segment to extend some feature of the resulting recursive transaction causes the transaction to rollback and restart.

This does fit with the observation that the number of rows updated before the rollback/restart occurs varies apparently at random. It also fits the observation that an update that doesn’t rollback and restart in 12.1 does rollback and restart in 12.2 because it gets a bit unlcky using the “single row locking” optimisation for “nochange update” optimisation.

September 8, 2019

Quiz Night

Filed under: 12c,Infrastructure,Oracle,redo — Jonathan Lewis @ 10:15 pm BST Sep 8,2019

Upgrades cause surprises – here’s a pair of results from a model that I constructed more than 15 years ago, and ran today on 12.2, then modified and ran again, then ran on 11.2.0.4, then on 12.1.0.2. It’s very simple, I just create a table, gather stats, then update every row.

rem
rem     Script:         update_nochange.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2019
rem
rem     Last tested 
rem             12.2.0.1 

create table t1
as
with generator as (
        select
                rownum id 
        from dual 
        connect by 
                rownum <= 1e4  -- > comment to avoid wordpress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'x')             small_vc,
--      lpad(rownum,10,'0')             small_vc,
        'Y'                             flag
from
        generator       v1
where   rownum <= 1e3   -- > comment to avoid wordpress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

execute snap_my_stats.start_snap

update t1 set small_vc = upper(small_vc);

execute snap_my_stats.end_snap

The calls to package snap_my_stats are my little routines to calculate the change in the session activity stats (v$sesstat, joined to v$statname) due to the update. Here are a a few of the results for the test using the code as it stands:


Name                                    Value
----                                    -----
redo entries                               36
redo size                             111,756
undo change vector size                53,220

You’ll notice, however, that the CTAS has an option commented out to create the small_vc column using lpad(rownum,‘0’) rather than lpad(rownum,‘x’). This is what the redo stats look like if I use ‘0’ instead of ‘x’:


Name                                    Value
----                                    -----
redo entries                              909
redo size                             223,476
undo change vector size                68,256

What – they’re different ?!  (and it’s reproducible).

Running the test on 12.1.0.2 or 11.2.0.4, both variants of the code produce the smaller number of redo entries (and bytes) and undo – it’s only 12.2.0.1 that shows a change.  [Update Jan 2020:The same behaviour, with slight variation in numbers, appears in 19.3.0.0]

Tonight’s quiz:

Figure out what’s happening in 12.2.0.1 to give two different sets of undo and redo figures.

If that problem is too easy – extrapolate the test to more complex cases to see when the difference stops appearing, and see if you can find any cases where this new feature might cause existing applications to break.

I’ll supply the basic answer in 48 hours.

Update (a few hours early)

The question has been answered in the comments – it’s an optimisation introduced in 12.2 that attempts to reduce the amount of undo and redo by minimising the work done for “no change” updates to data.  In principle – but we don’t yet know the rules and limitations – if an update does not change the column values Oracle 12.2 will not “save the old values in an undo record and log the new values in a redo change vector”, it will simply lock the row, to produce a minimal redo change vector.

Unfortunately Oracle goes into “single row” mode to lock rows, while it can do “block-level – i.e. multi-row/array” processing if it uses the “change” mechanism.  Inevitably there are likely to be cases where the 12.2 optimisation actually produces a worse result in terms of volume of redo, or contention for redo latches.

If we modify the code to dump the redo generated by the two different updates we can see more clearly what Oracle is doing:

alter session set tracefile_identifier = 'UPD';

column start_scn new_value m_start_scn
select to_char(current_scn,'999999999999999999999999') start_scn from v$database;

update t1 set small_vc = upper(small_vc);
commit;

column end_scn new_value m_end_scn
select to_char(current_scn,'999999999999999999999999') end_scn from v$database;

alter system dump redo scn min &m_start_scn scn max &m_end_scn;

Then, after running the test we can dump the list of redo op codes from the trace file:

First when we do the “no-change” update (with lots of repetitions deleted):

grep -n OP orcl12c_ora_21999_UPD.trc | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//'

129:CHANGE #2 MEDIA RECOVERY MARKER  OP:17.28
138:CHANGE #1  OP:11.4
147:CHANGE #2  OP:5.2
150:CHANGE #3  OP:11.4
159:CHANGE #4  OP:11.4
168:CHANGE #5  OP:11.4
177:CHANGE #6  OP:11.4
...
2458:CHANGE #189  OP:5.1
2474:CHANGE #190  OP:5.1
2490:CHANGE #191  OP:5.1
2506:CHANGE #192  OP:5.1
2525:CHANGE #1  OP:5.1
2541:CHANGE #2  OP:11.4
2553:CHANGE #1  OP:5.1
2569:CHANGE #2  OP:11.4
...
27833:CHANGE #1  OP:5.1
27849:CHANGE #2  OP:11.4
27861:CHANGE #1  OP:5.1
27877:CHANGE #2  OP:11.4
27889:CHANGE #1  OP:5.4

The dump starts with a large redo record (192 change vectors) that started life in the private redo buffer, and then switch to the standard “paired change vectors” in the public redo buffer. The 11.4 vectors are “lock row piece” while the 5.1 vectors are the “generate undo”. Counting the 11.4 and 5.1 lines there are exactly 1,000 of each – every row has been individually locked.

Now for the “real change” update:

grep -n OP orcl12c_ora_22255_UPD.trc | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//'

126:CHANGE #2 MEDIA RECOVERY MARKER  OP:17.28
135:CHANGE #1  OP:11.19
281:CHANGE #2  OP:5.2
284:CHANGE #3  OP:11.19
430:CHANGE #4  OP:11.19
576:CHANGE #5  OP:11.19
...
5469:CHANGE #41  OP:5.1
5573:CHANGE #42  OP:5.1
5726:CHANGE #43  OP:5.1
5879:CHANGE #44  OP:5.1
6035:CHANGE #1  OP:5.1
6188:CHANGE #2  OP:11.19
6337:CHANGE #1  OP:5.1
6490:CHANGE #2  OP:11.19
...
15029:CHANGE #2  OP:11.19
15101:CHANGE #1  OP:5.1
15177:CHANGE #2  OP:11.19
15249:CHANGE #1  OP:5.4

It’s a much smaller trace file (ca. 15,249 lines compared to ca. 27889 lines), and the table change vectors are 11.19 (Table array update) rather than 11.4 (table lock row piece). Counting the op codes we get 52 of each of the 11.19 and 5.1. If we want a little more information about those vectors we can do the following:


egrep -n -e "OP:" -e "Array Update" orcl12c_ora_22255_UPD.trc | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//' 

126:CHANGE #2 MEDIA RECOVERY MARKER  OP:17.28
135:CHANGE #1  OP:11.19
140:Array Update of 20 rows:
281:CHANGE #2  OP:5.2
284:CHANGE #3  OP:11.19
289:Array Update of 20 rows:
430:CHANGE #4  OP:11.19
435:Array Update of 20 rows:
576:CHANGE #5  OP:11.19
581:Array Update of 20 rows:
...
5469:CHANGE #41  OP:5.1
5481:Array Update of 13 rows:
5573:CHANGE #42  OP:5.1
5585:Array Update of 20 rows:
5726:CHANGE #43  OP:5.1
5738:Array Update of 20 rows:
5879:CHANGE #44  OP:5.1
5891:Array Update of 20 rows:
6035:CHANGE #1  OP:5.1
6047:Array Update of 20 rows:
6188:CHANGE #2  OP:11.19
6193:Array Update of 20 rows:
6337:CHANGE #1  OP:5.1
6349:Array Update of 20 rows:
...
14953:CHANGE #1  OP:5.1
14965:Array Update of 9 rows:
15029:CHANGE #2  OP:11.19
15034:Array Update of 9 rows:
15101:CHANGE #1  OP:5.1
15113:Array Update of 9 rows:
15177:CHANGE #2  OP:11.19
15182:Array Update of 9 rows:
15249:CHANGE #1  OP:5.4

As you can see, the 11.19 (table change) and 5.1 (undo) change vectors both report that they are are structured as array updates. In most cases the array size is 20 rows, but there are a few cases where the array size is smaller. In this test I found one update with an array size of 13 rows and three updates with an array size of 9 rows.

Summary

Oracle has introduced an optimisation for “no change” updates in 12.2 that tries to avoid generating quite so much undo and redo; however this may result in some cases where an “array update” change vector turns into several “single row lock” change vectors, so when you upgrade to 12.2 (or beyone) you may want to check any large update mechanism you run to see if your system has benefited or been penalised to any significant extent by this change. The key indicator will be an increase in the value of the session/system stats “redo entries” and “redo size”.

Next Page »

Powered by WordPress.com.