Oracle Scratchpad

October 11, 2019

v$session

Filed under: Execution plans,Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:29 pm BST Oct 11,2019

Here’s an odd, and unpleasant, detail about querying v$session in the “most obvious” way. (And if you were wondering what made me resurrect and complete a draft on “my session id” a couple of days ago, this posting is the reason). Specifically if you want to select some information for your own session from v$session the query you’re likely to use in any recent version of Oracle will probably be of the form:


select {list for columns} from v$session where sid = to_number(sys_context('userenv','sid'));

Unfortunately that one little statement hides two anomalies – which you can see in the execution plan. Here’s a demonstration cut from an SQL*Plus session running under 19.3.0.0:


SQL> select * from table(dbms_xplan.display_cursor);

SQL_ID  gcfrzq9knynj3, child number 0
-------------------------------------
select program from V$session where sid = sys_context('userenv','sid')

Plan hash value: 2422122865

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    33 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FULL       | X$KSLWT         |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    21 |     0   (0)|
|*  6 |    FIXED TABLE FULL       | X$KSUSE         |     1 |    21 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------ -------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(SYS_CONTEXT('userenv','sid')))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter((BITAND("S"."KSUSEFLG",1)<>0 AND BITAND("S"."KSSPAFLG",1)<>0 AND 
              "S"."INDX"=TO_NUMBER(SYS_CONTEXT('userenv','sid'))
              AND INTERNAL_FUNCTION("S"."CON_ID") AND "S"."INST_ID"=USERENV('INSTANCE')))

As you can see, v$session is a join of 3 separate structures – x$kslwt (v$session_wait), x$ksled (v$event_name), and x$ksuse (the original v$session as it was some time around 8i), and the plan shows two “full tablescans” and a Cartesian merge join. Tablescans and Cartesian merge joins are not necessarily bad – especially where small tables and tiny numbers of rows are concerned – but they do merit at least a brief glance.

x$ksuse is a C structure in the fixed SGA and that structure is a segmented array (which seems to be chunks of 126 entries in 19.3, and chunks of 209 entries in 12.2 – but that’s fairly irrelevant). The SID is simply the index into the array counting from 1, so if you have a query with a predicate like ‘SID = 99’ Oracle can work out the address of the 99th entry in the array and access it very quickly – which is why the SID column is reported as a “fixed index” column in the view v$indexed_fixed_column.

But we have two problems immediately visible:

  1. the optimizer is not using the “index” to access x$ksuse despite the fact that we’re giving it exactly the value we want to use (and we can see a suitable predicate at operation 6 in the plan)
  2. the optimizer has decided to start executing the query at the x$kslwt table

Before looking at why thing’s have gone wrong, let’s check the execution plan to see what would have happened if we’d copied the value from the sys_context() call into a bind variable and queried using the bind variable – which we’ll keep as a character type to make it a fair comparison:

SQL_ID  cm3ub1tctpdyt, child number 0
-------------------------------------
select program from v$session where sid = to_number(:v1)

Plan hash value: 1627146547

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  MERGE JOIN CARTESIAN     |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    12 |     0   (0)|
|*  3 |    FIXED TABLE FIXED INDEX| X$KSLWT (ind:1) |     1 |     8 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSLED (ind:2) |     1 |     4 |     0   (0)|
|   5 |   BUFFER SORT             |                 |     1 |    20 |     0   (0)|
|*  6 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("W"."KSLWTSID"=TO_NUMBER(:V1))
   4 - filter("W"."KSLWTEVT"="E"."INDX")
   6 - filter(("S"."INDX"=TO_NUMBER(:V1) AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))


When we have a (character) bind variable instead of a sys_context() value the optimizer manages to use the “fixed indexes”., but it’s still started executing at x$kslwt, and still doing a Cartesian merge join. The plan would be the same if the bind variable were a numeric type, and we’d still get the same plan if we replaced the bind variable with a literal number.

So problem number 1 is that Oracle only seems able to use the fixed index path for literal values and simple bind variables (plus a few “simple” functions). It doesn’t seem to use the fixed indexes for most functions (even deterministic ones) returning a value and the sys_context() function is a particular example of this.

Transitivity

Problem number 2 comes from a side-effect of something that I first described about 13 years ago – transitive closure. Take a look at the predicates in both the execution plans above. Where’s the join condition between x$ksuse and x$kslwt ? There should be one, because the underlying SQL defining [g]v$session  has the following joins:

from
      x$ksuse s,
      x$ksled e,
      x$kslwt w
where
      bitand(s.ksspaflg,1)!=0
and   bitand(s.ksuseflg,1)!=0
and   s.indx=w.kslwtsid       -- this is the SID column for v$session and v$session_wait
and   w.kslwtevt=e.indx
 

What’s happened here is that the optimizer has used transitive closure: “if a = b and b = c then a = c” to clone the predicate “s.indx = to_number(sys_context(…))” to “w.kslwtsid = to_number(sys_context(…))”. But at the same time the optmizer has eliminated the predicate “s.indx = w.kslwtsid”, which it shouldn’t do because this is 12.2.0.1 and ever since 10g we’ve had the parameter _optimizer_transitivity_retain = true — but SYS is ignoring the parameter setting.

So we no longer have a join condition between x$ksuse and x$kslwt – which means there has to be a cartesian merge join between them and the only question is whether this should take place before or after the join between x$kslwt and x$ksled. In fact, the order doesn’t really matter because there will be only one row identified in x$kslwt and one row in x$ksuse, and the join to x$ksled is simply a lookup (by undeclarable unique key) to translate an id into a name and it will take place only once whatever we do about the other two structures.

But there is a catch – especially if your sessions parameter is 25,000 (which it shouldn’t be) and the number of connected sessions is currently 20,000 (which it shouldn’t be) – the predicate against x$ksuse does a huge amount of work as it walks the entire array testing every row (and it doesn’t even do the indx test first – it does a couple of bitand() operations). Even then this wouldn’t be a disaster – we’re only talking a couple of hundredths of a second of CPU – until you find the applications that run this query a huge number of times.

We would prefer to avoid two full tablescans since the arrays could be quite large, and of the two it’s the tablescan of x$ksuse that is going to be the greater threat; so is there a way to bypass the threat?  Once we’ve identified the optimizer anomaly we’ve got a pointer to a solution. Transitivity is going wrong, so let’s attack the transitivity. Checking the hidden parameters we can find a parameter: _optimizer_generate_transitive_pred which defaults to true, so let’s set it to false for the query and check the plan:

select  /*+   opt_param('_optimizer_generate_transitive_pred','FALSE')
*/  program from  v$session where  sid = sys_context('userenv','sid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     1 |    32 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     1 |    28 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    20 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."INDX"=TO_NUMBER(SYS_CONTEXT('user
              env','sid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


Although it’s not nice to insert hidden parameters into the optimizer activity we do have a result. We don’t have any filtering on x$kslwt – fortunately this seems to be limited in size (but see footnote) to the number of current sessions (unlike x$ksuse which has an array size defined by the sessions parameter or derived from the processes parameter). For each row in x$kslwt we do an access into x$ksuse using the “index” (note that we don’t see access predicates for the fixed tables, we just have to note the operation says FIXED INDEX and spot the “index-related” predicate in the filter predicate list), so this strategy has reduced the number of times we check the complex predicate on x$ksuse rows.

It’s still far from ideal, though. What we’d really like to do is access x$kslwt by index using the known value from sys_context(‘userenv’,’sid’). As it stands the path we get from using a hidden parameter which isn’t listed as legal for the opt_param() hint is a plan that we would get if we used an unhinted query that searched for audsid = sys_context(‘userenv’,’sessionid’).

SQL_ID  7f3f9b9f32u7z, child number 0
-------------------------------------
select  program from  v$session where  audsid =
sys_context('userenv','sessionid')

Plan hash value: 3425234845

----------------------------------------------------------------------------------
| Id  | Operation                 | Name            | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                 |       |       |     1 (100)|
|   1 |  NESTED LOOPS             |                 |     2 |    70 |     0   (0)|
|   2 |   NESTED LOOPS            |                 |     2 |    62 |     0   (0)|
|   3 |    FIXED TABLE FULL       | X$KSLWT         |    47 |   376 |     0   (0)|
|*  4 |    FIXED TABLE FIXED INDEX| X$KSUSE (ind:1) |     1 |    23 |     0   (0)|
|*  5 |   FIXED TABLE FIXED INDEX | X$KSLED (ind:2) |     1 |     4 |     0   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(("S"."INDX"="W"."KSLWTSID" AND BITAND("S"."KSSPAFLG",1)<>0
              AND BITAND("S"."KSUSEFLG",1)<>0 AND "S"."KSUUDSES"=TO_NUMBER(SYS_CONTEXT('
              userenv','sessionid')) AND INTERNAL_FUNCTION("S"."CON_ID") AND
              "S"."INST_ID"=USERENV('INSTANCE')))
   5 - filter("W"."KSLWTEVT"="E"."INDX")


The bottom line, then, seems to be that if you need a query by SID against v$session to be as efficient as possible then your best bet is to load a numeric variable with the sys_context(‘userenv’,’sid’) and then select where “sid = :bindvariable”.  Otherwise query by audsid, or use a hidden parameter to affect the optimizer.

Until the the SYS schema follows the _optimizer_transitivity_retain parameter or treats sys_context() the same way it treats a bind variable there is always going to be some unnecessary work when querying v$session and that excess will grow with either the number of connected sessions (if you optimize the query) or with the value of the sessions parameter.

Footnote

In (much) older versions of Oracle v$session_wait sat on top of x$ksusecst, which was part of the same C structure as x$ksuse. In newer versions of Oracle x$kslwt is a structure that is created on demand in the PGA/UGA – I hope that there’s a short cut that allows Oracle to find the waiting elements in x$ksuse[cst] efficiently, rather than requiring a walk through the whole thing, otherwise a tablescan of the (nominally smaller) x$kslwt structure will be at least as expensive as a tablescan of the x$ksuse structure.

Update (just a few minutes after posting)

Bob Bryla has pointed out in a tweet that there are many “bugs” not fixed until 19.1 for which the workaround is to set “_optimizer_transitivity_retain” to false. So maybe this isn’t an example of SYS doing something particularly strange – it may be part of a general reworking of the mechanism that still has a couple of undesirable side effects.

Bob’s comment prompted me to clone the x$ tables into real tables in a non-SYS schema and model the fixed indexes with primary keys, and I found that the resulting plan (though very efficient) still discarded the join predicate. So we may be seeing the side effects of a code enhancement relating to generating predicates that produce unique key access paths. (A “contrary” test, the one in the 2013 article I linked to, still retains the join predicate for the query that has non-unique indexes.)

 

October 9, 2019

My SID

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 12:03 pm BST Oct 9,2019

Here’s a little note that’s been hanging around as a draft for more than eight years according to the OTN (as it was) posting that prompted me to start writing it. At the time there were still plenty of people using Oracle 10g. so the question didn’t seem entirely inappropriate:

On 10g R2 when I open a sqlplus session how can I know my session SID ? I’m not DBA then can not open as sysdba and query v$session.

In all fairly recent versions of Oracle, of course, we have the option to use the sys_context() function to get the SID, but this specific option didn’t appear until some time in the 10g timeline – so you might have spent years “knowing” that you could get the audsid though sys_context(‘userenv’,’sessionid’) there was no equivalent way to get the sid. Now, of course, and even in the timeline of the original posting, the simplest solution to the requirement is to execute:


select sys_context('userenv','sid') from dual;

But there are a number of alternatives – which may occasionally do a better job (and sometimes are just plain silly). It’s also worth noting that even in 19c Oracle still doesn’t have access to v$session.serial# through sys_context() and, anyway, sys_context() behaves like an unpeekable bind variable – which can be a problem.

So here’s the first of several options:

select sid from V$mystat where rownum = 1;

You’ll need SYS to grant you select on v_$mystat to use this one, of course, but v$mystat is a very convenient view giving you the session activity stats since logon for your own session – so there ought to be some mechanism that allows you to see some form of it in place anyway (ideally including the join to v$statname).

One of the oldest ways of getting access to your session ID without having access to any of the dynamic performance views was through the dbms_support package:

variable v1 varchar2(32)
execute :v1 := dbms_support.mysid
execute dbms_output.put_line(:v1)

Again you’ll need SYS to grant you extra privileges, in this case execute on the dbms_support package – worse still, the package is not installed by default. In fact (after installing it) if you call dbms_support.package_version it returns the value: “DBMS_SUPPORT Version 1.0 (17-Aug-1998) – Requires Oracle 7.2 – 8.0.5” – which gives you some idea of how old it is. It used to be useful for the start_trace_in_session() procedure it contains but that procedure has been superseded by many newer mechanisms. If you enable SQL tracing to see what’s happening under the covers when you call dbms_support.mysid you’ll see that the function actually runs the query I showed above against v$mystat .

Unlike dbms_support the dbms_session package is installed automatically with the privilege to execute granted to public,  and this gives you a function to generate a “unique session id”, . The notes in the scripts $ORACLE_HOME/rdbms/admin/dbmssess.sql that create the package say that the return value can be up to 24 bytes long, but so far the maximum I’ve seen is 12.


select dbms_session.unique_session_id from dual;
UNIQUE_SESSION_ID
--------------------------
00FF5F980001


select
        to_number(substr(dbms_session.unique_session_id,1,4),'XXXX') sid,
        to_number(substr(dbms_session.unique_session_id,5,4),'XXXX') serial#,
        to_number(substr(dbms_session.unique_session_id,9,4),'XXXX') instance
from
        dual
;

       SID    SERIAL# INSTANCE
---------- ---------- --------
       255      24472        1

As you can see, the session_unique_id can be decoded to produce three useful bits of information, and the nice thing about this call is that it gives you session serial# at the same time as the SID. It’s possible, of course, that this query is as efficient as it could be, but there’s some scope for writing a query that uses a non-mergeable in-line view to call the function once, then splits the result into three pieces.

While we’re on the session_unique_id, the dbms_pipe package also has a “unique identifier” function unique_session_name():

SQL> select dbms_pipe.unique_session_name from dual;

UNIQUE_SESSION_NAME
------------------------
ORA$PIPE$00FF5F980001

It doesn’t take a lot of effort to spot that the “unique session name” is the “unique session id” of dbms_session prefixed with the text “ORA$PIPE$”. It’s convenient for the dbms_pipe package to be able to generate a unique name so that one session can create a safely named pipe and tell another session about it. Anyone using pipes should take advantage of this function for its original purpose. Unlike dbms_session you’ll need to be granted the privilege to execute this package, it’s not available to public. Interestingly the script that creates dbms_pipe says that this function could return 30 bytes – since it appears to be 9 bytes prepended to the (“could be 24 bytes”) dbms_session.unique_session_id you have to wonder whether there’s something more subtle that could happen.

There may be many more mechanisms available as built-ins, but the last one I know of is in the dbms_debug_jdwp package (another package with execute privilege already granted to public and the ability to supply both the sid and serial#):

SQL> select
  2          dbms_debug_jdwp.current_session_id     sid,
  3          dbms_debug_jdwp.current_session_serial serial#
  4  from dual
  5  /

       SID    SERIAL#
---------- ----------
       255      24472

There is a reason why I’ve decided to resurrect this list of ways of getting at a session’s SID, but that’s the topic of another blog note.

 

 

October 7, 2019

Resumable

Filed under: Infrastructure,Oracle,Tuning — Jonathan Lewis @ 10:31 am BST Oct 7,2019

There are two questions about temporary space that appear fairly regularly on the various Oracle forums. One is of the form:

From time to time my temporary tablespace grows enormously (and has to be shrunk), how do I find what’s making this happen?

The other follows the more basic pattern:

My process sometimes crashes with Oracle error: “ORA-01652: unable to extend temp segment by %n in tablespace %s” how do I stop this happening?

Before moving on to the topic of the blog, it’s worth pointing out two things about the second question:

  • First, it’s too easy to get stuck at the word temp and leap to the conclusion that the problem is about the temporary tablespace without noticing that the error message includes the specific tablespace that’s raised the problem. If, for example, you rebuild an index in a nominated tablespace Oracle first creates the index as a temporary segment (with a name like {starting_file_number}.{starting_block_number}) in that tablespace then renames it to match the original index name once the rebuild is complete and drops the old index.
  • Secondly a process that raises ORA-01652 isn’t necessarily the guilty party – it may be the victim of some other process hogging all the available space when it shouldn’t. Moreover that other process may have completed and released its space by the time you start looking for the problem – causing extra confusion because your process seems to have crashed without a cause. Taking my example of an index rebuild – your index rebuild may fail because someone else was rebuilding a different index at the same time in the same tablespace; but when you check the tablespace all the space from their original index is now free as their rebuild completed in the interim.

So, before you start chasing something that you think is a problem with your code, pause a moment to double-check the error message and think about whether you could have been the victim of some concurrent, but now complete, activity.

I’ve listed the two questions as variants on the same theme because the workaround to one of them introduces the risk of the other – if you want to avoid ORA-01652 you could make all your data files and temp files “autoextensible”, but then there may be occasions when they extend far too much and you need to shrink them down again (and that’s not necessarily easy if it’s not the temporary tablespace). Conversely, if you think your data or temp files randomly explode to ludicrous sizes you could decide on a maximum size for your files and disable autoextension – then handle the complaints when a user reports an ORA-01652.

There are various ways you could monitor your system in near real time to spot the threat as it builds, of course; and there are various ways to identify potentially guilty SQL after the event. You could keep an eye on various v$ dynamic performance views or dba_ administrative views to try and intercept a problem; you could set event 1652 to dump an errorstack (or even systemstate) for post-crash analysis to see what that reported. Neither is an ideal solution – one requires you to pay excessive attention to the system, the other is designed to let the problem happen then leave you to clean up afterwards.  There is, however, a strategy that may stop the problem from appearing without requiring constant monitoring. The strategy is to enable (selectively) resumable operations.

If a resumable operation needs to allocate space but is unable to do so – i.e. it would normally be about to raise ORA-01652 – it will suspend itself for a while going into the wait state “statement suspended, wait error to be cleared” which will show up as the event in v$session_wait, timing out every 2 seconds The session will also be reporting its current action in the view v$resumable or, for slightly more information, dba_resumable. As it suspends the session will also write a message to the alert log but you can also create an “after suspend” database trigger to alert you that a problem has occurred.

If you set the resumable timeout to a suitable value then you may find:

  • the problem goes away of its own accord and the session resumes before the timeout is reached

or

  • you receive a warning and have some time to identify the source of the problem and take the minimum action needed to allow the session to resume

Implementation

The parameter resumable_timeout is a general control for resumable sessions if you don’t handle the feature at a more granular level than the system.

By default this parameter is set to zero which translates into a default value of 7,200 seconds but that default doesn’t come into effect unless a session declares itself resumable. If you set the parameter to a non-zero value all session will automatically be operating as resumable sessions – and you’ll soon hear why you don’t want to do that.

The second enabling feature for resumable sessions is the resumable privilege – a session can’t control it’s own resumability unless the schema has been granted the resumable privilege – which may be granted through a role. If a session has the privilege it may set its own resumable_timeout, even if the system value is zero.

Assume we have set resumable_timeout to 10 (seconds) through the instance parameter file and restarted the instance. If we now issue (for example) the following ‘create table’ statement:


create table t1 (n1, v1 ) 
pctfree 90 pctused 10
tablespace tiny
as
select 
        rownum, cast(lpad('x',800) as varchar2(1000))
from    all_objects
where   rownum <= 20000
/

This will attempt to allocate 1 row per block for 20,000 blocks (plus about 1.5% for bitmap space management blocks) – and tablespace tiny lives up (or down) to its name, consisting of a single file of only 10,000 Oracle blocks. Shortly after starting, the session will hit Oracle error “ORA-01652: unable to extend temp segment by 128 in tablespace TINY”, but it won’t report it; instead it will suspend itself for 10 seconds before failing and reporting the error. This will happen whether or not the session has the resumable privilege – in this case the behaviour is dictated by our setting the system parameter. If you look in the alert log after the session finally errors out you will find text like the following:

2019-10-04T14:01:11.847943+01:00
ORCL(3):ORA-1652: unable to extend temp segment by 128 in tablespace TINY [ORCL] 
ORCL(3):statement in resumable session 'User TEST_USER(138), Session 373, Instance 1' was suspended due to
ORCL(3):    ORA-01652: unable to extend temp segment by 128 in tablespace TINY
2019-10-04T14:01:23.957586+01:00
ORCL(3):statement in resumable session 'User TEST_USER(138), Session 373, Instance 1' was timed out

Note that there’s a 10 (plus a couple) second gap between the point where the session reports that it is suspending itself and the point where it fails with a timeout. The two-extra seconds appear because the session polls every 2 seconds to see whether the problem is still present or whether it has spontaneously disappeared so allowing the session to resume.

Let’s change the game slightly; let’s try to create the table again, but this time execute the following statement first:

alter session enable resumable timeout 60 name 'Help I''m stuck';

The initial response to this will be Oracle error “ORA-01031: insufficient privileges” because the session doesn’t have the resumable privilege, but after granting resumable to the user (or a relevant role) we try again and find we will be allowed a little extra time before the CTAS times out. Our session now overrides the system timeout and will wait 60 seconds (plus a bit) before failing.The “timeout” clause is optional and if we omit it the session will use the system value, similarly the “name” clause is optional though there’s no default for it, it’s just a message that will get into various views and reports.

There are several things you might check in this 60 second grace period. The session wait history will confirm that your session has been timing out every two seconds (as will the active session history if you’re licensed to use it):


select seq#, event, wait_time from v$session_wait_history where sid = 373

      SEQ# EVENT							     WAIT_TIME
---------- ---------------------------------------------------------------- ----------
	 1 statement suspended, wait error to be cleared			   204
	 2 statement suspended, wait error to be cleared			   201
	 3 statement suspended, wait error to be cleared			   201
	 4 statement suspended, wait error to be cleared			   201
	 5 statement suspended, wait error to be cleared			   200
	 6 statement suspended, wait error to be cleared			   200
	 7 statement suspended, wait error to be cleared			   202
	 8 statement suspended, wait error to be cleared			   200
	 9 statement suspended, wait error to be cleared			   200
	10 statement suspended, wait error to be cleared			   200

Then there’s a special dynamic performance view, v$resumable which I’ve reported below using a print_table() procedure that Tom Kyte wrote many, many years ago to report rows in a column format:

SQL> set serveroutput on
SQL> execute print_table('select * from v$resumable where sid = 373')

ADDR                          : 0000000074515B10
SID                           : 373
ENABLED                       : YES
STATUS                        : SUSPENDED
TIMEOUT                       : 60
SUSPEND_TIME                  : 10/04/19 14:26:20
RESUME_TIME                   :
NAME                          : Help I'm stuck
ERROR_NUMBER                  : 1652
ERROR_PARAMETER1              : 128
ERROR_PARAMETER2              : TINY
ERROR_PARAMETER3              :
ERROR_PARAMETER4              :
ERROR_PARAMETER5              :
ERROR_MSG                     : ORA-01652: unable to extend temp segment by 128 in tablespace TINY
CON_ID                        : 0
-----------------
1 rows selected

Notice how the name column reports the name I supplied when I enabled the resumable session. The view also tells us when the critical statement was suspended and how long it is prepared to wait (in total) – leaving us to work out from the current time how much time we have left to work around the problem.

There’s also a dba_resumable variant of the view which is slightly more informative (though the sample below is not consistent with the one above because I ran the CTAS several times, editing the blog as I did so):

SQL> execute print_table('select * from dba_resumable where session_id = 373')

USER_ID                       : 138
SESSION_ID                    : 373
INSTANCE_ID                   : 1
COORD_INSTANCE_ID             :
COORD_SESSION_ID              :
STATUS                        : SUSPENDED
TIMEOUT                       : 60
START_TIME                    : 10/04/19 14:21:14
SUSPEND_TIME                  : 10/04/19 14:21:16
RESUME_TIME                   :
NAME                          : Help I'm stuck
SQL_TEXT                      : create table t1 (n1, v1 ) pctfree 90 pctused 10 tablespace tiny as  select rownum, 
                                cast(lpad('x',800) as varchar2(1000)) from all_objects where rownum <= 20000
ERROR_NUMBER                  : 1652
ERROR_PARAMETER1              : 128
ERROR_PARAMETER2              : TINY
ERROR_PARAMETER3              :
ERROR_PARAMETER4              :
ERROR_PARAMETER5              :
ERROR_MSG                     : ORA-01652: unable to extend temp segment by 128 in tablespace TINY
-----------------
1 rows selected

This view includes the text of the statement that has been suspended and shows us when it started running (so that we can decide whether we really want to rescue it, or might be happy to kill it to allow some other suspended session to resume).

If you look at the alert log in this case you’ll see that the name has been reported there instead of the user, session and instance – which means you might want to think carefully about how you use the name option:


2019-10-04T14:21:16.151839+01:00
ORCL(3):statement in resumable session 'Help I'm stuck' was suspended due to
ORCL(3):    ORA-01652: unable to extend temp segment by 128 in tablespace TINY
2019-10-04T14:22:18.655808+01:00
ORCL(3):statement in resumable session 'Help I'm stuck' was timed out

Once your resumable task has completed (or timed out and failed) you can stop the session from being resumable with the command:

alter session disable resumable;

And it’s important that every time you enable resumability you should disable it as soon as the capability is no longer needed. Also, be careful about when you enable it, don’t be tempted to make every session resumable. Use it only for really important cases. Once a session is resumable virtually everything that goes on in that session is deemed to be resumable, and this has side effects.

The first side effect that may spring to mind is the impact of the view v$resumable – it’s a memory structure in the SGA so that everyone can see it and all the resumable sessions can populate and update it. That means there’s got to be some latch (or mutex) protection going on – and if you look at v$latch you’ll discover that there;s just a single (child) latch doing the job, so resumability can introduce a point of contention. Here’s a simple script (using my “start_XXX” strategy to “select 1 from dual;” one thousand times, with calls to check the latch activity:

set termout off
set serveroutput off
execute snap_latch.start_snap

@start_1000

set termout on
set serveroutput on
execute snap_latch.end_snap(750)

And here are the results of running the script – reporting only the latches with more than 750 gets in the interval – first without and then with a resumable session:

---------------------------------
Latch waits:-   04-Oct 15:04:31
Lower limit:-  750
---------------------------------
Latch                              Gets      Misses     Sp_Get     Sleeps     Im_Gets   Im_Miss Holding Woken Time ms
-----                              ----      ------     ------     ------     -------   ------- ------- ----- -------
session idle bit                  6,011           0          0          0           0         0       0     0      .0
enqueue hash chains               2,453           0          0          0           0         0       0     0      .0
enqueue freelist latch                1           0          0          0       2,420         0       0     0      .0
JS queue state obj latch          1,176           0          0          0           0         0       0     0      .0

SQL> alter session enable resumable;

SQL> @test
---------------------------------
Latch waits:-   04-Oct 15:04:46
Lower limit:-  750
---------------------------------
Latch                              Gets      Misses     Sp_Get     Sleeps     Im_Gets   Im_Miss Holding Woken Time ms
-----                              ----      ------     ------     ------     -------   ------- ------- ----- -------
session idle bit                  6,011           0          0          0           0         0       0     0      .0
enqueue hash chains               2,623           0          0          0           0         0       0     0      .0
enqueue freelist latch                1           0          0          0       2,588         0       0     0      .0
resumable state object            3,005           0          0          0           0         0       0     0      .0
JS queue state obj latch          1,260           0          0          0           0         0       0     0      .0

PL/SQL procedure successfully completed.

SQL> alter session disable resumable;

That’s 1,000 selects from dual – 3,000 latch gets on a single child latch. It looks like every call to the database results in a latch get and an update to the memory structure. (Note: You wouldn’t see the same effect if you ran a loop inside an anonymous PL/SQL block since the block would be the single database call).

For other side effects with resumability think about what else is going on around your session. If you allow a session to suspend for (say) 3600 seconds and it manages to resume just in time to avoid a timeout it now has 3,600 seconds of database changes to unwind if it’s trying to produce a read-consistent result; so not only do you have to allow for increasing the size of the undo tablespace and increasing the undo retention time, you have to allow for the fact that when the process resumes it may run much more slowly than usual because it spends more of its time trying to see the data as it was before it suspended, which may require far more single block reads of the undo tablespace – and the session may then crash anyway with an Oracle error ORA-01555 (which is so well-known that I won’t quote the text).

In the same vein – if a process acquires a huge amount of space in the temporary tablespace (in particular) and fails instantly because it can’t get any more space it normally crashes and releases the space. If you allow that process to suspend for an hour it’s going to hold onto that space – which means other processes that used to run safely may now crash because they find there’s no free space left for them in the temporary tablespace.

Be very cautious when you introduce resumable sessions – you need to understand the global impact, not just the potential benefit to your session.

Getting Alerts

Apart from the (passive) views telling you that a session has suspended it’s also possible to get some form of (active) alert when the event happens. There’s an “after suspend” event that you can use to create a database trigger to take some defensive action, e.g.:

create or replace trigger call_for_help
after suspend
on test_user.schema
begin
        if sysdate between trunc(sysdate) and trunc(sysdate) + 3/24 then
                null;
                -- use utl_mail, utl_smtp et. al. to page the DBA
        end if;
end;
/

This trigger is restricted to the test_user schema, and (code not included) sends a message to the DBA’s pager only between the hours of midnight and 3:00 a.m. Apart from the usual functions in dbms_standard that returnn error codes, names of objects and so on you might want to take a look at the dbms_resumable package for the “helper” functions and procedures it supplies.

For further information on resumable sessions here’s a link to the 12.2 manual to get you started.

October 3, 2019

Trace Files

Filed under: 12c,Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 1:38 pm BST Oct 3,2019

A recent blog note by Martin Berger about reading trace files in 12.2 poped up in my twitter timeline yesterday and reminded me of a script I wrote a while ago to create a simple view I could query to read the tracefile generated by the current session while the session was still connected. You either have to create the view and a public synonym through the SYS schema, or you have to use the SYS schema to grant select privileges on several dynamic performance views to the user to allow the user to create the view in the user’s schema. For my scratch database I tend to create the view in the SYS schema.

Script to be run by SYS:

rem
rem     Script: read_trace_122.sql
rem     Author: Jonathan Lewis
rem     Dated:  Sept 2018
rem
rem     Last tested
rem             12.2.0.1

create or replace view my_trace_file as
select 
        *
from 
        v$diag_trace_file_contents
where
        (adr_home, trace_filename) = (
                select
                --      substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                        substr(
                                substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                                1,
                                instr(
                                        substr(tracefile, 1, instr(tracefile,'/',-1)),
                                        'trace'
                                ) - 2
                        ),
                        substr(tracefile, instr(tracefile,'/',-1)+1) trace_filename
                from 
                        v$process
                where   addr = (
                                select  paddr
                                from    v$session
                                where   sid = (
                                        sys_context('userenv','sid')
                                        -- select sid from v$mystat where rownum = 1
                                        -- select dbms_support.mysid from dual
                                )
                        )
        )
;


create public synonym my_trace_file for sys.my_trace_file;
grant select on my_trace_file to {some role};

Alternatively, the privileges you could grant to a user from SYS so that they could create their own view:


grant select on v_$process to some_user;
grant select on v_$session to some_user;
grant select on v_$diag_trace_file_contents to some_user;
and optionally one of:
        grant select on v_$mystat to some_user;
        grant execute on dbms_support to some_user;
                but dbms_support is no longer installed by default.

The references to package dbms_support and view v$mystat are historic ones I have lurking in various scripts from the days when the session id (SID) wasn’t available in any simpler way.

Once the view exists and is available, you can enable some sort of tracing from your session then query the view to read back the trace file. For example, here’s a simple “self-reporting” (it’s going to report the trace file that it causes) script that I’ve run from 12.2.0.1 as a demo:


alter system flush shared_pool;
alter session set sql_trace true;

set linesize 180
set trimspool on
set pagesize 60

column line_number      format  999,999
column piece            format  a150    
column plan             noprint
column cursor#          noprint

break on plan skip 1 on cursor# skip 1

select
        line_number,
        line_number - row_number() over (order by line_number) plan,
        substr(payload,1,instr(payload,' id=')) cursor#,
        substr(payload, 1,150) piece
from
        my_trace_file
where
        file_name = 'xpl.c'
order by
        line_number
/

alter session set sql_trace false;

The script flushes the shared pool to make sure that it’s going to trigger some recursive SQL then enables a simple SQL trace. The query then picks out all the lines in the trace file generated by code in the Oracle source file xpl.c (execution plans seems like a likely guess) which happens to pick out all the STAT lines in the trace (i.e. the ones showing the execution plans).

I’ve used the “tabibitosan” method to identify all the lines that belong to a single execution plan by assuming that they will be consecutive lines in the output starting from a line which includes the text ” id=1 “ (the surrounding spaces are important), but I’ve also extracted the bit of the line which includes the cursor number (STAT #nnnnnnnnnnnnnnn) because two plans may be dumped one after the other if multiple cursors close at the same time. There is still a little flaw in the script because sometimes Oracle will run a sys-recursive statement in the middle of dumping a plan to turn an object_id into an object_name, and this will cause a break in the output.

The result of the query is to extract all the execution plans in the trace file and print them in the order they appear – here’s a sample of the output:


LINE_NUMBER PIECE
----------- ------------------------------------------------------------------------------------------------------------------------------------------------------
         38 STAT #140392790549064 id=1 cnt=0 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=53 us cost=4 size=113 card
         39 STAT #140392790549064 id=2 cnt=0 pid=1 pos=1 obj=37 op='INDEX RANGE SCAN I_OBJ2 (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=0 card=1)'


         53 STAT #140392790535800 id=1 cnt=1 pid=0 pos=1 obj=0 op='MERGE JOIN OUTER (cr=5 pr=0 pw=0 str=1 time=95 us cost=2 size=178 card=1)'
         54 STAT #140392790535800 id=2 cnt=1 pid=1 pos=1 obj=4 op='TABLE ACCESS CLUSTER TAB$ (cr=3 pr=0 pw=0 str=1 time=57 us cost=2 size=138 card=1)'
         55 STAT #140392790535800 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'
         56 STAT #140392790535800 id=4 cnt=0 pid=1 pos=2 obj=0 op='BUFFER SORT (cr=2 pr=0 pw=0 str=1 time=29 us cost=0 size=40 card=1)'
         57 STAT #140392790535800 id=5 cnt=0 pid=4 pos=1 obj=73 op='TABLE ACCESS BY INDEX ROWID TAB_STATS$ (cr=2 pr=0 pw=0 str=1 time=10 us cost=0 size=40 card=1)
         58 STAT #140392790535800 id=6 cnt=0 pid=5 pos=1 obj=74 op='INDEX UNIQUE SCAN I_TAB_STATS$_OBJ# (cr=2 pr=0 pw=0 str=1 time=8 us cost=0 size=0 card=1)'


         84 STAT #140392791412824 id=1 cnt=1 pid=0 pos=1 obj=20 op='TABLE ACCESS BY INDEX ROWID BATCHED ICOL$ (cr=4 pr=0 pw=0 str=1 time=25 us cost=2 size=54 card
         85 STAT #140392791412824 id=2 cnt=1 pid=1 pos=1 obj=42 op='INDEX RANGE SCAN I_ICOL1 (cr=3 pr=0 pw=0 str=1 time=23 us cost=1 size=0 card=2)'


         94 STAT #140392790504512 id=1 cnt=2 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=7 pr=0 pw=0 str=1 time=432 us cost=6 size=374 card=2)'
         95 STAT #140392790504512 id=2 cnt=2 pid=1 pos=1 obj=0 op='HASH JOIN OUTER (cr=7 pr=0 pw=0 str=1 time=375 us cost=5 size=374 card=2)'
         96 STAT #140392790504512 id=3 cnt=2 pid=2 pos=1 obj=0 op='NESTED LOOPS OUTER (cr=4 pr=0 pw=0 str=1 time=115 us cost=2 size=288 card=2)'
         97 STAT #140392790504512 id=4 cnt=2 pid=3 pos=1 obj=19 op='TABLE ACCESS CLUSTER IND$ (cr=3 pr=0 pw=0 str=1 time=100 us cost=2 size=184 card=2)'
         98 STAT #140392790504512 id=5 cnt=1 pid=4 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=85 us cost=1 size=0 card=1)'
         99 STAT #140392790504512 id=6 cnt=0 pid=3 pos=2 obj=75 op='TABLE ACCESS BY INDEX ROWID IND_STATS$ (cr=1 pr=0 pw=0 str=2 time=8 us cost=0 size=52 card=1)'
        100 STAT #140392790504512 id=7 cnt=0 pid=6 pos=1 obj=76 op='INDEX UNIQUE SCAN I_IND_STATS$_OBJ# (cr=1 pr=0 pw=0 str=2 time=7 us cost=0 size=0 card=1)'
        101 STAT #140392790504512 id=8 cnt=0 pid=2 pos=2 obj=0 op='VIEW  (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=43 card=1)'
        102 STAT #140392790504512 id=9 cnt=0 pid=8 pos=1 obj=0 op='SORT GROUP BY (cr=3 pr=0 pw=0 str=1 time=44 us cost=3 size=15 card=1)'
        103 STAT #140392790504512 id=10 cnt=0 pid=9 pos=1 obj=31 op='TABLE ACCESS CLUSTER CDEF$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=15 card=1)'
        104 STAT #140392790504512 id=11 cnt=1 pid=10 pos=1 obj=30 op='INDEX UNIQUE SCAN I_COBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        116 STAT #140392791480168 id=1 cnt=4 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=3 pr=0 pw=0 str=1 time=62 us cost=3 size=858 card=13)'
        117 STAT #140392791480168 id=2 cnt=4 pid=1 pos=1 obj=21 op='TABLE ACCESS CLUSTER COL$ (cr=3 pr=0 pw=0 str=1 time=24 us cost=2 size=858 card=13)'
        118 STAT #140392791480168 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        126 STAT #140392789565328 id=1 cnt=1 pid=0 pos=1 obj=14 op='TABLE ACCESS CLUSTER SEG$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=68 card=1)'
        127 STAT #140392789565328 id=2 cnt=1 pid=1 pos=1 obj=9 op='INDEX UNIQUE SCAN I_FILE#_BLOCK# (cr=2 pr=0 pw=0 str=1 time=12 us cost=1 size=0 card=1)'


        135 STAT #140392789722208 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=22 us cost=3 size=51 card=
        136 STAT #140392789722208 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=2 pr=0 pw=0 str=1 time=16 us cost=2 size=0 card=1)'


        153 STAT #140392792055264 id=1 cnt=1 pid=0 pos=1 obj=68 op='TABLE ACCESS BY INDEX ROWID HIST_HEAD$ (cr=3 pr=0 pw=0 str=1 time=25 us)'
        154 STAT #140392792055264 id=2 cnt=1 pid=1 pos=1 obj=70 op='INDEX RANGE SCAN I_HH_OBJ#_INTCOL# (cr=2 pr=0 pw=0 str=1 time=19 us)'

If you want to investigate further, the “interesting” columns in the underlying view are probably: section_name, component_name, operation_name, file_name, and function_name. The possible names of functions, files, etc. vary with the trace event you’ve enabled.

 

October 2, 2019

_cursor_obsolete_threshold

Filed under: 12c,Infrastructure,Oracle,Performance — Jonathan Lewis @ 2:39 pm BST Oct 2,2019

At the recent Trivadis Performance Days in Zurich, Chris Antognini answered a question that had been bugging me for some time. Why would Oracle want to set the default value of _cursor_obsolete_threshold to a value like 8192 in 12.2 ?

In 11.2.0.3 the parameter was introduced with the default value 100; then in 11.2.0.4, continuing into 12.1, the default value increased to 1,024 – what possible reason could anyone have for thinking that 8192 was a good idea ?

The answer is PDBs – specifically the much larger number of PDBs a single CBD can (theoretically) support in 12.2.

In fact a few comments, and the following specific explanation, are available on MoS in Doc ID 2431353.1 “High Version Counts For SQL Statements (>1024) Post Upgrade To 12.2 and Above Causing Database Slow Performance”:

The default value of _cursor_obsolete_threshold is increased heavily (8192 from 1024) from 12.2 onwards in order to support 4096 PDBs which was only 252 PDBs till 12.1. This parameter value is the maximum limit for obsoleting the parent cursors in an multitenant environment and cannot be increased beyond 8192.

Having said, this is NOT applicable for non-CDB environment and hence for those databases, this parameter should be set to 12.1 default value manually i.e. 1024. The default value of 1024 holds good for non-CDB environment and the same parameter can be adjusted case-to-case basis should there be a problem.

It’s all about PDBs – more precisely, it’s all about CDBs running a huge number of PDBs, which is not necessarily the way that many companies are likely to use PDBs. So if you’re a fairly typical companyy running a handful of PDBs in a single CDB then it’s probably a good idea to set the parameter down to the 12.1 value of 1024 (and for bad applications I’d consider going even lower) – and this MOS note actually makes that an official recommendation.

Impact analysis

What’s the worst that could happen if you actually have many PDBs all executing the same application and that application has a few very popular and frequently executed statements? Chris Antognini described a model he’d constructed and some tests he’d done to show the effects. The following code is a variation onhis work. It addresses the following question:

If you have an application that repeatedly issues (explicitly or implicitly) parse calls but doesn’t take advantage of the session cursor cache it has to search the library cache by hash_value / sql_id for the parent cursor, then has to walk the chain of child cursors looking for the right child. What’s the difference in the work done if this “soft parse” has to walk the list to child number 8,191 instead of finding the right cursor at child number 0.

Here’s the complete code for the test:


create table t1
select 1 id from dual
/

alter table t1 add constraint t1_pk primary key (id)
/

spool cursor_obsolete.lst

alter system flush shared_pool;
alter system flush shared_pool;

set serveroutput off
select /*+ index(t1) */ id from t1 where id > 0;
select * from table(dbms_xplan.display_cursor);

execute snap_my_stats.start_snap
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+8192 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

column sql_text format a60
select sql_id, child_number, loaded_versions, executions, sql_text from v$sql where sql_text like 'SELECT%T1%' order by child_number;

prompt  ===============
prompt  Low child reuse
prompt  ===============

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

prompt  ================
prompt  High child reuse
prompt  ================

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 7168+1..7168+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

spool off

I’ve created a table with just one row and given it a primary key. My testing query is going to be very short and simple. A query hinted to return that one row by primary key index range scan.

I’ve flushed the shared pool (twice) to minimise fringe contention from pre-existing information, then executed the statement to populate the dictionary cache and some library cache information and to check the execution plan.

The call to the package snap_my_stats is my standard method for reporting changes in v$mystat across the test. I’ve called the start_snap procedure twice in a row to make sure that its first load doesn’t add some noise to the statistics that we’re trying to capture.

The test runs in three parts.

  • First I loop 8192 times executing the same statement, but with a different value for the optimizer_index_cost_adj for each execution – this gives me the limit of 8192 child cursors, each reporting “Optimizer Mismatch” as the reason for not sharing. I’ve run a query against v$sql after this to check that I have 8192 child cursors – you’ll need to make sure your shared pool is a few hundred megabytes if you want to be sure of keeping them all in memory.
  • The second part of the test simply repeats the loop, but only for the first 1,024 child cursors. At this point the child cursors exist, so the optimizer should be doing “soft” parses rather than hard parses.
  • The final part of the test repeats the loop again, but only for the last 1,024 child cursors. Again they should exist and be usable, so the optimizer should again be doing “soft” parses rather than hard parses.

What I’m looking for is the extra work it takes for Oracle to find the right child cursor when there’s a very long chain of child cursors. From my memory of dumping the library cache in older versions of Oracle, the parent will point to a “segmented array” of pointers to child cursors, and each segment of the array will consist of 16 pointers, plus a pointer to the next segment. So if you have to find child cursor 8191 you will have to following 512 segment pointers, and 16 pointers per segment (totalling 8708 pointers) before you find the child you want – and you’re probably holding a mutex (or latch) while doing so.

One preipheral question to ask, of course, is whether Oracle keeps appending to the segmented array, or whether it uses a “pushdown” approach when allocating a new segment so that newer child cursors are near the start of the array. (i.e. will searching for child cursor 0 be the cheapest one or the most expensive one).

And the results, limited to just the second and third parts, with just a couple of small edits are as follows:


host sdiff -w 120 -s temp1.txt temp2.txt >temp.txt

===============                                            |    ================
Low child reuse                                            |    High child reuse
===============                                            |    ================

Interval:-  0 seconds                                      |    Interval:-  6 seconds

opened cursors cumulative                      2,084       |    opened cursors cumulative                      2,054
recursive calls                                6,263       |    recursive calls                                6,151
recursive cpu usage                               33       |    recursive cpu usage                              570
session logical reads                          1,069       |    session logical reads                          1,027
CPU used when call started                        33       |    CPU used when call started                       579
CPU used by this session                          37       |    CPU used by this session                         579
DB time                                           34       |    DB time                                          580
non-idle wait count                               16       |    non-idle wait count                                5
process last non-idle time                         1       |    process last non-idle time                         6
session pga memory                           524,288       |    session pga memory                            65,536
enqueue requests                                  10       |    enqueue requests                                   3
enqueue releases                                  10       |    enqueue releases                                   3
consistent gets                                1,069       |    consistent gets                                1,027
consistent gets from cache                     1,069       |    consistent gets from cache                     1,027
consistent gets pin                            1,039       |    consistent gets pin                            1,024
consistent gets pin (fastpath)                 1,039       |    consistent gets pin (fastpath)                 1,024
consistent gets examination                       30       |    consistent gets examination                        3
consistent gets examination (fastpath)            30       |    consistent gets examination (fastpath)             3
logical read bytes from cache              8,757,248       |    logical read bytes from cache              8,413,184
calls to kcmgcs                                    5       |    calls to kcmgcs                                    3
calls to get snapshot scn: kcmgss              1,056       |    calls to get snapshot scn: kcmgss              1,026
table fetch by rowid                              13       |    table fetch by rowid                               1
rows fetched via callback                          6       |    rows fetched via callback                          1
index fetch by key                                 9       |    index fetch by key                                 1
index scans kdiixs1                            1,032       |    index scans kdiixs1                            1,024
session cursor cache hits                         14       |    session cursor cache hits                          0
cursor authentications                         1,030       |    cursor authentications                         1,025
buffer is not pinned count                     1,066       |    buffer is not pinned count                     1,026
parse time cpu                                    23       |    parse time cpu                                   558
parse time elapsed                                29       |    parse time elapsed                               556
parse count (total)                            2,076       |    parse count (total)                            2,052
parse count (hard)                                11       |    parse count (hard)                                 3
execute count                                  1,050       |    execute count                                  1,028
bytes received via SQL*Net from client         1,484       |    bytes received via SQL*Net from client         1,486

Two important points to note:

  • the CPU utilisation goes up from 0.33 seconds to 5.7 seconds.
  • the number of hard parses is zero, this is all about searching for the

You might question are the 2,048-ish parse count(total) – don’t forget that we do an “execute immediate” to change the optimizer_index_cost_adj on each pass through the loop. That’s probably why we double the parse count, although the “alter session” doesn’t then report as an “execute count”.

The third call to a statement is often an important one – it’s often the first one that doesn’t need “cursor authentication”, so I ran a similar test executing the last two loops a second time – there was no significant change in the CPU or parse activity between the 2nd and 3rd executions of each cursor. For completeness I also ran a test with the loop for the last 1,024 child cursors ran before the loop for the first child cursors. Again this made no significant difference to the results – the low number child cursors take less CPU to find than the high number child cursors.

Bottom line

The longer the chain of child cursors the more time (elapsed and CPU) you spend searching for the correct child; and when a parent is allowed 8,192 child cursors the extra time can become significant. I would claim that the ca. 5 seconds difference in CPU time appearing in this test corresponds purely to an extra 5 milliseconds walking an extra 7,000 steps down the chain.

If you have a well-behaved application that uses the session cursor cache effectively, or uses “held cursors”, then you may not be worried by very long chains of child cursors. But I have seen many applications where cursor caching is not used and every statement execution from the client turns into a parse call (usually implicit) followed by a hunt through the library cache and walk along the child chain. These applications will not scale well if they are cloned to multiple PDBs sharing the same CDB.

Footnote 1

The odd thing about this “cursor obselete” feature is that I have a distinct memory that when  PDBs were introduced at an ACE Director’s meeting a few years ago the first thought that crossed my mind was about the potential for someone running multiple copies of the same application as separate PDBs seeing a lot of library cache latch contention or cursor mutex contention because any popular statement would now be hitting the same parent cursor from multiple PDBs. I think the casual (i.e. neither formal, nor official) response I got when I raised the point was that the calculation of the sql_id in future releases would take the con_id into consideration. It seems that that idea fell by the wayside.

Footnote 2

If you do see a large number of child cursors for a single parent then you will probably end up looking at v$sql_shared_cursor for the sql_id to see if that gives you some good ideas about why a particular statement has generated so many child cursors. For a list of explainations of the different reasons captured in this view MOS Doc Id  296377.1“Troubleshooting: High Version Count Issues” is a useful reference.

September 16, 2019

Little sleeps

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 11:50 am BST Sep 16,2019

A peripheral question in a recent comment (made in response to me asking whether a loop had been written with a sleep time of 1/100th or 1/1000th of a second) asked “How do you sleep for 1/1000th of a second in pure PL/SQL?”

The answer starts with “How pure is pure ?” Here’s a “pure” PL/SQL solution that “cheats” by calling one of the routines in Oracle’s built-in Java library:


create or replace procedure milli_sleep( i_milliseconds in number) 
as 
        language java
        name 'java.lang.Thread.sleep(long)';
/


create or replace procedure nano_sleep( i_milliseconds in number, i_nanoseconds in number)
as
        language java
        name 'java.lang.Thread.sleep(long, int)';
/

prompt  Milli sleep
prompt  ===========
execute milli_sleep(18)

prompt  Nano sleep
prompt  ==========
execute  nano_sleep(0,999999)


The “nano-second” component of the nano_sleep() procedure is restricted to the ranage 0 – 999999. In both cases the “milli-second” component has to be positive.

Whether your machine is good at handling sleeps of less than 1/100th of a second is another question, of course.

Update – due to popular demand

If you want to find out what else is available in the database you can query view all_java_methods searching by partial name (which is very slow) for something you think might exist, for example:

SQL> select owner, name , method_name from all_java_methods where upper(method_name) like '%MILLI%'

OWNER           NAME                                     METHOD_NAME
--------------- ---------------------------------------- ----------------------------------------
SYS             java/util/concurrent/TimeUnit$4          toMillis
SYS             java/util/concurrent/TimeUnit$5          toMillis
SYS             java/util/concurrent/TimeUnit$6          toMillis
SYS             java/util/concurrent/TimeUnit$7          toMillis
SYS             java/util/concurrent/TimeUnit            toMillis
SYS             java/lang/System                         currentTimeMillis
SYS             javax/swing/ProgressMonitor              setMillisToDecideToPopup
SYS             javax/swing/ProgressMonitor              getMillisToDecideToPopup

There’s a lot more than the few listed above – but I just wanted to pick up currentTimeMillis. If you spot something that looks interesting the easiest next step is probably to do a google search with (for example): Oracle java.lang.system currenttimemillis (alternatively you could just keep a permanent link to Oracle’s manual pages for the Java and serarch them. In my case this link was high on the list of google hits, giving me the following method description:


static long 	currentTimeMillis​() 	Returns the current time in milliseconds.

Conveniently this is easy to embed in pl/sql (be careful with case sensitivity):


create or replace function milli_time return number
as
        language java
        name 'java.lang.System.currentTimeMillis() return long';
/

execute dbms_output.put_line(milli_time)
execute dbms_lock.sleep(1)
execute dbms_output.put_line(milli_time)

----------------

SQL> @ java_procs

Function created.

1568719117725

PL/SQL procedure successfully completed.


PL/SQL procedure successfully completed.

1568719118734

PL/SQL procedure successfully completed.

You now have a PL/SQL function that will return the number of millisecond since 1st January 1970.

September 13, 2019

Nologging

Filed under: Infrastructure,Oracle,redo — Jonathan Lewis @ 12:18 pm BST Sep 13,2019

Bobby Durrett recently published a note about estimating the volume of non-logged blocks written by an instance with the aim of getting some idea of the extra redo that would be generated if a database were switched to “force logging”.

Since my most recent blog notes have included various extracts and summaries from the symbolic dumps of redo logs it occurred to me that another strategy for generating the same information would be to dump the redo generated by Oracle when it wanted to log some information about non-logged blocks. This may sound like a contradiction, of course, but it’s the difference between data and meta-data: if Oracle wants to write data blocks to disc without logging their contents it needs to write a note into the redo log saying “there is no log of the contents of these blocks”.

In terms of redo op codes this is done through “layer 19”, the set of op codes relating to direct path loads, with op code 19.2 being the specific “invalidate range” one that we are (probably)interested in.

So here’s a little demo of extracting the information we need:

rem
rem     Script:         nologging_writes.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2019
rem
rem     Last tested 
rem             12.2.0.1
rem

column c_scn new_value m_scn_1
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

create table t1 nologging
as
select  * 
from    all_objects
where   rownum <= 10000
/

column c_scn new_value m_scn_2
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

alter session set tracefile_identifier = 'TABLE';
alter system dump redo scn min &m_scn_1 scn max &m_scn_2 layer 19;


create index t1_i1
on t1(object_name, owner, object_id)
pctfree 80
nologging
/

column c_scn new_value m_scn_3
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

alter session set tracefile_identifier = 'INDEX';
alter system dump redo scn min &m_scn_2 scn max &m_scn_3 layer 19;


insert /*+ append */ into t1
select * from t1
/

column c_scn new_value m_scn_4
select to_char(current_scn,'999999999999999999999999') c_scn from v$database;

alter session set tracefile_identifier = 'APPEND';
alter system dump redo scn min &m_scn_3 scn max &m_scn_4 layer 19;


I’ve executed a “create table nologging”, a “create index nologging”, then an “insert /*+ append */” into the nologging table. I’ve captured the current SCN before and after each call, added an individual identifier to the tracefile name for each call, then dumped the redo between each pair of SCNs, restricting the dump to layer 19. (I could have been more restrictive and said “layer 19 opcode 2”, but there is an opcode 19.4 which might also be relevant – though I don’t know when it might appear.)

Here’s the list of trace files I generated, plus a couple extra that appeared around the same time:

 ls -ltr *.trc | tail -6
-rw-r----- 1 oracle oinstall 361355 Sep 12 19:44 orcl12c_ora_23630.trc
-rw-r----- 1 oracle oinstall   5208 Sep 12 19:44 orcl12c_ora_23630_TABLE.trc
-rw-r----- 1 oracle oinstall  27434 Sep 12 19:44 orcl12c_ora_23630_INDEX.trc
-rw-r----- 1 oracle oinstall   2528 Sep 12 19:44 orcl12c_ora_23630_APPEND.trc
-rw-r----- 1 oracle oinstall 162633 Sep 12 19:45 orcl12c_mmon_3042.trc
-rw-r----- 1 oracle oinstall 162478 Sep 12 19:45 orcl12c_gen0_2989.trc


And having identified the trace files we can now extract the block invalidation records (I’ve inserted blank lines to separate the results from the three separate files):

grep OP orcl12c_ora_23630_*.trc

orcl12c_ora_23630_APPEND.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058001bd BLKS:0x0043 OBJ:125947 SCN:0x00000b860da6e1a5 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_APPEND.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800482 BLKS:0x006e OBJ:125947 SCN:0x00000b860da6e1a5 SEQ:1 OP:19.2 ENC:0 FLG:0x0000

orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000c4 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e162 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000c8 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e162 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
...
...     70 lines deleted
...
orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800424 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e181 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_INDEX.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800428 BLKS:0x0004 OBJ:125948 SCN:0x00000b860da6e181 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_INDEX.trc:CHANGE #1 MEDIA RECOVERY MARKER CON_ID:3 SCN:0x0000000000000000 SEQ:0 OP:18.3 ENC:0 FLG:0x0000

orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800103 BLKS:0x000d OBJ:125947 SCN:0x00000b860da6e13e SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800111 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e140 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800121 BLKS:0x0007 OBJ:125947 SCN:0x00000b860da6e141 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800268 BLKS:0x0008 OBJ:125947 SCN:0x00000b860da6e142 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800271 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e144 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800081 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e146 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800091 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e148 SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000a1 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e14a SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x058000b1 BLKS:0x000f OBJ:125947 SCN:0x00000b860da6e14c SEQ:1 OP:19.2 ENC:0 FLG:0x0000
orcl12c_ora_23630_TABLE.trc:CHANGE #1 INVLD CON_ID:3 AFN:22 DBA:0x05800182 BLKS:0x003b OBJ:125947 SCN:0x00000b860da6e14c SEQ:1 OP:19.2 ENC:0 FLG:0x0000

Each line records the number of blocks (BLKS:) allocated and, as you can see, the APPEND trace shows much larger allocations than the TABLE trace (except for the last one) because the tablespace is locally managed with system allocated extents, and the first few invalidation records for the table creation are in the initial 8 block (64KB) extents; by the time we get to the last few blocks of the initial table creation we’ve just allocated the first 128 block (1MB) extent, which is why the last invalidation record for the table can cover so many more blocks than than the first few.

It is interesting to note, though, that the invalidation record for the INDEX trace are all small, typically 4 blocks, sometimes 3, even when we’ve obviously got to a point where we’re allocating from extents of 128 blocks.

I believe that somewhere I have a note explaining that the invalidation records always identified batches of 5 blocks in older versions of Oracle – but that may simply have been a consequence of the way that freelist management used to work (allocating 5 blocks at a time from the segment to the master freelist).

Although we could simply list all the invalidation records and sum the block counts manually we could be a little smarter with our code, summing them with awk, for example.

grep -n "OP:19.2" orcl12c_ora_23630_TABLE.trc |
     sed 's/.*BLKS://' |
     sed 's/ .*$//'  |
     awk '{m = m + strtonum($0) ; printf("%8i %8i \n",strtonum($0),m)}'
 
      13       13 
      15       28 
       7       35 
       8       43 
      15       58 
      15       73 
      15       88 
      15      103 
      15      118 
      59      177 

It’s left as an exercise to the Unix enthusiast to work out how to take the base tracefile name extract all the sets of data, cater for the odd 18.3 records (whose presence I didn’t request), report any lines for 19.x rows other than 19.2 and sum BLKS separately by TABLE, INDEX, and APPEND.

Once you’ve summed the number of blocks across all the invalidation records (and assuming you’re using the standard 8KB block size) the increease in the volume of redo generated if you alter the database to force logging will be (8KB + a little bit) * number of blocks.  The “little bit” will be close to 44 bytes.

If you’ve set your database up to use multiple block sizes you’ll have to aggregate the invalidation recrords by the AFN (absolute file number) entry and check which files use which block size and multiply up accordingly. And if you’re using a pluggable database (as I was) inside a container database you might also want to filter the redo dump by CON_ID.

If you do set the database to force logging and repeat the search for layer 19 in the redo  you’ll find that each individual data block written using a direct path write generates its own redo record, which will have length “data block size + 44” bytes and hold a single change vector of type 19.1 (Direct Loader block redo entry).

Footnote

It’s worth mentioning, that the dump of redo will go back into the archived redo logs in order to cover the entire range requested by the SCN man/max valeus; so it would be perfectly feasible (though possibly a little time and I/O consuming) to run the report across a full 24 hour window.

September 10, 2019

Update restarts

Filed under: Infrastructure,Oracle,redo,undo — Jonathan Lewis @ 11:00 am BST Sep 10,2019

Somewhere I think I’ve published a note about an anomaly that’s been bugging me since at least Oracle 10g – but if it is somewhere on the Internet it’s hiding itself very well and I can’t find it, though I have managed to find a few scripts on my laptop that make a casual reference to the side effects of the provlem. [Ed: a tweet from Timur Ahkmadeev has identified a conversation in the comments on an older post that could explain why I thought I’d written something about the issue.]

Anyway, I’ve decided to create some new code and write the article (all over again, maybe). The problem is a strange overhead that can appear when you do a simple but large update driving off a tablescan.

If you just want the conclusion and don’t feel like reading the detail, it’s this: a big update may roll itself back before it gets to the end of the tablescan and restart “invisibly”, doing a “select for update” (2nd tablescan, one row at a time) followed by the update (3rd tablescan, by array) massively increasing the redo and undo generated.

Here’s the script I’ve created to demonstrate the issue. As its header suggests the results I’ll be showing come from 12.1.0.2


rem
rem     Script:         update_restart.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Seo 2019
rem     Purpose:        
rem
rem     Last tested 
rem             12.1.0.2
rem

create table t1 
pctfree 80 
nologging
as
with g as (
        select rownum id 
        from dual 
        connect by level <= 500 -- > comment to avoid wordpress format issue
)
select
        rownum id, 
        lpad(rownum,10) v1
from    g,g
/

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select  blocks, num_rows
from    user_tables 
where   table_name = 'T1'
/

I’ve created a table with 250,000 rows (500 * 500) and gathered stats on it (the table was 3,345 blocks) as  gathering stats tends to sort out any issues of block cleanout that might confuse the issue. Having done that I’m going to do an update of every row in the table:


alter system switch logfile;
execute snap_my_stats.start_snap

update t1 set id = id;

execute snap_my_stats.end_snap
execute dump_log

The update doesn’t actually change the column value (and in 12.2 there’s an interesting change to the Oracle code that comes into play in that case) but it will still generate redo and undo for every row in the table, and update ITL entries and do all the rest of the work we normally associate with an update.

Before the update I’ve issued a “switch logfile” and after the update I’ve called a procedure I wrote to do a symbolic dump of the current log file (“alter system dump logfile”). The calls to the snap_my_stats package allow me to see the changes in the session activity stats due to the update. A simple check of these statistics can often give you a good idea of what’s going on. So you might ask yourself what you might expect to see in the stats from this update. (As Isaac Asimov once said: The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’, but ‘That’s funny …’)

Having updated 250,000 rows by tablescan I think we might expect to see a tablescan of 250,000 rows; we might also expect to see (in the worst case) something like 250,000 redo entries although we might expect to see far fewer since Oracle can probably do array updates and create one redo change vector and one undo record per block changed rather than per row changed.

So here are some of the more interesting numbers that we actually see (though the specific results are not 100% reproducible):

Name                                                                     Value
----                                                                     -----
session logical reads                                                  270,494
db block gets                                                          263,181
consistent gets                                                          7,313
db block changes                                                       534,032
free buffer requested                                                    7,274
switch current to new buffer                                             3,534
redo entries                                                           265,128
redo size                                                           83,803,996
rollback changes - undo records applied                                  1,002
table scan rows gotten                                                 522,470
table scan blocks gotten                                                 6,881 
HSC Heap Segment Block Changes                                         264,135
Heap Segment Array Updates                                              14,135

A pair of figures that stand out as most surprising are the tablescan figures – I’ve got a table of 3,345 blocks and 250,000 rows, but my tablescans total 6,881 blocks and 522,370 rows – I seem to have scanned the table (slightly more than) twice! At the same time I’ve requested 7,274 free buffers (some of them probably for undo blocks) and “switched current to new buffer” 3,534 times – so I seem to have read every table block then cloned it in memory.

Side note: the “switched current to new buffer” is a common feature of a tablescan update and not particularly relevant to this article. I assume the intent is to “leave behind” a read-consistent copy for any other queries that might have started before the big update.

Another really strange statistic is the one that says “rollback changes – undo records applied” – I’ve used 1,002 undo records in a rollback operation of some sort. I haven’t show you the stats for “user rollbacks” or “transaction rollbacks” because, despite the undo records being applied, there were no rollbacks reported.

What about changes? We see 534,000 “db block changes” – and maybe it’s okay for that statistic to be roughly twice the number of rows we expect to change because for every row we change we have to generate some undo, so may have modify an undo block to create an undo record – but there are still 34,000 “db block changes” too many. The number of changes is fairly consistent with the number of “redo entries”, of course, allowing for a factor of 2 because redo entries often consist of a matching pair of changes, one for the “forward change” and one for the “undo block change”. Again, though, there a bit of an excess: the 34,000 extra “db block changes” seem to have resulted in an excess 15,000 “redo entries”. And we still have to ask: “Whatever happened to array processing ?”

In fact, we do see some array processing – we did 14,135 “Heap Segment Array Updates” which, at an average of about 18 rows per array would account for our 250,000 row table – but we also did 264,135 “HSC Heap Segment Block Changes”, and here’s a (“That’s funny …”) coincidence: 264,135 – 14,135 = 250,000: the number of rows in the table. Maybe that means something, maybe not. Another funny coincidence: 264,135 (HSC Heap Segment Block Changes) + 1,002 (rollback changes – undo records applied) = 265,137 which is remarkably close to the number of “redo entries”.

If, by now, you’re not beginning to suspect that something a little odd has happened you haven’t been reading enough Agatha Christie novels, or watching enough Sherlock Holmes (or House) videos.

I don’t always … dump trace files but when I do I like to go over the top. My update generated 83MB of redo so I made sure my log files were 100MB each, and when I dumped the current log file I got a trace file that was a little over 466MB and 11M lines. Clearly I had no intention of reading it all. A critical skill in reading ANY type of trace file is knowing in advance what you ought to be looking for and working out a painless way of finding it. (Practising by slugging your way through a very small example is a good starting point.)

In this case I started by using grep to extract all the lines containing “OP:”, stripping them down to just the “OP:” followed by the redo op code, and then sorting and summarising the use of op codes. Here (with some annotations) is what I got:

grep OP or32_ora_14652.trc | sed 's/.*OP/OP/' | sed 's/ .*$//' | sort | uniq -c | sort -n

      1 OP:11.3
      2 OP:10.4
      2 OP:11.5
      3 OP:23.1
      3 OP:4.1
      4 OP:14.4
      4 OP:22.5
      5 OP:13.22
      8 OP:11.2
      8 OP:22.2
     13 OP:10.2
     25 OP:5.4
    119 OP:5.11         Mark undo as applied during rollback
    883 OP:5.6          Applying undo - NB 883 + 119 = 1002
   3776 OP:5.2          Get undo segment header
  15139 OP:11.19        Array update of table
 250000 OP:11.4         Lock table row piece
 264190 OP:5.1          Update undo block

The interesting point here is that I seem to have locked every single row in the table separately, but I’ve also done array updates on the table, and I’ve done some rolling back. So I need to do a more detailed check to see when things happened. So let’s begin by examining a few thousand lines near the start of the trace:

head -10000 or32_ora_14652.trc | grep -n OP | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//'

68:CHANGE #1  OP:11.19
214:CHANGE #2  OP:5.2
217:CHANGE #3  OP:11.19
363:CHANGE #4  OP:11.19
509:CHANGE #5  OP:5.2
512:CHANGE #6  OP:11.19
658:CHANGE #7  OP:11.19
692:CHANGE #8  OP:11.19
838:CHANGE #9  OP:11.19
984:CHANGE #10  OP:11.19
1130:CHANGE #11  OP:11.19
1276:CHANGE #12  OP:11.19
1422:CHANGE #13  OP:11.19
1568:CHANGE #14  OP:11.19
1714:CHANGE #15  OP:5.2
1717:CHANGE #16  OP:11.19
1863:CHANGE #17  OP:11.19
2009:CHANGE #18  OP:11.19
2155:CHANGE #19  OP:11.19
2301:CHANGE #20  OP:11.19
2447:CHANGE #21  OP:11.19
2593:CHANGE #22  OP:11.19
2739:CHANGE #23  OP:11.19
2885:CHANGE #24  OP:5.2
2888:CHANGE #25  OP:11.19
3034:CHANGE #26  OP:11.19
3180:CHANGE #27  OP:5.1
3336:CHANGE #28  OP:5.1
3489:CHANGE #29  OP:5.1
3642:CHANGE #30  OP:5.1
3795:CHANGE #31  OP:5.1
3836:CHANGE #32  OP:5.1
3989:CHANGE #33  OP:5.1
4142:CHANGE #34  OP:5.1
4295:CHANGE #35  OP:5.1
4448:CHANGE #36  OP:5.1
4601:CHANGE #37  OP:5.1
4754:CHANGE #38  OP:5.1
4907:CHANGE #39  OP:5.1
5060:CHANGE #40  OP:5.1
5213:CHANGE #41  OP:5.1
5366:CHANGE #42  OP:5.1
5519:CHANGE #43  OP:5.1
5672:CHANGE #44  OP:5.1
5825:CHANGE #45  OP:5.1
5978:CHANGE #46  OP:5.1
6131:CHANGE #47  OP:5.1
6284:CHANGE #48  OP:5.1
6440:CHANGE #1  OP:5.1
6593:CHANGE #2  OP:11.19
6742:CHANGE #1  OP:5.1
6895:CHANGE #2  OP:11.19
7044:CHANGE #1  OP:5.1
7197:CHANGE #2  OP:11.19
7346:CHANGE #1  OP:5.1
7499:CHANGE #2  OP:11.19
7648:CHANGE #1  OP:5.1
7801:CHANGE #2  OP:11.19
7950:CHANGE #1  OP:5.2
7953:CHANGE #2  OP:5.1
8106:CHANGE #3  OP:11.19
8255:CHANGE #1  OP:5.1
8408:CHANGE #2  OP:11.19
8557:CHANGE #1  OP:5.1
8710:CHANGE #2  OP:11.19
8859:CHANGE #1  OP:5.1
9012:CHANGE #2  OP:11.19
9161:CHANGE #1  OP:5.1
9314:CHANGE #2  OP:11.19
9463:CHANGE #1  OP:5.1
9616:CHANGE #2  OP:11.19
9765:CHANGE #1  OP:5.1
9918:CHANGE #2  OP:11.19


Everything up to line 6284 (change# 48) is one big “private redo” record where we do several “11.19 (table array updates)” followed by a matching group of “(5.1 update undo block)”, before switching to the “public redo” mechanism used by long transactions where each redo record is a matched pair of “backward” and “forward” change vectors, i.e. one 5.1 with a matching 11.19.

So we start by doing array updates without doing any single row locking – so I want to know when the array updates stop. Here’s a small but continuous piece from the next grep I ran (which produced an output totalling 15,000 lines):

grep -n "OP:11.19" or32_ora_14652.trc | sed 's/:.*OP/:-OP/' | sed 's/ .*$//' 
...
446693:-OP:11.19
446736:-OP:11.19
446891:-OP:11.19
447045:-OP:11.19
447200:-OP:11.19
7461225:-OP:11.19    *** Big jump
7461527:-OP:11.19
7461829:-OP:11.19
7462131:-OP:11.19
7462321:-OP:11.19
7462511:-OP:11.19
...

The 11.19 opcodes were appearing roughly every 300 lines of trace, except they stopped after 2,004 occurrences then disappeared for about 7M lines before reappering and running evenly (ca. every 300 lines) to the end of the trace file. Clearly we need to know what happened somewhere around lines 447200 and 7461225. But before looking too closely, let’s find out when the “11.4 lock row piece” and “5.6 Apply undo” start to appear”

grep -n "OP:11.4" or32_ora_14652.trc | sed 's/:.*OP/:-OP/' | sed 's/ .*$//' 

447374:-OP:11.4
447405:-OP:11.4
447433:-OP:11.4
447461:-OP:11.4
447489:-OP:11.4
447517:-OP:11.4
...
... lots of evenly spread 11.4, every 28 lines
...
7460948:-OP:11.4
7460976:-OP:11.4
7461004:-OP:11.4
7461032:-OP:11.4
7461060:-OP:11.4

Note where the 11.4 opcodes begin and end – they fit exactly into the gap between the two sets of 11.19 opcodes. And where do the 5.6 op codes come into play ?

grep -n "OP:5.6" or32_ora_14652.trc | sed 's/:.*OP/:-OP/' | sed 's/ .*$//'

295919:-OP:5.6
296046:-OP:5.6
296201:-OP:5.6
296356:-OP:5.6
...
... a few hundred appearances of 5.6, common spaced by 155 lines 
...
446529:-OP:5.6
446684:-OP:5.6
446727:-OP:5.6
446882:-OP:5.6
447191:-OP:5.6

That was the entire set of “5.6 Apply undo” records – a short burst of activity which finishes just before the big gap in the 11.19 opcodes . This leads to the question: “does anything difference happen around the time that the 5.6 op codes start to appear?”  Time to look at a very small extract from a very big grep output:

grep -n "OP" or32_ora_14652.trc | sed 's/ CON_ID.*OP/:-OP/' | sed 's/ ENC.*$//' 
..
294895:CHANGE #1:-OP:5.1
295048:CHANGE #2:-OP:11.19

295197:CHANGE #1:-OP:5.1
295322:CHANGE #2:-OP:11.19

295443:CHANGE #1:-OP:5.1
295596:CHANGE #2:-OP:11.19

295745:CHANGE #1:-OP:22.2

295753:CHANGE #1:-OP:22.5

295760:CHANGE #1:-OP:14.4
295766:CHANGE #2:-OP:22.2

295773:CHANGE #1:-OP:11.19
295919:CHANGE #2:-OP:5.6

295928:CHANGE #1:-OP:11.19
296046:CHANGE #2:-OP:5.6

296055:CHANGE #1:-OP:11.19
296201:CHANGE #2:-OP:5.6

...

I’ve inserted blank lines between redo records in this tiny extract from around the point where the 5.6 opcodes appear. The first few records show Oracle doing the update, then there’s a little bit of space management (Op codes in the 22 and 14 layer) and then Oracle starts rolling back. (Note: whether updating or rolling back Oracle is applying “11.19 table array updates”, but the “5.6 undo applied” appear as change vector #2 in the redo record as we rollback, while the “5.1 create undo” appear as change vector #1 as we make forward changes).

If we count the number of 11.19 opcodes up to the point where the 11.4 opcodes appear we find that we’ve used 1,002 to update the table, then another 1,002 to rollback the update.

Summary

Without going into very fine detail examining individual redo change vectors, what we’ve seen is as follows:

  • On executing the very large update the session activity statistics show the number of rows accessed by tablescan was more than twice the expected number. This obvious oddity made me look more closely and notice the surprising “rollback changes – undo records applied” which just shouldn’t have been there at all, and this led me to more closely at the number of redo entries which averaged more than one per row when it should have been a much smaller number due to Oracle’s internal array update mechanism.
  • Extracting just the redo op codes from a dump of the redo log generated during this update we’ve extracted summary information that the work we’ve done includes: (a) locking every single row in the table separately, (b) performing a load of table array updates, (c) applying a significant number of undo records to rollback changes.
  • Digging in for just a little more detail we see (in order of activity):
    • A thousand array updates  (roughly 20,000 rows by tablescan – I did actually check a few redo records in detail, 20 rows was a common array size)
    • The thousand updates being rolled back (luckily a long time before we got near the end of the first tablescan)
    • 250,000 rows locked individually (250,000 rows by tablescan – the 2nd tablescan) (NB each LKR – lock row – undo record is 68 bytes of undo)
    • 13,000 array updates   (250,000 rows by tablescan – the third tablescan) – at roughly 100 bytes per row of “overhead” in the undo

I don’t know why this happens – and from memory I think it started happening in 10g. I have an idea it relates to changes in the global SCN as the update takes place and something (possibly related to space management updates) making the update think it has to go into “write consistent” mode as its “start SCN” gets too far away from the current SCN.

The bottom line on this, though, is that there may be cases where a large, tablescan-driven, update could actually do the tablescan three times, and generate far more undo and redo than it should. In theory I could have completed the update in about 13,000 pairs of (11.19, 5.1) change vectors. In practice this simple test produced far more work.

And it gets worse …

Update – prompted by a tweet from Tanel Poder

In one of my tests I had added the /+ monitor */ hint to see if the activity would be reported correctly. It didn’t look quite right – but what I hadn’t noticed was that the monitor reported the statement as executing 3 times (and, of course, the tablescan executing 3 times):

Global Information
------------------------------
 Status              :  DONE
 Instance ID         :  1
 Session             :  TEST_USER (14:13842)
 SQL ID              :  9a166bpd7455j
 SQL Execution ID    :  16777216
 Execution Started   :  09/10/2019 09:41:54
 First Refresh Time  :  09/10/2019 09:41:54
 Last Refresh Time   :  09/10/2019 09:42:00
 Duration            :  6s
 Module/Action       :  MyModule/MyAction
 Service             :  orcl
 Program             :  sqlplus@vbgeneric (TNS V1-V3)

Global Stats
===================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs | Bytes |
===================================================================
|    5.84 |    5.77 |     0.01 |     0.06 |     1M |   42 |  26MB |
===================================================================

SQL Plan Monitoring Details (Plan Hash Value=2927627013)
=========================================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity | Activity Detail |
|    |                      |      | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |   (# samples)   |
=========================================================================================================================================
|  0 | UPDATE STATEMENT     |      |         |      |         4 |     +3 |     3 |        0 |      |       |          |                 |
|  1 |   UPDATE             | T1   |         |      |         6 |     +1 |     3 |        0 |      |       |   100.00 | Cpu (6)         |
|  2 |    TABLE ACCESS FULL | T1   |    250K |  432 |         4 |     +3 |     3 |     698K |   42 |  26MB |          |                 |
=========================================================================================================================================

As you see from Tanel’s tweet, he has a little script you can use to scan through the library cache to see if any (big) updates have suffered from this problem.

 

Addendum

I couldn’t quite manage to leave this problem alone, so I’ve run the test and dumped the redo log three more times and there’s an interesting coincidence. Every time I ran the test the switch from updating to rolling back had these 4 (annotated) redo opcodes separating them:


158766:CHANGE #1:-OP:22.2        ktfbhredo - File Space Header Redo
158774:CHANGE #1:-OP:22.5        ktfbbredo - File BitMap Block Redo:
158781:CHANGE #1:-OP:14.4        kteop redo - redo operation on extent map
158787:CHANGE #2:-OP:22.2        ktfbhredo - File Space Header Redo:

The group represents some allocation of space from a file to an object – so the question is: what object gets an extent map update. Here’s the full redo vchange vector for OP:14.4


REDO RECORD - Thread:1 RBA: 0x001cdd.000028a4.00d8 LEN: 0x00b0 VLD: 0x01 CON_UID: 0
SCN: 0x0b86.0db28b4c SUBSCN: 88 09/10/2019 15:06:13
CHANGE #1 CON_ID:0 TYP:0 CLS:29 AFN:2 DBA:0x00800688 OBJ:4294967295 SCN:0x0b86.0db28b4c SEQ:10 OP:14.4 ENC:0 RBL:0 FLG:0x0000
kteop redo - redo operation on extent map
   ADD: dba:0x802380 len:128 at offset:6
  ADDRET: offset:6 ctime:0
   SETSTAT: exts:9 blks:911 lastmap:0x0 mapcnt:0
   UPDXNT: extent:6 add:TRUE

The “ADD” is adding extent id 6, of 128 blocks, at file 2 block 9088. That happens to be the undo segment which my update statement is using. Further 14.4 change vectors that appear through the trace file take the size of the undo segment up to 35 extents, but in the (small number of ) tests I did it was always first 14.4 that seemed to trigger the rollback and restart.

Preliminary Hypothesis

When the transaction does sufficient work that it forces its undo segment to extend some feature of the resulting recursive transaction causes the transaction to rollback and restart.

This does fit with the observation that the number of rows updated before the rollback/restart occurs varies apparently at random. It also fits the observation that an update that doesn’t rollback and restart in 12.1 does rollback and restart in 12.2 because it gets a bit unlcky using the “single row locking” optimisation for “nochange update” optimisation.

September 8, 2019

Quiz Night

Filed under: 12c,Infrastructure,Oracle,redo — Jonathan Lewis @ 10:15 pm BST Sep 8,2019

Upgrades cause surprises – here’s a pair of results from a model that I constructed more than 15 years ago, and ran today on 12.2, then modified and ran again, then ran on 11.2.0.4, then on 12.1.0.2. It’s very simple, I just create a table, gather stats, then update every row.

rem
rem     Script:         update_nochange.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2019
rem
rem     Last tested 
rem             12.2.0.1 

create table t1
as
with generator as (
        select
                rownum id 
        from dual 
        connect by 
                rownum <= 1e4  -- > comment to avoid wordpress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'x')             small_vc,
--      lpad(rownum,10,'0')             small_vc,
        'Y'                             flag
from
        generator       v1
where   rownum <= 1e3   -- > comment to avoid wordpress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

execute snap_my_stats.start_snap

update t1 set small_vc = upper(small_vc);

execute snap_my_stats.end_snap

The calls to package snap_my_stats are my little routines to calculate the change in the session activity stats (v$sesstat, joined to v$statname) due to the update. Here are a a few of the results for the test using the code as it stands:


Name                                    Value
----                                    -----
redo entries                               36
redo size                             111,756
undo change vector size                53,220

You’ll notice, however, that the CTAS has an option commented out to create the small_vc column using lpad(rownum,‘0’) rather than lpad(rownum,‘x’). This is what the redo stats look like if I use ‘0’ instead of ‘x’:


Name                                    Value
----                                    -----
redo entries                              909
redo size                             223,476
undo change vector size                68,256

What – they’re different ?!  (and it’s reproducible).

Running the test on 12.1.0.2 or 11.2.0.4, both variants of the code produce the smaller number of redo entries (and bytes) and undo – it’s only 12.2.0.1 that shows a change. (I haven’t got around to testing 18.3 and 19.3 yet.)

Tonight’s quiz:

Figure out what’s happening in 12.2.0.1 to give two different sets of undo and redo figures.

If that problem is too easy – extrapolate the test to more complex cases to see when the difference stops appearing, and see if you can find any cases where this new feature might cause existing applications to break.

I’ll supply the basic answer in 48 hours.

Update (a few hours early)

The question has been answered in the comments – it’s an optimisation introduced in 12.2 that attempts to reduce the amount of undo and redo by minimising the work done for “no change” updates to data.  In principle – but we don’t yet know the rules and limitations – if an update does not change the column values Oracle 12.2 will not “save the old values in an undo record and log the new values in a redo change vector”, it will simply lock the row, to produce a minimal redo change vector.

Unfortunately Oracle goes into “single row” mode to lock rows, while it can do “block-level – i.e. multi-row/array” processing if it uses the “change” mechanism.  Inevitably there are likely to be cases where the 12.2 optimisation actually produces a worse result in terms of volume of redo, or contention for redo latches.

If we modify the code to dump the redo generated by the two different updates we can see more clearly what Oracle is doing:

alter session set tracefile_identifier = 'UPD';

column start_scn new_value m_start_scn
select to_char(current_scn,'999999999999999999999999') start_scn from v$database;

update t1 set small_vc = upper(small_vc);
commit;

column end_scn new_value m_end_scn
select to_char(current_scn,'999999999999999999999999') end_scn from v$database;

alter system dump redo scn min &m_start_scn scn max &m_end_scn;

Then, after running the test we can dump the list of redo op codes from the trace file:

First when we do the “no-change” update (with lots of repetitions deleted):

grep -n OP orcl12c_ora_21999_UPD.trc | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//'

129:CHANGE #2 MEDIA RECOVERY MARKER  OP:17.28
138:CHANGE #1  OP:11.4
147:CHANGE #2  OP:5.2
150:CHANGE #3  OP:11.4
159:CHANGE #4  OP:11.4
168:CHANGE #5  OP:11.4
177:CHANGE #6  OP:11.4
...
2458:CHANGE #189  OP:5.1
2474:CHANGE #190  OP:5.1
2490:CHANGE #191  OP:5.1
2506:CHANGE #192  OP:5.1
2525:CHANGE #1  OP:5.1
2541:CHANGE #2  OP:11.4
2553:CHANGE #1  OP:5.1
2569:CHANGE #2  OP:11.4
...
27833:CHANGE #1  OP:5.1
27849:CHANGE #2  OP:11.4
27861:CHANGE #1  OP:5.1
27877:CHANGE #2  OP:11.4
27889:CHANGE #1  OP:5.4

The dump starts with a large redo record (192 change vectors) that started life in the private redo buffer, and then switch to the standard “paired change vectors” in the public redo buffer. The 11.4 vectors are “lock row piece” while the 5.1 vectors are the “generate undo”. Counting the 11.4 and 5.1 lines there are exactly 1,000 of each – every row has been individually locked.

Now for the “real change” update:

grep -n OP orcl12c_ora_22255_UPD.trc | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//'

126:CHANGE #2 MEDIA RECOVERY MARKER  OP:17.28
135:CHANGE #1  OP:11.19
281:CHANGE #2  OP:5.2
284:CHANGE #3  OP:11.19
430:CHANGE #4  OP:11.19
576:CHANGE #5  OP:11.19
...
5469:CHANGE #41  OP:5.1
5573:CHANGE #42  OP:5.1
5726:CHANGE #43  OP:5.1
5879:CHANGE #44  OP:5.1
6035:CHANGE #1  OP:5.1
6188:CHANGE #2  OP:11.19
6337:CHANGE #1  OP:5.1
6490:CHANGE #2  OP:11.19
...
15029:CHANGE #2  OP:11.19
15101:CHANGE #1  OP:5.1
15177:CHANGE #2  OP:11.19
15249:CHANGE #1  OP:5.4

It’s a much smaller trace file (ca. 15,249 lines compared to ca. 27889 lines), and the table change vectors are 11.19 (Table array update) rather than 11.4 (table lock row piece). Counting the op codes we get 52 of each of the 11.19 and 5.1. If we want a little more information about those vectors we can do the following:


egrep -n -e "OP:" -e "Array Update" orcl12c_ora_22255_UPD.trc | sed 's/CON_ID.*OP/ OP/' | sed 's/ ENC.*$//' 

126:CHANGE #2 MEDIA RECOVERY MARKER  OP:17.28
135:CHANGE #1  OP:11.19
140:Array Update of 20 rows:
281:CHANGE #2  OP:5.2
284:CHANGE #3  OP:11.19
289:Array Update of 20 rows:
430:CHANGE #4  OP:11.19
435:Array Update of 20 rows:
576:CHANGE #5  OP:11.19
581:Array Update of 20 rows:
...
5469:CHANGE #41  OP:5.1
5481:Array Update of 13 rows:
5573:CHANGE #42  OP:5.1
5585:Array Update of 20 rows:
5726:CHANGE #43  OP:5.1
5738:Array Update of 20 rows:
5879:CHANGE #44  OP:5.1
5891:Array Update of 20 rows:
6035:CHANGE #1  OP:5.1
6047:Array Update of 20 rows:
6188:CHANGE #2  OP:11.19
6193:Array Update of 20 rows:
6337:CHANGE #1  OP:5.1
6349:Array Update of 20 rows:
...
14953:CHANGE #1  OP:5.1
14965:Array Update of 9 rows:
15029:CHANGE #2  OP:11.19
15034:Array Update of 9 rows:
15101:CHANGE #1  OP:5.1
15113:Array Update of 9 rows:
15177:CHANGE #2  OP:11.19
15182:Array Update of 9 rows:
15249:CHANGE #1  OP:5.4

As you can see, the 11.19 (table change) and 5.1 (undo) change vectors both report that they are are structured as array updates. In most cases the array size is 20 rows, but there are a few cases where the array size is smaller. In this test I found one update with an array size of 13 rows and three updates with an array size of 9 rows.

Summary

Oracle has introduced an optimisation for “no change” updates in 12.2 that tries to avoid generating quite so much undo and redo; however this may result in some cases where an “array update” change vector turns into several “single row lock” change vectors, so when you upgrade to 12.2 (or beyone) you may want to check any large update mechanism you run to see if your system has benefited or been penalised to any significant extent by this change. The key indicator will be an increase in the value of the session/system stats “redo entries” and “redo size”.

August 21, 2019

sql_patch

Filed under: Infrastructure,Oracle,Tuning — Jonathan Lewis @ 4:49 pm BST Aug 21,2019

This note is a short follow-up to a note I wrote some time ago about validating foreign key constraints where I examined the type of SQL Oracle generates internally to do the validation between parent and child tables.  In that article I suggested (before testing) that you could create an SQL patch for the generated SQL to over-ride the plan taken by Oracle – a plan dictated to some extent by hints (including a “deprecated” ordered hint) embedded in the code. I did say that the strategy might not work for SQL optimised by SYS, but it turned out that it did.

Here’s a little script I ran to test a few variations on the theme:


declare
        v1      varchar2(128);
begin
        v1 :=   dbms_sqldiag.create_sql_patch(
                        sql_id  => 'g2z10tbxyz6b0',
                        name    => 'validate_fk',
                        hint_text => 'ignore_optim_embedded_hints'
--                      hint_text => 'parallel(a@sel$1 8)'      -- worked
--                      hint_text => 'parallel(8)'              -- worked
--                      hint_text => q'{opt_param('_fast_full_scan_enabled' 'false')}'  -- worked
                );
        dbms_output.put_line(v1);
end;
/

I’ve tested this on 12.2.0.1 and 19.3.0.0, but for earlier versions of Oracle, and depending what patches you’ve applied, you will need to modify the code.

The SQL_ID represents the query for my specific tables, of course, so you will have to do a test run to find the query and SQL_ID for the validation you want to do. This is what the statement for my parent/child pair looked like (cosmetically adjusted):

select /*+ all_rows ordered dynamic_sampling(2) */ 
        A.rowid, :1, :2, :3
from
        "TEST_USER"."CHILD" A , 
        "TEST_USER"."PARENT" B 
where
        ("A"."OBJECT_ID" is not null) 
and     ("B"."OBJECT_ID"(+) = "A"."OBJECT_ID")
and     ("B"."OBJECT_ID" is null)

The patch that the script creates simply tells Oracle to ignore the embedded hints (in particular I don’t want that ordered hint), but I’ve left a few other options in the text, commenting them out.

Without the patch I got the following plan:.


Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  NESTED LOOPS ANTI (cr=399 pr=279 pw=0 time=47801 us starts=1 cost=70 size=22000 card=1000)
    100000     100000     100000   INDEX FAST FULL SCAN CHI_FK_PAR (cr=250 pr=247 pw=0 time=19943 us starts=1 cost=32 size=1700000 card=100000)(object id 73191)
     10000      10000      10000   INDEX UNIQUE SCAN PAR_PK (cr=149 pr=32 pw=0 time=3968 us starts=10000 cost=0 size=49995 card=9999)(object id 73189)

Rerunning the validation test with the patch in place I got the following plan – clearly the patch had had an effect.

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  HASH JOIN RIGHT ANTI (cr=246 pr=242 pw=0 time=96212 us starts=1 cost=39 size=22000 card=1000)
     10000      10000      10000   INDEX FAST FULL SCAN PAR_PK (cr=24 pr=23 pw=0 time=1599 us starts=1 cost=4 size=50000 card=10000)(object id 73235)
    100000     100000     100000   INDEX FAST FULL SCAN CHI_FK_PAR (cr=222 pr=219 pw=0 time=27553 us starts=1 cost=32 size=1700000 card=100000)(object id 73237)
(object id 73229)

Don’t worry too much about the fact that in my tiny example, and with a very new, nicely structured, data set the original plan was a little faster. In a production environment creating a hash table from the parent keys and probing it with the child keys may reduce the CPU usage and random I/O quite dramatically.

Bear in mind that the best possible plan may depend on many factors, such as the number of child rows per parent, the degree to which the parent and child keys arrive in sorted (or random) order, and then you have to remember that Oracle gets a little clever with the original anti-join (note that there are only 10,000 probes for 100,000 child rows – there’s an effect similar to the scalar subquery caching going on there), so trying to patch the plan the same way for every parent/child pair may not be the best strategy.

If you want to drop the patch after playing around with this example a call to execute dbms_sqldiag.drop_sql_patch(name=>’validate_fk’) will suffice.

 

August 20, 2019

Join View

Filed under: constraints,Infrastructure,Joins,Oracle — Jonathan Lewis @ 12:39 pm BST Aug 20,2019

It’s strange how one thing leads to another when you’re trying to check some silly little detail. This morning I wanted to find a note I’d written about the merge command and “stable sets”, and got to a draft about updatable join views that I’d started in 2016 in response to a question on OTN (as it was at the time) and finally led to a model that I’d written in 2008 showing that the manuals were wrong.

Since the manual – even the 19c manual – is still wrong regarding the “Delete Rule” for updatable (modifiable) join views I thought I’d quickly finish off the draft and post the 2008 script. Here’s what the manual says about deleting from join views (my emphasis on “exactly”):

Rows from a join view can be deleted as long as there is exactly one key-preserved table in the join. The key preserved table can be repeated in the FROM clause. If the view is defined with the WITH CHECK OPTION clause and the key preserved table is repeated, then the rows cannot be deleted from the view.

But here’s a simple piece of code to model a delete from a join view that breaks the rule:


rem
rem     Script:         delete_join.sql 
rem     Dated:          Dec 2008
rem     Author:         J P Lewis
rem

create table source
as
select level n1
from dual
connect by level <= 10
/ 
 
create table search
as
select level n1
from dual
connect by level <= 10
/ 

alter table source modify n1 not null;
alter table search modify n1 not null;

create unique index search_idx on search(n1);
-- create unique index source_idx on source(n1)

I’ve set up a “source” and a “search” table with 10 rows each and the option for creating unique indexes on each table for a column that’s declared non-null. Initially, though, I’ve only created the index on search to see what happens when I run a couple of “join view” deletes using “ANSI” syntax.

prompt  ===============================
prompt  Source referenced first in ANSI
prompt  ===============================

delete from (select * from source s join search s1 on s.n1 = s1.n1);
select count(1) source_count from source;
select count(1) search_count from search;
rollback;
 
prompt  ===============================
prompt  Search referenced first in ANSI
prompt  ===============================

delete from (select * from search s join source s1 on s.n1 = s1.n1);
select count(1) source_count from source;
select count(1) search_count from search;
rollback;

With just one of the two unique indexes in place the order of the tables in the inline view makes no difference to the outcome. Thanks to the unique index on search any row in the inline view corresponds to exactly one row in the source table, while a single row in the search table could end up appearing in many rows in the view – so the delete implictly has to operate as “delete from source”. So both deletes will result in the source_count being zero, and the search_count remaining at 10.

If we now repeat the experiment but create BOTH unique indexes, both source and search will be key-preserved in the join. According to the manual the delete should produce some sort of error. In fact the delete works in both cases – but the order that the tables appear makes a difference. When source is the first table in the in-line view the source_count drops to zero and the search_count stays at 10; when search is the first table in the in-line view the search_count drops to zero and the source_count stays at 10.

I wouldn’t call this totally unreasonable – but it’s something you need to know if you’re going to use the method, and something you need to document very carefully in case someone editing your code at a later date (or deciding that they could add a unique index) doesn’t realise the significance of the table order.

This does lead on to another important test – is it the order that the tables appear in the from clause that matters, or the order they appear in the join order that Oracle uses to optimise the query. (We hope – and expect – that it’s the join order as written, not the join order as optimised, otherwise the effect of the delete could change from day to day as the optimizer chose different execution plans!). To confirm my expectation I switched to traditional Oracle syntax with hints (still with unique indexes on both tables), writing a query with search as the first table in the from clause, but hinting the inline view to vary the optimised join order.


prompt  ============================================
prompt  Source hinted as leading table in join order 
prompt  ============================================

delete from (
        select 
                /*+ leading(s1, s) */
                * 
        from 
                search s,
                source s1 
        where
                s.n1 = s1.n1
        )
;

select count(1) source_count from source; 
select count(1) search_count from search;
rollback;

prompt  ============================================
prompt  Search hinted as leading table in join order 
prompt  ============================================

delete from (
        select 
                /*+ leading(s, s1) */
                * 
        from 
                search s,
                source s1 
        where
                s.n1 = s1.n1
        )
;

select count(1) source_count from source; 
select count(1) search_count from search;
rollback;

In both cases the rows were deleted from search (the first table in from clause). And, to answer the question you should be asking, I did check the execution plans to make sure that the hints had been effective:


============================================
Source hinted as leading table in join order
============================================

------------------------------------------------------------------
| Id  | Operation           | Name       | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | DELETE STATEMENT    |            |    10 |    60 |     1 |
|   1 |  DELETE             | SEARCH     |       |       |       |
|   2 |   NESTED LOOPS      |            |    10 |    60 |     1 |
|   3 |    INDEX FULL SCAN  | SOURCE_IDX |    10 |    30 |     1 |
|*  4 |    INDEX UNIQUE SCAN| SEARCH_IDX |     1 |     3 |       |
------------------------------------------------------------------

============================================
Search hinted as leading table in join order
============================================

------------------------------------------------------------------
| Id  | Operation           | Name       | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | DELETE STATEMENT    |            |    10 |    60 |     1 |
|   1 |  DELETE             | SEARCH     |       |       |       |
|   2 |   NESTED LOOPS      |            |    10 |    60 |     1 |
|   3 |    INDEX FULL SCAN  | SEARCH_IDX |    10 |    30 |     1 |
|*  4 |    INDEX UNIQUE SCAN| SOURCE_IDX |     1 |     3 |       |
------------------------------------------------------------------

Summary

Using updatable join views to handle deletes can be very efficient but the manual’s statement of the “Delete Rule” is incorrect. It is possible to have several key-preserved tables in the view that you’re using, and if that’s the case you need to play safe and ensure that the table you want to delete from is the first table in the from clause. This means taking steps to eliminate the risk of someone editing some code at a later date without realising the importance of the table order.

Update (very shortly after publication)

Iduith Mentzel has pointed out in comment #1 below that the SQL Language Reference Guide and the DBA Administration Guide are not consistent in their descriptions of deleting from a join view, and that the SQL Language Reference Guide correctly states that the delete will be applied to the first mentioned key-preserved table.

 

 

August 6, 2019

Parse Solution

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 1:14 pm BST Aug 6,2019

In the “Parse Puzzle” I posted a couple of days ago I showed a couple of extracts from an AWR report that showed contradictory results about the time the instance spent in parsing and hard parsing, and also showed an amazing factor of 4 difference between the DB Time and the “SQL ordered by Elapsed Time”. My example was modelling a real world anomaly I had come across, but was engineered to exaggerate the effect to make it easy to see what was going on.

The key feature was VPD (virtual private database) a.k.a. FGAC (find grained access control) or RLS (row-level security). I’ve created a “policy function” (the thing that generates the “security predicate”) to execute an extremely expensive SQL query; then I’ve created a policy with policy_type = ‘DYNAMIC’ so that the function gets executed every time a query against a particular table is executed. In fact my example holds three tables, and each table has its own policy function, and each policy function calls the same very expensive piece of SQL.

To see the effect this has on the AWR report I’ll now supply the contents of the “SQL ordered by Elapsed Time” and work through the list (though not in the order shown) explaining what each statement represents:


SQL ordered by Elapsed Time               DB/Inst: OR18/or18  Snaps: 2059-2060
-> Resources reported for PL/SQL code includes the resources used by all SQL
   statements called by the code.
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100
-> %Total - Elapsed Time  as a percentage of Total DB time
-> %CPU   - CPU Time      as a percentage of Elapsed Time
-> %IO    - User I/O Time as a percentage of Elapsed Time
-> Captured SQL account for  302.6% of Total DB Time (s):             158
-> Captured PL/SQL account for  101.7% of Total DB Time (s):             158

        Elapsed                  Elapsed Time
        Time (s)    Executions  per Exec (s)  %Total   %CPU    %IO    SQL Id
---------------- -------------- ------------- ------ ------ ------ -------------
           156.8              1        156.84   99.4   99.8     .0 1ubpdythth4q1
Module: SQL*Plus
select id, f_rls(n1, n2, n3) from fgac_base where rownum .le. 10                   -- edited to avoid WP  format issue

           156.8             33          4.75   99.3   99.8     .0 9dhvggqtk2mxh
Module: SQL*Plus
 select count(*) X from waste_cpu connect by n .gt. prior n start with n = 1       -- edited to avoid WP  format issue

            53.2             10          5.32   33.7   99.8     .0 5531kmrvrzcxy
Module: SQL*Plus
SELECT COUNT(*) FROM FGAC_REF3 WHERE ID = :B1

            53.2             11          4.83   33.7   99.8     .0 8g2uv26waqm8g
Module: SQL*Plus
begin :con := "FGAC_PACK"."FGAC_PREDICATE3"(:sn, :on); end;

            52.7             10          5.27   33.4   99.7     .0 awk070fhzd4vs
Module: SQL*Plus
SELECT COUNT(*) FROM FGAC_REF1 WHERE ID = :B1

            52.7             11          4.79   33.4   99.7     .0 c8pwn9j11gw5s
Module: SQL*Plus
begin :con := "FGAC_PACK"."FGAC_PREDICATE1"(:sn, :on); end;

            50.9             10          5.09   32.3   99.9     .0 964u0zv0rwpw1
Module: SQL*Plus
SELECT COUNT(*) FROM FGAC_REF2 WHERE ID = :B1

            50.9             11          4.63   32.3   99.9     .0 bgqf405f34u4v
Module: SQL*Plus
begin :con := "FGAC_PACK"."FGAC_PREDICATE2"(:sn, :on); end;

             2.8              1          2.79    1.8   98.4     .0 fhf8upax5cxsz
BEGIN sys.dbms_auto_report_internal.i_save_report (:rep_ref, :snap_id, :pr_class
, :rep_id, :suc); END;

             2.6              1          2.64    1.7   98.4     .0 0w26sk6t6gq98
SELECT XMLTYPE(DBMS_REPORT.GET_REPORT_WITH_SUMMARY(:B1 )) FROM DUAL

             2.4              1          2.43    1.5   98.3     .0 1q1spprb9m55h
WITH MONITOR_DATA AS (SELECT INST_ID, KEY, NVL2(PX_QCSID, NULL, STATUS) STATUS,
FIRST_REFRESH_TIME, LAST_REFRESH_TIME, REFRESH_COUNT, PROCESS_NAME, SID, SQL_ID,
 SQL_EXEC_START, SQL_EXEC_ID, DBOP_NAME, DBOP_EXEC_ID, SQL_PLAN_HASH_VALUE, SQL_
FULL_PLAN_HASH_VALUE, SESSION_SERIAL#, SQL_TEXT, IS_FULL_SQLTEXT, PX_SERVER#, PX

                          ------------------------------------------------------

The first statement is an SQL statement that calls a PL/SQL function f_rls() for 10 consecutive rows in an “ordinary table”. This is the query that actuallly takes 157 seconds to complete as far as the client SQL Plus session is concerned.

The function (called 10 times) passed in three values n1, n2, n3. The function uses n1 to query table FGAC_REF1, n2 to query table FGAC_REF2, and n3 to query FGAC_REF3 – and we can see those three queries appearing as statements 5, 7, and 3 (in that order) in the output. The main query takes 157 seconds to complete because each of those statements takes approximately 52 seconds to complete 10 executions each.

But each of those three statements references a table with a policy that causes a predicate function to be executed for each parse and execute of the statement (one parse, 10 executes) and we can see 11 executions each of calls to fgac_pack.fgac_predicateN (N in 1,2,3), which take roughly 4.8 seconds each, for a total of about 52 seconds across 11 executions.

But those calls to the functions (11 each) all execute the same “connect by” query that appears as statement 2 in the list – for a total off 33 calls of a SQL statement that averages 4.75 seconds – and almost all of the “real” database time is in that 33 calls (33 * 4.75 = 156.75).

So we count 157 seconds because that’s the time spent in the “connect by” queries”, but we count that time again (but under PL/SQL execution) because it’s called from the policy functions, then count it again (under SQL execution) because the functions are called by the “select count(*) from fgac_refN” queries, then count it one final time (under SQL execution) for the driving query. This gives us a total 300% of the actual database time apparently being spent in SQL and 100% apparently being spent in PL/SQL.

You’ll notice that “real” optimisation of the SQL that does run would have taken just fractions of a second (as we saw in the Instance Activity Statistics); but one execution of the “connect by” query would have been associated with the first parse call of each of the fgac_refN queries, so 15 seconds of policy function / connect by query time would have been accounted under the parse time elapsed / hard parse time elapsed we saw in the Time Model statistics.

One of the strangest things about the reporting showed up in the ASH figures – which only ever managed to sample three SQL_IDs, reporting them as “on CPU” in every single sample, and those three SQL_IDs were for the “select count(*) from fgac_refN” queries; the “connect by” queries – the real consumer of CPU resource – didn’t get into the ASH sample at all.

Footnote

I did two things to make an important anomaly very obvious – I added a CPU intensive query to the policy function so that it was easy to see where the time was really going, and I made the VPD policy_type “dynamic” so that the policy function would execute on every parse and execute against the underlying table.

In real life the typcial SQL called within a policy function is very lightweight, and policies tend to be declared with type “context_sensitive”, and this combination minimises the number of calls to the function and the cost of each call. Unfortunately there are 3rd party applications on the market that include complex PL/SQL frameworks in their policy functions and then have the application server reset the session context for every call to the database.

And that’s where I came in – looking at a system where 10% of the CPU was being spent on parsing that apparently couldn’t possibly be happening.

August 4, 2019

Parse Puzzle

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 10:46 am BST Aug 4,2019

Here are some details from an AWR report covering a few minutes in the lifetime of an instance of 18.3. It’s a carefully constructed demonstration and all I’ve done is take an AWR snapshot, execute a single SQL statement, then take another snapshot, so the only thing captured by the report is the work done in that brief time interval. The purpose of the exercise is to demonstrate how some Oracle features can make a complete nonsense of the AWR. (I have, as I often do, produced a model that reproduces an affect that can appear in production but exaggerates the effect to make it more clearly visible.)

First the Time Model statistics:

                                                                % of  % of Total
Statistic Name                                       Time (s) DB Time   CPU Time
------------------------------------------ ------------------ ------- ----------
sql execute elapsed time                                157.8   100.0
DB CPU                                                  157.5    99.8       97.3
parse time elapsed                                       13.6     8.6
hard parse elapsed time                                  13.6     8.6
PL/SQL execution elapsed time                             0.3      .2
PL/SQL compilation elapsed time                           0.0      .0
hard parse (sharing criteria) elapsed time                0.0      .0
repeated bind elapsed time                                0.0      .0
DB time                                                 157.8
background elapsed time                                   7.9
background cpu time                                       4.4                2.7
total CPU time                                          161.9
                          ------------------------------------------------------


Note particularly the parse times – the Time Model show 13.6 seconds spent in (hard) parsing.

Note also that (with a small error) DB time = DB CPU = SQL execute elapsed time, and the background time is very small (in particular it’s less than the parse time). This background time, by the way, is probably related to things that Oracle does behind the scenes when you take an AWR snapshot or run an AWR report.

Given the significant amount of time spent in hard parsing let’s take a look at the Instance Activity statistics – picking only the statistics relating to parse calls:


Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
parse count (describe)                            0            0.0           0.0
parse count (failures)                            0            0.0           0.0
parse count (hard)                              325            2.1         325.0
parse count (total)                           1,662           10.5       1,662.0
parse time cpu                                   39            0.3          39.0
parse time elapsed                               42            0.3          42.0

Although the Time Model thinks Oracle has spent 13.6 seconds in (hard) parse time, the Instance Activity Statistics says it has only spent 0.42 seconds (the statistic is repored in hundredths) That a fairly significant difference of opinion. So let’s see if we can find out more from the “SQL ordered by …”, and I’m only going to show you one heading as a teaser for the rest of the weekend:


SQL ordered by Elapsed Time               DB/Inst: OR18/or18  Snaps: 2059-2060
-> Resources reported for PL/SQL code includes the resources used by all SQL
   statements called by the code.
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100
-> %Total - Elapsed Time  as a percentage of Total DB time
-> %CPU   - CPU Time      as a percentage of Elapsed Time
-> %IO    - User I/O Time as a percentage of Elapsed Time
-> Captured SQL account for  302.6% of Total DB Time (s):             158
-> Captured PL/SQL account for  101.7% of Total DB Time (s):             158

How do you squeeze 400% of the elapsed time into SQL and PL/SQL executions? (Observation: it’s not an IBM P9 taking advantage of SMT/4)

One last set of stats – which will have some room for statistical error since they come from v$active_session_history:


select
        in_parse, in_hard_parse, sql_id, sql_exec_id, count(*)
from
        v$active_session_history  ash
where
        session_id = &m_session_id
and     sample_id between &m_start_sample_id and &m_end_sample_id
group by
        in_parse, in_hard_parse, sql_id, sql_exec_id
order by
        count(*)
/

I I SQL_ID	  SQL_EXEC_ID	COUNT(*)
- - ------------- ----------- ----------
Y Y 964u0zv0rwpw1		       3
Y Y awk070fhzd4vs		       4
Y Y 5531kmrvrzcxy		       5

N N 5531kmrvrzcxy		      42
N N 964u0zv0rwpw1		      42
N N awk070fhzd4vs		      51

6 rows selected.

So the ASH information seems to agree (reasonably closely) with the Time Model statistics – in the interval of the snapshot it’s noted 12 samples (which statistically represents 12 seconds) of hard parse time (and though my query doesn’t show it, the “event” is  null, i.e. “on CPU”).

 

To be continued, some time later this week …

August 1, 2019

LOB reads

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 9:24 pm BST Aug 1,2019

This is a blog note that started life in September 2001 (which was before I started blogging, I think), got drafted as a blog in January 2014 because of a question on the Oracle-L list server, and has finally got to publication following a question on the ODC database forum. (But the comments in the blog don’t look as if they are relevant to the question.)

The question on Oracle-l was:  Why is the same CLOB datablock read multiple times?

The answer is basically: client programs tend to set a “sensible” fetch buffer size for LOBs and if the buffer size is less than the size of the LOB the block(s) the LOB is in have to be visited multiple times and for nocache LOBs that means the block(s) will be read multiple times.

This can be seen quite easily in SQL*Plus which has two helpful parameters (one dating back to at least v5):

set long N
set longchunksize M

The “long” setting tells SQL*Plus how much of the LOB to display on screen, the “longchunksize” tells SQL*Plus how much of the LOB to fetch from the database on each call. The default value is 80 for both settings. So let’s set up some data and do a few simple demonstrations using nothing more subtle than autotrace.


rem
rem     Script:         lob_fetch_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2012
rem     Purpose:        
rem
rem     Last tested 
rem             18.3.0.0
rem             12.1.0.2
rem             11.2.0.2
rem

create table t1 (
        id      number , 
        v1      varchar2(60),
        c1      clob, 
        c2      clob
)
lob(c1) store as (enable storage in row cache)
lob(c2) store as (enable storage in row cache)
;

begin
        for i in 1..1000 loop
                insert into t1 values(i, lpad(i,60), lpad(i,60), empty_clob());
        end loop;
        commit;
end;
/

execute dbms_stats.gather_table_stats(null,'t1')

I’ve got a table with two CLOB columns, both declared to enable storage in row and cache. I’ve also got a varchar2() column, and I’ve loaded one of the CLOB columns and the varchar2() column with a 60 character string, setting the other CLOB column explicitly to the empty_clob() value – which is not the same as setting it to NULL.

Now I’ll do 3 selects, fetching the id column and one other column, showing only the autotrace statistics:


set long 80
set longchunksize 80

set arraysize 5000

set autotrace traceonly statistics

prompt  ==============
prompt  Varchar2 fetch
prompt  ==============

select id, v1 from t1;

prompt  ================
prompt  Small CLOB fetch
prompt  ================

select id, c1 from t1;

prompt  ================
prompt  Empty CLOB fetch
prompt  ================

select id, c2 from t1;

set autotrace off

I’ve explicitly set long and longchunksize to 80 (the defaults) just to make it clear that that’s what they are for this test; and I’ve set the arraysize to 5000 so that I can get the 1,000 rows of data in the smallest number of fetches. Here are the three sets of autotrace statistics:


==============
Varchar2 fetch
==============

1000 rows selected.


Statistics
----------------------------------------------------------
         11  recursive calls
         11  db block gets
         75  consistent gets
          0  physical reads
       2040  redo size
      70545  bytes sent via SQL*Net to client
        624  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       1000  rows processed

================
Small CLOB fetch
================

1000 rows selected.


Statistics
----------------------------------------------------------
          4  recursive calls
          5  db block gets
       2036  consistent gets
          0  physical reads
        992  redo size
     707454  bytes sent via SQL*Net to client
     296624  bytes received via SQL*Net from client
       2002  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       1000  rows processed

================
Enpty CLOB fetch
================

1000 rows selected.


Statistics
----------------------------------------------------------
          3  recursive calls
          5  db block gets
       2036  consistent gets
          0  physical reads
       1048  redo size
     585454  bytes sent via SQL*Net to client
     296624  bytes received via SQL*Net from client
       2002  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       1000  rows processed

Key points to note (ignoring minor details like the recursive calls to parse the statements):

  • The query for the the varchar2() column did two SQL*Net roundtrips and 75 consistent gets.
  • The query for the small, inline, CLOB did 2,002 SQL*Net roundtrips and 2,036 consistent gets
  • The query for the empty CLOB did exactly the same number of SQL*Net roundtrips and consistent gets as for the small CLOB

Because we were selecting a LOB column SQL*Plus switched from array fetches to single row fetches. In the first fetch of each pair of fetches it fetched the non-CLOB columns and the “LOB locator”; then in the second fetch of each pair it fetched the CLOB data – and it did this even when the LOB locator could (in principle) have told it that there was no data to fetch. (If you select both clob columns in the same query the number of SQL*Net roundtrips will go up to 3 per row in this test.)

Remember I said there was a difference between empty_clob() and NULL ? If you update the table to set c2 to null before running the query to select c2 you’ll see only 1,000 (and a few) SQL*Net roundtrips – you still get single row processing because you’re selecting a LOB column but when the NULL  arrives at SQL*Plus the code will know that there is no LOB data to retrieve.

Now try this:


truncate table t1;

begin
        for i in 1..1000 loop
                insert into t1 values(i, lpad(i,60), lpad(i,180), empty_clob());
        end loop;
        commit;
end;
/

execute dbms_stats.gather_table_stats(null,'t1')

set long 170
select id, c1 from t1;

Pause for thought as you work out what the stats will look like
.
.
.
.
.
The longchunksize (still at the default of 80) is now too small to collect all the data that should be displayed in one SQL*Net roundtrip – it’s going to take 3 roundtrips to get 170 bytes, so we might expect to see about 4,000 SQL*Net roundtrips and a similar number of consistent gets. This is what the stats look like:


Statistics
----------------------------------------------------------
          2  recursive calls
          0  db block gets
       4030  consistent gets
          0  physical reads
          0  redo size
    1485454  bytes sent via SQL*Net to client
     866624  bytes received via SQL*Net from client
       4002  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       1000  rows processed

Sure enough we do single row fetches, one SQL*Net roundtrip for each row with its LOB locator, then three more roundtrips for each LOB for a total of 4,000 round trips and (approximately) 4,000 consistent gets.

One last test – let’s change the c1 clob to nocache and disable storage in row then repeat the last experiment where the long setting is larger than the longchunksize setting.


alter table t1 move lob(c1) store as (disable storage in row nocache);
execute dbms_stats.gather_table_stats(null,'t1')
select id, c1 from t1;

Statistics
----------------------------------------------------------
          2  recursive calls
          0  db block gets
       4001  consistent gets
       6000  physical reads
          0  redo size
    1485454  bytes sent via SQL*Net to client
     866624  bytes received via SQL*Net from client
       4002  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       1000  rows processed

Now there’s a surprise!  I was expecting to see 3000 physical reads as each “out of row, nocache” LOB was accessed three times to pick up the three pieces of longchunksize using direct path reads. So why have I got 6,000 physical reads? Time to enable extended tracing (event 10046, level 8) and repeat.

Here’s a sample from the trace file – this trace is from 18.3, but the LOBREAD lines are a feature that appeared some time around 11.2.0.2.

FETCH #140355181475400:c=58,e=57,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=3617692013,tim=5288719901
WAIT #140355181475400: nam='SQL*Net message from client' ela= 78 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288720015

WAIT #0: nam='direct path read' ela= 12 file number=2 first dba=2867 block cnt=1 obj#=132008 tim=5288720198
WAIT #0: nam='direct path read' ela= 8 file number=2 first dba=2867 block cnt=1 obj#=132008 tim=5288720288
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288720326
LOBREAD: type=PERSISTENT LOB,bytes=80,c=247,e=246,p=2,cr=1,cu=0,tim=5288720350

WAIT #0: nam='SQL*Net message from client' ela= 82 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288720483
WAIT #0: nam='direct path read' ela= 12 file number=2 first dba=2867 block cnt=1 obj#=132008 tim=5288720733
WAIT #0: nam='direct path read' ela= 8 file number=2 first dba=2867 block cnt=1 obj#=132008 tim=5288720836
WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288720873
LOBREAD: type=PERSISTENT LOB,bytes=80,c=0,e=261,p=2,cr=1,cu=0,tim=5288720898

WAIT #0: nam='SQL*Net message from client' ela= 121 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288721071
WAIT #0: nam='direct path read' ela= 11 file number=2 first dba=2867 block cnt=1 obj#=132008 tim=5288721216
WAIT #0: nam='direct path read' ela= 8 file number=2 first dba=2867 block cnt=1 obj#=132008 tim=5288721300
WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288721335
LOBREAD: type=PERSISTENT LOB,bytes=20,c=0,e=236,p=2,cr=1,cu=0,tim=5288721359

WAIT #0: nam='SQL*Net message from client' ela= 73 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288721506
WAIT #140355181475400: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288721664
FETCH #140355181475400:c=51,e=51,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=3617692013,tim=5288721695
WAIT #140355181475400: nam='SQL*Net message from client' ela= 74 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288721801

WAIT #0: nam='direct path read' ela= 12 file number=2 first dba=2868 block cnt=1 obj#=132008 tim=5288721939
WAIT #0: nam='direct path read' ela= 47 file number=2 first dba=2868 block cnt=1 obj#=132008 tim=5288722065
WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288722122
LOBREAD: type=PERSISTENT LOB,bytes=80,c=357,e=357,p=2,cr=1,cu=0,tim=5288722204

WAIT #0: nam='SQL*Net message from client' ela= 81 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288722351
WAIT #0: nam='direct path read' ela= 10 file number=2 first dba=2868 block cnt=1 obj#=132008 tim=5288722489
WAIT #0: nam='direct path read' ela= 11 file number=2 first dba=2868 block cnt=1 obj#=132008 tim=5288722671
WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288722720
LOBREAD: type=PERSISTENT LOB,bytes=80,c=0,e=349,p=2,cr=1,cu=0,tim=5288722746

WAIT #0: nam='SQL*Net message from client' ela= 76 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288722874
WAIT #0: nam='direct path read' ela= 11 file number=2 first dba=2868 block cnt=1 obj#=132008 tim=5288723013
WAIT #0: nam='direct path read' ela= 10 file number=2 first dba=2868 block cnt=1 obj#=132008 tim=5288723160
WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=132008 tim=5288723199
LOBREAD: type=PERSISTENT LOB,bytes=20,c=0,e=302,p=2,cr=1,cu=0,tim=5288723224

As you can see, the client seems to FETCH one row then do two direct path reads to read 80 bytes from a LOB, then another two direct path reads for the next 80 bytes, then another two direct path reads for the last 20 bytes. So the 6,000 reads is (strangely) doubling the 3,000 I was expecting to see and the trace file seems to say it’s really happening that way.

Unfortunately I decided to take snapshots of the session stats at this point – and that made things even worse (claiming, for example, 4,000 “consistent gets” but 7,000 “no work consistent gets”), the stats seem to be rather messed up and the whole thing looked very different when I ran this particular test on 12.1.0.2 (which claimed only one direct path read per CLOB rather than the three I was expecting).

Bottom line, though: notwithstanding any oddities of reporting, when your client progam is fetching LOBs from the database, you probably have to put up with:
single row fetches – that get the LOB locator then multiple fetches for each LOB, with the number of LOBREAD calls dependent on the buffer size the client allocates for pulling LOB data from the database.

So when handling LOBs make sure you know how the client can be configured to minimise SQL*Net roundtrips, and check a few simple examples looking at the the trace files, session stats and session events before you start loading data at production volumes.

 

June 11, 2019

Redo Dumps

Filed under: ASSM,Infrastructure,Oracle,redo,Troubleshooting — Jonathan Lewis @ 12:53 pm BST Jun 11,2019

A thread started on the Oracle-L list-server a few days ago asking for help analysing a problem where a simple “insert values()” (that handled millions of rows per day) was running very slowly. There are many reasons why this might happen, ranging from the trivial (someone has locked the table in exclusive mode), through the slightly subtle (we’re trying to insert a row that collides on a uniqueness constraint with an uncommitted insert from another session) to the subtle (Oracle has to read through the undo to check current versions of blocks against read-consistent versions) ending up at the esoteric (the ASSM space management blocks are completely messed up again).

A 10046 trace of a session doing an insert showed only that there was a lot of time spent on single block reads. Unfortunately, since this was on an Exadata system the waits were reported as “cell single block physical read”. Unfortunately the parameters to this wait event are “cellhash#”, “diskhash#”, and “bytes” and we don’t see the file_id, block_id which can be very helpful for a case like this. The only information we got from the trace file was that the object_id was for the table were rows were being inserted.

Before digging into exotic debugging methods, the OP supplied us with a 1-second session report from Tanel Poder’s snapper script:


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  SID @INST, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   1070  @2, SYSADMIN, STAT, session logical reads                                     ,         13865,      7.73k,         ,             ,          ,           ,      14.1k total buffer visits
>   1070  @2, SYSADMIN, STAT, user I/O wait time                                        ,           141,      78.65,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, non-idle wait time                                        ,           141,      78.65,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, non-idle wait count                                       ,         12230,      6.82k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read total IO requests                           ,          6112,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read requests optimized                          ,          6111,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read total bytes optimized                       ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read total bytes                                 ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, cell physical IO interconnect bytes                       ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, gcs messages sent                                         ,             3,       1.67,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block gets                                             ,         13860,      7.73k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block gets from cache                                  ,         13860,      7.73k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block gets from cache (fastpath)                       ,          7737,      4.32k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets                                           ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets from cache                                ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets pin                                       ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, consistent gets pin (fastpath)                            ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, logical read bytes from cache                             ,     113541120,     63.34M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical reads                                            ,          6111,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical reads cache                                      ,          6111,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, physical read IO requests                                 ,          6112,      3.41k,         ,             ,          ,           ,      8.19k bytes per request
>   1070  @2, SYSADMIN, STAT, physical read bytes                                       ,      50069504,     27.93M,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, db block changes                                          ,            11,       6.14,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, free buffer requested                                     ,          6112,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, hot buffers moved to head of LRU                          ,           958,     534.39,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, free buffer inspected                                     ,          6144,      3.43k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, shared hash latch upgrades - no wait                      ,             7,        3.9,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, blocks decrypted                                          ,          6110,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo entries                                              ,          6120,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo size                                                 ,        465504,    259.67k,         ,             ,          ,           ,          ~ bytes per user commit
>   1070  @2, SYSADMIN, STAT, redo entries for lost write detection                     ,          6110,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo size for lost write detection                        ,        464756,    259.25k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, redo subscn max counts                                    ,             7,        3.9,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, file io wait time                                         ,       1408659,    785.78k,         ,             ,          ,           ,   230.47us bad guess of IO wait time per IO request
>   1070  @2, SYSADMIN, STAT, gc current blocks received                                ,             3,       1.67,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, gc local grants                                           ,          6116,      3.41k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM cbk:blocks examined                                  ,         12366,       6.9k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM gsp:L1 bitmaps examined                              ,          2478,      1.38k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM gsp:L2 bitmaps examined                              ,             1,        .56,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, ASSM gsp:reject db                                        ,         12388,      6.91k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, STAT, buffer is pinned count                                    ,           230,      128.3,         ,             ,          ,           ,       1.63 % buffer gets avoided thanks to buffer pin caching
>   1070  @2, SYSADMIN, STAT, cell flash cache read hits                                ,          6723,      3.75k,         ,             ,          ,           ,          ~ per execution
>   1070  @2, SYSADMIN, TIME, background cpu time                                       ,        365192,   203.71ms,    20.4%, [##        ],          ,           ,
>   1070  @2, SYSADMIN, TIME, background elapsed time                                   ,       1273623,   710.45ms,    71.0%, [########  ],          ,           ,      28.95 % unaccounted time
>   1070  @2, SYSADMIN, WAIT, gc current block busy                                     ,           629,   350.87us,      .0%, [          ],         3,       1.67,   209.67us average wait
>   1070  @2, SYSADMIN, WAIT, cell single block physical read                           ,       1557638,   868.88ms,    86.9%, [WWWWWWWWW ],      6746,      3.76k,    230.9us average wait
>  
> --  End of Stats snap 1, end=2019-05-18 12:58:58, seconds=1.8

My first step was simply to read down the list (using a very small font to get the entire width on screen without wrapping) to see if anything stood out as unusual. The report showed two things I rarely see in the session stats:


blocks decrypted                                          ,          6110
redo entries for lost write detection                     ,          6110

These stats tell me that there are two “uncommon” features enabled: db_lost_write_protect, and block level encryption. (So whatever else is going on it’s just possible that mixing in two rarely used – and therefore less frequently tested – features may be confusing the issue.

Lost write protection means Oracle writes a “block read record” (BRR) to the redo log every time it reads a block from disc, so I decided to follow up the 6,110 figure to see what other stats reported similar values.


physical read total IO requests                           ,          6112
physical read requests optimized                          ,          6111
physical reads                                            ,          6111
physical reads cache                                      ,          6111
physical read IO requests                                 ,          6112
free buffer requested                                     ,          6112
redo entries                                              ,          6120
redo entries for lost write detection                     ,          6110
gc local grants                                           ,          6116
cell flash cache read hits                                ,          6723

There’s nothing particularly surprising here – basically we see all the blocks being read as single block reads, into cache. All the necessary global cache (gc) grants are local so it’s possible the table of interest has been remastered to this node. The value for “cell flash cache read hits” look a little odd as the cache is hit more frequently than blocks are read – but dynamic performance views are not read-consistent and this session is hammering away like crazy so this might just be a side effect of the time to gather the data for the report.

We can chase the redo a little further – the number of redo entries is slightly larger than the number of blocks read, so (even though small inconsistencies are not necessarily meaningful) this might tell us something:


redo entries                                              ,          6120
redo size                                                 ,        465504
redo entries for lost write detection                     ,          6110
redo size for lost write detection                        ,        464756 
db block changes                                          ,            11

The number of “redo entries” that were NOT for lost write detection is 10, totalling 748 bytes (not a lot – so indicative of “non-user” activity). The number of “db block changes” is 11 (close enough to 10), and generally it’s changes to db blocks that require redo to be generated. The final significant number is the one that isn’t there – there’s no undo generated, so no user-change to data. This system is working like crazy achieving absolutely nothing at this point.

The next point to ponder is what sort of work it is doing – so let’s check how the physical reads turn into buffer gets.


session logical reads                                     ,         13865
db block gets                                             ,         13860
db block gets from cache                                  ,         13860
db block gets from cache (fastpath)                       ,          7737
consistent gets                                           ,             1
consistent gets from cache                                ,             1
consistent gets pin                                       ,             1
consistent gets pin (fastpath)                            ,             1
hot buffers moved to head of LRU                          ,           958

buffer is pinned count                                    ,           230 

The unusual thing you notice with these figures is that virtually every buffer get is a current get. We’ve also got a number of blocks pinned – this might just be the segment header block, or the segment header and level 2 bitmap block that we keep revisiting. Finally we can see a lot of hot buffers being moved to the head of the LRU; since our session has been doing a lot of work for a long time it seems likely that those buffers are ones that our session is keeping hot – and for a big insert that shouldn’t really be happening unless, perhaps, we were managing to do a lot of maintenance of (well-clustered) indexes.

I’ve isolated the (new in 12.2) “ASSM gsp (get space)” statistics from this output – they’re all about handling blocks, but I wanted to look at them without being distracted by other stats.


ASSM cbk:blocks examined                                  ,         12366 
ASSM gsp:L1 bitmaps examined                              ,          2478 
ASSM gsp:L2 bitmaps examined                              ,             1 
ASSM gsp:reject db                                        ,         12388

We can see that we’ve examined 2,478 “level 1” bitmap blocks. A level 1 block holds the basic “bitmap” that records the state of a number of data blocks (typically 128 blocks once the object gets very large) so our session has worked its way through 2,478 maps trying to find a data block that it could use to insert a row. The “reject db” statistic tells us about data blocks that have been examined and rejected (presumably because the row we want to insert is too large to fit, or maybe because there are no free ITL (interested transaction list) entries available in the block). So we seem to be spending all our time searching for somewhere to insert rows. This shouldn’t really be happening – it’s a type of problem that Oracle has been worrying away at for quite some time: how do you avoid “losing” space by updating bitmap blocks too soon on inserts without going to the opposite extreme and leaving bitmap blocks that claim the space is free when it’s in use by uncommitted transactions.

Note to self: I don’t know how we managed to reject more blocks (12,388) than we’ve examined (12,366) but possibly it’s just one of those timing glitches (the error is less than one fifth of one percent) , possibly it’s something to do with the reject count including some of the L1 bitmap blocks.

Clearly there’s something funny going on with space management – and we need to look at a few blocks that are exhibiting the problems. But how do we find a few suitable blocks? And that’s where, finally, we get to the title of the piece.

We are in the lucky position of having “lost write protection” enabled – so the redo log file will hold lots of “block read records”. We can’t get the file and block addresses we need from the “cell physical read” wait events so let’s ask the redo log to supply them. We just have to pick a log file (online or archived) and tell Oracle to dump some of it – and we can probably get away with a fairly small dump since we want just a single type of redo record over a short period of time. Here’s an example showing the format of two slightly different commands you could execute:


alter system dump logfile '/u01/app/oracle/oradata/orcl12c/redo03.log'
        rba min 2781    1
        rba max 2781    1000
        layer 23 opcode 2
;

alter system dump redo 
        scn min 19859991 scn max 19964263 
        layer 23 opcode 2
;

The first command is to dump a log file by name – but you may have to fiddle around a bit to find the names of an archived log file because if you choose this option you need to know the sequence number (sequence# in v$log_history) of the file if you want to restrict the size of the dump. The second command simply dumps redo for (in this example) an SCN range – and it’s easy to query v$log_history to find dates, times, and SCN ranges – Oracle will work out for itself which file it has to access. In both cases I’ve restricted the dump to just those redo records that contain change vectors of type BRR (block read records) which is what the layer 23 opcode 2 line is about.

Here’s an example of a redo record that contains nothing but a single BRR. (It’s from a single-block read, a multi-block read would produce a redo record with multiple change vectors, one vector for each block read.)


REDO RECORD - Thread:1 RBA: 0x000add.00000019.01b0 LEN: 0x004c VLD: 0x10 CON_UID: 2846920952
SCN: 0x00000000025a7c13 SUBSCN:  1 05/23/2019 10:42:51
CHANGE #1 CON_ID:3 TYP:2 CLS:6 AFN:9 DBA:0x00407930 OBJ:40 SCN:0x00000000001a1e2a SEQ:2 OP:23.2 ENC:0 RBL:0 FLG:0x0000
 Block Read - afn: 9 rdba: 0x00407930 BFT:(1024,4225328) non-BFT:(1,31024)
              scn: 0x00000000001a1e2a seq: 0x02
              flags: 0x00000006 ( dlog ckval )
              where: qeilwh05: qeilbk

If you’re wondering about the two interpretations of the rdba (relative datablock address), one is for BFTs (big file tablespaces) and one for non-BFTs. The other thing you’ll notice about the interpretations is that neither file number (1024 or 1) matches the afn (absolute file number). In smaller, non-CDB databases you will probably find that the afn matches the file number in the non-BFT interpretation, but I happen to be testing on a PDB and the first file in my SYSTEM tablespace happens to be the 9th file created in the CDB – connecting as SYS in my PDB I can compare the absolute and “relative” file number very easily:


SQL> select file#, rfile#, name from v$datafile;

     FILE#     RFILE# NAME
---------- ---------- ----------------------------------------------------------------
         9          1 /u01/app/oracle/oradata/orcl12c/orcl/system01.dbf
        10          4 /u01/app/oracle/oradata/orcl12c/orcl/sysaux01.dbf
        11          9 /u01/app/oracle/oradata/orcl12c/orcl/undotbs01.dbf
        12         12 /u01/app/oracle/oradata/orcl12c/orcl/users01.dbf
        13         13 /u01/app/oracle/oradata/orcl12c/orcl/APEX_1991375173370654.dbf
        14         14 /u01/app/oracle/oradata/orcl12c/orcl/APEX_1993195660370985.dbf
        22         22 /u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf
        23         23 /u01/app/oracle/oradata/orcl12c/orcl/test_8k.dbf

8 rows selected.

For bigfile tablespaces the “relative” file number is a complete fake and simply reports 1024 – you’re only allowed one file in a bigfile tablespace, so there is no “relativity” involved. (Unless you’re working at CERN and storing data about particle collisions in the LHC.)

The key point to remember when reading BRRs then, is that you should take the file number from the afn and the block number from the (appropriate) interpretation of the rdba. For the example above I would issue: “alter database dump datafile 9 block 31024;”

Finally

The originator of the thread hasn’t yet made any public response to the suggestion of dumping and reviewing blocks – possibly they’ve started a private conversation with Stefan Koehler who had suggested a strategy that examined function calls rather than block contents – so we’re unable to do any further analysis on what’s going on behind the scenes.

What we would be looking for is any indication that Oracle is repeatedly re-reading the same bitmap blocks and the same data blocks (by a simple check of block addresses); and if that is the case we would want to get some clue about why that might be happening by examining the contents of the data blocks that are subject to repeated reads without changing their status in the bitmap from “space available” to “full”.  As it is we just have to wait for the OP to tell us if they’ve made any further progress.

 

Next Page »

Powered by WordPress.com.