Oracle Scratchpad

October 5, 2020

Direct Path

Filed under: Oracle,Performance,Problem Solving,RAC,Troubleshooting,Tuning — Jonathan Lewis @ 11:29 am BST Oct 5,2020

Here’s a note that I might have written once already – but I can’t find it and I’ve just been reminded about what it (might have) said by a posting that came up on the Oracle database forum in the last few days.

The posting in question is asking why, after setting the hidden parameter _serial_direct_read to ‘always’ a particular query is now taking hours to complete when it used to complete in a minute or so.

The answer is partly “because you’ve forced direct path serial reads”, partly “because you’re running on RAC” and (most directly) because the optimizer is using a really bad execution plan for that query and the direct path reads have had a massive impact as a consequence. (It turns out, after modelling, that the answer might also include “because you’re running”)

I’m going to demonstrate the issue by forcing a very simple query to take a very bad execution plan.

rem     Script:         ko.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem     Purpose:        
rem     Last tested 

create table t1
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue

create table t2
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue

select  table_name, blocks 
from    user_tables
where   table_name in ('T1','T2')

alter system flush buffer_cache;

execute snap_events.start_snap
execute snap_my_stats.start_snap
alter session set "_serial_direct_read"=always;

                leading(t1 t2)
        t1, t2
        t2.object_id = t1.object_id + 0.5

execute snap_my_stats.end_snap
execute snap_events.end_snap

alter session set "_serial_direct_read"=auto;

My query very carefully ensures that it’s not going to return any rows; but it’s going to do a lot of work finding no data because I’ve forced Oracle into doing a tablescan of t2 for every row in t1 – so 10,000 scans of a table of 140 – 190 (depending on version) data blocks.

The snap_my_stats and snap_events packages are a couple of my simple diagnostic packages that allows me to find the change in some v$ content between the start and end snapshots. In this case it’s v$mystat and v$session_event for the session.

On a test using the query ran for about 41 seconds with the following wait events reported:

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               2           0           0.11        .054           1
direct path read                                  6,527           0         368.30        .056           6
SQL*Net message to client                            13           0           0.00        .000           0
SQL*Net message from client                          13           0      10,689.26     822.251      10,689

The 6,500 “direct path read” waits corresponded to 1.33M “physical reads direct” reported in the session activity stats. Although the t2 table was fairly small Oracle was forced to use direct path reads for every single cycle through the nested loop. As a quick comparison, here are the figures if I don’t force direct path serial scans.

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               2           0           0.09        .047           2
db file scattered read                               34           0           2.45        .072           0
SQL*Net message to client                            11           0           0.01        .001           0
SQL*Net message from client                          11           0         174.36      15.851      82,849

We’ve waited for only 34 “db file scattered reads” and 2.45 centiseconds as we read the t2 (and the t1) tables into the cache for the first time, but then we’ve been able to revisit the blocks in the cache. We also saw a reduction in CPU usage and the total run time dropped from 41 seconds to about 22 seconds.

In this tiny example it hasn’t made a staggering difference to the overall run time, but the OP wasn’t that lucky with his “couple of minutes” to “hours”.

If you look at the fragment of the SQL Monitor report supplied by the OP you’ll see that they have an operation which shows:

| Id    |                  Operation                    |            Name              | Rows    | Cost |  Time     |  Start | Execs |    Rows  | Read | Read  | Cell   | Mem | Activity |            Activity Detail               |
|       |                                               |                              | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload |    |    (%)   |              (# samples)                 |
| -> 25 |       INDEX STORAGE FAST FULL SCAN            | TMRC_IX1                     |      1  |      |     10745 |     +0 |  268K |     112K | 267K |   2GB |  96.53% | 1M |    99.74 | enq: KO - fast object checkpoint (4783)  |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | Cpu (753)                                |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | gcs drm freeze in enter server mode (25) |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | latch free (1)                           |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | reliable message (3505)                  |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | cell smart index scan (1635)             |

The “index fast full scan” is an example of a “segment scan” and will be subject to direct path reads, just like a tablescan. We don’t really know how big this index is – but we can see that we have read it 268K times (Execs column) reading approximately 2GB after 267K read requests. This suggests the index is at most 1MB in size since it can be read in a single read request, and may consist of only one populated block (2,000,000,000/267,000 = 7,490 bytes. Despite this small size the total time sampled for all these scans is more than 10,600 seconds – roughly 39 millisecond per scan. That’s not very good.

Looking more closely at the sampled time we notice 3 key features:

|            Activity Detail               |
|              (# samples)                 |
| enq: KO - fast object checkpoint (4783)  |
| Cpu (753)                                |
| gcs drm freeze in enter server mode (25) |
| latch free (1)                           |
| reliable message (3505)                  |
| cell smart index scan (1635)             |
  • A massive fraction of the time was spent on “enq: KO – fast object checkpoint”
  • A large fraction of the time was spent on “reliable message”
  • There was some time (relatively small, but large in absolute terms) for “gcs drm freeze …”

The last of these three is about global cache services, distributed resource manager” and is a clear indication that the system is running RAC and we have some hefty competition for “object mastering” between instances. But the actual time lost there is relatively small – though associated chatter between instances could be significant.

But what’s the “KO enqueue”? Every time an instance starts a direct path segment scan it has to get a message to the database writer (possibly via the checkpoint process) – hence the “reliable message” waits – to copy every dirty block for that segment from the buffer cache down to disc and it has to wait for the write to complete. This is necessary to ensure that the tablescan doesn’t miss any changes that have been made in memory without yet being written to disc.

The KO enqueue synchronises this activity – I haven’t worked out the complete chain of events, but the enqueue is negotiated between the session and the checkpoint process – and if you’re running RAC every instance has to write any dirty blocks it is holding for the segment, so you have to have a degree of cross-instance chatter to make this happen.

Thanks to the enforced serial direct reads the OP’s plan – which, surely, shouldn’t expect to do 267K index fast full scans – has a massive overhead thanks to the need for the repeated object checkpoints.

You may ask, at this point, why I didn’t see any KO enqueue waits in my test results – the answer is simple, I’d flushed the buffer cache before I started the test, so there were no dirty blocks for the session to worry about. Let’s see what happens if I introduce a little activity to keep dirtying a few blocks in the t2 table. Here’s a little loop that will update a few rows once per second:

        for i in 1..1 loop
                update t2 set data_object_id = 0 where mod(object_id,1000) = i;
        end loop;

You’ll notice the code runs through the loop just once – I started with a loop count of 30, and discovered it wasn’t necessary, but the option remains for further testing.

If I execute this anonymous block (which updates about 10 rows each time through the loop) from another session just after SQL*Plus reports my “alter session”, so that it runs just after the query starts, this is what the session event report looks like if I run the test against (the version reported by the OP):

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: KO - fast object checkpoint                  7,645           0         290.34        .038           6
db file sequential read                               2           0           0.13        .066           2
direct path read                                 10,714           0         675.50        .063           6
SQL*Net message to client                            14           0           0.00        .000           0
SQL*Net message from client                          14           0       1,556.23     111.160     101,653
events in waitclass Other                         5,607           0         218.04        .039           8

Suddenly we see a lot of time spent on the KO enqueue waits and the “events in waitclass Other” (which turn out to be “reliable message” waits). Apparently the session keeps finding dirty t2 blocks in the cache and telling the database writer they need to be written to disc before the next tablescan of t2 can take place.

There’s something odd here, though and I’ll introduce it with this comment: when I repeated the test on 19.3 (even with with the constant trickle of updates once per second), we only see a tiny number of KO enqueues and reliable message waits – the fact that we see a huge number of them in 11g is a defect in the design of the code.

Think about what’s happening with the KO enqueue: when you start the first tablescan of t2 you force every dirty block for that segment to be copied from the cache to the disc.

As the tablescan proceeds you may have to apply some undo change vectors to the blocks you’re reading to take them back to the SCN as at the start of query execution, but you know that any data that had been committed before the query started is on disc (even if it has been over-written by committed changes made after the query started, or by uncommitted changes made before the query started). What’s on the disc right now will be usable to get the correct read-consistent version of the data for the duration of the query run, no matter how many newer changes are made, whether or not they over-write the disc blocks before the query ends. There is no need to force write any dirty blocks as the tablescan is repeated and, it seems, by 19.3 the code has been adjusted to accomodate that fact.


After I had posted this comment on the forum, the OP raised the question of whether or not the fix might apply to 12c as well – so I ran up a VM of and and re-ran the tests. The results were initially promising – neither version reported an extreme number of KO enqueue waits or reliable message waits.

However when I changed the loop counter from 1 back to 30 I found that the waits re-appeared – though the numbers were significantly less than those from 11g – so perhaps there’s a timing element involved that might need further investigation and stress testing even for 19.3.

Footnote 2

Another snapshot I took in testing was from v$enqueue stat – which showed that (approximately) for every KO enqueue wait my session reported, the instance reported about 10 – 12 KO enqueue requests.

When checking v$enqueue_stat it’s important to remember that session activity stats (v$sesstat) report both “enqueue requests” and “enqueue conversions”. In v$enqueue_stat the conversions aren’t reported separately the view simply adds the two figures together under “requests”.

In the 11g test the session reported 7,645 KO enqueue waits, but the session activity stats reported 19,399 enqueue requests and 38,796 enqueue conversions; while v$enqueue_stat reported 96,990 KO enqueue requests. The remaining 38,796 KO enqueue requests were made by the checkpoint process (CKPT) – and it was only my session that repoted any waits for KO enqueue requests.

Without further low-level investigation this is what leads me to believe that the session sends CKPT a message that it wants an object-level checkpoint performed and waits for the message to be acknowledged (reliable message) before trying to convert a low-level KO enqueue to an exclusive one. But CKPT has acquired and converted the same KO enqueue before acknowledging the message from the session and will only release the enqueue when the database writer acknowledges that the checkpoint request has been completed. So this mechanism ensures that the session will have to wait until the checkpoint is complete and CKPT has released the enqueue before it can continue processing.

(P.S. If you enable event 10704 you will find that there seem to be two KO enqueues involved – one with id2 = 1, one with id2 = 2.)

February 22, 2018

Huge Pages

Filed under: Oracle,RAC,Troubleshooting,Tuning — Jonathan Lewis @ 9:03 am GMT Feb 22,2018

A useful quick summary from Neil Chandler replying to a thread on Oracle-L:

Topic: RAC install on Linux

You should always be using Hugepages.

They give a minor performance improvement and a significant memory saving in terms of the amount of memory needed to handle the pages – less Transaction Lookaside Buffers, which also means less TLB misses (which are expensive).

You are handling the memory chopped up into 2MB pieces instead of 4K. But you also have a single shared memory TLB for Hugepages.

The kernel has less work to do, bookkeeping fewer pointers in the TLB.

You also have contiguous memory allocation and it can’t be swapped.

If you are having problems with Hugepages, you have probably overallocated them (I’ve seen this several times at clients so it’s not uncommon). Hugepages can *only* be used for your SGA’s. All of your SGA’s should fit into the Hugepages and that should generally be no more than about 60% of the total server memory (but there are exceptions), leaving plenty of “normal” memory (small pages) for PGA , O/S and other stuff like monitoring agendas.

As an added bonus, AMM can’t use Hugepages, so your are forced to use ASMM. AMM doesn’t work well and has been kind-of deprecated by oracle anyway – dbca won’t let you setup AMM if the server has more than 4GB of memory.

There are a few follow-up emails after Neil’s; particularly helpful are two from Stefan Koehler, here and here.

For reference here’s a link to the Linux v6 guide to setting up huge pages for Oracle (if I remember to keep this up to date I’ll add further links for newer versions of Linux – unless someone conveniently supplies them in the comments). And a link to Tim Hall’s pages on the subject that add some further bits and pieces that are likely to be relevant in some environments. And a link to a page I use to connect to a very technical classic presentation and video on the topic. And a note I wrote several years ago with a bit of real-life arithmetic explaining why a hardware upgrade didn’t have the expected effect.

Update Nov 2020

I’ve been talking to a client recently about huge pages on a Linux based system, and thought I’d add a couple of notes here catching up a couple of details that I’d missed in the past.

Starting from, Oracle will start up and use a mixture of huge/large pages and standard pages if you haven’t allocated enough huge pages. Prior to the instance would refuse to start so it was easy to discover that you need to do some reconfiguring.

Since you don’t get the automatic warning that you’ve got it wrong, the alert log now tells you about your memory configuration (apparently only if it’s wrong) with an output like the following (which came from a 19c instance with a 9G SGA max size).

 Sys-V shared memory will be used for creating SGA
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

 Per process system memlock (soft) limit = 128G
 Expected per process system memlock (soft) limit to lock
 instance MAX SHARED GLOBAL AREA (SGA) into memory: 9218M
 Available system pagesizes:
  4K, 2048K
 Supported system pagesize(s):
        4K       Configured               7         1749997        NONE
     2048K             1200            4609            1191        NONE
 1. For optimal performance, configure system with expected number
 of pages for every supported system pagesize prior to the next
 instance restart operation.

As you can see I have a report that tells me my memlock (/etc/security/limits.conf) setting is fine, but I’ve had to allocate 1.7M pages of 4KB (along with almost all the available 2MB pages) to make space for the SGA. The report even shows me how many huge pages I need for this instance.

I’m not an expert at the O/S level but I think that means every process connecting to the SGA will (eventually) need a memory map of around 13.5MB (1.7M addresses * 8 bytes) to convert logical memory addresses to physical memory addresses if it needs to address the whole SGA.

January 25, 2018

gc buffer busy

Filed under: Oracle,RAC,Troubleshooting — Jonathan Lewis @ 2:12 pm GMT Jan 25,2018

I had to write this post because I can never remember which way round Oracle named the two versions of gc buffer busy when it started identifying them separately. There are two scenarios to cover when my session wants my instance to acquire a global cache lock on a block and some other session is already trying to acquire that lock or is holding it in an incompatible fashion:

  • The other session is in my instance
  • The other session is in a remote instance

One of these cases is reported as “gc buffer busy acquire”, the other as a “gc buffer busy release” and I always have to check which is which. I think I usually get it right first time when I see it but I always manage to convince myself that I might have got it wrong and end up searching the internet for Riyaj Shamsudeen’s blog posting about it.

The “release” is waiting for another instance to surrender the lock to my instance; the “acquire” is waiting for another session in my instance to finish acquiring the lock from the other  instance.

I decided to jot down this note so I didn’t have to keep searching for Riyaj’s and also because a little problem on OTN at the moment showed a couple of AWR reports with an unlikely combination of waits for acquire (180,000,000) and release (2,000) waits.

If you’re wondering why this looks odd – if I’m waiting for an acquire someone else in my instance must be waiting for a release.  Obviously many sessions could be waiting for one release, and if acquirers time out very rapidly (though they’re not reported as doing so) then the ratio could get very high – but 90,000 acquires per release doesn’t look right.


November 2, 2015


Filed under: Indexing,Oracle,RAC,Statistics — Jonathan Lewis @ 10:27 am GMT Nov 2,2015

I had a recent conversation at Oracle OpenWorld 2015 about a locking anomaly in a 3-node RAC system which was causing unexpected deadlocks. Coincidentally, this conversation came about shortly after I had been listening to Martin Widlake talking about using the procedure dbms_stats.set_table_prefs() to adjust the way that Oracle calculates the clustering_factor for indexes. The juxtaposition of these two topics made me realise that the advice I had given in “Cost Based Oracle – Fundamentals” 10 years ago was (probably) incomplete, and needed some verification. The sticking point was RAC.

In my original comments about setting the table_cached_blocks preference (as it is now known) I had pointed out that the effect of ASSM (with its bitmap space management blocks) was to introduce a small amount of random scattering as rows were inserted by concurrent sessions and this would adversely affect the clustering_factor of any indexes on the table, so a reasonable default value for the table_cached_blocks preference would be 16.

I had overlooked the fact that in RAC each instance tries to acquire ownership of its own level 1 (L1) bitmap block in an attempt to minimise the amount of global cache contention.  If each instance uses a different L1 bitmap block to allocate data blocks then (for tables and their partitions) they won’t be using the same data blocks for inserts and they won’t even have to pass the bitmap blocks between instances when searching for free space. The consequence of this, though, is that if N separate instances are inserting data into a single table there are typically 16 * N different blocks into which sessions could be inserting concurrently so the “most recent” data could be scattered across 16N blocks, which means the appropriate value table_cached_blocks is 16N.

To demonstrate the effect of RAC and multiple L1 blocks here’s a little demonstration code from a 12c RAC database with 3 active instances.

create tablespace test_8k_assm_auto
datafile size 67108864
logging online permanent
blocksize 8192
extent management local autoallocate default
nocompress segment space management auto

create table t1 (n1 number, c1 char(1000)) storage (initial 8M next 8M);

The code above simply creates a tablespace using locally managed extents with system allocated extent sizes then creates a table in that tablespace with a starting requirement of 8MB. Without this specification of initial the first few extents for the table would have been 64KB thanks to the system allocation algorithm and that would have spoiled the demonstration because the table would have started by allocating a single extent of 64KB with just one L1 bitmap block. Slightly different effects would also have appeared with an extent size of 1MB – with 2 L1 bitmap blocks – which is the second possible extent size for system allocation.

Having created the table I connected one session to each of the three instances and inserted one row, with commit, from each instance. Then I ran a simple SQL statement to show me the file and block numbers of the rows inserted:

        dbms_rowid.rowid_relative_fno(rowid)    file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_in_block
group by
order by

---------- ---------- -------------
        19        518             1
        19        745             1
        19       2157             1

As you can see, each row has gone into a separate block – more significantly, though, those blocks are a long way apart from each other – they are in completely different sets of 16 block – each instance is working with its own L1 block (there are 16 of them to choose from in an 8MB extent) and has formatted 16 blocks associated with that L1 for its own use.

In fact this simple test highlighted an anomaly that I need to investigate further. In my first test, after inserting just 3 rows into the table I found that Oracle had formatted 288 blocks (18 groups of 16) across 2 extent – far more than seems reasonable. The effect looks hugely wasteful but that’s mainly because I’ve implied that I have a “large” table into which I’ve then inserted very little data – nevertheless something a little odd has happened. In my second test it got worse because Oracle formatted 16 blocks on the first insert,  took that up to 288 blocks on the second insert, then went up to 816 blocks (using a third extent) on the third insert; then in my third test Oracle behaved as I had assumed it ought to, formatting 3 chunks of 16 blocks each in a single extent – but that might have been because I did a truncate rather than a drop and recreate.


Whatever else is going on, the key point of this note is that if you’re trying to get Oracle to give you a better estimate for the clustering_factor in a RAC system then “16 * instance-count” is probably a good starting point for setting the table preference known as table_cached_blocks.

The anomaly of data being scattered extremely widely with more extents being allocated than you might expect is probably a boundary condition that you don’t have to worry about – but I may change my mind about that when I’ve had time to look at it a little more closely.


February 10, 2014

RAC Plans

Filed under: Execution plans,Hints,Oracle,RAC,Troubleshooting — Jonathan Lewis @ 1:12 pm GMT Feb 10,2014

Recently appeared on Mos – “Bug 18219084 : DIFFERENT EXECUTION PLAN ACROSS RAC INSTANCES”

Now, I’m not going to claim that the following applies to this particular case – but it’s perfectly reasonable to expect to see different plans for the same query on RAC, and it’s perfectly possible for the two different plans to have amazingly different performance characteristics; and in this particular case I can see an obvious reason why the two nodes could have different plans.

Here’s the query reported in the bug:


October 3, 2011

Personal RAC

Filed under: Oracle,RAC — Jonathan Lewis @ 12:55 pm BST Oct 3,2011

Thanks to Tim Hall, I can now show you a snapshot of OEM demonstrating my personal four-node RAC.


Website Powered by