Here’s a note that I might have written once already – but I can’t find it and I’ve just been reminded about what it (might have) said by a posting that came up on the Oracle database forum in the last few days.
The posting in question is asking why, after setting the hidden parameter _serial_direct_read to ‘always’ a particular query is now taking hours to complete when it used to complete in a minute or so.
The answer is partly “because you’ve forced direct path serial reads”, partly “because you’re running on RAC” and (most directly) because the optimizer is using a really bad execution plan for that query and the direct path reads have had a massive impact as a consequence. (It turns out, after modelling, that the answer might also include “because you’re running 11.2.0.4”)
I’m going to demonstrate the issue by forcing a very simple query to take a very bad execution plan.
rem
rem Script: ko.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem Purpose:
rem
rem Last tested
rem 19.3.0.0
rem
create table t1
as
select *
from all_objects
where rownum <= 10000 -- > comment to avoid wordpress format issue
;
create table t2
as
select *
from all_objects
where rownum <= 10000 -- > comment to avoid wordpress format issue
;
select table_name, blocks
from user_tables
where table_name in ('T1','T2')
;
alter system flush buffer_cache;
execute snap_events.start_snap
execute snap_my_stats.start_snap
alter session set "_serial_direct_read"=always;
select
/*+
leading(t1 t2)
use_nl(t2)
full(t2)
*/
t1.object_type,
t2.object_type
from
t1, t2
where
t2.object_id = t1.object_id + 0.5
;
execute snap_my_stats.end_snap
execute snap_events.end_snap
alter session set "_serial_direct_read"=auto;
My query very carefully ensures that it’s not going to return any rows; but it’s going to do a lot of work finding no data because I’ve forced Oracle into doing a tablescan of t2 for every row in t1 – so 10,000 scans of a table of 140 – 190 (depending on version) data blocks.
The snap_my_stats and snap_events packages are a couple of my simple diagnostic packages that allows me to find the change in some v$ content between the start and end snapshots. In this case it’s v$mystat and v$session_event for the session.
On a test using 11.2.0.4 the query ran for about 41 seconds with the following wait events reported:
Event Waits Time_outs Csec Avg Csec Max Csec
----- ----- --------- ---- -------- --------
db file sequential read 2 0 0.11 .054 1
direct path read 6,527 0 368.30 .056 6
SQL*Net message to client 13 0 0.00 .000 0
SQL*Net message from client 13 0 10,689.26 822.251 10,689
The 6,500 “direct path read” waits corresponded to 1.33M “physical reads direct” reported in the session activity stats. Although the t2 table was fairly small Oracle was forced to use direct path reads for every single cycle through the nested loop. As a quick comparison, here are the figures if I don’t force direct path serial scans.
Event Waits Time_outs Csec Avg Csec Max Csec
----- ----- --------- ---- -------- --------
db file sequential read 2 0 0.09 .047 2
db file scattered read 34 0 2.45 .072 0
SQL*Net message to client 11 0 0.01 .001 0
SQL*Net message from client 11 0 174.36 15.851 82,849
We’ve waited for only 34 “db file scattered reads” and 2.45 centiseconds as we read the t2 (and the t1) tables into the cache for the first time, but then we’ve been able to revisit the blocks in the cache. We also saw a reduction in CPU usage and the total run time dropped from 41 seconds to about 22 seconds.
In this tiny example it hasn’t made a staggering difference to the overall run time, but the OP wasn’t that lucky with his “couple of minutes” to “hours”.
If you look at the fragment of the SQL Monitor report supplied by the OP you’ll see that they have an operation which shows:
=====================================================================================================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Cell | Mem | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | Offload | | (%) | (# samples) |
=====================================================================================================================================================================================================================================
| -> 25 | INDEX STORAGE FAST FULL SCAN | TMRC_IX1 | 1 | | 10745 | +0 | 268K | 112K | 267K | 2GB | 96.53% | 1M | 99.74 | enq: KO - fast object checkpoint (4783) |
| | | | | | | | | | | | | | | Cpu (753) |
| | | | | | | | | | | | | | | gcs drm freeze in enter server mode (25) |
| | | | | | | | | | | | | | | latch free (1) |
| | | | | | | | | | | | | | | reliable message (3505) |
| | | | | | | | | | | | | | | cell smart index scan (1635) |
=====================================================================================================================================================================================================================================
The “index fast full scan” is an example of a “segment scan” and will be subject to direct path reads, just like a tablescan. We don’t really know how big this index is – but we can see that we have read it 268K times (Execs column) reading approximately 2GB after 267K read requests. This suggests the index is at most 1MB in size since it can be read in a single read request, and may consist of only one populated block (2,000,000,000/267,000 = 7,490 bytes. Despite this small size the total time sampled for all these scans is more than 10,600 seconds – roughly 39 millisecond per scan. That’s not very good.
Looking more closely at the sampled time we notice 3 key features:
============================================
| Activity Detail |
| (# samples) |
============================================
| enq: KO - fast object checkpoint (4783) |
| Cpu (753) |
| gcs drm freeze in enter server mode (25) |
| latch free (1) |
| reliable message (3505) |
| cell smart index scan (1635) |
============================================
- A massive fraction of the time was spent on “enq: KO – fast object checkpoint”
- A large fraction of the time was spent on “reliable message”
- There was some time (relatively small, but large in absolute terms) for “gcs drm freeze …”
The last of these three is about “global cache services, distributed resource manager” and is a clear indication that the system is running RAC and we have some hefty competition for “object mastering” between instances. But the actual time lost there is relatively small – though associated chatter between instances could be significant.
But what’s the “KO enqueue”? Every time an instance starts a direct path segment scan it has to get a message to the database writer (possibly via the checkpoint process) – hence the “reliable message” waits – to copy every dirty block for that segment from the buffer cache down to disc and it has to wait for the write to complete. This is necessary to ensure that the tablescan doesn’t miss any changes that have been made in memory without yet being written to disc.
The KO enqueue synchronises this activity – I haven’t worked out the complete chain of events, but the enqueue is negotiated between the session and the checkpoint process – and if you’re running RAC every instance has to write any dirty blocks it is holding for the segment, so you have to have a degree of cross-instance chatter to make this happen.
Thanks to the enforced serial direct reads the OP’s plan – which, surely, shouldn’t expect to do 267K index fast full scans – has a massive overhead thanks to the need for the repeated object checkpoints.
You may ask, at this point, why I didn’t see any KO enqueue waits in my test results – the answer is simple, I’d flushed the buffer cache before I started the test, so there were no dirty blocks for the session to worry about. Let’s see what happens if I introduce a little activity to keep dirtying a few blocks in the t2 table. Here’s a little loop that will update a few rows once per second:
begin
for i in 1..1 loop
update t2 set data_object_id = 0 where mod(object_id,1000) = i;
dbms_output.put_line(sql%rowcount);
commit;
dbms_lock.sleep(1);
end loop;
end;
/
You’ll notice the code runs through the loop just once – I started with a loop count of 30, and discovered it wasn’t necessary, but the option remains for further testing.
If I execute this anonymous block (which updates about 10 rows each time through the loop) from another session just after SQL*Plus reports my “alter session”, so that it runs just after the query starts, this is what the session event report looks like if I run the test against 11.2.0.4 (the version reported by the OP):
Event Waits Time_outs Csec Avg Csec Max Csec
----- ----- --------- ---- -------- --------
enq: KO - fast object checkpoint 7,645 0 290.34 .038 6
db file sequential read 2 0 0.13 .066 2
direct path read 10,714 0 675.50 .063 6
SQL*Net message to client 14 0 0.00 .000 0
SQL*Net message from client 14 0 1,556.23 111.160 101,653
events in waitclass Other 5,607 0 218.04 .039 8
Suddenly we see a lot of time spent on the KO enqueue waits and the “events in waitclass Other” (which turn out to be “reliable message” waits). Apparently the session keeps finding dirty t2 blocks in the cache and telling the database writer they need to be written to disc before the next tablescan of t2 can take place.
There’s something odd here, though and I’ll introduce it with this comment: when I repeated the test on 19.3 (even with with the constant trickle of updates once per second), we only see a tiny number of KO enqueues and reliable message waits – the fact that we see a huge number of them in 11g is a defect in the design of the code.
Think about what’s happening with the KO enqueue: when you start the first tablescan of t2 you force every dirty block for that segment to be copied from the cache to the disc.
As the tablescan proceeds you may have to apply some undo change vectors to the blocks you’re reading to take them back to the SCN as at the start of query execution, but you know that any data that had been committed before the query started is on disc (even if it has been over-written by committed changes made after the query started, or by uncommitted changes made before the query started). What’s on the disc right now will be usable to get the correct read-consistent version of the data for the duration of the query run, no matter how many newer changes are made, whether or not they over-write the disc blocks before the query ends. There is no need to force write any dirty blocks as the tablescan is repeated and, it seems, by 19.3 the code has been adjusted to accomodate that fact.
Footnote
After I had posted this comment on the forum, the OP raised the question of whether or not the fix might apply to 12c as well – so I ran up a VM of 12.1.0.2 and 12.2.0.1 and re-ran the tests. The results were initially promising – neither version reported an extreme number of KO enqueue waits or reliable message waits.
However when I changed the loop counter from 1 back to 30 I found that the waits re-appeared – though the numbers were significantly less than those from 11g – so perhaps there’s a timing element involved that might need further investigation and stress testing even for 19.3.
Footnote 2
Another snapshot I took in testing was from v$enqueue stat – which showed that (approximately) for every KO enqueue wait my session reported, the instance reported about 10 – 12 KO enqueue requests.
When checking v$enqueue_stat it’s important to remember that session activity stats (v$sesstat) report both “enqueue requests” and “enqueue conversions”. In v$enqueue_stat the conversions aren’t reported separately the view simply adds the two figures together under “requests”.
In the 11g test the session reported 7,645 KO enqueue waits, but the session activity stats reported 19,399 enqueue requests and 38,796 enqueue conversions; while v$enqueue_stat reported 96,990 KO enqueue requests. The remaining 38,796 KO enqueue requests were made by the checkpoint process (CKPT) – and it was only my session that repoted any waits for KO enqueue requests.
Without further low-level investigation this is what leads me to believe that the session sends CKPT a message that it wants an object-level checkpoint performed and waits for the message to be acknowledged (reliable message) before trying to convert a low-level KO enqueue to an exclusive one. But CKPT has acquired and converted the same KO enqueue before acknowledging the message from the session and will only release the enqueue when the database writer acknowledges that the checkpoint request has been completed. So this mechanism ensures that the session will have to wait until the checkpoint is complete and CKPT has released the enqueue before it can continue processing.
(P.S. If you enable event 10704 you will find that there seem to be two KO enqueues involved – one with id2 = 1, one with id2 = 2.)
[…] is a little addendum to a note I wrote a couple of days ago about serial direct path reads and KO (fast object checkpoint) enqueue […]
Pingback by Direct Path | Oracle Scratchpad — October 8, 2020 @ 1:14 pm BST Oct 8,2020 |