Oracle Scratchpad

December 6, 2019

Temp space

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 12:18 pm GMT Dec 6,2019

A question about hunting down the source of the error “ORA-01652 unable to extend temp segment by NNN in tablespace XXX” shows up on the Oracle-L mailing list or the Oracle developer community forum from time to time. In most cases the tablespace referenced is the temporary tablespace, which means the session reporting the error was probably trying to allocate some space for sorting, or doing a hash join, or instantiating a GTT (global temporary table) or a CTE (common table expression / “with” subquery). The difficulty in cases like this is that the session reporting the error might be the victim of some other session’s greed – so looking at what the session was doing won’t necessarily point you to the real problem.

Of course you then run into a further problem tracking down the source of the problem. By the time you hear the complaint (even if it’s only seconds after the error appeared) the session that had been hogging the temporary tablespace may have finished what it was doing, leaving a huge amount of free space in the temporary tablespace and suggesting (to the less experienced and cynical user) that there’s something fundamentally wrong with the way Oracle has been accounting for space usage.

If you find yourself in this situation remember that (if you’re licensed to take advantage of it) the active session history may be able to help.  One of the columns in v$active_session_history is called temp_space_allocated with the slightly misleading description: “Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken”. A simple query against v$active_session_history may be enough to identify the session and SQL  statement that had been holding the temporary space when the error was raised, for example:


column pga_allocated        format 999,999,999,999
column temp_space_allocated format 999,999,999,999

break on session_id skip 1 on session_serial#

select
        session_id, session_serial#, 
        sample_id, 
        sql_id, 
        pga_allocated,
        temp_space_allocated
from
        v$active_session_history
where
        sample_time between sysdate - 5/1440 and sysdate
and     nvl(temp_space_allocated,0) != 0
order by
        session_id, sample_id
/

All I’ve done for this example is query v$active_session_history for the last 5 minutes reporting a minimum of information from any rows that show temp space usage. As a minor variation on the theme you can obviously change the time range, and you might want to limit the output to rows reporting more than 1MB (say) of temp space usage.

You’ll notice that I’ve also reported the pga_allocated (Description: Amount of PGA memory (in bytes) consumed by this session at the time this sample was taken) in this query; this is just a little convenience – a query that’s taking a lot of temp space will probably start by acquiring a lot of memory so it’s nice to be able to see the two figures together.

There are plenty of limitations and flaws in the usefulness of this report and I’ll say something about that after showing an example of usage. Let’s start with a script to build some data before running a space-consuming query:


rem
rem     Script:         allocate_tempspace.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1 as 
select * from all_objects
;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

insert into t1 select * from t1;
commit;

execute dbms_stats.gather_table_stats(null,'t1')

execute dbms_lock.sleep(20)

set pagesize  60
set linesize 255
set trimspool on
set serveroutput off
alter session set statistics_level = all;

with ttemp as (
        select /*+ materialize */ * from t1
)
select 
        /*+ no_partial_join(@sel$2 t1b) no_place_group_by(@sel$2) */ 
        t1a.object_type,
        max(t1a.object_name)
from
        ttemp t1a, ttemp t1b
where
        t1a.object_id = t1b.object_id
group by
        t1a.object_type
order by
        t1a.object_type
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

My working table t1 consists of 16 copies of the view all_objects – so close to 1 million rows in my case – and the query is hinted to avoid any of the clever transformations that the optimizer could use to reduce the workload so it’s going to do a massive hash join and aggregation to report a summary of a couple of dozen rows. Here’s the execution plan (in this case from 12.2.0.1, though the plan is the same for 19.3 with some variations in the numbers).


SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------

Plan hash value: 1682228242

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |     29 |00:00:10.03 |   47413 |  21345 |  12127 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |     29 |00:00:10.03 |   47413 |  21345 |  12127 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D665B_E2772D3 |      1 |        |      0 |00:00:01.51 |   28915 |      0 |   9217 |  2068K|  2068K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                         |      1 |    989K|    989K|00:00:00.24 |   19551 |      0 |      0 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                            |      1 |     29 |     29 |00:00:08.51 |   18493 |  21345 |   2910 |  6144 |  6144 | 6144  (0)|         |
|*  5 |    HASH JOIN                             |                            |      1 |     15M|     15M|00:00:03.93 |   18493 |  21345 |   2910 |    48M|  6400K|   65M (1)|   25600 |
|   6 |     VIEW                                 |                            |      1 |    989K|    989K|00:00:00.36 |    9233 |   9218 |      0 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D665B_E2772D3 |      1 |    989K|    989K|00:00:00.35 |    9233 |   9218 |      0 |       |       |          |         |
|   8 |     VIEW                                 |                            |      1 |    989K|    989K|00:00:00.40 |    9257 |   9217 |      0 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D665B_E2772D3 |      1 |    989K|    989K|00:00:00.39 |    9257 |   9217 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")

Critically this plan shows us two uses of the temp space but only reports one of them as Used-Tmp. The “hash join” at operation 5 tells us that it reached 65MB of (tunable PGA) memory before going “1-pass”, eventually dumping 25,600 KB to disc. This space usage is corroborated by the 2,910 writes (which, at an 8KB block size, would be 23,280 KB). The missing Used-Tmp, however, is the space taken up by the materialized CTE. We can see that operation 2 is a “load as select” that writes 9,217 blocks to disc (subsequently read back twice – the tablescans shown in operations 7 and 9). That’s  74,000 KB of temp space that doesn’t get reported Used-Tmp.

If we take a look at the plan from 19.3 we see different numbers, but the same “error of omission”:

SQL_ID  1cwabt12zq6zb, child number 0
-------------------------------------

Plan hash value: 1682228242

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |     25 |00:00:08.15 |   34905 |  13843 |   8248 |       |       |          |         |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |     25 |00:00:08.15 |   34905 |  13843 |   8248 |       |       |          |         |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6624_E259E68 |      1 |        |      0 |00:00:01.26 |   23706 |      0 |   5593 |  2070K|  2070K|          |         |
|   3 |    TABLE ACCESS FULL                     | T1                         |      1 |    907K|    907K|00:00:00.21 |   18024 |      0 |      0 |       |       |          |         |
|   4 |   SORT GROUP BY                          |                            |      1 |     25 |     25 |00:00:06.89 |   11193 |  13843 |   2655 |  6144 |  6144 | 6144  (0)|         |
|*  5 |    HASH JOIN                             |                            |      1 |     14M|     14M|00:00:03.55 |   11193 |  13843 |   2655 |    44M|  6400K|   64M (1)|      23M|
|   6 |     VIEW                                 |                            |      1 |    907K|    907K|00:00:00.26 |    5598 |   5594 |      0 |       |       |          |         |
|   7 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6624_E259E68 |      1 |    907K|    907K|00:00:00.25 |    5598 |   5594 |      0 |       |       |          |         |
|   8 |     VIEW                                 |                            |      1 |    907K|    907K|00:00:00.34 |    5595 |   5594 |      0 |       |       |          |         |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6624_E259E68 |      1 |    907K|    907K|00:00:00.33 |    5595 |   5594 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1A"."OBJECT_ID"="T1B"."OBJECT_ID")


With slightly fewer rows in t1 (907K vs. 989K) we write 5,593 blocks for the materialized CTE  (instead of 9,217) and spill 2,655 blocks during the hash join (instead of 2,910). But again it’s only the hash join spill that is reported under Used-Tmp. Note, by the way, that the Used-Tmp in 12.2 was reported in KB when it’s reported in MB in 19.3.0.0.

Side note: comparing the number of rows created and blocks written for the CTE, it looks as if 19.3 is using the data blocks much more efficiently than 12.2. There’s no obvious reason for this (though a first guess would be that the older mechanism is to write a GTT with pctfree=10 while the new avoid any free space and transactional details) so, as ever, I now have another draft for a blog note reminding me to investigate (eventually) what differences there are in CTE storage on the upgrade. It’s something that might make a difference in a few special cases.

With the figures from the execution plans in mind we can now look at the results of the query against v$active_session_history. Conveniently the queries took a few seconds to complete, so we’re going to see several rows for each execution.

First the results from 12.2.0.1

SESSION_ID SESSION_SERIAL#  SAMPLE_ID SQL_ID           PGA_ALLOCATED TEMP_SPACE_ALLOCATED
---------- --------------- ---------- ------------- ---------------- --------------------
        14           22234   15306218 1cwabt12zq6zb       95,962,112            1,048,576
                             15306219 1cwabt12zq6zb       97,731,584           37,748,736
                             15306220 1cwabt12zq6zb      148,194,304           77,594,624
                             15306221 1cwabt12zq6zb      168,117,248           85,983,232
                             15306222 1cwabt12zq6zb      168,117,248           90,177,536
                             15306223 1cwabt12zq6zb      168,117,248           95,420,416
                             15306224 1cwabt12zq6zb      168,117,248           98,566,144
                             15306225 1cwabt12zq6zb      168,117,248          102,760,448
                             15306226 1cwabt12zq6zb      116,933,632          103,809,024
                             15306227 1cwabt12zq6zb      116,933,632          103,809,024
                             15306228 b66ycurnwpgud        8,602,624            1,048,576

I pointed out that we had 25,600 KB reported as Used-Tmp and roughly 74,000 KB unreported – a total of nearly 100,000 KB that is reasonably close to the 103,800,000 bytes reported by ASH. Moreover the timing of the plan (loading the CTE in the first 2 seconds) seems to agree with the growth to 77,590,000 of temp_space_allocated by the time we get to sample_id 15306220 in ASH. Then we have several seconds of slow growth as the hash join takes place and feeds its resulte up to the sort group by. At the end of the query we happen to have been lucky enough to catch one last sample just before the session had released all its temp space and ceased to be active.  (Note: however, that the sql_id at that sample point was not the sql_id of our big query – and that’s a clue about one of the limitations of using ASH to find the greedy SQL.)

We see the same pattern of behaviour in 19.3.0.0:


SESSION_ID SESSION_SERIAL#  SAMPLE_ID SQL_ID           PGA_ALLOCATED TEMP_SPACE_ALLOCATED
---------- --------------- ---------- ------------- ---------------- --------------------
       136           42767    2217500 1cwabt12zq6zb      143,982,592           46,137,344
                              2217501 1cwabt12zq6zb      193,527,808           54,525,952
                              2217502 1cwabt12zq6zb      193,527,808           57,671,680
                              2217503 1cwabt12zq6zb      193,527,808           61,865,984
                              2217504 1cwabt12zq6zb      197,722,112           67,108,864
                              2217505 1cwabt12zq6zb      150,601,728           70,254,592
                              2217506 1cwabt12zq6zb      150,601,728           70,254,592

We start with an almost instantaneous jump to 46MB of temp_space_allocated in the first second of the query – that’s the 5,593 blocks of the CTE being materialized, then the slow growth of temp space as the hash join runs, spills to disc, and passes its data up to the sort group by. Again we can see that the peak usage was the CTE (46MB) plus the reported spill of 23MB (plus rounding errors and odd bits).

Preliminary Observations

Queries against ASH (v$active_session_history) can show us sessions that were holding space in the temporary tablespace at the moment a sample of active sessions was taken. This may allow us to identify greedy sessions that were causing other sessions to fail with ORA-01652 (unable to allocate temp segment).

We have seen that there is at least one case where we get better information about temp space allocation from ASH than we do from the variants on v$sql_plan that include the SQL Workarea information (v$sql_workarea, v$sql_workarea_active) because the space acquired during materialization of CTEs is not reported as a “tunable SQL workarea” but does appear in the ASH temp_space_allocated.

At first sight it looks as if we may be able to use the query against ASH to identify the statement (by sql_id) that was the one being run by the greedy session when it consumed all the space. As we shall see in a further article, there are various reasons why this may over-optimistic, however in many cases there’s a fair chance that when you see the same sql_id appearing in a number of consecutive rows of the report then that statement may be the thing that is responsible for the recent growth in temp space usage – and you can query v$sql to find the text and call dbms_xplan.display_cursor() to get as much execution plan information as possible.

Further questions

  • When does a session release the temp_space_allocated? Will the space be held (locked) as long as the cursor is open, or can it be released when it is no longer needed? Will it be held, but releasable, even after the cursor has (from the client program’s perspective) been closed?
  • Could we be fooled by a report that said a session was holding a lot of space when it didn’t need it and would have released it if the demand had appeared?
  • Under what conditions might the temp_space_allocated in an ASH sample have nothing to do with the sql_id reported in the same sample?
  • Are there any other reasons why ASH might report temp_space_allocated when an execution plan doesn’t?
  • Is temp_space_allocated only about the temporary tablespace, or could it include informatiom about other (“permanent”) tablespaces ?

Stay tuned for further analysis of the limitations of using v$active_session_history.temp_space_allocated to help identify the srouce of a space management ORA-01652 issue.

 

 

November 17, 2019

Parse Time

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 7:37 pm GMT Nov 17,2019

This is a note I started drafting In October 2012. It’s a case study from an optimizer (10053) trace file someone emailed to me, and it describes some of the high-level steps I went through to see if I could pinpoint what the optimizer was doing that fooled it into spending a huge amount of time optimising a statement that ultimately executed very quickly.

Unfortunately I never finished my notes and I can no longer find the trace file that the article was based on, so I don’t really know what I was planning to say to complete the last observation I had recorded.

I was prompted a  couple of days ago to publish the notes so far becuase I was reminded in a conversation with members of the Oak Table Network about an article that Franck Pachot wrote a couple of years ago. In 12c Oracle Corp. introduced a time-reporting mechanism for the optimizer trace. If some optimisation step takes “too long” (1 second, by default) then then optimizer will write a “TIMER:” line into the trace file telling you what the operation was and how long it took to complete and how much CPU time it used.  The default for “too long” can be adjusted by setting a “fix control”.  This makes it a lot easier to find out where the time went if you see a very long parse time.

But let’s get back to the original trace file and drafted blog note. It started with a question on OTN and an extract from a tkprof output to back up a nasty  performance issue.

=============================================================================================

 

What do you do about a parse time of 46 seconds ? That was the question that came up on OTN a few days ago – and here’s the tkprof output to demonstrate it.

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1     46.27      46.53          0          5          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.33       0.63        129      30331          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4     46.60      47.17        129      30336          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 144  
Number of plan statistics captured: 1
 
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=30331 pr=129 pw=0 time=637272 us)
       863        863        863   VIEW  VM_NWVW_1 (cr=30331 pr=129 pw=0 time=637378 us cost=1331 size=10 card=1)
       ... and lots more lines of plan

According to tkprof, it takes 46 seconds – virtually all CPU time – to optimise this statement, then 0.63 seconds to run it. You might spot that this is 11gR2 (in fact it’s 11.2.0.3) from the fact that the second line of the “Row Source Operation” includes a report of the estimated cost of the query, which is only 1,331.

Things were actually worse than they seem at first sight; when we saw more of tkprof output the following also showed up:

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE 
  NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') 
  NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),:"SYS_B_00"), 
  NVL(SUM(C2),:"SYS_B_01") 
FROM
 (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("VAL_000002") FULL("VAL_000002") 
  NO_PARALLEL_INDEX("VAL_000002") */ :"SYS_B_02" AS C1, 
  CASE WHEN
    ...
  END AS C2 FROM "BISWEBB"."RECORDTEXTVALUE" 
  SAMPLE BLOCK (:"SYS_B_21" , :"SYS_B_22") SEED (:"SYS_B_23") "VAL_000002" 
  WHERE ... 
 ) SAMPLESUB
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        5      0.00       0.00          0          0          0           0
Execute      5      0.00       0.00          0          0          0           0
Fetch        5     21.41      24.14      11108      37331          0           5
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       15     21.41      24.15      11108      37331          0           5
 
Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 144     (recursive depth: 1)
Number of plan statistics captured: 3
 
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=7466 pr=3703 pw=0 time=5230126 us)
   3137126    3137126    3137126   PARTITION HASH ALL PARTITION: 1 128 (cr=7466 pr=3703 pw=0 time=2547843 us cost=18758 size=131597088 card=3133264)
   3137126    3137126    3137126    TABLE ACCESS SAMPLE RECORDTEXTVALUE PARTITION: 1 128 (cr=7466 pr=3703 pw=0 time=2372509 us cost=18758 size=131597088 card=3133264)

This piece of SQL executed five times as the query was optimised, adding a further 24 seconds elapsed time and 21 CPU seconds which, surprisingly, weren’t included in the headline 46 seconds. The total time spent in optimising the statement was around 70 seconds, of which about 68 seconds were spent on (or waiting for) the CPU.

This is unusual – I don’t often see SQL statements taking more than a few seconds to parse – not since 8i, and not without complex partition views – and I certainly don’t expect to see a low cost query in 11.2.0.3 taking anything like 70 (or even 46) seconds to optimise.

The OP had enabled the 10046 and the 10053 traces at the same time – and since the parse time was sufficiently unusual I asked him to email me the raw trace file – all 200MB of it.

Since it’s not easy to process 200MB of trace the first thing to do is extract a few headline details, and I thought you might be interested to hear about some of the methods I use on the rare occasions when I decide to look at a 10053.

My aim is to investigate a very long parse time and the tkprof output had already shown me that there were a lot of tables in the query, so I had the feeling that the problem would relate to the amount of work done testing possible join orders; I’ve also noticed that the dynamic sampling code ran five times – so I’m expecting to see some critical stage of the optimisation run 5 times (although I don’t know why it should).

Step 1: Use grep (or find if you’re on Windows) to do a quick check for the number of join orders considered. I’m just searching for the text “Join order[” appearing at the start of line and then counting how many times I find it:

[jonathan@linux01 big_trace]$ grep "^Join order\[" orcl_ora_25306.trc  | wc -l
6266

That’s 6,266 join orders considered – let’s take a slightly closer look:

[jonathan@linux01 big_trace]$ grep -n "^Join order\[" orcl_ora_25306.trc >temp.txt
[jonathan@linux01 big_trace]$ tail -2 temp.txt
4458394:Join order[581]:  RECORDTYPEMEMBER[RTM]#9  RECORD_[VAL_000049]#13  ...... from$_subquery$_008[TBL_000020]#2
4458825:Join order[1]:  VM_NWVW_1[VM_NWVW_1]#0

The line of dots represents another 11 tables (or similar objects) in the join order. But there are only 581 join orders (apparently) before the last one in the file (which is a single view transformation). I’ve used the “-n” option with grep, so if I wanted to look at the right bit of the file I could tail the last few thousand lines, but my machine is happy to use vi on a 200MB file, and a quick search (backwards) through the file finds the number 581 in the following text (which does not appear in all versions of the trace file):

Number of join permutations tried: 581

So a quick grep for “join permutations” might be a good idea. (In the absence of this line I’d have got to the same result by directing the earlier grep for “^Join order\[“ to a file and playing around with the contents of the file.

[jonathan@linux01 big_trace]$ grep -n "join permutations" orcl_ora_25306.trc
11495:Number of join permutations tried: 2
11849:Number of join permutations tried: 1
12439:Number of join permutations tried: 2
13826:Number of join permutations tried: 2
14180:Number of join permutations tried: 1
14552:Number of join permutations tried: 2
15938:Number of join permutations tried: 2
16292:Number of join permutations tried: 1
16665:Number of join permutations tried: 2
18141:Number of join permutations tried: 2
18550:Number of join permutations tried: 2
18959:Number of join permutations tried: 2
622799:Number of join permutations tried: 374
624183:Number of join permutations tried: 2
624592:Number of join permutations tried: 2
624919:Number of join permutations tried: 1
625211:Number of join permutations tried: 2
1759817:Number of join permutations tried: 673
1760302:Number of join permutations tried: 1
1760593:Number of join permutations tried: 2
1760910:Number of join permutations tried: 1
1761202:Number of join permutations tried: 2
2750475:Number of join permutations tried: 674
2751325:Number of join permutations tried: 2
2751642:Number of join permutations tried: 1
2751933:Number of join permutations tried: 2
2752250:Number of join permutations tried: 1
2752542:Number of join permutations tried: 2
3586276:Number of join permutations tried: 571
3587133:Number of join permutations tried: 2
3587461:Number of join permutations tried: 1
3587755:Number of join permutations tried: 2
3588079:Number of join permutations tried: 1
3588374:Number of join permutations tried: 2
4458608:Number of join permutations tried: 581
4458832:Number of join permutations tried: 1

The key thing we see here is that there are five sections of long searches, and a few very small searches. Examination of the small search lists shows that they relate to some inline views which simply join a couple of tables. For each of the long searches we can see that the first join order in each set is for 14 “tables”. This is where the work is going. But if you add up the number of permutations in the long searches you get a total of 2,873, which is a long way off the 6,266 that we found with our grep for “^Join order[“ – so where do the extra join orders come from ? Let’s take a closer look at the file where we dumped all the Join order lines – the last 10 lines look like this:

4452004:Join order[577]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4452086:Join order[577]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4453254:Join order[578]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4453382:Join order[578]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4454573:Join order[579]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4454655:Join order[579]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4455823:Join order[580]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4455905:Join order[580]:  RECORD_[VAL_000033]#10  from$_subquery$_017[TBL_000029]#1 ...
4457051:Join order[581]:  RECORDTYPEMEMBER[RTM]#9  RECORD_[VAL_000049]#13  ...
4458394:Join order[581]:  RECORDTYPEMEMBER[RTM]#9  RECORD_[VAL_000049]#13  ...
4458825:Join order[1]:  VM_NWVW_1[VM_NWVW_1]#0

Every single join order seems to have appeared twice, and doubling the counts we got for the sum of the permutations gets us close to the total we got for the join order search. Again, we could zoom in a little closer, does the text near the start of the two occurrences of join order 581 give us any clues ? We see the following just before the second one:

****** Recost for ORDER BY (using join row order) *******

The optimizer has tried to find a way of eliminating some of the cost by letting the table join order affect the order of the final output. Let’s do another grep to see how many join orders have been recosted:

[jonathan@linux01 big_trace]$ grep "Recost for ORDER BY" orcl_ora_25306.trc | sort | uniq -c
    452 ****** Recost for ORDER BY (using index) ************
   2896 ****** Recost for ORDER BY (using join row order) *******

So we’ve done a huge amount recosting. Let’s check arithmetic: 452 + 2,896 + 2,873 = 6,221, which is remarkably close to the 6,266 we needed (and we have ignored a few dozen join orders that were needed for the inline views, and the final error is too small for me to worry about).

We can conclude, therefore, that we did a huge amount of work costing a 14 table join a little over 6,000 times. It’s possible, of course, that we discarded lots of join orders very early on in the cost stage, so we could count the number of times we see a “Now joining” message – to complete a single pass on a 14 table join the optimizer will have to report “Now joining” 13 times.

[jonathan@linux01 big_trace]$ grep -n "Now joining" orcl_ora_25306.trc | wc -l
43989

Since the message appeared 44,000 times from 6,200 join orders we have an average of 7 steps evaluated per join order. Because of the way that the optimizer takes short-cuts I think this is a fairly strong clue that most of the join order calculations actually completed, or get very close to completing, over the whole 14 tables. (The optimizer remembers “partial results” from previous join order calculations, so doesn’t have to do 13 “Now joining” steps on every single join order.)

We still need to know why the optimizer tried so hard before supplying a plan – so let’s look for the “Best so far” lines, which the trace file reports each time the optimizer finds a better plan than the previous best. Here’s an example of what we’re looking for:

       Cost: 206984.61  Degree: 1  Resp: 206984.61  Card: 0.00 Bytes: 632
***********************
Best so far:  Table#: 0  cost: 56.9744  card: 1.0000  bytes: 30
              Table#: 3  cost: 59.9853  card: 0.0000  bytes: 83
              Table#: 6  cost: 60.9869  card: 0.0000  bytes: 151
              Table#:10  cost: 61.9909  card: 0.0000  bytes: 185
              Table#: 5  cost: 62.9928  card: 0.0000  bytes: 253
              Table#: 2  cost: 65.0004  card: 0.0000  bytes: 306
              Table#: 1  cost: 122.4741  card: 0.0000  bytes: 336
              Table#: 8  cost: 123.4760  card: 0.0000  bytes: 387
              Table#: 4  cost: 125.4836  card: 0.0000  bytes: 440
              Table#: 7  cost: 343.2625  card: 0.0000  bytes: 470
              Table#: 9  cost: 345.2659  card: 0.0000  bytes: 530
              Table#:11  cost: 206981.5979  card: 0.0000  bytes: 564
              Table#:12  cost: 206982.6017  card: 0.0000  bytes: 598
              Table#:13  cost: 206984.6055  card: 0.0000  bytes: 632
***********************

As you can see, we get a list of the tables (identified by their position in the first join order examined) with details of accumulated cost. But just above this tabular display there’s a repeat of the cost that we end up with. So let’s write, and apply, a little awk script to find all the “Best so far” lines and then print the line two above. Here’s a suitable script, followed by a call to use it:

{
        if (index($0,"Best so far") != 0) {print NR m2}
        m2 = m1; m1 = $0;
}

awk -f cost.awk orcl_ora_25306.trc >temp.txt

There was a bit of a mess in the output – there are a couple of special cases (relating, in our trace file, to the inline views and the appearance of a “group by placement”) that cause irregular patterns to appear, but the script was effective for the critical 14 table join. And looking through the list of costs for the various permutations we find that almost all the options show a cost of about 206,000 – except for the last few in two of the five “permutation sets” that suddenly drop to costs of around 1,500 and 1,300. The very high starting cost explains why the optimizer was prepared to spend so much time trying to find a good path and why it kept working so hard until the cost dropped very sharply.

Side bar: I have an old note from OCIS (the precursor or the precursor of the precursor of MOS) that the optimizer will stop searching when the number of join orders tests * the number of “non-single-row” tables (according to the single table access path) * 0.3 is greater than the best cost so far.  I even have a test script (run against 8.1.7.4, dated September 2002) that seems to demonstrate the formula.  The formula may be terribly out of date by now and the rules of exactly how and when it applies may have changed – the model didn’t seem to work when I ran it against 19.3 – but the principle probably still holds true.

At this point we might decide that we ought to look at the initial join order and at the join order where the cost dropped dramatically, and try to work out why Oracle picked such a bad starting join order, and what it was about the better join order that the optimizer had missed. This might allow us to recognise some error in the statistics for either the “bad” starting order or the “good” starting order and allow us to solve the problem by (e.g.) creating a column group or gather some specific statistics. We might simply decide that we’ll take a good join order and pass it to the optimizer through a /*+ leading() */ hint, or simply take the entire outline and attach it to the query through a faked SQL Profile (or embedded set of hints).

However, for the purposes of this exercise (and because sometimes you have to find a strategic solution rather than a “single statement” solution) I’m going to carry on working through mechanisms for dissecting the trace file without looking too closely at any of the fine detail.

The final “high-level” target I picked was to pin down why there were 5 sets of join orders. I had noticed something particular about the execution plan supplied – it showed several occurrences of the operation “VIEW PUSHED PREDICATE” so I wondered if this might be relevant. So I did a quick check near the start of the main body of the trace file for anything that might be a clue, and found the following just after the “QUERY BLOCK SIGNATURE”.

QUERY BLOCK SIGNATURE
---------------------
  signature(): NULL
***********************************
Cost-Based Join Predicate Push-down
***********************************
JPPD: Checking validity of push-down in query block SEL$6E5D879B (#4)
JPPD:   Checking validity of push-down from query block SEL$6E5D879B (#4) to query block SEL$C20BB4FE (#6)
Check Basic Validity for Non-Union View for query block SEL$C20BB4FE (#6)
JPPD:     JPPD bypassed: View has non-standard group by.
JPPD:   No valid views found to push predicate into.
JPPD: Checking validity of push-down in query block SEL$799AD133 (#3)
JPPD:   Checking validity of push-down from query block SEL$799AD133 (#3) to query block SEL$EFE55ECA (#7)
Check Basic Validity for Non-Union View for query block SEL$EFE55ECA (#7)
JPPD:     JPPD bypassed: View has non-standard group by.
JPPD:   No valid views found to push predicate into.
JPPD: Checking validity of push-down in query block SEL$C2AA4F6A (#2)
JPPD:   Checking validity of push-down from query block SEL$C2AA4F6A (#2) to query block SEL$799AD133 (#3)
Check Basic Validity for Non-Union View for query block SEL$799AD133 (#3)
JPPD:     Passed validity checks
JPPD:   Checking validity of push-down from query block SEL$C2AA4F6A (#2) to query block SEL$6E5D879B (#4)
Check Basic Validity for Non-Union View for query block SEL$6E5D879B (#4)
JPPD:     Passed validity checks
JPPD:   Checking validity of push-down from query block SEL$C2AA4F6A (#2) to query block SEL$FC56C448 (#5)
Check Basic Validity for Non-Union View for query block SEL$FC56C448 (#5)
JPPD:     Passed validity checks
JPPD: JPPD:   Pushdown from query block SEL$C2AA4F6A (#2) passed validity checks.
Join-Predicate push-down on query block SEL$C2AA4F6A (#2)
JPPD: Using search type: linear
JPPD: Considering join predicate push-down
JPPD: Starting iteration 1, state space = (3,4,5) : (0,0,0)

As you can see we are doing cost-based join-predicate pushdown, and there are three targets which are valid for the operation. Notice the line that says “using search type: linear”, and the suggestive “starting iteration 1” – let’s look for more lines with “Starting iteration”

[jonathan@linux01 big_trace]$ grep -n "Starting iteration" orcl_ora_25306.trc
9934:GBP: Starting iteration 1, state space = (20,21) : (0,0)
11529:GBP: Starting iteration 2, state space = (20,21) : (0,C)
11562:GBP: Starting iteration 3, state space = (20,21) : (F,0)
12479:GBP: Starting iteration 4, state space = (20,21) : (F,C)
12517:GBP: Starting iteration 1, state space = (18,19) : (0,0)
13860:GBP: Starting iteration 2, state space = (18,19) : (0,C)
13893:GBP: Starting iteration 3, state space = (18,19) : (F,0)
14587:GBP: Starting iteration 4, state space = (18,19) : (F,C)
14628:GBP: Starting iteration 1, state space = (16,17) : (0,0)
15972:GBP: Starting iteration 2, state space = (16,17) : (0,C)
16005:GBP: Starting iteration 3, state space = (16,17) : (F,0)
16700:GBP: Starting iteration 4, state space = (16,17) : (F,C)
16877:JPPD: Starting iteration 1, state space = (3,4,5) : (0,0,0)
622904:JPPD: Starting iteration 2, state space = (3,4,5) : (1,0,0)
1759914:JPPD: Starting iteration 3, state space = (3,4,5) : (1,1,0)
2750592:JPPD: Starting iteration 4, state space = (3,4,5) : (1,1,1)

There are four iterations for state space (3,4,5) – and look at the huge gaps between their “Starting iteration” lines. In fact, let’s go a little closer and combine their starting lines with the lines above where I searched for “Number of join permutations tried:”


16877:JPPD: Starting iteration 1, state space = (3,4,5) : (0,0,0)
622799:Number of join permutations tried: 374

622904:JPPD: Starting iteration 2, state space = (3,4,5) : (1,0,0)
1759817:Number of join permutations tried: 673

1759914:JPPD: Starting iteration 3, state space = (3,4,5) : (1,1,0)
2750475:Number of join permutations tried: 674

2750592:JPPD: Starting iteration 4, state space = (3,4,5) : (1,1,1)
3586276:Number of join permutations tried: 571

4458608:Number of join permutations tried: 581

At this point my notes end and I don’t know where I was going with the investigation. I know that I suggested to the OP that the cost-based join predicate pushdown was having a huge impact on the optimization time and suggested he experiment with disabling the feature. (Parse time dropped dramatically, but query run-time went through the roof – so that proved a point, but wasn’t a useful strategy). I don’t know, however, what the fifth long series of permutations was for, so if I could find the trace file one of the things I’d do next would be to look at the detail a few lines before line 4,458,608 to see what triggered that part of the re-optimization. I’d also want to know whether the final execution plan came from the fifth series and could be reached without involving all the join predicate pushdown work, or whether it was a plan that was only going to appear after the optimizer had worked through all 4 iterations.

The final plan did involve all 3 pushed predicates (which looksl like it might have been from iteration 4), so it might have been possible to find a generic strategy for forcing unconditional predicate pushing without doing all the expensive intermediate work.

Version 12c and beyond

That was then, and this is now. And something completely different might have appeared in 12c (or 19c) – but the one thing that is particularly helpful is that you can bet that every iteration of the JPPD state spaces would have produced a “TIMER:” line in the trace file, making it very easy to run grep -n “TIMER:” (or -nT as I recently discovered) against the trace file to pinpoint the issue very quickly.

Here’s an example from my “killer_parse.sql” query after setting “_fix_control”=’16923858:4′ (1e4 microseconds = 1/100th second) in an instance of 19c:


$ grep -nT TIMER or19_ora_21051.trc

16426  :TIMER:      bitmap access paths cpu: 0.104006 sec elapsed: 0.105076 sec
252758 :TIMER:     costing general plans cpu: 0.040666 sec elapsed: 0.040471 sec
309460 :TIMER:      bitmap access paths cpu: 0.079509 sec elapsed: 0.079074 sec
312584 :TIMER: CBQT OR expansion SEL$765CDFAA cpu: 10.474142 sec elapsed: 10.508788 sec
313974 :TIMER: Complex View Merging SEL$765CDFAA cpu: 1.475173 sec elapsed: 1.475418 sec
315716 :TIMER: Table Expansion SEL$765CDFAA cpu: 0.046262 sec elapsed: 0.046647 sec
316036 :TIMER: Star Transformation SEL$765CDFAA cpu: 0.029077 sec elapsed: 0.026912 sec
318207 :TIMER: Type Checking after CBQT SEL$765CDFAA cpu: 0.220506 sec elapsed: 0.219273 sec
318208 :TIMER: Cost-Based Transformations (Overall) SEL$765CDFAA cpu: 13.632516 sec elapsed: 13.666360 sec
328948 :TIMER:      bitmap access paths cpu: 0.093973 sec elapsed: 0.095008 sec
632935 :TIMER: Access Path Analysis (Final) SEL$765CDFAA cpu: 7.703016 sec elapsed: 7.755957 sec
633092 :TIMER: SQL Optimization (Overall) SEL$765CDFAA cpu: 21.539010 sec elapsed: 21.632012 sec

The closing 21.63 seconds (line 633092) is largely 7.7559 seconds (632,935) plus 13.666 seconds (line 318208) Cost-Based Transformation time, and that 13.666 seconds is mostly the 1.475 seconds (line 313,974) plus 10.508 seconds (line 312,584) for CBQT OR expansion – so let’s try disabling OR expansion (alter session set “_no_or_expansion”=true;) and try again:


$ grep -nT TIMER or19_ora_22205.trc
14884  :TIMER:      bitmap access paths cpu: 0.062453 sec elapsed: 0.064501 sec
15228  :TIMER: Access Path Analysis (Final) SEL$1 cpu: 0.256751 sec elapsed: 0.262467 sec
15234  :TIMER: SQL Optimization (Overall) SEL$1 cpu: 0.264099 sec elapsed: 0.268183 sec

Not only was optimisation faster, the runtime was quicker too.

Warning – it’s not always that easy.

 

Library Cache Stats

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 9:36 am GMT Nov 17,2019

In resonse to a comment that one of my notes references a call to a packate “snap_libcache”, I’ve posted this version of SQL that can be run by SYS to create the package, with a public synonym, and privileges granted to public to execute it. The package doesn’t report the DLM (RAC) related activity, and is suitable only for 11g onwards (older versions require a massive decode of an index value to convert indx numbers into names).

rem
rem Script: snap_11_libcache.sql
rem Author: Jonathan Lewis
rem Dated: March 2001 (updated for 11g)
rem Purpose: Package to get snapshot start and delta of library cache stats
rem
rem Notes
rem Lots of changes needed by 11.2.x.x where x$kglst holds
rem two types – TYPE (107) and NAMESPACE (84) – but no
rem longer needs a complex decode.
rem
rem Has to be run by SYS to create the package
rem
rem Usage:
rem set serveroutput on size 1000000 format wrapped
rem set linesize 144
rem set trimspool on
rem execute snap_libcache.start_snap
rem — do something
rem execute snap_libcache.end_snap
rem

create or replace package snap_libcache as
procedure start_snap;
procedure end_snap;
end;
/

create or replace package body snap_libcache as

cursor c1 is
select
indx,
kglsttyp lib_type,
kglstdsc name,
kglstget gets,
kglstght get_hits,
kglstpin pins,
kglstpht pin_hits,
kglstrld reloads,
kglstinv invalidations,
kglstlrq dlm_lock_requests,
kglstprq dlm_pin_requests,
— kglstprl dlm_pin_releases,
— kglstirq dlm_invalidation_requests,
kglstmiv dlm_invalidations
from x$kglst
;

type w_type1 is table of c1%rowtype index by binary_integer;
w_list1 w_type1;
w_empty_list w_type1;

m_start_time date;
m_start_flag char(1);
m_end_time date;

procedure start_snap is
begin

m_start_time := sysdate;
m_start_flag := ‘U’;
w_list1 := w_empty_list;

for r in c1 loop
w_list1(r.indx).gets := r.gets;
w_list1(r.indx).get_hits := r.get_hits;
w_list1(r.indx).pins := r.pins;
w_list1(r.indx).pin_hits := r.pin_hits;
w_list1(r.indx).reloads := r.reloads;
w_list1(r.indx).invalidations := r.invalidations;
end loop;

end start_snap;

procedure end_snap is
begin

m_end_time := sysdate;

dbms_output.put_line(‘———————————‘);
dbms_output.put_line(‘Library Cache – ‘ ||
to_char(m_end_time,’dd-Mon hh24:mi:ss’)
);

if m_start_flag = ‘U’ then
dbms_output.put_line(‘Interval:- ‘ ||
trunc(86400 * (m_end_time – m_start_time)) ||
‘ seconds’
);
else
dbms_output.put_line(‘Since Startup:- ‘ ||
to_char(m_start_time,’dd-Mon hh24:mi:ss’)
);
end if;

dbms_output.put_line(‘———————————‘);

dbms_output.put_line(
rpad(‘Type’,10) ||
rpad(‘Description’,41) ||
lpad(‘Gets’,12) ||
lpad(‘Hits’,12) ||
lpad(‘Ratio’,6) ||
lpad(‘Pins’,12) ||
lpad(‘Hits’,12) ||
lpad(‘Ratio’,6) ||
lpad(‘Invalidations’,14) ||
lpad(‘Reloads’,10)
);

dbms_output.put_line(
rpad(‘—–‘,10) ||
rpad(‘—–‘,41) ||
lpad(‘—-‘,12) ||
lpad(‘—-‘,12) ||
lpad(‘—–‘,6) ||
lpad(‘—-‘,12) ||
lpad(‘—-‘,12) ||
lpad(‘—–‘,6) ||
lpad(‘————-‘,14) ||
lpad(‘——‘,10)
);

for r in c1 loop
if (not w_list1.exists(r.indx)) then
w_list1(r.indx).gets := 0;
w_list1(r.indx).get_hits := 0;
w_list1(r.indx).pins := 0;
w_list1(r.indx).pin_hits := 0;
w_list1(r.indx).invalidations := 0;
w_list1(r.indx).reloads := 0;
end if;

if (
(w_list1(r.indx).gets != r.gets)
or (w_list1(r.indx).get_hits != r.get_hits)
or (w_list1(r.indx).pins != r.pins)
or (w_list1(r.indx).pin_hits != r.pin_hits)
or (w_list1(r.indx).invalidations != r.invalidations)
or (w_list1(r.indx).reloads != r.reloads)
) then

dbms_output.put(rpad(substr(r.lib_type,1,10),10));
dbms_output.put(rpad(substr(r.name,1,41),41));
dbms_output.put(to_char(
r.gets – w_list1(r.indx).gets,
‘999,999,990’)
);
dbms_output.put(to_char(
r.get_hits – w_list1(r.indx).get_hits,
‘999,999,990’));
dbms_output.put(to_char(
(r.get_hits – w_list1(r.indx).get_hits)/
greatest(
r.gets – w_list1(r.indx).gets,
1
),
‘999.0’));
dbms_output.put(to_char(
r.pins – w_list1(r.indx).pins,
‘999,999,990’)
);
dbms_output.put(to_char(
r.pin_hits – w_list1(r.indx).pin_hits,
‘999,999,990’));
dbms_output.put(to_char(
(r.pin_hits – w_list1(r.indx).pin_hits)/
greatest(
r.pins – w_list1(r.indx).pins,
1
),
‘999.0’));
dbms_output.put(to_char(
r.invalidations – w_list1(r.indx).invalidations,
‘9,999,999,990’)
);
dbms_output.put(to_char(
r.reloads – w_list1(r.indx).reloads,
‘9,999,990’)
);
dbms_output.new_line;
end if;

end loop;

end end_snap;

begin
select
startup_time, ‘S’
into
m_start_time, m_start_flag
from
v$instance;

end snap_libcache;
/

drop public synonym snap_libcache;
create public synonym snap_libcache for snap_libcache;
grant execute on snap_libcache to public;

You’ll note that there are two classes of data, “namespace” and “type”. The dynamic view v$librarycache reports only the namespace rows.

October 31, 2019

IOT Hash

Filed under: Execution plans,Hash Join,Infrastructure,IOT,Joins,Oracle,Troubleshooting — Jonathan Lewis @ 2:59 pm GMT Oct 31,2019

It’s another of my double-entendre titles. The optimizer can turn a hash join involving an index-organized table into a real performance disaster (though you may have to help it along the way by using a silly definition for your primary key columns). This post was inspired by a question posted on the Oracle Developer Community forum recently so the table and column names I’ve used in my model reflect (almost, with a few corrections) the names used in the post.

We start with a simple requirement expressed through the following SQL:


rem
rem     Script:         iot_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

insert
        /*+
                qb_name(insert)
        */
into t_iot(
        id, inst_id, nr_time,
        o_time, st, null_col, svname
)
select
        /*+
                qb_name(main)
                unnest(@subq)
                leading(@sel$a93afaed apar@main ob@subq)
                use_hash(@sel$a93afaed ob@subq)
                swap_join_inputs(@sel$a93afaed ob@subq)
                index_ss_asc(@sel$a93afaed ob@subq (t_iot.st t_iot.inst_id t_iot.svname))a
        */
        apar.id,
        'UP',
        to_date('2019-10-24','yyyy-mm-dd'),
        to_date('1969-12-31','yyyy-mm-dd'),
        'IDLE',
        null,
        'tkt007.jj.bb.com'
from
        t_base apar
where
        apar.id not in (
                select
                        /*+
                                qb_name(subq)
                        */
                        id
                from
                        t_iot ob
                where
                        inst_id = 'UP'
        )
and     nvl(apar.gp_nm,'UA') = 'UA'
and     rownum <= 5000
/

The requirement is simple – insert into table t_iot a set of values dictated by a subset of the rows in table t_base if they do not already exist in t_iot. To model the issue that appeared I’ve had to hint the SQL to get the following plan (which I pulled from memory after enabling rowsource execution stats):


---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |             |      1 |        |   296 (100)|      0 |00:00:00.03 |     788 |    148 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL | T_IOT       |      1 |        |            |      0 |00:00:00.03 |     788 |    148 |       |       |          |
|*  2 |   COUNT STOPKEY          |             |      1 |        |            |    100 |00:00:00.03 |      99 |     91 |       |       |          |
|*  3 |    HASH JOIN RIGHT ANTI  |             |      1 |    100 |   296   (2)|    100 |00:00:00.03 |      99 |     91 |    14M|  1843K|   15M (0)|
|*  4 |     INDEX SKIP SCAN      | T_IOT_STATE |      1 |  12614 |   102   (0)|  10000 |00:00:00.01 |      92 |     91 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | T_BASE      |      1 |    100 |     2   (0)|    100 |00:00:00.01 |       7 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=5000)
   3 - access("APAR"."ID"="ID")
   4 - access("INST_ID"='UP')
       filter("INST_ID"='UP')
   5 - filter(NVL("APAR"."GP_NM",'UA')='UA')

The optimizer has unnested (as hinted) the subquery and converted it to an anti-join using a right hash anti-join. Take a look at the Used-mem for the hash join – would it surprise you to learn that the total size of the (not compressed in any way) IOT, and all its indexes, and the t_base table together total less than 4 MB. Something dramatically awful has happened in the hash join to generated a requirement of 14MB. (In the case of the OP this appeared as an unexpected 5GB written to the temporary tablespace.)

Before I address the source of the high memory usage, take a close look at the Predicate Information, particularly operation 3, and ask yourself what the definition of index t_iot_state might be. The predicate joins t_base.id to t_iot.id, and here’s the code to create both tables and all the indexes.

create table t_iot (
        nr_time         timestamp,
        id              varchar2(1024),
        inst_id         varchar2(200),
        o_time          timestamp,
        st              varchar2(200),
        null_col        varchar2(100),
        svname          varchar2(200),
        constraint t_iot_pk primary key(nr_time, id, inst_id)
)
organization index
/

insert into t_iot
select
        sysdate,
        dbms_random.string('l',10),
        'UP',
        sysdate,
        'IDLE',
        null,
        rpad('x',25,'x')
from
        all_objects
where
        rownum <= 1e4 -- > hint to avoid wordpress format issue
/

create index t_iot_state on t_iot(st, inst_id, svname); 
create index idx2        on t_iot(id, inst_id, svname);

create table t_base(
        id              varchar2(400) not null,
        gp_nm           varchar2(200)
)
/

insert into t_base
select
        dbms_random.string('l',10),
        'UA'
from
        all_objects
where
        rownum <= 100 -- > hint to avoid wordpress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_iot',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_base',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/


The index t_iot_state that Oracle has used in the hash join is defined on the columns (st, inst_id, svname) – so the predicate is doing a comparison with a column that’s not in the index! At least, it’s not visibly declared in the index; but this is a secondary index on an IOT, and IOTs don’t have “normal” rowids, the rowid in a secondary index is the value of the primary key (plus a “block guess”). So the columns in the index (even though not declared in the index) are: (st, inst_id, svname, {nr_time, id, inst_id, blockguess}). So this index does supply the required id column.

Side note: you’ll see in the list of columns above that inst_id appears twice. In fact (since Oracle 9, I think) the code to handle secondary indexes has been smart enough to avoid this duplication. If the secondary index contains columns from the primary key then the “rowid” doesn’t store those columns, the code knows how to construct the primaryh key value from the stored PK columns combined with the needed columns from the index entry. This can make IOTs a very nice choice of implementation for “intersection” tables that are used to represent many-to-many joins between two other tables.

Unfortunately this “rowid” is the explanation for the massive memory demand. Take a look at the “Column Projection Information” for my execution plan:


Column Projection Information (identified by operation id):
-----------------------------------------------------------
   2 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200], ROWNUM[8]
   3 - (#keys=1) "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]
   4 - "OB".ROWID[ROWID,1249], "NR_TIME"[TIMESTAMP,11], "ID"[VARCHAR2,1024], "INST_ID"[VARCHAR2,200], "OB".ROWID[ROWID,1249]
   5 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]

The interesting line is operation 4. A hash join takes the rowsource from its first child (the build table) and creates an in-memory hash table (which may spill to disc, of course), so if I see an unreasonable memory allocation (or unexpected spill to disc) a good starting point is to look at what the first child is supplying. In this case the first child seems to be saying that it’s supplying (or allowing for) nearly 3,700 bytes to be passed up to the hash join.

On closer inspection we can see it’s reporting the “rowid” twice, and also reporting the three columns that make up that rowid. I think it’s reasonable to assume that it’s only supplying the rowid once, and maybe it’s not even supplying the other three columns because they are embedded in the rowid. Doing a quick arithmetic check, let’s multiply the size of the rowid by the value of A-rows: 1,249 * 10,000 = 12,490,000. That’s pretty close to the 14MB reported by the hash join in operation 3.

Hypothesis – to get at the id column, Oracle has used this index (actually a very bad choice of those available) to extract the rowid and then passed the rowid up to the parent in a (length padded) fixed format. Oracle has then created a hash table by extracting the id column from the rowid and building the hash table on it but also carrying the length-padded rowid into the hash table.  Possible variants on this theme are that some or all of the other columns in the Column Projection Information are also passed upwards so that the id doesn’t have to be extracted, but if they are they are not padded to their maximum length.

A simple test that this is broadly the right assumption is to re-run the model making the declared length of the rowid much larger to see what happens to the memory allocation. Changing the inst_id declaration from 200 bytes to 1000 bytes (note the stored value is only the 2 bytes needed for the value ‘UP’) the Used-mem jumps to 23 MB (which is an increment very similar to 800 * 10,000).  You’ll note that I chose to experiment with a column that wasn’t the column used in the join. It was a column in the secondary index definition, though, so another test would be to change the nr_time column from a timestamp (11 bytes) to a large varchar2, so I re-ran the test declaring the nr_time as a varchar2(1000) – reverting the inst_id to varchar2(200) – and the Used-mem increased to 25MB.

Preliminary Conclusion

If Oracle uses the contents of the rowid of a secondary index on an IOT in a join then it constructs a fixed format version for the rowid by padding every column in the primary key to its maximum length and concatenating the results. This can have catastrophic side effects on performance if you’ve declared some very long columns “just in case”. Any time you use index organized tables you should remember to check the Column Projection Information in any execution plans that use secondary indexes in case they are passing a large, padded, primary key through the plan to a point where a blocking operation (such as a hash join or merge join) has to accumulate a large number of rows.

Footnote

In my test case I had to hint the query heavily to force Oracle into the path I wanted to demonstrate.

It’s surprising that the optimizer should have chosen this path in the OP’s system, given that there’s another secondary index that contains the necessary columns in its definition. (So one thought is that there’s a statistics problem to address, or possibly the “good” index is subject to updates that make it become very inefficient (large) very quickly.)

Another oddity of the OP’s system was that Oracle should have chosen to do a right hash anti-join when it looked as if joining the tables in the opposite order would produce a much smaller memory demand and lower cost – there was an explict swap_join_inputs() hint in the Outline Information (so copying the outline into the query and changing that to no_swap_join_inputs() might have been abother viable workaround.) In the end the OP hinted the query to use a nested loop (anti-)join from t_base to t_iot – which is another way to avoid the hash table threat with padded rowids.

 

October 9, 2019

My SID

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 12:03 pm BST Oct 9,2019

Here’s a little note that’s been hanging around as a draft for more than eight years according to the OTN (as it was) posting that prompted me to start writing it. At the time there were still plenty of people using Oracle 10g. so the question didn’t seem entirely inappropriate:

On 10g R2 when I open a sqlplus session how can I know my session SID ? I’m not DBA then can not open as sysdba and query v$session.

In all fairly recent versions of Oracle, of course, we have the option to use the sys_context() function to get the SID, but this specific option didn’t appear until some time in the 10g timeline – so you might have spent years “knowing” that you could get the audsid though sys_context(‘userenv’,’sessionid’) there was no equivalent way to get the sid. Now, of course, and even in the timeline of the original posting, the simplest solution to the requirement is to execute:


select sys_context('userenv','sid') from dual;

But there are a number of alternatives – which may occasionally do a better job (and sometimes are just plain silly). It’s also worth noting that even in 19c Oracle still doesn’t have access to v$session.serial# through sys_context() and, anyway, sys_context() behaves like an unpeekable bind variable – which can be a problem.

So here’s the first of several options:

select sid from V$mystat where rownum = 1;

You’ll need SYS to grant you select on v_$mystat to use this one, of course, but v$mystat is a very convenient view giving you the session activity stats since logon for your own session – so there ought to be some mechanism that allows you to see some form of it in place anyway (ideally including the join to v$statname).

One of the oldest ways of getting access to your session ID without having access to any of the dynamic performance views was through the dbms_support package:

variable v1 varchar2(32)
execute :v1 := dbms_support.mysid
execute dbms_output.put_line(:v1)

Again you’ll need SYS to grant you extra privileges, in this case execute on the dbms_support package – worse still, the package is not installed by default. In fact (after installing it) if you call dbms_support.package_version it returns the value: “DBMS_SUPPORT Version 1.0 (17-Aug-1998) – Requires Oracle 7.2 – 8.0.5” – which gives you some idea of how old it is. It used to be useful for the start_trace_in_session() procedure it contains but that procedure has been superseded by many newer mechanisms. If you enable SQL tracing to see what’s happening under the covers when you call dbms_support.mysid you’ll see that the function actually runs the query I showed above against v$mystat .

Unlike dbms_support the dbms_session package is installed automatically with the privilege to execute granted to public,  and this gives you a function to generate a “unique session id”, . The notes in the scripts $ORACLE_HOME/rdbms/admin/dbmssess.sql that create the package say that the return value can be up to 24 bytes long, but so far the maximum I’ve seen is 12.


select dbms_session.unique_session_id from dual;
UNIQUE_SESSION_ID
--------------------------
00FF5F980001


select
        to_number(substr(dbms_session.unique_session_id,1,4),'XXXX') sid,
        to_number(substr(dbms_session.unique_session_id,5,4),'XXXX') serial#,
        to_number(substr(dbms_session.unique_session_id,9,4),'XXXX') instance
from
        dual
;

       SID    SERIAL# INSTANCE
---------- ---------- --------
       255      24472        1

As you can see, the session_unique_id can be decoded to produce three useful bits of information, and the nice thing about this call is that it gives you session serial# at the same time as the SID. It’s possible, of course, that this query is as efficient as it could be, but there’s some scope for writing a query that uses a non-mergeable in-line view to call the function once, then splits the result into three pieces.

While we’re on the session_unique_id, the dbms_pipe package also has a “unique identifier” function unique_session_name():

SQL> select dbms_pipe.unique_session_name from dual;

UNIQUE_SESSION_NAME
------------------------
ORA$PIPE$00FF5F980001

It doesn’t take a lot of effort to spot that the “unique session name” is the “unique session id” of dbms_session prefixed with the text “ORA$PIPE$”. It’s convenient for the dbms_pipe package to be able to generate a unique name so that one session can create a safely named pipe and tell another session about it. Anyone using pipes should take advantage of this function for its original purpose. Unlike dbms_session you’ll need to be granted the privilege to execute this package, it’s not available to public. Interestingly the script that creates dbms_pipe says that this function could return 30 bytes – since it appears to be 9 bytes prepended to the (“could be 24 bytes”) dbms_session.unique_session_id you have to wonder whether there’s something more subtle that could happen.

There may be many more mechanisms available as built-ins, but the last one I know of is in the dbms_debug_jdwp package (another package with execute privilege already granted to public and the ability to supply both the sid and serial#):

SQL> select
  2          dbms_debug_jdwp.current_session_id     sid,
  3          dbms_debug_jdwp.current_session_serial serial#
  4  from dual
  5  /

       SID    SERIAL#
---------- ----------
       255      24472

There is a reason why I’ve decided to resurrect this list of ways of getting at a session’s SID, but that’s the topic of another blog note.

 

 

October 3, 2019

Trace Files

Filed under: 12c,Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 1:38 pm BST Oct 3,2019

A recent blog note by Martin Berger about reading trace files in 12.2 poped up in my twitter timeline yesterday and reminded me of a script I wrote a while ago to create a simple view I could query to read the tracefile generated by the current session while the session was still connected. You either have to create the view and a public synonym through the SYS schema, or you have to use the SYS schema to grant select privileges on several dynamic performance views to the user to allow the user to create the view in the user’s schema. For my scratch database I tend to create the view in the SYS schema.

Script to be run by SYS:

rem
rem     Script: read_trace_122.sql
rem     Author: Jonathan Lewis
rem     Dated:  Sept 2018
rem
rem     Last tested
rem             12.2.0.1

create or replace view my_trace_file as
select 
        *
from 
        v$diag_trace_file_contents
where
        (adr_home, trace_filename) = (
                select
                --      substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                        substr(
                                substr(tracefile, 1, instr(tracefile,'/',-1)-1),
                                1,
                                instr(
                                        substr(tracefile, 1, instr(tracefile,'/',-1)),
                                        'trace'
                                ) - 2
                        ),
                        substr(tracefile, instr(tracefile,'/',-1)+1) trace_filename
                from 
                        v$process
                where   addr = (
                                select  paddr
                                from    v$session
                                where   sid = (
                                        sys_context('userenv','sid')
                                        -- select sid from v$mystat where rownum = 1
                                        -- select dbms_support.mysid from dual
                                )
                        )
        )
;


create public synonym my_trace_file for sys.my_trace_file;
grant select on my_trace_file to {some role};

Alternatively, the privileges you could grant to a user from SYS so that they could create their own view:


grant select on v_$process to some_user;
grant select on v_$session to some_user;
grant select on v_$diag_trace_file_contents to some_user;
and optionally one of:
        grant select on v_$mystat to some_user;
        grant execute on dbms_support to some_user;
                but dbms_support is no longer installed by default.

The references to package dbms_support and view v$mystat are historic ones I have lurking in various scripts from the days when the session id (SID) wasn’t available in any simpler way.

Once the view exists and is available, you can enable some sort of tracing from your session then query the view to read back the trace file. For example, here’s a simple “self-reporting” (it’s going to report the trace file that it causes) script that I’ve run from 12.2.0.1 as a demo:


alter system flush shared_pool;
alter session set sql_trace true;

set linesize 180
set trimspool on
set pagesize 60

column line_number      format  999,999
column piece            format  a150    
column plan             noprint
column cursor#          noprint

break on plan skip 1 on cursor# skip 1

select
        line_number,
        line_number - row_number() over (order by line_number) plan,
        substr(payload,1,instr(payload,' id=')) cursor#,
        substr(payload, 1,150) piece
from
        my_trace_file
where
        file_name = 'xpl.c'
order by
        line_number
/

alter session set sql_trace false;

The script flushes the shared pool to make sure that it’s going to trigger some recursive SQL then enables a simple SQL trace. The query then picks out all the lines in the trace file generated by code in the Oracle source file xpl.c (execution plans seems like a likely guess) which happens to pick out all the STAT lines in the trace (i.e. the ones showing the execution plans).

I’ve used the “tabibitosan” method to identify all the lines that belong to a single execution plan by assuming that they will be consecutive lines in the output starting from a line which includes the text ” id=1 “ (the surrounding spaces are important), but I’ve also extracted the bit of the line which includes the cursor number (STAT #nnnnnnnnnnnnnnn) because two plans may be dumped one after the other if multiple cursors close at the same time. There is still a little flaw in the script because sometimes Oracle will run a sys-recursive statement in the middle of dumping a plan to turn an object_id into an object_name, and this will cause a break in the output.

The result of the query is to extract all the execution plans in the trace file and print them in the order they appear – here’s a sample of the output:


LINE_NUMBER PIECE
----------- ------------------------------------------------------------------------------------------------------------------------------------------------------
         38 STAT #140392790549064 id=1 cnt=0 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=53 us cost=4 size=113 card
         39 STAT #140392790549064 id=2 cnt=0 pid=1 pos=1 obj=37 op='INDEX RANGE SCAN I_OBJ2 (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=0 card=1)'


         53 STAT #140392790535800 id=1 cnt=1 pid=0 pos=1 obj=0 op='MERGE JOIN OUTER (cr=5 pr=0 pw=0 str=1 time=95 us cost=2 size=178 card=1)'
         54 STAT #140392790535800 id=2 cnt=1 pid=1 pos=1 obj=4 op='TABLE ACCESS CLUSTER TAB$ (cr=3 pr=0 pw=0 str=1 time=57 us cost=2 size=138 card=1)'
         55 STAT #140392790535800 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'
         56 STAT #140392790535800 id=4 cnt=0 pid=1 pos=2 obj=0 op='BUFFER SORT (cr=2 pr=0 pw=0 str=1 time=29 us cost=0 size=40 card=1)'
         57 STAT #140392790535800 id=5 cnt=0 pid=4 pos=1 obj=73 op='TABLE ACCESS BY INDEX ROWID TAB_STATS$ (cr=2 pr=0 pw=0 str=1 time=10 us cost=0 size=40 card=1)
         58 STAT #140392790535800 id=6 cnt=0 pid=5 pos=1 obj=74 op='INDEX UNIQUE SCAN I_TAB_STATS$_OBJ# (cr=2 pr=0 pw=0 str=1 time=8 us cost=0 size=0 card=1)'


         84 STAT #140392791412824 id=1 cnt=1 pid=0 pos=1 obj=20 op='TABLE ACCESS BY INDEX ROWID BATCHED ICOL$ (cr=4 pr=0 pw=0 str=1 time=25 us cost=2 size=54 card
         85 STAT #140392791412824 id=2 cnt=1 pid=1 pos=1 obj=42 op='INDEX RANGE SCAN I_ICOL1 (cr=3 pr=0 pw=0 str=1 time=23 us cost=1 size=0 card=2)'


         94 STAT #140392790504512 id=1 cnt=2 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=7 pr=0 pw=0 str=1 time=432 us cost=6 size=374 card=2)'
         95 STAT #140392790504512 id=2 cnt=2 pid=1 pos=1 obj=0 op='HASH JOIN OUTER (cr=7 pr=0 pw=0 str=1 time=375 us cost=5 size=374 card=2)'
         96 STAT #140392790504512 id=3 cnt=2 pid=2 pos=1 obj=0 op='NESTED LOOPS OUTER (cr=4 pr=0 pw=0 str=1 time=115 us cost=2 size=288 card=2)'
         97 STAT #140392790504512 id=4 cnt=2 pid=3 pos=1 obj=19 op='TABLE ACCESS CLUSTER IND$ (cr=3 pr=0 pw=0 str=1 time=100 us cost=2 size=184 card=2)'
         98 STAT #140392790504512 id=5 cnt=1 pid=4 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=85 us cost=1 size=0 card=1)'
         99 STAT #140392790504512 id=6 cnt=0 pid=3 pos=2 obj=75 op='TABLE ACCESS BY INDEX ROWID IND_STATS$ (cr=1 pr=0 pw=0 str=2 time=8 us cost=0 size=52 card=1)'
        100 STAT #140392790504512 id=7 cnt=0 pid=6 pos=1 obj=76 op='INDEX UNIQUE SCAN I_IND_STATS$_OBJ# (cr=1 pr=0 pw=0 str=2 time=7 us cost=0 size=0 card=1)'
        101 STAT #140392790504512 id=8 cnt=0 pid=2 pos=2 obj=0 op='VIEW  (cr=3 pr=0 pw=0 str=1 time=47 us cost=3 size=43 card=1)'
        102 STAT #140392790504512 id=9 cnt=0 pid=8 pos=1 obj=0 op='SORT GROUP BY (cr=3 pr=0 pw=0 str=1 time=44 us cost=3 size=15 card=1)'
        103 STAT #140392790504512 id=10 cnt=0 pid=9 pos=1 obj=31 op='TABLE ACCESS CLUSTER CDEF$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=15 card=1)'
        104 STAT #140392790504512 id=11 cnt=1 pid=10 pos=1 obj=30 op='INDEX UNIQUE SCAN I_COBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        116 STAT #140392791480168 id=1 cnt=4 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=3 pr=0 pw=0 str=1 time=62 us cost=3 size=858 card=13)'
        117 STAT #140392791480168 id=2 cnt=4 pid=1 pos=1 obj=21 op='TABLE ACCESS CLUSTER COL$ (cr=3 pr=0 pw=0 str=1 time=24 us cost=2 size=858 card=13)'
        118 STAT #140392791480168 id=3 cnt=1 pid=2 pos=1 obj=3 op='INDEX UNIQUE SCAN I_OBJ# (cr=2 pr=0 pw=0 str=1 time=11 us cost=1 size=0 card=1)'


        126 STAT #140392789565328 id=1 cnt=1 pid=0 pos=1 obj=14 op='TABLE ACCESS CLUSTER SEG$ (cr=3 pr=0 pw=0 str=1 time=21 us cost=2 size=68 card=1)'
        127 STAT #140392789565328 id=2 cnt=1 pid=1 pos=1 obj=9 op='INDEX UNIQUE SCAN I_FILE#_BLOCK# (cr=2 pr=0 pw=0 str=1 time=12 us cost=1 size=0 card=1)'


        135 STAT #140392789722208 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID BATCHED OBJ$ (cr=3 pr=0 pw=0 str=1 time=22 us cost=3 size=51 card=
        136 STAT #140392789722208 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX RANGE SCAN I_OBJ1 (cr=2 pr=0 pw=0 str=1 time=16 us cost=2 size=0 card=1)'


        153 STAT #140392792055264 id=1 cnt=1 pid=0 pos=1 obj=68 op='TABLE ACCESS BY INDEX ROWID HIST_HEAD$ (cr=3 pr=0 pw=0 str=1 time=25 us)'
        154 STAT #140392792055264 id=2 cnt=1 pid=1 pos=1 obj=70 op='INDEX RANGE SCAN I_HH_OBJ#_INTCOL# (cr=2 pr=0 pw=0 str=1 time=19 us)'

If you want to investigate further, the “interesting” columns in the underlying view are probably: section_name, component_name, operation_name, file_name, and function_name. The possible names of functions, files, etc. vary with the trace event you’ve enabled.

 

September 28, 2019

Negative Offload

Filed under: Exadata,Execution plans,HCC,Oracle,Troubleshooting — Jonathan Lewis @ 5:38 pm BST Sep 28,2019

At the Trivadis Performance Days 2019 I did a presentation on using execution plans to understand what a query was doing. One of the examples I showed was a plan from an Exadata system (using 11.2.0.4) that needed to go faster. The plan was from the SQL Monitor report and all I want to show you is one line that’s reporting a tablescan. To fit the screen comfortably I’ve removed a number of columns from the output.

The report had been generated while the statement was still running (hence the “->” at the left hand edge) and the query had scanned 166 segments (with no partition elimination) of a table with 4,500 data segments (450 range partitions and 10 hash sub-partitions – note the design error, by the way, hash partitioning in Oracle should always hash for a powert of 2).


SQL Plan Monitoring Details (Plan Hash Value=3764612084)  
============================================================================================================================================
| Id   |           Operation            | Name  | Read  | Read  | Write | Write |   Cell   | Mem  | Activity |       Activity Detail       |  
|      |                                |       | Reqs  | Bytes | Reqs  | Bytes | Offload  |      |   (%)    |         (# samples)         |   
============================================================================================================================================
| -> 5 |      TABLE ACCESS STORAGE FULL | TXN   |  972K | 235GB |       |       | -203.03% |   7M |    63.43 | Cpu (1303)                  | 
|      |                                |       |       |       |       |       |          |      |          | cell smart table scan (175) | 
============================================================================================================================================

In the presentation I pointed out that for a “cell smart table scan” (note the Activity Detail colum) this line was using a surprisingly large amount of CPU.

We had been told that the table was using hybrid columnar compression (HCC) and had been given some figures that showed the compression factor was slightly better than 4. I had also pointed out that the typical size of a read request was 256KB. (Compare Read Reqs with Read Bytes)

To explain the excessive CPU I claimed that we were seeing “double decompression” – the cell was decompressing (uncompressing) compression units (CUs), finding that the resulting decompressed data was larger than the 1MB unit that Exadata allows and sending the original compressed CU to the database server where it was decompressed again – and the server side decompression was burning up the CPU.

This claim is (almost certainly) true – but the justification I gave for the claim was at best incomplete (though, to be brutally honest, I have to admit that I’d made a mistake): I pointed out that the Cell Offload was negative 200% and that this was what told us about the double decompression. While double decompression was probably happening the implication I had made was that a negative offload automatically indicated double decompression – and that’s was an incorrect assumption on my part. Fortunately Maurice Müller caught up with me after the session was over and pointed out the error then emailed me a link to a relevant article by Ahmed Aangour.

The Cell Offload is a measure of the difference between the volume of data read and the volume of data returned to the server. If the cell reads 256KB from disc, but the column and row selection means the cell returns 128KB the Cell Offload would be 50%; if the cell returns 64KB the Cell Offload would be 75% (100 * (1 – 64KB/256KB)). But what if you select all the rows and columns from a compressed table – the volume of data after decompression would be larger than the compressed volume the cell had read from disc – and in this case we knew that we were reading 256KB at a time and the compression factor was slightly greater than 4, so the uncompressed data would probably be around 1MB, giving us a Cell Offload of 100 * (1 – 1024KB / 256KB) = negative 300%

Key Point: Any time that decompression, combined with the row and column selection, produces more data than the volume of data read from disc the Cell Offload will go negative. A negative Cell Offload is not inherently a problem (though it might hint at a suboptimal use of compression).

Follow-up Analysis

Despite the error in my initial understanding the claim that we were seeing double decompression was still (almost certainly) true – but we need to be a little more sophisticated in the analysis. The clue is in the arithmetic a few lines further up the page. We can see that we are basically reading 256KB chunks of the table, and we know that 256KB will expand to roughly 1MB so we ought to see a Cell Offload of about -300%; but the Cell Offload is -200%. This suggests fairly strongly that on some of the reads the decompressed data is slightly less than 1MB, which allows the cell to return the decompressed data to the database server, while some of the time the decompressed data is greater than 1MB, forcing the cell to send the original (compressed) CU to the databsae server.

We may even be able work the arithmetic backwards to estimate the number of times double decompression appeared.  Assume that two-thirds of the time the cell decompressed the data and successfully sent (just less than) 1MB back to the database server and one-third of the time the cell decompressed the data and found that the result was too large and sent 256KB of compressed data back to the server, and let’s work with the 972,000 read requests reported to see what drops out of the arithmetic:

  • Total data read: 972,000 * 256KB = 243,000 MB
  • Data sent to db server:  648,000 * 1MB + 324,000 * 256KB = 729,000 MB
  • Cell Offload = 100 * (1 – 729/243) = -200%   Q.E.D.

Of course it would be nice to avoid guessing – and if we were able to check the session activity stats (v$sessstat) while the query was running (or after it had completed) we could pick up several numbers that confirmed our suspicion. For 11.2.0.4, for example, we would keep an eye on:

	cell CUs sent uncompressed
	cell CUs processed for uncompressed
	EHCC {class} CUs Decompressed

Differences between these stats allows you to work out the number of compression units that failed the 1MB test on the cell server and were sent to the database server to be decompressed. There is actually another statistic named “cell CUs sent compressed” which would make life easy for us, but I’ve not seen it populated in my tests – so maybe it doesn’t mean what it seems to say.

Here’s an example from an 11.2.0.4 system that I presented a few years ago showing some sample numbers.

cell CUs sent uncompressed              5,601
cell CUs processed for uncompressed     5,601

EHCC CUs Decompressed                  17,903
EHCC Query High CUs Decompressed       12,302 

This reveals an annoying feature of 11g (continued in 12.1) that results in double counting of the statistics, confusing the issue when you’re trying to analyze what’s going on. In this case the table consisted of 12,302 compression units, and the query was engineered to cause the performance problem to appear. The first two statistics show us how many CUs were decompressed successfully (we’ll see a change appearing there in 12.1). We then see that all 12,302 of the table’s “query high” compression units were decompressed – but the “total” of all CUs decompressed was 17.903.

It’s not a coincidence that 12,302 + 5,601 = 17,903; there’s some double counting going on. I don’t know how many of the statistics are affected in this way, but Oracle has counted the CUs that passsed decompression once as they were processed at the cell server and again as they arrived at the database server. In this example we can infer that 12,302 – 5,601 = 6,701 compression units failed decompression at the cell server and were sent to the database server in compressed form to be decompressed again.

Here’s a couple of sets of figures from some similar tests run on 12.1.0.2 – one with a table compressed to query high another compressed to query low. There is one critical difference from the 11g figures but the same double-counting seems to have happened. In both cases the “EHCC Query [Low|High] CUs Decompressed” show the correct number of CUs in each table. Note, though that the “cell CUs processed for uncompress” in 12.1 appear to report the number of attempted decompressions rather than 11g’s number of successful decompressions.


=========================

cell CUs sent uncompressed                     19,561	-- successful decompressions at cell server
cell CUs processed for uncompressed            19,564	=> 3 failures

EHCC CUs Decompressed                          39,125	=  2 * 19,561 successes + 3 db server decompression
EHCC Query High CUs Decompressed               19,564

=========================

cell CUs sent uncompressed                     80,037	-- successful decompressions at cell server
cell CUs processed for uncompressed            82,178	=> 2,141 failures

EHCC CUs Decompressed                         162,215	=  2 * 80,037 successes + 2,141 db server decompressions
EHCC Query Low CUs Decompressed                82,178

=========================

I’ve annotated the figures to explain the arithmetic.

There has been some significant renaming and separation of statistics in 12.2, as described in this post by Roger MacNicol, and the problems of double-counting should have disappeared. I haven’t yet tested my old models in the latest versions of Oracle, though, so can’t show you anyy figures to demonstrate the change.

Takeaways

There are 4 key points to note in this posting.

  • Hash (sub)partitioning should be based on powers of 2, otherwise some partitions will be twice size of others.
  • There is a 1MB limit on the “data packet” sent between the cell server and database server in Exadata.
  • If you select a large fraction of the rows and columns from an HCC compressed table you may end up decompressing a lot of your data twice if the decompressed data for a read request is larger than the 1MB unit (and the cost will be highly visible at the database server as CPU usage).
  • The Cell Offload figure for a tablescan (in particular) will go negative if the volume of data sent from the cell server to the database server is larger than the volume of data read from the disk- even if double decompression hasn’t been happening.

A little corollary to the third point: if you are writing to a staging table with the expectation of doing an unfiltered tablescan (or a select *), then you probably don’t want to use hybrid columnar compression on the table as you will probably end up using a lot of CPU at the database server to compress it, then do double-decompression using even more CPU on the database server.  It’s only if you really need to minimise disk usage and have lots of CPU capacity to spare that you have a case for using hybrid columnar compression for the table (and Oracle In-Memory features may also change the degree of desirability).

Footnote

I haven’t said anything about accessing table data by index when the table is subject to HCC compression. I haven’t tested the mechanism in recent versions of Oracle but it used to be the case that the cell server would supply the whole compression unit (CU) to the database server which would decompress it to construct the relevant row. One side effect of this was that the same CU could be decompressed (with a high CPU load) many times in the course of a single query.

 

August 27, 2019

AW-argh

Filed under: AWR,Oracle,Troubleshooting,Upgrades — Jonathan Lewis @ 3:59 pm BST Aug 27,2019

This is another of the blog notes that have been sitting around for several years – in this case since May 2014, based on a script I wrote a year earlier. It makes an important point about “inconsistency” of timing in the way that Oracle records statistics of work done. As a consequence of being first drafted in May 2014 the original examples showed AWR results from 10.2.0.5 and 11.2.0.4 – I’ve just run the same test on 19.3.0.0 to see if anything has changed.

 

[Originally drafted May 2014]: I had to post this as a reminder of how easy it is to forget things – especially when there are small but significant changes between versions. It’s based loosely on a conversation from Oracle-L, but I’m going to work the issue in the opposite order by running some code and showing you the ongoing performance statistics rather than the usual AWR approach of reading the performance stats and trying to guess what happened.

The demonstration needs two sessions to run; it’s based on one session running some CPU-intensive SQL inside an anonymous PL/SQL block with a second another session launching AWR snapshots at carefully timed moments. Here’s the code for the working session:

rem
rem     Script:         awr_timing.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2013
rem

alter session set "_old_connect_by_enabled"=true';

create table kill_cpu(n, primary key(n))
organization index
as
select  rownum n
from    all_objects
where   rownum <= 26 -- > comment to avoid wordpress format issue
;

execute dbms_stats.gather_table_stats(user,'kill_cpu')

pause Take an AWR snapshot from another session and when it has completed  press return

declare
        m_ct    number;
begin

        select  count(*) X
        into    m_ct
        from    kill_cpu
        connect by
                n > prior n
        start with
                n = 1
        ;

        dbms_lock.sleep(30);

end;
/

You may recognise an old piece of SQL that I’ve often used as a way of stressing a CPU and seeing how fast Oracle can run. The “alter session” at the top of the code is necessary to use the pre-10g method of running a “connect by” query so that the SQL does a huge number of buffer gets (and “buffer is pinned count” visits). On my current laptop the query takes about 45 seconds (all CPU) to complete. I’ve wrapped this query inside a pl/sql block that then sleeps for 30 seconds.

From the second session you need to launch an AWR snapshot 4 times – once in the pause shown above, then (approximately) every 30 seconds thereafter. The second one should execute while the SQL statement is still running, the third one should execute while the sleep(30) is taking place, and the fourth one should execute after the pl/sql block has ended and the SQL*Plus prompt is visible.

Once you’ve got 4 snapshots you can generate 3 AWR reports. The question to ask then, is “what do the reports say about CPU usage?” Here are a few (paraphrased) numbers – starting with 10.2.0.5 comparing the “Top 5 timed events”, “Time Model”, and “Instance Activity” There are three sets of figures, the first reported while the SQL was still running, the second reported after the SQL statement had completed and the dbms_lock.sleep() call is executing, the last reported after the PL/SQL block has completed. There are some little oddities in the numbers due to backgorund “noise” – but the key points are still clearly visible:

While the SQL was executing

Top 5
-----
CPU Time                       26 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     26.9        100.0
DB CPU                                       26.2         97.6

Instance Activity
-----------------
CPU used by this session         0.65 seconds
recursive cpu usage              0.67 seconds

SQL ordered by CPU
------------------
31 seconds reported for both the SQL and PLSQL

During the sleep()

Top 5
-----
CPU Time                        19 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     19.0        100.0
DB CPU                                       18.6         98.1

Instance Activity
-----------------
CPU used by this session         0.66 seconds
recursive cpu usage             44.82 seconds

SQL ordered by CPU
------------------
14 seconds reported for both the SQL and PLSQL

After the PL/SQL block ended

Top 5
-----
CPU Time                         1 second

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                      1.4         99.9
DB CPU                                        1.4         99.7

Instance Activity
-----------------
CPU used by this session        44.68 seconds
recursive cpu usage              0.50 seconds

SQL ordered by CPU
------------------
1 second reported for the PLSQL, but the SQL was not reported

Points to notice:

While the SQL was excecuting (and had been executing for about 26 seconds, the Time Model mechanism was recording the work done by the SQL, and the Top N information echoed the Time model CPU figure. At the same time the “CPU used …” Instance Activity Statistics have not recorded any CPU time for the session – and they won’t until the SQL statement completes. Despite this, the “SQL ordered by …” reports double-count in real-time, show the SQL and the PL/SQL cursors as consuming (with rounding errors, presumably) 31 seconds each.

After the SQL execution was over and the session was sleeping the Time model (hence the Top 5) had recorded a further 19 seconds of work. The instance activity, however, has now accumulated 44 seconds of CPU, but only as “recursive CPU usage” (recursive because our SQL was called from with a PL/SQL block), with no “CPU used by this session”. The “SQL ordered by …” figures have recorded the amount of CPU used by both the SQL and PL/SQL in the interval (i.e. 14 seconds – which is a little off) recorded against both.)

After the PL/SQL block has completed the Time Model and the Top 5 report both say that nothing much happened in the interval, but the Instance Activity suddenly reports 44.68 seconds of CPU used by this session – which (roughly speaking) is truish as the PL/SQL block ended and assigned the accumulated recursive CPU usage to the session CPU figure. Finally, when we get down to the “SQL ordered by CPU” the SQL was not reported  – it did no work in the interval – but the PL/SQL block was still reported but only with a generous 1 second of CPU since all it did in the interval was finish the sleep call and tidy up the stack before ending.

Now the same sets of figures for 11.2.0.4 – there’s a lot of similarity, but one significant difference:

While the SQL was executing


Top 5
-----
CPU Time                        26.6 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     27.0        100.0
DB CPU                                       26.6         98.5

Instance Activity
-----------------
CPU used by this session         1.09 seconds
recursive cpu usage              1.07 seconds

SQL ordered by CPU
------------------
25.6 seconds reported for both the SQL and PLSQL

During the sleep()

Top 5
-----
CPU Time                        15.1 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     15.3         99.8
DB CPU                                       15.1         98.2

Instance Activity
-----------------
CPU used by this session        41.09 seconds
recursive cpu usage             41.03 seconds

SQL ordered by CPU
------------------
14.3 seconds reported for the SQL
13.9 seconds reported for the PLSQL

After the PL/SQL block ended

Top 5
-----
CPU Time                         1.4 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                      1.5         99.6
DB CPU                                        1.4         95.4

Instance Activity
-----------------
CPU used by this session         1.02 seconds
recursive cpu usage              0.95 seconds

SQL ordered by CPU
------------------
0.5 seconds reported for the PLSQL, and no sign of the SQL

Spot the one difference in the pattern – during the sleep() the Instance Activity Statistic “CPU used by this session” is recording the full CPU time for the complete query, whereas the time for the query appeared only in the “recursive cpu” in the 10.2.0.5 report.

I frequently point out that for proper understanding of the content of an AWR report you need to cross-check different ways in which Oracle reports “the same” information. This is often to warn you about checking underlying figures before jumping to conclusions about “hit ratios”, sometimes it’s to remind you that while the Top 5 might say some average looks okay the event histogram may say that what you’re looking at is mostly excellent with an occasional disaster thrown in. In this blog note I just want to remind you that if you only ever look at one set of figures about CPU usage there are a few special effects (particularly relating to long running PL/SQL / Java / SQL) where you may have to work out a pattern of behaviour to explain unexpectedly large (or small) figures and contradictory figures, The key to the problem is recognising that different statistics may be updated at different stages in a complex process.

Footnote

I doubt if many people still run 11.2.0.4, so I also re-ran the test on 19.3.0.0 before publishing. The behaviour hasn’t changed since 11.2.0.4 although the query ran a little faster, perhaps due to changes in the mechanisms for this type of “connect by pump”.

11.2.0.4 stats


Name                                            Value
----                                            -----
session logical reads                      33,554,435
consistent gets                            33,554,435
consistent gets from cache                 33,554,435
consistent gets from cache (fastpath)      33,554,431
no work - consistent read gets             33,554,431
index scans kdiixs1                        33,554,433
buffer is not pinned count                 16,777,219


19.3.0.0 stats

Name                                            Value
----                                            -----
session logical reads                      16,843,299
consistent gets                            16,843,299
consistent gets from cache                 16,843,299
consistent gets pin                        16,843,298
consistent gets pin (fastpath)             16,843,298
no work - consistent read gets             16,790,166
index range scans                          33,554,433
buffer is not pinned count                 16,790,169

Some changes are trivial (like the change of name for “index scans kdiixs1”) some are interesting (like some gets not being labelled as “pin” and “pin (fastpath)”), some are baffling (like how you can manage 33M index range scans while doing only 16M buffer gets!)

August 26, 2019

Troubleshooting

Filed under: CBO,Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 12:19 pm BST Aug 26,2019

A recent thread on the Oracle Developer Community starts with the statement that a query is taking a very long time (with the question “how do I make it go faster?” implied rather than asked). It’s 12.1.0.2 (not that that’s particularly relevant to this blog note), and we have been given a number that quantifies “very long time” (again not particularly relevant to this blog note – but worth mentioning because your “slow” might be my “wow! that was fast” and far too many people use qualitative adjectives when the important detail is quantative). The query had already been running for 15 hours – and here it is:


SELECT 
        OWNER, TABLE_NAME 
FROM
        DBA_LOGSTDBY_NOT_UNIQUE 
WHERE
        (OWNER, TABLE_NAME) NOT IN (
                SELECT 
                        DISTINCT OWNER, TABLE_NAME 
                        FROM     DBA_LOGSTDBY_UNSUPPORTED
        ) 
AND     BAD_COLUMN = 'Y'

There are many obvious suggestions anyone could make for things to do to investigate the problem – start with the execution plan, check whether the object statistics are reasonably representative, run a trace with wait state tracing enabled to see where the time goes; but sometimes that are a couple of very simple observation you can make that point you to simple solutions.

Looking at this query we can recognise that it’s (almost certainly) about a couple of Oracle data dictionary views (which means it’s probably very messy under the covers with a horrendous execution plan) and, as I’ve commented from time to time in the past, Oracle Corp. developers create views for their own purposes so you should take great care when you re-purpose them. This query also has the very convenient feature that it looks like two simpler queries stitched together – so a very simple step in trouble-shooting, before going into any fine detail, is to unstitch the query and run the two parts separately to see how much data they return and how long they take to complete:


SELECT OWNER, TABLE_NAME FROM DBA_LOGSTDBY_NOT_UNIQUE WHERE BAD_COLUMN = 'Y'

SELECT DISTINCT OWNER, TABLE_NAME FROM DBA_LOGSTDBY_UNSUPPORTED

It’s quite possble that the worst case scenario for the total run time of the original query could be reduced to the sum of the run time of these two queries. One strategy to achieve this would be a rewrite of the form:

select  * 
from    (
        SELECT OWNER, TABLE_NAME FROM DBA_LOGSTDBY_NOT_UNIQUE WHERE BAD_COLUMN = 'Y'
        minus
        SELECT DISTINCT OWNER, TABLE_NAME FROM DBA_LOGSTDBY_UNSUPPORTED
)

Unfortunately the immediately obvious alternative may be illegal thanks to things like duplicates (which disappear in MINUS operations) or NULLs (which can make ALL the data “disappear” in some cases). In this case the original query might be capable of returning duplicates of (owner, table_name) from dba_lgstdby_not_unique which would collapse to a single ocurrence each in my rewrite – so my version of the query is not logically equivalent (unless the definition of the view enforces uniqueness); on the other hand tracking, back through the original thread to the MoS article where this query comes from, we can see that even if the query could return duplicates we don’t actually need to see them.

And this is the point of the blog note – it’s a general principle (that happens to be a very obvious strategy in this case): if a query takes too long, how does it compare with a simplified version of the query that might be a couple of steps short of the final target. If it’s easy to spot the options for simplification, and if the simplified version operates efficiently, them isolate it (using a no_merge hint if necessary), and work forwards from there. Just be careful that your rewrite remains logically equivalent to the original (if it really needs to).

In the case of this query, the two parts took 5 seconds and 9 seconds to complete, returning 209 rows and 815 rows respectively. Combining the two queries with a minus really should get the required result in no more than 14 seconds.

Footnote

The “distinct” in the second query is technically redundant as the minus operation applies a sort unique operation to both the two intermediate result sets before comparing them.  Similarly the  “distinct” was also redundant when the second query was used for the “in subquery” construction – again there would be an implied uniqueness operation if the optimizer decided to do a simple unnest of the subquery.

 

 

 

 

August 13, 2019

Multi-table

Filed under: Bugs,Oracle,Troubleshooting — Jonathan Lewis @ 1:34 pm BST Aug 13,2019

Here’s a problem (and I think it should be called a bug) that I first came across about 6 years ago, then forgot for a few years until it reappeared some time last year and then again a few days ago. The problem has been around for years (getting on for decades), and the first mention of it that I’ve found is MoS Bug 2891576, created in 2003, referring back to Oracle 9.2.0.1, The problem still exists in Oracle 19.2 (tested on LiveSQL).

Here’s the problem: assume you have a pair of tables (call them parent and child) with a referential integrity constraint connecting them. If the constraint is enabled and not deferred then the following code may fail, and if you’re really unlucky it may only fail on rare random occasions:


insert all
        into parent({list of parent columns}) values({list of source columns})
        into child ({list of child columns})  values({list of source columns})
select
        {list of columns}
from    {source}
;

The surprising Oracle error is “ORA-02291: integrity constraint ({owner.constraint_name}) violated – parent key not found”, and the reason is simple (and documented in MoS note 265826.1 Multi-table Insert Can Cause ORA-02291: Integrity Constraint Violated for Master-Detail tables: the order in which the insert operations take place is “indeterminate” so that child rows may be inserted before their parent rows (and for the multi-table insert the constraint checks are not postponed until the statement completes as they are, for instance, for updates to a table with a self-referencing RI constraint).

Two possible workarounds are suggested in Doc ID 265826.1

  • drop the foreign key constraint and recreate it after the load,
  • make the foreign key constraint deferrable and defer it before the insert so that it is checked only on commit (or following an explicit call to make it immediate)

The second option would probably be preferable to the first but it’s still not a very nice thing to do and could leave your database temporarily exposed to errors that are hard to clean up. There are some details of the implementation of deferrable constraints in the comments of this note on index rebuilds if you’re interested in the technicalities.

A further option which seems to work is to create a (null) “before row insert” trigger on the parent table – this appears to force the parent into a pattern of single row inserts and the table order of insertion then seems to behave. Of course you do pay the price of an increase in the volume of undo and redo. On the down-side Bug 2891576 MULTITABLE INSERT FAILS WITH ORA-02291 WHEN FK & TRIGGER ARE PRESENT can also be fouind on MoS, leading 265826.1 to suggests disabling triggers if their correctness is in some way dependent on the order in which your tables are populated. That dependency threat should be irrelevant if the trigger is a “do nothing” trigger. Sadly there’s a final note that I should mention: Bug 16133798 : INSERT ALL FAILS WITH ORA-2291 reports the issue as “Closed: not a bug”

There is a very simple example in the original bug note demonstrating the problem, but it didn’t work on the version of Oracle where I first tested it, so I’ve modified it slightly to get it working on a fairly standard install. (I suspect the original was executed on a database with a 4KB block size.)


drop table child purge;
drop table parent purge;

create table parent (id number primary key);

create table child  (id number, v1 varchar2(4000),v2 varchar2(3920));
alter table child add constraint fk1 foreign key (id) references parent (id);
 
create or replace trigger par_bri
before insert on parent
for each row
begin
        null;
end;
.

insert all
        into parent ( id ) values ( id )
        into child  ( id ) values ( id )
select  100 id from dual
;

In the model above, and using an 8KB block in ASSM, the code as is resulted in an ORA-02991 error. Changing the varchar2(3920) to varchar2(3919) the insert succeeded, and when I kept the varchar2(3920) but created the trigger the insert succeeded.

Fiddling around in various ways and taking some slightly more realistic table definitions here’s an initial setup to demonstrate the “randomness” of the failure (tested on various versions up to 18.3.0.0):


rem
rem     Script:         insert_all_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem
rem     Last tested 
rem             18.3.0.0
rem             12.2.0.1
rem             12.1.0.2
rem             10.2.0.5
rem              9.2.0.8
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             small_vc,
        lpad(rownum,100,'0')            medium_vc,
        lpad(rownum,200,'0')            big_vc
from
        generator       v1
;

create table parent(
        id              number,
        small_vc        varchar2(10),
        medium_vc       varchar2(100),
        big_vc          varchar2(200),
        constraint par_pk primary key(id)
)
segment creation immediate
;

create table child(
        id              number,
        small_vc        varchar2(10),
        medium_vc       varchar2(100),
        big_vc          varchar2(200),
        constraint chi_pk primary key(id),
        constraint chi_fk_par foreign key (id) references parent(id)
)
segment creation immediate
;

create table child2(
        id              number,
        small_vc        varchar2(10),
        medium_vc       varchar2(100),
        big_vc          varchar2(200),
        constraint ch2_pk primary key(id),
        constraint ch2_fk_par foreign key (id) references parent(id)
)
segment creation immediate
;

I’ve created a “source” table t1, and three “target” tables – parent, child and child2. Table parent has a declared primary key and both child and child2 have a referential integrity constraint to parent. I’m going to do a multi-table insert selecting from t1 and spreading different columns across the three tables.

Historical note: When I first saw the “insert all” option of multi-table inserts I was delighted with the idea that it would let me query a de-normalised source data set just once and insert the data into a normalised set of tables in a single statement – so (a) this is a realistic test from my perspective and (b) it has come as a terrible disappointment to discover that I should have been concerned about referential integrity constraints (luckily very few systems had them at the time I last used this feature in this way).

The multi-table insert I’ve done is as follows:


insert all
        into parent(id, small_vc)  values(id, small_vc)
        into child (id, medium_vc) values(id, medium_vc)
        into child2(id, medium_vc) values(id, medium_vc)
--      into child2(id, big_vc)    values(id, big_vc)
select
        id, small_vc, medium_vc, big_vc
from
        t1
where
        rownum <= &m_rows_to_insert
;

You’ll notice that I’ve allowed user input to dictate the number of rows selected for insertion and I’ve also allowed for an edit to change the column that gets copied from t1 to child2. Althought it’s not visible in the create table statements I’ve also tested the effect of varying the size of the big_vc column in t1.

Starting with the CTAS and multi-table insert as shown the insert runs to completion if I select 75 rows from t1, but if I select 76 rows the insert fails with “ORA-02991: integrity constraint (TEST_USER.CHI_FK_PAR) violated – parent key not found”. If I change the order of the inserts into child1 and child2 the violated constraint is TEST_USER.CH2_FK_PAR – so Oracle appears to be applying the inserts in the order they appear in the statement in some circumstances.

Go back to the original order of inserts for child1 and child2, but use the big_vc option for child2 instead of the medium_vc. In this case the insert succeeds for 39 rows selected from t1, but fails reporting constraint TEST_USER.CH2_FK_PAR when selecting 40 rows. Change the CTAS and define big_vc with as lpad(rownum,195) and the insert succeeds with 40 rows selected and fails on 41 (still on the CH2_FK_PAR constraint); change big_vc to lpad(rownum,190) and the insert succeeds on 41 rows selected, fails on 42.

My hypothesis on what’s happening is this: each table in the multitable insert list gets a buffer of 8KB (maybe matching one Oracle block if we were to try different block sizes). As the statement executes the buffers will fill and, critically, when the buffer is deemed to be full (or full enough) it is applied to the table – so if a child buffer fills before the parent buffer is full you can get child rows inserted before their parent, and it looks like Oracle isn’t postponing foreign key checking to the end of statement execution as it does with other DML – it’s checking as each array is inserted.

Of course there’s a special boundary condition, and that’s why the very first test with 75 rows succeeds – neither of the child arrays gets filled before we reach the end of the t1 selection, so Oracle safely inserts the arrays for parent, child and child2 in that order. The same boundary applies occurs in the first of every other pair of tests that I’ve commented on.

When we select 76 rows from t1 in the first test the child and child2 arrays hit their limit and Oracle attempts to insert the child1 rows first – but the parent buffer is far from full so its rows are not inserted and the attempted insert results in the ORA-02991 error. Doing a bit of rough arithmetic the insert was for 76 rows totalling something like: 2 bytes for the id, plus a length byte, plus 100 bytes for the medium_vc plus a length byte, totalling 76 * 104 =7,904 bytes.

When we switch to using the big_vc for child2 the first array to fill is the child2 array, and we have 3 sets of results as we shorten big_vc:

  • 40 * ((1 + 2) + (1 + 200)) = 8160
  • 41 * ((1 + 2) + (1 + 195)) = 8159
  • 42 * ((1 + 2) + (1 + 190)) = 8148

While I’m fairly confident that my “8KB array” hypothesis is in the right ballpark I know I’ve still got some gaps to explain – I don’t like the fact that I’ve got a break point around 7,900 in the first example and something much closer to 8,192 in the other three examples.  I could try to get extra precision by running up a lot more examples with different numbers and lengths of columns to get a better idea of where the error is appearing – but I’m sufficiently confident that the idea is about right so I can’t persuade myself to make the effort to refine it. An example of an alternative algorithm (which is actually a better fit though a little unexpected) is to assume that the normal 5 byte row overhead (column count, lock byte, flags and 2-byte row directory entry) has been included in the array sizing code, and the insert takes place at the point incoming row breaks, or just touches, the limit. In this case our 4 results would suggest the following figures:

  • 75 * 109 = 8175
  • 39 * 209 = 8151
  • 40 * 204 = 8160
  • 41 * 199 = 8159

With these numbers we can see 8KB (8,192 bytes) very clearly, and appreciate that the one extra row would take us over the critical limit, hence triggering the insert and making the array space free to hold the row.

Bottom Line

If you’re using the multi-table “insert all” syntax and have referential integrity declared between the various target tables then you almost certainly need to ensure that the foreign key constraints are declared as deferrable and then deferred as the insert takes place otherwise you may get random (and, until now, surprisingly inexplicable) ORA-02991 foreign key errors.

A possible alternative workaround is to declare a “do nothing” before row insert trigger on the top-level as this seems to switch the process into single row inserts on the top-most parent that force the other array inserts to take place with their parent row using small array sizes and protecting against the foreign key error. This is not an officially sanctioned workaround, though, and may only have worked by accident in the examples I tried.

It is possible, if the 8KB working array hypothesis is correct, that you will never see the ORA-02991 if the volume of data (number of rows * row length) for the child rows of any given parent row is always less than the size of the parent row – but that might be a fairly risky thing to hope for in a production system. It might be much better to pay the overhead of deferred foreign key checking than having a rare, unpredictable error appearing.

 

August 11, 2019

Troubleshooting

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 9:28 pm BST Aug 11,2019

Here’s a question to provoke a little thought if you’ve got nothing more entertaining to do on a Sunday evening.  What threats do you think of when you see a statement like the following in (say) an AWR report, or in a query against v$sql ?

update tableX set
        col001 = :1, col002 = :2, col003 = :3, ...
        -- etc. 
        -- the names are supposed to indicate that the statement updates 302 columns
        -- etc.
        col301 = :301, col302 = :302
where
        pk_col = :303
;

I’ll be writing up some suggestions tomorrow (Monday, UK BST), possible linking to a few other articles for background reading.

Update

The first three comments have already hit the high points, but I’m going to jot down a few notes anyway.

The first two things that really (should) make an impact are:

  • There’s a table in the application with (at least) 303 columns – anything over 255 is a potential disaster area
  • An update statement that updates 302 columns is probably machine generated by a non-scalable application

A secondary detail that might be useful is recognising the pattern of the text – lower case for column names, simple “:nnn” for bind variables.  As it stands I don’t recognise the possible source for this style, but I know it’s not SQL embedded in PL/SQL (which would be all capitals with “:Bnnn” as bind variable names) and it’s not part of a distributed query from a remote database (which would be in capitals with quoted names, and table aliases like “A1, A2, …”), and it’s not “raw” Hiberbate code which produces meaningless table and column aliases based on numbers with a “t” for table and “c” for column.

So let’s think about possible problems and symptoms relating to the two obvious issues:

Wide tables

Once you have more than 255 (real) columns in a table – even if that count includes columns that have been marked unused – Oracle will have to split rows into “rowpieces” that do not exceed 255 columns and chain those pieces together. Oracle will try to be as efficient as possible – with various damage-limiting code changes appearing across versions – attempting store these row pieces together and keeping the number to a minimum, but there are a number of anomalies that can appear that have a significant impact on performance.

Simply having to visit two row pieces to pick up a column in the 2nd row piece (even if it is in the same block) adds to the cost of processing; but when you have to visit a second block to acquire a 2nd (or 3rd, or 4th) rowpiece the costs can be significant. As a quirky little extra, Oracle’s initial generation of row-pieces creates them from the end backwards – so a row with 256 columns starts with a row-piece of one column following by a rowpiece of 255 columns: so you may find that you have to fetch multiple row pieces for virtually every row you access.

It’s worth noting that a row splitting is based only on columns that have been used in the row. If your data is restricted to the first 255 column of a row then the entire row can be stored as a single row piece (following the basic rule that “trailing nulls take no space”); but as soon as you start to update such a row by populating columns past the 255 boundary Oracle will start splitting from the end – and it may create a new trailing row-piece each time you populate a column past the current “highest” column.  In an extreme case I’ve managed to show an example of a single row consisting of 746 row pieces, each in a different block (though that’s a bug/feature that’s been addressed in very recent versions of Oracle).

If rows have been split across multiple blocks then one of the nastier performance problems appears with direct path read tablescans. As Oracle follows the pointer to a secondary row piece it will do a physical read of the target block then immediately forget the target block so, for example, if you have inserted 20 (short) rows into a block then updated all of them in a way that makes them split and all their 2nd row pieces go to the same block further down the table you can find that Oracle does a single direct path read that picks up the head pieces, then 20 “db file sequential read” calls to the same block to pick up the follow-on pieces. (The same effect appears with simple migrated rows.) Contrarily, if you did the same tablescan using “db file scattered read” requests then Oracle might record a single, highly deceptive “table fetch continued row” because it managed to pin the target block and revisit it 20 times.

Often a very wide row (large number of columns) means the table is trying to hold data for multiple types of object. So a table of 750 columns may use the first 20 columns for generic data, columns 21 to 180 for data for object type A, 181 to 395 for data for object type B, and so on.  This can lead to rows with a couple of hundred used columns and several hundred null columns in the middle of each row – taking one byte per null column and burning up lots of CPU as Oracle walks a row to find a relevant column. A particularly nasty impact can appear from this type of arrangement when you upgrade an applications:  imagine you have millions of rows of the “type A” above which use only the first 180 columns.  For some reason the application adds one new “generic” column that (eventually) has to be populated for each row – as the column is populated for a type A row the row grows by 520 (null counting) bytes and splits into at least 3 pieces. The effect could be catastrophic for anyone who had been happy with their queries reporting type A data.

One of the difficulties of handling rows that are chained due to very high column counts is that the statistics can be quite confusing (and subject to change across versions). The most important clue comes from “table fetch continued row”; but this can’t tell you whether your “continued” rows are migrated or chained (or both), which table they come from, and whether you’ve been fetching the same small number multiple times or many different rows. Unfortunately the segment statistics (v$segstat / v$segment_statistics) don’t capture the number of continued fetches by segment – it would be nice if they did since it ought to be a rare (and therefore low-cost) event. The best you can do, probably, is to look at the v$sql_monitor report for queries that report tablescans against large tables but report large numbers of single block reads in the tablescan – and for repeatable cases enable SQL trace with wait tracing enabled against suspect queries to see if they show the characteristic mix of direct path reads and repeated db file sequential reads.

Update every column

The key performance threat in statements that update every column – including the ones that didn’t change – is that Oracle doesn’t compare before and after values when doing the update. Oracle’s code path assumes you know what you’re doing so it saves every “old” value to an undo record (which gets copied to the redo) and writes every “new” value to a redo change vector.  (Fortunately Oracle does check index definitions to see which index entries really have suffered changes, so it doesn’t visit index leaf blocks unnecessarily). It’s possible that some front-end tool that adopts this approach has a configuration option that switches from “one SQL statement for all update” to “construct minimal statement based on screen changes”.

The simple trade-off between these two options is the undo/redo overhead vs. parsing and optimisation overheads as the tool creates custom statements on demand. In the case of the table with more than 255 columns, of course, there’s the added benefit that an update of only the changed columns might limit the update to columns that are in the first rowpiece, eliminating the need (some of the time) to chase pointers to follow-up pieces.

Limiting the update can help with undo and redo, of course, but if the tool always pulls the entire row to the client anyway you still have network costs to consider. With the full row pulled and then updated you may find it takes several SQL*Net roundtrips to transfer the whole row between client and server.  In a quick test on a simple 1,000 column table with an update that set every column in a single row to null (using a bind variables) I found that the a default setup couldn’t even get 1,000 NULLs (let alone “real values”) across the network without resorting to one wait on “SQL*Net more data from client”


variable b1 number
exec :b1 := null;

update t1 set
        col0001 = :b1,
        col0002 = :b1,
        ...
        col1000 = :b1
;

Although “SQL*Net message to/from client” is typically (though not always correctly) seen as an ignorable wait, you need to be a little more suspicious of losing time to “SQL*Net more data to/from client”. The latter two waits mean you’re sending larger volumes of information across the network and maybe you can eliminate some of the data or make the transfer more efficient – perhaps a simple increase in the SDU (session data unit) in the tnsnames.ora, listener.ora, or sqlnet.ora (for default_sdu_size) might be helpful.

Warning

One of the features of trouble-shooting from cold is that you don’t know very much about the system you’re looking at – so it’s nice to be able to take a quick overview of a system looking for any potentially interesting anomalies and take a quick note of what they are and what thoughts they suggest before you start asking questions and digging into a system. This article is just a couple of brief notes along the lines of: “that doesn’t look nice- what questions does it prompt”.

August 8, 2019

Free Space

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 1:27 pm BST Aug 8,2019

Several years ago I wrote a note about reporting dba_free_space and dba_extents to produce a map of the space usage in a tablespace in anticipation of messing about with moving or rebuilding objects to try and reduce the size of the files in the tablespace.  In the related page where I published the script I pointed out that a query against dba_extents would be expensive because it makes use of structure x$ktfbue which generates the information dynamically by reading segment header blocks. I also pointed out in a footnote to the original article that if you’ve enabled the recyclebin and have “dropped” some objects then there will be some space that is reported as free but is not quite free since the extents will still be allocated. This brings me to the topic for today’s blog.

While visiting a client site recently I came across an instance that was running a regular report to monitor available space in the database. Basically this was a query against view dba_free_space. Surprisingly it was taking a rather long time to complete – and the reason for this came in two parts. First, the recyclebin was enabled and had some objects in it and secondly there were no stats on the fixed object x$ktfbue.

In the case of the client the particular query produced a plan that included the following lines:


Id  Operation             Name              Rows    Bytes  Cost (%CPU)  Time
--  --------------------- ----------------  ----   ------  -----------  --------
63  HASH JOIN                               2785     212K     46  (85)  00:00:01
64    TABLE ACCESS FULL   RECYCLEBIN$       1589    20657      7   (0)  00:00:01
65    FIXED TABLE FULL    X$KTFBUE          100K    6347K     38 (100)  00:00:01 

This is part of the view where Oracle calculates the size of all the extents of objects in the recyclebin so that they can be reported as free space. Notice that in this plan (which is dependent on version, system stats, object_stats and various optimizer parameters) the optimizer has chosen to do a hash join between the recyclebin (recyclebin$) and the x$ structure – and that has resulted in a “full tablescan” of x$ktfbue, which means Oracle reads the segment header block of every single segment in the entire database. (I don’t know where the row stats came from as there were no stats on x$ktfbue, and this plan was pulled from the AWR history tables so the query had been optimised and captured some time in the past.)

If there had been nothing in the recyclebin the hash join and two tablescans wouldn’t have mattered, unfortunately the recyclebin had been enabled and there were a few rows in recyclebin$, so the “tablescan” happened. Here’s a cut-n-paste from a much simpler query run against a fairly new (no 3rd party app) database running 12.1.0.2 to give you some idea of the impact:


SQL> execute snap_events.start_snap

PL/SQL procedure successfully completed.

SQL> select count(*) from x$ktfbue;

  COUNT(*)
----------
      8774

1 row selected.

SQL> execute snap_events.end_snap
---------------------------------------------------------
Session Events - 01-Aug 21:28:13
---------------------------------------------------------
Event                                             Waits   Time_outs        Csec    Avg Csec    Max Csec
-----                                             -----   ---------        ----    --------    --------
Disk file operations I/O                              7           0           0        .018           1
db file sequential read                           5,239           0          14        .003           6
SQL*Net message to client                             7           0           0        .000           0
SQL*Net message from client                           7           0       1,243     177.562         572
events in waitclass Other                             3           1           0        .002           0

PL/SQL procedure successfully completed.

On my little laptop, with nothing else going on, I’ve managed to get away with “only” 5,239 single block reads, and squeezed them all into just 14 centiseconds (local SSD helps). The clients wasn’t so lucky – they were seeing tens of thousands of real physical reads.

The ideal solution, of course, was to purge the recyclebin and disable the feature – it shouldn’t be necessary to enable it on a production system – but that’s something that ought to require at least some paperwork. In the short term gathering stats on the fixed table helped because the plan changed from a hash join with “tablescan” of x$ktfbue to a nested loop with an “indexed” access path, looking more like the following (from a query against just recyclebin$ and x$ktfbue)

---------------------------------------------------------------------------------------------
| Id  | Operation                | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                  |       |       |     4 (100)|          |
|   1 |  NESTED LOOPS            |                  |     7 |   182 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL      | RECYCLEBIN$      |     6 |    66 |     4   (0)| 00:00:01 |
|*  3 |   FIXED TABLE FIXED INDEX| X$KTFBUE (ind:1) |     1 |    15 |     0   (0)|          |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("BUE"."KTFBUESEGBNO"="BLOCK#" AND "BUE"."KTFBUESEGTSN"="TS#" AND
              "BUE"."KTFBUESEGFNO"="FILE#"))

This was still fairly resource-intensive for the client, but was something of an improvement – they had a lot more than 6 items in their recyclebin.

Part of the problem, of course, is that x$ktfbue is one of the objects that Oracle skips when you gather “fixed object” stats – it can be a bit expensive for exactly the reason that querying it can be expensive, all those single block segment header reads.

If you want to check the stats and gather them (as a one-off, probably) here’s some suitable SQL:


select
        table_name, num_rows, avg_row_len, sample_size, last_analyzed
from
        dba_tab_statistics
where
        owner = 'SYS'
and     table_name = 'X$KTFBUE'
;

begin
        dbms_stats.gather_table_stats('SYS','X$KTFBUE');
end;
/

Summary

You probably shouldn’t have the recyclebin enabled in a production system; but if you do, and if you also run a regular report on free space (as many sites seem to do) make sure (a) you have a regular routine to minimise the number of objects that it accumulates and (b) gather statistics (occasionally) on x$ktfbue to minimise the overhead of the necessary join between recyclebin$ and x$ktfbue.

August 6, 2019

Parse Solution

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 1:14 pm BST Aug 6,2019

In the “Parse Puzzle” I posted a couple of days ago I showed a couple of extracts from an AWR report that showed contradictory results about the time the instance spent in parsing and hard parsing, and also showed an amazing factor of 4 difference between the DB Time and the “SQL ordered by Elapsed Time”. My example was modelling a real world anomaly I had come across, but was engineered to exaggerate the effect to make it easy to see what was going on.

The key feature was VPD (virtual private database) a.k.a. FGAC (find grained access control) or RLS (row-level security). I’ve created a “policy function” (the thing that generates the “security predicate”) to execute an extremely expensive SQL query; then I’ve created a policy with policy_type = ‘DYNAMIC’ so that the function gets executed every time a query against a particular table is executed. In fact my example holds three tables, and each table has its own policy function, and each policy function calls the same very expensive piece of SQL.

To see the effect this has on the AWR report I’ll now supply the contents of the “SQL ordered by Elapsed Time” and work through the list (though not in the order shown) explaining what each statement represents:


SQL ordered by Elapsed Time               DB/Inst: OR18/or18  Snaps: 2059-2060
-> Resources reported for PL/SQL code includes the resources used by all SQL
   statements called by the code.
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100
-> %Total - Elapsed Time  as a percentage of Total DB time
-> %CPU   - CPU Time      as a percentage of Elapsed Time
-> %IO    - User I/O Time as a percentage of Elapsed Time
-> Captured SQL account for  302.6% of Total DB Time (s):             158
-> Captured PL/SQL account for  101.7% of Total DB Time (s):             158

        Elapsed                  Elapsed Time
        Time (s)    Executions  per Exec (s)  %Total   %CPU    %IO    SQL Id
---------------- -------------- ------------- ------ ------ ------ -------------
           156.8              1        156.84   99.4   99.8     .0 1ubpdythth4q1
Module: SQL*Plus
select id, f_rls(n1, n2, n3) from fgac_base where rownum .le. 10                   -- edited to avoid WP  format issue

           156.8             33          4.75   99.3   99.8     .0 9dhvggqtk2mxh
Module: SQL*Plus
 select count(*) X from waste_cpu connect by n .gt. prior n start with n = 1       -- edited to avoid WP  format issue

            53.2             10          5.32   33.7   99.8     .0 5531kmrvrzcxy
Module: SQL*Plus
SELECT COUNT(*) FROM FGAC_REF3 WHERE ID = :B1

            53.2             11          4.83   33.7   99.8     .0 8g2uv26waqm8g
Module: SQL*Plus
begin :con := "FGAC_PACK"."FGAC_PREDICATE3"(:sn, :on); end;

            52.7             10          5.27   33.4   99.7     .0 awk070fhzd4vs
Module: SQL*Plus
SELECT COUNT(*) FROM FGAC_REF1 WHERE ID = :B1

            52.7             11          4.79   33.4   99.7     .0 c8pwn9j11gw5s
Module: SQL*Plus
begin :con := "FGAC_PACK"."FGAC_PREDICATE1"(:sn, :on); end;

            50.9             10          5.09   32.3   99.9     .0 964u0zv0rwpw1
Module: SQL*Plus
SELECT COUNT(*) FROM FGAC_REF2 WHERE ID = :B1

            50.9             11          4.63   32.3   99.9     .0 bgqf405f34u4v
Module: SQL*Plus
begin :con := "FGAC_PACK"."FGAC_PREDICATE2"(:sn, :on); end;

             2.8              1          2.79    1.8   98.4     .0 fhf8upax5cxsz
BEGIN sys.dbms_auto_report_internal.i_save_report (:rep_ref, :snap_id, :pr_class
, :rep_id, :suc); END;

             2.6              1          2.64    1.7   98.4     .0 0w26sk6t6gq98
SELECT XMLTYPE(DBMS_REPORT.GET_REPORT_WITH_SUMMARY(:B1 )) FROM DUAL

             2.4              1          2.43    1.5   98.3     .0 1q1spprb9m55h
WITH MONITOR_DATA AS (SELECT INST_ID, KEY, NVL2(PX_QCSID, NULL, STATUS) STATUS,
FIRST_REFRESH_TIME, LAST_REFRESH_TIME, REFRESH_COUNT, PROCESS_NAME, SID, SQL_ID,
 SQL_EXEC_START, SQL_EXEC_ID, DBOP_NAME, DBOP_EXEC_ID, SQL_PLAN_HASH_VALUE, SQL_
FULL_PLAN_HASH_VALUE, SESSION_SERIAL#, SQL_TEXT, IS_FULL_SQLTEXT, PX_SERVER#, PX

                          ------------------------------------------------------

The first statement is an SQL statement that calls a PL/SQL function f_rls() for 10 consecutive rows in an “ordinary table”. This is the query that actuallly takes 157 seconds to complete as far as the client SQL Plus session is concerned.

The function (called 10 times) passed in three values n1, n2, n3. The function uses n1 to query table FGAC_REF1, n2 to query table FGAC_REF2, and n3 to query FGAC_REF3 – and we can see those three queries appearing as statements 5, 7, and 3 (in that order) in the output. The main query takes 157 seconds to complete because each of those statements takes approximately 52 seconds to complete 10 executions each.

But each of those three statements references a table with a policy that causes a predicate function to be executed for each parse and execute of the statement (one parse, 10 executes) and we can see 11 executions each of calls to fgac_pack.fgac_predicateN (N in 1,2,3), which take roughly 4.8 seconds each, for a total of about 52 seconds across 11 executions.

But those calls to the functions (11 each) all execute the same “connect by” query that appears as statement 2 in the list – for a total off 33 calls of a SQL statement that averages 4.75 seconds – and almost all of the “real” database time is in that 33 calls (33 * 4.75 = 156.75).

So we count 157 seconds because that’s the time spent in the “connect by” queries”, but we count that time again (but under PL/SQL execution) because it’s called from the policy functions, then count it again (under SQL execution) because the functions are called by the “select count(*) from fgac_refN” queries, then count it one final time (under SQL execution) for the driving query. This gives us a total 300% of the actual database time apparently being spent in SQL and 100% apparently being spent in PL/SQL.

You’ll notice that “real” optimisation of the SQL that does run would have taken just fractions of a second (as we saw in the Instance Activity Statistics); but one execution of the “connect by” query would have been associated with the first parse call of each of the fgac_refN queries, so 15 seconds of policy function / connect by query time would have been accounted under the parse time elapsed / hard parse time elapsed we saw in the Time Model statistics.

One of the strangest things about the reporting showed up in the ASH figures – which only ever managed to sample three SQL_IDs, reporting them as “on CPU” in every single sample, and those three SQL_IDs were for the “select count(*) from fgac_refN” queries; the “connect by” queries – the real consumer of CPU resource – didn’t get into the ASH sample at all.

Footnote

I did two things to make an important anomaly very obvious – I added a CPU intensive query to the policy function so that it was easy to see where the time was really going, and I made the VPD policy_type “dynamic” so that the policy function would execute on every parse and execute against the underlying table.

In real life the typcial SQL called within a policy function is very lightweight, and policies tend to be declared with type “context_sensitive”, and this combination minimises the number of calls to the function and the cost of each call. Unfortunately there are 3rd party applications on the market that include complex PL/SQL frameworks in their policy functions and then have the application server reset the session context for every call to the database.

And that’s where I came in – looking at a system where 10% of the CPU was being spent on parsing that apparently couldn’t possibly be happening.

August 4, 2019

Parse Puzzle

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 10:46 am BST Aug 4,2019

Here are some details from an AWR report covering a few minutes in the lifetime of an instance of 18.3. It’s a carefully constructed demonstration and all I’ve done is take an AWR snapshot, execute a single SQL statement, then take another snapshot, so the only thing captured by the report is the work done in that brief time interval. The purpose of the exercise is to demonstrate how some Oracle features can make a complete nonsense of the AWR. (I have, as I often do, produced a model that reproduces an affect that can appear in production but exaggerates the effect to make it more clearly visible.)

First the Time Model statistics:

                                                                % of  % of Total
Statistic Name                                       Time (s) DB Time   CPU Time
------------------------------------------ ------------------ ------- ----------
sql execute elapsed time                                157.8   100.0
DB CPU                                                  157.5    99.8       97.3
parse time elapsed                                       13.6     8.6
hard parse elapsed time                                  13.6     8.6
PL/SQL execution elapsed time                             0.3      .2
PL/SQL compilation elapsed time                           0.0      .0
hard parse (sharing criteria) elapsed time                0.0      .0
repeated bind elapsed time                                0.0      .0
DB time                                                 157.8
background elapsed time                                   7.9
background cpu time                                       4.4                2.7
total CPU time                                          161.9
                          ------------------------------------------------------


Note particularly the parse times – the Time Model show 13.6 seconds spent in (hard) parsing.

Note also that (with a small error) DB time = DB CPU = SQL execute elapsed time, and the background time is very small (in particular it’s less than the parse time). This background time, by the way, is probably related to things that Oracle does behind the scenes when you take an AWR snapshot or run an AWR report.

Given the significant amount of time spent in hard parsing let’s take a look at the Instance Activity statistics – picking only the statistics relating to parse calls:


Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
parse count (describe)                            0            0.0           0.0
parse count (failures)                            0            0.0           0.0
parse count (hard)                              325            2.1         325.0
parse count (total)                           1,662           10.5       1,662.0
parse time cpu                                   39            0.3          39.0
parse time elapsed                               42            0.3          42.0

Although the Time Model thinks Oracle has spent 13.6 seconds in (hard) parse time, the Instance Activity Statistics says it has only spent 0.42 seconds (the statistic is repored in hundredths) That a fairly significant difference of opinion. So let’s see if we can find out more from the “SQL ordered by …”, and I’m only going to show you one heading as a teaser for the rest of the weekend:


SQL ordered by Elapsed Time               DB/Inst: OR18/or18  Snaps: 2059-2060
-> Resources reported for PL/SQL code includes the resources used by all SQL
   statements called by the code.
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100
-> %Total - Elapsed Time  as a percentage of Total DB time
-> %CPU   - CPU Time      as a percentage of Elapsed Time
-> %IO    - User I/O Time as a percentage of Elapsed Time
-> Captured SQL account for  302.6% of Total DB Time (s):             158
-> Captured PL/SQL account for  101.7% of Total DB Time (s):             158

How do you squeeze 400% of the elapsed time into SQL and PL/SQL executions? (Observation: it’s not an IBM P9 taking advantage of SMT/4)

One last set of stats – which will have some room for statistical error since they come from v$active_session_history:


select
        in_parse, in_hard_parse, sql_id, sql_exec_id, count(*)
from
        v$active_session_history  ash
where
        session_id = &m_session_id
and     sample_id between &m_start_sample_id and &m_end_sample_id
group by
        in_parse, in_hard_parse, sql_id, sql_exec_id
order by
        count(*)
/

I I SQL_ID	  SQL_EXEC_ID	COUNT(*)
- - ------------- ----------- ----------
Y Y 964u0zv0rwpw1		       3
Y Y awk070fhzd4vs		       4
Y Y 5531kmrvrzcxy		       5

N N 5531kmrvrzcxy		      42
N N 964u0zv0rwpw1		      42
N N awk070fhzd4vs		      51

6 rows selected.

So the ASH information seems to agree (reasonably closely) with the Time Model statistics – in the interval of the snapshot it’s noted 12 samples (which statistically represents 12 seconds) of hard parse time (and though my query doesn’t show it, the “event” is  null, i.e. “on CPU”).

 

To be continued, some time later this week …

June 14, 2019

Trouble-shooting

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 9:19 am BST Jun 14,2019

Here’s an answer I’ve just offered on the ODC database forum to a fairly generic type of problem.

The question was about finding out why a “program” that used to take only 10 minutes to complete is currently taking significantly longer. The system is running Standard Edition, and the program runs once per day. There’s some emphasis on the desirability of taking action while the program is still running with the following as the most recent statement of the requirements:

We have a program which run daily 10minutes and suddenly one day,it is running for more than 10minutes…in this case,we are asked to look into the backend session to check what exactly the session is doing.I understand we have to check the events,last sql etc..but we need to get the work done by that session in terms of buffergets or physical reads(in case of standard edition)

1)In this case,we cannot enable trace to the session as it already started running and 10 minutes passed way.If we enable trace at this point,will it gives the required info?

2)To check the statistics of this specific session,what is the best way to proceed and this is my initial question.One of my friend pointed out to check v$sess_io and he is not sure whether it will return cumulative values because this view has only sid and not having serial#..and sid is not unique

So, answering the questions as stated, with a little bit of padding:

1) In this case, we cannot enable trace to the session as it already started running and 10 minutes passed way.If we enable trace at this point,will it gives the required info?

If the session has started and has been running for 10 minutes it’s still possible to force tracing into the session and, depending what the program does, you may be lucky enough to get enough information in the trace/tkprof file to help you. The “most-approved” way of doing this for a session is probably through a call to dbms_monitor.session_trace_enable(), but if that’s a package that Standard Edition is not licensed to use then there’s dbms_system.set_sql_trace_in_session().

If this doesn’t help, and if the session is still live and running, you could also look at v$open_cursor for that SID to see what SQL statements (sql_id, child_address, last_active_time and first 60 characters of the text) are still open for that session, then query v$sql for more details about those statements (using sql_id and child_address). The stats you find in those statements are accumulated across all executions by all sessions from the moment the cursor went into the library cache, but if this is a program that runs once per day then it’s possible that the critical statement will only be run by that one session, and the cursor from the previous day will have aged out of the library cache so that what you see is just today’s run.

Since you’re on Standard Edition and don’t have access to the AWR you should have installed Statspack – which gives you nearly everything that you can get from the AWR reports (the most significant difference being the absence of the v$active_session_history – but there are open-source emulators that you can install as a fairly good substitute for that functionality). If there is one statement in your program that does a lot of work then it’s possible that it might be one of the top N statements in a Statspack snapshot.

If this program is a known modular piece of code could you alter the mechanism that calls it to include calls to enable session tracing at the start of the program (and disable it, perhaps, at the end of the progam).  This might be by modifying the code directly, or by modifying the wrapper that drive the program, or by adding a logon trigger if there’s a mechanism that would allow Oracle to recognise the specific session that runs this particular program, or if something sets an identifiable (and unambiguous) module and/or action as part of calling the program then you could use the dbms_monitor package to enable tracing for (e.g.) a particular module and action on a permanent basis.

2) To check the statistics of this specific session,what is the best way to proceed and this is my initial question.One of my friend pointed out to check v$sess_io and he is not sure whether it will return cumulative values because this view has only sid and not having serial#..and sid is not unique

An answer: the stats are for one SID and SERIALl#, whether you’re looking at v$sess_io, v$sesstat, v$session_event, v$sess_time_model and any other v$sesXXX views that I can’t recall off-hand.  In passing, if you can add a wrapper to the calling code, capturing sessions activity stats (v$sesstat) wait time (v$session_event) and time model summary (v$sess_time_model) is a very useful debugging aid.

And an explanation: the “session” array is a fixed size array, and the SID is the array subscript of the entry your session acquired at logon. Since the array is fixed size Oracle has to keep re-using the array entries so each time it re-uses an array entry it increments the serial# so that (sid, serial#) becomes the unique identifier across history[1]. As it acquires the array entry it also resets/re-initializes all the other v$sesXXX arrays/linked list/structures.

The one thing to watch out for when you try to capture any of the session numbers is that you don’t query these v$sesXXX things twice to find a difference unless you also capture the serial# at the same time so that you can be sure that the first session didn’t log off and a second session log on and reuse the same SID between your two snapshots.  (It’s quite possible that this will be obvious even if you haven’t captured the serial#, because you may spot that some of the values that should only increase with time have decreased)

 

Footnote

[1] I think there was a time when restarting an instance would reset the serial# to 1 and the serial# has to wrap eventually and the wrap used to be at 65536 because it was stored as a two-byte number – which doesn’t seem to have changed.  Nowadays the serial# seems to be maintained across instance restart (I wonder how that works with carefully timed instance crashes), and given the amount of history that Oracle could maintain in the AWR I suppose there could be some scope for connect two pieces of history that were really from two different sessions.

 

Next Page »

Powered by WordPress.com.