Oracle Scratchpad

January 25, 2022

Catalog Catalogue

Filed under: Oracle — Jonathan Lewis @ 12:22 pm GMT Jan 25,2022

This is a good starting point for researching a topic on my blog. It’s simply a list of the various catalog(ue)s of articles that I’ve written over the years. There’s still a lot of work to do, though, as there are more than 1,500 published articles on the blog, and though I have published a few of the catalogues already they are almost guaranteed to be incomplete.

The target state is for each catalogue to list all relevant articles (which means some articles will appear in more than one catalogue), generally in reverse order of publication, with the date of each entry.

In the short term this catalogue will datestamp each listed catalogue with its last update date.

 

 

 

255 column catalogue

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 12:17 pm GMT Jan 25,2022

This is a short list of articles discussing the problems (almost always) and oddities that appear when you define a table with more than 255 “real” columns. Odd details are always subject to change over time/version – so if it matters, test it and check it.

Articles from other authors:

January 20, 2022

Indexing Catalogue

Filed under: Oracle — Jonathan Lewis @ 1:07 pm GMT Jan 20,2022

 

This is a list of all articles I’ve written that talk about indexes and indexing strategy. Each entry has a date stamp and a short note of the contents. The articles are (will be) grouped in a couple of different categories, but otherwise are generally listed most-recent first.

I am currently working backwards through my entire blog adding articles to catalogues as I go, so any individual catalogue may grow very slowly.

 

 

 

Engineering Indexes

 

January 19, 2022

Hash Aggregation – 2

Filed under: Execution plans,Infrastructure,Oracle,Performance,Tuning — Jonathan Lewis @ 12:03 pm GMT Jan 19,2022

In the note I wrote a couple of days ago about the way the optimizer switches from hash group by to sort group by if you add an order by X,Y clause to a query that has a group by X,Y I mentioned that this had been something I’d noticed about 15 years ago (shortly after Oracle introduced hash aggregation, in fact) but it was only the second of two details I’d noted when experimenting with this new operation. The first thing I’d noticed came from an even simpler example, and here’s a (cryptic) clue to what it was:


column operation_type format a24

select
        operation_type, count(*)
from
        V$sql_workarea
group by
        operation_type
order by
        operation_type
;

OPERATION_TYPE             COUNT(*)
------------------------ ----------
APPROXIMATE NDV                   1
BUFFER                          130
CONNECT-BY (SORT)                10
GROUP BY (HASH)                  12
GROUP BY (SORT)                 240
HASH-JOIN                       466
IDX MAINTENANCE (SORT)           39
LOAD WRITE BUFFERS               10
RANGE TQ COMPUTATION             13
RESULT CACHE                      4
SORT (v1)                        10
SORT (v2)                       147
WINDOW (SORT)                    35

The clue isn’t in the query, it’s in what’s missing from the result set, so here’s some code to create and query some data to make it easier to spot the anomaly:

rem
rem     Script:         hash_agg.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2007
rem

create table t1
as
with generator as (
        select  rownum id
        from    dual
        connect by
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        lpad(mod(rownum-1,1000),6)      small_vc_K,
        lpad(rownum-1,6)                small_vc_M
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid wordpress format issue
;

set serveroutput off

prompt  ===========================
prompt  1000 distinct values (hash)
prompt  ===========================

select
        /*+ gather_plan_statistics 1000 */
        count(*)
from
        (
        select  /*+ no_merge */
                distinct small_vc_K
        from
                t1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ===========================
prompt  1000 distinct values (sort)
prompt  ===========================

select
        /*+ gather_plan_statistics 1000 */
        count(*)
from
        (
        select  /*+ no_merge no_use_hash_aggregation */
                distinct small_vc_K
        from
                t1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

I’ve added the /*+ gather_plan_statistics */ hint to the query so that I can check on the rowsource execution stats and (particularly) the memory and/or temporary space used; and I’ve blocked hash aggregation in one of the two queries, so I expect to see a “hash unique” operation in the first query and a “sort unique” operation in the second. Here’s what I get from 19.11.0.0:


===========================
1000 distinct values (hash)
===========================

  COUNT(*)
----------
      1000

SQL_ID  1baadqgv02r6b, child number 0
-------------------------------------
select  /*+ gather_plan_statistics 1000 */  count(*) from  (  select
/*+ no_merge */   distinct small_vc_K  from   t1  )

Plan hash value: 171694178

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |   765 (100)|      1 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   1 |  SORT AGGREGATE      |      |      1 |      1 |            |      1 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   2 |   VIEW               |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   3 |    HASH UNIQUE       |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.07 |    2637 |   2632 |  1889K|  1889K| 1417K (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |   1000K|   405  (17)|   1000K|00:00:00.04 |    2637 |   2632 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

17 rows selected.

===========================
1000 distinct values (sort)
===========================

  COUNT(*)
----------
      1000

SQL_ID  a66rqhgw7a6dk, child number 0
-------------------------------------
select  /*+ gather_plan_statistics 1000 */  count(*) from  (  select
/*+ no_merge no_use_hash_aggregation */   distinct small_vc_K  from
t1  )

Plan hash value: 1750119335

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |   765 (100)|      1 |00:00:00.22 |    2637 |       |       |          |
|   1 |  SORT AGGREGATE      |      |      1 |      1 |            |      1 |00:00:00.22 |    2637 |       |       |          |
|   2 |   VIEW               |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.22 |    2637 |       |       |          |
|   3 |    SORT UNIQUE       |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.22 |    2637 | 48128 | 48128 |43008  (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |   1000K|   405  (17)|   1000K|00:00:00.02 |    2637 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Look back at the summary of v$sql_workarea. Can you now spot something that might be missing? Since we’ve now got two execution plans with their sql_ids, let’s run a much more precise query against the view.

select
        sql_id, child_number, operation_id, operation_type ,
        total_executions, last_memory_used/1024 last_mem_kb
from
        v$sql_workarea
where
        sql_id in (
                '1baadqgv02r6b',
                'a66rqhgw7a6dk'
)
order by
        sql_id, child_number, operation_id
/

SQL_ID        CHILD_NUMBER OPERATION_ID OPERATION_TYPE       TOTAL_EXECUTIONS LAST_MEM_KB
------------- ------------ ------------ -------------------- ---------------- -----------
1baadqgv02r6b            0            3 GROUP BY (HASH)                     1       1,417
a66rqhgw7a6dk            0            3 GROUP BY (SORT)                     1          42

The first thing to note, of course, is that the sort aggregate at operation 1 doesn’t report a workarea at all; but this shouldn’t be a surprise, the operation is simply counting rows as they arrive, there’s no real sorting going on.

Operation 3 in both cases is the more interesting one. In the plan it’s reported as a “hash/sort unique” but the workarea operation has changed this to a “group by (hash/sort)”.  It’s a little detail that probably won’t matter to most people most of the time – but it’s the sort of thing that can waste a couple of hours of time when you’re trying to analyze a performance oddity.

And another thing …

You’ll have noticed, by the way, that the hash unique operation actually demanded far more memory than the sort unique operation; 1.4MB compared to 42KB. In a large production system this may be fairly irrelevant, especially since the discrepancy tends to disappear for higher volumes, and hardly matters if the operation spills to disc. However, if you have a large number of processes doing a lot of small queries using hash aggregation you may find that cumulatively they use up an unexpected fraction of your pga_aggregate_target.

There is some scope (with the usual “confirm with Oracle support” caveat) for modifying this behaviour with a fix_control:

16792882    QKSFM_COMPILATION_16792882  Cardinality threshold for hash aggregation(raised to power of 10)

alter session set "_fix_control"='16792882:3';

The default value is 0, the maximum value seems to be 3 and the description about raising to power of 10 seems to be the wrong way round, but I haven’t done any careful testing. When I set the value to 3 the hash unique with an estimated output of 1,000 switched to a sort unique.(hence my suggestion about 10N rather than N10).

Footnote

In the previous note I pointed out that Oracle would use a single sort group by operation rather than a consecutive hash group by / sort order by pair of operations. This simpler example helps to explain why. If you check the CBO trace files for the two you can check the costs of the inline aggregate view.

From a 19.11.0.0 trace file here are two costing fragments for the hash group by strategy; the first bit is the cost of the tablescan that acquires the rows (operation 4) the second is the final cost of the inline view / query block:


  Best:: AccessPath: TableScan
         Cost: 404.639881  Degree: 1  Resp: 404.639881  Card: 1000000.000000  Bytes: 0.000000

Final cost for query block SEL$2 (#2) - All Rows Plan:
  Best join order: 1
  Cost: 764.843155  Degree: 1  Card: 1000000.000000  Bytes: 7000000.000000
  Resc: 764.843155  Resc_io: 337.000000  Resc_cpu: 1069607888
  Resp: 764.843155  Resp_io: 337.000000  Resc_cpu: 1069607888

I won’t show you the equivalent extracts for the example where I blocked hash aggregation because the numbers are identical. So there’s no way that Oracle will want to do hash group by followed by sort order by, when it can do just a single sort group by that costs exactly the same as the hash group by operation alone. This is a shame, and a little ironic because when Oracle Corp introduce hash aggregation they made a bit of a fuss about how much more efficient it was than sorting- but it looks like no-one told the optimizer about this.

January 17, 2022

Hash Aggregation – 1

Filed under: CBO,Hints,Oracle,Performance,Tuning — Jonathan Lewis @ 1:15 pm GMT Jan 17,2022

Here’s an observation I made some time in 2007, but didn’t mention online until a (possibly) relevant question appeared on the Oracle database forum in 2017; and the topic reappeared in a nearly unrelated question a little while ago. The 2017 question was this:

I am running following SQL, which it taking almost 4-5 seconds and returning only 18 records.

SELECT
   SUM(TOTAL_AMOUNT) C1,
   a.LEVEL2_ENAME AS c2,
   c.CURR_YEARMO AS c3
FROM TERRITORY_PRESET_MSNP a,
   CV_RESTATED_MSNP b
   LEFT OUTER JOIN
   MONTH_D c
   ON b.YEARMO = c.CURR_YEARMO,
   PRODUCT_OFFERING d
WHERE   b.PO_ID = d.ROW_ID
    AND b.DATASOURCE_ID = 10
    AND b.YEARMO = 201704
    AND b.OWNER_TERR_ID = a.TERR_ID
    AND c.CURR_YEARMO = 201704
    AND a.YEARMO = 201706
GROUP BY c.CURR_YEARMO, a.LEVEL2_ENAME
ORDER BY C3, C2;

If I remove the ORDER BY clause it is returning results in 1 second.

Before saying anything else, I’ll just make a couple of points about the SQL:

  • It’s not a good idea to mix traditional Oracle syntax with “ANSI” syntax – it’s likely to make things harder for the next person to read the code and there’s just a slight possibility that the rewrite that Oracle applies to hide the ANSI syntax may block some of the possible execution paths.
  • The C3, C2 in the order by clause are the column aliases for the curr_yearno, level2_ename columns used in the group by clause.  Although Oracle allows you to use aliases in the order by (but not in the group by) doing so can only make the SQL a little harder to interpret (especially in a case like this when you have both clauses).
  • There’s a left outer join to month_d (aliased as c), but the where clause then includes the predicate c.CURR_YEARMO = 201704 which will eliminate any rows where curr_yearmo is null, thus converting (we hope – but the mix and match syntax might introduce a side-effect) the outer join to an inner join – so maybe that’s a design error in the SQL.

Addressing the question, though, the first thought (rapidly eliminated) is that perhaps this is the standard complaint of the GUI interface: “it’s fast until I add an order by clause”.

The commonest reason for this complaint is that the typical GUI interface shows you the first few rows and waits for you to page down, so your impression of the response time is “the time to see the first few rows” rather than “the time to get the complete result set” when it might take much more time to return the entire result set. When you add an order by clause it’s possible that Oracle will have to sort the entire result set before you see any of it. It’s often the difference between “first few rows” and “entire result set” that triggers the complaint.

In this case the “rapid elimination” of this thought is due to the OP saying the result set was only 18 rows. which is likely to produce the gut feeling that it shouldn’t take very long for Oracle to sort 18 rows if it had to find all of them before showing displaying them. On the other hand the thought might need a little follow-up, after all:

  • it’s possible that the GUI is only displaying 15 rows at a time and it’s takes a lot of time to find the extra 3 rows. Just think of a tablescan with a filter subquery when the rows you want are the first few in the table. Without an order by the rows can be displayed as they are found, with an order by Oracle will have to get to the end of the tablescan before the rows can be sorted and displayed.
  • the optimizer can produce terrible estimates and the order by clause might prompt it to say “if I start with a different table, driving through a specific index, and changing the join order then I won’t have to do any sorting for the order by clause” The resulting path may be a very bad idea if the arithmetic produces the wrong results.

The OP hasn’t shown us the execution plan – and that’s what we really need to see; but there is an alternative  guess that we could make about what the optimizer is doing that would affect the performance this much.

The query is an aggregate query – we have a group by. Since 10g Oracle has been able to use “hash aggregation” – which shows up as the HASH GROUP BY operation in an execution plan. Here’s a little demo script, with a couple of sample queries:

rem
rem     Script:         sort_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2012
rem 

create table t1
as
with generator as (
        select
                rownum id 
        from dual 
        connect by 
                level <= 1e4  -- > comment to avoid wordpress format issue
)
select
        trunc(dbms_random.value(0,262144))      n_256K,
        trunc(dbms_random.value(0,131072))      n_128K,
        trunc(dbms_random.value(0,8192))        n_8k
from
        generator       v1,
        generator       v2
where
        rownum <= 8 * 1048576  -- > comment to avoid wordpress format issue
;

set arraysize 1000
set timing on
set autotrace traceonly 

prompt  ===========
prompt  No Order by
prompt  ===========

select
        n_8K, count(*) ct
from
        t1
group by
        n_8k
;


prompt  =============
prompt  With Order by
prompt  =============

select
        n_8K, count(*) ct
from
        t1
group by
        n_8k
order by
        1
;


My table has 8M rows, and my queries target the column with 8K distinct values. I’ve enabled autotrace from SQL*Plus, set a large arraysize (to reduce time lost to SQL*Net round-trips), and set timing on so we can get an elapsed time for total execution. I’ve set autotrace to “traceonly” so that the SQL*Plus client will fetch the data but won’t doesn’t waste resources formatting it, but I’m not actually interested in the handful of execution statistics that will be reported.

Here are the two sets of results from a test run on 19.11.0.0. Note, particularly, the figure for Elapsed:


===========
No Order by
===========

8192 rows selected.

Elapsed: 00:00:00.58

Execution Plan
----------------------------------------------------------
Plan hash value: 136660032

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
---------------------------------------------------------------------------

=============
With Order by
=============

8192 rows selected.

Elapsed: 00:00:03.03

Execution Plan
----------------------------------------------------------
Plan hash value: 3946799371

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
---------------------------------------------------------------------------

The time has jumped from slightly under 0.6 seconds to just over 3 seconds as the critical operation changes from a HASH GROUP BY to a SORT GROUP BY (even though the estimated cost, hence predicted run-time, of execution has not changed).

Your first thought at this point is probably along the lines of “surely it doesn’t take 2.4 seconds to sort 8,192 small rows, why doesn’t Oracle do a hash group by followed by a sort order by?” The answer seems to be “it just doesn’t”. So here’s one way to make it happen (with execution plan and elapsed time from 19.11.0.0 again):

select
        dist_8k, ct
from
        (
        select  /*+ no_merge */
                n_8K dist_8k, count(*) ct
        from
                t1
        group by
                n_8k
        )
order by 
        dist_8k
;

8192 rows selected.

Elapsed: 00:00:00.59

Execution Plan
----------------------------------------------------------
Plan hash value: 1705136228

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   1 |  SORT ORDER BY       |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   2 |   VIEW               |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   3 |    HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
-----------------------------------------------------------------------------


Again the estimated cost of execution doesn’t (to the nearest whole number) change from the basic cost of the hash aggregation – but we have brought the time back down to just under 0.6 seconds.

It’s worth reminding you at this point that if you can re-engineer any SQL that’s performing badly and can see that the driving core of the query can be reduced to something much simpler and more efficient, then wrapping that core into an inline view with the /*+ no_merge */ hint (and possibly putting it up into a “with subquery” clause) might be the safest first step and most effective way of improving performance.

There is an option for avoiding the query rewrite here – hint the path you want to see:


select  /*+ use_hash_aggregation */
        n_8K, count(*) ct
from
        t1
group by
        n_8k
order by 
        1
;

8192 rows selected.

Elapsed: 00:00:00.59

Execution Plan
----------------------------------------------------------
Plan hash value: 2808104874

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  SORT ORDER BY      |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
----------------------------------------------------------------------------

The nice thing about this, of course, is that you don’t actually have to edit the text; the hint could be attached to the query through an SQL Patch (or by abusing the SQL Profile or SQL Plan Baseline mechanisms).

The difficult part of hinting is finding the correct query block name for a more complex query. I simply added the hint /*+ use_hash_aggregation */ but the hint can be aimed at a query block so, in the absence of explicit query block names I could have used the hint /*+ use_hash_aggregation(@sel$1) */ using the default naming.

In a more complex case you can find the appropriate query block name by using the ‘alias’ format option when generating the execution plan. Consider the following query (where t2 and t3 are created from view all_objects), with its initial execution plan:


explain plan for
select 
        t2.owner, count(*)
from 
        t2 
where 
        t2.object_id in (
                select  t3.object_id 
                from    t3 
                where   t3.object_type = 'TABLE'
        )
group by 
        t2.owner
order by 
        t2.owner
/


select * from table(dbms_xplan.display(format=>'alias'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2646727453

------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |    23 |   575 |   319   (8)| 00:00:01 |
|   1 |  SORT GROUP BY        |      |    23 |   575 |   319   (8)| 00:00:01 |
|*  2 |   HASH JOIN RIGHT SEMI|      |  2298 | 57450 |   317   (7)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL  | T3   |  2298 | 34470 |   158   (7)| 00:00:01 |
|   4 |    TABLE ACCESS FULL  | T2   | 57448 |   561K|   156   (6)| 00:00:01 |
------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$5DA710D3
   3 - SEL$5DA710D3 / T3@SEL$2
   4 - SEL$5DA710D3 / T2@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T2"."OBJECT_ID"="T3"."OBJECT_ID")
   3 - filter("T3"."OBJECT_TYPE"='TABLE')

The Query Block Name / Object Alias information tells us that the query block holding the sort group by at operation 1 is named SEL$5DA710D3 so we can use that as the target query block in the hint: /*+ use_hash_aggregation(@SEL$5DA710D3) */ and the plan changes to:


-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |    23 |   575 |   319   (8)| 00:00:01 |
|   1 |  SORT ORDER BY         |      |    23 |   575 |   319   (8)| 00:00:01 |
|   2 |   HASH GROUP BY        |      |    23 |   575 |   319   (8)| 00:00:01 |
|*  3 |    HASH JOIN RIGHT SEMI|      |  2298 | 57450 |   317   (7)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL  | T3   |  2298 | 34470 |   158   (7)| 00:00:01 |
|   5 |     TABLE ACCESS FULL  | T2   | 57448 |   561K|   156   (6)| 00:00:01 |
-------------------------------------------------------------------------------

It’s possible that you’ll only notice a significant difference in performance (and, perhaps, PGA memory allocated) when you aggregate a large number of rows into a small result set. And it does seem that this is one case where the only way to get the plan you want without a significant code rewrite is through a hint

tl;dr

If you have some code which does a “group by X, Y, Z order by X, Y, Z” (the order of the columns/expressions used doesn’t have to be the same for the two clauses) then Oracle will use a sort group by operation to handle the aggregation and ordering in a single step, even though we can find cases where hash aggregation followed by sort ordering is more efficient.

If you come across such a case then injecting the hint /*+ use_hash_aggregation(@query_block_name) */ may be the only way to change the execution plan if you’re not allowed to edit the SQL.

Footnote

In the second of the two links to the Oracle Developer Forum you’ll see that one poster pointed out that if the order by clause uses a suitable expression to substitute for one of the columns in the group by clause then you don’t need to hint the code, e.g.

group by
        numeric_column
order by
        numeric_column + 0

Very cute, but not a good idea.

Footnote 2

There is another part to the testing I started in 2007, and I’ll come back to that later in Hash Aggregation – 2.

January 13, 2022

“Priors”

Filed under: Non-technical,Philosophy — Jonathan Lewis @ 12:34 pm GMT Jan 13,2022

I was listening to a podcast recently about “Thinking Clearly” when the presenter gave a very rapid description of a common mistake that occurs when human intuition meets statistical information.

The particular bit of statistical analysis, when you meet its effects in real-life, can be hard to understand so I thought I’d draw a picture that might help people to understand the mechanism of putting a “threat percentage” into the correct context.

Imagine there is a “nasty medical condition X” that affects one in 20 of all men over the age of 55. Now imagine that there is a diagnostic test for the condition that is 90% accurate (meaning it will return the correct result 90 times out of a hundred and the wrong result 10 times out of 100).

You are male, just past your 55th birthday, and your doctor tells you that you’ve tested positive. How terrified should you be or, to put it another way, which is more likely (and by how much): you’ve got X, or you haven’t got X?

The human, intuitive, response is simple: you’ve been told the test is 90% accurate and you’ve tested positive; so your digestive system is probably telling you that it’s almost certain you’ve got X.

The statistician’s approach (after the initial reflexive shock, perhaps) is to apply Bayesian thinking which can be turned into pictures as follows:

  • Step 1: What does “one in 20” look like? (left hand square)
  • Step 2: What does “90% accurate” look like? (right hand square)
  • Step 3: What does the picture look like when we superimpose the individual (independent) probabilities:

The big white block toward the top left tells us about the group of people who are correctly informed that they are X-free; the tiny red block in the bottom right tells us about the really unlucky ones who are told they’re X-free when they actually have X (false negatives).

Now compare the two pink blocks: the vertical rectangle to the right is the group that have X and test positive; the horizontal rectangle at the bottom is the group who don’t have X but test positive anyway (false positives).

The visual impression from this image is that if you’ve been told that you tested positive it’s nearly twice as likely that you are X-free than you are to have X: but let’s put the numbers into the picture to get a better handle on this. I’ll use a population of 10,000 (which, conveniently, I can represent as a square measuring 100 by 100):

In a population of 10,000

  • X-free = 95 * (90 + 10) = 9,500 (95%)
  • Got X = 5 * (90 + 10) = 500 (5%)
  • Correct result given = 90 * (95 + 5) = 9,000 (90%)
  • Wrong result given =10 * (95 + 5) = 1,000 (10%)
  • X-free and received right result = 8,550 … (95 * 90, top left)
  • Got X and received wrong result = 50 … (5 * 10, bottom right)
  • Got X and received right result = 450 … (5 * 90, top right)
  • X-free and received wrong result = 950 … (95 * 10, bottom left)

Given the underlying population data (“priors”) for this example, we see that a positive result from a test that’s 90% accurate means there’s a probability of 450 / (950 + 450) = 0.32 (32%) that you’ve got X.

Footnote

The result of this very simple hypothetical case is not intuitively obvious to most people; but if you thought it was easy to get to the right answer you might want to look at the Monty Hall problem, which also leads to Bertrand’s Boxes and the Three Prisoners problem.-

January 12, 2022

19c Trivia

Filed under: Oracle — Jonathan Lewis @ 11:06 am GMT Jan 12,2022

Here’s a little detail I’ve just noticed in 19.11.0.0. There are a lot of parameters with names like “\_bug%”, some having proper descriptions, some simply echoing their previous pattern of things like: “tenth spare parameter – integer”.

Parameter                                                 System Value         Description
--------------------------------------------------------- -------------------- ----------------------------------------------------------------------
_bug27355984_xt_preproc_timeout                           100                  external table preprocessor timeout
_bug28322973_asm_ownerid_trace_timeout                    600                  ownerid tracing threshold in seconds
_bug28482048_fb_log_max_rereads                           100                  Maximum number of rereads after reading a corrupt flashback block
_bug29274428_modsvc_call_out_enabled                      FALSE                one-hundred-and-forty-third spare parameter - boolean
_bug29290173_securefiles_dealloc_cfs                      0                    securefiles deallocate directly to CFS
_bug29302220_tcpinfo_statistics_save_atexit               FALSE                TCP Info Statistics Save At Exit
_bug29386835_enable_per_container_acl                     FALSE                Enable Per Container ACL
_bug29394014_allow_triggers_on_vpd_table                  FALSE                Allow triggers on VPD protected table in DM
_bug29504103_runtime_index_key_length_check               0                    allow runtime checking of index key length
_bug29558926_ctas_numfrags_hint                           0                    Hint for number of partition fragments to be loaded in a CTAS
_bug29796916_audit_sys_privilege                          TRUE                 one-hundred-and-seventy-ninth spare parameter - boolean
_bug29825525_bct_public_dba_buffer_dynresize_delay        3600                 sixth spare parameter - integer
_bug29903454_ksws_enable_alb                              TRUE                 enable ALB metrics processing
_bug30159581_cputime_limit_parallel_fptr                                       twenty-seventh spare parameter - integer
_bug30159581_runtime_limit_parallel_fptr                                       twenty-eighth spare parameter - integer
_bug30165506_unified_pga_max_pool_size                                         twenty-first spare parameter - integer
_bug30186319_lm_hb_lgio_timeout                           70                   LGWR io timeout in seconds
_bug30196629_dbopen_breakpoint                            none                 ALTER DATABASE OPEN codepath breakpoint
_bug30224950_kjac_direct_path_enabled                     TRUE                 one-hundred-and-fifty-fifth spare parameter - boolean
_bug30316897_allow_fallback_to_dbkey                      FALSE                one-hundred-and-sixty-third spare parameter - boolean
_bug30346330_hang_sess_enq_wait_blocked_session_threshold 5                    tenth spare parameter - integer
_bug30346330_hang_sess_enq_wait_resltn_trig_time          0                    seventeenth spare parameter - integer
_bug30352623_asm_enable_vfreloc_on_rm                     0                    Enable voting file relocation on rolling migration
_bug30382982_keep_relocated_source_pdb                    FALSE                one-hundred-and-sixty-second spare parameter - boolean
_bug30570083_mv_master_lock_timeout                       3                    timeout that mv create/drop waits to lock the master table
_bug30624792_hang_px_resolution_on_asm_enabled            TRUE                 Hang Management px resolution on ASM enabled
_bug31261467_disable_block_corruption_track               FALSE                one-hundred-and-seventy-eighth spare parameter - boolean
_bug31304945_utlfile_dirpath                              FALSE                return Utl_File Accessible
_bug31417007_lm_mp_additional_reg_buffers                 20                   forty-seventh spare parameter - integer
_bug31417007_lm_mp_share_lms_pools                        TRUE                 one-hundred-and-eighty-sixth spare parameter - boolean
_bug31417007_lm_mp_small_lms_pools                        TRUE                 one-hundred-and-eighty-seventh spare parameter - boolean
_bug31701239_hang_high_session_count_update_hb_threshold                       Hang Manager high session count update heartbeat threshold
_bug31720839_lm_ges_rht_alloc_size                        4096                 ges rht allocation chunk size
_bug31720839_lm_recovery_pdbw_reduction                   4                    recovery member pdbw reduction
_bug31734583_hang_analyze_elapsed_time_limit              1800                 Hang Analyze elapsed time limit in seconds
_bug31747989_cputime_limit_parallel_fptr                  60                   CPU-time limit for for parallel first-pass recovery
_bug31747989_runtime_limit_parallel_fptr                  60                   run-time limit for for parallel first-pass recovery
_bug31823051_asm_slow_client_timeout_percentage           50                   Percentage of the write cancel timeout to detect slow client.
_bug32093558_j0_deq_maxwait_time                          0                    forty-first spare parameter - integer
_bug32093558_srvntfn_max_job_time                         180                  fortieth spare parameter - integer
_bug32095097_defer_rcv_pdb_attach                         TRUE                 Defer PDB domain attach in instance recovery
_bug32171622_asm_reloc_restart_timeout                    900                  csecs before signaling ORA-15424 when waiting for orphans

42 rows selected.

It looks as if Oracle may be implementing a new strategy to replace the prevalence of fix_control settings to allow for improved visibility of the state of point patches. Possibly the aim is to cater for systems that have had to tweak multiple fix controls and have problems with the maximum length of the parameter value.

It’s interesting to note that 21.3.0.0 has only 11 such parameters – some new and a few matching thise in 19.11.0.0. I haven’t checked yet whether all the parameters that have disappeared from 19.11.0.0 are still in 21.3.0. with the bug number prefix removed, but I’ve spotted a couple with that appearance.

Many years ago Coskan Gundogar wrote a note about checking v$system_fix_control to see if it gave any clues about unexpected (optimizer) behaviour: the method is still appropriate, but now you have another place to check.

Although one of these parameters is now the record holder for “longest parameter name”, the longest description belongs to a hidden parameter that isn’t a “bug” parameter.

Footnote

The script to generate this list is available in this note about trace files, with most of the columns commented out. I’ve run the script from SQL*Plus using the command


start params "\_bug"

January 10, 2022

Control File Waits

Filed under: Infrastructure,LOBs,Oracle,Performance — Jonathan Lewis @ 2:10 pm GMT Jan 10,2022

Originally drafted 4th Aug 2011, this is a blog note that has been waiting a little over 10 years to be completed and published; and it’s probably worth doing it even now since the topic came up on the Oracle database forum just a couple of weeks ago.

The original title was CF Enqueues but I’ve combined it with notes from another draft called LOB write waits to get to the final title because a couple of small variations in my test script can produce all the annoying symptoms you may want to investigate in that area.

The forum note that triggered completion and publication was a complaint that’s common: “why is this so slow” but with a cause that comes up  less frequently: “BLOB operations and control file parallel writes”. The OP was running Oracle 19.9 but was suffering from a problem which (when I checked my archive and MOS) turned out to have a history (and a workaround) that started with 8.0.4. Their performance issue was due to their BLOB defintion which, with the storage information removed, was as follows:

LOB (CONTENT) STORE AS BASICFILE (
        ENABLE     STORAGE IN ROW
        CHUNK      32768
        PCTVERSION 10
        NOCACHE
        NOLOGGING
)

Note, particularly, the key words basicfile, nocache, and nologging. You should not be using basicfile LOBs in 19c, and both the nocache and nologging options can introduce problems of I/O and enqueue contention.

The impact of I/O and and enqueue contention depends, of course, on the nature of the work being done – you might be wasting effort or losing time but it might be a waste of resources that you’re not worred about and time that is effectively invisible to the end-user. For this user the task looked like an example of a “batch running as a pl/sql loop” repeatedly executing a statement similar to the following:


insert into table2(blob_column2)
select  blob_column1 
from    table1
where   id = :B1

A large fraction of the processing time was spent on “control file parallel write” (a consequence of nologging basicfile LOBs) and a further significant fraction was spent on “direct path write” (a consequence of nocache basicfile LOBs).

To understand the problem of the control file parallel write waits you need to know that every time a session makes a nologging change to a data file it also has to update an “unrecoverable SCN” field in the controlfile entry for that file. But if you’re inserting lots of individual LOBs then that turns into a lot of control file writes and the time spent on those writes, and the time spent waiting for other sessions to complete their writes and release the CF enqueue that might be protecting the control file from conflicting writes, can become significant.

As I said, this problem was first noted in 8.0.4, and an event was created to disable this control file update for LOBs (which were only of the basicfile variant at the time). The event was 10359 and the entry in $ORACLE_HOME/rdbms/mesg/oraus.msg for this event (which still has the same effect in 19.9) says:


10359, 00000, "turn off updates to control file for direct writes"
// *Cause:
// *Action:  Control files won't get updated for direct writes for  LOBs
//           when NOCACHE NOLOGGING is set. The only bad impact that it
//           can have is that if you are using the recovery manager,
//           it may affect a warning that says that the user should
//           back everything up. Now the recovery manager won't know
//           to tell you that the files that were updated with
//           unrecoverable events should be backed up.

Setting this event solved the OP’s performance problem.

It doesn’t really sound very nice, of course, setting an event that stops Oracle from telling you that “the files … should be backed up”, but the text does say the only bad impact is that it “may affect a warning” not that future attempts at recovery are going to fail. In fact there’s also MOS Doc ID: 1058851.6 Performance Degradation as a result of ‘enq: CF – contention (last updated March 2019 despite only having references to Oracle 8.0.4) which says (my emphasis):

If event 10359 is set to level 1, update of the control file with invalidation
redo is stopped.  This will not affect any recovery, but note that recovery
manager reports will be stale.

With event 10359 out of the way (but see footnote), I want to get back to my comment about switching to securefile LOBs. Here’s a very simple script I wrote a few years ago (and tweaked occasionally thereafter) to test the performance characteristics of LOBs:

rem
rem     Script: lob_write_waits.sql
rem     Author: Jonathan Lewis
rem     Dated:  Aug 2014
rem

drop table test_lobs;

@@setup

create table test_lobs (
        id              number(5),
        bytes           number(38),
        text_content    clob
)
lob (text_content)
store as 
--      securefile
        basicfile
        text_lob(
        disable storage in row
        nocache nologging
--
--      cache logging compress low                              --  0.26
--      cache reads logging compress low                        --  1.00
--      nocache logging compress low                            --  1.00
--
--      cache nologging compress low                            -- 14.00
--      cache reads nologging compress low                      --  1.00
--      nocache nologging compress low                          --  1.00
--
--      cache filesystem_like_logging compress low              -- 11.00
--      cache reads filesystem_like_logging compress low        --  1.00
--      nocache filesystem_like_logging compress low            --  1.00
)
;

insert into test_lobs values (1,1000, rpad('x',3000));
commit;

execute snap_my_stats.start_snap
execute snap_events.start_snap
execute snap_enqueues.start_snap

set serveroutput off
set feedback off

-- alter session set events '10359 trace name context forever, level 1';
-- alter session set db_unrecoverable_scn_tracking=false;
-- alter session set events '10046 trace name context forever, level 8';

declare
        c1 clob;
begin
        for r in 1..1000 loop
                insert into test_lobs values(r, 1000, lpad(r,1000,0));
                commit;
        end loop;
end;
/

commit;

alter session set events '10046 trace name context off';
alter session set events '10359 trace name context off';
alter session set db_unrecoverable_scn_tracking=true;

spool lob_write_waits

set serveroutput on
set feedback on

execute snap_enqueues.end_snap
execute snap_events.end_snap
execute snap_my_stats.end_snap

spool off

The script creates a table with a LOB column and inserts one row into it to avoid minor details of object creation and/or block formatting affecting the results I see from a loop test inserting 1000 rows one at a time with commit. You’ll notice that I’ve got lots of possible configurations for the LOB segment, including a choice between basicfile and securefile.

As it stands it will create a basicfile lob with the nocache nologging attributes (all the other options I have commented out are possible only for securefile LOBs thanks to the compress low clause.) Apart from the loop the script also allows a couple of parameters to be changed, or events to be set; and captures snapshots of sundry statistics.

If I allow the controlfile activity to take place, the following snippet shows the highlights of what my basicfile LOB does for me (or to me!) in 21c when the database is in ARCHIVELOG mode (which you shold be if you are trying to test the effects of any unrecoverable or nologging operation – the database is not in force_logging mode, by the way):

----------------------------------
System enqueues
Interval:-      6 seconds
----------------------------------
Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
CF         2,006           0       2,006           0           0 contention
XR         1,000           0       1,000           0           0 database force logging
TM         1,002           0       1,002           0           0 contention
TX         1,031           0       1,031           0           0 contention

---------------------------------------------------------
Session Events
Interval:-      6 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs            Csec    Avg Csec    Max Csec
-----                                             -----   ---------            ----    --------    --------
direct path sync                                  1,000           0           125.32        .125           6
control file sequential read                      5,014           0             5.50        .001           0
control file parallel write                       4,000           0           385.75        .096           6
direct path write                                 1,000           0             3.02        .003           5

---------------------------------
Session stats
Interval:-  6 seconds
---------------------------------
Name                                                                         Value
----                                                                         -----
total cf enq hold time                                                       3,953
total number of cf enq holders                                               1,000
physical writes                                                              1,000
physical writes direct                                                       1,000
physical writes direct (lob)                                                 1,000
redo entries                                                                 2,638
redo size                                                                1,647,276
redo size for direct writes                                                102,460
KTFB alloc space (block)                                                 9,437,184
ASSM gsp:get free block                                                      1,014
lob writes                                                                   1,000

As you can see, there’s a significant penalty due to the CF (controlfile) enqueues and control file reads and writes; and if I simply set event 10359 to level 1 (or set db_unrecoverable_scn_tracking to false) the most significant change is that the CF and control file I/O disappear and the job takes about 2 seconds instead of 6.

There’s nothing particularly interesting about the session activity stats, but since the activity is nologging it’s interesting to note that the redo size is nowhere near the 8MB we might expect (8KB blocks x 1,000 LOBs) even though we can see that Oracle has had to allocate 9MB of space for (mostly) the LOB data segment. Each nologging/unrecoverable LOB data block has averaged about 100 bytes of redo for tracking unrecoverable blocks. (When I changed the LOB chunk to 32K this KTFB allocation increased to 32MB – but that shouldn’t be a surprise).

When I switch to securefile LOBs (with the nocache nologging options) the results look like this:

----------------------------------
System enqueues
Interval:-      2 seconds
----------------------------------
Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
TM         1,001           0       1,001           0           0 contention
TX         2,073           0       2,073           0           0 contention
DW         1,057           0       1,057           0           0 contention

---------------------------------------------------------
Session Events
Interval:-      2 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs            Csec    Avg Csec    Max Csec
-----                                             -----   ---------            ----    --------    --------
direct path write                                 1,000           0            85.72        .086           4

---------------------------------
Session stats
Interval:-  2 seconds
---------------------------------
Name                                                                         Value
----                                                                         -----
redo entries                                                                 7,737
redo size                                                                3,594,280
KTFB alloc space (block)                                                 1,114,112
securefile allocation bytes                                              8,192,000
securefile allocation chunks                                                 1,000
securefile direct write bytes                                            8,192,000
securefile direct write ops                                                  1,000
securefile bytes non-transformed                                         2,000,000

If I use the compression low option at the same time the statistic about securefile bytes non-transformed drops to zero – not surprisingly – and I see 1,000 WG enqueues with reason “delete fso” and 3,000 WG enqueues with reason “lock fso”, but the timing is basically unchanged. The DW enqueue is the “In memory Dispenser” where the id1/id2 reference a tablespace and block address, and the WG enqueue is the “Write gather local enqueue” where the id1/id2 report the LOB ID (in two halves).

The most significant change, though, and the one that made me spend time on looking at results and traces etc. was what happened when I set the system to use “cache nologging” (or “cache filesystem_like_logging”) – the effect was really surprising; the report on enqueues didn’t show any significant differences (the same WG enqueues appeared if I enabled compression), but this is what happened with the session events:

---------------------------------------------------------
Session Events
Interval:-      12 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs            Csec    Avg Csec    Max Csec
-----                                             -----   ---------            ----    --------    --------
write complete waits                              1,000           0         1,007.02       1.007           5

For every LOB write there was a write complete wait that averaged close to one centisecond! That’s a fairly suspect number, looking more like a time-out from a self-timer than a wait that’s terminated by some sort of signal of an action completed. After thinking about what might be happening and browsing through the session activity stats I noticed the following clue:

---------------------------------
Session stats
Interval:-  12 seconds
---------------------------------
Name                                                                         Value
----                                                                         -----
messages sent                                                                2,000

That looks like two messages sent for each LOB inserted – which I confirmed by doing a few more tests with different numbers of LOBs. So who is received those messges (and not replying to them)? I re-ran the test taking a snapshot of the session stats for every session in the instance and extracted (using grep) the figures for just the “messages sent” and “messages received”. Here are the results, rearranged slightly:

SID:   283: - oracle - oracle@linux21.localdomain (W003)
messages sent                                                              4
SID:   293: - oracle - oracle@linux21.localdomain (W006)
messages sent                                                              3
SID:   294: - oracle - oracle@linux21.localdomain (W005)
messages sent                                                              4
SID:    20: - oracle - oracle@linux21.localdomain (W000)
messages sent                                                              8
SID:    33: - oracle - oracle@linux21.localdomain (W002)
messages sent                                                              2
SID:    44: - oracle - oracle@linux21.localdomain (W007)
messages sent                                                              3

SID:    51:TEST_USER - jonathan - sqlplus@linux21.localdomain (TNS V1-V3)
messages sent                                                          2,004    (2,000)
SID:   263: - oracle - oracle@linux21.localdomain (DBW0)
messages sent                                                          1,989    (2,000)
messages received                                                      3,021    (3,000)
SID:    11: - oracle - oracle@linux21.localdomain (LGWR)
messages sent                                                          1,021    (1,000)
messages received                                                      2,019    (2,000)

If we ignore the messages sent by Wnnn (which, I believe, go to DBW0) and allow for a few timing/rounding errors, we might decide to work with the figures in brackets in the above. With a few extra tests, changing “commit;” to “commit write wait;” or commenting out the “commit;” my best interpretation (i.e. guess) of these numbers came down to:

  • The user session sends 1,000 messages to DBWR (to write the nologging LOB) and 1,000 messages to LGWR to write the “ordinary” table redo (and the number 1,000 also show up in the “redo synch writes” stat and  “log file sync” waits if we use “commit write wait”)
  • LGWR receives 1,000 messages from the session, and 1,000 messages from DBW0 (which would relate to “LOB management” with redo OP code 26.4 and 13.53) and sends 1,000 completion messages to DBW0.
  • DBW0 receives 1,000 messages from the session and 1,000 messages from LGWR, and sends 1,000 “log LOB Maintenance” messages to LGWR.

And if you’ve been counting carefully you’ll have noticed that there are still 1,000 message sent and received by DBW0 that I haven’t been able to account for. It’s possibly a “double-counting” error; or maybe there’s another process involved in a dialogue with DBW0 that doesn’t record its end of the dialogue, or maybe it’s DBW0 sending messages to itself(!). At some point I may remember how to switch on message tracing and check exactly which messages are sent from where to where, but for the moment I’m going to put a line under the investigation and just highlight the critical effect.

Whatever’s going on, a key detail from the experiments above is that the session does not seem to get any sort of message telling it that the unlogged LOB is safely on disc, it just goes to sleep for 1/100 second – presumably checking the buffer status when it wakes up to decide whether or not the cached LOB has been written.

Footnote

You may have come across the parameter db_unrecoverable_scn_tracking which appeared some time around 11.2.0.4 with the description: “Track nologging SCN in controlfile”. If this parameter is set to true (the default value) at instance startup then you can use event 10359 to switch the feature on and off at the session (and possibly system) level. More elegantly, though, you can set the parameter at the session level instead of using the numeric event to address the problem:

alter session set db_unrecoverable_scn_tracking=false;

I would advise against setting the value to false at instance startup, or even through a call to “alter system”; there’s a comment in Doc ID 12360160.8 – ALTER SYSTEM / SESSION of DB_UNRECOVERABLE_SCN_TRACKING that if the parameter is set to false at startup it can’t be changed dynamically thereafter.  (I haven’t tested if this last detail is still true.)

tl;dr

You really shouldn’t be using basicfile LOBs with newer versions of Oracle, you need to convert them to securefile LOBs.

If you do use basicfile LOBs then you may need to set event 10359 (or adjust the db_unrecoverable_scn_tracking parameter) if you want them to be nologging LOBs.

Even with securefile LOBs the options “cache nologging” and “cache filesystem_like_logging” introduce a surprising I/O wait time of 1/100 second per LOB, so those are two options to avoid.  All other securefile variations seemed to behave nicely (for me) but the amount of undo and redo can vary considerably with option so even with a 1/100 second time penalty you might still choose to use one of these options.

 

January 9, 2022

OERI 6051

Filed under: Block Size,humour,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 9:27 pm GMT Jan 9,2022

How smashed (in the non-alcoholic sense) can an index be?

One of the components in the cost calculation for an indexed access path is the “blevel” (branch-level) or, indirectly, the “height” of the index. Both count the steps from the root block down to a leaf block (and all leaf blocks are at the same distance from the root – that’s the meaning of “balanced” in the expression “balanced B-tree”) but the height includes the leaf level in the count while the blevel excludes it and counts down only to the lowest level of branch blocks (so height = blevel + 1).

In many cases you will find that even with a few million entries in a single index segment the height may still be only 3 (blevel = 2), and it may take a few tens of millions of rows before an index needs to grow to height = 4 (blevel = 3).

It’s often the case that the number of index entries per leaf block and block pointers per branch block is around 200 to 400, so the rate at which the height/blevel grows is tiny compared to the rate at which the number of rows in the index increases. But algorithms often have weak points, and some time around the year 2000 I started demonstrating an edge case where I could crash a session in less than 3 seconds (and most of that time was spent on Oracle creating the crash dump) by inserting just 25 (carefully designed) rows into a table.

I published an article about this in 2005 (see footnote), but since then the algorithm has changed. My demo worked in versions up to 9.2.0.4; but in later versions Oracle Corp. modified the way that index blocks (possibly just the branch blocks) split at the low end of the index making the harder to achieve a crash. If you have a MOS account you can check Doc ID 1748260.8: OERI:6051 possible during index manipulation.

The change wasn’t a response to my demo, of course; it was in response to a problem that could occur in production systems running some of the big “database agnostic” accounting or HR or CRM systems that created huges indexes on multiple columns. Even when the crash never occured the nature of the application and its indexing strategy could result in some indexes growing to a ridiculous height that made a dramatic difference to the cost calculations (hence the desirability of the “best” index).

It’s harder, and less likely to happen in the wild, but it’s still possible to make the same crash occur even in the newest versions of Oracle. It will (probably) take roughly 8 million (power(2,23) + 1) rows and 32GB of space to crash (or 128GB if you want to play nicely and use an 8KB block size – and tweak my code a little further).

Richard Foote spotted a slide with a surprising blevel in a short presentation about CBO arithmetic by Maria Colgan a couple of days ago, so I thought it would be entertaining to tweak the old code to see if it could still cause the crash. So here it is:

rem
rem     Script:         silly_index_3a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2004 / Jan 2022
rem     Purpose:        Build an index with a large blevel
rem
rem     Notes: 
rem     Uses 2K block size for tablespace holding the index
rem     Estimated run-time (for me) with m_blevel = 23 - ca. 1 hour
rem

define m_blevel = 5
define m_rows = power(2,&m_blevel)

drop table t1 purge;
create table t1 (v1 varchar2(8));

create index t1_i1 on t1(substrb(lpad(v1,1469,'0'),1,1469))
tablespace test_2k
;

-- execute snap_my_stats.start_snap

prompt  ===================================================
prompt  Inserting &m_rows (short) rows in reverse order
prompt  ===================================================

begin
        for i in reverse 1..&m_rows loop
                insert into t1 values (i);
--              commit;
        end loop;
end;
/

-- execute snap_my_stats.end_snap

prompt  ================
prompt  Validating index
prompt  ================

validate index t1_i1;

select 
        lf_rows, height, height-1 blevel, lf_blks, br_blks
from
        index_stats
;

column  object_id new_value m_object_id

select  object_id
from    user_objects
where   object_name = 'T1_I1'
/

alter session set events 'immediate trace name treedump level &m_object_id';

insert into t1(v1) values('0');

I’ve precreated a tablespace called test_2k with a block size of 2KB for this demo; you’ll need a couple of percent over 32GB for this tablespace.

This script then creates a table in my default tablespace to hold a small character column, and a function-based index on that column that produces a character result of 1469 bytes (which gives me the largest possible index entry that’s allowed in a 2KB block size). The older version of the code used a simple lpad() to do this, but the newer versions decided that that would produce up to 2*1,469 bytes thanks to my default character set – hence the substrb(), note, especially the b for byte.

With the structure in place I’ve then inserted numeric values in descending order into the table so that the index is constantly doing leaf block splits at the left hand (low) end.

Once I’ve populated the table I use a call to validate index so that I can report the number of rows, leaf blocks, branch blocks and the height (and blevel) of the index; then I find it’s object_id so that I can do a treedump of it.

For m_blevel = 5, here are the results of the query against index_stats after the call to validate the index:

   LF_ROWS     HEIGHT     BLEVEL    LF_BLKS    BR_BLKS
---------- ---------- ---------- ---------- ----------
        32          6          5         32         31

As you can see, setting m_blevel = 5 I get an index with blevel = 5, and 2^5 leaf blocks each holding one row. If you set m_blevel to 23 you’ll end up (after about 1 hour, probably) with a blevel of 23 and 8,388,608 rows and leaf blocks (and branch blocks = leaf blocks – 1: hence the 32GB+ requirement for the tablespace … 16M blocks at 2KB per block, plus ASSM overheads).

To show you what’s happening inside the index here’s the treedump (from 19.11.0.0) with m_blevel = 5

branch: 0x4c00204 79692292 (0: nrow: 2, level: 5)
   branch: 0x4c0022a 79692330 (-1: nrow: 2, level: 4)
      branch: 0x4c00212 79692306 (-1: nrow: 2, level: 3)
         branch: 0x4c00206 79692294 (-1: nrow: 2, level: 2)
            branch: 0x4c0020c 79692300 (-1: nrow: 2, level: 1)
               leaf: 0x4c00209 79692297 (-1: row:1.1 avs:370)
               leaf: 0x4c00249 79692361 (0: row:1.1 avs:370)
            branch: 0x4c00248 79692360 (0: nrow: 2, level: 1)
               leaf: 0x4c00247 79692359 (-1: row:1.1 avs:370)
               leaf: 0x4c00246 79692358 (0: row:1.1 avs:370)
         branch: 0x4c00244 79692356 (0: nrow: 2, level: 2)
            branch: 0x4c00243 79692355 (-1: nrow: 2, level: 1)
               leaf: 0x4c00242 79692354 (-1: row:1.1 avs:370)
               leaf: 0x4c0024f 79692367 (0: row:1.1 avs:370)
            branch: 0x4c0024e 79692366 (0: nrow: 2, level: 1)
               leaf: 0x4c00235 79692341 (-1: row:1.1 avs:370)
               leaf: 0x4c00239 79692345 (0: row:1.1 avs:370)
      branch: 0x4c00238 79692344 (0: nrow: 2, level: 3)
         branch: 0x4c00237 79692343 (-1: nrow: 2, level: 2)
            branch: 0x4c00236 79692342 (-1: nrow: 2, level: 1)
               leaf: 0x4c00234 79692340 (-1: row:1.1 avs:370)
               leaf: 0x4c00233 79692339 (0: row:1.1 avs:370)
            branch: 0x4c00232 79692338 (0: nrow: 2, level: 1)
               leaf: 0x4c00231 79692337 (-1: row:1.1 avs:370)
               leaf: 0x4c00230 79692336 (0: row:1.1 avs:370)
         branch: 0x4c0023f 79692351 (0: nrow: 2, level: 2)
            branch: 0x4c0023e 79692350 (-1: nrow: 2, level: 1)
               leaf: 0x4c0023d 79692349 (-1: row:1.1 avs:370)
               leaf: 0x4c0023c 79692348 (0: row:1.1 avs:370)
            branch: 0x4c00225 79692325 (0: nrow: 2, level: 1)
               leaf: 0x4c0022d 79692333 (-1: row:1.1 avs:370)
               leaf: 0x4c0022c 79692332 (0: row:1.1 avs:370)
   branch: 0x4c0022b 79692331 (0: nrow: 2, level: 4)
      branch: 0x4c00229 79692329 (-1: nrow: 2, level: 3)
         branch: 0x4c00228 79692328 (-1: nrow: 2, level: 2)
            branch: 0x4c00227 79692327 (-1: nrow: 2, level: 1)
               leaf: 0x4c00226 79692326 (-1: row:1.1 avs:370)
               leaf: 0x4c00224 79692324 (0: row:1.1 avs:370)
            branch: 0x4c00223 79692323 (0: nrow: 2, level: 1)
               leaf: 0x4c00222 79692322 (-1: row:1.1 avs:370)
               leaf: 0x4c0022f 79692335 (0: row:1.1 avs:370)
         branch: 0x4c0022e 79692334 (0: nrow: 2, level: 2)
            branch: 0x4c00215 79692309 (-1: nrow: 2, level: 1)
               leaf: 0x4c00219 79692313 (-1: row:1.1 avs:370)
               leaf: 0x4c00218 79692312 (0: row:1.1 avs:370)
            branch: 0x4c00217 79692311 (0: nrow: 2, level: 1)
               leaf: 0x4c00216 79692310 (-1: row:1.1 avs:370)
               leaf: 0x4c00214 79692308 (0: row:1.1 avs:370)
      branch: 0x4c00213 79692307 (0: nrow: 2, level: 3)
         branch: 0x4c00211 79692305 (-1: nrow: 2, level: 2)
            branch: 0x4c00210 79692304 (-1: nrow: 2, level: 1)
               leaf: 0x4c0021f 79692319 (-1: row:1.1 avs:370)
               leaf: 0x4c0021e 79692318 (0: row:1.1 avs:370)
            branch: 0x4c0021d 79692317 (0: nrow: 2, level: 1)
               leaf: 0x4c0021c 79692316 (-1: row:1.1 avs:370)
               leaf: 0x4c00208 79692296 (0: row:1.1 avs:370)
         branch: 0x4c00207 79692295 (0: nrow: 2, level: 2)
            branch: 0x4c00205 79692293 (-1: nrow: 2, level: 1)
               leaf: 0x4c0020f 79692303 (-1: row:1.1 avs:370)
               leaf: 0x4c0020e 79692302 (0: row:1.1 avs:370)
            branch: 0x4c0020d 79692301 (0: nrow: 2, level: 1)
               leaf: 0x4c0020b 79692299 (-1: row:1.1 avs:370)
               leaf: 0x4c0020a 79692298 (0: row:1.1 avs:370)
----- end tree dump

As you can see, every branch block (which includes the root block) holds exactly 2 entries, and every leaf block holds just one row.

Once you’ve tested the code with a couple of small starting values you might want to skip the validate index and treedump steps – they might take quite a long time (especially since the treedump will write a trace file of 16M+ lines). The other thing to watch out for is that the script will generate something like 200GB of redo and 72GB of undo – so you might want to remove the comment marker from the commit in my PL/SQL loop and check your auto undo settings and auto extend settings on the undo files.

I should also point out that I’ve used 1469 for the substrb(lpad()) because that’s the largest string I can use for the index definition – but the same index pattern of use (i.e. one row per leaf block, two children per branch block) will appear if you reduce this (for a 2KB block size, using ASSM) to 915 bytes. (And this is instructive because if you use the smaller size and then eliminate the “reverse” in the loop the index branch blocks pack more efficiently and the blevel is smaller (even though the index leaf-block count is unchanged.)

In the good old days (9.2.0.4 and earlier) the maximum allowed height for a B-tree index was 24 (blevel = 23). I haven’t got a spare 32GB in any of my virtual machines at present so I haven’t checked to see if this is still true; but if you do run a test with m_blevel – 23, the final line of the script (inserting a zero row) should result in an ORA-00600 error with first parameter 6051 if the limit hasn’t changed.

Footnote

Here’s a link to the original document (how_high.doc) that I wrote and published in dbazine in 2005, now available (very slowly) on the Wayback Machine, and made redundant by a change in the branch block split algorithm in 10g.

Footnote 2

At some point I did discover a bug note on MOS (Metalink, in those days) that reported a performance problem due to an index with a very high blevel (I have a vague memory of it reaching double digits, but not very getting close to the limit – possibly something like 15). So there is a serious point to this post – bad column definitions with bad index definitions (and a silly block size) and a bit of bad luck with the pattern of data insertion can lead to an unexpected problems.

January 1, 2022

Happy New Year

Filed under: 12c,Bugs,CBO,Execution plans,Oracle,Transformations — Jonathan Lewis @ 12:01 am GMT Jan 1,2022

Here’s an entertaining little bug that appeared on the oracle-l list server just in time to end one year and start another in a suitable way. The thread starts with an offering from Matthias Rogel (shown below with some cosmetic changes) to be run on Oracle 12.2.0.1:

rem
rem     Script:         group_by_bug.sql
rem     Author:         Mathias Rogel  / Jonathan Lewis
rem     Dated:          Dec 2021
rem
rem     Last tested 
rem             19.11.0.0       Fixed
rem             12.2.0.1        Fail
rem

create table t as (
        select date'2021-12-30' d from dual 
        union all 
        select date'2021-12-31'   from dual
);

select extract(year from d), count(*) from t group by extract(year from d);

alter table t add primary key(d);
select extract(year from d), count(*) from t group by extract(year from d);

This doesn’t look particularly exciting – I’ve created a table with two rows holding two dates in the same year, then counted the number of rows for “each” year before and after adding a primary key on the date column. Pause briefly to think about what the results might look like …

Table created.


EXTRACT(YEARFROMD)   COUNT(*)
------------------ ----------
              2021          2

1 row selected.


Table altered.


EXTRACT(YEARFROMD)   COUNT(*)
------------------ ----------
              2021          1
              2021          1

2 rows selected.

After adding the primary key (with its unique index) the result changes to something that is clearly (for this very simple data set) wrong.

At this point I offered a hypothetical reason why Oracle might be producing the wrong result, but Tim Gorman was one step ahead of me and supplied a bug reference from MOS: Wrong Result Using GROUP BY with EXTRACT Function Against DATE (Doc ID 2629968.1)

The MOS document describes this as a bug introduced in the upgrade from 12.1.0.2 to 12.2.0.1, demonstrates the error with the extract() function applied to a date, and supplies three possible workarounds (but not the workaround or explanation I supplied in my response on oracle-l).

The document also pointed to a further bug note that described how the problem also appeared with the combination of the to_char() function applied to a date column with a unique indexes: 12.2 Wrong Results from Query with GROUP BY Clause Based on To_char Function of Unique Index Columns (Doc ID 2294763.1) with a further suggestion for applying a patch (see MOS Doc ID: 26588069.8) or upgrading to 18.1 (where the bug has been fixed).

Matthias Rogel supplied a follow-up demonstrating the problem with to_char(), which prompted me to create an example showing that it wasn’t just about dates – which I’ve tidied up below (reminder, all results on this page are from 12.2.0.1):

create  table t1 as 
select  round(rownum/10,1) n1 
from    all_objects 
where   rownum <= 10
;


select n1 from t1 order by n1;
select n1, count(*) from t1 group by n1 order by n1;

column nch format A3

select to_char(n1,'99') nch, count(*) from t1 group by to_char(n1,'99') order by 1,2;

select * from table(dbms_xplan.display_cursor(format =>'outline'));

alter table t1 add constraint t1_pk primary key(n1);
select to_char(n1,'99') nch , count(*) from t1 group by to_char(n1,'99') order by 1,2;

select * from table(dbms_xplan.display_cursor(format =>'outline'));

As before I’ve created a simple table, and populated it with a few rows of data. THe first two queries are there to show you the data (0.1 to 1.0 by steps of 0.1), and show that aggregating the raw data produces one row per value.

I’ve then repeated the aggregation query, but converted each value to a character string that effectively rounds the value to an integer. Here are the two sets of results, before and after adding the primary key.

NCH   COUNT(*)
--- ----------
  0          4
  1          6

2 rows selected.

Table altered.

NCH   COUNT(*)
--- ----------
  0          1
  0          1
  0          1
  0          1
  1          1
  1          1
  1          1
  1          1
  1          1
  1          1

10 rows selected.

Again, the introduction of the primary key constraint on the column results in wrong results. In this example, though I’ve pulled the execution plans from memory along with their outlines, and this is what the two plans look like.

SQL_ID  gt5a14jb0g4n0, child number 0
-------------------------------------
select to_char(n1,'99') nch, count(*) from t1 group by to_char(n1,'99')
order by 1,2

Plan hash value: 2808104874

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |       |       |     4 (100)|          |
|   1 |  SORT ORDER BY      |      |    10 |    30 |     4  (50)| 00:00:01 |
|   2 |   HASH GROUP BY     |      |    10 |    30 |     4  (50)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T1   |    10 |    30 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      USE_HASH_AGGREGATION(@"SEL$1")
      END_OUTLINE_DATA
  */


SQL_ID  4fxxtmrh8cpzp, child number 0
-------------------------------------
select to_char(n1,'99') nch , count(*) from t1 group by
to_char(n1,'99') order by 1,2

Plan hash value: 1252675504

--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |       |       |     2 (100)|          |
|   1 |  SORT ORDER BY   |       |    10 |    30 |     2  (50)| 00:00:01 |
|   2 |   INDEX FULL SCAN| T1_PK |    10 |    30 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$9BB7A81A")
      ELIM_GROUPBY(@"SEL$47952E7A")
      OUTLINE(@"SEL$47952E7A")
      ELIM_GROUPBY(@"SEL$1")
      OUTLINE(@"SEL$1")
      INDEX(@"SEL$9BB7A81A" "T1"@"SEL$1" ("T1"."N1"))
      END_OUTLINE_DATA
  */

In the absence of the primary key (index) Oracle does a full tablescan, then hash group by, then sort order by. When the primary key is put in place Oracle does an index full scan (which is legal because the index must contain all the data thanks to the not null declaration inherent in a primary key) and a sort order by without any group by.

You might wonder if the problem arises because Oracle assumes the indexed path somehow means the aggregation doesn’t apply – but with a /*+ full(t1) */ hint in place and a full tablescan in the plan the aggregation step is still missing — and if you look at the Outline Data section of the plan you can see that this is explicitly demanded by the hint(s): /*+ elim_groupby() */

My hypothesis (before I read the bug note) was that the optimizer had picked up the primary key declaration and seen that n1 was unique and therefore allowed the aggregating group by to be eliminated, but failed to “notice” that the to_char() – or extract() in the date example – meant that the assumption of uniqueness was no longer valid. To work around this problem very locally I simply added the hint /*+ no_elim_groupby */ (with no query block specified) to the query – and got the correct results.

Footnote

There is an interesting side note to this example (though not one that I would want to see used in a production system – this comment is for interest only). If you look at the Outline Data for the plan when there was no primary key you’ll notice that the only outline_leaf() is named sel$1 whereas in the plan with the primary key sel$1 appears as an outline() and the only outline_leaf() is named sel$9bb7a81a. As “outline leaf” is a query block that was used by the optimizer in constructing the final plan, while an “outline” is an intermediate query block that was examined before being transformed into another query block. So this difference in the Outline Data tells us that the problem appears thanks to a transformation that did not happen when there was no index – so what would our execution plan look like if the only hint we used in the query was /*+ outline_leaf(@sel$1) */ ?

SQL_ID  apgu34hc3ap7f, child number 0
-------------------------------------
select /*+ outline_leaf(@sel$1) */ to_char(n1,'99') nch , count(*) from
t1 group by to_char(n1,'99') order by 1,2

Plan hash value: 3280011052

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |       |       |     3 (100)|          |
|   1 |  SORT ORDER BY    |       |    10 |    30 |     3  (67)| 00:00:01 |
|   2 |   HASH GROUP BY   |       |    10 |    30 |     3  (67)| 00:00:01 |
|   3 |    INDEX FULL SCAN| T1_PK |    10 |    30 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      INDEX(@"SEL$1" "T1"@"SEL$1" ("T1"."N1"))
      USE_HASH_AGGREGATION(@"SEL$1")
      END_OUTLINE_DATA
  */


This posting was scheduled to launch at 00:01 GMT on 1st January 2022. Happy new year – just be careful that you don’t try to extract() or to_char() it if you’re running 12.2.0.1 unless you’ve applied patch 26588069.

December 22, 2021

Explain Plan

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 1:26 pm GMT Dec 22,2021

Here’s a little example that appeared on the Oracle database forum a few years ago (2013, 11.2.0.3 – it’s another of my drafts that waited a long time for completion) which captures a key problem with execution plans:

you need to make sure you look at the right one.

We have the tkprof output from an SQL statement that has been traced because it needs to go faster.


select clndr_id , count(*)
from
 task where (clndr_id = :"SYS_B_0") group by clndr_id
union
select clndr_id , count(*) from project where (clndr_id = :"SYS_B_1") group by clndr_id

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute      1      0.01       0.00          0          0          0           0
Fetch        2     53.32     612.03      81650      58920          0           2
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4     53.34     612.04      81650      58920          0           2

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 34  (PX)
Number of plan statistics captured: 1


Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         2          2          2  SORT UNIQUE (cr=58923 pr=81650 pw=22868 time=113329109 us cost=58277 size=24 card=2)
         2          2          2   UNION-ALL  (cr=58923 pr=81650 pw=22868 time=113329001 us)
         1          1          1    SORT GROUP BY NOSORT (cr=58330 pr=81070 pw=22868 time=104312437 us cost=58128 size=7 card=1)
   5589739    5589739    5589739     VIEW  index$_join$_003 (cr=58330 pr=81070 pw=22868 time=619784236 us cost=57240 size=38875249 card=5553607)
   5589739    5589739    5589739      HASH JOIN  (cr=58330 pr=81070 pw=22868 time=617373467 us)
   5590158    5590158    5590158       INDEX RANGE SCAN NDX_TASK_CALENDAR (cr=21676 pr=21676 pw=0 time=113637058 us cost=11057 size=38875249 card=5553607)(object id 24749)
   6673774    6673774    6673774       INDEX FAST FULL SCAN NDX_TASK_PROJ_RSRC (cr=36651 pr=36526 pw=0 time=213370625 us cost=21921 size=38875249 card=5553607)(object id 217274)
         1          1          1    SORT GROUP BY NOSORT (cr=593 pr=580 pw=0 time=9016527 us cost=149 size=17 card=1)
    136390     136390     136390     INDEX FAST FULL SCAN NDX_PROJECT_CALENDAR (cr=593 pr=580 pw=0 time=165434 us cost=132 size=2315876 card=136228)(object id 154409)


Rows     Execution Plan
-------  ---------------------------------------------------
      0  SELECT STATEMENT   MODE: ALL_ROWS
      2   SORT (UNIQUE)
      2    UNION-ALL
      1     SORT (GROUP BY NOSORT)
5589739      TABLE ACCESS   MODE: ANALYZED (BY INDEX ROWID) OF 'TASK' (TABLE)
5589739       INDEX   MODE: ANALYZED (RANGE SCAN) OF 'NDX_TASK_CALENDAR' (INDEX)
5590158     SORT (GROUP BY NOSORT)
6673774      INDEX   MODE: ANALYZED (RANGE SCAN) OF 'NDX_PROJECT_CALENDAR' (INDEX)

One of the first things you should notice from the tkprof output is that the “Rowsource Operation” section and the “Execution Plan” section do not match. Remember that tkprof is simply calling explain plan when it generates the “Execution Plan” section of the output and has no information about the type and value of bind variables, so it’s easy for it to generate a plan that didn’t happen. (There are other reasons why the two sets of output might differ – but this is the commonest one.)

Another thing you might note in passing is that the system has cursor_sharing set to force or similar – a detail you can infer from the bind variable names having the form :SYS_B_nnn. This, alone, might be enough to convince you to ignore the Execution Plan because of its potential to mislead.

Despite the tendancy to mislead there is (in this case) a very important clue in the Execution Plan. In lines 2 and 3 we can see the “sort unique” and “union all” that the optimizer has used to implement the UNION operator in the query. Then we see that the “union all” has two “sort group by (nosort)” children, one for each of the aggregate query blocks – the “nosort” in both cases because the query blocks ensure that only a single value of clndr_id is selected in each case anyway.

The interesting thing in the Execution Plan is that the range scan of ndx_task_calendar in the first query block tells us that the optimizer thinks that all the information we need can be found in that index. So why does the Rowsource Operation tell us that at run-time the optimizer thought it needed to include the ndx_task_proj_rsrc index in an index join as well?

Assuming we haven’t found a bug the answer must be that there is another predicate hidden somewhere behind the query. There may be a security predicate, or it may simply be that task is actually a view with a definition like “select {list of columns} from task_t where {some predicate}”.

The mismatch between Execution Plan and Rowsource Operation gives us a clue, but the output from tkprof (and even the underlying trace file) is incomplete – we need to see what predicates Oracle has used to execute this query, and where it used them. So here’s the plan for the query pulled from memory by a call to dbms_xplan.display_cursor():

--------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                 | E-Rows |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                      |        |       |       |          |         |
|   1 |  SORT UNIQUE              |                      |      2 |  2048 |  2048 | 2048  (0)|         |
|   2 |   UNION-ALL               |                      |        |       |       |          |         |
|   3 |    SORT GROUP BY NOSORT   |                      |      1 |       |       |          |         |
|*  4 |     VIEW                  | index$_join$_003     |   5553K|       |       |          |         |
|*  5 |      HASH JOIN            |                      |        |   207M|    11M|  176M (1)|     181K|
|*  6 |       INDEX RANGE SCAN    | NDX_TASK_CALENDAR    |   5553K|       |       |          |         |
|*  7 |       INDEX FAST FULL SCAN| NDX_TASK_PROJ_RSRC   |   5553K|       |       |          |         |
|   8 |    SORT GROUP BY NOSORT   |                      |      1 |       |                  |         |
|*  9 |     INDEX FAST FULL SCAN  | NDX_PROJECT_CALENDAR |    136K|       |       |          |         |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("CLNDR_ID"=:SYS_B_0)
   5 - access(ROWID=ROWID)
   6 - access("CLNDR_ID"=:SYS_B_0)
   7 - filter("DELETE_SESSION_ID" IS NULL)
   9 - filter(("CLNDR_ID"=:SYS_B_1 AND "DELETE_SESSION_ID" IS NULL))

And there in the Predicate Information section we see two extra predicates “delete_session_id is null” – one on each table. Presumably the index ndx_project_calendar includes this column, but ndx_task_calendar does not – hence the need for the index join.

Given that the predicate has been attached to both tables, it seems fairly likely (from an outsider’s perspecetive) that this is row-level security (RLS / FGAC / VPD) in place. Regardless of whether it is RLS, or just a view layer, it looks like the only options to improve the performance of this query is to persuade it into using a full tablescan, or to find an alternatice index access path (which might mean adding delete_session_id to the index it’s already using – and checking whether this change would have any nasty side effects).

Footnote

A potentially significant, but easy to miss detail is the “Misses in library cache during parse:” – this statement had to be “hard parsed” (i.e. optimised) before execution; if you see a trace file where a statement has executed and parsed many times and the Misses is greater than one then it’s worth asking yourself why that might be.

MV Hacking

Filed under: Infrastructure,Materialized view,Oracle — Jonathan Lewis @ 11:54 am GMT Dec 22,2021

This is another article that I wrote a long time ago (at least 15 years according to the file’s timestamp but internally stamped as 2002, and it references 8i and 9i and “enhancements in 10g”), but the method described is still relevant and worthy of consideration so I thought I’d resurrect it with the caveat that the idea is sound, but the details need review.

Yet another fix for 3rd-party performance

 There are many ways to address the performance problems associated with third-party code that cannot be modified directly. Some vendors will allow you to modify indexes, some may even allow you to make more radical infrastructure changes – such as converting heap tables to index-organized tables. In some cases you may be able to make use of stored outlines to force specific execution plans on to particular SQL statements.

This article outlines yet another tactic that you can use to change the performance characteristics of someone else’s code, without changing the code itself: Query Rewrite.

Introduction to Query Rewrite.

If you say “materialized view” to most DBAs and developers, they will tend to think of data warehouses, big crunchy queries, and summary tables.  But, with a little luck, you can use materialized views to great effect even in OLTP systems.

Materialized views have been around for many years, of course, originally in the guise of the snapshots that were introduced to allow data to be replicated between different databases. But the newer, special, feature of materialized views is the ability to create them in the local database and get the cost based optimizer to rewrite queries to use them whenever it seems to be a good idea, so the query:

select
        dept, sale_dt sum(val) 
from
        sales_table
where
        sale_dt > sysdate - 14
group by 
        dept, sale_dt
;

running against a huge table might invisibly be turned into:

select  
        dept, sale_dt, v_sum 
from
        sales_sum
where
        sale_dt > sysdate – 14
;

running against a much smaller summary table. But this depends on the fact that we have created a materialized view called sales_sum typically using a command like the following:

create materialized view sales_sum
refresh on demand
enable query rewrite
as
select
        dept, sale_dt, 
        sum(val) v_sum
from    sales_table
group by 
        dept, sale_dt
;

There are many ways (with different schedules and side effects) in which the summary table (or materialized view) can be kept up to date, but I’m not going to go into that in this article.

To most people, then, materialized view = summary table, which pretty much limits their functionality to data warehouses and massive number crunching queries.

But take a wider viewpoint. The purpose of a materialized view is to allow data to be found more efficiently – just like an index. So could you use the technology in a way that is closer to indexing than it is to summary tables? The answer is yes.

Materialized clones

The commonest symptom I see in Oracle systems [ed: still, in 2021] that have performance problems is that they want to acquire too much data from too many different locations on disc –the issues of precision and scatter.

We may be able to address the problem of examining too much data (precision) by changing index definitions – but if we still have to visit many different disc locations to acquire just the correct data (scatter), we may still have a performance problem.

Consider, for example, the problem of producing a monthly statement of account on a credit card. If the debits table is an ordinary heap table, then the statement of account for a typical customer is likely to require Oracle to locate a few dozen different rows – which will probably be scattered across a few dozen table blocks – which will require a few dozen physical disc reads.

The solution, for the credit card company that builds its own application, is to create the debits table as an index organized table (IOT) with a primary key starting with the account number and payment date – so that all the payments for a given customer are collated into a few table (index) blocks as they are created.

But if you don’t own the application, and the vendor tells you that you may not change ‘their’ heap table into ‘your’ IOT, what can you do? One option – which will be appropriate for some cases, and needs a proper cost/benefit analysis – is simply to clone the debits table as an IOT and keep the clone up to date in real time with a trigger; then call the clone a materialized view with query rewrite enabled.

Figure 1 builds a simple example to demonstrate the process. The demonstration is designed to run under Oracle 9i, hence the use of subquery factoring (“with subquery”) which will not work in Oracle 8i. (The complete example can be downloaded from the Wayback Machine archive of my website: https://web.archive.org/web/20181217220623/http://www.jlcomp.demon.co.uk).

The example starts by building two tables with a referential integrity constraint to demonstrate the principal in a non-trivial case. The child table corresponds to the debits table described in the text above.

The script creates the child table with 100,000 rows using 1,000 account numbers. The account numbers are generated using the random number generator, so there will be approximately 100 rows for each account, and they will be scattered fairly evenly throughout the table – which will be about 1,700 blocks long. Given this build, a query for all the data for a single account will typically have to visit 100 blocks in the table – and in a full-size system, those block visits will probably turn into physical disc reads.

 
rem
rem     Script:         mv_idea_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2002
rem
rem     Last tested
rem             10.2.0.1
rem

create table child
as
with generator as (
        select  --+ materialize
                rownum  id
        from    all_objects 
        where   rownum <= 5000 -- > comment to avoid WordPress formatting issue
)
select
        /*+ ordered use_nl(v2) */
        trunc(dbms_random.value(1,1001)) account,
        trunc(sysdate-10) + rownum/1000  tx_time,
        round(dbms_random.value(5,50),2) debit,
        rpad('x',100)                      padding
from
        generator       v1,
        generator       v2
where
        rownum <= 100000 -- > comment to avoid WordPress formatting issue
;

alter table child 
        add constraint c_pk primary key(account,tx_time);

create table parent as
select
        account,
        min(tx_time)    first_tx_time,
        rpad('x',300)   padding
from
        child
group by
        account
order by
        min(tx_time)
;

alter table parent 
        add constraint p_pk primary key(account);

alter table child  
        add constraint c_fk_p foreign key(account) 
        references parent(account);


Figure 1: Building the basic sample. 

To add a little depth to the model, and demonstrate a little more of the power of query rewrite, I have added a parent table (in the previous discussion, it might be the accounts table, where the child table holds the debits for each account).  Critically, I have declared and enabled a proper set of integrity constraints. If you try using materialized views, you either need to set up proper referential integrity, or you need to investigate the use of dimensions if you want the Cost Based Optimizer to recognise the opportunity for ‘subtle’ query rewrites.

Once I have my base table in place, the code in Figure 2 creates a clone of the data (in fact, using a subset of the columns) and a materialized view based on that clone.

create table mv_child(
        account,
        tx_time,
        debit,
        constraint mv_pk primary key (account, tx_time)
)
organization index
as
select  account, tx_time, debit
from    child
;

create materialized view mv_child
on prebuilt table 
never refresh
enable query rewrite
as
select  account, tx_time, debit
from    child
;

--      now collect statistcs with dbms_stats


Figure 2: Building the materialized view.

You will notice that the table mv_child has been declared as an index organized table.  Since I’ve eliminated the padding column from the data as well, the data for each account is now packed into about 4 leaf blocks.

The materialized view (also called mv_child) is declared as prebuilt, with never refresh. As far as Oracle is concerned, the thing exists, and will ‘never’ be maintained, so there is no need to keep any sort of log for it. But it can be used for query rewrite because we have enabled it for query rewrite.

Of course, if Oracle does use this table for query rewrite, the results will start to go wrong as soon as the data in the base table (child) starts to change. So we have to tell the cost based optimizer to be very trusting about this materialized view. We start by telling Oracle to enable query rewrite, then set the rewrite integrity:

alter session set query_rewrite_enabled   = true;
alter session set query_rewrite_integrity = stale_tolerated;

Since we are going to be updating the base table, the stale_tolerated level is necessary. If we were not expecting to update the child table, then an integrity level of trusted would be sufficient.

Once we have set this up, we can start to run queries and check execution plans (making sure we check the real run-time plans, not just the output from explain plan which can produce incorrect results when materialized views come into play).

select
        p.account,  c.tx_time, c.debit
from
        parent  p,
        child   c
where
        p.first_tx_time = trunc(sysdate) - :b1
and     c.account = p.account
order by
        p.account, c.tx_time
;

SELECT STATEMENT 
  SORT (order by)
    NESTED LOOPS
      TABLE ACCESS PARENT (full)  Filter: P.FIRST_TX_TIME= TRUNC(SYSDATE@!)-TO_NUMBER(TO_CHAR(:B1))
      INDEX MV_PK (range scan)    Access: MV_CHILD.ACCOUNT=P.ACCOUNT


Figure 3: Example query  with execution plan:

In the example in figure 3, I have joined the parent and child tables. And, as you can see, the access to the child table (you might expect an index range scan on ‘C’ followed by table access to table ‘CHILD’ has been replaced by just the index range scan on index ‘MV_PK’ – the index on the mv_child table.

Next Steps:

Of course, this is just a simple example of how to use materialized views – to take advantage of the technique in an OLTP system, you have to ensure that the table mv_child stays in synch with the base child table. This is where simple triggers can come into play. (See fig  4.)

Problems

The sample trigger code will keep the child and mv_child table in synch on an update. You will have to create insert and delete triggers as well. Of course, a better (more efficient) style of code for this part of the task would be to create a package containing procedures to operate the actual SQL, then use the triggers to call the packaged procedures. The style shown just keeps the code short and simple.

create or replace trigger c_aru
after update of account, tx_time, debit 
on child
for each row
begin
        if (
                :new.account = :old.account
            and :new.tx_time = :old.tx_time
        ) then
                update mv_child
                        set debit = :new.debit
                where   account = :new.account
                and     tx_time = :new.tx_time
                ;
        else
                delete from mv_child
                where   account = :old.account
                and     tx_time = :old.tx_time
                ;

                insert into mv_child (
                        account, tx_time, debit
                )
                values (
                        :new.account, 
                        :new.tx_time, 
                        :new.debit
                )
                ;
        end if;
end;
/


Figure 4: Example of simple trigger on child table.

But there are problems with this approach that you may discover only after you have started to test the triggers.

Critically, if your session modifies the child table (in a way that makes the triggers fire) and then queries it before commit, the query will not be rewritten to use the materialized view. However, if your application always does a commit after any modification and before querying the table, your queries will be redirected against the materialized view.

I have to say I was a little surprised when I found that this technique worked at all – according to various notes in the manuals query rewrite is supposed to be blocked by the presence of bind variables. But, as you can see in fig. 3, my tests with bind variables did actually work.

One final thought – if you are already familiar with materialized views and query rewrite you may be wondering why I didn’t simply define my view with the option to “refresh fast on commit” instead of writing my own triggers. In a system with low throughput, this might work – but for a high-throughput OLTP system the overheads are enormous, roughly 5,000% the last time I checked [ed: but that was a long time ago, and things have improved]).

Conclusion

Many of the performance problems that I see boil down to excessive I/O – with solutions that basically find ways of reducing the amount of I/O needed to satisfy critical queries.

Although materialized views are usually associated with reducing I/O in Data Warehouse and Decision Support Systems, you may find that you can take advantage of them in a slightly fanciful way in OLTP systems as well, typically by creating copies of data restricted to subsets of columns (as above) or/and rows.

Footnote Dec 2021

Remember that this was written some time when 10g was  very new – the idea is useful if you haven’t considered it before, but I won’t guarantee that the details are still correct, or that there isn’t a better way of implementing the strategy I’ve described above.

December 3, 2021

Column Groups

Filed under: CBO,Oracle,Statistics — Jonathan Lewis @ 10:40 am GMT Dec 3,2021

Here’s a little test that should have been the second test I ran when column groups were first made available in Oracle. It’s a check on the second of the two basic algorithms that Oracle uses for combining predicates:

  1. sel(predA and predB) = sel(predA) * sel(predB)
  2. sel(predA or predB) = sel(predA) + sel(predB) – sel(predA and predB)

As I pointed out many years ago in Cost Based Oracle – Fundamentals these are the formulae for combining probabilities of independent events. But if the predicates are not independent the formulae make the optimizer produce bad cardinality estimates and that’s why the Oracle developers introduced column groups – virtual columns that gather statistics about combinations of columns.

So let’s set up an extreme example and test the results (I’ve run it one 21c and 19c, but the issue is the same for all versions with column groups):

rem
rem     Script:         column_group_and_or.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2021
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        mod(rownum,100)                 n1,
        mod(rownum,100)                 n2,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator 
;

prompt  =========================================
prompt  Run the script twice, 
prompt  first without gathering the column group
prompt  then after gathering the column group
prompt  =========================================

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for columns (n1, n2) size 1'
        );
end;
/

column column_name format a32

select
        column_name, num_distinct, data_default
from
        user_tab_cols
where
        table_name = 'T1'
/

This script gives you a table with 10,000 rows, where the n1 and n2 columns always have exactly the same values as each other, and each column has 100 rows each of 100 distinct values. If you allow the pl/sql block to run the column stats query will report the following:

COLUMN_NAME                      NUM_DISTINCT DATA_DEFAULT
-------------------------------- ------------ --------------------------------------------------------------------------------
N1                                        100
N2                                        100
V1                                      10000
PADDING                                     1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS            100 SYS_OP_COMBINED_HASH("N1","N2")

5 rows selected.

After creating the data (and statistics) we can execute the following simple statements:

set autotrace traceonly explain

select  * 
from    t1
where
        n1 = 50
/

select  * 
from    t1
where
        n1 = 50
and     n2 = 50
/

select  *
from    t1
where
        n1 = 50
or      n2 = 50
/

set autotrace off

We know that the first query will return 100 rows, and that the optimizer has enough information to get the correct estimate: the selectivity of “n1 = 50” is 1/100, the number of rows in the table is 10,000, so the estimate should be 1/100 * 10,000 = 100.

For the second query: if we don’t create the column group we expect the optimizer to multiply the selectivities of the two predicates together before multiplying by the number of rows in the table to get an estimate of (1/100 * 1/100) * 10,000 = 1

If we’ve created the column group we expect the optimizer to use the column group selectivity to calculate its estimate (1/100 * 10,000) = 100.

Then we get to the third query: if we don’t create the column group we expect the optimizer to slot 1/100 into the 4 places it has to appear in the formula to get an estimate (1/100 + 1/100 – (1/100 * 1/100)) * 10,000 = 199.

If we have created the column group the third expression in the formula can use the column group 1/num_distinct for the third expression to get: (1/100 + 1/100 – (1/100)) * 10,000 = 100

Here are the three execution plans (with a little cosmetic work) when the column group exists:

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 11800 |    26   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 11800 |    26   (4)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 11800 |    26   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 11800 |    26   (4)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50 AND "N2"=50)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   199 | 23482 |    27   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   199 | 23482 |    27   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50 OR "N2"=50)

As you can see from the results in the Rows column, the optimizer has used the column group statistics for the AND estimate, but not for the OR estimate, resulting in an estimate that is out by a factor of nearly 2.

It should be noted that failure to use the column group for AND’ed predicates can introduce a massive error in cardinality estimates but for OR’ed predicates the worst it can be is a factor of 2.

Summary

There is a possible enhancement to the the optimizer’s use of column group statistics that could make the arithmetic consistent for AND and OR predicates and could allow the optimizer to produce better estimates for OR conditions.

Although any error in cardinality estimates can lead to undesirable execution plans the scale of the error due to the current inconsistency will be at worst a factor or 2, which is unlikely to introduce many surprise plans. The nature of the error means the estimate will be too big, which might push the optimizer into using tablescans and hash joins when indexed accesses and nested loops might be the better choice.

December 1, 2021

Best Practice

Filed under: Execution plans,Oracle,Performance — Jonathan Lewis @ 7:08 pm GMT Dec 1,2021

This showed up recently on the Oracle Developer Forum.

For a while I’m questioning myself about a best practice query for performance.

I’ve 2 tables: EMPLOYEES and DEPARTMENTS. The EMPLOYEES table can contains millions of records. Which query is better for performance ?

Select t1.emp_id, t1.emp_name, t2.dept_name
from EMPLOYEES t1, DEPARTMENTS t2
where t1.dept_id = t2.dept_id
order by t1.emp_name
Select emp_id, emp_name, (select mypackage.get_dept_name(dept_id) from dual) as dept_name
from EMPLOYEES
order by emp_name

The nice thing thing about this question is the correct use of English grammar; it uses “better” rather than “best” (the former is correct when comparing two things, the latter requires at least three things to be compared). On the down-side, any suggestion that a single pattern might be “best practice” for performance should be viewed with great caution; it’s far too easy for the less experience to translate “this is a good idea for this query” as “you should always do this”. At best it’s possible to identify patterns at the edges of style where you could say either: “this usually works well enough” or “you shouldn’t do this unless it’s really necessary”.

So I’m going to ignore the generic question and pretend that the question was supposed to be about the performance of the specific case; and then I’m going to complain that the question is badly posed because it omits several pieces of information that are likely to be relevant.

Some points to consider as background before looking too closely into performance:

  • We might expect that there is a referential integrity (RI) constraint between departments and employees. We might also expect to see emp.dept_id declared as not null – if it’s not then do we really want employees who are not yet assigned to a department to disappear from the report. Maybe the join should be an outer join.
  • How many departments are there. If there are millions of employees might there be thousands, or even tens of thousands, of departments? That’s going to make a difference to performance and (depending on strategy) to the consistency of the performance.
  • How important is it to squeeze the last bit of performance from this query. An Oracle database can perform perfectly well with millions of rows but how rapidly can the network transfer the rows from the database to the client, and how well is the client process going to deal with such a large result set. (Why does the client want to run a query that returns millions of rows – is the client then going to do something, after a long and tedious data transfer, that should have been done in 1/100th of the time in the database to supply a much smaller result set).
  • How competent are the people who may be required to support the query once it has gone into production. Should you follow the principle of preferring simplicity to optimum performance. What’s the difference in (database) performance between the better performance and the one that’s easier to understand?
  • Is there any reason why the query must not use parallel execution. Depending on the coding strategy adopted it’s possible that the slower serial query becomes the faster parallel query, and parallelism hasn’t even been mentioned.
  • Do you think that the version of Oracle may change the relative performance of different strategies for the query? (The correct answer is “yes”, and the query that uses a function call and scalar subquery is the one that’s more likely to be affected by future tweaks to the Oracle internal processing.)

So let’s look at specifics.

I’m going to ignore the option for using of a PL/SQL function to generate a department name from an id. Possibly the hope that using it would reduce the number of times a piece of SQL was executed against the dept table; but you shouldn’t use the public result cache for such a lightweight call, particularly when the session’s scalar subquery cache would probably get all the benefit that was available anyway.

Assume, then, that the referential integrity and not null declarations are in place, and that the number of departments is vaguely commensurate with the millions of employees. I’ve set up a little test with 50,000 rows in an “emp” table and 200 rows in a “dept” table just to have some concrete results to prompt a few thoughts.

Given the nature of the query (no predicates to limit data selected) the simple join with “order by” probably has only one sensible path:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |  50000 |00:00:00.03 |     953 |       |       |          |
|   1 |  SORT ORDER BY      |      |      1 |  50000 |  50000 |00:00:00.03 |     953 |  3738K|   834K| 3322K (0)|
|*  2 |   HASH JOIN         |      |      1 |  50000 |  50000 |00:00:00.01 |     953 |  1335K|  1335K| 1566K (0)|
|   3 |    TABLE ACCESS FULL| DEPT |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  50000 |  50000 |00:00:00.01 |     948 |       |       |          |
-----------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEPT"."ID"="EMP"."DEPT_ID")

I’ve run this with with rowsource execution stats enabled so that I can highlight details that might suggest ways to save resources like CPU or memory consumption. This gives rise to two thoughts:

First, would the subquery method do anything to reduce the volume of data sorted – i.e. would the subquery run late; secondly, even if it didn’t run late might it reduce the work needed to translate department ids into department names.

Here’s a version of the query using a scalar subquery with the resulting execution plan:

select 
        emp.id, emp.ename,
        (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |  50000 |00:00:00.05 |    5884 |       |       |          |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEPT    |   4931 |      1 |   4931 |00:00:00.01 |    4935 |       |       |          |
|*  2 |   INDEX UNIQUE SCAN         | DEPT_PK |   4931 |      1 |   4931 |00:00:00.01 |       4 |       |       |          |
|   3 |  SORT ORDER BY              |         |      1 |  50000 |  50000 |00:00:00.05 |    5884 |  3738K|   834K| 3322K (0)|
|   4 |   TABLE ACCESS FULL         | EMP     |      1 |  50000 |  50000 |00:00:00.01 |     949 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEPT"."ID"=:B1)

The plan for any query with scalar subqueries in the select list reports the subqueries at the same level as the main query and before the query – so we can see here that the scalar subquery has been executed only 4,931 times, rather than the full 50,000, and that’s the benefit of scalar subquery caching. The execution count is in the thousands rather than being 200 (number of departments) because the cache is rather small and when hash values for department ids collide hashing only the first into the cache is retained (the cache doesn’t use linked lists to handle collisions).

There’s some ambiguity here, though. We can see that the Buffers statistic at operation 3 (the Sort) matches the total for the query, and consists of the value for the table scan plus the value for the subquery execution (5884 = 4935 + 949, but don’t ask me why the emp tablescan reports 949 buffers instead of the 948 previously reported) so we can’t easily tell whether Oracle executed the subquery before or after it had sorted the data from the tablescan. This could make a difference to performance since including the department name in the data to be sorted could mean using a lot more memory (and more I/O if the larger volume resulted in the sort spilling to disc). We can do a quick check with a simplified query – selecting only the department id from emp, rather than translating id to name.

select 
        emp.id, emp.ename,
        emp.dept_id
--      (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  50000 |00:00:00.02 |     948 |       |       |          |
|   1 |  SORT ORDER BY     |      |      1 |  50000 |  50000 |00:00:00.02 |     948 |  2746K|   746K| 2440K (0)|
|   2 |   TABLE ACCESS FULL| EMP  |      1 |  50000 |  50000 |00:00:00.01 |     948 |       |       |          |
----------------------------------------------------------------------------------------------------------------


With this simpler query the Used-Mem for the sort operation drops from 3322K to 2440K so Oracle must have been using the scalar subquery to fetch the department name before sorting. (I don’t know why Buffers for the tablescan has dropped to 948 again.)

So maybe it would be worth rewriting the query to sort the “narrower” data set before calling the scalar subquery:

select
        /*+ 
                qb_name(main) 
                no_merge(@inline)
                no_eliminate_oby(@inline)
        */
        v1.id, v1.ename, 
        (select /*+ qb_name(ssq) */ dept.dept_name from dept where dept.id = v1.dept_id) dept_name
from    (
        select 
                /*+ qb_name(inline) */
                emp.id, emp.ename, emp.dept_id
        from 
                emp
        order by
                emp.ename
        )       v1
/

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |  50000 |00:00:00.03 |     949 |       |       |          |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEPT    |   4909 |      1 |   4909 |00:00:00.01 |    5112 |       |       |          |
|*  2 |   INDEX UNIQUE SCAN         | DEPT_PK |   4909 |      1 |   4909 |00:00:00.01 |     203 |       |       |          |
|   3 |  VIEW                       |         |      1 |  50000 |  50000 |00:00:00.03 |     949 |       |       |          |
|   4 |   SORT ORDER BY             |         |      1 |  50000 |  50000 |00:00:00.02 |     949 |  2746K|   746K| 2440K (0)|
|   5 |    TABLE ACCESS FULL        | EMP     |      1 |  50000 |  50000 |00:00:00.01 |     949 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEPT"."ID"=:B1)

As you can see, my execution plan now has a View operation telling us that the Sort operation completed inside a non-mergable view; and we can see that the Buffers and the Used-Mem statistics for the Sort operation has stayed at the 948(+1 again) buffers and 2440K of the simpler query. Of course, the need for the data (i.e. sorted rowsource) to pass up the plan through the view operation means a little extra CPU at that point before we start executing the subquery. By a lucky fluke the hash collisions on the scalar subquery cache have dropped slightly because the data department ids are arriving in a different order – this can be a good thing but, as you will have seen in the linked article above on filter subqueries, the change of order could have been a disaster. (More articles on scalar subquery caching – and deterministic functions – linked to from this URL)

You might also note the odd little glitch in the reporting – the final figure for Buffers has “lost” the result from the scalar subquery. (And this was on 19.11.0.0 – so it’s not an old error.)

So it seems we can do better than either of the options supplied in the original post. But the important question then is whether we should adopt that strategy or not. And the answer in many cases will be “probably not, unless it’s in very special circumstances and carefully documented”.

I’ve used a no_merge() hint and the undocumented no_eliminate_oby() hint; and that happens to have done what I wanted in the release of Oracle I was using; but in principle the optimizer could unnest the “@qb_name(ssq)” scalar subquery and produce a join between the “@qb_name(inline)” inline view and the unnested table – and that might cause the ordered data to become disordered (even though the sort order by operation would still be taking place). That’s a little unlikely, of course, but if it did happen someone would have to notice and deal with an urgent rewrite. (Strangely the reason why the optimizer doesn’t do this unnest in 19c and 21c is because – according to the 10053 trace file – the subquery “might return more than one row”, even though it’s selecting by equality on primary key!)

Summary Observations

Resist the temptation to ask for “best practices” for writing performant SQL. There are no immediately practical answers to the question (though there may be a couple of suggestions about how not to write your SQL).

You don’t necessarily want to put the more performant query into production.

Footnote

If you want to play around with this model, here’s the script I used to generate the data and run the test:

Click here to expand/contract the text
rem
rem     Script:         emp_dept_optimise.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem

define m_dept=200

drop table emp;
drop table dept;

execute dbms_randoms.seed(0)

create table dept as
select 
        rownum id, 
        lpad(dbms_random.value,20) dept_name, 
        rpad('x',100) padding
from
        all_objects
where 
        rownum &amp;lt;= &amp;amp;m_dept
/

alter table dept add constraint dept_pk primary key(id);

create table emp as
select 
        rownum id, 
        1 + mod(rownum-1,&amp;amp;m_dept) dept_id,
        lpad(dbms_random.value,20) ename, 
        rpad('x',100) padding
from
        all_objects
where
        rownum &amp;lt;= 50000
/

alter table emp add constraint emp_pk primary key (id)
/

alter table emp add constraint emp_fk_dept foreign key(dept_id) references dept
/

set linesize 180
set arraysize 250

set termout off
set serveroutput off

alter session set statistics_level = all;

select 
        emp.id, emp.ename,
        emp.dept_id
--      (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

spool emp_dept_optimise.lst

prompt  ============================================
prompt  For a baseline on memory, query with dept_id
prompt  ============================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last'));
spool off

select 
        emp.id, emp.ename,
        dept.dept_name
from 
        emp,
        dept
where
        dept.id = emp.dept_id
order by
         emp.ename
/

spool emp_dept_optimise.lst append

prompt  ======================================
prompt  For a simple join between emp and dept
prompt  ======================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last projection'));
spool off

select 
        emp.id, emp.ename,
--      emp.dept_id
        (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

spool emp_dept_optimise.lst append

prompt  ==============================================
prompt  Now switching to the dept_name scalar subquery
prompt  ==============================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last'));
spool off



select
        /*+ 
                qb_name(main) 
                no_merge(@inline)
                no_eliminate_oby(@inline)
        */
        v1.id, v1.ename, 
        (select /*+ qb_name(ssq) */ dept.dept_name from dept where dept.id = v1.dept_id) dept_name
from    (
        select 
                /*+ qb_name(inline) */
                emp.id, emp.ename, emp.dept_id
        from 
                emp
        order by
                emp.ename
        )       v1
/

spool emp_dept_optimise.lst append

prompt  ==================================================
prompt  Now with the dept_name scalar subquery 'postponed'
prompt  ==================================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last outline'));
spool off

set serveroutput on
alter session set statistics_level = typical;

November 22, 2021

Statistics_Level

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 11:12 am GMT Nov 22,2021

Prompted by a recent question on the MOSC community forum (link needs support account) I thought I’d dust off this little script (that I wrote nearly 20 years ago for Oracle 9.2) so see if it still worked, needed any new columns, or added extra rows in 21c.

The script is just a simple report of v$statistics_level, which reports the various real-time statistics collections and advisors that could be enabled, the view that holds associated results (where relevant) and the “activation_level” – in effect telling you whether or not it is necessary to set the statistics_level to “all” before you can enable a particular collection.

The results are not as helpful as you might hope, however, and may result in a mild panic attack if you are prone to getting worried about licensing requirements.

rem
rem     Script:         statistics_level.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2002
rem     Purpose:        Simple script to pick up 9.2 statistics activation details
rem

column statistics_name          format a40
column statistics_view_name     format a24
column description              format a64 word_wrapped

column system_status            heading "Sys"
column session_Status           heading "Ses"
column session_settable         heading "Set"

break on activation_level skip 1

set linesize 160
set pagesize  90
set trimspool on

spool statistics_level

select
        activation_level,
        statistics_name,
        statistics_view_name,
        system_status,
        session_status,
        session_settable,
        description
from
        v$statistics_level
order by
        activation_level,
--      length(description) desc,
        statistics_name
;

clear breaks

spool off

By default the script has to be run by the SYS user, and you’ll note that I’ve omitted the con_id column, which always seemed to be zero whether I query from the root or from a pluggable database on my 21.3 instance.

Here are the results I got from the root container with the statistics_level set to typical:

ACTIVAT STATISTICS_NAME                          STATISTICS_VIEW_NAME     Sys      Ses      Set DESCRIPTION
------- ---------------------------------------- ------------------------ -------- -------- --- ----------------------------------------------------------------
ALL     Plan Execution Statistics                V$SQL_PLAN_STATISTICS    DISABLED DISABLED YES Enables collection of plan execution statistics
        Timed OS Statistics                                               DISABLED DISABLED YES Enables gathering of timed operating system statistics

TYPICAL Active Session History                   V$ACTIVE_SESSION_HISTORY ENABLED  ENABLED  NO  Monitors active session activity using MMNL
        Adaptive Thresholds Enabled                                       ENABLED  ENABLED  NO  Controls if Adaptive Thresholds should be enabled
        Automated Maintenance Tasks                                       ENABLED  ENABLED  NO  Controls if Automated Maintenance should be enabled
        Automatic DBOP Monitoring                V$SQL_MONITOR            ENABLED  ENABLED  YES Controls if automatic DBOP Monitoring should be enabled
        Bind Data Capture                        V$SQL_BIND_CAPTURE       ENABLED  ENABLED  NO  Enables capture of bind values used by SQL statements
        Buffer Cache Advice                      V$DB_CACHE_ADVICE        ENABLED  ENABLED  NO  Predicts the impact of different cache sizes on number of
                                                                                                physical reads

        Column Tracking Level                                             ENABLED  ENABLED  YES Sets Up Column Tracking Level
        Global Cache Statistics                                           ENABLED  ENABLED  NO  RAC Buffer Cache statistics
        Longops Statistics                       V$SESSION_LONGOPS        ENABLED  ENABLED  NO  Enables Longops Statistics
        MTTR Advice                              V$MTTR_TARGET_ADVICE     ENABLED  ENABLED  NO  Predicts the impact of different MTTR settings on number of
                                                                                                physical I/Os

        Modification Monitoring                                           ENABLED  ENABLED  NO  Enables modification monitoring
        OLAP row load time precision                                      ENABLED  ENABLED  YES Sets precision of olap row load time statistics
        Object Activity Tracking                                          ENABLED  ENABLED  YES Sets Up Object Activity Tracking (OATS)
        PGA Advice                               V$PGA_TARGET_ADVICE      ENABLED  ENABLED  NO  Predicts the impact of different values of pga_aggregate_target
                                                                                                on the performance of memory intensive SQL operators

        Plan Execution Sampling                  V$ACTIVE_SESSION_HISTORY ENABLED  ENABLED  YES Enables plan lines sampling
        SQL Monitoring                           V$SQL_MONITORING         ENABLED  ENABLED  YES Controls if SQL Monitoring should be enabled
        Segment Level Statistics                 V$SEGSTAT                ENABLED  ENABLED  NO  Enables gathering of segment access statistics
        Shared Pool Advice                       V$SHARED_POOL_ADVICE     ENABLED  ENABLED  NO  Predicts the impact of different values of shared_pool_size on
                                                                                                elapsed parse time saved

        Streams Pool Advice                      V$STREAMS_POOL_ADVICE    ENABLED  ENABLED  NO  Predicts impact on Streams perfomance of different  Streams pool
                                                                                                sizes

        Threshold-based Alerts                                            ENABLED  ENABLED  NO  Controls if Threshold-based Alerts should be enabled
        Time Model Events                        V$SESS_TIME_MODEL        ENABLED  ENABLED  YES Enables Statics collection for time events
        Timed Statistics                                                  ENABLED  ENABLED  YES Enables gathering of timed statistics
        Ultrafast Latch Statistics                                        ENABLED  ENABLED  NO  Maintains statistics for ultrafast latches in the fast path
        Undo Advisor, Alerts and Fast Ramp up    V$UNDOSTAT               ENABLED  ENABLED  NO  Transaction layer manageability features
        V$IOSTAT_* statistics                                             ENABLED  ENABLED  NO  Controls if I/O stats in v$iostat_ should be enabled


27 rows selected.

The key thing to note from this output is that there are only two statistics collections that are enabled by setting statistics_level to all, the “Plan Execution Statistics” and the “Timed OS Statistics”. Make the change and you’ll see these two statistics reported as ENABLED. If you look behind the scenes you’ll also find that the parameter “timed_os_statistics” has changed from 0 to 60 (and v$sysstat now starts reporting values for the statistics with names like ‘%OS%’) and the hidden parameter “_rowsource_execution_statistics” has changed from false to true (and when you execute new queries and use the option format=>’allstats [last]’ in calls to dbms_xplan.display_cursor you get execution stats appearing in every line of the resulting execution plan).

Guideline: do not set statistics_level to all at the system level, the overheads can be significant. You might want to set it occasionally for a single session for a brief interval to investigate a performance problem – bearing in mind that setting the parameter might actually introduce a whole new performance problem.

User-friendly: NOT

The question on the forum that prompted this note was asking if there were licensing implications of setting the statistics_level, in particular whether there was any requirement to license the diagnostic and performance packs. I didn’t reply to the question – someone else took it on – but the answer is no.

If you look at the output above, though you’ll see that it reports both “Active Session History” and “Plan Execution Sampling” as ENABLED – when I know that I’ve set the parameter control_management_pack_access to none. And when I query v$active_session_history the rowcount is always zero – so it’s not enabled.

Without going through the manuals, checking the dynamic performance views in the output above, and looking for related parameters (e.g. view V$MTTR_TARGET_ADVICE and parameter fast_start_mttr_target) I can’t really be sure what it means to say that an entry in the output is “Enabled”.

I wonder if my query to check whether v$active_session_history was populated will have set the feature audit to say that I’ve used the active session history!

Footnote

There have been a couple of changes over time in this view. In 12.2, for example, I noted a statistic called “Global Cache CPU Statistics” which wasn’t present in 21.3; conversely in 21.3 I noted a statistic “Object Activity Tracking” that wasn’t present in 12.2

You could disable most of the stats collections and advisors at the system level by setting the statistics level to “basic” – but (a) you might find that the attempt to do so raised Oracle errors if you haven’t previously disabled some of the default mechanism e.g:

ORA-32017: failure in updating SPFILE
ORA-00830: cannot set statistics_level to BASIC with auto-tune SGA enabled

and (b) you will probably find one day that 20/20 hindsight tells you that really could have used some of the advisors to find out why your production system is misbehaving.

« Previous PageNext Page »

Website Powered by WordPress.com.