exadata

January 27, 2022

Exadata Catalogue

Filed under: Oracle — Jonathan Lewis @ 6:18 pm GMT Jan 27,2022

This is a list of all articles I’ve written that talk about Exadata, with a date stamp and short note of the contents. The articles are listed most-recent first.

I am currently working backwards through my entire blog adding articles to catalogues as I go, so any individual catalogue may grow very slowly.

system_stats() hint – Jan 2022: notes on a hint that allows you to set the “noworkload” system statistics in a hint on Exadata.
Compression Units part 1 – July 2012
Compression Units part 2 – July 2012
Compression Units part 3 – July 2012
Compression Units part 4 – Aug 2012
Compression Units part 5 – Aug 2012
Compression Units part 6 – May 2014 Finding the location of rows and columns in a compression unit.

Other authors:

When smart scans fail to offload (Roger MacNicol)
Small tables etc. – part 1 (Roger MacNicol)
Small tables etc. – part 2 (Roger MacNicol)
Beginners’ Guide to Oracle tablescans (Roger MacNicol)
All articles by Roger MacNicol currently hosted by Sayan Malakshinov

Comments (1)

September 14, 2010

Filed under: Exadata,Infrastructure — Jonathan Lewis @ 6:58 pm BST Sep 14,2010

Inevitably there’s still excitement and uncertainty in the air about Exadata – and/or the Database Machine. For ease of reference and a jolly good read I’ve started collecting a few blog postings about it:

Jason Arneil’s Blog

Category Exadata

Pythian

Part 1 – mostly hardware
Part 2 – mostly software
Part 3 – build and license
Storage Indexes
Making the most of Storage indexes
Making the most of Exadata
Column limit removed
What’s in your smart flash cache

Kerry Osborne

Kevin Closson

Tanel Poder

Exadata trouble-shooting article(s)

Uwe Hesse

Arup Nanda

Introduction to a four-part Exadata command reference

Oracle Official Links

Comments (1)

June 13, 2023

Case Study

Filed under: Oracle,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 12:50 pm BST Jun 13,2023

Here’s a problem that appeared on the Oracle-L list server a little while ago. It features an Exadata system with a “very large” GTT (global temporary table) and a pair of queries that are repeated frequently, using a different literal value on each repetition of the pair.

The queries, with sample execution plans, have been reported using the SQL Monitor feature and are available on github, but I’ll repeat them here, individually, for convenience. The first query is a very simple single-table with aggregation:

SELECT FG.FID,   FG.DCD, FG.MDT,1.00 , SUM (FG.N_CD), SUM (FG.d_nsd), and several more summed columns
    FROM F_GTT FG
   WHERE FG.FID = {literal constant}
GROUP BY FG.FID, FG.DCD, FG.MDT
ORDER BY 1, 3


Global Information
------------------------------
 Status              :  DONE (ALL ROWS)           
 Instance ID         :  3                         
 SQL ID              :  gmm2v6dtvu0yb             
 SQL Execution ID    :  50331648                  
 Execution Started   :  06/01/2023 14:52:21       
 First Refresh Time  :  06/01/2023 14:52:25       
 Last Refresh Time   :  06/01/2023 14:53:11       
 Duration            :  50s                       
 Fetch Calls         :  1                         

Global Stats
==================================================================================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  |  Other   | Fetch | Buffer | Read  | Read  | Uncompressed |  Offload   |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Waits(s) | Calls |  Gets  | Reqs  | Bytes |    Bytes     | Elig Bytes | Returned Bytes |
==================================================================================================================================================
|      50 |      48 |     0.68 |        0.01 |     0.00 |     0.72 |     1 |     3M | 45632 |  22GB |         22GB |       22GB |           22GB |
==================================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3209725901)
==============================================================================================================================================================================
| Id |          Operation           |      Name      |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read  | Read  |  Mem  | Activity |      Activity Detail      |
|    |                              |                | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes | (Max) |   (%)    |        (# samples)        |
==============================================================================================================================================================================
|  0 | SELECT STATEMENT             |                |         |      |        47 |     +4 |     1 |       31 |       |       |     . |          |                           |
|  1 |   SORT GROUP BY              |                |     217 | 792K |        47 |     +4 |     1 |       31 |       |       | 14336 |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL | F_GTT          |   63836 | 792K |        50 |     +1 |     1 |    37491 | 45632 |  22GB |  14MB |    97.96 | Cpu (46)                  |
|    |                              |                |         |      |           |        |       |          |       |       |       |          | cell smart table scan (2) |
==============================================================================================================================================================================

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - storage("FG"."FID"={literal value})
       filter("FG"."FID"={literal value})

It’s not a difficult query, and it’s easy to see that almost all the time is spent in a tablescan that reads 22GB to identify 37,491 rows, with fractions of a second spent aggregating those rows. And of the time spent in the tablescan almost all of it is CPU time (rather than waits for I/O).

Details to note:

We’ve done 45.000 read requests for 22GB of data read, which suggests we’re doing reads of 500KB (rather than the 1MB that you might have expected).
We’ve done 3M buffer gets – which is a good match for 22GB of data read with an 8KB block size, so it looks like we’re doing a simple tablescan into buffer cache. (This is a little unexpected since we spend time on “cell smart tablescan” so it would be nice to see the session statistics for one execution so that we could see the underlying reason for these buffer gets).
We don’t seem to have saved any data traffic through storage indexes (but that might be because Oracle doesn’t create storage indexes for temporary tablespaces – that’s a detail I’d have to check).

It would be nice to know, of course, how many rows there were in the table, and how many distinct values of FID, and how variable the number of rows per FID was across the table. Based on the little we know, though, and assuming that this single execution represents a fairly typical value of FID, the basic question seems to be: would we benefit from defining an appropriate index on the table, though there are several other ways that you could choose to improve the performance if you had a completely free hand and could make whatever simple changes you liked in the surrounding code.

We can note, of course, that with a suitable index we might have done about 37,500 (single block) read requests compared to 45,000 multi-block read requests, and that looks like a small win on I/O but with a potentially big win on CPU. The obvious index is on (FID), defined with “compress 1” but if we created the index on (FID, MDT, DCD) we might get some extra benefit since Oracle could walk the index using a “SORT GROUP BY NOSORT” operation to aggregate and return the data in the right order.

Note: the SQL says: select fid, dcd, mdt … group by fid, dcd, mdt order by 1,3 – but Oracle has assumed that the order by clause can be ignored because of the group by (the optimizer’s allowed to do that, programmers aren’t). This means the optimizer has recognized that it could the query to group by fid, dcd, mdt order by 1,2 to produce the same result.

Since the initial 37,491 rows for the FID collapsed to 31 rows after aggregation might find that we could even create the larger index with “compress 3”.

Here’s another thought that goes with the idea of creating the perfect index for walking the data and avoiding sorting for every execution of the query: this table is re-populated before the queries are executed so could we populate the table with pre-ordered data so that all the rows for any individual FID are packed into the smallest number of blocks?

Of course, changes like adding an index and pre-sorting data might have significant side effects in other parts of the code. What we’re trying to do here is find big wins using simple strategies that avoid complicated code changes – we need to be sure that we haven’t introduced side-effects that mean we’ve moved the problem rather than solving it.

We should keep our thoughts about the first query in mind as we move on to the second query. This one is a very simple join and aggregation (via distinct) which features the same very large tablescan:

SELECT  DISTINCT ft.fT_NM, FGTT.DCD, ft.COl5, ft.FT_CID, RGTT.mcg_ind,   C.D_CD
    FROM rt_gtt         RGTT,
         F_GTT          FGTT,
         RD_FTIN ft,
         RDC           c
   WHERE     FGTT.eid = RGTT.nid
         AND FGTT.etyp = RGTT.nety
         AND FGTT.FID = ft.ftid
         AND FGTT.FID = {literal value}
         AND FGTT.DCD = c.DCD
         AND ft.FT_CID = RGTT.mcid
ORDER BY ft.fT_NM, DCD

	   
	   Global Information
------------------------------
 Status              :  DONE (ALL ROWS)           
 Instance ID         :  3                         
 SQL ID              :  c3y3n73an8b69             
 SQL Execution ID    :  50331648                  
 Execution Started   :  06/01/2023 14:54:39       
 First Refresh Time  :  06/01/2023 14:54:43       
 Last Refresh Time   :  06/01/2023 14:55:17       
 Duration            :  38s                       
 Fetch Calls         :  1                         

Global Stats
==================================================================================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  |  Other   | Fetch | Buffer | Read  | Read  | Uncompressed |  Offload   |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Waits(s) | Calls |  Gets  | Reqs  | Bytes |    Bytes     | Elig Bytes | Returned Bytes |
==================================================================================================================================================
|      39 |      37 |     0.57 |        0.01 |     0.00 |     0.59 |     1 |     4M | 45640 |  22GB |         22GB |       22GB |           22GB |
==================================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=891026795)
========================================================================================================================================================================================
| Id |             Operation              |             Name             |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read  | Read  |  Mem  | Activity | Activity Detail |
|    |                                    |                              | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes | (Max) |   (%)    |   (# samples)   |
========================================================================================================================================================================================
|  0 | SELECT STATEMENT                   |                              |         |      |         1 |    +38 |     1 |        1 |       |       |     . |          |                 |
|  1 |   SORT UNIQUE                      |                              |   19196 |   6M |         1 |    +38 |     1 |        1 |       |       |  2048 |          |                 |
|  2 |    NESTED LOOPS                    |                              |         |      |         1 |    +38 |     1 |      380 |       |       |     . |          |                 |
|  3 |     NESTED LOOPS                   |                              |   19196 |   6M |         1 |    +38 |     1 |      380 |       |       |     . |          |                 |
|  4 |      NESTED LOOPS                  |                              |      94 |   6M |         1 |    +38 |     1 |        2 |       |       |     . |          |                 |
|  5 |       NESTED LOOPS                 |                              |      1M |   4M |         1 |    +38 |     1 |        2 |       |       |     . |          |                 |
|  6 |        VIEW                        | VW_DTP_E7E675E0              |      3M | 815K |         1 |    +38 |     1 |        2 |       |       |     . |          |                 |
|  7 |         HASH UNIQUE                |                              |      3M | 815K |        35 |     +4 |     1 |        2 |       |       |  11MB |          |                 |
|  8 |          TABLE ACCESS STORAGE FULL | F_GTT                        |      3M | 792K |        39 |     +0 |     1 |    27239 | 45638 |  22GB |  14MB |   100.00 | Cpu (38)        |
|  9 |        TABLE ACCESS BY INDEX ROWID | rt_gtt                       |       1 |    2 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 10 |         INDEX RANGE SCAN           | rt_gtt_IX5                   |       1 |    1 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 11 |       TABLE ACCESS BY INDEX ROWID  | RD_FTIN                      |       1 |    3 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 12 |        INDEX RANGE SCAN            | RD_FTIN_IX1                  |       1 |    2 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 13 |      INDEX RANGE SCAN              | RDC_IX1                      |     206 |    1 |         1 |    +38 |     2 |      380 |     2 | 16384 |     . |          |                 |
| 14 |     TABLE ACCESS BY INDEX ROWID    | RDC                          |     204 |  169 |         1 |    +38 |   380 |      380 |       |       |     . |          |                 |
========================================================================================================================================================================================

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - storage("FGTT"."FID"={literal value})
       filter("FGTT"."FID"={literal value})
   9 - filter("ITEM_3"="RGTT"."nety")
  10 - access("ITEM_4"="RGTT"."nid")
  11 - filter("FT"."FT_CID"="RGTT"."mcid")
  12 - access("ITEM_2"="FT"."ftid")
  13 - access("ITEM_1"="C"."DCD")

Again we have a query that selects all the rows for a given FID from the global temporary table, but this time it joins them to three other tables before eliminating duplicates. Unsurprisingly a very large fraction of the time is spent on the tablescan with the same 45,000 read requests pushing 22GB across the interconnect to the buffer cache, with most of the time recorded as CPU time not I/O wait time.

There’s a very interesting feature of this plan, though. The SQL statement says “join then make distinct”, but the plan reports “make distinct then join” – applying a hash unique operation to the GTT data reducing 27,239 rows to 2 rows before joining the other three tables. (I’ve written about the “group by placement” transformation in the past, this is the equivalent transformation for select distinct.)

Note: it’s a little odd that two queries running just one and a half minutes apart from each other should have estimates which are out by a factor of two in one case and 100 in the other for a simple “column = constant” predicate. This might be something to do with “out of range” values, or a side effect of a hybrid histogram with an unlucky sample size.

The only thing that really matters in this query is the tablescan of the GTT, and obviously we could again consider creating an index on (FID) as a possible way to reduce the workload, giving us a single extra index that might benefit both queries. In this case, though, there are (only) three other column from the GTT appearing in the query, so we could create an index that included those column so that Oracle didn’t have to visit the table at all. For example (FID, eid, etyp, dcd) – with some level of compression that we’d have to decide on by reviewing the data.

Of course that does leave us wanting two new indexes on the same large table that have to be maintained as the data is loaded (remember that you can’t create indexes on GTT’s after the data has been inserted). So we ought to look for alternative strategies.

Again the reduction from tens of thousands of rows to “just a few” rows prompts an idea. I’m guessing that Oracle aggregates (distinct) the GTT table down with the equivalent of an inline view: “select distinct fid, eid, etyp, dcd from … “ (the 10053 (CBO) trace might show this as one of the “unparsed” queries). If that’s the case, why not have a summary GTT that emulates a “refresh complete materialized view” on the main GTT and populate it with the result of that select distinct immediately after populating the main GTT; then reference the summary GTT in the join query instead of the main GTT. With a reduction by a factor of around 10,000 in the number of rows and a further reduction due to the column restriction, the tablescan might drop from 22GB to just a couple of (cached) megabytes.

With the “materialized view” concept in mind, we could revisit the first query. The big problem with the current coding strategy is that we populate a large table once, then keep executing a pair of queries that are interested in one FID at a time. Maybe we could put a wrapper around the whole thing to populate two pre-calculated, small, summary tables, and change then tweak the two problem queries to reference their corresponding “materialized views”.

Summary

The system populates a very large GTT, then seems to be running a loop to execute two queries per “FID” value. Both queries spend most of their time doing a tablescan of the large GTT before aggregating a few thousand rows down to a handful of rows.

A basic strategy for improving performance would be to create two new indexes on this table (though one might be “good enough” for both queries in the short term); the indexing strategy could be made more effective if the code to load the GTT could be modified to pre-sort the data before loading it.

An alternative strategy that takes advantage of the massive drop in volume as the data is aggregated is pre-load two (much smaller) “summary” GTTs after loading the main GTT and before starting the loop through the FIDs. This would require the two queries to be edited to reference their respective summary tables. (The summary tables could, of course, be indexed to add a little extra benefit.)

Comments (3)

January 28, 2022

Hinting Catalogue

Filed under: Oracle — Jonathan Lewis @ 2:38 pm GMT Jan 28,2022

This is a list of all articles I’ve written that talk about hints. Each entry has a date stamp and a short note of the contents. The articles are (may be / will be) grouped in a couple of different categories, but otherwise are generally listed most-recent first.

I am currently working backwards through my entire blog adding articles to catalogues as I go, so any individual catalogue may grow very slowly.

opt_estimate() catalogue: a separate catalogue of articles about the (undocumented) opt_estimate() hint and loosely related hints. (Updated Nov 2022)
ignoring hints catalogue: a separate catalogue of articles about how Oracle doesn’t ignore hints, or examples of bugs, or explanations of the difficulty of hinting correctly.

gby_pushdown (Nov 2023): a hint to change the way parallel query handles aggregation
Swap_join_inputs (Nov 2023): Taking care when editing Outline Data to change a join method in an SQL Patch.
force_match SQL Patch (Sept 2023): Hinting through SQL patches is nice, but may need a force_match hack if your application uses literals too often
Case Study (June 2022): an example of the SQL Monitor report showing that a no_merge hint was the appropriate solution
System_stats() hint (Jan 2022): for Exadata, the optimizer can be told to override the stored system stats.
Hash Aggregation 1 (Jan 2022): you may need to hint hash aggregation for some queries that group by then order by the same columns
use_nl() redux (Oct 2021): where in its processing does Oracle apply join method hints like use_nl()
Ordered hint (Sept 2021): you shouldn’t be using it, and this note shows a surprising “place group by” side effect
Distributed Query (Aug 2021): A real tangle when using the rule, driving_site and materialize hints (badly) together
Driving_site() in a patch (May 2021): specific example of a general problem – changing an inline hint to an SQL Patch may require a target query block name and a fully qualified table alias.
use_nl_with_index (Mar 2021): a description of what it can do, and one (current) omission
Index descending First N (Dec 2020): until 19c an first rows in descending order had to be hinted with index_desc()
Using the num_index_keys() hint (Nov 2018): an undocumented hint that adjusts the access/filter strategy for index range scans in special cases.
ignore_row_on_dupkey_index() hint (May 2017): an elegant solution to a “rare” error, but very expensive for frequent errors.
Five Hints (Dec 2015): the paper I wrote for a DOAG presentation on the (then) top 5 most most important hints.
New Index hints (June 2013): Analysis of how Oracle handles “near misses” on hinting by index description.
New Index hints (Nov 2011): Oracle 10g introduced an index hinting by description. Some examples and comments.
v$sql_shared_memory (Jan 2009): Even Oracle’s developers get the use_nl() hint wrong.
qb_name (June 2007): An introduction to the hint and its use.

Comments (1)

January 27, 2022

Statistics catalogue

Filed under: Oracle — Jonathan Lewis @ 6:16 pm GMT Jan 27,2022

This is a list of all articles I’ve written that talk about statistics, with a date stamp and short note of the contents. I have also created a couple of sub-catalogues around major topics:

Column Groups catalogue
Histograms catalogue
Partition Stats catalogue (other authors 2007 – 2012)

I am currently working backwards through my entire blog adding articles to catalogues as I go, so any individual catalogue may grow very slowly.

Quiz Night 38 (Jan 2023): What really happens when you try to gather stats on a single column?
19c – better stats (April 2022): Oracle now uses approx_count_distinct() on the whole index in dbms_stats.gather_index_stats().
system_stats() hint (Jan 2022): notes on a hint that allows you to set the “noworkload” system statistics in a hint on Exadata.
Optimizer Tip (Sept 2021): reprint of a note for IOUG 2015 on column groups and the table_cached_blocks preference
Stats preferences (Aug 2021): a simple query to report all the system and table preferences, default or explicitly set
Raw to Timestamp (April 2020): how to convert the internal format of a timestamp to a usable value
Using the num_index_keys() hint (Nov 2018): controlling the split between access and filter predicates used in the index for complex in-lists.
Reporting times of stats operations (Sept 2018): dbms_stats functions to report details of stats gathering operations.
- Older version of stats timing code (Sept 2016): relevant for versions without dba_operation_opstat_tasks.
- Pre-12c stats timing code (Sept 2016): hack code to get an indication of time spent on a table gathering it stats.
table_cached_blocks defect (July 2018): A table preference about indexes, but it isn’t used during create or rebuild of index.
- table_cached_blocks and RAC (Nov 2015): if you set the table preference to adjust the index clustering_factor, should you allow for the effects of RAC?
- table_cached_blocks (May 2013): At last, a “legal” way to adjust the clustering_factor to a realistic value.
Column Stats (Jan 2018): Getting the correct syntax for a complex method_opt.
Optimising statspack (Oct 2016): selective stats collection for data dictionary / internal queries.
v$lock efficiency (May 2013): the benefit of collecting stats on individual x$ structures
Minimum Stats (Sept 2012): auto_sample_size on an index tends to mean 1,140 blocks

Other Authors

Partition Stats (2007 – 2012) – a list of articles written by other people on the topic of statistics on partitioned objects.

Comments (1)

January 26, 2022

system_stats() hint

Filed under: 18c,19c,Hints,Oracle,Statistics,System Stats — Jonathan Lewis @ 9:46 am GMT Jan 26,2022

Starting from 18.3 Oracle introduced the system_stats() hint, apparently allowing you to set the system statistics for the duration of a query. However the hint didn’t seem to have any effect in that version of Oracle – even though the fix_control that seemed to be the most relevant (QKSFM_DBMS_STATS_24952618) was set to 1, so maybe the hint was acting strictly according to the fix control description, which was: “turn on Exadata stats: MBRC,IOTFRSPEED,IOSEEKTIME” (or maybe the fix control had nothing to do with the hint)

According to my notes I had a test that showed it working on live SQL, which (in my notes) I said was running 19.2 at the time; however, I can’t get it to work on 19.11.0.0 or 21.3.0.0 on a Linux VM (or on the current Live SQL version) despite a load of fiddling with potentially relevant hidden parameters, fix controls, and numeric event numbers. So maybe it is only for Exadata.

It’s not documented, of course, but I’m fairly confident I’m using the correct syntax – which was quite easy to find (sometimes you get lucky) because a search through the binary for the hint text produced a perfect result:


[oracle@linux183 bin]$ strings -a oracle | grep -T -n  -i system_stats\(
1762556:BEGIN :1 := dbms_stats_internal.store_system_stats(:2, :3, :4); END;
1787190:system_stats(mbrc=%f ioseektim=%f iotfrspeed=%f)

So it would seem (from line 1787190) that we can override three of the system statistics: mbrc, ioseektim, and iotfrspeed. Thanks to the hint_report option that 19c introduced to dispay_xxxxxx() calls in dbms_xplan it’s easy to see that this syntax is correct but unused. From a call to dbms_xplan.display_cursor() in 19.11.0.0:

select  /*+ system_stats(mbrc=128 ioseektim=1 iotfrspeed=262144) */ count(*) from t1

Plan hash value: 3724264953

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |  2732 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50000 |  2732   (1)| 00:00:01 |
-------------------------------------------------------------------

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (U - Unused (1))
---------------------------------------------------------------------------
   0 -  STATEMENT
         U -  system_stats(mbrc=128 ioseektim=1 iotfrspeed=262144)

Other tests reported shorter versions of the hint (e.g. /*+ system_stats(mbrc=128) */ ) as errors:


Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (E - Syntax error (1))
---------------------------------------------------------------------------
   1 -  SEL$1
         E -  system_stats

In passing, it’s interesting to note that the text was reported as a “query block” hint (sel$1) when it had a syntax error despite being a “statement-level” hint when it was recognised. Presumably the generic parsing rule is: “it’s a query block hint unless proved otherwise”.

The call to dbms_stat_internal.store_system_stats() that also dropped out of the scan of the executable looks as if it’s the function that sets the “noworkload” statistics – the three parameters are, in order: ioseektim, iotfrspeed, cpuspeednw – but possibly it’s the internal call used when you use the ‘EXADATA’ option for gathering system stats.

Bottom line:

Maybe there’s a way to switch this hint on to override the default system stats; maybe it just needs to be run on Exadata; and maybe – if it can be switched on – it could be attached as an SQL_Patch. Experimentation left to readers who have access to an Exadata system, any results are welcome.

Comments (10)

January 25, 2022

Catalog Catalogue

Filed under: Oracle — Jonathan Lewis @ 12:22 pm GMT Jan 25,2022

This is a good starting point for researching a topic on my blog. It’s simply a list of the various catalog(ue)s of articles that I’ve written over the years. There’s still a lot of work to do, though, as there are more than 1,600 published articles on the blog and though I have published a few of the catalogues already they are almost guaranteed to be incomplete.

The target state is for each catalogue to list all relevant articles (which means some articles will appear in more than one catalogue), generally in reverse order of publication, with the date of each entry.

Each entry lists the number of articles currently in the catalogue, breaking them down by author (viz: me / others). In the near future, when most of the catalogues have a reasonable number of recent entries, this catalogue will datestamp each listed catalogue with its last update date and highlight a short selection of the most recently updated.

255 columns catalogue (Blog notes 8, Other authors 2)
ANSI catalogue (Blog notes 3)
AWR / Statspack catalogue (Blog notes 39, Other authors 5) Last update March 2024
Block size catalogue
Bug catalogue (Blog notes 22) Last update February 2024
Compression catalogue (Blog notes 12)
CTE / “with subquery” / Subquery factoring catalogue (Blog notes 19) Last update Oct 2023
Design catalogue (Blog notes 10) Last update November 2023
Distributed query catalogue (Blog notes 5)
Direct attach to the SGA (Blog notes 4) Last update April 2022
Exadata catalogue (Blog notes 7)
Execution Plans catalogue (Blog notes 31) Last update March 2024
Fragmentation / rebuild / space catalogue (Blog notes 7)
Hint catalogue (Blog notes 18) Last update February 2024
- Ignoring hints (sub-)catalogue (Blog notes 13) Last update January 2023
- opt_estimate() (sub-)catalogue (Blog notes 7, Other Authors 1)
Indexing catalogue (Blog notes 45) Last Update April 2024
Index Organized Tables (IOT) catalogue (Blog notes 14, Other authors 15)
Infrastructure (geek stuff) catalogue (Blog notes 52) Last update April 2024
- Edition Based Redefinition (Other authors 1)
Locking catalogue (Blog notes 22) Last update February 2024
LOB catalogue (Blog notes 21) Last update August 2023
Materialized View catalogue (Blog notes 11)
Optimizer catalogue (Blog notes 28, external links 1, Other Authors 1) Last update October 2023
Parallel Execution / PX / PQ catalogue (Blog notes 12) Last update November 2023
Partitioning catalogue (Blog notes 11) Last updated March 2024
- Partition Stats (sub-catalogue) (Other Authors 22)
Performance catalogue (Blog notes 14, Other Authors 1) Last updated November 2023
Quiz catalogue (Blog notes 13) Last updated January 2023
Scripts catalogue (Blog notes 32)
Sequence catalogue (Blog notes 9)
Statistics catalogue (Blog notes 16, Other Authors 1) Last update November 2022
- Column Group (sub-)catalogue (Blog notes 21, Other Authors 5) Last update November 2023
- Histogram (sub-)catalogue (Blog notes 25, Other Authors 2) Last update November 2022
- Partition Stats (sub-catalogue) (Other Authors 22)
Statspack / AWR catalogue (Blog notes 37, Other Authors 5)
Transformations catalogue (Blog notes 12, external links 4, Other authors 1) Last Update November 2023
Troubleshooting catalogue (Blog notes 57) Last update April 2024
Upgrades catalogue (Blog notes 27) Last update November 2023

October 11, 2021

Adaptive Study

Filed under: Execution plans,Oracle — Jonathan Lewis @ 11:57 am BST Oct 11,2021

This is a little case study of adaptive optimisation in Oracle 19c with a surprising side-effect showing up when the optimizer gave the execution engine the option to “do the right thing” and the execution engine took it – except the “right thing” turned out to be a wrong thing.

We started with a request to the Oracle-L list server asking about the difference between the operations “table access by rowid” and “table access by rowid batched” and why changing the parameter “optimizer_adaptive_reporting_only” should make a plan switch from one to the other, and how much of a performance impact this would have because this was the only change that showed up in a plan that went from fast (enough) to very slow when the parameter was changed from true to false.

The batching (or not) of the table access really shouldn’t make much difference; the batch option tends to appear if there’s a “blocking” operation (such as a hash join) further up the execution plan, but the mechanism by which a rowsource is produced and passed up the tree is only likely to be affected very slightly. So there had to be something else going on.

Fortunately the OP had the SQL Monitor reports available from a fast / non-batched / reporting only = true run and a slow / batched / “reporting only = false” run. I’ve shown these below with the option to expand and contract them on demand:

Fast plan (reporting only):

Click on this line to expand the “reporting only = true (fast)” plan

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)         
 Instance ID         :  2                       
 Session             :  XXXXX (510:5394) 
 SQL ID              :  791qwn38bq6gv           
 SQL Execution ID    :  33554432                
 Execution Started   :  10/07/2021 11:46:56     
 First Refresh Time  :  10/07/2021 11:46:56     
 Last Refresh Time   :  10/07/2021 11:51:36     
 Duration            :  280s                    
 Module/Action       :  SQL*Plus/-              
 Service             :  XXXXX.XXXXX.com 
 Program             :  sqlplus.exe             
 Fetch Calls         :  370                     

Global Stats
===========================================================================
| Elapsed |   Cpu   |    IO    | Cluster  | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
===========================================================================
|     252 |     170 |       71 |       11 |   370 |    39M | 251K |   2GB |
===========================================================================

SQL Plan Monitoring Details (Plan Hash Value=250668601)
===============================================================================================================================================================================
| Id |                      Operation                       |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read  | Read  |  Mem  |
|    |                                                      |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs  | Bytes | (Max) |
===============================================================================================================================================================================
|  0 | SELECT STATEMENT                                     |                               |         |       |       279 |     +2 |     1 |       2M |       |       |     . |
|  1 |   FILTER                                             |                               |         |       |       279 |     +2 |     1 |       2M |       |       |     . |
|  2 |    NESTED LOOPS OUTER                                |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
|  3 |     NESTED LOOPS OUTER                               |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
|  4 |      HASH JOIN OUTER                                 |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
|  5 |       NESTED LOOPS OUTER                             |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
|  6 |        STATISTICS COLLECTOR                          |                               |         |       |       279 |     +2 |     1 |       2M |       |       |     . |
|  7 |         NESTED LOOPS OUTER                           |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
|  8 |          HASH JOIN OUTER                             |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
|  9 |           NESTED LOOPS OUTER                         |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
| 10 |            STATISTICS COLLECTOR                      |                               |         |       |       279 |     +2 |     1 |       2M |       |       |     . |
| 11 |             NESTED LOOPS OUTER                       |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
| 12 |              NESTED LOOPS OUTER                      |                               |       1 |    3M |       279 |     +2 |     1 |       2M |       |       |     . |
| 13 |               NESTED LOOPS                           |                               |    272K |    2M |       279 |     +2 |     1 |       2M |       |       |     . |
| 14 |                NESTED LOOPS OUTER                    |                               |    272K |    2M |       279 |     +2 |     1 |       2M |       |       |     . |
| 15 |                 NESTED LOOPS                         |                               |    272K |    2M |       279 |     +2 |     1 |       2M |       |       |     . |
| 16 |                  NESTED LOOPS OUTER                  |                               |    272K |    1M |       279 |     +2 |     1 |       2M |       |       |     . |
| 17 |                   NESTED LOOPS                       |                               |    272K |    1M |       279 |     +2 |     1 |       2M |       |       |     . |
| 18 |                    FILTER                            |                               |         |       |       279 |     +2 |     1 |       2M |       |       |     . |
| 19 |                     NESTED LOOPS OUTER               |                               |    272K |  598K |       279 |     +2 |     1 |       2M |       |       |     . |
| 20 |                      VIEW                            | index$_join$_006              |    276K | 48299 |       279 |     +2 |     1 |       2M |       |       |     . |
| 21 |                       HASH JOIN                      |                               |         |       |       279 |     +2 |     1 |       2M |       |       | 132MB |
| 22 |                        HASH JOIN                     |                               |         |       |         2 |     +1 |     1 |       2M |       |       | 124MB |
| 23 |                         INDEX STORAGE FAST FULL SCAN | TET_IX2                       |    276K |  8505 |         1 |     +2 |     1 |       2M |       |       |     . |
| 24 |                         INDEX STORAGE FAST FULL SCAN | TET_IX4                       |    276K | 13077 |         1 |     +2 |     1 |       2M |       |       |     . |
| 25 |                        INDEX STORAGE FAST FULL SCAN  | TET_PK                        |    276K | 11889 |       279 |     +2 |     1 |       2M |   149 |  62MB |     . |
| 26 |                      TABLE ACCESS BY INDEX ROWID     | TT                            |       1 |     2 |       279 |     +2 |    2M |       2M |  2347 |  18MB |     . |
| 27 |                       INDEX UNIQUE SCAN              | TT_PK                         |       1 |     1 |       279 |     +2 |    2M |       2M |    11 | 90112 |     . |
| 28 |                    TABLE ACCESS BY INDEX ROWID       | TM                            |       1 |     2 |       279 |     +2 |    2M |       2M | 12476 |  97MB |     . |
| 29 |                     INDEX UNIQUE SCAN                | TM_PK                         |       1 |     1 |       279 |     +2 |    2M |       2M |  1683 |  13MB |     . |
| 30 |                   TABLE ACCESS BY INDEX ROWID        | TU                            |       1 |     1 |       257 |    +21 |    2M |    17764 |   137 |   1MB |     . |
| 31 |                    INDEX UNIQUE SCAN                 | TU_PK                         |       1 |       |       257 |    +21 |    2M |    17764 |     1 |  8192 |     . |
| 32 |                  TABLE ACCESS BY INDEX ROWID         | TEP                           |       1 |     2 |       279 |     +2 |    2M |       2M |  155K |   1GB |     . |
| 33 |                   INDEX UNIQUE SCAN                  | TEP_PK                        |       1 |     1 |       279 |     +2 |    2M |       2M |  1729 |  14MB |     . |
| 34 |                 TABLE ACCESS BY INDEX ROWID          | TLIM                          |       1 |     1 |       279 |     +2 |    2M |       2M |       |       |     . |
| 35 |                  INDEX UNIQUE SCAN                   | TLIM_PK                       |       1 |       |       279 |     +2 |    2M |       2M |       |       |     . |
| 36 |                TABLE ACCESS BY INDEX ROWID           | TLPSE                         |       1 |     1 |       279 |     +2 |    2M |       2M |       |       |     . |
| 37 |                 INDEX UNIQUE SCAN                    | TLPSE_PK                      |       1 |       |       279 |     +2 |    2M |       2M |       |       |     . |
| 38 |               INDEX RANGE SCAN                       | TCX_IX2                       |       1 |     2 |       279 |     +2 |    2M |       2M |  8870 |  69MB |     . |
| 39 |              TABLE ACCESS BY INDEX ROWID             | TC                            |       1 |     2 |       279 |     +2 |    2M |       2M | 14648 | 114MB |     . |
| 40 |               INDEX UNIQUE SCAN                      | TC_PK                         |       1 |     1 |       279 |     +2 |    2M |       2M |   157 |   1MB |     . |
| 41 |            INDEX RANGE SCAN                          | TCX_PK                        |       1 |     2 |       279 |     +2 |    2M |       2M |       |       |     . |
| 42 |           INDEX RANGE SCAN                           | TCX_PK                        |       1 |     2 |           |        |       |          |       |       |     . |
| 43 |          TABLE ACCESS BY INDEX ROWID                 | TC                            |       1 |     2 |       279 |     +2 |    2M |       2M | 16037 | 125MB |     . |
| 44 |           INDEX UNIQUE SCAN                          | TC_PK                         |       1 |     1 |       279 |     +2 |    2M |       2M |   224 |   2MB |     . |
| 45 |        TABLE ACCESS BY INDEX ROWID                   | TP                            |       1 |     3 |       279 |     +2 |    2M |       2M |       |       |     . |
| 46 |         INDEX RANGE SCAN                             | TP_PK                         |      15 |     1 |       279 |     +2 |    2M |      28M |       |       |     . |
| 47 |       TABLE ACCESS BY INDEX ROWID                    | TP                            |       1 |     3 |           |        |       |          |       |       |     . |
| 48 |        INDEX RANGE SCAN                              | TP_PK                         |      15 |     1 |           |        |       |          |       |       |     . |
| 49 |      TABLE ACCESS STORAGE FULL FIRST ROWS            | TLIET                         |       1 |     3 |       279 |     +2 |    2M |       2M |       |       |     . |
| 50 |     VIEW PUSHED PREDICATE                            | TEB_VW                        |       1 |    57 |       256 |    +24 |    2M |     1459 |       |       |     . |
| 51 |      NESTED LOOPS OUTER                              |                               |       1 |    57 |       272 |     +8 |    2M |     1459 |       |       |     . |
| 52 |       NESTED LOOPS                                   |                               |       1 |    55 |       256 |    +24 |    2M |     1459 |       |       |     . |
| 53 |        NESTED LOOPS                                  |                               |       1 |    53 |       256 |    +24 |    2M |     1459 |       |       |     . |
| 54 |         NESTED LOOPS                                 |                               |       1 |    51 |       272 |     +9 |    2M |     1459 |       |       |     . |
| 55 |          NESTED LOOPS                                |                               |       5 |    41 |       279 |     +2 |    2M |     6965 |       |       |     . |
| 56 |           NESTED LOOPS                               |                               |       1 |     7 |       279 |     +2 |    2M |     770K |       |       |     . |
| 57 |            NESTED LOOPS                              |                               |       1 |     4 |       279 |     +2 |    2M |     770K |       |       |     . |
| 58 |             NESTED LOOPS                             |                               |       1 |     3 |       279 |     +2 |    2M |     770K |       |       |     . |
| 59 |              TABLE ACCESS BY INDEX ROWID             | TEP                           |       1 |     3 |       279 |     +2 |    2M |     770K |       |       |     . |
| 60 |               INDEX UNIQUE SCAN                      | TEP_PK                        |       1 |     2 |       279 |     +2 |    2M |       2M |       |       |     . |
| 61 |              INDEX RANGE SCAN                        | TLP_IX1                       |       1 |       |       279 |     +2 |  770K |     770K |       |       |     . |
| 62 |             VIEW                                     |                               |       1 |     1 |       279 |     +2 |  770K |     770K |       |       |     . |
| 63 |              SORT AGGREGATE                          |                               |       1 |       |       279 |     +2 |  770K |     770K |       |       |     . |
| 64 |               TABLE ACCESS BY INDEX ROWID            | TPR                           |       1 |     1 |       279 |     +2 |  770K |     770K |       |       |     . |
| 65 |                INDEX UNIQUE SCAN                     | TPR_PK                        |       1 |       |       279 |     +2 |  770K |     770K |       |       |     . |
| 66 |            TABLE ACCESS BY INDEX ROWID               | TET                           |       1 |     3 |       279 |     +2 |  770K |     770K | 28892 | 226MB |     . |
| 67 |             INDEX RANGE SCAN                         | TET_Ix1                       |       1 |     2 |       279 |     +2 |  770K |     899K |  6957 |  54MB |     . |
| 68 |           TABLE ACCESS BY INDEX ROWID                | TWE                           |       5 |    34 |       272 |     +9 |  770K |     6965 |   890 |   7MB |     . |
| 69 |            INDEX RANGE SCAN                          | TWE_IDX1                      |      35 |     2 |       272 |     +9 |  770K |     6965 |    22 | 176KB |     . |
| 70 |          TABLE ACCESS BY INDEX ROWID                 | TT                            |       1 |     2 |       272 |     +9 |  6965 |     1459 |       |       |     . |
| 71 |           INDEX UNIQUE SCAN                          | TT_PK                         |       1 |     1 |       272 |     +9 |  6965 |     6965 |       |       |     . |
| 72 |         INDEX RANGE SCAN                             | TCX_IX2                       |       1 |     2 |       256 |    +24 |  1459 |     1459 |   932 |   7MB |     . |
| 73 |        TABLE ACCESS BY INDEX ROWID                   | TC                            |       1 |     2 |       256 |    +24 |  1459 |     1459 |       |       |     . |
| 74 |         INDEX UNIQUE SCAN                            | TC_PK                         |       1 |     1 |       256 |    +24 |  1459 |     1459 |       |       |     . |
| 75 |       TABLE ACCESS BY INDEX ROWID                    | TLS                           |       1 |     2 |       256 |    +24 |  1459 |     1451 |       |       |     . |
| 76 |        INDEX SKIP SCAN                               | TLS_PK                        |       1 |     1 |       256 |    +24 |  1459 |     1451 |       |       |     . |
| 77 |    SORT AGGREGATE                                    |                               |       1 |       |       279 |     +2 |    2M |       2M |       |       |     . |
| 78 |     FIRST ROW                                        |                               |       1 |     3 |       279 |     +2 |    2M |       2M |       |       |     . |
| 79 |      INDEX RANGE SCAN (MIN/MAX)                      | TCX_IX2                       |       1 |     3 |       279 |     +2 |    2M |       2M |       |       |     . |
===============================================================================================================================================================================

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3 (U - Unused (1))
---------------------------------------------------------------------------
    0 -  STATEMENT
         U -  first_rows / hint overridden by another in parent query block
           -  first_rows
 
  56 -  SEL$5
           -  no_merge
 
Note
-----
   - this is an adaptive plan

Slow plan (runtime adapted):

Click on this line to expand the “reporting_only = false (slow)” plan

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)          
 Instance ID         :  2                        
 Session             :  XXXXX (509:27860) 
 SQL ID              :  8t19y7v5j9ztg            
 SQL Execution ID    :  33554432                 
 Execution Started   :  10/07/2021 07:56:09      
 First Refresh Time  :  10/07/2021 07:56:09      
 Last Refresh Time   :  10/07/2021 08:07:17      
 Duration            :  668s                     
 Module/Action       :  SQL*Plus/-               
 Service             :  XXXXX.XXXXX.com  
 Program             :  sqlplus.exe              
 Fetch Calls         :  370                      

Global Stats
==========================================================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency | Cluster  | Fetch | Buffer | Read | Read  | Write | Write |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs | Bytes | Reqs  | Bytes | Returned Bytes |
==========================================================================================================================
|     705 |     280 |      270 |        0.00 |      155 |   370 |    40M | 984K |  11GB |  6422 |   3GB |            6GB |
==========================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3015036808)
========================================================================================================================================================================================================
| Id |                      Operation                       |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read  | Read  | Write | Write |  Mem  | Temp  |
|    |                                                      |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs  | Bytes | Reqs  | Bytes | (Max) | (Max) |
========================================================================================================================================================================================================
|  0 | SELECT STATEMENT                                     |                               |         |       |       512 |   +157 |     1 |       2M |       |       |       |       |     . |     . |
|  1 |   FILTER                                             |                               |         |       |       512 |   +157 |     1 |       2M |       |       |       |       |     . |     . |
|  2 |    NESTED LOOPS OUTER                                |                               |       1 |    3M |       512 |   +157 |     1 |       2M |       |       |       |       |     . |     . |
|  3 |     NESTED LOOPS OUTER                               |                               |       1 |    3M |       512 |   +157 |     1 |       2M |       |       |       |       |     . |     . |
|  4 |      HASH JOIN OUTER                                 |                               |       1 |    3M |       538 |   +131 |     1 |       2M |  3387 |   2GB |  3387 |   2GB | 450MB |   2GB |
|  5 |       NESTED LOOPS OUTER                             |                               |       1 |    3M |        27 |   +131 |     1 |       2M |       |       |       |       |     . |     . |
|  6 |        STATISTICS COLLECTOR                          |                               |         |       |        27 |   +131 |     1 |       2M |       |       |       |       |     . |     . |
|  7 |         NESTED LOOPS OUTER                           |                               |       1 |    3M |        27 |   +131 |     1 |       2M |       |       |       |       |     . |     . |
|  8 |          HASH JOIN OUTER                             |                               |       1 |    3M |       155 |     +3 |     1 |       2M |  3035 |   1GB |  3035 |   1GB | 309MB |   1GB |
|  9 |           NESTED LOOPS OUTER                         |                               |       1 |    3M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 10 |            STATISTICS COLLECTOR                      |                               |         |       |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 11 |             NESTED LOOPS OUTER                       |                               |       1 |    3M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 12 |              NESTED LOOPS OUTER                      |                               |       1 |    3M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 13 |               NESTED LOOPS                           |                               |    272K |    2M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 14 |                NESTED LOOPS OUTER                    |                               |    272K |    2M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 15 |                 NESTED LOOPS                         |                               |    272K |    2M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 16 |                  NESTED LOOPS OUTER                  |                               |    272K |    1M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 17 |                   NESTED LOOPS                       |                               |    272K |    1M |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 18 |                    FILTER                            |                               |         |       |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 19 |                     NESTED LOOPS OUTER               |                               |    272K |  598K |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 20 |                      VIEW                            | index$_join$_006              |    276K | 48299 |       129 |     +3 |     1 |       2M |       |       |       |       |     . |     . |
| 21 |                       HASH JOIN                      |                               |         |       |       129 |     +3 |     1 |       2M |       |       |       |       | 132MB |     . |
| 22 |                        HASH JOIN                     |                               |         |       |         3 |     +1 |     1 |       2M |       |       |       |       | 124MB |     . |
| 23 |                         INDEX STORAGE FAST FULL SCAN | TET_IX2                       |    276K |  8505 |         1 |     +1 |     1 |       2M |   129 |  54MB |       |       |     . |     . |
| 24 |                         INDEX STORAGE FAST FULL SCAN | TET_IX4                       |    276K | 13077 |         3 |     +1 |     1 |       2M |   167 |  81MB |       |       |     . |     . |
| 25 |                        INDEX STORAGE FAST FULL SCAN  | TET_PK                        |    276K | 11889 |       129 |     +3 |     1 |       2M |   198 |  61MB |       |       |     . |     . |
| 26 |                      TABLE ACCESS BY INDEX ROWID     | TT                            |       1 |     2 |       129 |     +3 |    2M |       2M |  1488 |  12MB |       |       |     . |     . |
| 27 |                       INDEX UNIQUE SCAN              | TT_PK                         |       1 |     1 |       129 |     +3 |    2M |       2M |     7 | 57344 |       |       |     . |     . |
| 28 |                    TABLE ACCESS BY INDEX ROWID       | TM                            |       1 |     2 |       129 |     +3 |    2M |       2M |  9875 |  77MB |       |       |     . |     . |
| 29 |                     INDEX UNIQUE SCAN                | TM_PK                         |       1 |     1 |       129 |     +3 |    2M |       2M |  1235 |  10MB |       |       |     . |     . |
| 30 |                   TABLE ACCESS BY INDEX ROWID        | TU                            |       1 |     1 |       119 |    +11 |    2M |    17764 |       |       |       |       |     . |     . |
| 31 |                    INDEX UNIQUE SCAN                 | TU_PK                         |       1 |       |       119 |    +11 |    2M |    17764 |       |       |       |       |     . |     . |
| 32 |                  TABLE ACCESS BY INDEX ROWID         | TEP                           |       1 |     2 |       129 |     +3 |    2M |       2M |  140K |   1GB |       |       |     . |     . |
| 33 |                   INDEX UNIQUE SCAN                  | TEP_PK                        |       1 |     1 |       129 |     +3 |    2M |       2M |  1478 |  12MB |       |       |     . |     . |
| 34 |                 TABLE ACCESS BY INDEX ROWID          | TLIM                          |       1 |     1 |       129 |     +3 |    2M |       2M |       |       |       |       |     . |     . |
| 35 |                  INDEX UNIQUE SCAN                   | TLIM_PK                       |       1 |       |       129 |     +3 |    2M |       2M |       |       |       |       |     . |     . |
| 36 |                TABLE ACCESS BY INDEX ROWID           | TLPSE                         |       1 |     1 |       129 |     +3 |    2M |       2M |       |       |       |       |     . |     . |
| 37 |                 INDEX UNIQUE SCAN                    | TLPSE_PK                      |       1 |       |       129 |     +3 |    2M |       2M |       |       |       |       |     . |     . |
| 38 |               INDEX RANGE SCAN                       | TCX_IX2                       |       1 |     2 |       129 |     +3 |    2M |       2M |  4642 |  36MB |       |       |     . |     . |
| 39 |              TABLE ACCESS BY INDEX ROWID             | TC                            |       1 |     2 |       129 |     +3 |    2M |       2M | 22307 | 174MB |       |       |     . |     . |
| 40 |               INDEX UNIQUE SCAN                      | TC_PK                         |       1 |     1 |       129 |     +3 |    2M |       2M |   546 |   4MB |       |       |     . |     . |
| 41 |            INDEX RANGE SCAN                          | TCX_PK                        |       1 |     2 |           |        |       |          |       |       |       |       |     . |     . |
| 42 |           INDEX RANGE SCAN                           | TCX_PK                        |       1 |     2 |         1 |   +131 |     1 |     976K |       |       |       |       |     . |     . |
| 43 |          TABLE ACCESS BY INDEX ROWID                 | TC                            |       1 |     2 |        27 |   +131 |    2M |       2M | 21549 | 168MB |       |       |     . |     . |
| 44 |           INDEX UNIQUE SCAN                          | TC_PK                         |       1 |     1 |        27 |   +131 |    2M |       2M |   959 |   7MB |       |       |     . |     . |
| 45 |        TABLE ACCESS BY INDEX ROWID BATCHED           | TP                            |       1 |     3 |           |        |       |          |       |       |       |       |     . |     . |
| 46 |         INDEX RANGE SCAN                             | TP_PK                         |      15 |     1 |           |        |       |          |       |       |       |       |     . |     . |
| 47 |       TABLE ACCESS BY INDEX ROWID BATCHED            | TP                            |       1 |     3 |        36 |   +157 |     1 |       15 |       |       |       |       |     . |     . |
| 48 |        INDEX RANGE SCAN                              | TP_PK                         |      15 |     1 |        36 |   +157 |     1 |       15 |       |       |       |       |     . |     . |
| 49 |      TABLE ACCESS STORAGE FULL FIRST ROWS            | TLIET                         |       1 |     3 |       512 |   +157 |    2M |       2M |       |       |       |       |     . |     . |
| 50 |     VIEW PUSHED PREDICATE                            | TEB_VW                        |       1 |    57 |       506 |   +163 |    2M |     1459 |       |       |       |       |     . |     . |
| 51 |      NESTED LOOPS OUTER                              |                               |       1 |    57 |       506 |   +163 |    2M |     1459 |       |       |       |       |     . |     . |
| 52 |       NESTED LOOPS                                   |                               |       1 |    55 |       506 |   +163 |    2M |     1459 |       |       |       |       |     . |     . |
| 53 |        NESTED LOOPS                                  |                               |       1 |    53 |       506 |   +163 |    2M |     1459 |       |       |       |       |     . |     . |
| 54 |         NESTED LOOPS                                 |                               |       1 |    51 |       506 |   +163 |    2M |     1459 |       |       |       |       |     . |     . |
| 55 |          NESTED LOOPS                                |                               |       5 |    41 |       510 |   +159 |    2M |     6965 |       |       |       |       |     . |     . |
| 56 |           NESTED LOOPS                               |                               |       1 |     7 |       510 |   +159 |    2M |     770K |       |       |       |       |     . |     . |
| 57 |            NESTED LOOPS                              |                               |       1 |     4 |       510 |   +159 |    2M |     770K |       |       |       |       |     . |     . |
| 58 |             NESTED LOOPS                             |                               |       1 |     3 |       510 |   +159 |    2M |     770K |       |       |       |       |     . |     . |
| 59 |              TABLE ACCESS BY INDEX ROWID             | TEP                           |       1 |     3 |       512 |   +157 |    2M |     770K |  661K |   5GB |       |       |     . |     . |
| 60 |               INDEX UNIQUE SCAN                      | TEP_PK                        |       1 |     2 |       512 |   +157 |    2M |       2M |  2934 |  23MB |       |       |     . |     . |
| 61 |              INDEX RANGE SCAN                        | TLP_IX1                       |       1 |       |       510 |   +159 |  770K |     770K |       |       |       |       |     . |     . |
| 62 |             VIEW                                     |                               |       1 |     1 |       510 |   +159 |  770K |     770K |       |       |       |       |     . |     . |
| 63 |              SORT AGGREGATE                          |                               |       1 |       |       510 |   +159 |  770K |     770K |       |       |       |       |     . |     . |
| 64 |               TABLE ACCESS BY INDEX ROWID            | TPR                           |       1 |     1 |       510 |   +159 |  770K |     770K |       |       |       |       |     . |     . |
| 65 |                INDEX UNIQUE SCAN                     | TPR_PK                        |       1 |       |       510 |   +159 |  770K |     770K |       |       |       |       |     . |     . |
| 66 |            TABLE ACCESS BY INDEX ROWID BATCHED       | TET                           |       1 |     3 |       511 |   +158 |  770K |     770K | 79759 | 623MB |       |       |     . |     . |
| 67 |             INDEX RANGE SCAN                         | TET_Ix1                       |       1 |     2 |       510 |   +159 |  770K |     899K | 15834 | 124MB |       |       |     . |     . |
| 68 |           TABLE ACCESS BY INDEX ROWID BATCHED        | TWE                           |       5 |    34 |       506 |   +163 |  770K |     6965 |  2080 |  16MB |       |       |     . |     . |
| 69 |            INDEX RANGE SCAN                          | TWE_IDX1                      |      35 |     2 |       506 |   +163 |  770K |     6965 |   118 | 944KB |       |       |     . |     . |
| 70 |          TABLE ACCESS BY INDEX ROWID                 | TT                            |       1 |     2 |       506 |   +163 |  6965 |     1459 |   208 |   2MB |       |       |     . |     . |
| 71 |           INDEX UNIQUE SCAN                          | TT_PK                         |       1 |     1 |       506 |   +163 |  6965 |     6965 |       |       |       |       |     . |     . |
| 72 |         INDEX RANGE SCAN                             | TCX_IX2                       |       1 |     2 |       506 |   +163 |  1459 |     1459 |  1388 |  11MB |       |       |     . |     . |
| 73 |        TABLE ACCESS BY INDEX ROWID                   | TC                            |       1 |     2 |       506 |   +163 |  1459 |     1459 |   936 |   7MB |       |       |     . |     . |
| 74 |         INDEX UNIQUE SCAN                            | TC_PK                         |       1 |     1 |       506 |   +163 |  1459 |     1459 |    75 | 600KB |       |       |     . |     . |
| 75 |       TABLE ACCESS BY INDEX ROWID BATCHED            | TLS                           |       1 |     2 |       506 |   +163 |  1459 |     1451 |     1 |  8192 |       |       |     . |     . |
| 76 |        INDEX SKIP SCAN                               | TLS_PK                        |       1 |     1 |       506 |   +163 |  1459 |     1451 |     1 |  8192 |       |       |     . |     . |
| 77 |    SORT AGGREGATE                                    |                               |       1 |       |       512 |   +157 |    2M |       2M |       |       |       |       |     . |     . |
| 78 |     FIRST ROW                                        |                               |       1 |     3 |       512 |   +157 |    2M |       2M |       |       |       |       |     . |     . |
| 79 |      INDEX RANGE SCAN (MIN/MAX)                      | TCX_IX2                       |       1 |     3 |       512 |   +157 |    2M |       2M |  9356 |  73MB |       |       |     . |     . |
=======================================================================================================================================================================================================

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3 (U - Unused (1))
---------------------------------------------------------------------------
   0 -  STATEMENT
         U -  first_rows / hint overridden by another in parent query block
           -  first_rows
 
  56 -  SEL$5
           -  no_merge
 
Note
-----
   - this is an adaptive plan

If you want to pull these plans into separate windows and compare (nothing but) the Operations and Names line by line you’ll find that the only differences appear at operations 45, 47, 66, 68, and 75:

| 45 |        TABLE ACCESS BY INDEX ROWID                   | TP                            |
| 47 |       TABLE ACCESS BY INDEX ROWID                    | TP                            |
| 66 |            TABLE ACCESS BY INDEX ROWID               | TET                           |
| 68 |           TABLE ACCESS BY INDEX ROWID                | TWE                           |
| 75 |       TABLE ACCESS BY INDEX ROWID                    | TLS                           |

| 45 |        TABLE ACCESS BY INDEX ROWID BATCHED           | TP                            |
| 47 |       TABLE ACCESS BY INDEX ROWID BATCHED            | TP                            |
| 66 |            TABLE ACCESS BY INDEX ROWID BATCHED       | TET                           |
| 68 |           TABLE ACCESS BY INDEX ROWID BATCHED        | TWE                           |
| 75 |       TABLE ACCESS BY INDEX ROWID BATCHED            | TLS                           |

So what could possibly make one plan so much slower than the other?

There are all sorts of bits and pieces in these plans that you need to be able to spot “in passing” if you want to become fluent at understanding execution plans. It’s something that takes a lot of practise but there’s one general tip (or warning, perhaps) that I can offer.

If you start out by looking for one particular thing you’ll miss lots of important clues; on a first pass through the plan just try to notice anything that looks a little interesting or informative, then go back for a more detailed examination on a second pass through the plan.

I won’t go through the entire pattern of thought that went through my mind as I started looking at these plans, but here are a couple of flags I raised

Adaptive plans in SQL Monitor – we’re likely to see some “statistics collector” operations and that’s the “obvious” source of the anomaly, but reading plans that include their adaptive bits can be a mind-bending experience.
Global Stats of the slow one says 984K read requests (compared to 251K for the fast plan) – that might explain the difference in timing – keep an eye out for where the big numbers appear. (NB Don’t, at this point, go looking for them as that may lead you into missing the real issue.)
The slow plan plan shows the top operation with a Start Active of +157 while the fast plan has a start active of +2: that’s pretty consistent with a comment the user made (but I hadn’t mentioned) about the response time from the user’s perspective; and it tells us that there’s a blocking operation somewhere in the slow plan. That’s nice because that’s what we might expect from seeing an adaptive plan switching from a nested loop to a hash join. (So we already think we’re a clever bunny – too bad it wasn’t quite the right answer.)
There are two places in the plan which report Statistics Collector, both having a Nested Loop Outer then a Hash Join Outer immediately above them as the two candidate consumers for the rowsource supplied by the Statistic Collector’s child operation. (Note: in some cases you will see two Nested Loop operations above a Statistics Collector before you get to the Hash Join, but that’s just because Oracle may implement a nested loop join in two steps, first to the index, second to the table)

First thoughts

Bearing in mind the warning about not digging “too deep too soon”, here are a few potentially significant observations. In the following notes I’ll simply refer to the plans as the “fast” plan and the “slow” plan.

The slow plan shows a large hash join with spill to disc at operation 4, and that’s one of the options triggered by the statistics collector at operation 6. It’s a little confusing that the nested loop join following it also reports execution statistics because only one of the two operations would have been doing anything, but we’ll assume for now that that’s a reporting error and postpone making a fuss about it. The same pattern appears at operations 8, 9 and 10.
A cross-check to the fast plan shows that the hash joins don’t do any work at the corresponding operations (although it also displays the oddity of operations that (we believe) didn’t actually run reporting a long Time Active.
The slow plan has an oddity at operation 41 – it’s an index range scan of a primary key (TCX_PK) that doesn’t report any execution statistics. Cross-checking to the slow plan we see that it’s operation 42 that doesn’t report any execution statistics (and it’s also an index range scan, and using the same index!) . We note that the fast plan “Starts” its range scan 2 million times, while the slow plan starts just once, starting at time +131 and having an active time of 1 second. [side-note: I am a little suspicious of that number – It looks to me as if it ought to be reporting 27 seconds]
Keep going – because just a bit further down we see that the slow plan has no stats for operations 45 and 46 (index range scan of TP_PK with table access of TP) while the fast plan has no stats for operations 47 and 48 (also an index range scan of TP_PK with table access to TP). Again we see the same pattern that the slow plan executes the operation just once while the fast plan executes its operations 2M times.
Keep going – the previous two observations are interesing and probably need further investigation, but they might not be critical. The very next line (operation 49) in both plans shows us a “TABLE ACCESS STORAGE FULL FIRST ROWS” that executes 2 million times – that’s got to hurt, surely, but let’s not get side-tracked just yet.
Operation 50 is a “VIEW PUSHED PREDICATE” – that almost certainly means it’s the second child of a nested loop join with a join predicate pushed into a non-mergeable view (and the view name is TEB_VW so it’s not a view created by an internal transformation) and the operation has, like so many other lines in the plan, started 2 million times.
Looking at the rest of the plan, there are no more statistics collectors and the plans have an exact match on operations. Unfortunately we don’t have a Predicate Information section, so we can’t tell whether matching operations were really doing the same thing (e.g. was an index range scan in one plan using more columns to probe the index than the corresponding index range scan in the other plan) however we can check times:
The View Pushed Predicate in the fast plan starts at time +24 [Another slightly suspicious time given all the +2 starts in the locale] and is active for 256 seconds, while in the slow plan it starts at time +163 and is active for 506 seconds. So it looks as if a lot of the excess run time of the query time is spent in this part of the plan — for no logical reason that we can see – although the access to table TEP by index TEP_PK does no physical reads in the fast plan and 661K reads in the slow plan.
Again we take a quick note, and move on. The final observation is that the last three lines of the plan look like the plan for a subquery block (executed very efficiently) of the “find the most recent / highest / lowest” type, and a quick check to the top of the plan shows that its parent is a FILTER operation, corroborating our first guess.

Starting to dig

Reviewing the first pass we can see that we lose a lot of “startup” time to the two hash joins where the build table in each case has to be completed before any further tables can be joined. This is in the order of 160 seconds, which is consistent with the OP’s observations, and it’s happening because adaptive plans are activated, triggering a change from nested loop joins to hash joins.

More significantly, from the perspective of time, is that the nested loop join into the View Pushed Predicate is active for twice as long in the slow plan as it is in the fast plan – so that’s a place to look a little more closely, revealing that operation 59 is probably the reason for the difference: 661 thousand read requests in the slow plan but none in the fast plan.

Unfortunately we don’t have any Activity Stats (i.e. active session history data) in the report, but since the access to the table is reported as unique access by unique index in both cases we can be fairly sure that the difference isn’t due to a difference in the Predicate Information (that isn’t in the report).

Clearly we need to stop the stop the adaptive plan from being taken to avoid the start-up delay (e.g. add a /*+ no_adaptive_plan */ hint to the query) but that still leaves two puzzles:

why are the rows estimates so bad (and at least part of the reason for that is that it turned out that the query was being optimized with optimizer_mode = first_rows – that’s the legacy first_rows, not a cost-based first_rows_N);
how could the same sub-plan result in far more physical reads in one case compared to the other when the critical operation is a unique index access.

The answer to the second question could be in an observation I first published 14 years ago – and it could indicate a generic threat to adaptive optimisation.

If you have an execution plan which, stripped to a minimum, looks like this:

Join Operation
        Table_X
        Table_Y

The order in which the result appears is likely to change depending on the join mechanism that Oracle chooses, viz Hash Join, Merge Join or Nested Loop Join.

Under “non-adaptive” conditions if you have a join that’s border-line between a hash join and a nested loop join it frequently means that the optimizer will fip flop between two plans like the following (leading to the classic question – nothing changed why did the plan change):

Hash Join
        Table_X
        Table_Y

Nested Loop Join
        Table_Y
        Table_X

Note that the order in which the tables appear is reversed.

As it says in another article of mine: “all joins are nested loop joins, just with different startup costs”. In both the plans above Oracle picks a row from Table_Y and looks for a match in Table_X, so the order of the result set is dictated by the Table_Y regardless of whether the join is a hash join or a nested loop join. However, if Oracle has decided to use an adaptive plan and starts with the nested loop (Y -> X) and decides to switch to a hash join it doesn’t swap the join order as the join mechanism is selected, so a result set whose order would have been dictated by Table_Y turns into the same result set (we hope) but in an order dictated by Table_X.

Consequences:

If you’re using very big tables and Oracle produces an adaptive nested loop join early in the plan this may result in a later nested loop being lucky and inducing lots of “self-caching” because its driving rowsource is in a nice order. If the plan adapts to a hash join the driving data set may appear in a completely different order that makes the later nested loop jump randomly around a very large table, inducing a lot of “self-flushing” as one table block is discarded from the buffer cache to make space for another. (I published an article several years ago about how a similar – though far from identical – type of anomaly could happen with Exadata and compression: an unlucky order of data access causing a massive extra workload.)

Conclusion and further thoughts

In this note I’ve tried to present my thoughts as I’ve read through an execution plan trying to understand what it’s doing and why it’s showing the performance characteristics it does.

In this case the interpretation was made harder because the plan was an adaptive plan – and there doesn’t appear to be an option in the procedure in dbms_sql_monitor to choose between hiding and revealing the adaptive parts [ed: this statement may apply only to the text option – see comment #1 for a counter-example using the ‘Active HTML” option]; moreover there was no Activity (ASH) information supplied and we didn’t have the Predicate Information.

The performance “issue” was that when adaptive plans were allowed (as opposed to reported only) we could see that two nested loops changed to hash joins. It was fairly clear that this explained the huge delay before results started to appear, but didn’t explain why the query took so much longer to complete.

We have a hypothesis that the extra run time of the query was due to “bad luck” because we can see very clearly that a nested loop into a non-mergeable view with pushed predicate reports a huge number of single block read requests; and we know that changing a join from a nested loop to a hash join without changing the order of the child operations will change the order in which the join’s rowsource is generated.

Ini this case the query was executing under the legacy first_rows optimizer mode, and it’s possible that if first_rows_N had been used the optimizer would have behaved differently, especially since we have a query that is returning 2M rows and we only want the first few rows.

Next Steps

The obvious “next step” in this investigation is to check whether first_rows_N co-operates nicely with adaptive optimisation. After all, the only significant thing that adaptive optimisation does to (serial) execution plans is set an inflexion point to dictate when a nested loop should change to a hash join – and a hash join is a blocking operation which is rarely a good thing for a first_rows_N plan.

So, does first_rows_N disable this adaptive plan analysis, does it move the inflection point significantly, or does the optimizer simply forget that hash joins are less desirable in first_rows N optimisation. And if you’re running a system in first_rows_N mode should you disable adaptive plans by default, and only enable it for special cases.

I also have an urge to test a couple of ideas about why the two timing anomalies I mentioned have appeared, but it’s already taken me several hours to write notes (including a few replies to the list server) about the 30 minutes I’ve spent looking at an execution plan, so any further experimens will have to go on my to-do list.

Comments (4)

December 8, 2020

Hash Joins

Filed under: Hash Join,Joins,Oracle — Jonathan Lewis @ 11:55 am GMT Dec 8,2020

This is a follow-up to a note that’s 10 years old [opens in new tab], prompted by a thread on the Oracle Developer Community forum asking about the order of operation in an execution plan, and following up with a question about PGA memory use that I had left open in the original note.

The original note pointed out that the serial execution plan for a 4 table join that used hash joins for every join and had a leading() hint dictating a single join order could still have 8 distinct execution plans (which I then corrected to 4, because half of them were eliminated by an odd little inconsistency of the leading() hint).

The source of the number 8 was the fact that when you hint a hash join with /*+ use_hash(next_alias) */ the optimizer will consider a /*+ swap_join_inputs(next_alias) */ – in other words, it will choose which of the two inputs should be the “build” table and which the “probe” table, even though the human eye will probably assume that next_alias was clearly intended as the probe table. (The “little inconsistency” in the leading() hint is that it blocks the option for a swap for the first – and only the first – join unless it’s explicitly hinted.)

In the article I showed all 8 possible plans, and described how the two plans at the extreme ends of my list would operate at run-time. I’ve reproduced these two plans below, preceded by the hints that produced them:

leading(t1 t2 t3 t4)
use_hash(t2) no_swap_join_inputs(t2)
use_hash(t3) no_swap_join_inputs(t3)
use_hash(t4) no_swap_join_inputs(t4)
 
--------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     1 |    24 |    11 |
|   1 |  SORT AGGREGATE       |      |     1 |    24 |       |
|*  2 |   HASH JOIN           |      |    70 |  1680 |    11 |
|*  3 |    HASH JOIN          |      |    70 |  1260 |     8 |
|*  4 |     HASH JOIN         |      |    70 |   840 |     5 |
|   5 |      TABLE ACCESS FULL| T1   |    70 |   420 |     2 |
|   6 |      TABLE ACCESS FULL| T2   |    70 |   420 |     2 |
|   7 |     TABLE ACCESS FULL | T3   |    70 |   420 |     2 |
|   8 |    TABLE ACCESS FULL  | T4   |    70 |   420 |     2 |
--------------------------------------------------------------

leading(t1 t2 t3 t4)
use_hash(t2)    swap_join_inputs(t2)
use_hash(t3)    swap_join_inputs(t3)
use_hash(t4)    swap_join_inputs(t4)
 
--------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     1 |    24 |    11 |
|   1 |  SORT AGGREGATE       |      |     1 |    24 |       |
|*  2 |   HASH JOIN           |      |    70 |  1680 |    11 |
|   3 |    TABLE ACCESS FULL  | T4   |    70 |   420 |     2 |
|*  4 |    HASH JOIN          |      |    70 |  1260 |     8 |
|   5 |     TABLE ACCESS FULL | T3   |    70 |   420 |     2 |
|*  6 |     HASH JOIN         |      |    70 |   840 |     5 |
|   7 |      TABLE ACCESS FULL| T2   |    70 |   420 |     2 |
|   8 |      TABLE ACCESS FULL| T1   |    70 |   420 |     2 |
--------------------------------------------------------------

Side note: one of the comments on the original post raised the point that (e.g.) the second of the plans above could have been generated by the join order represented by the hint leading(t2 t1 t3 t4), and that’s absolutely correct. In fact, if you see the plan for a 4 table join consisting of nothing but 3 hash join you cannot be certain what join order the optimizer was examining when it produced that plan unless you look at the CBO trace file (or check to see if it’s been explicitly and completely, hinted in the code).

Having set the scene, we come to the question that prompted this note. The question related to the second plan above, and said:

“However, this one is the plan that takes the most memory in the PGA because it builds 3 hash tables before joining any table, right?”

The quick answer to the question is: “Not necessarily.”

In the original article I had pointed out that plans following the pattern of the first plan above with N tables and N – 1 joins would have at most two build tables in memory at any one moment while the bottom plan would create N – 1 build tables in memory before any join results could be created. This does rather invite the inference that the bottom plan has to be the one that is going to use most PGA memory, but what I had said in the original article was (emphasis added):

“Notice: the number of in-memory hash (build) tables we have in the first of the 8 plans at any instant after the first join starts is two and (no matter how many tables are involved in this pattern) the number of in-memory hash tables will always be two. The actual size of the two hash tables is a little unpredictable and, as a very crude guideline, you might expect the size to grow as more tables are joined into the result set.“

As a thought experiment, consider 3 small dimension tables and one big fact table. If Oracle were to create in-memory hash tables from the three dimension tables and then start scanning the fact table (following the pattern of the second plan above with t1 in the role of the fact table) probing each of the dimension tables in turn, it could deliver the first result row very quickly without requiring more memory to store intermediate results.

Conversely if Oracle were to create a tiny in-memory hash table from the first dimension and probe it with the fact table (following the pattern of the first plan above with t2 in the role of the fact table) Oracle would then have to build a very large in-memory hash table before before probing it with the second dimension table, and as that second join takes place it would be generating a new result set that would become the next big in-memory hash table.

In this thought experiment we would probably find that the optimizer did the right thing without prompting and constructed three tiny in-memory hash tables – but it’s not always so clear-cut, and even the “right” decision can result in very large intermediate build tables (and if those intermediate build tables spill to disc in an Exadata environment the change in performance can be huge).

To finish off, here’s a script to turn the thought experiment into a concrete example (by careful, but deliberately silly, hinting).

rem
rem     Script:         c_treblehash_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem 

create table t1 
as
select
        rownum          id,
        to_char(rownum) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 70
;

alter table t1 
        add constraint t1_pk primary key(id)
;

create table t2
as
select
        rownum          id,
        to_char(rownum) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 70
;

alter table t2
        add constraint t2_pk primary key(id)
;

create table t3
as
select
        rownum          id,
        to_char(rownum) small_vc,
        rpad('x',100)   padding
from
        all_objects
where
        rownum <= 70
;

alter table t3
        add constraint t3_pk primary key(id)
;

create table t4
nologging
as
select
        t1.id                   id1,
        t2.id                   id2,
        t3.id                   id3,
        rpad(rownum,10)         small_vc,
        rpad('x',100)           padding
from
        t1, t2, t3
;

connect test_user/test
set linesize 156
set serveroutput off

select
        /*+ 
                leading(t4 t1 t2 t3)
                full(t4) 
                use_hash(t1) full(t1) swap_join_inputs(t1)
                use_hash(t2) full(t2) swap_join_inputs(t2)
                use_hash(t3) full(t3) swap_join_inputs(t3) 
        */
        count(t1.small_vc),
        count(t2.small_vc),
        count(t3.small_vc),
        count(t4.small_vc)
from
        t4,     
        t1,     
        t2,     
        t3
where
        t1.id = t4.id1
and     t2.id = t4.id2
and     t3.id = t4.id3
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

column pga_max_mem format 99,999,999

select pga_max_mem from v$process where addr = (
                select paddr from v$session where sid = (
                        select sid from V$mystat where rownum = 1
                )
        )
;

connect test_user/test
set linesize 156
set serveroutput off

select
        /*+ 
                leading(t4 t1 t2 t3)
                full(t4) 
                use_hash(t1) full(t1) no_swap_join_inputs(t1)
                use_hash(t2) full(t2) no_swap_join_inputs(t2)
                use_hash(t3) full(t3) no_swap_join_inputs(t3) 
        */
        count(t1.small_vc),
        count(t2.small_vc),
        count(t3.small_vc),
        count(t4.small_vc)
from
        t4,     
        t1,     
        t2,     
        t3
where
        t1.id = t4.id1
and     t2.id = t4.id2
and     t3.id = t4.id3
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

column pga_max_mem format 99,999,999

select pga_max_mem from v$process where addr = (
                select paddr from v$session where sid = (
                        select sid from V$mystat where rownum = 1
                )
        )
;

All I’ve done is create three small “dimension” tables of 70 rows each then created a table which is their Cartesian join, which produces a “fact” table of 343,000 rows. Then I’ve written a simple query to join the three dimension tables to the fact table.

I’ve used swap_join_inputs() for all the joins in one version of the query, and no_swap_join_inputs() in the other versions; and I’ve reconnected to the data before running each query to make it easier to see the different impact on the PGA of the two plans.

I’ve use dbms_xplan.display_cursor() to pull the execution plans from memory after running the queries, and since I’ve not set statistics_level to all, or added the hint /*+ gather_plan_statistics */ to the queries the only rowsource execution statistics I’ll get are the PGA usage: estimates and actuals.

The final step of each test reports the maximum PGA memory usage the session reached in the course of the test.

Here are the two plans, each followed by the PGA maximum memory size.

--------------------------------------------------------------------------
| Id  | Operation             | Name | E-Rows |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |        |       |       |          |
|   1 |  SORT AGGREGATE       |      |      1 |       |       |          |
|*  2 |   HASH JOIN           |      |    343K|  2171K|  2171K| 1684K (0)|
|   3 |    TABLE ACCESS FULL  | T3   |     70 |       |       |          |
|*  4 |    HASH JOIN          |      |    343K|  2171K|  2171K| 1681K (0)|
|   5 |     TABLE ACCESS FULL | T2   |     70 |       |       |          |
|*  6 |     HASH JOIN         |      |    343K|  2171K|  2171K| 1711K (0)|
|   7 |      TABLE ACCESS FULL| T1   |     70 |       |       |          |
|   8 |      TABLE ACCESS FULL| T4   |    343K|       |       |          |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T3"."ID"="T4"."ID3")
   4 - access("T2"."ID"="T4"."ID2")
   6 - access("T1"."ID"="T4"."ID1")

PGA_MAX_MEM
-----------
 13,859,925

--------------------------------------------------------------------------
| Id  | Operation             | Name | E-Rows |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |        |       |       |          |
|   1 |  SORT AGGREGATE       |      |      1 |       |       |          |
|*  2 |   HASH JOIN           |      |    343K|    23M|  3727K|   29M (0)|
|*  3 |    HASH JOIN          |      |    343K|    23M|  3727K|   29M (0)|
|*  4 |     HASH JOIN         |      |    343K|    23M|  3667K|   29M (0)|
|   5 |      TABLE ACCESS FULL| T4   |    343K|       |       |          |
|   6 |      TABLE ACCESS FULL| T1   |     70 |       |       |          |
|   7 |     TABLE ACCESS FULL | T2   |     70 |       |       |          |
|   8 |    TABLE ACCESS FULL  | T3   |     70 |       |       |          |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T3"."ID"="T4"."ID3")
   3 - access("T2"."ID"="T4"."ID2")
   4 - access("T1"."ID"="T4"."ID1")

PGA_MAX_MEM
-----------
 52,984,917

As you can see, each of the hash joins in the first plan required roughly 1.7MB of memory. All three hash tables would have been in existence at the same time, giving a total of about 5.1MB of memory for the query. The session’s maximum PGA usage shows up as 13MB, of which 5MB was my basic “startup” PGA, leaving 3MB “unaccounted”.

In comparison, each of the hash joins in the second plan required roughly 29MB, although only two of the hash tables would have been in existence at any one moment. That’s still an allocation of 58MB for the same basic 4 table join. In fact things aren’t quite as bad as they seem in this case since the maximum PGA allocated was only about 52MB (again with 5MB of startup PGA). The apparent contradiction may be due to the way that Oracle allocates PGA in increasing chunks – the 29MB reported may have been the result of the session doing something like: “I’ve reached 23MB, my next allocation will be 4MB, oops, I only needed another 128KB)”

As a final check of activity, I’ve also run a couple of queries against V$sql_workarea – a convenient memory structure you can examine to get some “recent history” of queries that have been using large amount of memory, or spilling to disc. In this case I’ve query the structure by sql_id for the two queries, reporting just a little detail about the last execution and memory usage.

SQL> select operation_id, operation_type, last_memory_used, last_execution  from V$sql_workarea where sql_id = 'dcc01q28gcbmy';

OPERATION_ID OPERATION_TYPE                   LAST_MEMORY_USED LAST_EXECUTION
------------ -------------------------------- ---------------- ----------------------------------------
           2 HASH-JOIN                                 1724416 OPTIMAL
           4 HASH-JOIN                                 1721344 OPTIMAL
           6 HASH-JOIN                                 1752064 OPTIMAL


SQL> select operation_id, operation_type, last_memory_used, last_execution  from V$sql_workarea where sql_id = 'b52uwjz07fwhk';

OPERATION_ID OPERATION_TYPE                   LAST_MEMORY_USED LAST_EXECUTION
------------ -------------------------------- ---------------- ----------------------------------------
           2 HASH-JOIN                                30930944 OPTIMAL
           3 HASH-JOIN                                30945280 OPTIMAL
           4 HASH-JOIN                                30650368 OPTIMAL

As you can see, the view reports each plan operation (with id and type) that needed memory for an SQL workarea – and the numbers confirm the 1.7MB and 29MB reported by the execution plans. (Which is not surprising since it’s exactly these figures that are used to supply the details in the plans.)

You need to be a little cautious with this view in older versions of Oracle – it used to do a “full tablescan” of the library cache, which resulted in a lot of library cache latch activity and could cause some contention if you ran it frequently, but it’s a very useful view for finding queries that are doing unexpectedly large sorts or hash joins, and one you might query occasionally if you see any space management threats in the temporary tablespace.

Comments (4)

November 12, 2020

rowsets

Filed under: Execution plans,Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:35 pm GMT Nov 12,2020

Here’s a little demonstration of the benefit of rowsets. It started with a very simple question that arrived in my in-box from Kaley Crum.

“Why does the query with the UNION ALL take so much longer than the query without the UNION ALL?”

Here are the two queries – surely they should take virtually the same amount of time.

select count(*)
from (select /*+ no_merge */
             object_id
      from drop_me);


select count(*)
from (select /*+ no_merge */
             object_id
      from drop_me
      --
      union all
      --
      select 1 from dual);

Look closely – the difference between the two queries is just the addition through a UNION ALL of “select 1 from dual”. The first query took about 3.5 seconds to complete (there were 70M rows in the table), and the second took nearly 69 seconds.

Here are the execution plans showing the rowsource execution statistics (and Query Block // Object Alias information) – first the baseline query:

QL_ID  0ph1vfuuxkbqb, child number 0
-------------------------------------
select count(*) from (select /*+ no_merge */              object_id    
  from drop_me)
 
Plan hash value: 3609429292
 
--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name    | Starts | E-Rows | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |      1 |        | 26465 (100)|          |      1 |00:00:03.34 |   95896 |  95891 |
|   1 |  SORT AGGREGATE     |         |      1 |      1 |            |          |      1 |00:00:03.34 |   95896 |  95891 |
|   2 |   VIEW              |         |      1 |     70M| 26465   (2)| 00:00:02 |     70M|00:00:03.32 |   95896 |  95891 |
|   3 |    TABLE ACCESS FULL| DROP_ME |      1 |     70M| 26465   (2)| 00:00:02 |     70M|00:00:03.27 |   95896 |  95891 |
--------------------------------------------------------------------------------------------------------------------------
 
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SEL$2 / from$_subquery$_001@SEL$1
   3 - SEL$2 / DROP_ME@SEL$2

And then the UNION ALL query:

SQL_ID  0chdajr28y0ub, child number 0
-------------------------------------
select count(*) from (select /*+ no_merge */              object_id    
  from drop_me       --       union all       --       select 1 from
dual)
 
Plan hash value: 3408528233
 
---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name    | Starts | E-Rows | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |      1 |        | 26467 (100)|          |      1 |00:01:18.58 |   95896 |  95891 |
|   1 |  SORT AGGREGATE      |         |      1 |      1 |            |          |      1 |00:01:18.58 |   95896 |  95891 |
|   2 |   VIEW               |         |      1 |     70M| 26467   (2)| 00:00:02 |     70M|00:01:10.84 |   95896 |  95891 |
|   3 |    UNION-ALL         |         |      1 |        |            |          |     70M|00:00:53.13 |   95896 |  95891 |
|   4 |     TABLE ACCESS FULL| DROP_ME |      1 |     70M| 26465   (2)| 00:00:02 |     70M|00:00:19.28 |   95896 |  95891 |
|   5 |     FAST DUAL        |         |      1 |      1 |     2   (0)| 00:00:01 |      1 |00:00:00.01 |       0 |      0 |
---------------------------------------------------------------------------------------------------------------------------
 
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SET$1 / from$_subquery$_001@SEL$1
   3 - SET$1
   4 - SEL$2 / DROP_ME@SEL$2
   5 - SEL$3 / DUAL@SEL$3

This is not a trick question – I’ve got the SQL to create the data set and run the test; and I’ve run the test through 19.3.0.0 and 12.2.0.1 with results very similar to the above. Here’s a slightly modified version of the script – you’ll notice the column name has changed because I’ve used my usual “large data set” generator rather than scaling up a clone of all_objects:

rem
rem     Script:         rowset_kaley.sql
rem     Author:         Kaley Crum / JP Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem

create table drop_me
pctfree 0
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum  n1
from
        generator
cross join
        generator
where rownum <= 7e7
;

alter session set statistics_level = all;

set timing on
set serveroutput off

prompt  ===========================================
prompt  Baseline: Simple query without a union all.
prompt  Runs in 3-4 seconds
prompt  ===========================================

select
         count(*)
from (select /*+ no_merge */
             n1
      from drop_me);

select *
from table(dbms_xplan.display_cursor(format => 'allstats last advanced'));

prompt  ==========================================
prompt  Add a UNION ALL for one record from dual.
Prompt  Runs in over a minute
prompt  ==========================================

pause Press return

select
         count(*)
from (select /*+ no_merge */
             n1
      from drop_me
      --
      union all
      --
      select 1 from dual);

select *
from table(dbms_xplan.display_cursor(format => 'allstats last advanced'));

There are two factors involved in the massive variation in performance. The first factor is one that the test case will report – though I didn’t show it in the output above, the second is that the test case has enabled rowsource execution statistics.

Here’s the big clue – from the Column Projection Information, which is one of the things that appears with the “advanced” format option (or when you add the “projection” format option) in the call to dbms_xplan.display_xxx()

First for the fast query:

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]
   2 - (rowset=1019)
   3 - (rowset=1019)

The rowsource passes 1,019 rows at a time from the tablescan operation to the view operation and from the view operation to the sort operation, for a total of roughly 70,000 calls for each of those two steps of the plan.

Compare this to the projection information for the slow UNION ALL query:

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]

I haven’t missed a bit in copying – this really is it. The rows move up the stack one at a time, not in a rowset array. That’s 70,000,000 subroutime calls for each of the two steps of the plan.

I’m sure most of us have heard the mantra “row by row is slow by slow” (or some variant on the theme). This is true all the way down to the internal levels of an execution plan.

Of course 70M calls vs. 70K calls shouldn’t really explain a difference of more than 60 seconds – but when the calls to the O/S for timing information for rowsource execution stats are repeated for each call (or even if it’s only a 1% sample of the calls) down the stack that’s where a lot of time can disappear.

On my laptop, running Oracle 19.3 in a VM, this is how my timing went:

Fast query: 1.94 seconds, dropping to 1.79 seconds when I disabled rowsource execution stats.
Slow query: 65.91 seconds, dropping to a far less astounding 3.26 seconds when I disabled rowsource execution stats.

So there really is a noticeable difference between row-by-row and array-processing but most of the difference in the original test came from using rowsource execution statistics to measure how much of a difference there would be.

It’s also worth mentioning that this is probably the most extreme case you could produce to show the difference – using the largest possible rowset size with the smallest possible rows when you want to mazimise the gap between (internal) array processing and single row processing – and then processing a very large number of rows. [But isn’t ridiculously large numbers of rows what you do with Exadata?].

If you want further evidence that the difference is due to the rowset size you can always alter session set “_rowsets_enabled”=false; and watch the fast query slow down. It will take about half the time of the slow query as it only has to pass 70M rows up one step of the plan rather than the two steps that are in the UNION ALL plan.

In theory it looks as if you could also restrict the size of the rowset by setting _rowsets_max_rows or _rowsets_target_maxsize, but since the default value for the former is 256 the results above suggest that the settings might be ignored, and when I tried adjusting them at the session level nothing changed.

Another test you could run is to adjust (alter session) the parameter “_rowsource_statistics_sampfreq”, When I set this to 100 the time for the slow query dropped to about 21 seconds (and the fast query – with rowsets disabled – dropped to about 11 seconds).

Footnote

It’s always easy to hit an accidental special case without realising it, so when you’re testing something it’s important to think about what’s special in the example. I came up with two possibilities in this case – dual is always going to be a very special case in any circumstances, and I’ve got a table in my query that is guaranteed to return exactly one row. So I repeated the test with:

a real table with one row and primary key in case the reference to dual was disabling rowsets
a real table with 100 rows in case this enabled a rowset size of 100

In neither case did Oracle start using array processing.

Footnote 2

One of the patterns I see occasionally follows the form of the slow query above. Some front-end tool executes a select statement that starts with a select {list of desired column headings} from dual so that the front-end code doesn’t have to be programmed to create headings as it receives the data.

Although the impact this would have on rowsets looks like a large percentage of the work done in the database in this case, in more realistic cases it would probably be irrelevant, until someone starts testing with the statistics_level set to all, or _rowsource_execution_statistics explicitly enabled.

There is a related effect, though, from the SQL Monitor feature (which activates automatically for a statement if it’s executing in parallel, or if it’s predicted to take more than 5 seconds to complete). When I added the /*+ monitor */ hint to the two queries (and didn’t fiddle with the rowsets parameter) the fast query averaged 2.26 seconds instead of 1.79 seconds, and the slow query averaged 3.51 seconds instead of 3.26 seconds. I have to say, though, that the variation between executions of the same statement was quite significant compared the variation produced by adding or eliminating the hint.

Comments (7)

October 10, 2020

Interval Oddity

Filed under: CBO,Execution plans,Oracle,Partitioning — Jonathan Lewis @ 2:51 pm BST Oct 10,2020

Interval partitioning is a popular strategy for partitioning date-based data. It’s an enhanced variant of range partitioning that allows you to define a starting partition and an interval that should be used to derive the high values for all subsequent partitions – and Oracle doesn’t even have to create intervening partitions if you insert data that goes far beyond the current partition, it automatically creates exactly the right partition (with the correct high_value and correctly inferred lower boundary) for the incoming data and behaves as if the intervening partitions will become available when they’re needed at some later point in time. So no need for DBAs to work out actual partition high_values, no need to ensure that all the partitions you need out into the future have been pre-created, no more batch processes crashing with Oracle error ORA-14400: inserted partition key does not map to any partition.

But there’s a surprising difference between traditional range partitioning and the newer interval partitioning that will increase CPU usage in some cases and may (though I haven’t yet investigated this in sufficient detail to create an example) lead to variations in execution plans.

To demonstrate the difference I’m going to create two tables with the same structure and content then run the same query against them and show you the resulting execution plans. Here’s the code to create and populate the tables:

rem
rem     Script:         interval_or_range.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem


create table t_interval(
        order_date      date not null,
        order_id        number(10,0) not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range (order_date) 
interval (numtoyminterval(1,'MONTH'))
subpartition by hash (order_id) subpartitions 4
        (
                partition start_p1 values less than (to_date('01-Jan-2020','dd-mon-yyyy'))
        )
;

create table t_range(
        order_date      date not null,
        order_id        number(10,0) not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range (order_date) 
subpartition by hash (order_id) subpartitions 4
        (
                partition start_p1 values less than (to_date('01-Jan-2020','dd-mon-yyyy')),
                partition start_p2 values less than (to_date('01-Feb-2020','dd-mon-yyyy')),
                partition start_p3 values less than (to_date('01-Mar-2020','dd-mon-yyyy')),
                partition start_p4 values less than (to_date('01-Apr-2020','dd-mon-yyyy')),
                partition start_p5 values less than (to_date('01-May-2020','dd-mon-yyyy')),
                partition start_p6 values less than (to_date('01-Jun-2020','dd-mon-yyyy')),
                partition start_p7 values less than (to_date('01-Jul-2020','dd-mon-yyyy')),
                partition start_p8 values less than (to_date('01-Aug-2020','dd-mon-yyyy')),
                partition start_p9 values less than (to_date('01-Sep-2020','dd-mon-yyyy'))
        )
;

insert into t_range select
        to_date('01-Jan-2020','dd-mon-yyyy') + rownum,
        rownum,
        lpad(rownum,10,'0'),
        lpad('x',100,'x')
from
        all_objects
where
        rownum <= 240 -- > comment to avoid wordpress format issue  
;

insert into t_interval select
        to_date('01-Jan-2020','dd-mon-yyyy') + rownum,
        rownum,
        lpad(rownum,10,'0'),
        lpad('x',100,'x')
from
        all_objects
where
        rownum <= 240 -- > comment to avoid wordpress format issue
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T_RANGE',
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T_INTERVAL',
                method_opt  => 'for all columns size 1'
        );
end;
/

I’ve used composite partitioned tables in this example but the same anomaly appears with simple partitioning – which you can test by commenting out the “subpartion by ..” lines. The two tables have partitions defined to hold a month’s data. I’ve inserted a few rows into most of the partitions, and ensured that I haven’t accidentally attempted to insert data that falls outside the legal range of the table with the predefined partitions .

To show that the tables are nominally identical here’s the list of partitions with their high values and number of rows:

column table_name       format a15
column partition_name   format a15
column high_value       format a80

break on table_name skip 1

select
        table_name, partition_name, num_rows, high_value
from
        user_tab_partitions
where
        table_name in ('T_INTERVAL','T_RANGE')
order by
        table_name, partition_name
;


ABLE_NAME      PARTITION_NAME    NUM_ROWS HIGH_VALUE
--------------- --------------- ---------- --------------------------------------------------------------------------------
T_INTERVAL      START_P1                 0 TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10722              30 TO_DATE(' 2020-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10723              29 TO_DATE(' 2020-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10724              31 TO_DATE(' 2020-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10725              30 TO_DATE(' 2020-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10726              31 TO_DATE(' 2020-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10727              30 TO_DATE(' 2020-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10728              31 TO_DATE(' 2020-08-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10729              28 TO_DATE(' 2020-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

T_RANGE         START_P1                 0 TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P2                30 TO_DATE(' 2020-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P3                29 TO_DATE(' 2020-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P4                31 TO_DATE(' 2020-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P5                30 TO_DATE(' 2020-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P6                31 TO_DATE(' 2020-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P7                30 TO_DATE(' 2020-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P8                31 TO_DATE(' 2020-08-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P9                28 TO_DATE(' 2020-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA


18 rows selected.

With this setup we can now run a simple query against the two tables using a where clause that is an exact match for the partition definition for the July data:

set serveroutput off

select 
        count(*) 
from 
        t_range 
where   order_date >= date '2020-07-01'  -- to_date('01-Jul-2020','dd-mon-yyyy')
and     order_date <  date '2020-08-01'  -- to_date('01-Aug-2020','dd-mon-yyyy')
/

select * from table(dbms_xplan.display_cursor);

select 
        count(*) 
from 
        t_interval 
where   order_date >= date '2020-07-01'  -- to_date('01-Jul-2020','dd-mon-yyyy')
and     order_date <  date '2020-08-01'  -- to_date('01-Aug-2020','dd-mon-yyyy')
/

select * from table(dbms_xplan.display_cursor);

I’ve left in two versions of the date predicates – the “ANSI” style, and a strictly formatted “to_char()” style that is still the approach used most commonly in Oracle systems. The effect on the execution plans is the same for both variations of the date declaration, and here are the two execution plans – reported from an instance of 19.3.0.0:

---------------------------------------------------------------------------------------------------
| Id  | Operation               | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |         |       |       |   130 (100)|          |       |       |
|   1 |  SORT AGGREGATE         |         |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE|         |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
|   3 |    TABLE ACCESS FULL    | T_RANGE |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
---------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |       |       |   130 (100)|          |       |       |
|   1 |  SORT AGGREGATE         |            |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE|            |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
|*  3 |    TABLE ACCESS FULL    | T_INTERVAL |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(("ORDER_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))

It’s quite possible that you won’t notice the difference these plans, even though I’ve put them directly one below the other; and it’s very likely that most people wouldn’t think about the difference if they didn’t have the two plans available at the same time to compare.

If you can’t spot the difference it’s because I’ve deliberately arranged them in a way that might fool you into not looking carefully enough.

If you’ve spotted the difference it’s probably because you’ve examined the Predicate Information section. There isn’t one for the the t_range example but there is one for the t_interval example – and it was a little sneaky of me to make it easy for you to assume that the one set of predicates I showed was common to the two plans – even though the first plan doesn’t have the asterisk by operation 3 that the second plan has.

Since the plans have been pulled from memory (v$sql_plan) it seems likely that they are truthful and the query based on the interval partitioning is actually checking every row in the partition against two date values. In my example that’s not going to make much difference to performance but in a data warehouse with many millions of millions of rows per partition there’s scope for a noticeable increase in CPU between the two queries – especially since the block reads are likely to be direct path, bypassing much of the usual (CPU -intensive) buffer cache activity.

Interestingly when I checked the 10053 trace files for the two queries the CPU cost for the pair was identical. Normally you expect to see at least a little CPU cost (even if it’s only tens of units out of millions) for checking a predicate. This raises the question – does the optimizer allow a cost for the t_range table for an event that isn’t going to happen, or is it reporting an event that isn’t going to happen for the t_interval table. [Edit: easy enough to check by doing the test on a simple heap table that clones the data from that one partition, and checking the calculated cost with and without the predicate]

There’s a further ramification to this anomaly, relating to the question on the Oracle-L list server that prompted the investigation. What happens on Exadata where suitable predicates can be offloaded to storage?

The original question wasn’t actually about the appearance (or not) of the predicates, it was about an unexpected cardinality estimate for a query involving two equi-partitioned tables, so this note has gone completely off-topic from the question; but another strange detail about the predicates showed up when I suggested the creation of a column group on the join columns. A (redundant) predicate that had been offloaded to storage stopped being offloaded; here, from the original posting with the original tables, are the two different Predicate Information sections that appeared – the first without the column group, the second after the column group had been created and its stats collected:

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ORDER_DETL"."ORDR_RQST_NUMBER"="ORDER_REQST"."ORDR_RQST_NUMBER" AND
              "ORDER_DETL"."ORDR_RQST_DATE"="ORDER_REQST"."ORDR_RQST_DATE")
   6 - storage("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   8 - storage("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ORDER_DETL"."ORDR_RQST_NUMBER"="ORDER_REQST"."ORDR_RQST_NUMBER" AND
              "ORDER_DETL"."ORDR_RQST_DATE"="ORDER_REQST"."ORDR_RQST_DATE")
   6 - storage("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   8 - filter("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Notice how the storage() predicate that appears at operation 8 in the first set has disappeared from the second even though operation 6 manages to retain its storage() predicate throughout.

I’ve modelled a further example of odd behaviour using two pairs of tables – one pair using range/hash partitioning the other interval/hash partitioning. I won’t reproduce it here but the problem of redundant predicates appearing and then disappearing gets even stranger.

I haven’t yet produced an example where the unexpected predicate behaviour has affected the costs or cardinalities – but I’ve only spent a couple of hours playing around with well-formed examples: it’s possible that in badly formed examples (e.g. with statistical inconsistencies) the side effect could mean that two notionally identical queries produce different numbers and different plans because they end up with different predicates in the final transformed query.

Footnote

Following an email from David Kurtz, it occurred to me that I should have made it clear that the disappearance of predicates on the partition key is expected behaviour when the predicates are clearly synchronised with the partition boundaries. This is why it’s a good idea to avoid the between operator when dealing with partitions (if it isn’t a requirement of producing the correct result, of course) – the content of a partition is “greater than or equal to” the high_value of the previous partition and “strictly less than” its own high_value whereas between expands to “greater than or equal to and less than or equal to“.

The predicate behaviour for the interval partitioning is the oddity, the behaviour for the “normal” range partitioning is expected.

Update Oct 2023

Testing on 19.11.0.0 and 23.2.0.0 – the anomaly (i.e. the redundant predicate) is still present in 19.11, but has disappeared by the time 23 get to 23.2. This may be fix control 34275901 introduced in 23c with the description: “enable predicate elimination for interval-partitioned tables”.

Any tests and reports of more recent versions of 19c, (and 21c) would be welcomed in the comments. And the results of a test on 23c running on Exadata would be interesting – which predicates are eliminated, and which predicates are sent to storage?

Comments (2)

October 1, 2020

Index FFS Cost

Filed under: Bugs,CBO,Execution plans,Indexing,Oracle — Jonathan Lewis @ 11:46 am BST Oct 1,2020

There are a number of unexpected issues with the optimizer’s treatment of the index fast full scan, the access path where Oracle ignores the structure of the B-tree and uses multiblock reads to do a brute-force segment scan as if the index were a “skinny table” with a few blocks of irrelevant garbage (i.e. the branch blocks) that could be ignored.

A recent comment made a few days on a blog about the optimizer’s “index-join” access path reminded me that I had a few notes to finish and publish that might help some people address performance issues relating to a couple of little-known fixes for optimizer oddities in this area. Today’s “fix” is one that appeared in 10g but never became the default optimizer behaviour (possibly because it’s one of those “not a bug” types of bug with no “always correct” strategy).

The problem revolves around the optimizer_index_cost_adj parameter (which is another reason why the fix might have been kept out of the standard code base). For index-only access paths the parameter will apply to the index range scan, full scan, or skip scan, but not to the index fast full scan. Here’s a model to demonstrate this:


rem
rem     Script:         oica_iffs.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2012
rem
rem     Last tested
rem             23.3.0.0
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        lpad(trunc(dbms_random.value(0,10000)),10)      v1,
        rpad('x',20)                                    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid wordpress format issue
;

create index t1_i1 on t1(v1);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                cascade          => true,
                method_opt       => 'for all columns size 1'
        );
end;
.

select blocks from user_tables where table_name = 'T1';
select leaf_blocks from user_indexes where index_name = 'T1_I1';

I’ve created a fairly narrow table with 1M rows, and created an index on that table on the v1 column. In my test there were 5,219 blocks in the table and 3,068 blocks in the index.

I’m now going to run a query to count the rows in the table where v1 is not null and report its execution plan. The first run will be unhinted, the second run will include a hint that forces an index full scan, and the third run will include a hint to force an index fast full scan.


prompt  ==============
prompt  Unhinted query
prompt  ==============

select
        count(*)
from
        t1
where
        v1 is not null
;

prompt  ==========
prompt  Index hint
prompt  ==========

select
        /*+ index(t1) */
        count(*)
from
        t1
where
        v1 is not null
;

prompt  ==============
prompt  index_ffs hint
prompt  ==============

select
        /*+ index_ffs(t1) */
        count(*)
from
        t1
where
        v1 is not null
;

Here are the three execution plans. As you might have predicted the default plan (for my Oracle 19.3) is the index fast full scan:


==============
Unhinted query
==============
-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   450  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|   450  (15)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1" IS NOT NULL)

==========
Index hint
==========
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    11 |  3159   (3)| 00:00:01 |
|   1 |  SORT AGGREGATE  |       |     1 |    11 |            |          |
|*  2 |   INDEX FULL SCAN| T1_I1 |  1000K|    10M|  3159   (3)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1" IS NOT NULL)

==============
index_ffs hint
==============
-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   450  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|   450  (15)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1" IS NOT NULL)

Since the size of the index is less than the size of the table the cost of the index fast full scan will be lower than the cost of a tablescan; and since an index fast full scan does multiblock reads while the index full scan is assumed to use single block reads the cost of the index fast full scan will be lower than the cost of the index full scan.

Now we set the optimizer_index_cost_adj to 1 (as happens in a number of “legacy” commercial products) and repeat the experiment. Here are the three plans (reduced to just the body of the plan):

==================
OICA = 1, Unhinted
==================
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    11 |    32   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE  |       |     1 |    11 |            |          |
|*  2 |   INDEX FULL SCAN| T1_I1 |  1000K|    10M|    32   (4)| 00:00:01 |
--------------------------------------------------------------------------

====================
OICA = 1, index hint
====================
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    11 |    32   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE  |       |     1 |    11 |            |          |
|*  2 |   INDEX FULL SCAN| T1_I1 |  1000K|    10M|    32   (4)| 00:00:01 |
--------------------------------------------------------------------------

========================
OICA = 1, index_ffs hint
========================
-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   450  (15)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|   450  (15)| 00:00:01 |
-------------------------------------------------------------------------------

The default plan is now the index full scan, not the index fast full scan; and when we look at the two hinted plans we can see why. The full scan cost has been reduced to 1% of the original cost (as dictated by the setting for optimizer_index_cost_adj); the fast full scan cost has not been adjusted.

So is this a good thing or a bad thing? Different people may answer that question differently. There are clearly likely to be cases where changing the setting for the optimizer_index_cost_adj (which you shouldn’t have been doing anyway for the last 20 years or so) will result in some plans switching from fast full scans to full scans with unfortunate consequences;

On the other hand when you do a fast full scan it may do direct path reads, while the full scan will do cached reads, and the cache may be holding a lot of the index leaf blocks already, so some people might prefer the current behaviour;

On the other, other hand an Exadata system can use storage indexes when doing direct path segment scans (even when the segment is an index) so an Exadata fast full scan might be far more desirable than a partially cached full scan.

So no generic right answer, only specific answers for particular sets of circumstances.

Solutions

This issue was addressed some time in the 10g timeline- my notes say the earliest successful tests I did of the fix were on 10.2.0.4, and that it wasn’t available in 9.2.0.8. There are two possibilities – a fix control, and an event.

Fix control: 4483286 – described as “Discount FFS cost using optimizer_index_cost_adj”
Event: 38085 at level one. I’ve never seen this event in the oraus.msg file, but it is referenced by bug 4483286

Of the two, setting the fix control is probably the safer option since the event is (apparently) non-existent and there’s always the possibility that the number will be hi-jacked in a future release to do something completely different. It’s also worth noting that the fix control can be applied to an individual statement through the /*+ opt_param() */ hint (there’s an example of the syntax towards the end of another article of mine).

A little demonstration:


-- alter session set events '38085 trace name context forever, level 1';
alter session set "_fix_control"='4483286:1';
alter session set optimizer_index_cost_adj = 1;

set autotrace traceonly explain

prompt  ==============
prompt  Unhinted query
prompt  ==============

select
        count(*)
from
        t1
where
        v1 is not null
;

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|    10M|     4   (0)| 00:00:01 |
-------------------------------------------------------------------------------

With the event or _fix_control set the cost of the index fast full scan drops to 1%, and the path reappears without being hinted.

September 30, 2020

Serial Bloom

Filed under: Execution plans,Oracle,Parallel Execution,Performance,Upgrades — Jonathan Lewis @ 1:35 pm BST Sep 30,2020

Following the recent note I wrote about an enhancement to the optimizer’s use of Bloom filters, I received a question by email asking about the use of Bloom filters in serial execution plans:

I’m having difficulty understanding the point of a Bloom filter when used in conjunction with a hash join where everything happens within the same process.

I believe you mentioned in your book (Cost Based Oracle) that hash joins have a mechanism similar to a Bloom filter where a row from the probe table is checked against a bitmap, where each hash table bucket is indicated by a single bit. (You have a picture on page 327 of the hash join and bitmap, etc).

The way that bitmap looks and operates appears to be similar to a Bloom filter to me…. So it looks (to me) like hash joins have a sort of “Bloom filter” already built into them.

My question is… What is the advantage of adding a Bloom filter to a hash join if you already have a type of Bloom filter mechanism thingy built in to hash joins?

I can understand where it would make sense with parallel queries having to pass data from one process to another, but if everything happens within the same process I’m just curious where the benefit is.

The picture on page 327 of CBO-F is a variation on the following, which is the penultimate snapshot of the sequence of events in a multi-pass hash join. The key feature is the in-memory bitmap at the top of the image describing which buckets in the (partitioned and spilled) hash table hold rows from the build table. I believe that it is exactly this bitmap that is used as the Bloom filter.

The question of why it might be worth creating and using a Bloom filter in a simple serial hash join is really a question of scale. What is the marginal benefit of the Bloom filter when the basic hash join mechanism is doing all the hash arithmetic and comparing with a bitmap anyway?

If the hash join is running on an Exadata machine then the bitmap can be passed as a predicate to the cell servers and the hash function can be used at the cell server to minimise the volume of data that has to be passed back to the database server – with various optimisations dependent on the version of the Exadata software. Clearly minimising traffic through the interconnect is going to have some benefit.

Similarly, as the email suggests, for a parallel query where (typically) one set of parallel processes will read the probe table and distribute the data to the second set of parallel processes which then do the hash join it’s clearly sensible to allow the first set of procsses to apply the hash function and discard as many rows as possible before distributing the survivors – minimising inter-process communication.

In both these cases, of course, there’s a break point to consider of how effective the Bloom filter needs to be before it’s worth taking advantage of the technology. If the Bloom filter allows 99 rows out of every hundred to be passed to the database server / second set of parallel processes then Oracle has executed the hash function and checked the bitmap 100 times to avoid sending one row (and it will (may) have to do the same hash function and bitmap check again to perform the hash join); on the other hand if the Bloom filter discards 99 rows and leaves only one row surviving then that’s a lot of traffic eliminated – and that’s likely to be a good thing. This is why there are a few hidden parameters defining the boundaries of when Bloom filters should be used – in particular there’s a parameter “_bloom_filter_ratio” which defaults to 35 and is, I suspect, a figure which says something like “use Bloom filtering only if it’s expected to reduce the probe data to 35% of the unfiltered size”.

So the question then becomes: “how could you benefit from a serial Bloom filter when it’s the same process doing everything and there’s no “long distance” traffic going on between processes?” The answer is simply that we’re operating at a much smaller scale. I’ve written blog notes in the past where the performance of a query depends largely on the number of rows that are passed up a query plan before being eliminated (for example here, where the volume of data moving results in a significant fraction of the total time).

If you consider a very simple hash join its plan is going to be shaped something like this:


-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |   720 |    31 |
|*  1 |  HASH JOIN         |      |    45 |   720 |    31 |
|*  2 |   TABLE ACCESS FULL| T2   |    15 |   120 |    15 |
|   3 |   TABLE ACCESS FULL| T1   |  3000 | 24000 |    15 |
-----------------------------------------------------------

If you read Tanel Poder’s article on execution plans as a tree of Oracle function calls you’ll appreciate that you could translate this into informal English along the lines of:

Operation 1 calls a function (at operation 2) to do a tablescan of t1 and return all the relevant rows, building an in-memory hash table by applying a hashing function to the join column(s) of each row returned by the call to the tablescan. As the hash table is populated the operation also constructs a bitmap to flag buckets in the hash table that have been populated.
Operation 1 then calls a function (at operation 3) to start a tablescan and then makes repeated calls for it to return one row (or, in newer versions, a small rowset) at a time from table t2. For each row returned operation 1 applies the same hash function to the join column(s) and checks the bitmap to see if there’s a potential matching row in the relevant bucket of the hash table, and if there’s a potential match Oracle examines the actual contents of the bucket (which will be stored as a linked list) to see if there’s an actual match.

Taking the figures above, let’s imagine that Oracle is using a rowset size of 30 rows. Operation 1 will have to make 100 calls to Operation 3 to get all the data, and call the hashing function 3,000 times. A key CPU component of the work done is that the function represented by operation 3 is called 100 times and (somehow) allocates and fills an array of 30 entries each time it is called.

Now assume operation 1 passes the bitmap to operation 3 as an input and it happens to be a perfect bitmap. Operation 3 starts its tablescan and will call the hash function 3,000 times, but at most 45 rows will get past the bitmap. So operation 1 will only have to call operation 3 twice. Admittedly operation 1 will (possibly) call the hash function again for each row – but maybe operation 3 will supply the hash value in the return array. Clearly there’s scope here for a trade-off between the reduction in work due to the smaller number of calls and the extra work needed to take advantage of the bitmap technology.

Here’s an example that shows the potential for savings – if you want to recreate this test you’ll need about 800MB of free space in the database, the first table takes about 300MB and the second about 450MB.


rem
rem     Script:         bloom_filter_serial_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,30,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;

create table t2
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        round(rownum + 0.5,2)           id,
        mod(rownum,1e5)                 n1,
        lpad(rownum,10)                 v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;


prompt  =================
prompt  With Bloom filter
prompt  =================

select 
        /*+ 
                px_join_filter(t1) 
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

prompt  ===============
prompt  No Bloom filter
prompt  ===============

select 
        /*+
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

I’ve created tables t1 and t2 with an id column that never quite matches, but the range of values is set so that the optimizer thinks the two tables might have a near-perfect 1 to 1 match. I’ve given t2 an extra column with 10⁵ distinct values in its 10⁷ rows, so it’s going to have 100 rows per distinct value. Then I’ve presented the optimizer with a query that looks as if it’s going to find 100 rows in t2 and needs to find a probable 100 rows of matches in t1. For my convenience, and to highlight a couple of details of Bloom filters, it’s not going to find any matches.

In both runs I’ve enabled the SQL Monitor feature with the /*+ monitor */ hint, and in the first run I’ve also hinted the use of a Bloom filter. Here are the resulting SQL Monitor outputs. Bear in mind we’re looking at a reasonably large scale query – volume of input data – with a small result set.

First without the Bloom filter:


Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    3.00 |    2.24 |     0.77 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=2959412835)
==================================================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                      |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT     |      |         |       |         2 |     +2 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN          |      |     100 | 14373 |         2 |     +2 |     1 |        0 |      |       |   2MB |          |                 |
|  2 |    TABLE ACCESS FULL | T2   |      99 |  5832 |         2 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  3 |    TABLE ACCESS FULL | T1   |     10M |  8140 |         2 |     +2 |     1 |      10M |  463 | 453MB |     . |          |                 |
==================================================================================================================================================

According to the Global Stats the query has taken 3 seconds to complete, of which 2.24 seconds is CPU. (The 750MB read in 0.77 second would be due to the fact that I’m running off SSD, and I’ve got a 1MB read size that helps). A very large fraction of the CPU appears because of the number of calls from operation 1 to operation 3 (the projection information pulled from memory reports a rowset size of 256 rows, so that’s roughly 40,000 calls to the function.

When we force the use of a Bloom filter the plan doesn’t change much (though the creation and use of the Bloom filter has to be reported) – but the numbers do change quite significantly.

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    1.97 |    0.99 |     0.98 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=4148581417)
======================================================================================================================================================
| Id |       Operation       |  Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |         | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
======================================================================================================================================================
|  0 | SELECT STATEMENT      |         |         |       |         1 |     +1 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN           |         |     100 | 14373 |         1 |     +1 |     1 |        0 |      |       |   1MB |          |                 |
|  2 |    JOIN FILTER CREATE | :BF0000 |      99 |  5832 |         1 |     +1 |     1 |      100 |      |       |     . |          |                 |
|  3 |     TABLE ACCESS FULL | T2      |      99 |  5832 |         1 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  4 |    JOIN FILTER USE    | :BF0000 |     10M |  8140 |         1 |     +1 |     1 |    15102 |      |       |     . |          |                 |
|  5 |     TABLE ACCESS FULL | T1      |     10M |  8140 |         1 |     +1 |     1 |    15102 |  463 | 453MB |     . |          |                 |
======================================================================================================================================================

In this case, the elapsed time dropped to 1.97 seconds (depending on your viewpoint that’s either a drop of “only 1.03 seconds” or drop of “an amazing 34.3%”; with the CPU time dropping from 2.24 seconds to 0.99 seconds (55.8% drop!)

In this case you’ll notice that the tablescan of t1 produced only 15,102 rows to pass up to the hash join at operation 1 thanks to the application of the predicate (not reported here): filter(SYS_OP_BLOOM_FILTER(:BF0000,”T1″.”ID”)). Instead of 40,000 calls for the next rowset the hash function has been applied during the tablescan and operation 5 has exhausted the tablescan after only about 60 calls. This is what has given us the (relatively) significant saving in CPU.

This example of the use of a Bloom filter highlights up the two points I referred to earlier.

First, although we see operations 4 and 5 as Join (Bloom) filter use and Table access full respectively I don’t think the data from the tablescan is being “passed up” from operation 5 to 4; I believe operation 4 can be views as a “placeholder” in the plan to allow us to see the Bloom filter in action, the hashing and filtering actually happening during the tablescan.
Secondly, we know that there are ultimately no rows in the result set, yet the application of the Bloom filter has not eliminated all the data. Remember that the bitmap that Oracle constructs of the hash table identifies used buckets, not actual values. Those 15,102 rows are rows that “might” find a match in the hash table because they belong in buckets that are flagged. A Bloom filter won’t discard any data that is needed, but it might fail to eliminate data that subsequently turns out to be unwanted.

How parallel is parallel anyway?

I’ll leave you with one other thought. Here’s an execution plan from 12c (12.2.0.1) which joins three dimension tables to a fact table. There are 343,000 rows in the fact table and the three joins individually identify about 4 percent of the data in the table. In a proper data warehouse we might have been looking at a bitmap star transformation solution for this query, but in a mixed system we might want to run warehouse queries against normalised data – this plan shows what Bloom filters can do to minimise the workload. The plan was acquired from memory after enabling rowsource execution statistics:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |      1 |        |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   1 |  SORT AGGREGATE              |          |      1 |      1 |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   2 |   PX COORDINATOR             |          |      1 |        |        |      |            |      2 |00:00:00.05 |      22 |      3 | 73728 | 73728 |          |
|   3 |    PX SEND QC (RANDOM)       | :TQ10000 |      0 |      1 |  Q1,00 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   4 |     SORT AGGREGATE           |          |      2 |      1 |  Q1,00 | PCWP |            |      2 |00:00:00.09 |    6681 |   6036 |       |       |          |
|*  5 |      HASH JOIN               |          |      2 |     26 |  Q1,00 | PCWP |            |     27 |00:00:00.09 |    6681 |   6036 |  2171K|  2171K|     2/0/0|
|   6 |       JOIN FILTER CREATE     | :BF0000  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  7 |        TABLE ACCESS FULL     | T3       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  8 |       HASH JOIN              |          |      2 |    612 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6634 |   6026 |  2171K|  2171K|     2/0/0|
|   9 |        JOIN FILTER CREATE    | :BF0001  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 10 |         TABLE ACCESS FULL    | T2       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 11 |        HASH JOIN             |          |      2 |  14491 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6614 |   6022 |  2171K|  2171K|     2/0/0|
|  12 |         JOIN FILTER CREATE   | :BF0002  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 13 |          TABLE ACCESS FULL   | T1       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|  14 |         JOIN FILTER USE      | :BF0000  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  15 |          JOIN FILTER USE     | :BF0001  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  16 |           JOIN FILTER USE    | :BF0002  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  17 |            PX BLOCK ITERATOR |          |      2 |    343K|  Q1,00 | PCWC |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|* 18 |             TABLE ACCESS FULL| T4       |     48 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.05 |    6594 |   6018 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

It’s a parallel plan, but it’s used the 12c “PQ_REPLICATE” strategy. The optimizer has decided that all the dimension tables are so small that it’s going to allow every PX process to read every (dimension) table through the buffer cache and build its own hash tables from them. (In earlier versions you might have seen the query coordinator scanning and broadcasting the three small tables, or one set of PX processes scanning and broadcasting to the other set).

So every PX process has an in-memory hash table of all three dimension tables and then (operation 17) they start a tablescan of the fact table, picking non-overlapping rowid ranges to scan. But since they’ve each created three in-memory hash tables they’ve also been able to create three Bloom filters each, which can all be applied simultaneously as the tablescan takes place; so instead of 343,000 rows being passed up the plan and through the first hash join (where we see from operation 11 that the number of surviving rows would have been about 14,500 ) we see all but 27 rows discarded very early on in the processing. Like bitmap indexes part of the power of Bloom filters lies in the fact that with the right plan the optimizer can combine them and identify a very small data set very precisely, very early.

The other thing I want you to realise about this plan, though, is that it’s not really an “extreme” parallel plan. It’s effectively running as a set of concurrent, non-interfering, serial plans. Since I was running (parallel 2) Oracle started just 2 PX processes: they both built three hash tables from the three dimension tables then split the fact table in half and took half each to do all the joins, and passed the nearly complete result to the query co-ordinator at the last moment. That’s as close as you can get to two serial, non-interfering, queries and still call it a parallel query. So, if you wonder why there might be any benefit in serial Bloom filters – Oracle’s actually being benefiting from it under the covers for several years.

Summary

Bloom filters trade a decrease in messaging against an increase in preparation and hashing operations. For Exadata systems with predicate offloading it’s very easy to see the potential benefit; for general parallel execution; it’s also fairly easy to see the potential benefit for parallel query execution what inter-process message between two sets of PX processes can be resource intensive; but even for serial queries there can be some benefit though, in absolute terms, they are likely to be only a small saving in CPU.

Comments (1)

September 9, 2020

Bloom Upgrade

Filed under: 18c,CBO,Joins,Oracle,Tuning,Upgrades — Jonathan Lewis @ 5:18 pm BST Sep 9,2020

It’s a common pattern of Oracle features that they start with various restrictions or limitations that disappear over time. This note is about an enhancement to Bloom filter processing that appeared in the 18.1 optimizer and, for some people, may be a good enough reason for upgrading to a newer version of Oracle. This enhancement came to my attention by way of the Oracle Developer forum in a thread with the title Bloom filters and view using UNION ALL asking how to get a Bloom filter pushed inside a UNION ALL view. The original requirement wasn’t a trivial one so I’ll demonstrate the problem with a very simple example – first the data set:

rem
rem     Script:         bloom_pushdown.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem
rem     Last tested:
rem             19.3.0.0
rem

create table t1 as select * from all_objects where rownum <= 50000;
create table t2 as select t1.* from t1, (select rownum n1 from dual connect by level <= 4);
create table t3 as select t1.* from t1, (select rownum n1 from dual connect by level <= 4); -- > comment to avoid wordpress format issue

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns object_type size 254'
        );
end;
/

I’ve been a bit lazy here, copying data from view all_objects. I’ve gathered stats on t1 so that I can generate a histogram on the object_type column because I’m going to query for a rare object_type and I want the optimizer to get a reasonable estimate of rows. I’m going to hint a parallel query to join t1 to t2 (aliased, trivially, as v1 for reasons that will become apparent soon):

select
        /*+ 
                parallel(2) 
        */
        t1.object_name, v1.object_name
from
        t1,
        t2 v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

In my case the optimizer chooses to do a hash join between these two table, and creates a Bloom filter to try and minimise the data passing through the data flow operation. The result set in my 12.2.0.1 database is only 16 rows, so it would be nice if the parallel scan could eliminate most of the 200,000 rows in t2 early – here’s the execution plan pulled from memory after running the query with rowsource execution stats enabled:


----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |      1 |        |   371 (100)|     16 |00:00:00.06 |      20 |      0 |       |       |          |
|   1 |  PX COORDINATOR        |          |      1 |        |            |     16 |00:00:00.06 |      20 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)  | :TQ10000 |      0 |     16 |   371   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN           |          |      2 |     16 |   371   (5)|     16 |00:00:00.05 |    6278 |   3988 |  1250K|  1250K|     2/0/0|
|   4 |     JOIN FILTER CREATE | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|   6 |     JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.04 |    4244 |   3988 |       |       |          |
|   7 |      PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.04 |    4244 |   3988 |       |       |          |
|*  8 |       TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|     16 |00:00:00.03 |    4244 |   3988 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"V1"."OBJECT_ID"))

We see that Oracle has generated a Bloom filter at operation 4 from the data returned from t1 at operation 5, and then used that Bloom filter at operation 6 to eliminate most of the data from t2 before passing the remaining few rows up to the hash join.

Let’s make the query more interesting – what if you want to use a UNION ALL of t2 and t3 in the query (for example one might be “current data” while the other is “historic data”. Here’s the query and plan from 12.2.0.1:

select
        /*+ 
                parallel(2) 
        */
        t1.object_name, v1.object_name
from
        t1,
        (select * from t2 union all select * from t3) v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |        |   667 (100)|     32 |00:00:00.37 |      40 |      0 |       |       |          |
|   1 |  PX COORDINATOR         |          |      1 |        |            |     32 |00:00:00.37 |      40 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000 |      0 |     32 |   667   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN            |          |      1 |     32 |   667   (5)|     32 |00:00:00.34 |    5125 |   3860 |  1250K|  1250K|     2/0/0|
|*  4 |     TABLE ACCESS FULL   | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|   5 |     VIEW                |          |      2 |    400K|   584   (4)|    400K|00:00:00.52 |    8488 |   7976 |       |       |          |
|   6 |      UNION-ALL          |          |      2 |        |            |    400K|00:00:00.24 |    8488 |   7976 |       |       |          |
|   7 |       PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|    200K|00:00:00.11 |    4244 |   3988 |       |       |          |
|*  8 |        TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|    200K|00:00:00.07 |    4244 |   3988 |       |       |          |
|   9 |       PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|    200K|00:00:00.11 |    4244 |   3988 |       |       |          |
|* 10 |        TABLE ACCESS FULL| T3       |     32 |    200K|   292   (4)|    200K|00:00:00.03 |    4244 |   3988 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   4 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
   8 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
  10 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue

No Bloom filter – so all 400,000 rows feed up the plan and through the hash join. This won’t matter too much for my sub-second tiny data set but on a pair of 50GB tables, with the potential to offload the Bloom filter to storage in Exadata and, perhaps, eliminate 99% of the data at the cell servers, this could make a huge difference to performance.

Since Bloom filters are all about hashing data (in Oracle the standard Bloom filter is the bitmap summarising the build table in a hash join) let’s trying pushing the optimizer into a hash distribution for the parallel join to see if that had any effect:


select
        /*+ 
                parallel(2) 
                gather_plan_statistics
                leading(@sel$1 t1@sel$1 v1@sel$1)
                use_hash(@sel$1 v1@sel$1)
                pq_distribute(@sel$1 v1@sel$1 hash hash)
        */
        t1.object_name, v1.object_name
from
        t1,
        (select * from t2 union all select * from t3) v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |      1 |        |   667 (100)|     32 |00:00:00.43 |      60 |      0 |       |       |          |
|   1 |  PX COORDINATOR             |          |      1 |        |            |     32 |00:00:00.43 |      60 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)       | :TQ10002 |      0 |     32 |   667   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN BUFFERED       |          |      1 |     32 |   667   (5)|     32 |00:00:00.38 |    4000 |   3752 |  2290K|  2082K|     2/0/0|
|   4 |     JOIN FILTER CREATE      | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX RECEIVE             |          |      2 |      4 |    75   (4)|      8 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       PX SEND HYBRID HASH   | :TQ10000 |      0 |      4 |    75   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |        STATISTICS COLLECTOR |          |      2 |        |            |      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|   8 |         PX BLOCK ITERATOR   |          |      2 |      4 |    75   (4)|      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|*  9 |          TABLE ACCESS FULL  | T1       |     26 |      4 |    75   (4)|      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|  10 |     PX RECEIVE              |          |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  11 |      PX SEND HYBRID HASH    | :TQ10001 |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  12 |       JOIN FILTER USE       | :BF0000  |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  13 |        VIEW                 |          |      2 |    400K|   584   (4)|    400K|00:00:00.68 |    8488 |   7976 |       |       |          |
|  14 |         UNION-ALL           |          |      2 |        |            |    400K|00:00:00.59 |    8488 |   7976 |       |       |          |
|  15 |          PX BLOCK ITERATOR  |          |      2 |    200K|   292   (4)|    200K|00:00:00.18 |    4244 |   3988 |       |       |          |
|* 16 |           TABLE ACCESS FULL | T2       |     32 |    200K|   292   (4)|    200K|00:00:00.06 |    4244 |   3988 |       |       |          |
|  17 |          PX BLOCK ITERATOR  |          |      2 |    200K|   292   (4)|    200K|00:00:00.12 |    4244 |   3988 |       |       |          |
|* 18 |           TABLE ACCESS FULL | T3       |     32 |    200K|   292   (4)|    200K|00:00:00.08 |    4244 |   3988 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   9 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue 
       filter("T1"."OBJECT_TYPE"='SCHEDULE') 
  16 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue
  18 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue

We’ve managed to introduce a Bloom filter (which is visible as :BF0000 in the plan, even through there’s no reference to sys_op_bloom_filter() in the predicate information) but there’s a problem, we’re still passing 400,000 rows up the plan and the Bloom filter is only being applied at (or just after) the VIEW operator, discarding all but 66 rows before doing the hash join. It’s an improvement but not ideal; we’d like to see the Bloom filter applied to each of the two tables separately to eliminate rows as early as possible.

This can’t be done in 12.2, and you’d have to rewrite the query, changing a “join with union” into a “union of joins”, and that’s not really a desirable strategy.

Next Steps

Searching MOS, though you will be able to find the following note:

Doc ID 18849313.8 – ENH : bloom filters/pruning are pushed through union-all view

There’s an enhancement request to do what we want in 18.1, and the enhancement has got into the software. Here’s the (unhinted) plan from 19.3 (the plan stays the same when optimizer_features_enable is set back to 18.1.0, but drops back to the 12.1. plan when OFE is set to 12.2.0.1):

------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |   666 (100)|     32 |00:00:00.11 |      10 |      0 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |            |     32 |00:00:00.11 |      10 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10000 |      0 |     32 |   666   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      2 |     32 |   666   (5)|     32 |00:00:00.05 |   10020 |   7958 |  1250K|  1250K|     2/0/0|
|   4 |     JOIN FILTER CREATE   | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    1998 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    1998 |      0 |       |       |          |
|   6 |     VIEW                 |          |      2 |    400K|   583   (4)|     32 |00:00:00.04 |    8022 |   7958 |       |       |          |
|   7 |      UNION-ALL           |          |      1 |        |            |     12 |00:00:00.02 |    4011 |   3979 |       |       |          |
|   8 |       JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|  11 |       JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|  12 |        PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|* 13 |         TABLE ACCESS FULL| T3       |     32 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
  10 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T2"."OBJECT_ID")) 
  13 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T3"."OBJECT_ID"))

As you can see, we create a Bloom filter at operation 4, and use it twice at operations 8 and 11 – with the sys_op_bloom_filter() functions clearly visible in the predicate information showing us that the Bloom filter is applied to the object_id column in both cases.

If you want to disable this enhancement for some reasons there are two hidden parameters available (which you might set for a single query using the opt_param() hint):

_bloom_filter_setops_enabled = true
_bloom_pruning_setops_enabled = true

The first is for Bloom filters in the situation shown, I assume the second deals with Bloom filters for partition pruning.

Summary

In versions prior to 18.1 the optimizer is unable to push Bloom filters down to the individual tables in a UNION ALL view, but this limitation was removed in the 18.1 code set.

Comments (4)

February 18, 2020

Interval Partition(s)

Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 1:45 pm GMT Feb 18,2020

A quirky little feature of interval partitioning showed up on Twitter today – a parallel insert that would only use a single PX slave to do the inserting. With 1.1 billion rows and the option for running parallel 32 this made the loading process rather slower than it ought to have been.

Fortunately it’s quite easy to model (and work around) the oddity. So here’s a small data set and an empty partitioned table to work with:


rem
rem     Script:         pt_int_load_anomaly.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem

create table t1 
nologging 
as
select 
        ao.* 
from 
        (select * from all_objects where object_id is not null) ao,
        (select rownum id from dual connect by level <= 20)
;

create table pt1
partition  by range (object_id) interval (1000000) (
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

I’ve created a table by copying all_objects 20 times which, for my little sandbox, has given me a total of about 1.2M rows in 12.2.0.1. Then I’ve created an empty interval-partitioned clone of all_objects, with the first partition defined to hold all rows where the object_id is less than 1 (and there’s no object in the database that could match that criterion). I’ve defined the interval to be 1,000,000 and since the highest object_id in my database is about 90,000 the first partition that gets added to this table will be able to hold all the data from t1.

So now we try to do a parallel insert from t1 into pt1, and check the execution plan and parallel execution statistics:


set serveroutput off

insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select * from t1;

select * from table(dbms_xplan.display_cursor);

start pq_tqstat

Note how I’ve used the hint /*+ enable_parallel_dml */ (possible a 12c hint back-ported to 11.2.0.4) rather than using an “alter session”, it’s just a little convenience to be able to embed the directive in the SQL. The pq_tqstat script is one I published some time ago to report the contents of the session-specific dynamic performance view v$pq_tqstat immediately after running a parallel statement.

Here’s the plan:


SQL_ID  25hub68pf1z1s, child number 0
-------------------------------------
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select *
from t1

Plan hash value: 2888707464

-------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                            |          |       |       |   631 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                             |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)                       | :TQ10001 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT (HIGH WATER MARK BROKERED)| PT1      |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     OPTIMIZER STATISTICS GATHERING          |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                             |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANDOM LOCAL                  | :TQ10000 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | P->P | RANDOM LOCA|
|   7 |        PX BLOCK ITERATOR                    |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL                   | T1       |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 6 because of hint

The most important detail of this plan is that the PX slaves do the load as select (operation 3), then send a message to the query coordinator (PX send QC, operation 2) to tell it about the data load. They do not send their data to the QC for the QC to do the load.

So the plan says we will be doing parallel DML, but here’s what v$pq_tqstat tells us:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                215880   34785363     17.47      16.86         16          0           0
                                             1 P007                202561   34436325     16.39      16.69         17          0           0
                                             1 P008                207519   34564496     16.79      16.75         17          0           0
                                             1 P009                208408   34594770     16.86      16.77         17          0           0
                                             1 P00A                198915   33529627     16.10      16.25         16          0           0
                                             1 P00B                202537   34430603     16.39      16.69         16          0           0
                      Consumer               1 P000                     0        144      0.00       0.00         51         47           0
                                             1 P001                     0        144      0.00       0.00         51         47           0
                                             1 P002               1235820  206340464    100.00     100.00         75         47           0
                                             1 P003                     0        144      0.00       0.00         51         47           0
                                             1 P004                     0        144      0.00       0.00       1138       1134           0
                                             1 P005                     0        144      0.00       0.00       1137       1133           0

                    1 Producer               1 P000                     0         24      0.00       5.91         51         42           0
                                             1 P001                     0         24      0.00       5.91         50         41           0
                                             1 P002                     2        286    100.00      70.44         58         14           0
                                             1 P003                     0         24      0.00       5.91         51         43           0
                                             1 P004                     0         24      0.00       5.91         51         42           0
                                             1 P005                     0         24      0.00       5.91         51         43           0
                      Consumer               1 QC                       2        406    100.00     100.00        311        179           0

19 rows selected.

The query did run parallel 6 as hinted – and 6 PX slaves scanned the t1 table; but they all sent all their data to one PX slave in the second slave set and that one PX slave did all the inserts. The plan was parallel, but the execution was effectively serial. (You’ll note there is something a little odd about the waits and timeout for p004 and p005 when they are acting as consumers. I may worry about that later, but it could be a host-based side effect of running parallel 6 on a VM with 4 CPUs).

The serialization leads to two questions

What went wrong?
How do we work around this and make the insert “truly” parallel

My answer to (1) is “I don’t know – but I’ll look at it if necessary” combined with the guess – it’s something to do with the table having only one partition at the outset and this has an unexpected side effect on the randomising function for the PX distribution.

My answer to (2) is “if I’m right about (1), why not try pre-defining two partitions, and I’ll even let both of them stay empty”.

So here’s my new definition for pt1:


create table pt1
partition  by range (object_id) interval (1000000) (
        partition p0 values less than (0),
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

Re-running the test with the completely redundant, and permanently empty p0 partition the plan doesn’t change but the results from v$pq_tqstat change dramatically:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                207897   34581153     16.82      16.76         23          4           0
                                             1 P007                215669   34786429     17.45      16.86         30          5           0
                                             1 P008                221474   36749626     17.92      17.81         28          5           0
                                             1 P009                204959   34497164     16.58      16.72         22          2           0
                                             1 P00A                177755   30141002     14.38      14.61         21          0           0
                                             1 P00B                208066   35585810     16.84      17.25         25          2           0
                      Consumer               1 P000                213129   35612973     17.25      17.26         82         57           0
                                             1 P001                200516   33570586     16.23      16.27         84         55           0
                                             1 P002                203395   33950449     16.46      16.45         83         56           0
                                             1 P003                205458   34235575     16.63      16.59         82         54           0
                                             1 P004                204111   33999932     16.52      16.48        581        555           0
                                             1 P005                209211   34971669     16.93      16.95        580        553           0

                    1 Producer               1 P000                     2        286     16.67      16.67        422        149           0
                                             1 P001                     2        286     16.67      16.67        398        130           0
                                             1 P002                     2        286     16.67      16.67        405        128           0
                                             1 P003                     2        286     16.67      16.67        437        161           0
                                             1 P004                     2        286     16.67      16.67        406        116           0
                                             1 P005                     2        286     16.67      16.67        440        148           0
                      Consumer               1 QC                      12       1716    100.00     100.00        242        111           0



19 rows selected.

Every consumer receives and inserts roughly 200,000 rows – it’s a totally fair parallel DML. Timings are pretty irrelevant for such a small data set but the execution time did drop from 7 seconds to 4 seconds when parallelism was working “properly”.

I’ve tested this script on 12.2.0.1 and 19c (19.3 and 19.11) – the same anomaly appears in both versions though it might be worth noting that the strange skew in the waits and timeouts doesn’t appear in 19c.

Comments (2)

January 27, 2022

Other authors:

September 14, 2010

Jason Arneil’s Blog

Pythian

Kerry Osborne

Kevin Closson

Tanel Poder

Uwe Hesse

Arup Nanda

Oracle Official Links

June 13, 2023

Summary

January 28, 2022

January 27, 2022

Other Authors

January 26, 2022

Bottom line:

January 25, 2022

October 11, 2021

Fast plan (reporting only):

Slow plan (runtime adapted):

First thoughts

Starting to dig

Consequences:

Conclusion and further thoughts

Next Steps

December 8, 2020

November 12, 2020

Footnote

Footnote 2

October 10, 2020

Footnote

Update Oct 2023

October 1, 2020

Solutions

A little demonstration:

September 30, 2020

How parallel is parallel anyway?

Summary

September 9, 2020

Next Steps

Summary

February 18, 2020

Click on the motto below

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS