In the previous note I posted about * Join Cardinality* I described a method for calculating the figure that the optimizer would give for the special case where you had a query that:

- joined two tables
- used a single-column to join on equality
- had no nulls in the join columns
- had a perfect frequency histogram on the columns at the two ends of the join
- had no filter predicates associated with either table

The method simply said: *“Match up rows from the two frequency histograms, multiply the corresponding frequencies”* and I supplied a simple SQL statement that would read and report the two sets of histogram data, doing the arithmetic and reporting the final cardinality for you. In an update I also added an adjustment needed in 11g (or, you might say, removed in 12c) where gaps in the histograms were replaced by “ghost rows” with a frequency that was half the lowest frequency in the histogram.

This is a nice place to start as the idea is very simple, and it’s likely that extensions of the basic idea will be used in all the other cases we have to consider. There are 25 possibilities that could need separate testing – though only 16 of them ought to be relevant from 12c onwards. Oracle allows for four kinds of histograms – in order of how precisely they describe the data they are:

- Frequency – with a perfect description of the data
- Top-N (a.k.a. Top-Frequency) – which describes all but a tiny fraction (ca. one bucket’s worth) of data perfectly
- Hybrid – which can (but doesn’t usually, by default) describe up to 2,048 popular values perfectly and gives an approximate distribution for the rest
- Height-balanced – which can (but doesn’t usually, by default) describe at most 1,024 popular values with some scope for misinformation.

Finally, of course, we have the general case of no histogram, using only 4 numbers (low value, high value, number of rows, number of distinct values) to give a rough picture of the data – and the need for histograms appears, of course, when the data doesn’t look anything like an even distribution of values between the low and high with close to “number of rows”/”number of distinct values” for each value.

So there are 5 possible statistical descriptions for the data in a column – which means there are 5 * 5 = 25 possible options to consider when we join two columns, or 4 * 4 = 16 if we label height-balanced histograms as obsolete and ignore them (which would be a pity because Chinar has done some very nice work explaining them).

Of course, once we’ve worked out a single-column equijoin between two tables there are plenty more options to consider: multi-column joins, joins involving range-based predicates, joins involving more than 2 tables, and queries which (as so often happens) have predicates which aren’t involved in the joins.

For the moment I’m going to stick to the simplest case – two tables, one column, equality – and comment on the effects of filter predicates. It seems to be very straightforward as I’ll demonstrate with a new model

rem rem Script: freq_hist_join_03.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem execute dbms_random.seed(0) create table t1( id number(8,0), n0040 number(4,0), n0090 number(4,0), n0190 number(4,0), n0990 number(4,0), n1 number(4,0) ) ; create table t2( id number(8,0), n0050 number(4,0), n0110 number(4,0), n0230 number(4,0), n1150 number(4,0), n1 number(4,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 40) + 1 n0040, mod(rownum, 90) + 1 n0090, mod(rownum, 190) + 1 n0190, mod(rownum, 990) + 1 n0990, trunc(30 * abs(dbms_random.normal)) n1 from generator v1, generator v2 where rownum <= 1e5 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 50) + 1 n0050, mod(rownum, 110) + 1 n0110, mod(rownum, 230) + 1 n0230, mod(rownum, 1150) + 1 n1150, trunc(30 * abs(dbms_random.normal)) n1 from generator v1, generator v2 where rownum <= 1e5 -- > comment to avoid WordPress format issue ; begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns n1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', method_opt => 'for all columns size 1 for columns n1 size 254' ); end; /

You’ll notice that in this script I’ve created empty tables and then populated them. This is because of an anomaly that appeared in 18.3 when I used “create as select”, and should allow the results from 18.3 be an exact match for 12c. You don’t need to pay much attention to the Nxxx columns, they were there so I could experiment with a few variations in the selectivity of filter predicates.

Given the purpose of the demonstration I’ve gathered histograms on the column I’m going to use to join the tables (called n1 in this case), and here are the summary results:

TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS -------------------- -------------------- --------------- ------------ ----------- T1 N1 FREQUENCY 119 119 T2 N1 FREQUENCY 124 124 VALUE FREQUENCY FREQUENCY PRODUCT ---------- ---------- ---------- ------------ 0 2488 2619 6,516,072 1 2693 2599 6,999,107 2 2635 2685 7,074,975 3 2636 2654 6,995,944 ... 113 1 3 3 115 1 2 2 116 4 3 12 117 1 1 1 120 1 2 2 ------------ sum 188,114,543

We’ve got frequencyy histograms, and we can see that they don’t have a perfect overlap. I haven’t printed every single line from the cardinality query, just enough to show you the extreme skew, a few gaps, and the total. So here are three queries with execution plans:

set serveroutput off alter session set statistics_level = all; alter session set events '10053 trace name context forever'; select count(*) from t1, t2 where t1.n1 = t2.n1 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); select count(*) from t1, t2 where t1.n1 = t2.n1 and t1.n0990 = 20 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); select count(*) from t1, t2 where t1.n1 = t2.n1 and t1.n0990 = 20 and t2.n1150 = 25 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

I’ve queried the pure join – the count was exactly the 188,114,543 predicted by the cardinality query, of course – then I’ve applied a filter to one table, then to both tables. The first filter *n0990 = 20* will (given the *mod(,990)*) definition identify one row in 990 from the original 100,000 in * t1*; the second filter

*n1150 = 25*will identify one row in 1150 from

*. That’s filtering down to 101 rows and 87 rows respectively from the two tables. So what do we see in the plans:*

**t2**----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:23.47 | 748 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:23.47 | 748 | | | | |* 2 | HASH JOIN | | 1 | 188M| 188M|00:00:23.36 | 748 | 6556K| 3619K| 8839K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100K| 100K|00:00:00.01 | 374 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 100K| 100K|00:00:00.01 | 374 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."N1"="T2"."N1") ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.02 | 748 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.02 | 748 | | | | |* 2 | HASH JOIN | | 1 | 190K| 200K|00:00:00.02 | 748 | 2715K| 2715K| 1647K (0)| |* 3 | TABLE ACCESS FULL| T1 | 1 | 101 | 101 |00:00:00.01 | 374 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 100K| 100K|00:00:00.01 | 374 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."N1"="T2"."N1") 3 - filter("T1"."N0990"=20) ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 748 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 748 | | | | |* 2 | HASH JOIN | | 1 | 165 | 165 |00:00:00.01 | 748 | 2715K| 2715K| 1678K (0)| |* 3 | TABLE ACCESS FULL| T2 | 1 | 87 | 87 |00:00:00.01 | 374 | | | | |* 4 | TABLE ACCESS FULL| T1 | 1 | 101 | 101 |00:00:00.01 | 374 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."N1"="T2"."N1") 3 - filter("T2"."N1150"=25) 4 - filter("T1"."N0990"=20)

The first execution plan shows an estimate of 188M rows – but we’ll have to check the trace file to confirm whether that’s only an approximate match to our calculation, or whether it’s an exact match. So here’s the relevant pair of lines:

Join Card: 188114543.000000 = outer (100000.000000) * inner (100000.000000) * sel (0.018811) Join Card - Rounded: 188114543 Computed: 188114543.000000

Yes, the cardinality calculation and the execution plan estimates match perfectly. But there are a couple of interesting things to note. First, Oracle seems to be deriving the cardinality by multiplying the individual cardinalities of the two tables with a figure it calls “sel” – the thing that * Chinar Aliyev* has labelled

*the*

**J**_{sel}*“Join Selectivity”*. Secondly, Oracle can’t do arithmetic (or, removing tongue from cheek) the value it’s reported for the join selectivity is reported at only 6 decimal places, but stored to far more. What is the Join Selectivity, though ? It’s the figure we derive from the cardinality SQL divided by the cardinality of the cartesian join of the two tables – i.e. 188,114,543 / (100,000 * 100,000).

With the clue from the first trace file, can we work out why the second and third plans show 190K and 165 rows respectively. How about this – multiply the filtered cardinalities of the two separate tables, then multiply the result by the join selectivity:

- 1a)
: gives us 1 row in every 990. 100,000 / 990 =*n0990 = 20*(echoing the rounded execution plan estimate).**101.010101…** - 1b) 100,000 * (100,000/990) * 0.0188114543 =
(which is in the ballpark of the plan and needs confirmation from the trace file).**190,014.69898989…**

- 2a)
: gives us 1 row in every 1,150. 100,000 / 1,150 =**n1150 = 25**(echoing the rounded execution plan estimate)**86.9565217…** - 2b) (100,000/990) * (100,000/1,150) * 0.0188114543 =
(echoing the rounded execution plan estimate).**165.2301651..**

Cross-checking against extracts from the 10053 trace files:

Join Card: 190014.689899 = outer (101.010101) * inner (100000.000000) * sel (0.018811) Join Card - Rounded: 190015 Computed: 190014.689899 Join Card: 165.230165 = outer (86.956522) * inner (101.010101) * sel (0.018811) Join Card - Rounded: 165 Computed: 165.230165

### Conclusion.

Remembering that we’re still looking at very simple examples with perfect frequency histograms: it looks as if we can work out a *“Join Selectivity” (J _{sel})* – the selectivity of a “pure” unfiltered join of the two tables – by querying the histogram data then use the resulting value to calculate cardinalities for simple two-table equi-joins by multiplying together the individual (filtered) table cardinality estimates and scaling by the Join Selectivity.

### Acknowledgements

Most of this work is based on ** a document written by Chinar Aliyev** in 2016 and presented at the Hotsos Symposium the same year. I am most grateful to him for responding to a recent post of mine and getting me interested in spending some time to get re-acquainted with the topic. His original document is a 35 page pdf file, so there’s plenty more material to work through, experiment with, and write about.