## October 9, 2018

### Join Cardinality – 3

Filed under: CBO,Histograms,Oracle,Statistics — Jonathan Lewis @ 1:01 pm BST Oct 9,2018

In the previous posting I listed the order of precision of histograms as:

• Frequency
• Top-Frequency
• Hybrid
• Height-balanced
• None

Having covered the Frequency/Frequency join (for a single column, no nulls, equijoin) in the previous posting I’ve decided to work down the list and address Frequency/Top-Frequency in this posting. It gets a little harder to generate data as we move to the less precise histograms since we need to have skew, we want some gaps, and (for Top-Frequency) we need to have some data that can be “ignored”. On the plus side, though, I want to work with a small number of buckets to keep the output of any queries I run fairly short so I’m going to stick with a small number of buckets, which means the “small” volume of “ignorable” data (the “spare” bucket) can be relative large. Here’s the code I used to generate data for my investigation – 100 rows for the table with a frequency histogram and 800 rows for the table with a top-frequency.

```
rem
rem     Script:         freq_hist_join_05.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem     Purpose:
rem
rem     Last tested
rem             18.3.0.0
rem             12.2.0.1
rem

execute dbms_random.seed(0)

create table t1 (
id              number(6),
n04             number(6),
n05             number(6),
n20             number(6),
j1              number(6)
)
;

create table t2(
id              number(8,0),
n20             number(6,0),
n30             number(6,0),
n50             number(6,0),
j2              number(6,0)
)
;

insert into t1
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum                                  id,
mod(rownum,   4) + 1                    n04,
mod(rownum,   5) + 1                    n05,
mod(rownum,  20) + 1                    n20,
trunc(2.5 * trunc(sqrt(v1.id*v2.id)))   j1
from
generator       v1,
generator       v2
where
v1.id <= 10 -- > comment to avoid WordPress format issue
and     v2.id <= 10 -- > comment to avoid WordPress format issue
;

insert into t2
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum                                  id,
mod(rownum,   20) + 1                   n20,
mod(rownum,   30) + 1                   n30,
mod(rownum,   50) + 1                   n50,
28 - round(abs(7*dbms_random.normal))        j2
from
generator       v1
where
rownum <= 800 -- > comment to avoid WordPress format issue
;

begin
dbms_stats.gather_table_stats(
ownname          => null,
tabname          => 'T1',
method_opt       => 'for all columns size 1 for columns j1 size 254'
);
dbms_stats.gather_table_stats(
ownname          => null,
tabname          => 'T2',
method_opt       => 'for all columns size 1 for columns j2 size 16'
);
end;
/

```

In this example I’ve used the sqrt() function and the dbms_random.normal() function to generate the data. The scaling and truncating I’ve done on the results has given me two sets of data which have a nice skew, some gaps, but different patterns (though both have a small number of small values and a larger number of larger values). The data from dbms_random.normal() will produce 22 distinct values, so I’ve requested a histogram with 16 buckets and checked that this will produce a Top-Frequency histogram. (If I want a Hybrid histogram – for the next thrilling installment in the series – I’ll just reduce the number of buckets slightly).

Here are the resulting stats, preceded by the code that reported them:

```
select  table_name, column_name, histogram, num_distinct, num_buckets, density
from    user_tab_cols
where   table_name in ('T1','T2')
and     column_name in ('J1','J2')
order by table_name
;

select  table_name, num_rows
from    user_tables
where   table_name in ('T1','T2')
order by table_name
;

break on table_name skip 1 on report skip 1

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
order by
endpoint_value
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
order by
endpoint_value
)
select f1.* from f1
union all
select f2.* from f2
order by 1,2
;

TABLE_NAME           COLUMN_NAME          HISTOGRAM       NUM_DISTINCT NUM_BUCKETS    DENSITY
-------------------- -------------------- --------------- ------------ ----------- ----------
T1                   J1                   FREQUENCY                 10          10       .005
T2                   J2                   TOP-FREQUENCY             22          16    .000625

TABLE_NAME             NUM_ROWS
-------------------- ----------
T1                          100
T2                          800

TABLE_NAME                VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER
-------------------- ---------- ------------------- ---------------
T1                            2                   5               5
5                  15              20
7                  15              35
10                  17              52
12                  13              65
15                  13              78
17                  11              89
20                   7              96
22                   3              99
25                   1             100

T2                            1                   1               1
13                  14              15
15                  11              26
16                  22              48
17                  34              82
18                  31             113
19                  36             149
20                  57             206
21                  44             250
22                  45             295
23                  72             367
24                  70             437
25                  87             524
26                 109             633
27                  96             729
28                  41             770

```

Table t1 reports 100 rows, 10 distinct values and a Frequency histogram with 10 buckets.
Table t2 reports 800 rows, 22 distinct values and a Top-Frequency histogram with 16 buckets.

Things we notice from the histograms are: t1 has a range from 2 to 25, while t2 has a range from 1 to 28. We also notice that the highest endpoint_number for t2 is only 770 out of a possible 800 – we’ve “lost” 30 rows. We don’t really care what they are for the purposes of the arithmetic, but if we did a quick “select j2, count(*)” query we’d see that we had lost the following:

```
SQL> select j2, count(*) from t2 group by j2 order by count(*), j2;

J2   COUNT(*)
---------- ----------
1	    1
9	    1  *
8	    3  *
11	    4  *
10	    5  *
12	    8  *
14	    9  *
15	   11
...
```

The reason why the total number of rows accounted for is less than the total number of rows in the table comes in two parts. The Top-Frequency histogram is designed to hold the Top N most popular entries in the table, so there will be some entries that don’t make an appearance in the histogram despite contributing rows to the total table count; the number of “lost” rows can then be increased because the Top N popular values may not include the column low and high values, and these two values must appear in the histogram. Looking at the output above we can see that we could have reported 14 as the 16th most popular value, instead we have to record 1, losing a further 9 rows and regaining 1.

Let’s test the pure join query on the two tables to see what the optimizer is predicting as the join cardinality, and then try to re-create that cardinality from the histogram data:

```
alter session set statistics_level = all;
alter session set events '10053 trace name context forever';
alter session set tracefile_identifier='BASELINE';

select
count(*)
from
t1, t2
where
t1.j1 = t2.j2
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

alter session set statistics_level = typical;
alter session set events '10053 trace name context off';

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |      41 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |      41 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |   1608 |   1327 |00:00:00.01 |      41 |  2545K|  2545K| 1355K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |       7 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    800 |    800 |00:00:00.01 |       7 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"."J1"="T2"."J2")

```

Our target is to work out how we can query the histogram data in a way that gets the result 1,608. Ideally we’ll also think of a rationale for justifying our method, and then we’ll apply the same method with 15 buckets and 17 buckets, and with a couple of variations to the data (e.g. update all rows where j1 = 25 to set j1 = 28), to see if the method still gets the right result.

All we did with the frequency/frequency join was to join the two histograms on matching values, multiply the frequencies on each resulting row , then sum down the set, and this automatically eliminated rows which were outside the “highest low” and “lowest high” (i.e. we only examined rows where the histograms overlapped). We might hope that things shouldn’t be too different when one of the histograms is a top-frequency histogram.

There is an important difference, though, between frequency and top-frequency histograms – in the latter case there are values in the table which will not be in the histogram, so we ought to make some allowance for these (even though it’s only “one bucket’s worth”). It’s possible that some of these values might match values in the frequency histogram so we need to include a mechanism for adding in a factor to allow for them. So as a first step let’s work out the “average number of rows per value” for the missing values.

We have 22 distinct values and 16 end points so there are 6 missing values. We have 800 rows in the table but only 770 rows reported in the histogram so there are 30 missing rows. So let’s say the missing values have an average cardinality of 30/6 = 5 (and we might extend that to say they have an average selectivity of 5/800 = 0.00625).

Let’s bring that value into the query we wrote for the frequency/frequency case by using an outer join (which I’ll write as an “ANSI” Full Outer Join”) with a predicate in place that restricts the result to just the overlapping range, which is [2,25], the “higher low value” and “lower high value” across the two histograms. Here’s some code – with an odd little detail included:

```
column product format 999,999,999.99
compute sum of product on report

compute sum of t1_count on report
compute sum of t1_value on report
compute sum of t2_count on report
compute sum of t2_value on report

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
order by
endpoint_value
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
order by
endpoint_value
)
select
f1.value f1_value,
f2.value f2_value,
nvl(f1.row_or_bucket_count,0.00) t1_count,
nvl(f2.row_or_bucket_count,800*0.00625) t2_count,
nvl(f1.row_or_bucket_count,0.00) *
nvl(f2.row_or_bucket_count,800*0.006250) product
from
f1
full outer join
f2
on
f2.value = f1.value
where
coalesce(f1.value, f2.value) between 2 and 25
order by
coalesce(f1.value, f2.value)
;

```

I’ve included an nvl() on the columns for the top-frequency histograms that convert nulls (i.e. the preserved rows derived from the frequency histogram) into the average frequency we’ve just calculated, using the “num_rows * selectivity” representation. The odd little detail that I commented on above does something similar for the preserved rows derived from the top-frequency histogram because this first guess at the calculation was wrong and needed an adjustment which I’m anticipating. Here are the results I got with this code:

```  T1_VALUE   T2_VALUE   T1_COUNT   T2_COUNT         PRODUCT
---------- ---------- ---------- ---------- ---------------
2                     5          5           25.00
5                    15          5           75.00
7                    15          5           75.00
10                    17          5           85.00
12                    13          5           65.00
13          0         14             .00
15         15         13         11          143.00
16          0         22             .00
17         17         11         34          374.00
18          0         31             .00
19          0         36             .00
20         20          7         57          399.00
21          0         44             .00
22         22          3         45          135.00
23          0         72             .00
24          0         70             .00
25         25          1         87           87.00
---------- ---------- ---------- ---------- ---------------
135        233        100        548        1,463.00

```

The figure is too low, so there has to be an adjustment. What if the code is allowing for the “maybe there are other values” algorithm that the optimizer uses with fequency histograms ? If you’ve gathered a frequency histogram on a column but query it with a value that isn’t in the histogram than Oracle applies an algorithm that looks like: “if you’re asking for something that isn’t in the histogram I’ll assume that there must be some data there and use a frequency that’s half the lowest frequency I have recorded”**Important footnote. The value 25 appears once in our histogram so let’s include a fudge-factor of 0.5 (i.e. half a row) in the nvl() expression for the t1 frequencies and see what happens. This is what the new results look like:

```
T1_VALUE   T2_VALUE   T1_COUNT   T2_COUNT         PRODUCT
---------- ---------- ---------- ---------- ---------------
2                     5          5           25.00
5                    15          5           75.00
7                    15          5           75.00
10                    17          5           85.00
12                    13          5           65.00
13         .5         14            7.00
15         15         13         11          143.00
16         .5         22           11.00
17         17         11         34          374.00
18         .5         31           15.50
19         .5         36           18.00
20         20          7         57          399.00
21         .5         44           22.00
22         22          3         45          135.00
23         .5         72           36.00
24         .5         70           35.00
25         25          1         87           87.00
---------- ---------- ---------- ---------- ---------------
135        233      103.5        548        1,607.50

```

Since we were looking for 1,608 I’m going to call that a success. I can check precision, of course, by looking at the 10053 trace file. Extracting a few critical lines:

```egrep -e"Density" -e"Join Card" orcl12c_ora_6520_BASELINE.trc

AvgLen: 3 NDV: 22 Nulls: 0 Density: 0.006250 Min: 1.000000 Max: 28.000000
AvgLen: 3 NDV: 10 Nulls: 0 Density: 0.005000 Min: 2.000000 Max: 25.000000

Join Card:  1607.500000 = outer (100.000000) * inner (800.000000) * sel (0.020094)

```

The “Density” lines come from the column statistics – note the 0.00625 that matches the “average selectivity” I derived from the top-frequency figures. You might also note that the “half the least frequent value” could be derived from the t1.j1 density (0.005) * t1.num_rows (100).

The “Join Card” line is exactly what it says – the join cardinality calculation showing that the plan’s prediction of 1,608 rows was actually a rounded 1607.5

There is one more important thing to check before I start tweaking the data to see if there are any other factors involved. Is the 0.5 I stuck into the query really the value of “half the least common frequency” or is it a fixed value in all cases. A nice easy way of testing this is to update the t1 table to change one row from 22 to 25 (22 will still be present in the table and histogram before and after this test, so it’s a minimal and safe change). Making this change and re-running the calculation query leaving the 0.5 unchanged gives the following:

```
update t1 set j1 = 25 where j1 = 22 and rownum = 1;

...

21         .5         44           22.00
22         22          2         45           90.00
23         .5         72           36.00
24         .5         70           35.00
25         25          2         87          174.00
---------- ---------- ---------------
sum                        103.5        548        1,649.50

```

Without reporting all the details:

• the estimate in the plan went up from 1,608 to 1,794
• leaving 0.5 in the query the derived result was 1,649.5 (last few lines of output above)
• changing the 0.5 to 1.0 the derived result was 1,794.0

Conclusion – the “fudge factor” is consistent with the model the optimizer uses with frequency histogram calculations. The optimizer models “missing” rows in the join calculation as “half the number of the least frequently occuring value**Important footnote

### Filter Predicates:

After a dozen tests varying the number of buckets in the top-frequency histogram (and checking it really was still a top-frequency histogram), and tweaking the t1 (frequency histogram) data to use values on the boundaries of, or outside, the range of the t2 (top-frequency) data, I concluded that my approach was probably correct. Outer join the two histograms, restrict to the overlap, supply the “num_rows * density” figure on the top-frequency side, and “half the lowest frequency”**Important footnote on the frequency side, and the query produces the same result as the optimizer for the pure join cardinality.

So the next step is to check what happens when you add filter predicates on one, or both, sides. I listed a fragment of code earlier on to execute the pure join and count the number of rows it produced, enabling the 10053 trace and pulling the actual plan from memory at the same time. I repeated this code with 3 variations and checked the “Join Card” lines from the resulting trace files:

```
select count(*) from  t1, t2 where  t1.j1 = t2.j2
select count(*) from  t1, t2 where  t1.j1 = t2.j2 and t1.n04 = 2
select count(*) from  t1, t2 where  t1.j1 = t2.j2                and t2.n30 = 25
select count(*) from  t1, t2 where  t1.j1 = t2.j2 and t1.n04 = 2 and t2.n30 = 25

egrep -e"Join Card" orcl12c_ora_10447*.trc

orcl12c_ora_10447_BASELINE.trc:Join Card:  1607.500000 = outer (800.000000) * inner (100.000000) * sel (0.020094)
orcl12c_ora_10447_FILTERJ1.trc:Join Card:  401.875000 = outer (800.000000) * inner (25.000000) * sel (0.020094)
orcl12c_ora_10447_FILTERJ2.trc:Join Card:  53.583333 = outer (100.000000) * inner (26.666667) * sel (0.020094)
orcl12c_ora_10447_FILTJ1J2.trc:Join Card:  13.395833 = outer (26.666667) * inner (25.000000) * sel (0.020094)

```

As you can see in all 4 cases, Oracle reports an inner and outer cardinality estimate and a join selectivity. The join selectivity remains unchanged throughout; it’s the value we can derive from our pure join test (0.020094 = 1607.5 / (100 * 800)). All that changes is that the individual table predicates are applied to the base tables before the join selectivity is applied to the product of the filtered base table cardinalities:

• Column n04 has 4 distinct values in 100 rows – filter cardinality = 100/4 = 25
• Column n30 has 30 distinct values in 800 rows – filter cardinality = 800/30 = 26.66666…

### Conclusion

For a single column equijoin on columns with no nulls where one column has a frequency histogram and the other has a top-frequency histogram the optimizer calculates the “pure” join cardinality using the overlapping range of column values and two approximating frequencies, then derives the filtered cardinality by applying the base table filters, calculates the cardinality of the cartesian join of the filtered data sets, then multiplies by the pure join selectivity.

**Important Footnote  Until Chinar Aliyev questioned what I had written, I had never noticed that the “half the lowest frequency” that I describe at various point in the arithmetic was anything other than a fixed fudge factor. In fact, in perfect symmetry with the expression used for the average selectivity in the top-frequency part of the calculcation, this “fudge factor” is simple “num_rows * column_density” for the column with the frequency histogram. (Whether the “half the lowest frequency” drops out as a side effect of the density calculation, or whether the column density is derived from half the lowest frequency is another matter.)