In the previous posting I listed the order of precision of histograms as:
- Frequency
- Top-Frequency
- Hybrid
- Height-balanced
- None
Having covered the Frequency/Frequency join (for a single column, no nulls, equijoin) in the previous posting I’ve decided to work down the list and address Frequency/Top-Frequency in this posting. It gets a little harder to generate data as we move to the less precise histograms since we need to have skew, we want some gaps, and (for Top-Frequency) we need to have some data that can be “ignored”. On the plus side, though, I want to work with a small number of buckets to keep the output of any queries I run fairly short so I’m going to stick with a small number of buckets, which means the “small” volume of “ignorable” data (the “spare” bucket) can be relative large. Here’s the code I used to generate data for my investigation – 100 rows for the table with a frequency histogram and 800 rows for the table with a top-frequency.
rem rem Script: freq_hist_join_05.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem Purpose: rem rem Last tested rem 18.3.0.0 rem 12.2.0.1 rem execute dbms_random.seed(0) create table t1 ( id number(6), n04 number(6), n05 number(6), n20 number(6), j1 number(6) ) ; create table t2( id number(8,0), n20 number(6,0), n30 number(6,0), n50 number(6,0), j2 number(6,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 4) + 1 n04, mod(rownum, 5) + 1 n05, mod(rownum, 20) + 1 n20, trunc(2.5 * trunc(sqrt(v1.id*v2.id))) j1 from generator v1, generator v2 where v1.id <= 10 -- > comment to avoid WordPress format issue and v2.id <= 10 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 20) + 1 n20, mod(rownum, 30) + 1 n30, mod(rownum, 50) + 1 n50, 28 - round(abs(7*dbms_random.normal)) j2 from generator v1 where rownum <= 800 -- > comment to avoid WordPress format issue ; begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns j1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', method_opt => 'for all columns size 1 for columns j2 size 16' ); end; /
In this example I’ve used the sqrt() function and the dbms_random.normal() function to generate the data. The scaling and truncating I’ve done on the results has given me two sets of data which have a nice skew, some gaps, but different patterns (though both have a small number of small values and a larger number of larger values). The data from dbms_random.normal() will produce 22 distinct values, so I’ve requested a histogram with 16 buckets and checked that this will produce a Top-Frequency histogram. (If I want a Hybrid histogram – for the next thrilling installment in the series – I’ll just reduce the number of buckets slightly).
Here are the resulting stats, preceded by the code that reported them:
select table_name, column_name, histogram, num_distinct, num_buckets, density from user_tab_cols where table_name in ('T1','T2') and column_name in ('J1','J2') order by table_name ; select table_name, num_rows from user_tables where table_name in ('T1','T2') order by table_name ; break on table_name skip 1 on report skip 1 with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' order by endpoint_value ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' order by endpoint_value ) select f1.* from f1 union all select f2.* from f2 order by 1,2 ; TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS DENSITY -------------------- -------------------- --------------- ------------ ----------- ---------- T1 J1 FREQUENCY 10 10 .005 T2 J2 TOP-FREQUENCY 22 16 .000625 TABLE_NAME NUM_ROWS -------------------- ---------- T1 100 T2 800 TABLE_NAME VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER -------------------- ---------- ------------------- --------------- T1 2 5 5 5 15 20 7 15 35 10 17 52 12 13 65 15 13 78 17 11 89 20 7 96 22 3 99 25 1 100 T2 1 1 1 13 14 15 15 11 26 16 22 48 17 34 82 18 31 113 19 36 149 20 57 206 21 44 250 22 45 295 23 72 367 24 70 437 25 87 524 26 109 633 27 96 729 28 41 770
Table t1 reports 100 rows, 10 distinct values and a Frequency histogram with 10 buckets.
Table t2 reports 800 rows, 22 distinct values and a Top-Frequency histogram with 16 buckets.
Things we notice from the histograms are: t1 has a range from 2 to 25, while t2 has a range from 1 to 28. We also notice that the highest endpoint_number for t2 is only 770 out of a possible 800 – we’ve “lost” 30 rows. We don’t really care what they are for the purposes of the arithmetic, but if we did a quick “select j2, count(*)” query we’d see that we had lost the following:
SQL> select j2, count(*) from t2 group by j2 order by count(*), j2; J2 COUNT(*) ---------- ---------- 1 1 9 1 * 8 3 * 11 4 * 10 5 * 12 8 * 14 9 * 15 11 ...
The reason why the total number of rows accounted for is less than the total number of rows in the table comes in two parts. The Top-Frequency histogram is designed to hold the Top N most popular entries in the table, so there will be some entries that don’t make an appearance in the histogram despite contributing rows to the total table count; the number of “lost” rows can then be increased because the Top N popular values may not include the column low and high values, and these two values must appear in the histogram. Looking at the output above we can see that we could have reported 14 as the 16th most popular value, instead we have to record 1, losing a further 9 rows and regaining 1.
Let’s test the pure join query on the two tables to see what the optimizer is predicting as the join cardinality, and then try to re-create that cardinality from the histogram data:
alter session set statistics_level = all; alter session set events '10053 trace name context forever'; alter session set tracefile_identifier='BASELINE'; select count(*) from t1, t2 where t1.j1 = t2.j2 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set statistics_level = typical; alter session set events '10053 trace name context off'; ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 41 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 41 | | | | |* 2 | HASH JOIN | | 1 | 1608 | 1327 |00:00:00.01 | 41 | 2545K| 2545K| 1355K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 7 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 7 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")
Our target is to work out how we can query the histogram data in a way that gets the result 1,608. Ideally we’ll also think of a rationale for justifying our method, and then we’ll apply the same method with 15 buckets and 17 buckets, and with a couple of variations to the data (e.g. update all rows where j1 = 25 to set j1 = 28), to see if the method still gets the right result.
All we did with the frequency/frequency join was to join the two histograms on matching values, multiply the frequencies on each resulting row , then sum down the set, and this automatically eliminated rows which were outside the “highest low” and “lowest high” (i.e. we only examined rows where the histograms overlapped). We might hope that things shouldn’t be too different when one of the histograms is a top-frequency histogram.
There is an important difference, though, between frequency and top-frequency histograms – in the latter case there are values in the table which will not be in the histogram, so we ought to make some allowance for these (even though it’s only “one bucket’s worth”). It’s possible that some of these values might match values in the frequency histogram so we need to include a mechanism for adding in a factor to allow for them. So as a first step let’s work out the “average number of rows per value” for the missing values.
We have 22 distinct values and 16 end points so there are 6 missing values. We have 800 rows in the table but only 770 rows reported in the histogram so there are 30 missing rows. So let’s say the missing values have an average cardinality of 30/6 = 5 (and we might extend that to say they have an average selectivity of 5/800 = 0.00625).
Let’s bring that value into the query we wrote for the frequency/frequency case by using an outer join (which I’ll write as an “ANSI” Full Outer Join”) with a predicate in place that restricts the result to just the overlapping range, which is [2,25], the “higher low value” and “lower high value” across the two histograms. Here’s some code – with an odd little detail included:
column product format 999,999,999.99 compute sum of product on report compute sum of t1_count on report compute sum of t1_value on report compute sum of t2_count on report compute sum of t2_value on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' order by endpoint_value ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' order by endpoint_value ) select f1.value f1_value, f2.value f2_value, nvl(f1.row_or_bucket_count,0.00) t1_count, nvl(f2.row_or_bucket_count,800*0.00625) t2_count, nvl(f1.row_or_bucket_count,0.00) * nvl(f2.row_or_bucket_count,800*0.006250) product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between 2 and 25 order by coalesce(f1.value, f2.value) ;
I’ve included an nvl() on the columns for the top-frequency histograms that convert nulls (i.e. the preserved rows derived from the frequency histogram) into the average frequency we’ve just calculated, using the “num_rows * selectivity” representation. The odd little detail that I commented on above does something similar for the preserved rows derived from the top-frequency histogram because this first guess at the calculation was wrong and needed an adjustment which I’m anticipating. Here are the results I got with this code:
T1_VALUE T2_VALUE T1_COUNT T2_COUNT PRODUCT ---------- ---------- ---------- ---------- --------------- 2 5 5 25.00 5 15 5 75.00 7 15 5 75.00 10 17 5 85.00 12 13 5 65.00 13 0 14 .00 15 15 13 11 143.00 16 0 22 .00 17 17 11 34 374.00 18 0 31 .00 19 0 36 .00 20 20 7 57 399.00 21 0 44 .00 22 22 3 45 135.00 23 0 72 .00 24 0 70 .00 25 25 1 87 87.00 ---------- ---------- ---------- ---------- --------------- 135 233 100 548 1,463.00
The figure is too low, so there has to be an adjustment. What if the code is allowing for the “maybe there are other values” algorithm that the optimizer uses with fequency histograms ? If you’ve gathered a frequency histogram on a column but query it with a value that isn’t in the histogram than Oracle applies an algorithm that looks like: “if you’re asking for something that isn’t in the histogram I’ll assume that there must be some data there and use a frequency that’s half the lowest frequency I have recorded”**Important footnote. The value 25 appears once in our histogram so let’s include a fudge-factor of 0.5 (i.e. half a row) in the nvl() expression for the t1 frequencies and see what happens. This is what the new results look like:
T1_VALUE T2_VALUE T1_COUNT T2_COUNT PRODUCT ---------- ---------- ---------- ---------- --------------- 2 5 5 25.00 5 15 5 75.00 7 15 5 75.00 10 17 5 85.00 12 13 5 65.00 13 .5 14 7.00 15 15 13 11 143.00 16 .5 22 11.00 17 17 11 34 374.00 18 .5 31 15.50 19 .5 36 18.00 20 20 7 57 399.00 21 .5 44 22.00 22 22 3 45 135.00 23 .5 72 36.00 24 .5 70 35.00 25 25 1 87 87.00 ---------- ---------- ---------- ---------- --------------- 135 233 103.5 548 1,607.50
Since we were looking for 1,608 I’m going to call that a success. I can check precision, of course, by looking at the 10053 trace file. Extracting a few critical lines:
egrep -e"Density" -e"Join Card" orcl12c_ora_6520_BASELINE.trc AvgLen: 3 NDV: 22 Nulls: 0 Density: 0.006250 Min: 1.000000 Max: 28.000000 AvgLen: 3 NDV: 10 Nulls: 0 Density: 0.005000 Min: 2.000000 Max: 25.000000 Join Card: 1607.500000 = outer (100.000000) * inner (800.000000) * sel (0.020094)
The “Density” lines come from the column statistics – note the 0.00625 that matches the “average selectivity” I derived from the top-frequency figures. You might also note that the “half the least frequent value” could be derived from the t1.j1 density (0.005) * t1.num_rows (100).
The “Join Card” line is exactly what it says – the join cardinality calculation showing that the plan’s prediction of 1,608 rows was actually a rounded 1607.5
There is one more important thing to check before I start tweaking the data to see if there are any other factors involved. Is the 0.5 I stuck into the query really the value of “half the least common frequency” or is it a fixed value in all cases. A nice easy way of testing this is to update the t1 table to change one row from 22 to 25 (22 will still be present in the table and histogram before and after this test, so it’s a minimal and safe change). Making this change and re-running the calculation query leaving the 0.5 unchanged gives the following:
update t1 set j1 = 25 where j1 = 22 and rownum = 1; ... 21 .5 44 22.00 22 22 2 45 90.00 23 .5 72 36.00 24 .5 70 35.00 25 25 2 87 174.00 ---------- ---------- --------------- sum 103.5 548 1,649.50
Without reporting all the details:
- the estimate in the plan went up from 1,608 to 1,794
- leaving 0.5 in the query the derived result was 1,649.5 (last few lines of output above)
- changing the 0.5 to 1.0 the derived result was 1,794.0
Conclusion – the “fudge factor” is consistent with the model the optimizer uses with frequency histogram calculations. The optimizer models “missing” rows in the join calculation as “half the number of the least frequently occuring value“**Important footnote
Filter Predicates:
After a dozen tests varying the number of buckets in the top-frequency histogram (and checking it really was still a top-frequency histogram), and tweaking the t1 (frequency histogram) data to use values on the boundaries of, or outside, the range of the t2 (top-frequency) data, I concluded that my approach was probably correct. Outer join the two histograms, restrict to the overlap, supply the “num_rows * density” figure on the top-frequency side, and “half the lowest frequency”**Important footnote on the frequency side, and the query produces the same result as the optimizer for the pure join cardinality.
So the next step is to check what happens when you add filter predicates on one, or both, sides. I listed a fragment of code earlier on to execute the pure join and count the number of rows it produced, enabling the 10053 trace and pulling the actual plan from memory at the same time. I repeated this code with 3 variations and checked the “Join Card” lines from the resulting trace files:
select count(*) from t1, t2 where t1.j1 = t2.j2 select count(*) from t1, t2 where t1.j1 = t2.j2 and t1.n04 = 2 select count(*) from t1, t2 where t1.j1 = t2.j2 and t2.n30 = 25 select count(*) from t1, t2 where t1.j1 = t2.j2 and t1.n04 = 2 and t2.n30 = 25 egrep -e"Join Card" orcl12c_ora_10447*.trc orcl12c_ora_10447_BASELINE.trc:Join Card: 1607.500000 = outer (800.000000) * inner (100.000000) * sel (0.020094) orcl12c_ora_10447_FILTERJ1.trc:Join Card: 401.875000 = outer (800.000000) * inner (25.000000) * sel (0.020094) orcl12c_ora_10447_FILTERJ2.trc:Join Card: 53.583333 = outer (100.000000) * inner (26.666667) * sel (0.020094) orcl12c_ora_10447_FILTJ1J2.trc:Join Card: 13.395833 = outer (26.666667) * inner (25.000000) * sel (0.020094)
As you can see in all 4 cases, Oracle reports an inner and outer cardinality estimate and a join selectivity. The join selectivity remains unchanged throughout; it’s the value we can derive from our pure join test (0.020094 = 1607.5 / (100 * 800)). All that changes is that the individual table predicates are applied to the base tables before the join selectivity is applied to the product of the filtered base table cardinalities:
- Column n04 has 4 distinct values in 100 rows – filter cardinality = 100/4 = 25
- Column n30 has 30 distinct values in 800 rows – filter cardinality = 800/30 = 26.66666…
Conclusion
For a single column equijoin on columns with no nulls where one column has a frequency histogram and the other has a top-frequency histogram the optimizer calculates the “pure” join cardinality using the overlapping range of column values and two approximating frequencies, then derives the filtered cardinality by applying the base table filters, calculates the cardinality of the cartesian join of the filtered data sets, then multiplies by the pure join selectivity.
**Important Footnote Until Chinar Aliyev questioned what I had written, I had never noticed that the “half the lowest frequency” that I describe at various point in the arithmetic was anything other than a fixed fudge factor. In fact, in perfect symmetry with the expression used for the average selectivity in the top-frequency part of the calculcation, this “fudge factor” is simple “num_rows * column_density” for the column with the frequency histogram. (Whether the “half the lowest frequency” drops out as a side effect of the density calculation, or whether the column density is derived from half the lowest frequency is another matter.)
[…] I’ve covered two options for the types of histogram involved: frequency to frequency, and frequency to top-frequency. Today it’s time to examine frequency to […]
Pingback by Join Cardinality – 4 | Oracle Scratchpad — October 25, 2018 @ 9:10 am BST Oct 25,2018 |
[…] Top-Frequency […]
Pingback by Join Cardinality – 5 | Oracle Scratchpad — November 1, 2018 @ 1:34 pm GMT Nov 1,2018 |
[…] Frequency / Top Frequency with filter predicates […]
Pingback by Histogram catalogue | Oracle Scratchpad — November 8, 2022 @ 1:07 pm GMT Nov 8,2022 |