So far in this series I’ve written about the way that the optimizer estimates cardinality for an equijoin where one end of the join has a frequency histogram and the other end has a histogram of type:

It’s now time to look at a join where the other end has a * height-balanced* histogram. Arguably it’s not sensible to spend time writing about this since you shouldn’t be creating them in 12c (depending, instead, on the hybrid histogram that goes with the

*), and the arithmetic is different in 11g. However, there still seem to be plenty of people running 12c but not using the*

**auto_sample_size***and that means they could be generating some height-balanced histograms – so let’s generate some data and see what happens.*

**auto_sample_size**rem rem Script: freq_hist_join_04a.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem Purpose: rem rem Last tested rem 18.3.0.0 rem 12.2.0.1 rem 12.1.0.2 rem 11.2.0.4 Different results rem drop table t2 purge; drop table t1 purge; set linesize 156 set trimspool on set pagesize 60 set feedback off execute dbms_random.seed(0) create table t1( id number(6), n04 number(6), n05 number(6), n20 number(6), j1 number(6) ) ; create table t2( id number(8,0), n20 number(6,0), n30 number(6,0), n50 number(6,0), j2 number(6,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 4) + 1 n04, mod(rownum, 5) + 1 n05, mod(rownum, 20) + 1 n20, trunc(2.5 * trunc(sqrt(v1.id*v2.id))) j1 from generator v1, generator v2 where v1.id <= 10 -- > comment to avoid WordPress format issue and v2.id <= 10 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 20) + 1 n20, mod(rownum, 30) + 1 n30, mod(rownum, 50) + 1 n50, 28 - round(abs(7*dbms_random.normal)) j2 from generator v1 where rownum <= 800 -- > comment to avoid WordPress format issue ; commit; prompt ========================================================== prompt Using estimate_percent => 100 to get height-balanced in t2 prompt ========================================================== begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns j1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', estimate_percent => 100, method_opt => 'for all columns size 1 for columns j2 size 20' ); end; /

As in earlier examples I’ve created some empty tables, then inserted randomly generated data (after calling the * dbms_random.seed(0)* function to make the data reproducible). Then I’ve gathered stats, knowing that there will be 22 distinct values in

*so forcing a height-balanced histogram of 20 buckets to appear.*

**t2**When we try to calculate the join cardinality we’re going to need various details from the histogram information, such as bucket sizes, number of distinct values, and so on, so in the next few queries to display the histogram information I’ve captured a few values into SQL*Plus variables. Here’s the basic information about the histograms on the join columns * t1.j1* and

*:*

**t2.j2**column num_distinct new_value m_t2_distinct column num_rows new_value m_t2_rows column num_buckets new_value m_t2_buckets column bucket_size new_value m_t2_bucket_size select table_name, column_name, histogram, num_distinct, num_buckets, density from user_tab_cols where table_name in ('T1','T2') and column_name in ('J1','J2') order by table_name ; select table_name, num_rows, decode(table_name, 'T2', num_rows/&m_t2_buckets, null) bucket_size from user_tables where table_name in ('T1','T2') order by table_name ; column table_name format a3 heading "Tab" break on table_name skip 1 on report skip 1 with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ) select f1.* from f1 union all select f2.* from f2 order by 1,2 ; Tab COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS DENSITY -------------------- -------------------- --------------- ------------ ----------- ---------- T1 J1 FREQUENCY 10 10 .005 T2 J2 HEIGHT BALANCED 22 20 .052652266 Tab NUM_ROWS BUCKET_SIZE -------------------- ---------- ----------- T1 100 T2 800 40 Tab VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER --- ---------- ------------------- --------------- T1 2 5 5 5 15 20 7 15 35 10 17 52 12 13 65 15 13 78 17 11 89 20 7 96 22 3 99 25 1 100 T2 1 0 0 14 1 1 17 1 2 18 1 3 19 1 4 20 1 5 21 2 7 22 1 8 23 1 9 24 2 11 25 2 13 26 3 16 27 2 18 28 2 20

As you can see, there is a * frequency* histogram on

*reporting a cumulative total of 100 rows; and the histogram on*

**t1***is a*

**t2***histogram of 20 buckets, showing 21, 24, 25, 26, 27 and 28 as popular values with 2,2,2,2,3 and 2 endpoints (i.e. buckets) respectively. You’ll also note that the*

**height-balanced***histogram has 21 rows with row/bucket 0 showing us the minimum value in the column and letting us know that bucket 1 is not exclusively full of the value 14. (If 14 had been the minimum value for the column as well as an end point Oracle would not have created a bucket 0 – that may be a little detail that isn’t well-known – and will be the subject of a little follow-up blog note.)*

**t2**Let’s modify the code to join the two sets of hisogram data on data value – using a full outer join so we don’t lose any data but restricting ourselves to values where the histograms overlap. We’re going to follow the idea we’ve developed in earlier postings and multiply frequencies together to derive a join frequency, so we’ll start with a simple full outer join and assume that when we find a real match value we should behave as if the height-balanced buckets (* t2*) where the bucket count is 2 or greater represent completely full buckets and are popular values..

I’ve also included in this query (because it had a convenient full outer join) a column selection that counts how many rows there are in * t1* with values that fall inside the range of the

*histogram but*

**t2****don’t**match a popular value in

*.*

**t2**column unmatch_ct new_value m_unmatch_ct column product format 999,999.99 break on report skip 1 compute sum of product on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ), join1 as ( select f1.value t1_value, f2.value t2_value, f1.frequency t1_frequency, f2.frequency t2_frequency, sum( case when f2.frequency > 1 and f1.frequency is not null then 0 else f1.frequency end ) over() unmatch_ct, f2.frequency * &m_t2_bucket_size * case when f2.frequency > 1 and f1.frequency is not null then f1.frequency end product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between 2 and 25 -- coalesce(f1.value, f2.value) between &m_low and &m_high order by coalesce(f1.value, f2.value) ) select * from join1 ; T1_VALUE T2_VALUE T1_FREQUENCY T2_FREQUENCY UNMATCH_CT PRODUCT ---------- ---------- ------------ ------------ ---------- ----------- 2 5 99 5 15 99 7 15 99 10 17 99 12 13 99 14 1 99 15 13 99 17 17 11 1 99 18 1 99 19 1 99 20 20 7 1 99 21 2 99 22 22 3 1 99 23 1 99 24 2 99 25 25 1 2 99 80.00 ----------- sum 80.00

We captured the bucket size (* &m_bucket_size*) for the

*histogram as 40 in the earlier SQL, and we can see now that in the overlap range (which I’ve hard coded as 2 – 25) we have three buckets that identify popular values, but only one of them corresponds to a value in the frequency histogram on*

**t2***, so the*

**t1***column shows a value of 1 * 2 * 40 = 80. Unfortunately this is a long way off the prediction that the optimizer is going to make for the simple join. (Eventually we’ll see it’s 1,893 so we have a lot more rows to estimate for).*

**Product**Our code so far only acounts for items that are popular in both tables. Previous experience tells us that when a popular value exists only at one end of the join predicate we need to derive a contribution to the total prediction through an *“average selectivity”* calculated for the other end of the join predicate. For frequency histograms we’ve seen that *“half the number of the least frequently occuring value”* seems to be the appropriate frequency estimate, and if we add that in we’ll get two more contributions to the total from the values 21 and 24 which appear in the height-balanced (* t2*) histogram as popular but don’t appear in the frequency (

*) histogram. Since the lowest frequency in*

**t1***is 1 this would give us two contributions of 0.5 * 2 (buckets) * 40 (bucket size) viz: two contributions of 40 bringing our total to 160 – still a serious shortfall from Oracle’s prediction. So we need to work out how Oracle generates an*

**t1***“average frequency”*for the non-popular values of

*and then apply it to the 99 rows in*

**t2***that haven’t yet been accounted for in the output above.*

**t1**To calculate the *“average selectivity”* of a non-popular row in * t2* I need a few numbers (some of which I’ve already acquired above). The total number of rows in the table (NR), the number of distinct values (NDV), and the number of popular values (NPV), from which we can derive the the number of distinct non-popular values and the number of rows for the non-popular values. The model that Oracle uses to derive these numbers is simply to assume that a value is popular if its frequency in the histogram is greater than one and the number of rows for that value is

*“frequency * bucket size”*.

The first query we ran against the * t2* histogram showed 6 popular values, accounting for 13 buckets of 40 rows each. We reported 22 distinct values for the column and 800 rows for the table so the optimizer assumes the non-popular values account for (22 – 6) = 16 distinct values and (800 – 13 * 40) = 280 rows. So the selectivity of non-popular values is (280/800) * (1/16) = 0.021875. This needs to be multiplied by the 99 rows in

*which don’t match a popular value in*

**t1***– so we now need to write some SQL to derive that number.*

**t2**We could enhance our earlier full outer join and slot 0.5, 99, and 0.021875 into it as “magic” constants. Rather than do that though I’m going to write a couple of messy queries to derive the values (and the low/high range we’re interested in) so that I will be able to tweak the data later on and see if the formula still produces the right answer.

column range_low new_value m_low column range_high new_value m_high column avg_t1_freq new_value m_avg_t1_freq column new_density new_value m_avg_t2_dens with f1 as ( select endpoint_value ep_val, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select endpoint_value ep_val, endpoint_number ep_num, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ) select max(min_v) range_low, min(max_v) range_high, min(min_f)/2 avg_t1_freq, max(new_density) new_density from ( select min(ep_val) min_v, max(ep_val) max_v, min(frequency) min_f, to_number(null) new_density from f1 union all select min(ep_val) min_v, max(ep_val) max_v, null min_f, (max(ep_num) - sum(case when frequency > 1 then frequency end)) / ( max(ep_num) * (&m_t2_distinct - count(case when frequency > 1 then 1 end)) ) new_density from f2 ) ; RANGE_LOW RANGE_HIGH AVG_T1_FREQ NEW_DENSITY ---------- ---------- ----------- ----------- 2 25 .5 .021875

This query finds the overlap by querying the two histograms and reporting the lower high value and higher low value. It also reports the minimum frequency from the frequency histogram and divides by 2, and calculates the number of non-popular rows divided by the total number of rows and the number of distinct non-popular values. (Note that I’ve picked up the number of distinct values in * t2.j2* as a substituion variable generated by one of my earlier queries.) In my full script this messy piece of code runs before the query that showed I showed earlier on that told us how well (or badly) the two histograms matched.

Finally we can use the various values we’ve picked up in a slightly more complex version of the full outer join – with a special row added through a * union all* to give us our the estimate:

break on report skip 1 compute sum of product on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ), join1 as ( select f1.value t1_value, f2.value t2_value, f1.frequency t1_frequency, f2.frequency t2_frequency, f2.frequency * case when f2.frequency > 1 and f1.frequency is not null then f1.frequency when f2.frequency > 1 and f1.frequency is null then &m_avg_t1_freq end * &m_t2_bucket_size product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between &m_low and &m_high order by coalesce(f1.value, f2.value) ) select * from join1 union all select null, &m_avg_t2_dens, &m_unmatch_ct, &m_t2_rows * &m_avg_t2_dens, &m_t2_rows * &m_avg_t2_dens * &m_unmatch_ct from dual ; T1_VALUE T2_VALUE T1_FREQUENCY T2_FREQUENCY PRODUCT ---------- ---------- ------------ ------------ ----------- 2 5 5 15 7 15 10 17 12 13 14 1 15 13 17 17 11 1 18 1 19 1 20 20 7 1 21 2 40.00 22 22 3 1 23 1 24 2 40.00 25 25 1 2 80.00 .021875 99 17.5 1,732.50 ----------- sum 1,892.50

It remains only to check what the optimizer thinks the cardinality will be on a simple join, and then modify the data slightly to see if the string of queries continues to produce the right result. Here’s a starting test:

set serveroutput off alter session set statistics_level = all; alter session set events '10053 trace name context forever'; alter session set tracefile_identifier='BASELINE'; select count(*) from t1, t2 where t1.j1 = t2.j2 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set statistics_level = typical; alter session set events '10053 trace name context off'; COUNT(*) ---------- 1327 PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------ SQL_ID f8wj7karu0hhs, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 41 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 41 | | | | |* 2 | HASH JOIN | | 1 | 1893 | 1327 |00:00:00.01 | 41 | 2545K| 2545K| 1367K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 7 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 7 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

The * E-rows* for the hash join operation reports 1893 – and a quick check of the 10053 trace file shows that this is 1892.500000 rounded – a perfect match for the result from my query. I’ve modified the data in various ways (notably updating the

*table to change the value 25 (i.e. the current maximum value of*

**t1***) to other, lower, values) and the algorithm in the script seems to be sound – for 12c and 18c. I won’t be surprised, however, if someone comes up with a data pattern where the wrong estimate appears.*

**j1**### Don’t look back

Upgrades are a pain. With the same data set and same statistics on 11.2.0.4, running the same join query between * t1* and

*, here’s the execution plan I got:*

**t2**SQL_ID f8wj7karu0hhs, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 12 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 12 | | | | |* 2 | HASH JOIN | | 1 | 1855 | 1327 |00:00:00.01 | 12 | 2440K| 2440K| 1357K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 6 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 6 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

Notice that the * E-rows* value is different. The join cardinality algorithm seems to have changed in the upgrade from 11.2.0.4 to 12c. I haven’t quite figured out how to get to the 11g result, but I seem to get quite close

**most of the time**by making a simple change to the final query that I used to predict the optimizer’s estimate. In the

*expression that chooses between the actual*

**case***frequency and the*

**t1.j1***“average frequency”*don’t choose, just use the latter:

case when f2.frequency > 1 and f1.frequency is not null -- then f1.frequency -- 12c then &m_avg_t1_freq -- 11g when f2.frequency > 1 and f1.frequency is null then &m_avg_t1_freq end *

As I modified the * t1* row with the value 25 to hold other values this change kept producing results that were exactly 2, 2.5, or 3.0 different from the execution plan

*– except in one case where the error was exactly 15.5 (which looks suspiciously like 17.5: the*

**E-Rows***“average frequency in t2”*minus 2). I’m not keen to spend time trying to work out exactly what’s going on but the takeaway from this change is that anyone upgrading from 11g to 12c may find that some of their queries change plans because they happen to match the type of example I’ve been working with in this post.

In some email I exchanged with Chinar Aliyev, he suggested three fix-controls that might be relevant. I’ve added these to * an earlier posting* I did when I first hit the anomaly a few days ago but I’ll repeat them here. I will be testing their effects at some point in the not too distant future:

14033181 1 QKSFM_CARDINALITY_14033181 correct ndv for non-popular values in join cardinality comp. (12.1.0.1) 19230097 1 QKSFM_CARDINALITY_19230097 correct join card when popular value compared to non popular (12.2.0.1) 22159570 1 QKSFM_CARDINALITY_22159570 correct non-popular region cardinality for hybrid histogram (12.2.0.1)