This note was prompted by a recent email asking about the optimizer’s method for estimating the selectivity of a predicate which compared two columns in the same table – for example: *“where orders.amount_invoiced = orders.amount_paid”*. It’s been about 14 years since I wrote “Cost Based Oracle – Fundamentals” so my memory of what I wrote (and whether I even mentioned this case) was rather hazy, so I sent off a quick reply and decided to do a little checking.

It turned out that I’d * already written a blog note* with a throwaway comment about the estimates and a general workaround for optimizer problems caused by examples of this kind. The comment I made about the estimate was that the selectivity seems to be the smaller of the selectivities of (using the example above)

*“amount_paid = :unpeekable_bind”*and

*“amount_invoice = :unpeekable_bind”*. I’m fairly sure I’ve made similar comments several times in the past, but after replying to the email I started to wonder whether this would still be true if there were histograms on the columns. So I ran up a little test and here, to start things off, is the code to generate the data I used for testing:

rem rem Script: column_equality_2.sql rem Author: Jonathan Lewis rem Dated: Apr 2019 rem Purpose: create table t1( id number(8,0), n1 number(6,0) ) ; create table t2( id number(8,0), n1 number(6,0) ) ; create table t3( n1 number(6,0), n2 number(6,0), v1 varchar2(50) ) ; execute dbms_random.seed(0) insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, trunc(10 * abs(dbms_random.normal)) n1 from generator v1 ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, trunc(10 * abs(dbms_random.normal)) n1 from generator v1 ; insert into t3 (n1, n2, v1) select t1.n1, t2.n1, rpad(rownum,50) from t1, t2 where t1.id = t2.id ; commit; begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns n1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', method_opt => 'for all columns size 1 for columns n1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T3', method_opt => 'for all columns size 254 for columns v1 size 1' ); end; / select table_name, column_name, num_distinct, density, histogram, low_value, high_value from user_tab_cols where table_name in ('T1','T2','T3') and column_name in ('N1','N2') order by table_name, column_name ; TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM LOW_VALUE HIGH_VALUE --------------- --------------- ------------ ---------- --------------- ---------- ---------- T1 N1 38 .00005 FREQUENCY 80 C128 T2 N1 38 .00005 FREQUENCY 80 C126 T3 N1 38 .00005 FREQUENCY 80 C128 N2 38 .00005 FREQUENCY 80 C126

I’ve created two sets of 10,000 rows each of normally distributed data – but taken the absolute values so I’ve only got half the bell curve, and I’ve scaled up by a factor of 10 and truncated. This has given me two similar but slightly different sets of values which happen to cover 38 distinct values each.

I’ve then generated my test set by joining these two tables on the unique (though not declared as such) * id* column to give a table with the same number of rows and two skewed sets of data. The calls to

*create histograms on the skewed data sets, and I’ve reported a few significant numbers about the 4 relevant columns.*

**dbms_stats**Looking at the column statistics we have * num_distinct *= 38 across the board – so my observation from paragraph 2 above would tend to suggest that the optimizer would report 10,000/38 = 263 as the cardinality estimate for the predciate

*“t3.n1 = t3.n2”*(I’m fairly confident that in this case 1/

*will be preferred over using the*

**num_distinct***from*

**density***). But here’s what we get from a call to*

**user_tab_cols***:*

**explain plan**explain plan for select v1 from t3 where n1 = n2 ; select * from table(dbms_xplan.display); -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 564 | 32148 | 18 (6)| 00:00:01 | |* 1 | TABLE ACCESS FULL| T3 | 564 | 32148 | 18 (6)| 00:00:01 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1"="N2")

The estimate is 564 – which is a pretty good estimate in this case (the actual result was 552) as the two columns were randomly generated and there’s no correlation between them. Unfortunately this is quite a long way of my assumption of 263, so where did the optimizer get that number from?

Here’s a query (with result set) that you may recognise from * an earlier post*.

break on report skip 1 compute count of value on report compute sum of t1_frequency on report compute sum of t2_frequency on report compute sum of product on report column product format 999,999,999 with f1 as ( select endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T3' and column_name = 'N1' ), f2 as ( select endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T3' and column_name = 'N2' ) select f1.value, f1.frequency t1_frequency, f2.frequency t2_frequency, f1.frequency * f2.frequency product from f1, f2 where f2.value = f1.value order by f1.value ; VALUE T1_FREQUENCY T2_FREQUENCY PRODUCT ---------- ------------ ------------ ------------ 0 777 768 596,736 1 806 753 606,918 2 794 779 618,526 3 808 763 616,504 4 752 749 563,248 5 627 729 457,083 6 623 628 391,244 7 584 616 359,744 8 544 597 324,768 9 512 546 279,552 10 441 439 193,599 11 409 342 139,878 12 345 370 127,650 13 318 300 95,400 14 257 282 72,474 15 244 242 59,048 16 214 206 44,084 17 172 193 33,196 18 161 140 22,540 19 113 114 12,882 20 108 93 10,044 21 95 81 7,695 22 72 55 3,960 23 54 56 3,024 24 43 36 1,548 25 38 31 1,178 26 23 18 414 27 18 23 414 28 7 14 98 29 9 13 117 30 14 11 154 31 4 2 8 32 5 3 15 33 1 3 3 35 4 1 4 37 2 2 4 ---------- ------------ ------------ ------------ 36 9998 9998 5,643,754

I’m querying the histoggram information for the two columns, and where * t3.n1* and

*have a value in common I’ve reported the two frequencies for that value and the product of the frequencies. For convenience I’ve included a count and a couple of sums to show that there isn’t a perfect match in the set of values for the two columns. The most important number at the bottom of the page, though, is the sum of the products of frequencies of common values. Take that value and divide by 10,000 and you get 564.3754 – compare that with the cardinality estimate of the predicate*

**t3.n2***“t3.n1 = t3.n2”*, it’s a perfect match (allowing for rounding).

The query against * user_tab_histograms* is the query I used to calculate the cardinality of a join where there were frequency histograms on the columns at both ends of the join. The optimizer’s estimate for “intra-table” predicates is consistent with its estimate for joins (in the special cases of

*“no histograms”*and

*“two frequency histograms”*, at least). Viewing it from a slightly different angle: the selectivity of the predicate

*“n1 = n2”*can be derived as

*“the cardinality estimate for joining t3 to itself”*divided by

*“the cardinality of the cartesian join”*(the latter being

***

**num_rows***, of course).*

**num_rows**Just as a closing demo – lets generate a plan for the appropriate self-join of * t3* and check the cardinality estimate:

explain plan for select t3a.v1, t3b.v1 from t3 t3a, t3 t3b where t3a.n2 = t3b.n1 ; select * from table(dbms_xplan.display); --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5643K| 581M| 138 (83)| 00:00:01 | |* 1 | HASH JOIN | | 5643K| 581M| 138 (83)| 00:00:01 | | 2 | TABLE ACCESS FULL| T3 | 10000 | 527K| 13 (8)| 00:00:01 | | 3 | TABLE ACCESS FULL| T3 | 10000 | 527K| 13 (8)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("T3A"."N2"="T3B"."N1")

As expected the (rounded) join cardinality is reported as 5,643K.

So the selectivity of the single table predicate *“n1 = n2”* will be (5,643,000 / (10,000 * 10,000) = 0.05643 and the cardinality estimate of the single table query will be 10,000 * 0.05643 = 564.3 QED.

I haven’t tested any other variations of types of histogram, degree of overlap of value ranges, etc. but I suspect that the general principle is probably going to give the selectivity as (or with the appearance of): “estimated cardinality of self-join” / “square of num_rows (allowing for nulls)”.

## Comments and related questions are welcome.