Here’s a simple data set – I’m only interested in three of the columns in the work that follows, but it’s a data set that I use for a number of different models:

rem rem Script: aggregate_selectivity_c.sql rem Author: Jonathan Lewis rem Dated: Aug 2013 rem Purpose: rem execute dbms_random.seed(0) create table t1 nologging as with generator as ( select --+ materialize rownum id from dual connect by level <= 1e4 ) select trunc(dbms_random.value(0,1000)) n_1000, trunc(dbms_random.value(0,750)) n_750, trunc(dbms_random.value(0,600)) n_600, trunc(dbms_random.value(0,400)) n_400, trunc(dbms_random.value(0,90)) n_90, trunc(dbms_random.value(0,72)) n_72, trunc(dbms_random.value(0,40)) n_40, trunc(dbms_random.value(0,3)) n_3 from generator v1, generator v2 where rownum <= 1e6 -- > hint to avoid wordpress format issue ; create table t2 nologging as select * from t1 ; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T1', method_opt => 'for all columns size 1' ); dbms_stats.gather_table_stats( ownname => user, tabname =>'T2', method_opt => 'for all columns size 1' ); end; /

The columns I want to consider are * n_3*,

*, and*

**n_400***. As their names suggest the columns have 3, 400, and 1000 distinct values respectively and since I’ve used the*

**n_1000***function to generate the data the distinct values are fairly evenly spread across the million rows of the table.*

**dbms_random.value()**Consider, then, the following two queries:

select * from t1 where exists ( select null from t2 where t2.n_1000 = 0 and t2.n_400 = t1.n_400 and t2.n_3 = t1.n_3 ) ; select * from t1 where (t1.n_400, t1.n_3) in ( select t2.n_400, t2.n_3 from t2 where t2.n_1000 = 0 ) ;

The first point to check is that these two queries are logically equivalent.

Once you’re happy with that idea we can work out, informally, how many rows we should expect the queries ought to return: there are 1,200 combinations for * (n_400, n_3)* so each combination should return roughly 833 rows; if we pick 1,000 rows from the 1 million available we can expect to see 679 of those combinations (that’s

**formula that Oracle uses for adjusting**

*Alberto Dell’Era’s “selection without replacement”***to allow for filter predicates). So we might reasonably suggest that the final number of rows as 833 * 679 = 565,607. It turns out that that’s a pretty good estimate – when I ran the query the result was actually 567,018 rows.**

*num_distinct*So what does Oracle produce for the two execution plans – here are the result from 12c (EXISTS first, then IN):

=================== Multi-column EXISTS =================== ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 920K| 34M| 1259 (11)| 00:00:01 | |* 1 | HASH JOIN RIGHT SEMI| | 920K| 34M| 1259 (11)| 00:00:01 | |* 2 | TABLE ACCESS FULL | T2 | 1000 | 11000 | 610 (8)| 00:00:01 | | 3 | TABLE ACCESS FULL | T1 | 1000K| 26M| 628 (11)| 00:00:01 | ----------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("T2"."N_400"="T1"."N_400" AND "T2"."N_3"="T1"."N_3") 2 - filter("N_1000"=0) =================== Equivalent IN query =================== ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 833K| 30M| 1259 (11)| 00:00:01 | |* 1 | HASH JOIN RIGHT SEMI| | 833K| 30M| 1259 (11)| 00:00:01 | |* 2 | TABLE ACCESS FULL | T2 | 1000 | 11000 | 610 (8)| 00:00:01 | | 3 | TABLE ACCESS FULL | T1 | 1000K| 26M| 628 (11)| 00:00:01 | ----------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("T1"."N_400"="T2"."N_400" AND "T1"."N_3"="T2"."N_3") 2 - filter("T2"."N_1000"=0)

The first thing to note is that the shape of the plans is identical, and the predicate sections are identical – but the final cardinaliity estimates are different. Clearly at least one of the cardinalities has to be wrong by a significant amount (7.5% or 10.4%, depending which way round you want to look at it). If you run the test on 11.2.0.4 you find that both plans give the same estimated row count – and it’s the 920,000 rows; so arguably 12c has “fixed” the IN subquery calculation, bringing it closer to a reasonable prediction, but it hasn’t fixed the EXISTS subquery calculation. That 833K prediction, by the way, is what you would expect to see with this data with a basic join – and a semi-join shouldn’t be able to produce more data than a join!

But both predictions are way off the (informal) expectation, so how have they appeared ?

Working backwards it’s easy to spot that: 833K = 833 * 1,000: Oracle is behaving as if every single row identified in the subquery will produce a separate combination of ** (n_400, n_3)**. If we reverse engineer 920K we get: 920K / 833 = 1104 – it would appear that the optimizer thinks the 1,000 rows produced by the subquery will produce 1,104 distinct combinations of

*so how did the impossible 1,104 appear in the arithmetic.*

**(n_400, n_3)**If you apply the *“selection without replacement”* formula to picking 1,000 rows with 400 distinct values from 1,000,000 rows the expected number of distinct values (with rounding) will be 368; if you apply the formula for picking 1,000 rows with 3 distinct values from 1,000,000 rows the expected number will be 3. And 3 * 368 = 1,104. (Remember that in my original estimate I applied the formula * after* multiplying out the combination of distinct values). The optimizer is using its standard methods, but using intermediate results in an unsuitable fashion.

It’s impossible to say what the impact of this particular code path – and the change on the upgrade – might be. The optimizer has over-estimated by 47% in one case and 62% in the other but (a) there may be something about my data that exaggerated an effect that few people will see in the wild and (b) in many cases getting in the right ballpark is enough to get a reasonable plan, and a factor of 2 is the right ballpark.

Of course, a few people will be unlucky with a few queries on the upgrade where the estimate changes – after all a single row difference in the estimate can cause the optimizer to flip between a hash join and a nested loop – but at least you’ve got a little extra information that might help when you see a bad estimate on an important semi-join.

So is there a workaround ? Given that I’ve got 12c, the obvious thing to try is to create a column group at both ends of the semi-join and see what happens. It shouldn’t really make any difference because column groups are targeted at the problems of correlated column – but we might as well try it:

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns (n_400,n_3) size 1') execute dbms_stats.gather_table_stats(user,'t2',method_opt=>'for columns (n_400,n_3) size 1')

Unfortunately when I did this the final cardinality estimate for both queries dropped to just 833 (the absence of a K on the end isn’t a typo!).

Manually unnesting got me closer:

select * from ( select distinct n_3, n_400 from t2 where t2.n_1000 = 0 ) sq, t1 where sq.n_400 = t1.n_400 and sq.n_3 = t1.n_3 ; ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 649K| 33M| 1260 (11)| 00:00:01 | |* 1 | HASH JOIN | | 649K| 33M| 1260 (11)| 00:00:01 | | 2 | VIEW | | 779 | 20254 | 612 (8)| 00:00:01 | | 3 | HASH UNIQUE | | 779 | 8569 | 612 (8)| 00:00:01 | |* 4 | TABLE ACCESS FULL| T2 | 1000 | 11000 | 610 (8)| 00:00:01 | | 5 | TABLE ACCESS FULL | T1 | 1000K| 26M| 628 (11)| 00:00:01 | ----------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("SQ"."N_400"="T1"."N_400" AND "SQ"."N_3"="T1"."N_3") 4 - filter("N_1000"=0)

The cardinality of 649K is (allowing for rounding) 833 * 779; so we need to know where the 779 came from. It’s the optimizer’s standard arithmetic for *“distinct”* – multiply the N individual selectivities together then divide by the * sqrt(2)* “N-1” times. So we apply the “selection without replacement formula twice”:

- adjusted selectivity of n_400 = 367.21
- adjusted selectivity of n_3 = 3
- 367.21 * 3 / sqrt(2) = 779

If you create column group statistics for * (n_400, n_3)* this doesn’t change the optimizer’s estimate for the number of distinct combinations after selection – maybe that’s another enhancement in the pipeline – but, at least in this case, the manual unnesting has got us a little closer to the right estimates without any statistical intervention.

### Footnote:

Just for the sake of completeness, here are the plans (with yet more cardinality predictions) that you get if you block the unnesting:

select * from t1 where exists ( select /*+ no_unnest */ null from t2 where t2.n_1000 = 0 and t2.n_400 = t1.n_400 and t2.n_3 = t1.n_3 ) ; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1179 | 33012 | 766K (12)| 00:00:30 | |* 1 | FILTER | | | | | | | 2 | TABLE ACCESS FULL| T1 | 1000K| 26M| 632 (11)| 00:00:01 | |* 3 | TABLE ACCESS FULL| T2 | 1 | 11 | 638 (12)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE "N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)) 3 - filter("N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2) ===================================== Unnesting blocked and subquery pushed ===================================== select * from t1 where exists ( select /*+ no_unnest push_subq */ null from t2 where t2.n_1000 = 0 and t2.n_400 = t1.n_400 and t2.n_3 = t1.n_3 ) ; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 50000 | 1367K| 1271 (12)| 00:00:01 | |* 1 | TABLE ACCESS FULL | T1 | 50000 | 1367K| 632 (11)| 00:00:01 | |* 2 | TABLE ACCESS FULL| T2 | 1 | 11 | 638 (12)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST */ 0 FROM "T2" "T2" WHERE "N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)) 2 - filter("N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)

The 1179 comes from the magic of sqrt(2): 1179 = 1,000,000 / (400 * 3 / sqrt(2)).

The 50,000 is just the basic *“I dunno, let’s call it 5%”*.

## Comments and related questions are welcome.