Oracle Scratchpad

September 2, 2015

IN/EXISTS bugs

Filed under: 12c,Bugs,CBO,Execution plans,Oracle,subqueries — Jonathan Lewis @ 8:11 am BST Sep 2,2015

Here’s a simple data set – I’m only interested in three of the columns in the work that follows, but it’s a data set that I use for a number of different models:


execute dbms_random.seed(0)

create table t1
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		level <= 1e4
)
select
	trunc(dbms_random.value(0,1000))	n_1000,
	trunc(dbms_random.value(0,750))		n_750,
	trunc(dbms_random.value(0,600))		n_600,
	trunc(dbms_random.value(0,400))		n_400,
	trunc(dbms_random.value(0,90))		n_90,
	trunc(dbms_random.value(0,72))		n_72,
	trunc(dbms_random.value(0,40))		n_40,
	trunc(dbms_random.value(0,3))		n_3
from
	generator	v1,
	generator	v2
where
	rownum <= 1e6
;
create table t2 nologging 
as
select * from t1
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 1'
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T2',
		method_opt 	 => 'for all columns size 1'
	);
end;
/

The columns I want to consider are n_3, n_400, and n_1000. As their names suggest the columns have 3, 400, and 1000 distinct values respectively and since I’ve used the dbms_random.value() function to generate the data the distinct values are fairly evenly spread across the million rows of the table.

Consider, then, the following two queries:


select
        *
from
        t1
where
        exists (
                select  null
                from    t2
                where   n_1000 = 0
                and     t2.n_400 = t1.n_400
                and     t2.n_3 = t1.n_3
        )
;


select
        *
from
        t1
where
        (t1.n_400, t1.n_3) in (
                select  t2.n_400, t2.n_3
                from    t2
                where   t2.n_1000 = 0
        )
;

The first point to check is that these two queries are logically equivalent.

Once you’re happy with that idea we can work out, informally, how many rows we should expect the queries ought to return: there are 1,200 combinations for (n_400, n_3) so each combination should return roughly 833 rows; if we pick 1,000 rows from the 1 million available we can expect to see 679 of those combinations (that’s Alberto Dell’Era’s “selection without replacement” formula that Oracle uses for adjusting num_distinct to allow for filter predicates). So we might reasonably suggest that the final number of rows as 833 * 679 = 565,607. It turns out that that’s a pretty good estimate – when I ran the query the result was actually 567,018 rows.

So what does Oracle produce for the two execution plans – here are the result from 12c (EXISTS first, then IN):


===================
Multi-column EXISTS
===================
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   920K|    34M|  1259  (11)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT SEMI|      |   920K|    34M|  1259  (11)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T2   |  1000 | 11000 |   610   (8)| 00:00:01 |
|   3 |   TABLE ACCESS FULL  | T1   |  1000K|    26M|   628  (11)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."N_400"="T1"."N_400" AND "T2"."N_3"="T1"."N_3")
   2 - filter("N_1000"=0)

===================
Equivalent IN query
===================
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   833K|    30M|  1259  (11)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT SEMI|      |   833K|    30M|  1259  (11)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T2   |  1000 | 11000 |   610   (8)| 00:00:01 |
|   3 |   TABLE ACCESS FULL  | T1   |  1000K|    26M|   628  (11)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."N_400"="T2"."N_400" AND "T1"."N_3"="T2"."N_3")
   2 - filter("T2"."N_1000"=0)

The first thing to note is that the shape of the plans is identical, and the predicate sections are identical – but the final cardinalities are different. Clearly at least one of the cardinalities has to be wrong by a significant amount (7.5% or 10.4%, depending which way round you want to look at it). If you run the test on 11.2.0.4 you find that both plans give the same estimated row count – and it’s the 920,000 rows; so arguably 12c has “fixed” the IN subquery calculation, bringing it closer to a reasonable prediction, but it hasn’t fixed the EXISTS subquery calculation. That 833K prediction, by the way, is what you would expect to see with this data with a basic join – and a semi-join shouldn’t be able to produce more data than  a join.

But both predictions are way off the (informal) expectation, so how have they appeared ?

Working backwards it’s easy to spot that: 833K = 833 * 1,000: Oracle is behaving as if every single row identified in the subquery will produce a separate combination of (n_400, n_3). If we reverse engineer 920K we get: 920K / 833 = 1104 – it would appear that the optimizer thinks the 1,000 rows produced by the subquery will produce 1,104 distinct combinations of (n_400, n_3) so we how did the impossible 1,104 appear in the arithmetic.

If you apply the “selection without replacement” formula to picking 1,000 rows with 400 distinct values from 1,000,000 rows the expected number of distinct values (with rounding) will be 368; if you apply the formula for picking 1,000 rows with 3 distinct values from 1,000,000 rows the expected number will be 3. And 3 * 368 = 1,104. (Remember that in my original estimate I applied the formula after multiplying out the combination of distinct values). The optimizer is using its standard methods, but using internediate results in an unsuitable fashion.

It’s impossible to say what the impact of this particular code path – and the change on the upgrade – might be. The optimizer has over-estimated by 47% in one case and 62% in the other but (a) there may be something about my data that exaggerated an effect that few people will see in the wild and (b) in many cases getting in the right ballpark is enough to get a reasonable plan, and a factor of 2 is the right ballpark.

Of course, a few people will be unlucky with a few queries on the upgrade where the estimate changes – after all a single row difference in the estimate can cause the optimizer to flip between a hash join and a nested loop – but at least you’ve got a little extra information that might help when you see a bad estimate on an important semi-join.

So is there a workaround ? Given that I’ve got 12c, the obvious thing to try is to create a column group at both ends of the semi-join and see what happens. It shouldn’t really make any difference because column groups are targeted at the problems of correlated column – but we might as well try it:


execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns (n_400,n_3) size 1')
execute dbms_stats.gather_table_stats(user,'t2',method_opt=>'for columns (n_400,n_3) size 1')

Unfortunately when I did this the final cardinality estimate for both queries dropped to just 833 (the absence of a K on the end isn’t a typo!).

Manually unnesting got me closer:


select
        *
from
        (
        select  distinct n_3, n_400
        from    t2
        where   n_1000 = 0
        )       sq,
        t1
where   
        sq.n_400 = t1.n_400
and     sq.n_3 = t1.n_3
;

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   649K|    33M|  1260  (11)| 00:00:01 |
|*  1 |  HASH JOIN           |      |   649K|    33M|  1260  (11)| 00:00:01 |
|   2 |   VIEW               |      |   779 | 20254 |   612   (8)| 00:00:01 |
|   3 |    HASH UNIQUE       |      |   779 |  8569 |   612   (8)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL| T2   |  1000 | 11000 |   610   (8)| 00:00:01 |
|   5 |   TABLE ACCESS FULL  | T1   |  1000K|    26M|   628  (11)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("SQ"."N_400"="T1"."N_400" AND "SQ"."N_3"="T1"."N_3")
   4 - filter("N_1000"=0)

The cardinality of 649K is (allowing for rounding) 833 * 779; so we need to know where the 779 came from. It’s the optimizer standard arithmetic for “distinct” – multiply the N individual selectivities together then divide by the sqrt(2) “N-1” times. So we apply the “selection without replacement formula twice”:

  • adjusted selectivity of n_400 = 367.21
  • adjusted selectivity of n_3 = 3
  • 367.21 * 3 / sqrt(2) = 779

If you create column group statistics for (n_400, n_3) this doesn’t change the optimizer’s estimate for the number of distinct combinations after selection – maybe that’s another enhancement in the pipeline – but, at least in this case, the manual unnesting has got us a little closer to the right estimates without any statistical intervention.

Footnote:

Just for the sake of completeness, here are the plans (with yet more cardinality predictions) that you get if you block the unnesting:


select 
	*
from 
	t1 
where 
	exists (
		select	
			/*+ no_unnest */
			null  
		from	t2 
		where	n_1000 = 0 
		and	t2.n_400 = t1.n_400 
		and	t2.n_3 = t1.n_3
	)
;



---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  1179 | 33012 |   766K (12)| 00:00:30 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  1000K|    26M|   632  (11)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   |     1 |    11 |   638  (12)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE
              "N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2))
   3 - filter("N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)



=====================================
Unnesting blocked and subquery pushed
=====================================
select 
	*
from 
	t1 
where 
	exists (
		select	
			/*+ no_unnest push_subq */
			null  
		from	t2 
		where	n_1000 = 0 
		and	t2.n_400 = t1.n_400 
		and	t2.n_3 = t1.n_3
	)
;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 50000 |  1367K|  1271  (12)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL | T1   | 50000 |  1367K|   632  (11)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T2   |     1 |    11 |   638  (12)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST */ 0 FROM "T2"
              "T2" WHERE "N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2))
   2 - filter("N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)

The 1179 comes from the magic of sqrt(2):  1179 = 1,000,000 / (400 * 3 / sqrt(2)).

The 50,000 is just the basic “I dunno, let’s call it 5%”.

 

Reference script: aggregate_selectivity_c.sql

 

2 Comments »

  1. […] bugs by Jonathan […]

    Pingback by Log Buffer #439: A Carnival of the Vanities for DBAs | InsideMySQL — September 4, 2015 @ 7:41 pm BST Sep 4,2015 | Reply

  2. […] bugs by Jonathan […]

    Pingback by Log Buffer #439: A Carnival of the Vanities for DBAs | MySQL — September 5, 2015 @ 1:18 am BST Sep 5,2015 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.