Everyone gets caught out some of the time with NOT IN.
NOT IN is not the opposite of IN.
This came up in a (fairly typical) question on OTN recently where someone had the task of “deleting 6M rows from a table of 18M”. A common, and perfectly reasonable, suggestion for dealing with a delete on this scale is to consider creating a replacement table holding the data you do want rather than deleting the data you don’t want. In this case, however, the query for deleting the data looked like this:
DELETE FROM EI.CASESTATUS WHERE CASEID NOT IN (SELECT CASEID FROM DO.STG_CASEHEADER);
The suggested code for creating the kept data was this:
CREATE TABLE newTable as SELECT * FROM EI.CASESTATUS WHERE CASEID IN (SELECT CASEID FROM DO.STG_CASEHEADER);
You might get the same result sets in the two tables after this process – but it depends on the caseid in both tables never being null. (You might think that a column with “id” in the name probably ought to be a primary key, or a foreign key with a not null declaration but then again there’s that “stg_” in the subquery tablename that might be indicative of a “staging” table, so who knows what might happen if the data’s allowed to start life as dirty data.)
Here’s a little script to demonstrate the point. First some simple data creation – with an optional insert so that you’ve got two tests in one script – followed by the two strategies for identifying data:
rem rem Script: not_in_ctas.sql rem Author: Jonathan Lewis rem Dated: Apr 2015 rem drop table t3; drop table t2; drop table t1; create table t1 (n1 number); create table t2 (n1 number); insert into t1 values(null); insert into t1 values(1); insert into t1 values(2); /* two tests available, with or without a null in t2 */ insert into t2 values(1); -- insert into t2 values(null); commit; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T1', method_opt => 'for all columns size 1' ); dbms_stats.gather_table_stats( ownname => user, tabname =>'T2', method_opt => 'for all columns size 1' ); end; / set null n/a delete from t1 where t1.n1 not in (select t2.n1 from t2); prompt Remainder after delete select * from t1; rollback; prompt Selected on create create table t3 as select * from t1 where t1.n1 in (select t2.n1 from t2); select * from t3;
Then the two sets of output from running the test, first with the the null insert into t2:
Selected on create N1 ---------- n/a 1 2 Result of CTAS N1 ---------- 1
We haven’t deleted any data from t1 when we were probably hoping that the 2 would disappear – after all, it’s clearly not in t2; however since the equality comparison between a t1 row and every t2 row must evaluate to false before a t1 row is deleted and the comparison of 2 and null evaluates to null the 2 row won’t be deleted (similarly the comparison for “t1.null = t2.anything” evaluates to null rather than false so the null isn’t deleted).
Still, perhaps the rewrite would have been okay for the data set where we don’t have a null in t2:
Remainder after delete N1 ---------- n/a 1 Selected on create N1 ---------- 1
Oops – we still haven’t produced matching results . This time the row with the 2 has disappeared from t1 in both cases – which might have been closer to what the original OTN poster had hoped but we still have the difference in the survival of the NULLs from t1 – for the reason given for the previous data set
Footnote:
In passing, the execution plan subsequently supplied by the OP showed a “MERGE JOIN ANTI NA” with stg_caseheader (the subquery table) as the second table. The significance of the NA (Null Aware) is that it tells us that the join column in stg_caseheader definitely doesn’t have a NOT NULL constraint on it. (We can’t draw any conclusion about the join column in casestatus.)