Everyone gets caught out some of the time with NOT IN.
This came up in a (fairly typical) question on OTN recently where someone had the task of “deleting 6M rows from a table of 18M”. A common, and perfectly reasonable, suggestion for dealing with a delete on this scale is to consider creating a replacement table holding the data you do want rather than deleting the data you don’t want. In this case, however, the query for deleting the data looked like this:
DELETE FROM EI.CASESTATUS WHERE CASEID NOT IN (SELECT CASEID FROM DO.STG_CASEHEADER);
The suggested code for creating the kept data was this:
CREATE TABLE newTable as SELECT * FROM EI.CASESTATUS WHERE CASEID IN (SELECT CASEID FROM DO.STG_CASEHEADER);
You might get the same result sets in the two tables after this process – but it depends on the CASEID in both tables never being NULL. (You might think that a column with ID in the name probably ought to be a primary key, or a foreign key with a NOT NULL declaration, but then again there’s that STG_ in the subquery table that might be indicative of a “staging” table, so who knows what might happen if the data’s allowed to start life as dirty data.) Here’s a quick demo to prove the point. First some simple data creation – with an optional insert so that you’ve got two tests in one script – followed by the two strategies for identifying data:
drop table t3; drop table t2; drop table t1; create table t1 (n1 number); create table t2 (n1 number); insert into t1 values(null); insert into t1 values(1); insert into t1 values(2); /* two tests available, with or without a null in t2 */ -- insert into t2 values(null); insert into t2 values(1); commit; -- gather stats here set null n/a delete from t1 where t1.n1 not in (select t2.n1 from t2); prompt Remainder after delete select * from t1; rollback; prompt Selected on create create table t3 as select * from t1 where t1.n1 in (select t2.n1 from t2); select * from t3;
Then the two sets of output from running the test, first with the the NULL insert into t2:
Remainder after delete N1 ---------- n/a 1 2 Selected on create N1 ---------- 1
We haven’t deleted ANY data from t1 when we were probably hoping that the 2 would disappear – after all, it’s not in t2; however since the equality comparison between a t1 row and every t2 row must evaluate to FALSE before a t1 row is deleted and the comparison of 2 and NULL evaluates to NULL the 2 row won’t be deleted (similarly the comparison for “t1.NULL = t2.anything” evaluates to NULL rather than FALSE, so the NULL isn’t deleted).
Still, perhaps the rewrite would have been okay for the data set where we don’t have a NULL in t2:
Remainder after delete N1 ---------- n/a 1 Selected on create N1 ---------- 1
Oops – still doesn’t produce matching results . This time the row with the 2 has disappeared from t1 in both cases – which might have been closer to what the original OTN poster had hoped but we still have the difference in the survival of the NULLs from t1 – for the reason given for the previous data set
In passing, the execution plan subsequently supplied by the OP showed a “MERGE JOIN ANTI NA” with stg_caseheader (the subquery table) as the second table. The significance of the NA (Null-aware) is that it tells us that the join column in stg_caseheader definitely doesn’t have a NOT NULL constraint on it. (We can’t draw any conclusion about the join column in casestatus.)