Oracle Scratchpad

April 5, 2015

Not In CTAS

Filed under: Infrastructure,NULL,Oracle — Jonathan Lewis @ 6:49 pm GMT Apr 5,2015

Everyone gets caught out some of the time with NOT IN.

NOT IN is not the opposite of IN.

This came up in a (fairly typical) question on OTN recently where someone had the task of “deleting 6M rows from a table of 18M”. A common, and perfectly reasonable, suggestion for dealing with a delete on this scale is to consider creating a replacement table holding the data you do want rather than deleting the data you don’t want.  In this case, however, the query for deleting the data looked like this:


DELETE FROM EI.CASESTATUS
     WHERE CASEID NOT IN (SELECT CASEID FROM DO.STG_CASEHEADER);

The suggested code for creating the kept data was this:


CREATE TABLE newTable as
  SELECT * FROM EI.CASESTATUS
     WHERE CASEID IN (SELECT CASEID FROM DO.STG_CASEHEADER);

You might get the same result sets in the two tables after this process – but it depends on the CASEID in both tables never being NULL. (You might think that a column with ID in the name probably ought to be a primary key, or a foreign key with a NOT NULL declaration, but then again there’s that STG_ in the subquery table that might be indicative of a “staging” table, so who knows what might happen if the data’s allowed to start life as dirty data.) Here’s a quick demo to prove the point. First some simple data creation – with an optional insert so that you’ve got two tests in one script – followed by the two strategies for identifying data:


drop table t3;
drop table t2;
drop table t1;

create table t1 (n1 number);
create table t2 (n1 number);

insert into t1 values(null);
insert into t1 values(1);
insert into t1 values(2);

/* two tests available, with or without a null in t2 */

-- insert into t2 values(null);
insert into t2 values(1);

commit;

-- gather stats here

set null n/a
delete from t1
where  t1.n1 not in (select t2.n1 from t2);

prompt  Remainder after delete

select  *  from t1;

rollback;

prompt  Selected on create

create table t3 as
select  *  from t1
where   t1.n1 in (select t2.n1 from t2);

select * from t3;

Then the two sets of output from running the test, first with the the NULL insert into t2:


Remainder after delete

        N1
----------
n/a
         1
         2
Selected on create

        N1
----------
         1

We haven’t deleted ANY data from t1 when we were probably hoping that the 2 would disappear – after all, it’s not in t2; however since the equality comparison between a t1 row and every t2 row must evaluate to FALSE before a t1 row is deleted and the comparison of 2 and NULL evaluates to NULL the 2 row won’t be deleted (similarly the comparison for “t1.NULL = t2.anything” evaluates to NULL rather than FALSE, so the NULL isn’t deleted).

Still, perhaps the rewrite would have been okay for the data set where we don’t have a NULL in t2:


Remainder after delete

        N1
----------
n/a
         1
Selected on create

        N1
----------
         1

Oops – still doesn’t produce matching results . This time the row with the 2 has disappeared from t1 in both cases – which might have been closer to what the original OTN poster had hoped but we still have the difference in the survival of the NULLs from t1 – for the reason given for the previous data set

Footnote:

In passing, the execution plan subsequently supplied by the OP showed a “MERGE JOIN ANTI NA” with stg_caseheader (the subquery table) as the second table. The significance of the NA (Null-aware) is that it tells us that the join column in stg_caseheader definitely doesn’t have a NOT NULL constraint on it. (We can’t draw any conclusion about the join column in casestatus.)

4 Comments »

  1. Hi Jonathan, I guess Lex would have liked that entry. Thanks.

    Comment by Lothar Flatz — April 7, 2015 @ 10:12 am GMT Apr 7,2015 | Reply

  2. Jonathan,

    thanks for the idea to present all the possible combinaitons – with and without the NULL. I “know” that NOT IN should be treated with care regarding the NULL, but always have difficulties getting my head around the details.

    Some statements seem contradictory to me, you might want to check them:

    “EVERY equality comparison between a t1 row and all the t2 rows must evaluate to TRUE before a t1 row is deleted”
    **** you probably mean “EVERY non-equality”

    “The NULLs disappear even though there is is a “matching NULL” in t2 for the NULL in t1”
    **** the statement actually refers to the case when we “we get rid of that NULL in t2”, so it doesn’t sound right

    Regards
    Todor

    Comment by Todor Botev — April 7, 2015 @ 9:22 pm GMT Apr 7,2015 | Reply

    • Todor,

      Thanks for highlighting the errors – which I’ve now eliminated.
      Perversely I tend to think of the equality evaluating to FALSE rather than “not equals” evaluating to TRUE, though.

      Comment by Jonathan Lewis — April 8, 2015 @ 8:35 am GMT Apr 8,2015 | Reply

      • Yes, this is how I prefer to think, too. But this is our human interpretation of what is really going on which is – as you wrote in your previous post – exactly “not equals” evaluating to TRUE:

        colX NOT in (value1, value2, value3)

        means

        colX != value1
        AND colX != value2
        AND colX != value3

        Comment by Todor Botev — April 9, 2015 @ 5:12 pm GMT Apr 9,2015 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.