Oracle Scratchpad

April 29, 2015

Not Exists

Filed under: Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 8:21 pm GMT Apr 29,2015

This whole thing about “not exists” subqueries can run and run. In the previous episode I walked through some ideas of how the following query might perform depending on the data, the indexes, and the transformation that the optimizer might apply:

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As another participant in the original OTN thread had suggested, however, it might be possible to find a completely different way of writing the query, avoiding the subquery approach completely. In particular there are (probably) several ways that we could write an equivalent query where the table only appears once. In other words, if we restate the requirement we might be able to find a different SQL translation for that requirement.

Looking at the current SQL, it looks like the requirement is: “Count the number of rows in t1 that have values of X that only have one associated value of Y”.

Based on this requirement, the following SQL statements (supplied by two different people) look promising:


    WITH counts AS
       (SELECT x,y,count(*) xy_count
        FROM   t1
        GROUP BY x,y)
    SELECT SUM(x_count)
    FROM  (SELECT x, SUM(xy_count) x_count
           FROM   counts
           GROUP BY x
           HAVING COUNT(*) = 1);


SELECT SUM(COUNT(*))
  FROM t1
GROUP BY x HAVING COUNT(DISTINCT y)<=1

Logically they do seem to address the description of the problem – but there’s a critical difference between these statements and the original. The clue about the difference appears in the absence of any comparisons between columns in the new forms of the query, no t1.colX = t2.colX, no t1.colY != t2.colY, and this might give us an idea about how to test the code. Here’s some test data:


drop table t1 purge;

create table t1 (
        x       number(2,0),
        y       varchar2(10)
);

create index t1_i1 on t1(x,y);

--      Pick one of the three following pairs of rows

insert into t1(x,y) values(1,'a');
insert into t1(x,y) values(1,null);

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'b');

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'a');

commit;

--      A pair to be skipped

insert into t1(x,y) values(2,'c');
insert into t1(x,y) values(2,'c');

--      A pair to be reported

insert into t1(x,y) values(3,'d');
insert into t1(x,y) values(3,'e');

commit;

execute dbms_stats.gather_table_stats(user,'t1')

Notice the NULLs – comparisons with NULL lead to rows disappearing, so might the new forms of the query get different results from the old ?
The original query returns a count of 4 rows whichever pair we select from the top 6 inserts.

With the NULL in the Y column the new forms report 2 and 4 rows respectively – so only the second query looks viable.
With the NULLs in the X columns and differing Y columns the new forms report 2 and 2 rows respectively – so even the second query is broken.

However, if we add “or X is null” to the second query it reports 4 rows for both tests.
Finally, having added the “or x is null” predicate, we check that it returns the correct 4 rows for the final test pair – and it does.

It looks as if there is at least one solution to the problem that need only access the table once, though it then does two aggregates (hash group by in 11g). Depending on the data it’s quite likely that this single scan and double hash aggregation will be more efficient than any of the plans that do a scan and filter subquery or scan and hash anti-join. On the other hand the difference in performance might be small, and the ease of comprehension is just a little harder.

Footnote:

I can’t help thinking that the “real” requirement is probably as given in the textual restatement of the problem, and that the first rewrite of the query is probably the one that’s producing the “right” answers while the original query is probably producing the “wrong” answer.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.