This whole thing about “not exists” subqueries can run and run. In the previous episode I walked through some ideas of how the following query might perform depending on the data, the indexes, and the transformation that the optimizer might apply:
select count(*) from t1 w1 where not exists ( select 1 from t1 w2 where w2.x = w1.x and w2.y <> w1.y );
As another participant in the original OTN thread had suggested, however, it might be possible to find a completely different way of writing the query, avoiding the subquery approach completely. In particular there are (probably) several ways that we could write an equivalent query where the table only appears once. In other words, if we restate the requirement we might be able to find a different SQL translation for that requirement.
Looking at the current SQL, it looks like the requirement is: “Count the number of rows in t1 that have values of X that only have one associated value of Y”.
Based on this requirement, the following SQL statements (supplied by two different people) look promising:
WITH counts AS (SELECT x,y,count(*) xy_count FROM t1 GROUP BY x,y) SELECT SUM(x_count) FROM (SELECT x, SUM(xy_count) x_count FROM counts GROUP BY x HAVING COUNT(*) = 1); SELECT SUM(COUNT(*)) FROM t1 GROUP BY x HAVING COUNT(DISTINCT y)<=1
Logically they do seem to address the description of the problem – but there’s a critical difference between these statements and the original. The clue about the difference appears in the absence of any comparisons between columns in the new forms of the query, no t1.colX = t2.colX, no t1.colY != t2.colY, and this might give us an idea about how to test the code. Here’s some test data:
drop table t1 purge; create table t1 ( x number(2,0), y varchar2(10) ); create index t1_i1 on t1(x,y); -- Pick one of the three following pairs of rows insert into t1(x,y) values(1,'a'); insert into t1(x,y) values(1,null); -- insert into t1(x,y) values(null,'a'); -- insert into t1(x,y) values(null,'b'); -- insert into t1(x,y) values(null,'a'); -- insert into t1(x,y) values(null,'a'); commit; -- A pair to be skipped insert into t1(x,y) values(2,'c'); insert into t1(x,y) values(2,'c'); -- A pair to be reported insert into t1(x,y) values(3,'d'); insert into t1(x,y) values(3,'e'); commit; execute dbms_stats.gather_table_stats(user,'t1')
Notice the NULLs – comparisons with NULL lead to rows disappearing, so might the new forms of the query get different results from the old ?
The original query returns a count of 4 rows whichever pair we select from the top 6 inserts.
With the NULL in the Y column the new forms report 2 and 4 rows respectively – so only the second query looks viable.
With the NULLs in the X columns and differing Y columns the new forms report 2 and 2 rows respectively – so even the second query is broken.
However, if we add “or X is null” to the second query it reports 4 rows for both tests.
Finally, having added the “or x is null” predicate, we check that it returns the correct 4 rows for the final test pair – and it does.
It looks as if there is at least one solution to the problem that need only access the table once, though it then does two aggregates (hash group by in 11g). Depending on the data it’s quite likely that this single scan and double hash aggregation will be more efficient than any of the plans that do a scan and filter subquery or scan and hash anti-join. On the other hand the difference in performance might be small, and the ease of comprehension is just a little harder.
I can’t help thinking that the “real” requirement is probably as given in the textual restatement of the problem, and that the first rewrite of the query is probably the one that’s producing the “right” answers while the original query is probably producing the “wrong” answer.