A recent question on the OTN Database Forum asked:
I need to check if at least one record present in table before processing rest of the statements in my PL/SQL procedure. Is there an efficient way to achieve that considering that the table is having huge number of records like 10K.
I don’t think many readers of the forum would consider 10K to be a huge number of records; nevertheless it is a question that could reasonably be asked, and should prompt a little discssion.
First question to ask, of course is: how often do you do this and how important is it to be as efficient as possible. We don’t want to waste a couple of days of coding and testing to save five seconds every 24 hours. Some context is needed before charging into high-tech geek solution mode.
Next question is: what’s wrong with writing code that just does the job, and if it finds that the job is complete after zero rows then you haven’t wasted any effort. This seems reasonable in (say) a PL/SQL environment where we might discuss the following pair of strategies:
Option 1: ========= -- execute a select statement to see in any rows exist if (flag is set to show rows) then for r in (select all the rows) loop do something for each row end loop; end if; Option 2: ========= for r in (select all the rows) loop do something for each row; end loop;
If this is the type of activity you have to do then it does seem reasonable to question the sense of putting in an extra statement to see if there are any rows to process before processing them. But there is a possibly justification for doing this. The query to find just one row may produce a very efficient execution plan, while the query to find all the rows may have to do something much less efficient even when (eventually) it finds that there is no data. Think of the differences you often see between a first_rows_1 plan and an all_rows plan; think about how Oracle can use index-only access paths and table elimination – if you’re only checking for existence you may be able to produce a MUCH faster plan than you can for selecting the whole of the first row.
Next question, if you think that there is a performance benefit from the two-stage approach: is the performance gain worth the cost (and risk) of adding a near-duplicate statement to the code – that’s two statements that have to be maintained every time you make a change. Maybe it’s worth “wasting” a few seconds on every execution to avoid getting the wrong results (or an odd extra hour of programmer time) once every few months. Bear in mind, also, that the optimizer now has to optimize two statement instead of one – you may not notice the extra CPU usage in testing but perhaps in the live environment the execution benefit will be eroded by the optimization cost.
Next question, if you still think that the two-stage process is a good idea: will it result in an inconsistent database state ?! If you select and find a row, then run and find that there are no rows to process because something modified and “hid” the row you found on the first pass – what are you going to do. Will this make the program crash ? Will it produce an erroneous result on this run, or will a silent side effect be that the next run will produce the wrong results. (See Billy Verreynne’s comment on the original post). Should you set the session to serializable before you start the program, or maybe lock a critical table to make sure it can’t change.
So, assuming you’ve decided that some form of “check for existence then do the job” is both desirable and safe, what’s the most efficient strategy. Here’s one of the smarter solutions that minimises risk and effort (in this case using a pl/sql environment).
select count(*) into m_counter from dual where exists ({your original driving select statement}) ; if m_counter = 0 then null; else for c1 in {your original driving select statement} loop -- do whatever end loop; end if;
The reason I describe this solution as smarter, with minimum risk and effort, is that (a) you use EXACTLY the same SQL statement in both locations so there should be no need to worry about making the same effective changes twice to two slightly different bits of SQL and (b) the optimizer will recognise the significance of the existence test and run in first_rows_1 mode with maximum join elimination and avoidance of redundant table visits. Here’s a little data set I can use to demonstrate the principle:
rem rem Script: existence_2.sql rem Author: Jonathan Lewis rem Dated: Jul 2015 rem create table t1 as select mod(rownum,200) n1, -- scattered data mod(rownum,200) n2, rpad(rownum,180) v1 from dual connect by level <= 10000 -- > comment to avoid WordPress formatting problem ; delete from t1 where n1 = 100; commit; create index t1_i1 on t1(n1); begin dbms_stats.gather_table_stats( user, 't1', cascade => true, method_opt => 'for all columns size 1' ); end; /
It’s just a simple table with index, but the index isn’t very good for finding the data – it’s repetitive data widely scattered through the table: 10,000 rows with only 200 distinct values. But check what happens when you do the dual existence test – first we run our “driving” query to show the plan that the optimizer would choose for it, then we run with the existence test to show the different strategy the optimizer takes when the driving query is embedded:
alter session set statistics_level = all; select * from t1 where n1 = 100 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost')); select count(*) from dual where exists ( select * from t1 where n1 = 100 ) ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));
Notice how I’ve enabled rowsource execution statistics and pulled the execution plans from memory with their execution statistics. Here they are:
select * from t1 where n1 = 100 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 38 (100)| 0 |00:00:00.01 | 274 | |* 1 | TABLE ACCESS FULL| T1 | 1 | 50 | 38 (3)| 0 |00:00:00.01 | 274 | ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1"=100) select count(*) from dual where exists ( select * from t1 where n1 = 100 ) --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 3 (100)| 1 |00:00:00.01 | 2 | | 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:00.01 | 2 | |* 2 | FILTER | | 1 | | | 0 |00:00:00.01 | 2 | | 3 | FAST DUAL | | 0 | 1 | 2 (0)| 0 |00:00:00.01 | 0 | |* 4 | INDEX RANGE SCAN| T1_I1 | 1 | 2 | 1 (0)| 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter( IS NOT NULL) 4 - access("N1"=100)
For the original query the optimizer did a full tablescan – that was the most efficient path. For the existence test the optimizer decided it didn’t need to visit the table for “*” and it would be quicker to use an index range scan to access the data and stop after one row. Note, in particular, that the scan of the dual table didn’t even start – in effect we’ve got all the benefits of a “select {minimum set of columns} where rownum = 1” query, without having to work out what that minimum set of columns was.
But there’s an even more cunning option – remember that we didn’t scan dual when there were no matching rows:
for c1 in ( with driving as ( select /*+ inline */ * from t1 ) select /*+ track this */ * from driving d1 where n1 = 100 and exists ( select * from driving d2 where n1 = 100 ); ) loop -- do your thing end loop;
In this specific case the factored subquery would automatically be copied inline so the hint here is actually redundant; in general you’re likely to find the optimizer materializing your subquery and bypassing the cunning strategy if you don’t use the hint. (This example is one of the special cases where subquery factoring doesn’t automatically materialize – there’s no where clause in the subquery.)
Here’s the execution plan pulled from memory (after running this SQL through an anonymous PL/SQL block):
SQL_ID 7cvfcv3zarbyg, child number 0 ------------------------------------- WITH DRIVING AS ( SELECT /*+ inline */ * FROM T1 ) SELECT /*+ track this */ * FROM DRIVING D1 WHERE N1 = 100 AND EXISTS ( SELECT * FROM DRIVING D2 WHERE N1 = 100 ) --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 39 (100)| 0 |00:00:00.01 | 2 | |* 1 | FILTER | | 1 | | | 0 |00:00:00.01 | 2 | |* 2 | TABLE ACCESS FULL| T1 | 0 | 50 | 38 (3)| 0 |00:00:00.01 | 0 | |* 3 | INDEX RANGE SCAN | T1_I1 | 1 | 2 | 1 (0)| 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter( IS NOT NULL) 2 - filter("T1"."N1"=100) 3 - access("T1"."N1"=100)
You’ve got just one statement – and you’ve only got one version of the complicated text because you put it into a factored subquery; but the optimizer manages to use one access path for one instantiation of the text and a different one for the other. You get an efficient test for existence and only run the main query if some suitable data exists, and the whole thing is entirely read-consistent.
I have to say, though, I can’t quite make myself 100% enthusiastic about this code strategy – there’s just a nagging little doubt that the optimizer might come up with some insanely clever trick to try and transform the existence test into something that’s supposed to be faster but does a lot more work; but maybe that’s only likely to happen on an upgrade, which is when you’d be testing everything very carefully anyway (wouldn’t you) and you’ve got the “dual/exists” fallback position if necessary.
Footnote:
Does anyone remember the thing about reading execution plans “first child first” – this particular usage of an existence test is one of the interesting cases where it’s not the first child of a parent operation that runs first: it’s the case I often refer to as the “constant subquery”.
This is so good it should probably be adopted into the PL/SQL language (if not SQL itself) as a preemptory verb, let’s say “ifany” that could be fired on any “with” definition plus optionally added predicate that might preempt even the full planning process (and possibly read even more simply.) Bravo, my friend.
Comment by rsiz — July 29, 2015 @ 2:16 pm BST Jul 29,2015 |
Jonathan,
wouldn’t you put the “original query” completely inside the WITH subquery, so in your example including the “WHERE N1 = 100” predicate? That would make more sense to me (to avoid double maintenance) and correspond more to the previous suggestions.
The main query would then always look something like this:
In that case the INLINE hint is then actually required again even in that simple example since there is a predicate in the WITH clause query triggering materialization.
For safety purposes adding an explicit NO_UNNEST hint to the EXISTS subquery along with a suitable comment could be helpful to avoid clever tricks of the optimizer (of course there might be (future) transformations that still transform the query into something unwanted even with the hint in place).
Randolf
Comment by Randolf Geist — July 29, 2015 @ 5:44 pm BST Jul 29,2015 |
Randolf,
There may have been some rationale behind the duplication when I wrote the note – possibly it was just an attempt at visual continuity – but it’s one I wrote with a delayed publication date over a week ago and I don’t recall why I did it this way. Your comment about including the predicate in the “with subquery” is most relevent especially in view of my comment about “only one version of the complicated text”.
Comment by Jonathan Lewis — August 1, 2015 @ 10:31 pm BST Aug 1,2015 |
[…] Using subquery factoring as a strategy for checking for existence […]
Pingback by CTE Catalogue | Oracle Scratchpad — June 10, 2020 @ 6:46 pm BST Jun 10,2020 |