A question about parallel query and cardinality estimates appeared on OTN a little while ago that prompted me to write this note about helping the optimizer do the best job with the least effort. (A critical point in the correct answer to the original question is that parallel query may lead to “unexpected” dynamic sampling, which can make a huge difference to the choice of execution plans, but that’s another matter.)
The initial cardinality error in the plan came from the following predicate on a “Date dimension” table:
AR_DATE.CALENDAR_DT = AR_DATE.MONTH_END_DT AND AR_DATE.CALENDAR_DT >= ADD_MONTHS(TRUNC(SYSDATE,'fmyy'), -36) AND AR_DATE.MONTH_END_DT >= ADD_MONTHS(TRUNC(SYSDATE,'fmyy'), -36)
In the parallel plan the estimated number of rows on a full tablescan of the table was 742, while on the serial plan the same tablescan produced a cardinality of 1. You will appreciate that having an estimate of 1 (or less) that is nearly three orders of magnitude wrong is likely to lead to a very bad execution plan.
My first thought when I saw this was (based on a purely intuitive interpretation): “there’s one day every month that’s the last day of the month and we’re looking at roughly that last 36 months so we might predict a cardinality of about 36”. That’s still a long way off the 742 estimate and 1,044 actual for the parallel query, but it’s a warning flag that the serial estimate is probably an important error – it’s also an example of the very simple “sanity checking” mental exercises that can accompany almost any execution plan analysis.
My second thought (which happened to be wrong, and would only have been right some time well before version 10.2.0.5) was that the optimizer would treat the add_months() expressions as unknown values and assign a selectivity of 5% to each of the predicates, reducing the combined selectivity to 1/400th of the selectivity it gave to the first predicate. In fact the optimizer evaluates the expressions and would have used the normal (required range / total range) calculation for those two predicates.
It’s the first predicate that I want to examine, though – how does the optimizer calculate a selectivity for it ? Here’s some code to generate sample data to use for testing.
rem rem Script: month_end.sql rem Author: Jonathan Lewis rem Dated: June 2016 rem create table t1 nologging as select rownum id, to_date('01-Jan-2010','dd-mon-yyyy') + rownum - 1 calendar_date, add_months( trunc(to_date('01-Jan-2010','dd-mon-yyyy') + rownum - 1 ,'MM' ), 1 ) - 1 month_end_date from dual connect by level <= trunc(sysdate) - to_date('01-jan_2010','dd-mon-yyyy') + 1 ;
execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')
This clunky bit of code gives me consecutive dates from 1st Jan 2010 up to “today” with the month_end_date column holding the month end date corresponding to the row’s calendar_date. So now we can check what the optimizer makes of the predciate calendar_date = month_end_date:
set autotrace on explain select count(*) from t1 where calendar_date = month_end_date; COUNT(*) ---------- 79 Execution Plan ---------------------------------------------------------- Plan hash value: 3724264953 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 16 | 3 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 16 | | | |* 2 | TABLE ACCESS FULL| T1 | 1 | 16 | 3 (0)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("CALENDAR_DATE"="MONTH_END_DATE")
Looking at operation 2 we can see that, in effect, the optimizer has considered two independent predicates “calendar_date = {unknown}” and “month_end_date = {unknown}” and taken the lower of the two selectivities – which means the cardinality estimate is 1 because the calendar_date column is unique across this table.
There are various ways to work around problems like this. One of the simplest would be to tell Oracle to sample this table with the (table-level) hint /*+ dynamic_sampling(t1 1) */; in fact, since this predicate is effectively treated as an examination of two predicates the (cursor-level) hint /*+ dynamic_sampling(4) */ would also cause sampling to take place – note that level 4 or higher is required to trigger sampling for “multiple” predicates on a single table. As a general guideline we always try to minimise the risk of side effects so if this problem were embedded in a much larger query I would prefer the table-level hint over the cursor-level hint.
There are other options, though, that would allow you to bypass sampling – provided you can modify the SQL. The script I used to create this table also included the following statement:
alter table t1 add ( date_offset1 generated always as (calendar_date - month_end_date) virtual, date_flag generated always as (case when calendar_date - month_end_date = 0 then 'Y' end) virtual );
In 12c I would declare these virtual columns to be invisible to avoid problems with any SQL that didn’t use explicit column lists. For demonstration purposes I’ve set up two options – I can find the rows I want with one of two obvious predicates:
date_offset1 = 0 date_flag = 'Y'
In fact there’s a third predicate I could use that doesn’t need to know about the virtual columns:
calendar_date - month_end_date = 0
Unfortunately I can’t arbitrarily swap the order of the two dates in the last predicate, and the optimizer won’t spot that it is also equivalent to “calendar_date = month_end_date”. Here are a few execution plans – for which the only significant bit is the cardinality estimate of the full tablescans:
select count(*) from t1 where date_flag = 'Y'; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 2 | 4 (25)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 2 | | | |* 2 | TABLE ACCESS FULL| T1 | 79 | 158 | 4 (25)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DATE_FLAG"='Y') select count(*) from t1 where date_offset1 = 0; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 4 | 3 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 4 | | | |* 2 | TABLE ACCESS FULL| T1 | 78 | 312 | 3 (0)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DATE_OFFSET1"=0) select count(*) from t1 where calendar_date - month_end_date = 0; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 4 | 3 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 4 | | | |* 2 | TABLE ACCESS FULL| T1 | 78 | 312 | 3 (0)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("T1"."DATE_OFFSET1"=0)
It’s interesting to note that the optimizer has transformed the last predicate into the equivalent virtual column expression to do the arithmetic. You might also note that the date_flag option is slightly more accurate, but that’s because it’s based on an expression which is null for the rows we don’t want while the date_offset1 column has a value for every row and a little bit of arithmetical rounding comes into play. You might also note that there’s a small cost difference – which I’d ascribe to the CPU cost that the optimizer has added for the CASE expression being applied on top of the simple date arithmetic.
Of course, whatever else you might play around with when working around a cardinality problem like this, I think the strategic aim for a data warehouse system would be to get a REAL flag column on the table and populate it at data loading time if month-end dates played an important part in the proceedings – though I have to say that the virtual flag column is one I do rather fancy.
Comments and related questions are welcome.