Oracle Scratchpad

January 4, 2016

ANSI bug

Filed under: ANSI Standard,Bugs,Oracle — Jonathan Lewis @ 1:12 pm GMT Jan 4,2016

In almost all cases the SQL you write using the ANSI (SQL-92) standard syntax is tranformed into a statement using Oracle’s original syntax before being optimised – and there are still odd cases where the translation is not ideal.  This can result in poor performance, it can result in wrong results. The following examples arrived in my in-tray a couple of weeks ago:

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
            from table1
            left join table2 on table1.my_number = table2.my_number
            group by table1.my_number
        );


with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select sum(table3.table2.my_number) the_answer
    from table1
    left join table2 on table1.my_number = table2.my_number
    group by table1.my_number;

Notice the reference to table3.table2.my_number in the select list of both queries – where does the “table3” bit come from ? These queries should result in Oracle error ORA-00904: “TABLE3″.”TABLE2″.”MY_NUMBER”: invalid identifier.

If you’re running 11.2.0.4 (and, probably, earlier versions) both queries produce the following result:


THE_ANSWER
----------
         1

1 row selected.

If you’re running 12.1.0.2 the first query produces the ORA-00904 error that it should do, but the second query still survives to produce the same result as 11.2.0.4.

Update (8th July 2016)

I’ve just seen a reference to this blog note on the OTN database forum that prompted me to point out that the problem doesn’t occur if you switch to traditional Oracle syntax – the expected Oracle error message appears in 11.2.0.4:

SQL> ed temp

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
/*
            from table1
            left join table2 on table1.my_number = table2.my_number
*/
            from table1, table2
            where table2.my_number(+) = table1.my_number
            group by table1.my_number
        )
 15  ;
        select sum(table3.table2.my_number) the_answer
                   *
ERROR at line 6:
ORA-00904: "TABLE3"."TABLE2"."MY_NUMBER": invalid identifier


SQL> ed temp3

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
--            from table1
--            left join table2 on table1.my_number = table2.my_number
            from table1, table2
            where table2.my_number(+) = table1.my_number
            group by table1.my_number
        )
 13  ;
        select sum(table3.table2.my_number) the_answer
                   *
ERROR at line 6:
ORA-00904: "TABLE3"."TABLE2"."MY_NUMBER": invalid identifier

22 Comments »

  1. Hi Jonathan,

    we have always been learning from your blogs,
    I have executed above queries , first is having issue not the second.

    [oracle@nava ~]$ SQL
    
    SQL*Plus: Release 12.1.0.2.0 Production on Mon Jan 4 19:00:09 2016
    
    Copyright (c) 1982, 2014, Oracle.  All rights reserved.
    
    
    Connected to:
    Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
    With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
    
    SQL> with
        table1 as ( select 1 my_number from dual ),
        table2 as ( select 1 my_number from dual )
    select *
        from (
            select sum(table3.table2.my_number) the_answer
                from table1
                left join table2 on table1.my_number = table2.my_number
                group by table1.my_number
            );
       2    3    4    5    6    7    8    9   10    select sum(table3.table2.my_number) the_answer
                       *
    ERROR at line 6:
    ORA-00904: "TABLE3"."TABLE2"."MY_NUMBER": invalid identifier
    
    
    SQL>
    SQL> with
        table1 as ( select 1 my_number from dual ),
        table2 as ( select 1 my_number from dual )
    select sum(table3.table2.my_number) the_answer
        from table1
        left join table2 on table1.my_number = table2.my_number
        group by table1.my_number;  2    3    4    5    6    7
    
    THE_ANSWER
    ----------
             1
    

    Regards,
    Nava

    Comment by nava — January 4, 2016 @ 1:34 pm GMT Jan 4,2016 | Reply

  2. So to be clear, the problem is that a query that should be flagged as invalid are being allowed to pass (and run, somehow)? This bug doesn’t affect queries that are actually valid?

    I think I’ve seen this occur, come to think of it.

    Off the top of my head, another issue I’ve encountered: if you have a WITH clause that has the same name as a table, weird things happen. (I don’t remember offhand what those weird things are.)

    I’m inclined to say that the issue might be tied to the use of WITH, not because of ANSI joins. Well, if you need both to manifest the bug, then it’s tied to both… but the wonkiness of WITH naming makes me suspect it’s the WITH code, if we were to somehow rummage around inside the DBMS source code.

    Comment by Jason Bucata — January 4, 2016 @ 1:56 pm GMT Jan 4,2016 | Reply

    • Jason,

      Invalid queries allowed to pass is the immediate problem. I don’t know if the same bug has a side effect on valid queries (e.g. like losing a table name and thereby allowing the wrong subquery to capture a column and return wrong results).

      I wouldn’t be surprised if your example turned out to be a combination of WITH and ANSI. Traditional Oracle syntax would tend to spot the ambiguity very quickly whereas the ANSI transforms can add so many layers to what seems to be a simple join that it’s easy to imagine it losing track of the ambiguity.

      Comment by Jonathan Lewis — January 4, 2016 @ 2:04 pm GMT Jan 4,2016 | Reply

  3. Jonathan,

    It seems to be a known bug 13570803 : QUERY WITH CASE CLAUSE IN ITS SUBQUERY DOES NOT ERROR WITH EXPECTED ORA-904. Though, according to the note the bug should be fixed in the 12.2, but your test confirmed that it is fixed in 12.1.

    Comment by Nenad Noveljic — January 4, 2016 @ 2:40 pm GMT Jan 4,2016 | Reply

    • Nenad,

      At best only half-fixed; but there is no CASE clause in this example, so it may not be the same bug, and document 13570803.8 reports the bug as fixed in 11.2.0.4

      Comment by Jonathan Lewis — January 4, 2016 @ 2:57 pm GMT Jan 4,2016 | Reply

  4. There is description in v$system_fix_control which mentions an ANSI re-architecture in 12.1.0.2.
    Do you know any details about this?
    Perhaps this re-architecture is responsible and perhaps it wasn’t backported to 11.2.0.4.

    Comment by Dom Brooks — January 4, 2016 @ 3:54 pm GMT Jan 4,2016 | Reply

  5. Hi Jonathan,

    I seem to recall seeing this same identifier issue in a very recent blog post from you, though I don’t think it was called out.

    It seems that when 9i was created an architectural decision was made to first translate ANSI SQL into Oracle specific SQL and then optimize.

    It is not too difficult to imagine this has been the topic of debate at times within Oracle.

    Changing that decision to either translate all to ANSI, or choose different paths based on SQL type would have some profound affects on the optimizer code.

    Perhaps just more attention needs to be given to the ANSI -> Oracle translation, as doing an MOS search on ‘ANSI’ and ‘bug ‘ brings up a length list of hits.

    Comment by jkstill — January 4, 2016 @ 6:07 pm GMT Jan 4,2016 | Reply

  6. Here is a version of the bug that doesn’t use a with clause.

    select table3.t2.dummy
    from dual t1
    left join dual t2 on t1.dummy = t2.dummy;

    Comment by Shawn — January 5, 2016 @ 5:07 pm GMT Jan 5,2016 | Reply

    • Shawn,

      Thanks – that produced the appropriate error message on 12.1.0.2 for me, but erroneously produced an answer in 11.2.0.4

      Comment by Jonathan Lewis — January 6, 2016 @ 2:03 pm GMT Jan 6,2016 | Reply

  7. Ran into a nasty problem with LEFT OUTER JOIN on partitioned (and remote tables). Table is partitioned on pkey

    SELECT COUNT (*) 
      FROM t1 a
           LEFT OUTER JOIN t2 b
              ON (    a.col1a = b.col2b
                  AND a.pkey = b.pkey
                  and a.pkey = 20151231
                  AND b.pkey = 20151231)
    

    Returned ~994,000,000 in 6 hours. Result is wrong due to incorrect access path of outer join and failure to partition prune on one of the tables. (Same result/timing with specifying only 1 of the pkey columns).

    Rewrote using Oracle syntax, Correct result, correct partition pruning, 4 minutes execution.

    Comment by Jeff Jacobs — February 17, 2016 @ 6:24 pm GMT Feb 17,2016 | Reply

    • Jeff,

      Thanks for the note.
      You didn’t mention a version, but I’ve just recreated it (minus the 6 hours wait) on 11.2.0.4 and 12.1.0.2.

      Comment by Jonathan Lewis — February 17, 2016 @ 7:06 pm GMT Feb 17,2016 | Reply

    • Jeff,

      Having just tweeted about this, it crossed my mind that I wasn’t 100% convinced that the ANSI was doing the wrong thing and producing the wrong result.
      For partition elimination on t1 shouldn’t “a.pkey = 2015131” be in the WHERE clause rather than the ON clause ?
      What did your traditional syntax query look like ?

      Comment by Jonathan Lewis — February 18, 2016 @ 9:02 am GMT Feb 18,2016 | Reply

  8. Not per my understanding of ANSI LEFT JOIN (which of course could be wrong :-)…

    The execution plan is clearly wrong. The number of rows where t1.cycle_dt_num = 20151231 is approximately 76K, or one partition, but it returns ~997M rows. Not only is there no partition pruning But look at the CASE in the access path!!!:

    ---------------------------------------------------------------------------------------------------------------------
    | Id  | Operation              | Name     | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   | Pstart| Pstop | Inst   |
    ---------------------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT REMOTE|          |      1 |    26 |       |    33M  (1)|246:41:16 |       |       |        |
    |   1 |  SORT AGGREGATE        |          |      1 |    26 |       |            |          |       |       |        |
    |   2 |   PARTITION RANGE ALL  |          |    996M|    24G|       |    33M  (1)|246:41:16 |     1 |  1381 |        |
    |*  3 |    HASH JOIN OUTER     |          |    996M|    24G|    17M|    33M  (1)|246:41:16 |       |       |        |
    |   4 |     TABLE ACCESS FULL  | t1       |    996M|    12G|       |    16M  (1)|120:03:08 |     1 |  1381 |   PDST |
    |   5 |     TABLE ACCESS FULL  | t2       |   1002M|    12G|       |    16M  (1)|121:10:04 |     1 |  1381 |   PDST |
    ---------------------------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       3 - access("A3"."CYCLE_DT_NUM"="A1"."CYCLE_DT_NUM"(+) AND 
                  "A3"."col1a"="A1"."col2b"(+) AND "A3"."CYCLE_DT_NUM"=CASE  WHEN 
                  ("A1"."col2b"(+) IS NOT NULL) THEN 20151231 ELSE 20151231 END )
    

    (NOTE: original query is production to a remote database and “fully” remote; I have no access to any DBA views to do any further analysis, and no time at home :-)

    To simplify and have apples to apples, the query:

    
    SELECT COUNT (*) 
      FROM t1 a
           LEFT OUTER JOIN t2 b
              ON (    a.col1a = b.col2b
                  AND a.pkey = b.pkey
                  and a.pkey = 20151231)
    [sourcecode]
    
    returns the same result and performance (~6 hours)..
    
    The Oracle syntax:
    [sourcecode]
    
    SELECT COUNT (*)
      FROM t1 a, t2 b
     WHERE     a.col1a = b.col2b(+)
           AND a.cycle_dt_num = b.cycle_dt_num(+)
           AND a.pkey = 20151231;;
    

    Returns the correct result in under 5 minutes.

    Even more interesting:

    
    SELECT COUNT (*) 
      FROM t1 a
           LEFT OUTER JOIN t2 b
              ON (    a.col1a = b.col2b
                  AND a.pkey = b.pkey)
                  where a.pkey = 20151231)
    

    Also returns correct number of rows in under 5 minutes!!!

    Comment by Jeff Jacobs, Oracle Ace — February 21, 2016 @ 7:27 pm GMT Feb 21,2016 | Reply

    • Cycle_dt_num should be pkey.

      Comment by Jeff Jacobs — February 21, 2016 @ 7:55 pm GMT Feb 21,2016 | Reply

    • Jeff,

      This is expected behaviour – and although your original comment was about partitioned tables and distributed joins neither of those conditions is necessary.

      The ON clause tells you how to test for a match, so when you have “ON a.pkey = 20151231” the check is “this row from B matches the current A row only if a.pkey = 20151231”, it does not mean “look only at rows in A that have pkey = 20151231”. So Oracle has to look at EVERY row in A, and fails that test on almost all of them, and then returns the row anyway because the join is an outer join.

      This is the critical difference between the ON clause and the WHERE clause, and it only causes problems in outer joins (and it’s the bit of ANSI where I have to stop and think VERY carefully whenever I see it).

      Comment by Jonathan Lewis — February 23, 2016 @ 8:48 am GMT Feb 23,2016 | Reply

  9. Version is 11.02.

    Comment by Jeff Jacobs — February 22, 2016 @ 10:10 pm GMT Feb 22,2016 | Reply

  10. This mere mortal humbly receives enlightenment from he who is forever an “alien of unusual abilities” (no matter what the INS says :-)

    I didn’t think the partition or distributed were necessary contributors, just additional info. I kept thinking that I was missing a key difference with the ANSI LEFT join, but since the result was not at all what was desired, I was more focused on fixing the user’s query…

    Thanks again!
    .

    Comment by Jeff Jacobs — February 23, 2016 @ 6:32 pm GMT Feb 23,2016 | Reply

  11. Jeff,

    Additional information can be dangerous. I can’t remember the name of the fallacy but I fell into the trap of assuming that the additional information was the relevent information – viz: I assumed that “something changed” when the query was distributed and (made very obvious when) partitioned. It wasn’t until the following day that I realised that that was simply an assumption I had made.

    Comment by Jonathan Lewis — February 24, 2016 @ 9:53 am GMT Feb 24,2016 | Reply

    • The only thing that changed was hiding the actual column names. I didn’t assume that the alleged bug would reproduce in a local database.

      I don’t know what to call my fallacy. A little knowledge can be useful? I knew that there were bugs in ANSI syntax and that enabled me to solve the problem even if my root cause analysis was incorrect.

      Comment by Jeff Jacobs — February 24, 2016 @ 3:59 pm GMT Feb 24,2016 | Reply

      • Jeff,

        I was the one guilty of exercising the fallacy – you cited the available evidence, and I extrapolated it to include something that you had not said.

        Comment by Jonathan Lewis — February 24, 2016 @ 4:14 pm GMT Feb 24,2016 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.