Oracle Scratchpad

July 12, 2012

PK Problem

Filed under: Execution plans,Oracle — Jonathan Lewis @ 4:18 pm BST Jul 12,2012

I have been saying for a very long time (probably since some time in the last century) that if you want to add a primary key (or unique) constraint to a table where there might be some duplicate data then one of the best strategies for doing so might be to create a non-unique index (with the online option), then add the constraint in the state enable novalidate, and then validate the constraint. For example:

create table t1
as
select
	*
from
	all_objects
;

create index t1_i1 on t1(object_id) online;
-- collect stats
alter table t1 add constraint t1_pk primary key(object_id) enable novalidate;
alter table t1 modify constraint t1_pk validate;


The theory was simple – the online build would require only a brief period of locking, the addition of the constraint would require only a brief period of locking, after which all new (or modified) data would be checked for uniqueness; and then the validation would simply walk the index in order checking for duplicates. If you wanted you could always choose the option to list the rowids of duplicates into an exceptions table.

In the last 48 hours it has been brought to my attention that that last assumption is, and perhaps has always been, wrong. I’ve just been running some tests of the mechanism and checked what Oracle does about the validation step all the way back to 8.1.7.4.

Depending on whether the constraint is single column or multi-column, depending on whether you’re adding a unique or primary key constraint, depending on whether the column(s) are pre-declared as NOT NULL, and depending on exact version of Oracle, there are several slightly different scenarios to review if you want to do an exhaustive analysis; but here’s the SQL, with execution plan, that Oracle ran for the validate step in the example above on 11.2.0.3:

select /*+ all_rows ordered */
	A.rowid, :1, :2, :3
from
	TEST_USER.T1 A,
	(
	select /*+ all_rows */
		OBJECT_ID
	from
		TEST_USER.T1 A
	where	(OBJECT_ID is not null)
	group by
		OBJECT_ID
	having
		count(1) > 1
	) B
where
	(A.OBJECT_ID = B.OBJECT_ID)
union all
select
	/*+ all_rows ordered */
	A.rowid, :1, :2, :3
from
	TEST_USER.T1 A
where
	(OBJECT_ID is null)
;

-------------------------------------------------------------------------------------------
| Id  | Operation                 | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |       |  2706 | 81167 |       |   149   (3)| 00:00:02 |
|   1 |  UNION-ALL                |       |       |       |       |            |          |
|*  2 |   HASH JOIN               |       |  2705 | 81150 |  1536K|   149   (3)| 00:00:02 |
|   3 |    INDEX FAST FULL SCAN   | T1_I1 | 54081 |   897K|       |    34   (0)| 00:00:01 |
|   4 |    VIEW                   |       |  2705 | 35165 |       |    36   (6)| 00:00:01 |
|*  5 |     FILTER                |       |       |       |       |            |          |
|   6 |      HASH GROUP BY        |       |  2705 | 13525 |       |    36   (6)| 00:00:01 |
|   7 |       INDEX FAST FULL SCAN| T1_I1 | 54081 |   264K|       |    34   (0)| 00:00:01 |
|*  8 |   FILTER                  |       |       |       |       |            |          |
|   9 |    INDEX FAST FULL SCAN   | T1_I1 | 54081 |   897K|       |    34   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
   5 - filter(COUNT(*)>1)
   8 - filter(NULL IS NOT NULL)

A couple of background points to note: this execution plan came from “explain plan” and a call to dbms_xplan.display, but it does match the run-time plan as reported in the trace file. The “intent” of the code is to identify all the rowids for rows where the same object_id has been used more than once (if there are none then the PK is valid). Because of the standard optimizer arithmetic for “having count(*) > {const}”, the optimizer assumes that duplicates account for 5% of the data. The second query block in the union all allows Oracle to capture rows where the primary key can’t be valid because the column is null – but this is a redundant check in my example because the column is defined to be not null; however if you check line 8 of the execution plan you can see that there is a FILTER operation that means that part of the query will be short-circuited and do no work. The three bind variables in the main select hold the table owner, table name, and constraint name – as required by the exceptions table (not that we’ve asked Oracle to capture exceptions in this example).

I think we can easily explain the redundancy in the code by assuming that the developer who wrote it decided to maximise code reuse; but why is there an ordered hint in it that (surely) maximises the inefficiency of the code ? The resulting execution plan for any large index is almost fixed by that hint. Oracle is going to build an in-memory (if it fits) hash table of every entry in the index, then do a massive aggregation job to generate the data to probe the hash table.  As a consequence, it’s quite likely that we will end up doing the equivalent for dumping the whole index into the temporary tablespace twice – once for the hash table, once to do the aggregation, and this might be something we’d like to avoid.

But if we’re hoping to enable a primary key constraint aren’t we expecting a very small volume of data (i.e. small number of duplicates) in the aggregate, and wouldn’t it make sense to do a nested loop driven by that small number, rather than a hash join; and since we have a suitable index in place could we generate the aggregate without doing any sorting ?  In other orders can we avoid doing any I/O to the temporary tablespace and produce a (pseudo-)plan like:

NESTED LOOP
    Aggregated object_id having count(*) > 1
    Index range scan t1_i1

Since the code is embedded in a package we can’t change it but, for very special (extremely large) cases, we might decide that we need to change the execution plan for the query – so we might look at the option for creating an SQL Plan Baseline for a modified version of the query, and then attaching that plan to this query. Let’s work on that idea in steps – first we’ll just try aggregating the object ids by walking the index in order:

select * from (
	select /*+ index(t1(object_id)) */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
)
;
--------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |  2705 | 35165 |   121   (0)| 00:00:02 |
|   1 |  VIEW                  |       |  2705 | 35165 |   121   (0)| 00:00:02 |
|*  2 |   FILTER               |       |       |       |            |          |
|   3 |    SORT GROUP BY NOSORT|       |  2705 | 13525 |   121   (0)| 00:00:02 |
|   4 |     INDEX FULL SCAN    | T1_I1 | 54081 |   264K|   121   (0)| 00:00:02 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(COUNT(*)>1)

In this plan we walk the index in order, keeping a running count for each object_id, discarding any ids where the count is just 1. Note particulary that the aggregation step is “sort group by nosort” – we DON’T sort the data. In the absence of the index() hint my system chose to do an “index fast full scan” with “hash group by” – it’s up to you to decide which of those two plans is likely to be the more efficient way to generate the small number of exceptions in your case.

Once we have this starting step we can try to use it as the first step of a nested loop (since we believe it will return a handful of rows rather than the 5% indicated by the optimizer.

select
	--+ leading(b a) use_nl(a) index(a(object_id)) no_merge(b)
	a.rowid
from
	(
	select /*+ index(t1 (object_id)) */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
	)
	b ,
	t1	a
where
	a.object_id = b.object_id
;

---------------------------------------------------------------------------------
| Id  | Operation               | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |       |  2705 | 81150 |   121   (0)| 00:00:02 |
|   1 |  NESTED LOOPS           |       |  2705 | 81150 |   121   (0)| 00:00:02 |
|   2 |   VIEW                  |       |  2705 | 35165 |   121   (0)| 00:00:02 |
|*  3 |    FILTER               |       |       |       |            |          |
|   4 |     SORT GROUP BY NOSORT|       |  2705 | 13525 |   121   (0)| 00:00:02 |
|   5 |      INDEX FULL SCAN    | T1_I1 | 54081 |   264K|   121   (0)| 00:00:02 |
|*  6 |   INDEX RANGE SCAN      | T1_I1 |     1 |    17 |     0   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(COUNT(*)>1)
   6 - access("A"."OBJECT_ID"="B"."OBJECT_ID")

This is exactly what we want – as each row comes out of the aggregate rowsource at line 2, we access the index to get the details that we need of the multiple rows (just the rowids) – and the relevant index leaf block is the one we’ve just walked to generate the aggregate – so no further random I/O needed. This HAS to be pretty efficient.

There’s just one problem – I lied.

I’ve hacked the plan by hand; when you embed the previous query as an in-line view the nosort option disappears. Line 4 should read “SORT GROUP BY” – I edited in the nosort by hand – Oracle actually sorts the entire result set before starting the nested loop; in fact it’s slightly worse than that because by default the optimizer chooses a “HASH GROUP BY”, which I blocked by setting the hidden parameter _gby_hash_aggregation_enabled to false. Looking at the costing, by the way, it struck me that it was just possible that Oracle was doing a nosort sort – the incremental cost was zero – so I enabled the 10046 and 10032 trace so show that Oracle wrote the result set to the temporary tablespace, then read it back, and did actualy record the expected number of records sorted.

I couldn’t find any way to avoid this undesirable sort by hinting – so that seems to have scuppered our plans for using an SQL Plan Baseline to eliminate the large volume of writes to the temporary tablespace (note, we might still be happy to do an index fast full scan at this point with a hash group by – it should be better than the default plan because it will only dump the content of the index to the temporary tablespace once.)

But our plans might not be completely wrecked. It is possible to get the intermediate result set with a nosort and then use it as follows:

with b as (
	select /*+ materialize index(t1(object_id))  */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
)
select
	--+ leading(b a) use_nl(a) index(a(object_id))
	a.rowid
from
	b ,
	t1	a
where
	a.object_id = b.object_id
;

----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                        | Rows  | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                             |  2705 | 81150 |   123   (0)| 00:00:02 |
|   1 |  TEMP TABLE TRANSFORMATION |                             |       |       |            |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D6608_4017FEF5 |       |       |            |       |
|*  3 |    FILTER                  |                             |       |       |            |       |
|   4 |     SORT GROUP BY NOSORT   |                             |  2705 | 13525 |   121   (0)| 00:00:02 |
|   5 |      INDEX FULL SCAN       | T1_I1                       | 54081 |   264K|   121   (0)| 00:00:02 |
|   6 |   NESTED LOOPS             |                             |  2705 | 81150 |     2   (0)| 00:00:01 |
|   7 |    VIEW                    |                             |  2705 | 35165 |     2   (0)| 00:00:01 |
|   8 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D6608_4017FEF5 |  2705 | 13525 |     2   (0)| 00:00:01 |
|*  9 |    INDEX RANGE SCAN        | T1_I1                       |     1 |    17 |     0   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(COUNT(*)>1)
   9 - access("A"."OBJECT_ID"="B"."OBJECT_ID")

If we can rewrite the SQL to use subquery factoring and then materialize the subquery (which should only produce a small temporary table) we can get the nosort back and get the nested loop – and I didn’t even have to play with a hidden parameter to do it. But how do we rewrite the SQL that’s generated by code embedded in a package ? First of all, here’s the SQL (with plan) I would like to use – I’ve eliminated the original hints, introduced a subquery factoring clause with materialization and an index hint:


with b as (
	select /*+ materialize index(t1 (object_id))  */
		object_id
	from
		t1
	where
		object_id is not null
	group by
		object_id
	having
		count(1) > 1
)
select
	A.rowid, :1, :2, :3
from
	t1 A,
	B
where
	(a.object_id = b.object_id)
union all
select
	a.rowid, :1, :2, :3
from
	t1 A
where
	object_id is null
;

----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                        | Rows  | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                             |  2706 | 81167 |     2   (0)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                             |       |       |            |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D660D_4017FEF5 |       |       |            |       |
|*  3 |    FILTER                  |                             |       |       |            |       |
|   4 |     SORT GROUP BY NOSORT   |                             |  2705 | 13525 |   121   (0)| 00:00:02 |
|   5 |      INDEX FULL SCAN       | T1_I1                       | 54081 |   264K|   121   (0)| 00:00:02 |
|   6 |   UNION-ALL                |                             |       |       |            |       |
|   7 |    NESTED LOOPS            |                             |  2705 | 81150 |     2   (0)| 00:00:01 |
|   8 |     VIEW                   |                             |  2705 | 35165 |     2   (0)| 00:00:01 |
|   9 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D660D_4017FEF5 |  2705 | 13525 |     2   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN       | T1_I1                       |     1 |    17 |     0   (0)| 00:00:01 |
|* 11 |    FILTER                  |                             |       |       |            |       |
|  12 |     INDEX FAST FULL SCAN   | T1_I1                       | 54081 |   897K|    34   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(COUNT(*)>1)
  10 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
  11 - filter(NULL IS NOT NULL)

As you can see, we load a small (we hope) temporary table by doing an index walk with running count; we use that result set in a nested loop back into the index, and then we do a union all with a result set that we don’t attempt to create because the “null is not null” filter terminates that branch of the plan. All we have to do now is make Oracle run OUR query instead of its own.

Plan B – dbms_advanced_rewrite

The package dbms_advanced_rewrite basically allows you to tell Oracle that when it sees a particular piece of SQL it should run some other piece of SQL. You do have to be a little careful to make it work, of course – the text should be “sufficiently similar”, and the number and type of values used and returned has to be the same. Here’s an example of setting up a “rewrite equivalence”:

begin
        sys.dbms_advanced_rewrite.declare_rewrite_equivalence(
                name             => 'TEST_HINT',
                source_stmt      => 'select * from t1 where n1 = 15',
                destination_stmt =>
                        'select /*+ index(v1.t1) */ * from v1 where n1 = 15',
                validate         => false,
                rewrite_mode     => 'general'
        );
end;
/

In this example, whenever the statement “select * from t1 where n1 = 15″ reaches the database, it will execute the hinted SQL query against the view.

Following this example, all we have to do (and this is a technique that became avaiable in 10g) is let Oracle generate the SQL that’s giving us the bad plan, and set up an equivalence so that what actually runs is our version with the subquery factoring clause instead – except we run into another problem when we try to declare the equivalence:

ERROR at line 1:
ORA-30389: the source statement is not compatible with the destination statement
ORA-32034: unsupported use of WITH clause
ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 29
ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 185
ORA-06512: at line 11

Brilliant idea number 2 isn’t going to work – subquery factoring isn’t supported with advanced rewrite. So I guess we’re just going to have to wait for Oracle to improve the code.

Summary

After many years of believing that Oracle did something sensible when validating a primary key or unique constraint against an existing index, I finally discovered that it was doing something wierdly inefficient that invariably (probably) will result in the content of a very large index being dumped to the temporary tablespace twice as the validation takes place.

I tried to find a way of making the existing code avoid any large joins and aggregations by working out how to construct an SQL Plan Baseline for the query that would make it use a different join order and walk the index in order to produce an optimal nested loop approach. A limitation in the optimizer made the plan unreachable. (A variation on the plan that did an index fast full scan and hash aggregation followed by a nested loop is still viable, and could be a better strategy than the default.)

Because I couldn’t use an SQL Plan Baseline to get the plan I wanted, I looked at the possibility of a rewrite that would allow me to aggregate without using an temporary space, and it turned out that materializing a subquery factoring clause would work. The only way to impose this on Oracle, though, would be through the package dbms_advanced_rewrite – and it is a current limitation of that package that it won’t allow you to use subquery factoring in the rewrite process.

It is worth noting that the work I’ve done was triggered by a question relating to a table of 1.8 Billion (with a B) rows and the way that the default validation process needed more temporary space than was available. Although the general principle of what to do is going to change, the actual code you need would have to be tailored to the specific table and index every time tried to apply the method. In general it’s the sort of task which you don’t do unless you’ve got a very good reason. I went to the trouble of working out a few possibilities (a) because I was curious and (b) because I thought it might make an interesting little study for the blog.

9 Comments »

  1. Interesting study indeed!

    I *hope* it leads to Oracle performing pk constraint validations explicitly by treewalk(not fast) full index scan when over size X. Where you only have to keep 1 previous value (and the start of the next section if parallel) in memory per parallel thread, and even with exception tracking only the duplicates need to get written. Parallel scan points could be picked from the starting node and a thread finishing would pick up a point not yet assigned, which would make the finishing times similar for all but the most pathological cases (which would be indicated as one of those few “you need an index rebuild” cases.)

    Size X would default to the size of hash that could be held in memory routinely, and might be something you could tune or a hint could be applied to the constraint enablement to use the special procedure instead of standard set aggregation code. Sigh.

    I too never tested this. Thanks for quickly addressing this issue. Once again the hallmark of an actual expert is to publicly and cogently correct any discovered error. And you’ve gone on in great depth depict the actual lay of the land.

    Nice work, my friend.

    Comment by rsiz — July 12, 2012 @ 6:32 pm BST Jul 12,2012 | Reply

  2. Hi Jonathan,

    a great post, and very practical, too. I imagine that it could have a much wider application than making PK validation faster — it’s probably not the only place in Oracle internal code where an inefficient plan is hardwired for a maintenance operation. E.g. I hate the way Oracle implemented redefinition of multiple columns in one ALTER TABLE (in my case Oracle was doing as many full table scans as the number of columns to be modified) — and even though in this particular case the tricks you described won’t work, I would tend to believe there are dozens similar cases when they might help.

    Best regards,
    Nikolay

    Comment by savvinov — July 13, 2012 @ 1:50 pm BST Jul 13,2012 | Reply

  3. [...] the constraint in the state enable novalidate, and then validate the constraint, a cool advice from Jonathan once [...]

    Pingback by Log Buffer #277, A Carnival of the Vanities for DBAs | The Pythian Blog — July 14, 2012 @ 4:19 am BST Jul 14,2012 | Reply

  4. Hi Jonathan, please take a simple trick to avoid “unsupported use of WITH clause”:

    SQL> begin
      2          sys.dbms_advanced_rewrite.declare_rewrite_equivalence(
      3                  name             => 'TEST_HINT2',
      4                  source_stmt      => 'select * from usr.t1 where n1 = 15',
      5                  destination_stmt =>
      6                          'with a as (select /*+ index(v1.t1) */ * from usr.v1 where n1 = 15) select * from a',
      7                  validate         => false,
      8                  rewrite_mode     => 'general'
      9          );
     10  end;
     11  /
     
    begin
            sys.dbms_advanced_rewrite.declare_rewrite_equivalence(
                    name             => 'TEST_HINT2',
                    source_stmt      => 'select * from usr.t1 where n1 = 15',
                    destination_stmt =>
                            'with a as (select /*+ index(v1.t1) */ * from usr.v1 where n1 = 15) select * from a',
                    validate         => false,
                    rewrite_mode     => 'general'
            );
    end;
     
    ORA-30389: the source statement is not compatible with the destination statement
    ORA-32034: unsupported use of WITH clause
    ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 29
    ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 185
    ORA-06512: at line 3
    
    SQL> begin
      2          sys.dbms_advanced_rewrite.declare_rewrite_equivalence(
      3                  name             => 'TEST_HINT2',
      4                  source_stmt      => 'select * from usr.t1 where n1 = 15',
      5                  destination_stmt =>
      6                          'select * from (with a as (select /*+ index(v1.t1) */ * from usr.v1 where n1 = 15) select * from a)',
      7                  validate         => false,
      8                  rewrite_mode     => 'general'
      9          );
     10  end;
     11  /
     
    PL/SQL procedure successfully completed
    

    Comment by Valentin Nikotin — July 14, 2012 @ 10:11 am BST Jul 14,2012 | Reply

    • The last thing for which I couldn’t find any workaround is

      *
      ERROR at line 1:
      ORA-30353: expression not supported for query rewrite
      ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 29
      ORA-06512: at "SYS.DBMS_ADVANCED_REWRITE", line 185
      ORA-06512: at line 13

      Comment by Valentin Nikotin — July 15, 2012 @ 7:18 pm BST Jul 15,2012 | Reply

    • Valentin,

      That’s a brilliant workaround.
      I think the ORA-30389 is a re-interpretation of the underlying ORA-32023 because Oracle tests the validity of the destination statement by attempting to parse a statement of the form:

      (old_statement)
      union all
      (new_statement)
      
      -- example
      
      select object_id from user_objects
      union all
      with v1 as (select data_object_id  from user_objects)
      
      

      Conveniently your workaround hides the factored subquery from the union all.

      Comment by Jonathan Lewis — July 19, 2012 @ 9:42 am BST Jul 19,2012 | Reply

  5. Would globally hash-partitioning the index change the situation any? For a sufficiently large table it’s a good idea anyway (assuming partitioning is licensed, of course).

    Oh, and since you’re a Brit, does “Billion with a B” mean 1e9 or 1e12 (which would be a trillion to us Yankees)? If the latter, hash partitioning the index sounds like an even better idea.

    Comment by Jason Bucata — July 16, 2012 @ 6:19 pm BST Jul 16,2012 | Reply

    • Hi
      For the record the quoted table size meant one thousand eight hundred million . The American definition of a billion won out a long time around the globe and is now the official meaning of a “Billion” in the U.K. As for partitioning, well in this case anything only hash partitioning may be possible because there is nothing of use in the table definition to range or list partition by. This was casually mentioned to me by the development team, who looked somewhat bemused when I told them that you have to plan this sort of thing up front! So this is what I am looking at next.

      Thanks to Jonathan for the posts on Oracle-L and this blog entry.

      Pete

      Comment by Pete — July 17, 2012 @ 4:45 pm BST Jul 17,2012 | Reply

    • Jason,

      Hash partitioning the index is a good strategy (if you’ve got the license) for reducing buffer busy waits on the high-values end of a sequential or time-based index; and if you do need to rebuild the index from time to time it’s nice to be able to rebuild it in parts; but partitioning doesn’t affect the SQL used to check for uniqueness.

      As Pete points out – a billion generally defaults to the US standard of 10e9.

      Comment by Jonathan Lewis — July 19, 2012 @ 10:48 am BST Jul 19,2012 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,113 other followers