Oracle Scratchpad

June 5, 2007

Design

Filed under: Performance,Tuning — Jonathan Lewis @ 9:13 pm GMT Jun 5,2007

The following comes from a recent posting on comp.databases.oracle.server:

I have two tables that I need to join. Package and Audit_Trail. Package has ~300,000 rows and Audit_Trail > 8 million. In Package there is a field CaseRef. It is a VarChar2 and has the form xx-xxx ie 34-145 or 67-12345 etc etc. In Audit_Trail there are two fields. Proc and Num. You guessed it – they are both Numbers and form the two parts of the CaseRef in Package.

I need to join them in the fastest way possible as I am dealing with rather large tables. Both fields in Audit_Trail are indexed (not unique). CaseRef in Package is primary key. I can change / alter Package table but I have select only permission on Audit_Trail as it’s in the schema of a 3rd party product.

The best I can come up with is splitting the CaseRef in Package (as it’s the smaller table) and then joining on two fields rather than concatenating the Proc and Num in Audit_Trail and joining on one field.

The only better solution I have is using a Materialized View on Audit_Trail providing the PRE concatenated CaseRef along with Proc and Num fields and unique indexing on all three.

Anyone have suggestions, comments or considerations?

What’s the problem – here’s a perfectly reasonable join which doesn’t need any special consideration (at first sight):

 
SQL> @ afiedt.buf 
SQL> select 
  2  	     /*+ ordered use_nl(aud) index_combine(aud) */ 
  3  	     Pck.padding, 
  4  	     aud.padding 
  5  from 
  6  	     package	     pck, 
  7  	     audit_trail     aud 
  8  where 
  9  	     aud.proc = to_number(substr(pck.caseref,1,instr(pck.caseref,'-')-1)) 
 10  and     aud.num  = to_number(substr(pck.caseref,instr(pck.caseref,'-')+1)) 
 11  /         

Execution Plan 
---------------------------------------------------------- 
Plan hash value: 166471044         

------------------------------------------------------------------------------------------------- 
| Id  | Operation                         | Name        | Rows  | Bytes | Cost (%CPU)| Time     | 
------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT                  |             |     1 |    47 |    11   (0)| 00:00:01 | 
|   1 |  TABLE ACCESS BY INDEX ROWID      | AUDIT_TRAIL |     1 |    33 |    11   (0)| 00:00:01 | 
|   2 |   NESTED LOOPS                    |             |     1 |    47 |    11   (0)| 00:00:01 | 
|   3 |    TABLE ACCESS FULL              | PACKAGE     |     1 |    14 |     2   (0)| 00:00:01 | 
|   4 |    BITMAP CONVERSION TO ROWIDS    |             |       |       |            |          | 
|   5 |     BITMAP AND                    |             |       |       |            |          | 
|   6 |      BITMAP CONVERSION FROM ROWIDS|             |       |       |            |          | 
|*  7 |       INDEX RANGE SCAN            | AUD_NUM     |     1 |       |     1   (0)| 00:00:01 | 
|   8 |      BITMAP CONVERSION FROM ROWIDS|             |       |       |            |          | 
|*  9 |       INDEX RANGE SCAN            | AUD_PROC    |     1 |       |     1   (0)| 00:00:01 | 
-------------------------------------------------------------------------------------------------         

Predicate Information (identified by operation id): 
--------------------------------------------------- 
   7 - access("AUD"."NUM"=TO_NUMBER(SUBSTR("PCK"."CASEREF",INSTR("PCK"."CASEREF",'-')+1))) 
   9 - access("AUD"."PROC"=TO_NUMBER(SUBSTR("PCK"."CASEREF",1,INSTR("PCK"."CASEREF",'-')-1)))         

Here’s an alternative path, going the other way round:

 
SQL> @afiedt.buf 
SQL> select 
  2  	     /*+ ordered use_nl(pck) index(pck) */ 
  3  	     Pck.padding, 
  4  	     aud.padding 
  5  from 
  6  	     audit_trail     aud, 
  7  	     package	     pck 
  8  where 
  9  	     pck.caseref = 
 10  		     to_char(aud.proc,'fm99999') || '-' || 
 11  		     to_char(aud.num,'fm99999') 
 12  /      

Execution Plan 
---------------------------------------------------------- 
Plan hash value: 3689512721      

-------------------------------------------------------------------------------------------- 
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     | 
-------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT             |             |     1 |    47 |     3   (0)| 00:00:01 | 
|   1 |  NESTED LOOPS                |             |     1 |    47 |     3   (0)| 00:00:01 | 
|   2 |   TABLE ACCESS FULL          | AUDIT_TRAIL |     1 |    33 |     2   (0)| 00:00:01 | 
|   3 |   TABLE ACCESS BY INDEX ROWID| PACKAGE     |     1 |    14 |     1   (0)| 00:00:01 | 
|*  4 |    INDEX UNIQUE SCAN         | PACK_PK     |     1 |       |     0   (0)| 00:00:01 | 
--------------------------------------------------------------------------------------------      

Predicate Information (identified by operation id): 
--------------------------------------------------- 
   4 - access("PCK"."CASEREF"=TO_CHAR("AUD"."PROC",'fm99999')||'-'||TO_CHAR("AUD"."NUM",'fm99999')) 

Given the demonstration that there is no apparent problem in joining the tables as they stand, it becomes clear that it’s not possible to answer this question (sensibly) without further information. To see why, just consider a similar request for help: “I need to join the orders table to the order_lines table, does anyone have any ideas on what to do ?” The ‘obvious’ answer to the question is probably not the answer to the question that was really being asked. 

When asking a question like this, you really need to be precise about your requirements, and why you are asking the question, before you can hope for a reasonable answer. Consider the possibilities in this example.

Do you want to join all the data from one table to the other table; or do you typically  pick a very few rows from one table to join to the other; if it’s just a few rows (or just one row)  which table is it that drives the join.

If your queries are low volume then your main concerns are whether or not you can engineer a nested loop join between the two tables and whether the initial access path to the driving table can be precise.

If your queries are high volume then you need to be able to do a hash join (or possibly a merge join) between the two tables; and again it’s not the join condition that is the real problem, it’s the other filters that you may have available to minimise the driving table data.

The requirment for this pair of tables, with their incompatible column definitions, becomes difficult only if (a) you want to cover every possible query requirement with a single generic approach and (b) you want every possible query to return the result in the shortest time possible given the volume of data requested.

But, until we know more about what you are trying to achieve – how can we give you appropriate advice?

5 Comments »

  1. Good advice that — very “bookmarkable” for linking to in forum postings.

    I try to promote phraseology such as “I need to join TableA _and_ TableB …” or “… Table A _with_ TableB …”, rather than using “to” so that we avoid prejudging the identification of the driving table.

    And there’s also the question, I think, of how often you perform this join. If you generally access TableB only when accessing TableA and joining them together (I’m thinking of an invoice_detail and an invoice_header table accessed through an OLTP system screen or web interface for customer enquiries, for example) then I’d be thinking of clustering as another option.

    Comment by David Aldridge — June 6, 2007 @ 11:38 am GMT Jun 6,2007 | Reply

  2. [...] enjoyed this post by Jonathan Lewis, in which he discusses his reaction to a seemingly simple question on [...]

    Pingback by Oracle Musings » ???????, ?? ???????? — June 7, 2007 @ 1:35 pm GMT Jun 7,2007 | Reply

  3. This:

    “I need to join them in the fastest way possible as I am dealing with rather large tables.”

    is really the scary bit in the initial question.

    Why is it needed to join in the fastest way possible because the tables are big? That makes no sense at all, by itself.

    If the number of rows being retrieved in a typical query is small, then the size of the tables we start from is immaterial – assuming of course a sensible indexing strategy.

    If the number of rows being retrieved in a typical query is large, then – and only then – does the large size of the tables become critical. Mostly because indexed retrieval may not be the best strategy to read large numbers of rows.

    But like you said: there is not enough information in the question to answer it in a consistent fashion.

    The problem of course becomes: what happens when I *may* need to retrieve *both* large and small amounts of data from the join? Where is the cutoff point for indexing versus hash/sort-merge and how do I tell the optimizer to go each way depending on my intentions? This is where histograms may help, but often that’s not enough…

    Comment by Noons — June 7, 2007 @ 10:23 pm GMT Jun 7,2007 | Reply

  4. As you stated, It depends on how much information you need to retrieve, and wich one would be the leading table, filters, etc… But if “package” is to be the leading table, you can always create a concatenated index on audit_trail

    create index ProcNumConcatIdx
    on audit_trail(proc||’-‘||num);

    select /*+ index(aud ProcNumConcatIdx) */
    pck.*
    from
    package pck,
    audit_trail aud
    where
    pck.caseref=aud.proc||’-‘||aud.num;

    PLAN_TABLE_OUTPUT
    ————————————————————————————–
    Plan hash value: 4083992475

    —————————————————————————————
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    —————————————————————————————
    | 0 | SELECT STATEMENT | | 1000 | 13000 | 31 (4)| 00:00:01 |
    |* 1 | HASH JOIN | | 1000 | 13000 | 31 (4)| 00:00:01 |
    | 2 | TABLE ACCESS FULL| PACKAGE | 1000 | 6000 | 2 (0)| 00:00:01 |
    | 3 | INDEX FULL SCAN | PROCNUMCONCATIDX | 9999 | 69993 | 28 (0)| 00:00:01 |
    —————————————————————————————

    1 – access(“PCK”.”CASEREF”=TO_CHAR(“PROC”)||’-‘||TO_CHAR(“NUM”))

    select /*+ use_nl(pck aud) index(aud ProcNumConcatIdx) */
    pck.*
    from
    package pck,
    audit_trail aud
    where
    pck.caseref=aud.proc||’-‘||aud.num;

    PLAN_TABLE_OUTPUT
    ————————————————————————————–
    Plan hash value: 1711062606

    —————————————————————————————
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    —————————————————————————————
    | 0 | SELECT STATEMENT | | 1000 | 13000 | 1003 (1)| 00:00:13 |
    | 1 | NESTED LOOPS | | 1000 | 13000 | 1003 (1)| 00:00:13 |
    | 2 | TABLE ACCESS FULL| PACKAGE | 1000 | 6000 | 2 (0)| 00:00:01 |
    |* 3 | INDEX RANGE SCAN | PROCNUMCONCATIDX | 1 | 7 | 1 (0)| 00:00:01 |
    —————————————————————————————

    Comment by Hugo — June 8, 2007 @ 4:35 am GMT Jun 8,2007 | Reply

  5. “you can always create a concatenated index on audit_trail”

    Given that in the original quote only select on audit trail was mentioned, this might in fact _not_ always be the case.

    Comment by HolgerBaer — July 20, 2007 @ 3:07 pm GMT Jul 20,2007 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,308 other followers