The following comes from a recent posting on comp.databases.oracle.server:
I have two tables that I need to join. Package and Audit_Trail. Package has ~300,000 rows and Audit_Trail > 8 million. In Package there is a field CaseRef. It is a VarChar2 and has the form xx-xxx ie 34-145 or 67-12345 etc etc. In Audit_Trail there are two fields. Proc and Num. You guessed it – they are both Numbers and form the two parts of the CaseRef in Package.
I need to join them in the fastest way possible as I am dealing with rather large tables. Both fields in Audit_Trail are indexed (not unique). CaseRef in Package is primary key. I can change / alter Package table but I have select only permission on Audit_Trail as it’s in the schema of a 3rd party product.
The best I can come up with is splitting the CaseRef in Package (as it’s the smaller table) and then joining on two fields rather than concatenating the Proc and Num in Audit_Trail and joining on one field.
The only better solution I have is using a Materialized View on Audit_Trail providing the PRE concatenated CaseRef along with Proc and Num fields and unique indexing on all three.
Anyone have suggestions, comments or considerations?
What’s the problem – here’s a perfectly reasonable join which doesn’t need any special consideration (at first sight):
SQL> @ afiedt.buf
SQL> select
2 /*+ ordered use_nl(aud) index_combine(aud) */
3 Pck.padding,
4 aud.padding
5 from
6 package pck,
7 audit_trail aud
8 where
9 aud.proc = to_number(substr(pck.caseref,1,instr(pck.caseref,'-')-1))
10 and aud.num = to_number(substr(pck.caseref,instr(pck.caseref,'-')+1))
11 /
Execution Plan
----------------------------------------------------------
Plan hash value: 166471044
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 47 | 11 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID | AUDIT_TRAIL | 1 | 33 | 11 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 1 | 47 | 11 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | PACKAGE | 1 | 14 | 2 (0)| 00:00:01 |
| 4 | BITMAP CONVERSION TO ROWIDS | | | | | |
| 5 | BITMAP AND | | | | | |
| 6 | BITMAP CONVERSION FROM ROWIDS| | | | | |
|* 7 | INDEX RANGE SCAN | AUD_NUM | 1 | | 1 (0)| 00:00:01 |
| 8 | BITMAP CONVERSION FROM ROWIDS| | | | | |
|* 9 | INDEX RANGE SCAN | AUD_PROC | 1 | | 1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("AUD"."NUM"=TO_NUMBER(SUBSTR("PCK"."CASEREF",INSTR("PCK"."CASEREF",'-')+1)))
9 - access("AUD"."PROC"=TO_NUMBER(SUBSTR("PCK"."CASEREF",1,INSTR("PCK"."CASEREF",'-')-1)))
Here’s an alternative path, going the other way round:
SQL> @afiedt.buf
SQL> select
2 /*+ ordered use_nl(pck) index(pck) */
3 Pck.padding,
4 aud.padding
5 from
6 audit_trail aud,
7 package pck
8 where
9 pck.caseref =
10 to_char(aud.proc,'fm99999') || '-' ||
11 to_char(aud.num,'fm99999')
12 /
Execution Plan
----------------------------------------------------------
Plan hash value: 3689512721
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 47 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 47 | 3 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL | AUDIT_TRAIL | 1 | 33 | 2 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID| PACKAGE | 1 | 14 | 1 (0)| 00:00:01 |
|* 4 | INDEX UNIQUE SCAN | PACK_PK | 1 | | 0 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("PCK"."CASEREF"=TO_CHAR("AUD"."PROC",'fm99999')||'-'||TO_CHAR("AUD"."NUM",'fm99999'))
Given the demonstration that there is no apparent problem in joining the tables as they stand, it becomes clear that it’s not possible to answer this question (sensibly) without further information. To see why, just consider a similar request for help: “I need to join the orders table to the order_lines table, does anyone have any ideas on what to do ?” The ‘obvious’ answer to the question is probably not the answer to the question that was really being asked.
When asking a question like this, you really need to be precise about your requirements, and why you are asking the question, before you can hope for a reasonable answer. Consider the possibilities in this example.
Do you want to join all the data from one table to the other table; or do you typically pick a very few rows from one table to join to the other; if it’s just a few rows (or just one row) which table is it that drives the join.
If your queries are low volume then your main concerns are whether or not you can engineer a nested loop join between the two tables and whether the initial access path to the driving table can be precise.
If your queries are high volume then you need to be able to do a hash join (or possibly a merge join) between the two tables; and again it’s not the join condition that is the real problem, it’s the other filters that you may have available to minimise the driving table data.
The requirment for this pair of tables, with their incompatible column definitions, becomes difficult only if (a) you want to cover every possible query requirement with a single generic approach and (b) you want every possible query to return the result in the shortest time possible given the volume of data requested.
But, until we know more about what you are trying to achieve – how can we give you appropriate advice?

Good advice that — very “bookmarkable” for linking to in forum postings.
I try to promote phraseology such as “I need to join TableA _and_ TableB …” or “… Table A _with_ TableB …”, rather than using “to” so that we avoid prejudging the identification of the driving table.
And there’s also the question, I think, of how often you perform this join. If you generally access TableB only when accessing TableA and joining them together (I’m thinking of an invoice_detail and an invoice_header table accessed through an OLTP system screen or web interface for customer enquiries, for example) then I’d be thinking of clustering as another option.
Comment by David Aldridge — June 6, 2007 @ 11:38 am UTC Jun 6,2007 |
[...] enjoyed this post by Jonathan Lewis, in which he discusses his reaction to a seemingly simple question on [...]
Pingback by Oracle Musings » ???????, ?? ???????? — June 7, 2007 @ 1:35 pm UTC Jun 7,2007 |
This:
“I need to join them in the fastest way possible as I am dealing with rather large tables.”
is really the scary bit in the initial question.
Why is it needed to join in the fastest way possible because the tables are big? That makes no sense at all, by itself.
If the number of rows being retrieved in a typical query is small, then the size of the tables we start from is immaterial – assuming of course a sensible indexing strategy.
If the number of rows being retrieved in a typical query is large, then – and only then – does the large size of the tables become critical. Mostly because indexed retrieval may not be the best strategy to read large numbers of rows.
But like you said: there is not enough information in the question to answer it in a consistent fashion.
The problem of course becomes: what happens when I *may* need to retrieve *both* large and small amounts of data from the join? Where is the cutoff point for indexing versus hash/sort-merge and how do I tell the optimizer to go each way depending on my intentions? This is where histograms may help, but often that’s not enough…
Comment by Noons — June 7, 2007 @ 10:23 pm UTC Jun 7,2007 |
As you stated, It depends on how much information you need to retrieve, and wich one would be the leading table, filters, etc… But if “package” is to be the leading table, you can always create a concatenated index on audit_trail
create index ProcNumConcatIdx
on audit_trail(proc||’-'||num);
select /*+ index(aud ProcNumConcatIdx) */
pck.*
from
package pck,
audit_trail aud
where
pck.caseref=aud.proc||’-'||aud.num;
PLAN_TABLE_OUTPUT
————————————————————————————–
Plan hash value: 4083992475
—————————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————
| 0 | SELECT STATEMENT | | 1000 | 13000 | 31 (4)| 00:00:01 |
|* 1 | HASH JOIN | | 1000 | 13000 | 31 (4)| 00:00:01 |
| 2 | TABLE ACCESS FULL| PACKAGE | 1000 | 6000 | 2 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | PROCNUMCONCATIDX | 9999 | 69993 | 28 (0)| 00:00:01 |
—————————————————————————————
1 – access(“PCK”.”CASEREF”=TO_CHAR(“PROC”)||’-'||TO_CHAR(“NUM”))
select /*+ use_nl(pck aud) index(aud ProcNumConcatIdx) */
pck.*
from
package pck,
audit_trail aud
where
pck.caseref=aud.proc||’-'||aud.num;
PLAN_TABLE_OUTPUT
————————————————————————————–
Plan hash value: 1711062606
—————————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————
| 0 | SELECT STATEMENT | | 1000 | 13000 | 1003 (1)| 00:00:13 |
| 1 | NESTED LOOPS | | 1000 | 13000 | 1003 (1)| 00:00:13 |
| 2 | TABLE ACCESS FULL| PACKAGE | 1000 | 6000 | 2 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | PROCNUMCONCATIDX | 1 | 7 | 1 (0)| 00:00:01 |
—————————————————————————————
Comment by Hugo — June 8, 2007 @ 4:35 am UTC Jun 8,2007 |
“you can always create a concatenated index on audit_trail”
Given that in the original quote only select on audit trail was mentioned, this might in fact _not_ always be the case.
Comment by HolgerBaer — July 20, 2007 @ 3:07 pm UTC Jul 20,2007 |