Oracle Scratchpad

May 21, 2015

Understanding SQL

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 6:12 pm BST May 21,2015

From time to time someone publishes a query on the OTN database forum and asks how to make it go faster, and you look at it and think it’s a nice example to explain a couple of principles because it’s short, easy to understand, obvious what sort of things might be wrong, and easy to fix. Then, after you’ve made a couple of suggestions and explained a couple of ideas the provider simply fades into the distance and doesn’t tell you any more about the query, or whether they’ve taken advantage of your advice, or found some other way to address the problem.

Such a query, with its execution plan, appeared a couple of weeks ago:

UPDATE FACETS_CUSTOM.MMR_DTL
SET
	CAPITN_PRCS_IND = 2,
	FIL_RUN_DT = Current_fil_run_dt,
	ROW_UPDT_DT = dta_cltn_end_dttm
WHERE
	CAPITN_PRCS_IND = 5
AND	HSPC_IND ='Y'
AND	EXISTS (
		SELECT	1
		FROM	FACETS_STAGE.CRME_FUND_DTL_STG STG_CRME
		WHERE	STG_CRME.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN STG_CRME.ERN_FROM_DT AND STG_CRME.ERN_THRU_DT
		AND	STG_CRME.FUND_ID IN ('AAB1', '1AA2', '1BA2', 'AAB2', '1AA3', '1BA3', '1B80', '1A80')
	)
AND	EXISTS (
		SELECT	1
		FROM	FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
		WHERE	XWLK.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN XWLK.HSPC_EVNT_EFF_DT AND XWLK.HSPC_EVNT_TERM_DT
	)
;

-------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|   1 |  UPDATE                       | MMR_DTL               |       |       |            |          |
|   2 |   NESTED LOOPS SEMI           |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|*  3 |    HASH JOIN RIGHT SEMI       |                       |    49 |  5488 | 12375   (2)| 00:02:29 |
|   4 |     TABLE ACCESS FULL         | FCTS_TMS_MBRID_XWLK   |  6494 | 64940 |    24   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL         | MMR_DTL               |   304K|    29M| 12347   (2)| 00:02:29 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CRME_FUND_DTL_STG     |     1 |    36 |     5   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | IE1_CRME_FUND_DTL_STG |     8 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("XWLK"."MBR_CK"="MMR_DTL"."MBRSHP_CK")
       filter("XWLK"."HSPC_EVNT_EFF_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "XWLK"."HSPC_EVNT_TERM_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   5 - filter("CAPITN_PRCS_IND"=5 AND "HSPC_IND"='Y')
   6 - filter(("STG_CRME"."FUND_ID"='1A80' OR "STG_CRME"."FUND_ID"='1AA2' OR
              "STG_CRME"."FUND_ID"='1AA3' OR "STG_CRME"."FUND_ID"='1B80' OR "STG_CRME"."FUND_ID"='1BA2' OR
              "STG_CRME"."FUND_ID"='1BA3' OR "STG_CRME"."FUND_ID"='AAB1' OR "STG_CRME"."FUND_ID"='AAB2') AND
              "STG_CRME"."ERN_FROM_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "STG_CRME"."ERN_THRU_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   7 - access("STG_CRME"."MBR_CK"="MMR_DTL"."MBRSHP_CK")

The most informative bit of narrative that went with this query said:

“The table MMR_DTL doesnt have index on these columns CAPITN_PRCS_IND , HSPC_IND .. Since this is an update stmt which will update 85k records, In a dilema whether to index these columns or not .. And this table MMR_DTL is an ever growing table. Worried about the update performance. “

This was in response an observation that there was a full tablescan on MMR_DTL at operation 5 despite the predicate “CAPITN_PRCS_IND”=5 AND “HSPC_IND”=’Y’. You’ll note that the predicted cardinality for that scan is 304K and the update statement is going to change CAPITN_PRCS_IND from the value 5 to the value 2 – so it’s not entirely unreasonable to be worried about the impact of creating an index that included the column capitn_prcs_ind.

What more can we say about this query, given the limited information. Lots – unfortunately the owner of the query isn’t giving anything else away.

I’m going to leave this note unfinished to give people a little chance to think about the clues in the request, the questions they might ask, reasons why there might be a performance problem, and strategies they might investigate, then I’ll update the posting with a few ideas some time in the next 24 hours.

Update 1 – 24th May

There are so many ideas that spring up from a small amount of information that it’s very hard to write a concise and coherent description of what you’ve noticed, when and how far you pursued it, and how relevant the ideas might be to the problem in hand – especially when most of the thoughts require you to ask for more information. Something has come up that has taken up the free time I had to write this note, so I’m just going to have to complete it in rapidly written installments. The first bit is an outline of the immediate response I had to the initial presentation of the problem and the execution plan that went with it.

The only comment from the OP on this statement and plan was: “I couldnt optimize this query for better performance and optimized cost.. Can some one guide me on this.”

We have no idea how many rows would be updated, how long it took, or how long the OP thinks it ought to take; it’s not until a subsequent post that we learn that the number of rows targetted for update is 85,000 – which tells us that the optimizer has run into some problems with its cardinality estimates. This suggests that IF there’s a serious performance problem then POSSIBLY there’s a better execution plan and we might get the optimizer to find it automatically if we could tell it how to adjust its cardinality estimates. It would be nice, however to know where the query spent it’s time (i.e. can we re-run it with rowsource execution stats or monitoring enabled, and see the actual run-time work in the plan).

If it took a couple of minutes to update that 85,000 rows, I probably wouldn’t want to spend time making it go faster; if it took 2 hours, of which 1 hour 50 minutes was spent waiting for a transaction (row) lock then I’d want to look at why the update collision could happen and see if that problem could be avoided – it might then be the case that the last 10 minutes was spent rolling back and restarting an update that ought to have taken 2 minutes “in vacuo”. Underlying all this, I would want to be sure (as I’ve implicitly, and I think resonably, assumed) that it’s an update that runs only occasionally, perhaps once per day or once per week.

In the absence of hard information – let’s resort to a few hypotheticals; looking at the plan itself (and knowing the target 85,000 rows) I am prepared to make a few guesses about the run-time activity.

  1. We build an inmemory hash table from the whole of FCTS_TMS_MBRID_XWLK, a step for which the optimizer ought to be able to give a reasonable cost and cardinality – assuming (as I will from now on) that the basic stats are reasonably accurate.
  2. We scan the (fairly large) MMR_DETAIL table, applying a couple of filters; again the optimizer ought to do a reasonable job of estimating the cost of such a table, and we might expect a significant fraction of the time to be spent on multiblock (possibly direct path) reads of the table. The cardinality reported is 304,000 but we note there are two predcicates and both are for columns which we might guess have a small number of distinct values – one of which we are changing. Perhaps there’s a bad cardinality error there and maybe a couple of single column histograms would help, but maybe column group stats with a histogram on the pair would be even better. I also wonder when (if) HSPC_IND ever changes from Y to N, and contemplate the possibility of creating a function-based index that records ONLY the rows that match this predicate pair (see the note on indexing that will appear some time over the next week further down the page). It’s at this point that we might ask whether the number of rows returned by this scan should be very similar to the number of rows updated, or whether the scan identifies far too many rows and the two existence tests do a lot of work to eliminate the excess and, if the latter, which test should we apply first and how should we apply it.
  3. Having scanned the MMR_DTL we probe the in-memory hash table copy of FCTS_TMS_MBRID_XWLK for the first match, using an equality predicate (which will be the access predicate) and a range-based (filter) predicate which looks as if it is checking that some “start date” is between an “effective date” and a “termination date”. The estimated size of the result set is FAR too small at 49 rows when we know we have to have at least 85,000 rows survive this test; moreover, this tiny estimate comes from inputs of 6,500 and 304,000 rows being joined so we ought to wonder how such a tiny estimate could appear. A possible explanation is that the application has used some extreme dates to represent NULL values. If that’s the case then it’s possible that suitable histograms might help the optimizer recognise the extreme distribution; alternatively virtual columns that change the extreme values back to NULL and a predicate referencing the virtual columns may help.
  4. After estimating the cardinality of the intermediate result so badly, the optimizer decides that the second existence test can be performed as a semi-join using a nested loop. The thing to note here is that the optimizer “knows” that this is an expensive step – the cost of each table access operation is 5 (4 + 1) – but it’s a step that shouldn’t happen very frequently so the cost is considered acceptable. We know, however, that this step has to execute at least 85,000 times, so the optimizer’s prediction of visiting 4 blocks in the table to identify (on average) 8 rows and discard (on average) 7 of them looks nasty. Again we note that one of the predicates is range-based on a pair of dates – and in this case we might wonder whether or not most of the rows we discard are outside the date range, and whether we ought to consider (as a general point, and not just for this query) whether or not we should add one, other, or both the ERN_FROM_DT and ERN_THRU_DAT to the IE1_CRME_FUND_DTL_STG index. It’s at this point in the query that we are most ignorant of time spent at present (and in the future as the MMR_DTL table grows) – on one hand it’s possible that the rows for each MMR_DTL are widely scattered across the CRME_FUND_DTL_STG and this step could do a lot of random I/O, on the other hand the (assumed) time-dependent nature of the data may mean that the only MMR_DTL rows we look at are recently entered and the associated CRME_FUND_DTL_STG rows are therefore also recently entered and closely clustered – leading to a beneficial “self-caching” effect at the “high” end of the table as the query runs, which introduces an efficiency that the optimizer won’t notice. There is one numerical highlight in this join – we have a cost of 5 for each probe and 49 rows to test, so we might expect the incremental cost of the query to be around 250, but the difference between operations 3 and 2 is only 56 – suggesting that the optimizer does have some “self-caching” information, possibly based on there being a significant difference between the two tables for the number of distinct values of the join column. (See, for example: http://oracle-randolf.blogspot.co.uk/2012/05/nested-loop-join-costing.html )

Update 2 – 25th May

Scalability is a question that should always be considered – and there’s a scalability threat in the information we have so far. The plan shows a full tablescan of the MMR_DTL table, and while tablescans are not necessarily a bad thing we’ve been told that: “this table MMR_DTL is an ever growing table“. It’s possible that Oracle can be very quick and efficient when doing the existence tests on the rows it selects from the table – but it is inevitable that the tablescan will take longer to complete as time passes; whether or not this is likely to matter is something we can’t decide from the information given: we don’t know how much of the time is the tablescan, we don’t know what fraction of the total time is due to the tablescan, and we don’t know  how much larger the table will grow each day.

Another scalability detail we ought to ask about is the volume of data that we expect to update each time we run this statement. A time passes do we expect to see the same number of rows waiting to be updated, or are we expecting the business (whatever that may be) to grow steadily each month with an increase of a few percent in the number of rows to be updated on each execution. Our coding strategy may vary depending on the answer to that question – we might, for example, try to pre-empt a future problem by introducing some partitioning now.

The final scalablility issue is one I’ve raised already and comes from the CRME_FUND_DTL_STG table. According to the plan there about 8 rows in this table for each distinct value of MMR_DTL.MBRSHP_CK; if MMR_DTL is large and growing, is CRME_FUND_DTL_STG very large and growing even faster – as time passes will there be more rows for each distinct value of MMR_DTL.MBRSHP_CK.  Answers to these questions will help us decide whether we should use a hash join or a nested loop in the join to this table, and how to index the table to minimise random I/O.

 

To be continued …

 

May 18, 2015

Migrated rows

Filed under: Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 6:43 pm BST May 18,2015

I received an email recently describing a problem with a query which was running a full tablescan but: “almost all the waits are on ‘db file sequential read’ and the disk read is 10 times the table blocks”.  Some further information supplied was that the tablespace was using ASSM and 16KB block size; the table had 272 columns (ouch!) and the Oracle version was 11.2.0.4.

In his researches he had read my article on wide rows, and had picked out of one of the comments the line: “the very bad thing about chained rows and direct reads that is that finding the rest of row by ‘db file sequential read’ is never cached”, but he wasn’t sure that this was the problem he was seeing so, very sensibly, he had re-run the query with extended tracing available, and dumped (and formatted/edited) a couple of blocks from the table.

He then sent me the trace file and block dump. Generally this is a mistake – especially when the trace file is several megabytes – but he had prepared the ground well and had linked it back to one of my blog notes, and I thought there might be an opportunity for publishing a few more comments, so I took a look. Here’s a carefully edited subset of the block dump – showing all the pertinent information:


Start dump data blocks tsn: 99 file#:100 minblk 2513181 maxblk 2513181

Block header dump:  0x1926591d
 Object id on Block? Y
 seg/obj: 0x1652a7  csc: 0x53.891880b8  itc: 12  flg: E  typ: 1 - DATA
     brn: 1  bdba: 0x1965b70c ver: 0x01 opc: 0
     inc: 84  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0010.01d.0000dff9  0x2b442286.3469.09  C---    0  scn 0x0053.891880b1
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x04   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x05   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x06   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x07   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x08   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x09   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0a   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0b   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0c   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
bdba: 0x1926591d

data_block_dump,data header at 0x11083f154
===============
tsiz: 0x1ea8
hsiz: 0x26
pbl: 0x11083f154
     76543210
flag=--------
ntab=1
nrow=10
frre=-1
fsbo=0x26
fseo=0x4c5
avsp=0x49f
tosp=0x49f
0xe:pti[0]	nrow=10	offs=0
0x12:pri[0]	offs=0x1c15
0x14:pri[1]	offs=0x197b
0x16:pri[2]	offs=0x16e1
0x18:pri[3]	offs=0x1448
0x1a:pri[4]	offs=0x11b8
0x1c:pri[5]	offs=0xf1f
0x1e:pri[6]	offs=0xc85
0x20:pri[7]	offs=0x9ec
0x22:pri[8]	offs=0x752
0x24:pri[9]	offs=0x4c5
block_row_dump:
tab 0, row 0, @0x1c15
tl: 659 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 1, @0x197b
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 2, @0x16e1
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 3, @0x1448
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 4, @0x11b8
tl: 656 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 5, @0xf1f
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 7, @0x9ec
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 8, @0x752
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 9, @0x4c5
tl: 653 fb: -----L-- lb: 0x0  cc: 255

In the ITL you can see 10 entries with the flag set to “C—-” (committed) with no XID or SCN – that’s consistent with 10 rows migrating into the block in a single transaction. In the row directory you can see the block holds 10 rows, and in the body of the block you can see the header for each of those 10 rows with 255 columns (presumably the 2nd section of each row of 272 columns), and the flag bytes set to “—–L–” (the Last piece of a chained – as opposed to simply migrated – row).

So the block dump is consistent with the possiblity of a direct path read of a block somewhere (10 head pieces) having to read this block 10 times shortly afterwards. Can we find further corroboration in the trace file? The blockdump was for block 0x1926591d = 421943581 decimal


PARSE #4573135368:c=29,e=48,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4116693033,tim=79008343283418
EXEC #4573135368:c=53,e=93,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4116693033,tim=79008343283607
WAIT #4573135368: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=15477650 tim=79008343283636
WAIT #4573135368: nam='Disk file operations I/O' ela= 38 FileOperation=2 fileno=101 filetype=2 obj#=1462951 tim=79008343283973
WAIT #4573135368: nam='direct path read' ela= 8991 file number=100 first dba=947580 block cnt=13 obj#=1462951 tim=79008343293041

WAIT #4573135368: nam='db file sequential read' ela= 4934 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298032
WAIT #4573135368: nam='db file sequential read' ela= 155 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298216
WAIT #4573135368: nam='db file sequential read' ela= 127 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298378
WAIT #4573135368: nam='db file sequential read' ela= 125 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298526
WAIT #4573135368: nam='db file sequential read' ela= 128 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298677
WAIT #4573135368: nam='db file sequential read' ela= 123 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298826
WAIT #4573135368: nam='db file sequential read' ela= 134 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298983
WAIT #4573135368: nam='db file sequential read' ela= 129 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299135
WAIT #4573135368: nam='db file sequential read' ela= 180 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299341
WAIT #4573135368: nam='db file sequential read' ela= 133 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299497

WAIT #4573135368: nam='db file sequential read' ela= 11039 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310565
WAIT #4573135368: nam='db file sequential read' ela= 133 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310730
WAIT #4573135368: nam='db file sequential read' ela= 139 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310895
WAIT #4573135368: nam='db file sequential read' ela= 124 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311045
WAIT #4573135368: nam='db file sequential read' ela= 122 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311190
WAIT #4573135368: nam='db file sequential read' ela= 127 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311339
WAIT #4573135368: nam='db file sequential read' ela= 125 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311490
WAIT #4573135368: nam='db file sequential read' ela= 134 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311647
WAIT #4573135368: nam='db file sequential read' ela= 128 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311797
WAIT #4573135368: nam='db file sequential read' ela= 124 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311947

WAIT #4573135368: nam='db file sequential read' ela= 10592 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322564
WAIT #4573135368: nam='db file sequential read' ela= 142 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322740
WAIT #4573135368: nam='db file sequential read' ela= 126 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322889

There are a couple of interesting details in this trace file.

First we note (as the OP said) there are very few direct path reads – but direct path reads can be asynchronous with several running concurrently, which means that we may report one direct path read while the data returned from others records no time. (You’ll have to take my word for the sparseness of direct path reads – there were 5 reading a total of 58 blocks from the object, compared to 50,000 db file sequential reads)

Then you can see that although each block that was subject to “db file sequential read” is reported 10 times, the first read is much slower than the subsequent ones – a fairly good indication that the later reads are coming from a cache somewhere. (The 50,00 reads consisted of roughly 5,300 blocks being read 10 times, 1,400 blocks being read 9 times, 460 blocks being read 8 times, and a few blocks being read 7 or fewer times.)

You might also notice that the “coincidental” jump of 64 blocks between the sets of 10 reads – this appears fairly frequently, and it’s the type of pattern you might expect to see when a serial process is allocating blocks for use in a clean ASSM tablespace after the extent sizes have become fairly large (possibly the 64MB size that eventually appears with system managed extent sizes). There’s a “pseudo-random” choice of block within extent dicated by the process id, that spreads the work done by a single process steadily through the extent. Having filled 2513181, 2513245, 2513309 and so on for 16 steps the trace file comes back to 2513182, 2513246, 2513309 and so on.

It’s interesting (and time-consuming) to check the patterns but what we really need next, and don’t have, to check the theory is the set of 13 blocks dictated by the first direct path read:

WAIT #4573135368: nam='direct path read' ela= 8991 file number=100 first dba=947580 block cnt=13 obj#=1462951 tim=79008343293041

It’s likely that somewhere in the 13 blocks in the range 947580 onwards we would find the 10 row head pieces pointing to block 2513181; then the 10 row head pieces pointing to block 2513245, and so on – and I’d hope that we might see a pattern of many consecutive (or near-consecutive) rows in each originating block pointing to the same “next block”. In fact, with a few blocks in the early range, we might even get some idea of how the application was loading and updating data and be able to make some suggestions for changing the strategy to avoid row chaining.

Footnote

The OP also had a follow-up question which was: “One question for the block dump is why there is no hrid in it since the row pieces are the second row pieces and the flag bit is ‘—–L–‘?”  It would be nice to see this, of course – then we wouldn’t need to see the 947580-947592 range to see what had been happening to the data – but that’s not the way Oracle works, as I’ve pointed out above; but since the answer was in another posting of mine I simply emailed the relevant URL to the OP.

May 15, 2015

Quiz Night

Filed under: Oracle — Jonathan Lewis @ 8:34 pm BST May 15,2015

How many rows will you return from a single table query if you include the predicate


        rownum > 2

in the where clause.

Warning: this IS a catch question

To make it easier and avoid ambiguity, you may assume the table is the standard SCOTT.EMP table.

 

Part 2:

This posting was prompted by noticing a note that Dominic Brooks posted a few months ago.

Can you supply a workaround for the little oddity he’s described.

Answers

Part 1

I did say that the first part was a catch question. The semi-automatic answer that I would expect most people to give would be “none” – because putting a specific limit on a result is (probably) the reason that most people use rownum, and therefore it’s easy to forget exactly what it’s really doing. The best reply (because it’s also a nice demonstration of an important feature of the run-time engine came from comment 8 by Balazs Papp with the predicate “1 = 1 or rownum > 2″. A disjunct (OR) evaluates to TRUE if any of its components evaluates to TRUE; since 1 is always equal to 1 the check against rownum is irrelevant and the entire data set will be returned. (Conversely, thinking back to an old post of mine about NOT IN subqueries, a conjunct (AND) evaluates to TRUE if only if every individual component evalues to TRUE.)

It’s always instructive to look at execution plans (ideally pulled from memory, and including the rowsoruce execution stats) when you experiment. First the query with just the rownum predicate, then with the “1=1″ included.


select * from emp where rownum > 2;

no rows returned

--------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  COUNT              |      |      1 |        |      0 |00:00:00.01 |       2 |
|*  2 |   FILTER            |      |      1 |        |      0 |00:00:00.01 |       2 |
|   3 |    TABLE ACCESS FULL| EMP  |      1 |     14 |     14 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM>2)

=======================================================================================

select * from emp where rownum > 2 or 1=1;

14 rows selected.

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     14 |00:00:00.01 |       3 |
|   1 |  COUNT             |      |      1 |        |     14 |00:00:00.01 |       3 |
|   2 |   TABLE ACCESS FULL| EMP  |      1 |     14 |     14 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------

{This space really was blank - no predicates reported]

Note how the first query had Oracle acquire every row (A-rows) and then discard it because the check “rownum > 2″ would be false for each row in turn.
Note how the entire predicate section of the second query has disappeared because “1=1″ is always true, so the filter is irrelevant. It is an interesting oddity, though, that the redundant COUNT operator still appears in the second plan – presumably an automatic response to the presence of the rownum predicate.

The query I had originally planned to use in my answer modelled a typing error I had seen in a production system that might cause a double take on first sight.  (This is using the standard SCOTT.EMP table again on 11.2.0.4):

select
        rowid, rownum, job
from
        emp 
where   job = 'CLERK' or rownum > 2
;

ROWID                  ROWNUM JOB
------------------ ---------- ---------
AAAvgVAAFAAAAiBAAA          1 CLERK
AAAvgVAAFAAAAiBAAK          2 CLERK
AAAvgVAAFAAAAiBAAL          3 CLERK
AAAvgVAAFAAAAiBAAM          4 ANALYST
AAAvgVAAFAAAAiBAAN          5 CLERK


select
        rowid, rownum, job
from
        emp
where   job = 'ANALYST' or rownum > 2
;

ROWID                  ROWNUM JOB
------------------ ---------- ---------
AAAvgVAAFAAAAiBAAH          1 ANALYST
AAAvgVAAFAAAAiBAAM          2 ANALYST
AAAvgVAAFAAAAiBAAN          3 CLERK

I’ve included the rowid for each row because (although it’s not an assumption you should generally make) it shows you the order the rows appeared in the block. As you can see, after two CLERKS (first query) anything is accepted; after two ANALYSTS (second query) anything is accepted.

Part 2

As Dominic points out in his comments below, the anomaly comes from a predicate “rownum <= nvl(:2, rownum)” which the optimizer applies to every row rather than recognising that when :2 is not null it could use its STOPKEY optimisation. This is a variant of the old problem of the optimizer finding cases where the only way it can get the correct answer in all circumstances is to do something inefficient – a problem I described some time ago in a note on “OR with subquery”. We recognise, of course, that in general we should either use the front end to choose one of two possible queries (depending on whether the bind variable is null or not) or we should rewrite the query as a UNION ALL, taking care that the code eliminates any rows from the later query blocks that have already appeared in the earlier query blocks, and being very careful about dealing with NULLs in the data; for example (setting the bind variable :n to the value 10, and with a table t1 that is a copy of the data from view all_objects, indexed on object_id):


select  *
from    t1
where   object_id <= 100
and     :n is not null
and     rownum <= :n
union all
select  *
from    t1
where   object_id <= 100
and     :n is null
;

--------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |       |      1 |        |     10 |00:00:00.01 |       5 |
|   1 |  UNION-ALL                     |       |      1 |        |     10 |00:00:00.01 |       5 |
|*  2 |   COUNT STOPKEY                |       |      1 |        |     10 |00:00:00.01 |       5 |
|*  3 |    FILTER                      |       |      1 |        |     10 |00:00:00.01 |       5 |
|   4 |     TABLE ACCESS BY INDEX ROWID| T1    |      1 |     37 |     10 |00:00:00.01 |       5 |
|*  5 |      INDEX RANGE SCAN          | T1_I1 |      1 |     37 |     10 |00:00:00.01 |       3 |
|*  6 |   FILTER                       |       |      1 |        |      0 |00:00:00.01 |       0 |
|   7 |    TABLE ACCESS BY INDEX ROWID | T1    |      0 |     37 |      0 |00:00:00.01 |       0 |
|*  8 |     INDEX RANGE SCAN           | T1_I1 |      0 |     37 |      0 |00:00:00.01 |       0 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=:N)
   3 - filter(:N IS NOT NULL)
   5 - access("OBJECT_ID"<=100)
   6 - filter(:N IS NULL)
   8 - access("OBJECT_ID"<=100)

When the bind variable :n is null the optimizer will execute only the second query block; but when :n is not null it will take the first query block, and take advantage of the COUNT STOPKEY optimisation. (I was a little surprised that the operations 2 and 3 were in the order show, I had expected them to be in the opposite order.)

In this example we can engineer the code so that precisely one of the two query blocks executes, so we don’t need to worry about the threat of an overlap that I raised in the note on “Subquery with OR” with its solution of a call to lnnvl().

 

May 13, 2015

Instance stats

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 7:31 pm BST May 13,2015

+

While reading a posting by Martin Bach on a new buffering option for 12c I was prompted to take a look at another of his posts on the instance activity stats, which reminded me that the class column on v$statname is a bit flag, which we can dissect using the bitand() function to pick out the statistics that belong to multiple classes. I’ve got 2 or 3 little scripts that do this one, for example, picks out all the statistics relating to RAC, another is just a cross-tab of the class values used and their breakdown by class.  Originally this latter script used the “diagonal” method of decode() then sum() – but when the 11g pivot() option appeared I used it as an experiment on pivoting.

This is the script as it now stands, with the output from 12.1.0.2




select
        *
from    (
        select
                st.class,
                pwr.class_id,
                case bitand(st.class, pwr.expn)
                        when 0 then to_number(null)
                               else 1
                end     class_flag
        from
                v$statname      st,
                (select
                        level                   class_id,
                        power(2,level - 1)      expn
                from
                        dual
                connect by level <= 8
                )       pwr
        where
                bitand(class,pwr.expn) = pwr.expn
        )
pivot   (
                sum(class_flag)
        for     class_id in (
                        1 as EndUser,
                        2 as Redo,
                        3 as Enqueue,
                        4 as Cache,
                        5 as OS,
                        6 as RAC,
                        7 as SQL,
                        8 as Debug
                )
        )
order by
        class
;


     CLASS    ENDUSER       REDO    ENQUEUE      CACHE         OS        RAC        SQL      DEBUG
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         1        130
         2                    68
         4                                9
         8                                         151
        16                                                     16
        32                                                                35
        33          3                                                      3
        34                     1                                           1
        40                                          53                    53
        64                                                                          130
        72                                          15                               15
       128                                                                                     565
       192                                                                            2          2

13 rows selected.

The titles given to the columns come from Martin’s blog, but the definitive set is in the Oracle documentation in the reference manual for v$statname. (I’ve changed the first class from “User” to “EndUser” because of a reserved word problem, and I abbreviated the “RAC” class for tidiness.) It’s interesting to note how many of the RAC statistics are also about the Cache layer.

May 12, 2015

Parallel Query

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 7:22 pm BST May 12,2015

According to the Oracle Database VLDB and Partitioning Guide (10g version and 11g version):

A SELECT statement can be executed in parallel only if the following conditions are satisfied:

  • The query includes a parallel hint specification (PARALLEL or PARALLEL_INDEX) or the schema objects referred to in the query have a PARALLEL declaration associated with them.
  • At least one table specified in the query requires one of the following:
    • A full table scan
    • An index range scan spanning multiple partitions
  • No scalar subqueries are in the SELECT list.

Note, particularly, that last restriction. I was looking at a query recently that seemed to be breaking this rule so, after examining the 10053 trace file for a while, I decided that I would construct a simplified model of the client’s query to demonstrate how the manuals can tell you the truth while being completely deceptive or (conversely) be wrong while still giving a perfectly correct impression. So here’s a query, with execution plan, from 11.2.0.4:

select
        /*+ parallel(t1 2) */
        d1.small_vc,
        t1.r1,
        t2.n21,
        t2.v21,
        t3.v31,
        (select max(v1) from ref1 where n1 = t2.n21)    ref_t2,
        (select max(v1) from ref2 where n1 = t1.r1)     ref_t1,
        t1.padding
from
        driver          d1,
        t1, t2, t3
where
        d1.n1 = 1
and     t1.n1 = d1.id
and     t1.n2 = 10
and     t1.n3 = 10
and     t2.id = t1.r2
and     t3.id = t1.r3
;

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 15700 |  1340   (3)| 00:00:07 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   2 |   TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  3 |    INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   4 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   5 |   TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  6 |    INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   7 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   8 |   PX SEND QC (RANDOM)        | :TQ10003 |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | P->S | QC (RAND)  |
|*  9 |    HASH JOIN                 |          |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 10 |     HASH JOIN                |          |   100 | 14700 |  1317   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 11 |      HASH JOIN               |          |   100 | 13300 |  1294   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  12 |       BUFFER SORT            |          |       |       |            |          |  Q1,03 | PCWC |            |
|  13 |        PX RECEIVE            |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |         PX SEND BROADCAST    | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 15 |          TABLE ACCESS FULL   | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  16 |       PX BLOCK ITERATOR      |          |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWC |            |
|* 17 |        TABLE ACCESS FULL     | T1       |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  18 |      BUFFER SORT             |          |       |       |            |          |  Q1,03 | PCWC |            |
|  19 |       PX RECEIVE             |          | 10000 |   136K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  20 |        PX SEND BROADCAST     | :TQ10001 | 10000 |   136K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  21 |         TABLE ACCESS FULL    | T2       | 10000 |   136K|    23   (5)| 00:00:01 |        |      |            |
|  22 |     BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|  23 |      PX RECEIVE              |          | 10000 |    97K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  24 |       PX SEND BROADCAST      | :TQ10002 | 10000 |    97K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  25 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    23   (5)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=:B1)
   6 - access("N1"=:B1)
   9 - access("T3"."ID"="T1"."R3")
  10 - access("T2"."ID"="T1"."R2")
  11 - access("T1"."N1"="D1"."ID")
  15 - filter("D1"."N1"=1)
  17 - filter("T1"."N2"=10 AND "T1"."N3"=10)

Thanks to my hint the query has been given a parallel execution plan – and a check of v$pq_tqstat after running the query showed that it had run parallel. Note, however, where the PX SEND QC and PX COORDINATOR operations appear – lines 7 and 8, and above those lines we see the two scalar subqueries.

This means we’re running the basic select statement as a parallel query but the query co-ordinator has serialised on the scalar subqueries in the select list.  Is the manual “right but deceptive” or “wrong but giving the right impression” ?  Serialising on (just) the scalar subqueries can have a huge impact on the performance and effectively make the query behave like a serial query even though, technically, the statement has run as a parallel query.

You may recall that an example of this type of behaviour, and its side effects when the scalar subqueries executed independently as parallel queries, showed up some time ago. At the time I said I would follow up with a note about the change in behaviour in 12c; this seems to be an appropriate moment to show the 12c plan(s), first the default:


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 19100 |  1364   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10005 |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED        |          |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  4 |     HASH JOIN OUTER          |          |   100 | 18100 |  1340   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  5 |      HASH JOIN               |          |   100 | 16400 |  1335   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  6 |       HASH JOIN OUTER        |          |   100 | 15000 |  1311   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  7 |        HASH JOIN             |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|   8 |         PX RECEIVE           |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,05 | PCWP |            |
|   9 |          PX SEND BROADCAST   | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  10 |           PX SELECTOR        |          |       |       |            |          |  Q1,00 | SCWC |            |
|* 11 |            TABLE ACCESS FULL | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | SCWP |            |
|  12 |         PX BLOCK ITERATOR    |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWC |            |
|* 13 |          TABLE ACCESS FULL   | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|  14 |        PX RECEIVE            |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  15 |         PX SEND BROADCAST    | :TQ10001 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  16 |          PX SELECTOR         |          |       |       |            |          |  Q1,01 | SCWC |            |
|  17 |           VIEW               | VW_SSQ_1 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  18 |            HASH GROUP BY     |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  19 |             TABLE ACCESS FULL| REF2     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,01 | SCWP |            |
|  20 |       PX RECEIVE             |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  21 |        PX SEND BROADCAST     | :TQ10002 | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  22 |         PX SELECTOR          |          |       |       |            |          |  Q1,02 | SCWC |            |
|  23 |          TABLE ACCESS FULL   | T2       | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | SCWP |            |
|  24 |      PX RECEIVE              |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  25 |       PX SEND BROADCAST      | :TQ10003 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | S->P | BROADCAST  |
|  26 |        PX SELECTOR           |          |       |       |            |          |  Q1,03 | SCWC |            |
|  27 |         VIEW                 | VW_SSQ_2 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  28 |          HASH GROUP BY       |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  29 |           TABLE ACCESS FULL  | REF1     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,03 | SCWP |            |
|  30 |     PX RECEIVE               |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  31 |      PX SEND BROADCAST       | :TQ10004 | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  32 |       PX SELECTOR            |          |       |       |            |          |  Q1,04 | SCWC |            |
|  33 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | SCWP |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T3"."ID"="T1"."R3")
   4 - access("ITEM_2"(+)="T2"."N21")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("ITEM_1"(+)="T1"."R1")
   7 - access("T1"."N1"="D1"."ID")
  11 - filter("D1"."N1"=1)
  13 - filter("T1"."N2"=10 AND "T1"."N3"=10)

The first thing to note is the location of the PX SEND QC and PX COORDINATOR operations – right at the top of the plan: there’s no serialisation at the query coordinator. Then we spot the views at operations 17 and 27 – VW_SSQ_1, VW_SSQ_2 (would SSQ be “scalar subquery”, perhaps). The optimimzer has unnested the scalar subqueries out of the select list into the join. When a scalar subquery in the select list returns no data it’s value is deemed to be NULL so the joins (operations 4 and 6) have to be outer joins.

You’ll notice that there are a lot of PX SELECTOR operations – each feeding a PX SEND BROADCAST operations that reports its “IN-OUT” column as S->P (i.e. serial to parallel). Historically a serial to parallel operation started with the query coordinator doing the serial bit but in 12c the optimizer can dictate that one of the PX slaves should take on that task (see Randolf Geist’s post here). Again my code to report v$pq_tqstat confirmed this behaviour in a way that we shall see shortly.

This type of unnesting is a feature of 12c and in some cases will be very effective. It is a cost-based decision, though, and the optimizer can make mistakes; fortunately we can control the feature. We could simply set the optimizer_features_enable back to 11.2.0.4 (perhaps through the hint) and this would take us back to the original plan, but this isn’t the best option in this case. There is a hidden parameter _optimizer_unnest_scalar_sq enabling the feature so we could, in principle, disable just the one feature by setting that parameter to false; a more appropriate strategy would simply be to tell the optimizer that it should not unnest the subqueries. In my case I could put the /*+ no_unnest */ hint into both the subqueries or use the qb_name() hint to give the two subquery blocks names, and then used the /*+ no_unnest() */ hint with the “@my_qb_name” format at the top of the main query. Here’s the execution plan I get whether I use the hidden parameter or the /*+ no_unnest */ mechanim:

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |       |       |  1554 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                 |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10003 |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    EXPRESSION EVALUATION        |          |       |       |            |          |  Q1,03 | PCWC |            |
|*  4 |     HASH JOIN                   |          |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  5 |      HASH JOIN                  |          |   100 | 14700 |  1330   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  6 |       HASH JOIN                 |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|   8 |         PX RECEIVE              |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|   9 |          PX SEND BROADCAST      | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL     | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  11 |        PX BLOCK ITERATOR        |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWC |            |
|* 12 |         TABLE ACCESS FULL       | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|  13 |       BUFFER SORT               |          |       |       |            |          |  Q1,03 | PCWC |            |
|  14 |        PX RECEIVE               |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  15 |         PX SEND BROADCAST       | :TQ10001 | 10000 |   136K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  16 |          TABLE ACCESS FULL      | T2       | 10000 |   136K|    24   (5)| 00:00:01 |        |      |            |
|  17 |      BUFFER SORT                |          |       |       |            |          |  Q1,03 | PCWC |            |
|  18 |       PX RECEIVE                |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  19 |        PX SEND BROADCAST        | :TQ10002 | 10000 |    97K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  20 |         TABLE ACCESS FULL       | T3       | 10000 |    97K|    24   (5)| 00:00:01 |        |      |            |
|  21 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  22 |      TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 23 |       INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|  24 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  25 |      TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 26 |       INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."ID"="T1"."R3")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("T1"."N1"="D1"."ID")
  10 - filter("D1"."N1"=1)
  12 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N2"=10 AND "T1"."N3"=10))
  23 - access("N1"=:B1)
  26 - access("N1"=:B1)

Note particularly that the PX SEND QC and PX COORDINATOR operations are operations 2 and 1,, and we have a new operator EXPRESSSION EVALUATION at operation 3. This has three child operations – the basic select starting at operation 4, and the two scalar subqueries starting at lines 21 and 24. We are operating the scalar subqueries as correlated subqueries, but we don’t leave all the work to the query coordinator – each slave is running its own subqueries before forwarding the final result to the coordinator. There is a little side effect that goes with this change – the “serial to parallel” operations are now, as they always used to be, driven by the query co-ordinator, the PX SELECTOR operations have disappeared.

And finally

Just to finish off, let’s take a look at the results from v$pq_tqstat in 12.1.0.2. First from the default plan with the PX SELECTOR operations. Remember that this turned into a five table join where two of the “tables” were non-correlated aggregate queries against the reference tables.


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                   200       2428          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                   100       1238         59         27           0
                                             1 P001                   100       1238         41         24           0

                    1 Producer               1 P002                  2000      23830          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                  1000      11939         57         26           0
                                             1 P001                  1000      11939         41         24           0

                    2 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     339732          0          0           0
                      Consumer               1 P000                 10000     169890         49         22           0
                                             1 P001                 10000     169890         31         21           0

                    3 Producer               1 P002                     0         48          0          0           0
                                             1 P003                  2000      23830          0          0           0
                      Consumer               1 P000                  1000      11939         58         26           0
                                             1 P001                  1000      11939         38         23           0

                    4 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     239986          0          0           0
                      Consumer               1 P000                 10000     120017         50         22           0
                                             1 P001                 10000     120017         34         21           0

                    5 Producer               1 P000                     1        169          0          0           0
                                             1 P001                     1        169          1          0           0
                      Consumer               1 QC                       2        338          3          0           0

As you read down the table queues you can see that in the first five table queues (0 – 4) we seem to operate parallel to parallel, but only one of the two producers (p002 and p003) produces any data at each stage. A more traditional plan would show QC as the single producer in each of these stages.

Now with scalar subquery unnesting blocked – the plan with the three table join and EXPRESSION EVALUATION – we see the more traditional serial to parallel, the producer is QC in all three of the first table queues (the full scan and broadcast of tables t1, t2, and t3).

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 QC                     200       1726          0          0           0
                      Consumer               1 P000                   100       1614         28         15           0
                                             1 P001                   100       1614         34         13           0

                    1 Producer               1 QC                   20000     339732          0          0           0
                      Consumer               1 P000                 10000     169866         19         10           0
                                             1 P001                 10000     169866         25          8           0

                    2 Producer               1 QC                   20000     239986          0          0           0
                      Consumer               1 P000                 10000     119993         23         11           0
                                             1 P001                 10000     119993         31         11           0

                    3 Producer               1 P000                     1        155          1          0           0
                                             1 P001                     1        155          0          0           0
                      Consumer               1 QC                       2        310          3          1           0

It’s an interesting point that this last set of results is identical to the set produced in 11g – you can’t tell from v$pq_tqstat whether the parallel slaves or the query co-ordinator executed the subqueries – you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.

 

May 11, 2015

Efficiency

Filed under: Oracle,Performance — Jonathan Lewis @ 9:24 pm BST May 11,2015

Here’s a question to which I don’t know the answer, and for which I don’t think anyone is likely to need the answer; but it might be an entertaining little puzzle for thr curious.

Assume you’re doing a full tablescan against a simple heap table of no more than 255 columns (and not using RAC, Exadata, In-memory, etc. etc. etc.), and the query is something like:


select  {columns 200 to 250}
from    t1
where   column_255 = {constant}
;

To test the predicate Oracle has to count its way along each row column by column to find column 255. Will it:

  1. transfer columns 200 to 250 to local memory as it goes, then check column_255 — which seems to be a waste of CPU if the predicate doesn’t evaluate to TRUE
  2. evaluate the predicate, then walk the row again to transfer columns 200 to 250 to local memory if the predicate evaluates to TRUE — which also seems to be a waste of CPU
  3. one or other of the above depending on which columns are needed, how long they are, and the selectivity of the predicate

How would you test your hypothesis ?

Parallel Execution

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 10:16 am BST May 11,2015

This is another little reference list I should have created some time ago. It covers a series of posts on interpreting parallel execution plans and understanding where the work happens.

I may add further links to this page in the future relating to other aspects of parallel execution.

 

April 29, 2015

Not Exists

Filed under: Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 8:21 pm BST Apr 29,2015

This whole thing about “not exists” subqueries can run and run. In the previous episode I walked through some ideas of how the following query might perform depending on the data, the indexes, and the transformation that the optimizer might apply:

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As another participant in the original OTN thread had suggested, however, it might be possible to find a completely different way of writing the query, avoiding the subquery approach completely. In particular there are (probably) several ways that we could write an equivalent query where the table only appears once. In other words, if we restate the requirement we might be able to find a different SQL translation for that requirement.

Looking at the current SQL, it looks like the requirement is: “Count the number of rows in t1 that have values of X that only have one associated value of Y”.

Based on this requirement, the following SQL statements (supplied by two different people) look promising:


    WITH counts AS
       (SELECT x,y,count(*) xy_count
        FROM   t1
        GROUP BY x,y)
    SELECT SUM(x_count)
    FROM  (SELECT x, SUM(xy_count) x_count
           FROM   counts
           GROUP BY x
           HAVING COUNT(*) = 1);


SELECT SUM(COUNT(*))
  FROM t1
GROUP BY x HAVING COUNT(DISTINCT y)<=1

Logically they do seem to address the description of the problem – but there’s a critical difference between these statements and the original. The clue about the difference appears in the absence of any comparisons between columns in the new forms of the query, no t1.colX = t2.colX, no t1.colY != t2.colY, and this might give us an idea about how to test the code. Here’s some test data:


drop table t1 purge;

create table t1 (
        x       number(2,0),
        y       varchar2(10)
);

create index t1_i1 on t1(x,y);

--      Pick one of the three following pairs of rows

insert into t1(x,y) values(1,'a');
insert into t1(x,y) values(1,null);

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'b');

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'a');

commit;

--      A pair to be skipped

insert into t1(x,y) values(2,'c');
insert into t1(x,y) values(2,'c');

--      A pair to be reported

insert into t1(x,y) values(3,'d');
insert into t1(x,y) values(3,'e');

commit;

execute dbms_stats.gather_table_stats(user,'t1')

Notice the NULLs – comparisons with NULL lead to rows disappearing, so might the new forms of the query get different results from the old ?
The original query returns a count of 4 rows whichever pair we select from the top 6 inserts.

With the NULL in the Y column the new forms report 2 and 4 rows respectively – so only the second query looks viable.
With the NULLs in the X columns and differing Y columns the new forms report 2 and 2 rows respectively – so even the second query is broken.

However, if we add “or X is null” to the second query it reports 4 rows for both tests.
Finally, having added the “or x is null” predicate, we check that it returns the correct 4 rows for the final test pair – and it does.

It looks as if there is at least one solution to the problem that need only access the table once, though it then does two aggregates (hash group by in 11g). Depending on the data it’s quite likely that this single scan and double hash aggregation will be more efficient than any of the plans that do a scan and filter subquery or scan and hash anti-join. On the other hand the difference in performance might be small, and the ease of comprehension is just a little harder.

Footnote:

I can’t help thinking that the “real” requirement is probably as given in the textual restatement of the problem, and that the first rewrite of the query is probably the one that’s producing the “right” answers while the original query is probably producing the “wrong” answer.

April 26, 2015

Not Exists

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 7:17 pm BST Apr 26,2015

Another question on a seemingly simple “not exists” query has appeared on OTN just a few days after my last post about the construct. There are two little differences between the actual form of the two queries that make it worth repeating the analysis.

The first query was of the form:


select from big_table
where  not exists (select exact_matching_row from small table);

while the new query is of the form:


select from big_table alias1
where not exists (select inexact_matching_row from big_table alias2)

In the absence of a complete explanation, we might guess that the intention of the first query is to: “check correctness of undeclared foreign key constraint” i.e. small_table is the parent table with unique values, and big_table is the child end with some data that may be invalid due to a recent bulk data load. (In my example for the previous posting I relaxed the uniqueness assumption in small_table to make the problem a little more expensive.)

Our guess for the second query ought to be different; we are using the same table twice, and we are checking for the non-existence of “imperfect matches”. This introduces two potential threats – first that the (possibly pseudo-)join between the two tables is between two large tables and therefore inherently likely to be expensive; second that we may be allowed to have very large numbers of “perfect matches” that will escalate the scale of the join quite dramatically. Here’s the actual query (second version) from the posting; this will make it easier to explain why the structure of the query introduces the second threat and requires us (as so often) to understand the data in order to optimise the query execution:


select  count(*)
from    ubl_stg.wk_sap_fat w1
where   not exists (
                select  1
                from    ubl_stg.wk_sap_fat w2
                where   w2.mes_mese_id =  w1.mes_mese_id
                and     w2.sistema     <> w1.sistema
        )
;

Note the schema name ubl_stg – doesn’t that hint at “staging tables” for a data load.
Note the column name mes_mese_id – an “id” column, but clearly not one that’s supposed to be unique, so possibly a foreign key column that allows repetitions

The code is looking for, then counting, cases where for a given value of mes_mese_id there is only one corresponding value used for sistema. Since we don’t know the application we don’t know whether this count should be low or high relative to the number of rows in the table, nor do we know how many distinct values there might be of mes_mese_id – and these pieces of information are critical to identifying the best execution plan for the query.

The owner of this query told us that the table held 2 million rows which are “deleted and reloaded” every day, showed us that the (default) execution plan was a hash anti-join which took two hours to complete, told us that he (or she) didn’t think that an index would help because of the “not equal” predicate, and finally reported back that the problem was solved (no timing information supplied) by the addition of the /*+ no_unnest */ hint.

The question is – what has happened and why is there such a difference in performance ?

The first observation is that the 2 hours does seem an unreasonably long time for the query to run – and since it’s only a select statement it would be good to run it again and take a snapshot of the session statistics and session events (v$sesstat, v$session_event) for the session to see what the workload was and where the time was spent. Given the “deleted and reloaded” comment it’s possible that there may be some unexpected overhead due to some strange effects of read-consistency or delayed block cleanout, so we might also take a snapshot of tablespace or file I/O to check for lots of I/O on the undo tablespace (which might also show an I/O problem on the table’s tablespace or the user’s TEMP tablespace). The snapshots give us a very cheap, non-invasive option for getting some summary stats – but the tablespace/file stats are system-wide, of course, so may not tell us anything about our specific task, so we might even enable extended tracing (event 10046 level 8 / dbms_monitor with waits) to see in detail where we are losing time.

Since we don’t currently have the information we need to explain the two hours (it may have appeared by the time I post this note) it might be instructive to make some guesses about where the time could go in the hash anti-join, and why the /*+ no_unnest */ hint could make a difference.

Ignoring the possibility of strange undo/read-consistency/cleanout effects the first possiblity is simply that the hash join is large and turns into a multi-pass I/O thrash. The mes_mese_id column looks like it might be a number (id columns so often are) but the sistema column has the flavour of a reasonably large character column – so maybe our hash table has to be a couple of hundred megabytes – that could certainly be enough to spill to disc, though you’d have to have a really small PGA availability for it to turn into a multi-pass hash join.

Another possibility is that the pattern in the data makes the hash join burn up a huge amount of CPU – that should be easy to see on a re-run.  If there are relatively few distinct sets of values for (mes_mese_id, sistema) and there are very few cases where a mes_mese_id is associated with more than one sistema, then a large fraction of the hash table probes would have to follow a very long chain of matches to the very end, and that would take a large amount of CPU.

Pursue that “long chain” hypothesis to a slight extreme – what if there’s one mes_mese_id that appears in 250,000 of the 2M rows, and the sistema value is the same for every one of those quarter million rows,  which would require Oracle to walk a chain of 250,000 elements 250,000 times, for a total of 62.5 billion pointers to follow and comparisions to make – how much CPU might that take.  Worse still,since having the same mes_mese_id (the hashing is only on the equality predicate) means the quarter of a million rows would have to go into the same hash bucket so we might end up doing a multipass operation because that one bucket was very much larger than anything Oracle had anticipated when it did its internal partitioning calculations for the hash table. (There are some related notes in this article and in chapter 12 of Cost Based Oracle – Fundamentals)

Why would the /*+ unnest */ hint address these performance problems ? My first guess would be that the OP may have created the index on (mes_mese_id, sistema), in which case a full scan of that index (or even a fast-full scan if the index were newly created, or even a tablescan if the data had been loaded in the right order) followed by the filter subquery being driven by an index range scan would result in a relatively efficient subquery being executed once per distinct value of (mes_mese_id, sistema) rather than once per row. This could be much more efficient than a really badly skewed data set doing the hash anti-join. In fact, even in the absence of the index, if the number of distinct combinations was really quite small, that many tablescans – if the table were cached, whether in Oracle or the filesystem or the SAN cache – might still be a lot faster than the hash anti-join. (We were told that the problem was solved – but not told how much time constituted a viable solution.)

Models

Hand-waving and talk is fine – but a few modelled results might make it easier to comprehend, so here’s a way of generating a few versions of data sets and testing:


drop table t1 purge;

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                          id,
/*
        --
        --      Random generation nearly guaranteeing no random
        --      duplicates of (x,y) but quite a lot of mismatches
        --      Count 735,715 in 3.36 seconds.
        --
        trunc(dbms_random.value(0,2e6))                 x,
        lpad(trunc(dbms_random.value(0,1e6)),64,'0')    y
*/
/*
        --
        --      One specific pair repeated many times with no mismatch the rest
        --      as previously generated above. Times with different repeat counts:
        --       10,000            14 seconds
        --       20,000            52 seconds
        --       40,000           202 seconds
        --      CPU time quadruples as the count doubles (twice the number of
        --      probes walking twice the number of steps in the hash chain)
        --       80,000 =>        800 seconds
        --      160,000 =>      3,200 seconds
        --      240,000 =>      ca. 2 hours.
*/
        case
                when rownum <= 40000
                        then 2e6
                        else trunc(dbms_random.value(0,2e6))
        end                                             x,
        case
                when rownum <= 40000
                        then
                                lpad(0,64,'0')
                        else
                                lpad(
                                        trunc((rownum - 1)/1000) +
                                                case mod(rownum-1,1000) when 0 then 0 else 0 end,
                                        64,'0'
                                )
        end                                             y
/*
        --
        --      2,000 distinct values repeated 1,000 times each
        --      Query result: 2,000,000 in 235 seconds.
        --
        trunc((rownum - 1)/1000)                x,
        lpad(
                trunc((rownum - 1)/1000) +
                        case mod(rownum-1,1000) when 0 then 0 else 0 end,
                64,'0'
        )                                       y
*/
from
        generator       v1,
        generator       v2
where
        rownum <= 2e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As you can see I’ve changed the table and column names – this keeps them in line with the original SQL statement presented on OTN before we got the graphic display of a similar statement and plan. The SQL to create the data includes three variants and 5 sets of results (and 3 conjectures) running 11.2.0.4. These results appeared when the optimizer took the hash anti-join execution plan, and also spilled to disc with a one-pass workarea operation that first extended to about 200MB of PGA then dropped back to about 8MB.

To summarise the results recorded in the SQL – if we use the term “bad data” to describe rows where more than one Y value appears for a given X value then:

  1. With a large number of distinct pairs and a lot of bad data: the anti-join is pretty fast at 3.36 seconds.
  2. With no bad data, a small number of distinct pairs (2,000) and lots of rows per pair (1,000): the anti-join takes 235 CPU seconds
  3. As for #1 above, but with one extreme “good” pair that appears a large number of times: CPU time is proportional to the square of the number of duplicates of this value

I didn’t actually test beyond 40,000 duplicates for the last case, but you can see the double/quadruple pattern very clearly and the CPU time would have hit 2 hours at around 240,000 identical copies of one (x,y) pair.

/*+ no_unnest */

So what happens if you try using the /*+ no_unnest */ hint ? The target here is that the more repetitive the data the smaller the number of times you may have to run the subquery; and if you can get the driving data ordered you can guarantee the smallest possible number of runs of the subquery. I haven’t worked through all the possibilities, but to give you the flavour of what can happen, when I added the /*+ no_unnest */ hint to the query (and ensured that the table would be cached rather than read using direct path reads into the PGA) the execution time for the test with 1,000 copies of 2,000 pairs took 181 seconds to do 2,001 tablescans (compared with 235 seconds to do 2 tablescans and a hash anti-join) with the following execution path:


----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    69 |  8256K  (4)| 11:28:04 |
|   1 |  SORT AGGREGATE     |      |     1 |    69 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1   |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    TABLE ACCESS FULL| T1   |     2 |   138 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST NO_INDEX ("W2") */ 0
              FROM "T1" "W2" WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - filter("W2"."X"=:B1 AND "W2"."Y"<>:B2)

More significantly, when I created the optimum index the execution time dropped to 0.9 seconds – here’s the create index statement and subsequent plan – the extreme benefit appears because the data was effectively loaded in sorted order; if this had not been the case I would have forced an index full scan for the driving data set (with a “not null” predicate or constraint to make it possible for the optimizer to use the index to drive the query)


create index t1_i1 on t1(x,y) compress pctfree 0;

-----------------------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |     1 |    69 |  6008K  (1)| 08:20:45 |
|   1 |  SORT AGGREGATE     |       |     1 |    69 |            |          |
|*  2 |   FILTER            |       |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1    |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    INDEX RANGE SCAN | T1_I1 |     2 |   138 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T1" "W2"
              WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - access("W2"."X"=:B1)
       filter("W2"."Y"<>:B1)

Similarly, when I had the index in place and the 40,000 repetitions of a “good pair”, Oracle took a total of 9 seconds even though it had to run the subquery 1,960,000 times for the non-repetitive data (and once for the repetitive_pair). I have to say I was a little surprised at how rapidly it managed to get through that 2M subquery executions – but then I keep forgetting how ridiculously overpowered my new laptop is.

With these figures in mind you can appreciate that if the OP had lots of pairs with tens of thousands of repetitions, then even without creating the index on the table, the query time might drop from 2 hours for the hash anti-join to “a few minutes” for the filter subquery with a time that was good enough to look like “problem solved”.

Summary

If you have a few “long hash chains” in the build table – i.e. rows from the “first” table in the join that hash to the same value – then the amount of work Oracle has to do to check a single row from the probe (“second”) table that matches on the hash value can become significant. If a large number of rows from the probe table hit a long hash chain then the CPU time for the whole join can climb dramatically and you may want to force Oracle away from the hash join.

If the the long chains are the result of a skewed distribution where a small number of values appear very frequently in a table with a large number of distinct values that each appears infrequently then the optimizer may not notice the threat and may choose the hash plan when there is a much less resource-intensive alternative available.

Footnote:

There was another interesting observations I made while doing the experiments relating to whether the chains are due to hash collisions or require exact matches in the data – but I’ve spent an hour on desgining and running tests, and nearly 4 hours writing up the results so far. I need to do a few more tests to work out whether I’m seeing a very clever optimisation or a lucky coincidence in a certain scenario – so I’m going to save that for another day.

 

 

April 23, 2015

Golden Oldies

Filed under: bitmaps,Indexing,Oracle — Jonathan Lewis @ 8:45 am BST Apr 23,2015

I’ve just been motivated to resurrect a couple of articles I wrote for DBAZine about 12 years ago on the topic of bitmap indexes. All three links point to Word 97 documents which I posted on my old website in September 2003. Despite their age they’re still surprisingly good.

Update: 26th April 2015

Prompted by my reply to comment #2 below to look at what I said about bitmap indexes in Practical Oracle 8i (published 15 years ago), and found this gem:

An interesting feature of bitmap indexes is that it is rather hard to predict how large the index segment will be. The size of a B-tree index is based very closely on the number of rows and the typical size of the entries in the index column. The size of a bitmap index is dictated by a fairly small number of bit-strings which may have been compressed to some degree depending upon the number of consecutive 1’s and 0’s.

To pick an extreme example, imagine a table of one million rows that has one column that may contain one of eight values ‘A’ to ‘H’ say, which has been generated in one of of the two following extreme patterns:

  • All the rows for a given value appear together, so scanning down the table we get 125,000 rows with ‘A’ followed by 125,000 rows of ‘B’ and so on.
  • The rows cycle through the values in turn, so scanning down the table we get ‘A’,’B’. . . ‘H’ repeated 125,000 times.

What will the bitmap indexes look like in the two cases case?

For the first example, the basic map for the ‘A’ value will be 125,000 one-bits, followed by 875,000 zero bits – which will be trimmed off. Splitting the 125,000 bits into bytes and adding the necessary overhead of about 12% we get an entry of the ‘A’ rows of 18K. A similar argument applies for each of the values ‘B’ to ‘H’, so we get a total index size of around 8 x 18K – giving 156K.

For the second example, the basic map for the ‘A’ value will be a one followed by 7 zeros, repeated 125,000 times. There is no chance of compression here, so the ‘A’ entry will start at 125,000 bytes. Adding the overhead this goes up to 140K, and repeating the argument for the values ‘B’ to ‘H’ we get a total index of 1.12 MB.

This wild variation in size looks like a threat, but to put this into perspective, a standard B-tree index on this column would run to about 12 Mb irrespective of the pattern of the data. It would probably take about ten times as long to build as well.

As we can see, the size of a bitmap index can be affected dramatically by the packing of the column it depends upon as well as the number of different possible values the column can hold and the number of rows in the table. The compression that is applied before the index is stored, and the amazing variation in the resulting index does mean that the number of different values allowed in the column can be much larger than you might first expect. In fact it is often better to think of bitmap indexes in terms of how many occurrences of each value there are, rather than in terms of how many different values exist. Viewing the issue from this direction, a bitmap is often better than a B-tree when each value occurs more than a few hundred times in the table (but see the note below following the description of bitmap index entries).

 

April 20, 2015

Manuals

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 4:24 pm BST Apr 20,2015

From time to time I read a question (or, worse, an answer) on OTN and wonder how someone could have managed to misunderstand some fundamental feature of Oracle – and then, as I keep telling people everyone should do – I re-read the manuals and realise that that sometimes the manuals make it really easy to come to the wrong conclusion.

Having nothing exciting to do on the plane to Bucharest today, I decided it was time to read the Concepts manual again – 12c version – to remind myself of how much I’ve forgotten. Since I was reading the mobi version on an iPad mini I can’t quote page numbers, but at “location 9913 of 16157″ I found the following text in a sidebar:

“LGWR can write redo log entries to disk before a transaction commits. The redo entries become permanent only if the transaction later commits.”

Now I know what that’s trying to say because I already know how Oracle works – but it explains the various questions that I’ve seen on OTN (and elsewhere) struggling with the idea of how Oracle manages to “not have” redo for transactions that didn’t commit.

The redo entries become permanent the moment they are written to disc – nothing makes any of the content of the redo log files disappear 1, nothing goes back and flags some bits of the redo log as “not really there”. It’s the changes to the data blocks that have been described by the redo that become permanent only if the transaction later commits. If the transaction rolls back2 the session doesn’t “seek and destroy” the previous redo, it generates MORE redo (based on the descriptions that it originally put into the undo segment) and applies the changes described by that redo to reverse out the effects of the previous changes.

So next time you see a really bizarre question about how Oracle works remember that it could have arisen from someone reading the manual carefully; because sometimes the manual writers know exactly what they mean to say but don’t actually say it clearly and unambiguously.

1 I am aware that strange and rare events such disc crashes could make all sorts of things disappear, but I think it’s reasonable to assume here that we’re talking about standard processing mechanisms.

2 I am also aware that there are variations dependent on events like sessions being killed, or instance failure that could need some further explanation, but there’s a time, place, and pace, for everything.

April 15, 2015

Cartesian join

Filed under: Oracle,Performance,Tuning — Jonathan Lewis @ 6:40 pm BST Apr 15,2015

Some time ago I pulled off the apocryphal “from 2 hours to 10 seconds” trick for a client using a technique that is conceptually very simple but, like my example from last week, falls outside the pattern of generic SQL. The problem (with some camouflage) is as follows: we have a data set with 8 “type” attributes which are all mandatory columns. We have a “types” table with the same 8 columns together with two more columns that are used to translate a combination of attributes into a specific category and “level of relevance”. The “type” columns in the types table are, however, allowed to be null although each row must have at least one column that is not null – i.e. there is no row where every “type” column is null.

The task is to match each row in the big data set with all “sufficiently similar” rows in the types table and then pick the most appropriate of the matches – i.e. the match with the largest “level of relevance”. The data table had 500,000 rows in it, the types table has 900 rows. Here’s a very small data set representing the problem client data (cut down from 8 type columns to just 4 type columns):


create table big_table(
	id		number(10,0)	primary key,
	v1		varchar2(30),
	att1		number(6,0),
	att2		number(6,0),
	att3		number(6,0),
	att4		number(6,0),
	padding		varchar2(4000)
);

create table types(
	att1		number(6,0),
	att2		number(6,0),
	att3		number(6,0),
	att4		number(6,0),
	category	varchar2(12)	not null,
	relevance	number(4,0)	not null
);

insert into big_table values(1, 'asdfllkj', 1, 1, 2, 1, rpad('x',4000));
insert into big_table values(2, 'rirweute', 1, 3, 1, 4, rpad('x',4000));

insert into types values(   1, null, null, null, 'XX',  10);
insert into types values(   1, null, null,    1, 'YY',  20);
insert into types values(   1, null,    1, null, 'ZZ',  20);

commit;

A row from the types table is similar to a source row if it matches on all the non-null columns. So if we look at the first row in big_table, it matches the first row in types because att1 = 1 and all the other attN columns are null; it matches the second row because att1 = 1 and att4 = 1 and the other attN columns are null, but it doesn’t match the third row because types.att3 = 1 and big_table.att3 = 2.

Similarly, if we look at the second row in big_table, it matches the first row in types, doesn’t match the second row because types.att4 = 1 and big_table.att4 = 4, but does match the third row. Here’s how we can express the matching requirement in SQL:


select
	bt.id, bt.v1,
	ty.category,
	ty.relevance
from
	big_table	bt,
	types		ty
where
	nvl(ty.att1(+), bt.att1) = bt.att1
and	nvl(ty.att2(+), bt.att2) = bt.att2
and	nvl(ty.att3(+), bt.att3) = bt.att3
and	nvl(ty.att4(+), bt.att4) = bt.att4
;

You’ll realise, of course, that essentially we have to do a Cartesian merge join between the two tables. Since there’s no guaranteed matching column that we could use to join the two tables we have to look at every row in types for every row in big_table … and we have 500,000 rows in big_table and 900 in types, leading to an intermediate workload of 450,000,000 rows (with, in the client case, 8 checks for each of those rows). Runtime for the client was about 2 hours, at 100% CPU.

When you have to do a Cartesian merge join there doesn’t seem to be much scope for reducing the workload, however I didn’t actually know what the data really looked like so I ran a couple of queries to analyse it . The first was a simple “select count (distinct)” query to see how many different combinations of the 8 attributes existed in the client’s data set. It turned out to be slightly less than 400.

Problem solved – get a list of the distinct combinations, join that to the types table to translate to categories, then join the intermediate result set back to the original table. This, of course, is just applying two principles that I’ve discussed before: (a) be selective about using a table twice to reduce the workload, (b) aggregate early if you can reduce the scale of the problem.

Here’s my solution:


with main_data as (
	select
		/*+ materialize */
		id, v1, att1, att2, att3, att4
	from
		big_table
),
distinct_data as (
	select
		/*+ materialize */
		distinct att1, att2, att3, att4
	from	main_data
)
select
	md.id, md.v1, ty.category, ty.relevance
from
	distinct_data	dd,
	types		ty,
	main_data	md
where
	nvl(ty.att1(+), dd.att1) = dd.att1
and	nvl(ty.att2(+), dd.att2) = dd.att2
and	nvl(ty.att3(+), dd.att3) = dd.att3
and	nvl(ty.att4(+), dd.att4) = dd.att4
and	md.att1 = dd.att1
and	md.att2 = dd.att2
and	md.att3 = dd.att3
and	md.att4 = dd.att4
;

And here’s the execution plan.


---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |    12 |  2484 |    11  (10)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D6619_8FE93F1 |       |       |            |          |
|   3 |    TABLE ACCESS FULL       | BIG_TABLE                  |     2 |   164 |     2   (0)| 00:00:01 |
|   4 |   LOAD AS SELECT           | SYS_TEMP_0FD9D661A_8FE93F1 |       |       |            |          |
|   5 |    HASH UNIQUE             |                            |     2 |   104 |     3  (34)| 00:00:01 |
|   6 |     VIEW                   |                            |     2 |   104 |     2   (0)| 00:00:01 |
|   7 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D6619_8FE93F1 |     2 |   164 |     2   (0)| 00:00:01 |
|*  8 |   HASH JOIN                |                            |    12 |  2484 |     6   (0)| 00:00:01 |
|   9 |    NESTED LOOPS OUTER      |                            |     6 |   750 |     4   (0)| 00:00:01 |
|  10 |     VIEW                   |                            |     2 |   104 |     2   (0)| 00:00:01 |
|  11 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661A_8FE93F1 |     2 |   104 |     2   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL      | TYPES                      |     3 |   219 |     1   (0)| 00:00:01 |
|  13 |    VIEW                    |                            |     2 |   164 |     2   (0)| 00:00:01 |
|  14 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D6619_8FE93F1 |     2 |   164 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access("MD"."ATT1"="DD"."ATT1" AND "MD"."ATT2"="DD"."ATT2" AND
              "MD"."ATT3"="DD"."ATT3" AND "MD"."ATT4"="DD"."ATT4")
  12 - filter("DD"."ATT1"=NVL("TY"."ATT1"(+),"DD"."ATT1") AND
              "DD"."ATT2"=NVL("TY"."ATT2"(+),"DD"."ATT2") AND
              "DD"."ATT3"=NVL("TY"."ATT3"(+),"DD"."ATT3") AND
              "DD"."ATT4"=NVL("TY"."ATT4"(+),"DD"."ATT4"))

Critically I’ve taken a Cartesian join that had a source of 500,000 and a target of 900 possible matches, and reduced it to a join between the 400 distinct combinations and the 900 possible matches. Clearly we can expect this to to take something like one twelve-hundredth (400/500,000) of the work of the original join – bringing 7,200 seconds down to roughly 6 seconds. Once this step is complete we have an intermediate result set which is the 4 non-null type columns combined with the matching category and relevance columns – and can use this in a simple and efficient hash join with the original data set.

Logic dictated that the old and new results would be the same – but we did run the two hour query to check that the results matched.

Footnote: I was a little surprised that the optimizer produced a nested loops outer join rather than a Cartesian merge in the plan above – but that’s probably an arterfact of the very small data sizes in my test.There’s presumably little point in transferring the data into the PGA when the volume is so small.

Footnote 2: I haven’t included the extra steps in the SQL to eliminate the reduce the intermediate result to just “the most relevant” – but that’s just an inline view with an analytic function. (The original code actually selected the data with an order by clause and used a client-side filter to eliminate the excess!).

Footnote 3: The application was a multi-company application – and one of the other companies had not yet gone live on the system because they had a data set of 5 million rows to process and this query had never managed to run to completion in the available time window.  I’ll have to get back to the client some day and see if the larger data set also collapsed to a very small number of distinct combinations and how long the rewrite took with that data set.

 

April 13, 2015

Not Exists

Filed under: CBO,Execution plans,Oracle,Performance — Jonathan Lewis @ 12:51 pm BST Apr 13,2015

The following requirement appeared recently on OTN:


=========================================================================================================
I have a following query and want to get rid of the "NOT EXISTS' clause without changing the end results.

SELECT   A.c,
         A.d,
         A.e,
         A.f
  FROM   A
WHERE   NOT EXISTS (SELECT   1
                       FROM   B
                      WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e);
===========================================================================================================

Inevitably this wasn’t the problem query, and almost inevitably the OP was asking us how to implement a solution which wasn’t appropriate for a problem that shouldn’t have existed. Despite this it’s worth spending a little time to take the request at its face value and look at the sort of thing that could be going on.

First, of course, you cannot get rid of the “not exists” clause, although you may be able to make it look different. If you want “all the rows in A that are not referenced in B” then you HAVE to examine all the rows in A, and you have to do some sort of check for each row to see whether or not it exists in B. The only option you’ve got for doing something about the “not exists” clause is to find a way of making it as a cheap as possible to implement.

A couple of people came up with suggestions for rewriting the query to make it more efficient. One suggested writing it as a “NOT IN” subquery, but it’s worth remembering that the optimizer may cheerfully transform a “NOT IN” subquery to a “NOT EXISTS” subquery if it’s legal and a manual rewrite may overlook the problem of NULLs; another suggested rewriting the query as an outer join, but again it’s worth remembering that the optimimzer may transform a “NOT EXISTS” subquery to an “ANTI-JOIN” – which is a bit like an outer join with filter, only more efficient. So, before suggesting a rewrite, it’s worth looking at the execution plan to see what the optimizer is doing just in case it’s doing something silly. There are two options – anti-join or filter subquery.

Here, with code I’ve run under 10.2.0.5 to match the OP, is a demonstration data set, with the two plans you might expect to see – first, some the data:


execute dbms_random.seed(0)

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(0,4))           c,
        trunc(dbms_random.value(0,5))           d,
        trunc(dbms_random.value(0,300))         e,
        rownum                                  f,
        rpad('x',100)                   padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table t2
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(0,4))           c,
        trunc(dbms_random.value(0,5))           d,
        trunc(dbms_random.value(0,300))         e,
        rownum                                  f,
        rpad('x',100)                   padding
from
        generator       v1,
        generator       v2
where
        rownum <= 24000
;

create index t1_i1 on t1(c,d,e);
create index t2_i1 on t2(c,d,e);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

The OP had followed up their original query with a claim that “Table A holds 100 million rows and table B holds 24,000″ – that’s a lot of checks (if true) and you ought to be asking how quickly the OP expects the query to run and how many of the 100 M rows are going to survive the check. I’ve set up just 1M rows with 6,000 distinct values for the column combination (c,d,e), and a reference table with 24,000 rows which are likely to include most, but not all, of those 6,000 combinations.

Rather than generate a very large output, I’ve written a query that generates the required data set, then counts it:


select
        max(f), count(*)
from (
        SELECT   /*+ no_merge */
                 A.c,
                 A.d,
                 A.e,
                 A.f
          FROM   t1 A
        WHERE   NOT EXISTS (SELECT   /* no_unnest */
                                      1
                               FROM   t2 B
                              WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e)
)
;

This took about 0.35 seconds to run – aggregating roughly 14,500 rows from 1M. The plan was (as I had expected) based on a (right) hash anti join:


---------------------------------------------------------------------------------
| Id  | Operation               | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |       |     1 |    13 |  2183   (5)| 00:00:11 |
|   1 |  SORT AGGREGATE         |       |     1 |    13 |            |          |
|   2 |   VIEW                  |       |   999K|    12M|  2183   (5)| 00:00:11 |
|*  3 |    HASH JOIN RIGHT ANTI |       |   999K|    23M|  2183   (5)| 00:00:11 |
|   4 |     INDEX FAST FULL SCAN| T2_I1 | 24000 |   234K|    11  (10)| 00:00:01 |
|   5 |     TABLE ACCESS FULL   | T1    |  1000K|    14M|  2151   (4)| 00:00:11 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("B"."C"="A"."C" AND "B"."D"="A"."D" AND "B"."E"="A"."E")

Oracle has built an in-memory hash table from the 24,000 rows in t2, then scanned the t1 table, probing the hash table with each row in turn. That’s 1M probe in less than 0.35 seconds. You ought to infer from this that most of the time spent in the original query should have been spent scanning the 100M rows, and only a relatively small increment appear due to the “not exists” clause.

You’ll notice, though that there was a comment in my subquery with the /* no_unnest */ hint embedded – if I change this from a comment to a hint (/*+ */) I should get a plan with a filter subquery, and maybe that’s what’s happening to the OP for some odd reason. Here’s the plan:


------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |     1 |    13 | 15166   (1)| 00:01:16 |
|   1 |  SORT AGGREGATE      |       |     1 |    13 |            |          |
|   2 |   VIEW               |       |   999K|    12M| 15166   (1)| 00:01:16 |
|*  3 |    FILTER            |       |       |       |            |          |
|   4 |     TABLE ACCESS FULL| T1    |  1000K|    14M|  2155   (4)| 00:00:11 |
|*  5 |     INDEX RANGE SCAN | T2_I1 |     4 |    40 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "B"
              WHERE "B"."E"=:B1 AND "B"."D"=:B2 AND "B"."C"=:B3))
   5 - access("B"."C"=:B1 AND "B"."D"=:B2 AND "B"."E"=:B3)

The query took 1.65 seconds to complete. (And re-running with rowsource execution statistics enabled, I found that the subquery had executed roughly 914,000 times in that 1.65 seconds). Even if the original query had used the filter subquery plan the subquery shouldn’t have made much difference to the overall performance. Of course if T2 didn’t have that index on (c,d,e) then the filter subquery plan would have been much more expensive – but then, we would really have expected to see the hash anti-join.

If you’re wondering why the subquery ran 914,000 times instead of 1M times, you’ve forgotten “scalar subquery caching”.  The session caches a limited number of results from subquery execution as a query runs and may be able to use cached results (or simply a special “previous-execution” result) to minimise the number of executions of the subquery.

Did you notice the index I created on t1(c,d,e) ? If I drive the query through this index I’ll access all the rows for a given combination of (c,d,e) one after the other and only have to run the subquery once for the set. To make this happen, though, I’ll have to declare one of the columns to be NOT NULL, or add a suitable “column is not null” predicate to the query; and then I’ll probably have to hint the query anyway:


select
        max(f)
from (
        SELECT   /*+ no_merge index(a) */
                 A.c,
                 A.d,
                 A.e,
                 A.f
          FROM   t1 A
        WHERE   NOT EXISTS (SELECT   /*+ no_unnest */
                                      1
                               FROM   t2 B
                              WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e)
        and     c is not null
)
;

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |     1 |    13 | 65706   (1)| 00:05:29 |
|   1 |  SORT AGGREGATE               |       |     1 |    13 |            |          |
|   2 |   VIEW                        |       |   999K|    12M| 65706   (1)| 00:05:29 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1    | 50000 |   732K| 52694   (1)| 00:04:24 |
|*  4 |     INDEX FULL SCAN           | T1_I1 | 50000 |       |  2869   (2)| 00:00:15 |
|*  5 |      INDEX RANGE SCAN         | T2_I1 |     4 |    40 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("C" IS NOT NULL AND  NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM
              "T2" "B" WHERE "B"."E"=:B1 AND "B"."D"=:B2 AND "B"."C"=:B3))
   5 - access("B"."C"=:B1 AND "B"."D"=:B2 AND "B"."E"=:B3)

Re-running this code with rowsource execution statistics enabled showed that the subquery ran just 6,000 times (as expected) – for a total run time that was slightly faster than the hash anti-join method (0.17 seconds – but I do have a new laptop using SSD only, with a 3.5GHz CPU and lots of memory).

Every which way, if we can get reasonable performance from the underlying table access there’s no way that introducing a “NOT EXISTS” ought to be a disaster. The worst case scenario – for some reason Oracle chooses to run a filter subquery plan and the appropriate index hasn’t been created to support it.

Footnote:

Of course, table A didn’t really exist, it was a three table join; and it didn’t produce 100M rows, it produced anything between zero and 5 million rows, and the effect of the subquery (which correlated back to two of the joined tables) was to leave anything between 0 and 5 million rows. And (apparently) the query was quick enough in the absence of the subquery (producing, for example, 1 million rows in only 5 minutes), but too slow with the subquery in place.

But that’s okay. Because of our tests we know that once we’ve produced a few million rows it takes fractions of a second more to pass them through a hash table with an anti-join to deal with the “not exists” subquery; and I doubt if we have to play silly games to push the data through a filter subquery plan in the right order to squeeze a few extra hundredths of a second from the query.

If the OP is happy with the basic select statement before the “not exists” subquery, all he has to do is take advantage of a no_merge hint:


select  {list of columns}
from
        (
        select /*+ no_merge */ .... rest of original query
        )    v1
where
        not exists (
                select  null
                from    b
                where   b.c = v1.c and b.d = v1.d and b.e = v1.e
        )
;

You’re probably wondering why the OP currently sees a performance problem as the subquery is added. The best guess is that the subquery has introduce a “magic 5% fudge factor” to the arithmetic (did you notice the cardinality of t1 dropping to 50,000 from 1M in the plan above) and made it pick a worse execution plan for the rest of the query. We can’t tell, though, since the OP hasn’t yet given us the information that would allow us to see what’s going wrong.

April 10, 2015

Counting

Filed under: Execution plans,Indexing,Oracle,Performance — Jonathan Lewis @ 5:27 pm BST Apr 10,2015

There’s a live example on OTN at the moment of an interesting class of problem that can require some imaginative thinking. It revolves around a design that uses a row in one table to hold the low and high values for a range of values in another table. The problem is then simply to count the number of rows in the second table that fall into the range given by the first table. There’s an obvious query you can write (a join with inequality) but if you have to join each row in the first table to several million rows in the second table, then aggregate to count them, that’s an expensive strategy.  Here’s the query (with numbers of rows involved) that showed up on OTN; it’s an insert statement, and the problem is that it takes 7 hours to insert 37,600 rows:


    INSERT INTO SA_REPORT_DATA
    (REPORT_ID, CUTOFF_DATE, COL_1, COL_2, COL_3)
    (
    SELECT 'ISRP-734', to_date('&DateTo', 'YYYY-MM-DD'),
           SNE.ID AS HLR
    ,      SNR.FROM_NUMBER||' - '||SNR.TO_NUMBER AS NUMBER_RANGE
    ,      COUNT(M.MSISDN) AS AVAILABLE_MSISDNS
    FROM
           SA_NUMBER_RANGES SNR          -- 10,000 rows
    ,      SA_SERVICE_SYSTEMS SSS        --  1,643 rows
    ,      SA_NETWORK_ELEMENTS SNE       --    200 rows
    ,      SA_MSISDNS M                  --    72M rows
    WHERE
           SSS.SEQ = SNR.SRVSYS_SEQ
    AND    SSS.SYSTYP_ID = 'OMC HLR'
    AND    SNE.SEQ = SSS.NE_SEQ
    AND    SNR.ID_TYPE = 'M'
    AND    M.MSISDN  >= SNR.FROM_NUMBER
    AND    M.MSISDN  <= SNR.TO_NUMBER
    AND    M.STATE  = 'AVL'
    GROUP BY
           SNE.ID,SNR.FROM_NUMBER||' - '||SNR.TO_NUMBER
    )  

The feature here is that we are counting ranges of MSISDN: we take 10,000 number ranges (SNR) and join with inequality to a 72M row table. It’s perfectly conceivable that at some point the data set expands (not necessarily all at once) to literally tens of billions of rows that are then aggregated down to the 37,500 that are finally inserted.

The execution plan shows the optimizer joining the first three tables before doing a merge join between that result set and the relevant subset of the MSISDNs table – which means the MSISDNs have to be sorted and buffered (with a probably spill to disc) before they can be used. It would be interesting to see the rowsource execution stats for the query – partly to see how large the generated set became, but also to see if the ranges involved were so large that most of the time went in constantly re-reading the sorted MSISDNs from the temporary tablespace.

As far as optimisation is concerned, there are a couple of trivial things around the edges we can examine: we have 10,000 number ranges but insert 37,600 results, and the last stages of the plan generated those results so we’ve scanned and aggregated the sorted MSISDNs 37,600 times. Clearly we could look for a better table ordering that (eliminated any number ranges early), then did the minimal number of joins to MSISDN, aggregated, then scaled up to 37,600: with the best join order we might reduce the run time by a factor of 3 or more. (But that’s still a couple of hours run time.)

What we really need to do to make a difference is change the infrastructure in some way – prefereably invisibly to the rest of the application. There are a number of specific details relating to workload, read-consistency, timing, concurrency, etc. that will need to be considered, but broadly speaking, we need to take advantage of a table that effectively holds the “interesting” MSISDNs in sorted order. I’ve kept the approach simple here, it needs a few modifications for a production system. The important bit of the reports is the bit that produces the count, so I’m only going to worry about a two-table join – number ranges and msidn; here’s some model data:


execute dbms_random.seed(0)

create table msisdns
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1e9,1e10))      msisdn
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table number_ranges
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1e9,1e10))      from_number,
        trunc(dbms_random.value(1e9,1e10))      to_number
from
        generator       v1
where
        rownum  <= 1000
;

update number_ranges set
        from_number = to_number,
        to_number = from_number
where
        to_number < from_number
;

commit;

I’ve created a table of numbers with values between 10e9 and 10e10 to represent 1 million MSISDNs, and a list of 1,000 number ranges – making sure that the FROM number is not greater than the TO number. Now I need a “summary” table of the MSISDNs, which I’m going to create as an index-organized table:


create table tmp_msisdns (
        msisdn,
        counter,
        constraint tmp_pk primary key (msisdn, counter)
)
organization index
as
select
        msisdn,
        row_number() over(order by msisdn)      counter
from
        msisdns
;

This is only a demonstration so I’ve haven’t bothered with production-like code to check that the MSISDNs I had generated were unique (they were); and I’ve casually included the row_number() as part of the primary key as a performance fiddle even though it’s something that could, technically, allow some other program to introduce bad data if I made the table available for public use rather than task specific.

Finally we get down to the report. To find out how many MSISDN values there are between the FROM and TO number in a range I just have to find the lowest and highest MSISDNs from tmp_msisdn in that range and find the difference between their counter values, and add 1. And there’s a very fast way to find the lowest or highest values when you have the appropriate index – the min/max range scan – but you have to access the table twice, once for the low, once for the high. Here’s the necessary SQL, with execution plan from 12.1.0.2:


select
        nr.from_number, nr.to_number,
--      fr1.msisdn, fr1.counter,
--      to1.msisdn, to1.counter,
        1 + to1.counter - fr1.counter range_count
from
        number_ranges   nr,
        tmp_msisdns     fr1,
        tmp_msisdns     to1
where
        fr1.msisdn = (
                select min(msisdn) from tmp_msisdns where tmp_msisdns.msisdn >= nr.from_number
        )
and     to1.msisdn = (
                select max(msisdn) from tmp_msisdns where tmp_msisdns.msisdn <= nr.to_number
        )
;

-------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |               |       |       |  4008 (100)|          |
|   1 |  NESTED LOOPS                   |               |  1000 | 38000 |  4008   (1)| 00:00:01 |
|   2 |   NESTED LOOPS                  |               |  1000 | 26000 |  2005   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL            | NUMBER_RANGES |  1000 | 14000 |     2   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN             | TMP_PK        |     1 |    12 |     2   (0)| 00:00:01 |
|   5 |     SORT AGGREGATE              |               |     1 |     7 |            |          |
|   6 |      FIRST ROW                  |               |     1 |     7 |     3   (0)| 00:00:01 |
|*  7 |       INDEX RANGE SCAN (MIN/MAX)| TMP_PK        |     1 |     7 |     3   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN              | TMP_PK        |     1 |    12 |     2   (0)| 00:00:01 |
|   9 |    SORT AGGREGATE               |               |     1 |     7 |            |          |
|  10 |     FIRST ROW                   |               |     1 |     7 |     3   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN (MIN/MAX) | TMP_PK        |     1 |     7 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("FR1"."MSISDN"=)
   7 - access("TMP_MSISDNS"."MSISDN">=:B1)
   8 - access("TO1"."MSISDN"=)
  11 - access("TMP_MSISDNS"."MSISDN"<=:B1)

Execution time – with 1 million MSISDNs and 1,000 ranges: 0.11 seconds.

For comparative purposes, and to check that the code is producing the right answers, here’s the basic inequality join method:


select
        nr.from_number, nr.to_number, count(*) range_count
from
        number_ranges   nr,
        msisdns         ms
where
        ms.msisdn >= nr.from_number
and     ms.msisdn <= nr.to_number
group by
        nr.from_number, nr.to_number
order by
        nr.from_number
;

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |               |       |       |       |   472K(100)|          |
|   1 |  HASH GROUP BY        |               |   707K|    14M|  6847M|   472K (17)| 00:00:19 |
|   2 |   MERGE JOIN          |               |   255M|  5107M|       | 13492  (77)| 00:00:01 |
|   3 |    SORT JOIN          |               |  1000 | 14000 |       |     3  (34)| 00:00:01 |
|   4 |     TABLE ACCESS FULL | NUMBER_RANGES |  1000 | 14000 |       |     2   (0)| 00:00:01 |
|*  5 |    FILTER             |               |       |       |       |            |          |
|*  6 |     SORT JOIN         |               |  1000K|  6835K|    30M|  3451   (7)| 00:00:01 |
|   7 |      TABLE ACCESS FULL| MSISDNS       |  1000K|  6835K|       |   245  (14)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("MS"."MSISDN"<="NR"."TO_NUMBER")
   6 - access("MS"."MSISDN">="NR"."FROM_NUMBER")
       filter("MS"."MSISDN">="NR"."FROM_NUMBER")

The two queries produced the same results (apart from ordering); but the second query took 2 minutes 19.4 seconds to complete.

 

Update:

In a moment of idle curiosity I recreated the data with 40 Million rows in the MSISDNs table to get some idea of how fast the entire report process could go when re-engineered (remember the OP has 72M rows, but select the subset flagged as ‘AVL’). It took 1 minute 46 seconds to create the IOT – after which the report for 1,000 number ranges still took less than 0.2 seconds.

 

 

 

 

 

April 5, 2015

Not In CTAS

Filed under: Infrastructure,NULL,Oracle — Jonathan Lewis @ 6:49 pm BST Apr 5,2015

Everyone gets caught out some of the time with NOT IN.

NOT IN is not the opposite of IN.

This came up in a (fairly typical) question on OTN recently where someone had the task of “deleting 6M rows from a table of 18M”. A common, and perfectly reasonable, suggestion for dealing with a delete on this scale is to consider creating a replacement table holding the data you do want rather than deleting the data you don’t want.  In this case, however, the query for deleting the data looked like this:


DELETE FROM EI.CASESTATUS
     WHERE CASEID NOT IN (SELECT CASEID FROM DO.STG_CASEHEADER);

The suggested code for creating the kept data was this:


CREATE TABLE newTable as
  SELECT * FROM EI.CASESTATUS
     WHERE CASEID IN (SELECT CASEID FROM DO.STG_CASEHEADER);

You might get the same result sets in the two tables after this process – but it depends on the CASEID in both tables never being NULL. (You might think that a column with ID in the name probably ought to be a primary key, or a foreign key with a NOT NULL declaration, but then again there’s that STG_ in the subquery table that might be indicative of a “staging” table, so who knows what might happen if the data’s allowed to start life as dirty data.) Here’s a quick demo to prove the point. First some simple data creation – with an optional insert so that you’ve got two tests in one script – followed by the two strategies for identifying data:


drop table t3;
drop table t2;
drop table t1;

create table t1 (n1 number);
create table t2 (n1 number);

insert into t1 values(null);
insert into t1 values(1);
insert into t1 values(2);

/* two tests available, with or without a null in t2 */

-- insert into t2 values(null);
insert into t2 values(1);

commit;

-- gather stats here

set null n/a
delete from t1
where  t1.n1 not in (select t2.n1 from t2);

prompt  Remainder after delete

select  *  from t1;

rollback;

prompt  Selected on create

create table t3 as
select  *  from t1
where   t1.n1 in (select t2.n1 from t2);

select * from t3;

Then the two sets of output from running the test, first with the the NULL insert into t2:


Remainder after delete

        N1
----------
n/a
         1
         2
Selected on create

        N1
----------
         1

We haven’t deleted ANY data from t1 when we were probably hoping that the 2 would disappear – after all, it’s not in t2; however since the equality comparison between a t1 row and every t2 row must evaluate to FALSE before a t1 row is deleted and the comparison of 2 and NULL evaluates to NULL the 2 row won’t be deleted (similarly the comparison for “t1.NULL = t2.anything” evaluates to NULL rather than FALSE, so the NULL isn’t deleted).

Still, perhaps the rewrite would have been okay for the data set where we don’t have a NULL in t2:


Remainder after delete

        N1
----------
n/a
         1
Selected on create

        N1
----------
         1

Oops – still doesn’t produce matching results . This time the row with the 2 has disappeared from t1 in both cases – which might have been closer to what the original OTN poster had hoped but we still have the difference in the survival of the NULLs from t1 – for the reason given for the previous data set

Footnote:

In passing, the execution plan subsequently supplied by the OP showed a “MERGE JOIN ANTI NA” with stg_caseheader (the subquery table) as the second table. The significance of the NA (Null-aware) is that it tells us that the join column in stg_caseheader definitely doesn’t have a NOT NULL constraint on it. (We can’t draw any conclusion about the join column in casestatus.)

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 5,095 other followers