Oracle Scratchpad

June 17, 2015

Reverse Key

Filed under: Indexing,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:11 pm BST Jun 17,2015

A question came up on the OTN database forum recently asking if you could have a partitioned index on a non-partitioned table.

(Aside: I’m not sure whether it would be quicker to read the manuals or try the experiment – either would probably be quicker than posing the question to the forum. As so often happens in these RTFM questions the OP didn’t bother to acknowledge any of the responses)

The answer to the question is yes – you can create a globally partitioned index, though if it uses range partitioning you have to specify a MAXVALUE partition. The interesting thing about the question, though is that several people tried to guess why it had been asked and then made suggestions based on the most likely guess (and wouldn’t it have been nice to see some response from the OP ). The common guess was that there was a performance problem with the high-value block of a sequence-based (or time-based) index – a frequent source of “buffer busy wait” events and other nasty side effects.

Unfortunately too many people suggesting reverse key as a solution to this “right-hand” problem. If you’re licensed for partitioning it’s almost certain that a better option would simple be to use global hash partitioning (with 2^N for some N) partitions. Using reverse keys can result in a bigger performance than the one you’re trying to avoid – you may end up turning a little time spent on buffer busy waits into a large amount of time spent on db file sequential reads. To demonstrate the issue I’ve created a sample script – and adjusted my buffer cache down to the appropriate scale:

create table t1(
	id	not null
)
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		rownum <= 1e4
)
select
	1e7 + rownum	id
from
	generator	v1,
	generator	v2
where
	rownum <= 1e7 
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1'
	);
end;
/

alter table t1 add constraint t1_pk primary key(id) 
using index 
	reverse 
	nologging 
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';

begin
	for i in 20000001..20010000 loop
		insert into t1 values(i);
	end loop;
end;
/

I’ve created a table with 10,000,000 rows using a sequential value as the primary key, then inserted “the next” 10,000 rows into the table in order. The index occupied about about 22,000 blocks, so to make my demonstration show you the type of effect you could get from a busy production system with more tables and many indexes I ran my test with the buffer cache limited to 6,000 blocks – a fair fraction of the total index size. Here’s a small section of the trace file from the test running 10.2.0.3 on an elderly machine:


WAIT #43: nam='db file sequential read' ela= 13238 file#=6 block#=12653 blocks=1 obj#=63623 tim=3271125590
WAIT #43: nam='db file sequential read' ela=  7360 file#=6 block#=12749 blocks=1 obj#=63623 tim=3271133150
WAIT #43: nam='db file sequential read' ela=  5793 file#=6 block#=12844 blocks=1 obj#=63623 tim=3271139110
WAIT #43: nam='db file sequential read' ela=  5672 file#=6 block#=12940 blocks=1 obj#=63623 tim=3271145028
WAIT #43: nam='db file sequential read' ela= 15748 file#=5 block#=13037 blocks=1 obj#=63623 tim=3271160998
WAIT #43: nam='db file sequential read' ela=  8080 file#=5 block#=13133 blocks=1 obj#=63623 tim=3271169314
WAIT #43: nam='db file sequential read' ela=  8706 file#=5 block#=13228 blocks=1 obj#=63623 tim=3271178240
WAIT #43: nam='db file sequential read' ela=  7919 file#=5 block#=13325 blocks=1 obj#=63623 tim=3271186372
WAIT #43: nam='db file sequential read' ela= 15553 file#=6 block#=13549 blocks=1 obj#=63623 tim=3271202115
WAIT #43: nam='db file sequential read' ela=  7044 file#=6 block#=13644 blocks=1 obj#=63623 tim=3271209420
WAIT #43: nam='db file sequential read' ela=  6062 file#=6 block#=13741 blocks=1 obj#=63623 tim=3271215648
WAIT #43: nam='db file sequential read' ela=  6067 file#=6 block#=13837 blocks=1 obj#=63623 tim=3271221887
WAIT #43: nam='db file sequential read' ela= 11516 file#=5 block#=13932 blocks=1 obj#=63623 tim=3271234852
WAIT #43: nam='db file sequential read' ela=  9295 file#=5 block#=14028 blocks=1 obj#=63623 tim=3271244368
WAIT #43: nam='db file sequential read' ela=  9466 file#=5 block#=14125 blocks=1 obj#=63623 tim=3271254002
WAIT #43: nam='db file sequential read' ela=  7704 file#=5 block#=14221 blocks=1 obj#=63623 tim=3271261991
WAIT #43: nam='db file sequential read' ela= 16319 file#=6 block#=14444 blocks=1 obj#=63623 tim=3271278492
WAIT #43: nam='db file sequential read' ela=  7416 file#=6 block#=14541 blocks=1 obj#=63623 tim=3271286129
WAIT #43: nam='db file sequential read' ela=  5748 file#=6 block#=14637 blocks=1 obj#=63623 tim=3271292163
WAIT #43: nam='db file sequential read' ela=  7131 file#=6 block#=14732 blocks=1 obj#=63623 tim=3271299489
WAIT #43: nam='db file sequential read' ela= 16126 file#=5 block#=14829 blocks=1 obj#=63623 tim=3271315883
WAIT #43: nam='db file sequential read' ela=  7746 file#=5 block#=14925 blocks=1 obj#=63623 tim=3271323845
WAIT #43: nam='db file sequential read' ela=  9208 file#=5 block#=15020 blocks=1 obj#=63623 tim=3271333239
WAIT #43: nam='db file sequential read' ela=  7708 file#=5 block#=15116 blocks=1 obj#=63623 tim=3271341141
WAIT #43: nam='db file sequential read' ela= 15484 file#=6 block#=15341 blocks=1 obj#=63623 tim=3271356807
WAIT #43: nam='db file sequential read' ela=  5488 file#=6 block#=15437 blocks=1 obj#=63623 tim=3271362623
WAIT #43: nam='db file sequential read' ela= 10447 file#=6 block#=15532 blocks=1 obj#=63623 tim=3271373342
WAIT #43: nam='db file sequential read' ela= 12565 file#=6 block#=15629 blocks=1 obj#=63623 tim=3271386741
WAIT #43: nam='db file sequential read' ela= 17168 file#=5 block#=15725 blocks=1 obj#=63623 tim=3271404135
WAIT #43: nam='db file sequential read' ela=  7542 file#=5 block#=15820 blocks=1 obj#=63623 tim=3271411882
WAIT #43: nam='db file sequential read' ela=  9400 file#=5 block#=15917 blocks=1 obj#=63623 tim=3271421514
WAIT #43: nam='db file sequential read' ela=  7804 file#=5 block#=16013 blocks=1 obj#=63623 tim=3271429519
WAIT #43: nam='db file sequential read' ela= 14470 file#=6 block#=16237 blocks=1 obj#=63623 tim=3271444168
WAIT #43: nam='db file sequential read' ela=  5788 file#=6 block#=16333 blocks=1 obj#=63623 tim=3271450154
WAIT #43: nam='db file sequential read' ela=  9630 file#=6 block#=16429 blocks=1 obj#=63623 tim=3271460008
WAIT #43: nam='db file sequential read' ela= 10910 file#=6 block#=16525 blocks=1 obj#=63623 tim=3271471174
WAIT #43: nam='db file sequential read' ela= 15683 file#=5 block#=16620 blocks=1 obj#=63623 tim=3271487065
WAIT #43: nam='db file sequential read' ela=  8094 file#=5 block#=16717 blocks=1 obj#=63623 tim=3271495454
WAIT #43: nam='db file sequential read' ela=  6670 file#=5 block#=16813 blocks=1 obj#=63623 tim=3271502293
WAIT #43: nam='db file sequential read' ela=  7852 file#=5 block#=16908 blocks=1 obj#=63623 tim=3271510360
WAIT #43: nam='db file sequential read' ela= 10500 file#=6 block#=17133 blocks=1 obj#=63623 tim=3271521039
WAIT #43: nam='db file sequential read' ela= 11038 file#=6 block#=17229 blocks=1 obj#=63623 tim=3271532275
WAIT #43: nam='db file sequential read' ela= 12432 file#=6 block#=17325 blocks=1 obj#=63623 tim=3271544974
WAIT #43: nam='db file sequential read' ela=  7784 file#=6 block#=17421 blocks=1 obj#=63623 tim=3271553331
WAIT #43: nam='db file sequential read' ela=  7774 file#=5 block#=17517 blocks=1 obj#=63623 tim=3271561346
WAIT #43: nam='db file sequential read' ela=  6583 file#=5 block#=17613 blocks=1 obj#=63623 tim=3271568146
WAIT #43: nam='db file sequential read' ela=  7901 file#=5 block#=17708 blocks=1 obj#=63623 tim=3271576231
WAIT #43: nam='db file sequential read' ela=  6667 file#=5 block#=17805 blocks=1 obj#=63623 tim=3271583259
WAIT #43: nam='db file sequential read' ela=  9427 file#=6 block#=18029 blocks=1 obj#=63623 tim=3271592988
WAIT #43: nam='db file sequential read' ela= 52334 file#=6 block#=18125 blocks=1 obj#=63623 tim=3271646055
WAIT #43: nam='db file sequential read' ela= 50512 file#=6 block#=18221 blocks=1 obj#=63623 tim=3271697284
WAIT #43: nam='db file sequential read' ela= 10095 file#=6 block#=18317 blocks=1 obj#=63623 tim=3271708095

Check the block number for this list of single block reads – we’re jumping through the index about 100 blocks at a time to read the next block where an index entry has to go. The jumps are the expected (and designed) effect of reverse key indexes: the fact that the jumps turn into physical disc reads is the (possibly unexpected) side effect. Reversing an index makes adjacent values look very different (by reversing the bytes) and go to different index leaf blocks: the purpose of the exercise is to scatter concurrent similar inserts across multiple blocks, but if you scatter the index entries you need to buffer a lot more of the index to keep the most recently used values in memory. Reversing the index may eliminate buffer busy waits, but it may increase time lost of db file sequential reads dramatically.

Here’s a short list of interesting statistics from this test – this time running on 11.2.0.4 on a machine with SSDs) comparing the effects of reversing the index with those of not reversing the index – normal index first:


Normal index
------------
CPU used by this session               83
DB time                                97
db block gets                      40,732
physical reads                         51
db block changes                   40,657
redo entries                       20,174
redo size                       5,091,436
undo change vector size         1,649,648

Repeat with reverse key index
-----------------------------
CPU used by this session              115
DB time                               121
db block gets                      40,504
physical reads                     10,006
db block changes                   40,295
redo entries                       19,973
redo size                       4,974,820
undo change vector size         1,639,232

Because of the SSDs there’s little difference in timing between the two sets of data and, in fact, all the other measures of work done are very similar except for the physical read, and the increase in reads is probably the cause of the extra CPU time thanks to both the LRU manipulation and the interaction with the operating system.

If you want to check the effect of index reversal you can take advantage of the sys_op_lbid() function to sample a little of your data – in my case I’ve queried the last 10,000 rows (values) in the table:


select 
	/*+ 
		cursor_sharing_exact 
		dynamic_sampling(0) 
		no_monitoring 
		no_expand 
		index_ffs(t1,t1_i1) 
		noparallel_index(t,t1_i1) 
	*/ 
	count (distinct sys_op_lbid( &m_ind_id ,'L',t1.rowid)) as leaf_blocks
from 
	t1
where 
	id between 2e7 + 1 and 2e7 + 1e4
;

The &m_ind_id substition variable is the object_id of the index t1_i1.

In my case, with an index of 22,300 leaf blocks, my 10,000 consecutive values were scattered over 9,923 leaf blocks. If I want access to “recent data” to be as efficient as possible I need to keep that many blocks of the index cached, compared to (absolute) worst case for my data 100 leaf blocks. When you reverse key an index you have to think about how much bigger you have to make your buffer cache to keep the performance constant.

June 2, 2015

Predicate Order

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 7:10 pm BST Jun 2,2015

A recent OTN post demonstrated a very important point about looking at execution plans – especially when you don’t use the right data types. The question was:

We’ve this query which throws invalid number

SELECT * FROM table A
WHERE A.corporate_id IN (59375,54387) AND TRUNC(created_dt) BETWEEN '19-DEC-14' AND '25-DEC-14';

However it works fine if we use not in instead of in

SELECT * FROM table A  
WHERE A.corporate_id  NOT IN (59375,54387) AND TRUNC(created_dt) BETWEEN '19-DEC-14' AND '25-DEC-14';

Please assist.

A follow-up post told us that corporate_id was a varchar() type – so the root cause of the ORA-01722: invalid number error is simply that you shouldn’t be mixing data types. Either the corporate_id should have been defined as numeric or the in-list should have been a list of varchar2() values. (And, of course, the character strings that look like dates should have been converted explicitly to date data types using either the to_date() function with a 4-digit year or the date ‘yyyy-mm-dd’ syntax; and using “created_dt >=  19th Dec and created_dt < 26th Dec” would have given the optimizer a chance to get a better cardinality estimate)

The answer to the slightly more specific problem – why does changing NOT IN to IN allow the query to run rather than crashing – is (probably) one that I first addressed in an article in Oracle Magazine just over eleven years ago: with CPU costing enabled Oracle can change the order in which it applies filter predicates to a table. It’s also a question that can easily be answered by my commonest response to many of the optimizer questions that appear on OTN – look at the execution plan.

In this example it’s a fairly safe bet that there’s a reasonable small volume of data (according to the optimizer’s estimate) where to_number(corporate_id) is one of the required values, and a much larger volume of data where it is not; with some intermediate volume of data where the created_dt falls in the required date range. With CPU costing enabled (optional in 9i, enabled by default in 10g) the optimizer would then do some arithmetic to calculate the most cost-effective order of applying the filter predicates based on things like: the number of CPU cycles it takes to walk along a row to find a particular column. the number of CPU cycles it takes to convert a character column to a number and compare it with a number; the number of CPU cycles it takes truncate a date column and compare it with a string, the number of rows that would pass the numeric test hence requiring the first-applied date test, compared with the number of rows that would survive the first-applied date test hence requiring either the second date test or the numeric test to take place.

Here’s some code to demonstrate the point. It may require the system stats to be set to a particular values to ensure that it is probably repeatable, but there’s probably some flexibility in the range, which is why I’ve called dbms_stats.set_system_stats() in the first few lines:

drop table t1 purge;

create table t1 (
        v1      varchar2(10),
        d1      date
)
;

insert into t1 values(1,'01-Jan-2015');

insert into t1 values('x','02-Jan-2015');

insert into t1 values(3,'03-Jan-2015');
insert into t1 values(4,'04-Jan-2015');
insert into t1 values(5,'05-Jan-2015');
insert into t1 values(6,'06-Jan-2015');
insert into t1 values(7,'07-Jan-2015');
insert into t1 values(8,'08-Jan-2015');
insert into t1 values(9,'09-Jan-2015');
insert into t1 values(10,'10-Jan-2015');

execute dbms_stats.gather_table_stats(user,'t1');

First we create a table, load some data, and gather stats. You’ll notice that I’ve got a varchar2(10) column into which I’ve inserted numbers for all rows except one where it holds the value ‘x’. Now we just run some code to check the execution plans for a couple of queries.


explain plan for
select
        *
from    t1
where   v1 in (4,6)
and     d1 between '03-Jan-2015' and '09-Jan-2015'
;

select * from table(dbms_xplan.display);

explain plan for
select
        *
from    t1
where   v1 not in (4,6)
and     d1 between '03-Jan-2015' and '&1-Jan-2015'
;

select * from table(dbms_xplan.display);

As with the original question I’ve take a query with an IN operator and changed it to NOT IN. The in-list is numeric even though the relevant column is varchar2(10). The first query crashes with ORA-01722: invalid number, the second one runs and returns the correct result. You’ll notice, of course, that the “bad” value for v1 is not in the set of rows
where d1 is between 3rd and 9th Jan 2015. You’ll also notice that in my code I’ve used &1 for the end day in the query with the NOT IN clause so that I can re-run the query a few times to show the effects of changing the date range. Here are the execution plans – first with the IN clause:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     2 |    20 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     2 |    20 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((TO_NUMBER("V1")=4 OR TO_NUMBER("V1")=6) AND
              "D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "D1"<=TO_DATE(' 2015-01-09 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

The optimizer arithmetic predicts 2 rows returned using a full tablescan – it doesn’t know the query is going to crash. Notice the predicate information, though. The first predicate says Oracle will attempt to convert v1 to a number and compare it with 4 and then (if the first test fails) with 6. The query will crash as soon as it hits a row with a non-numeric value for v1. In outline, the optimizer has decided that the numeric conversion and test is very cheap (on CPU) and only a few rows will survive to take the more expensive date comparison; wherease either of the (expensive) date comparisons would leave a lot of rows that would still have to be checked with the numeric test. It makes sense to do the numeric comparison first.

Here’s the plan for the query with the NOT IN clause when I set the date range to be 3rd Jan to 7th Jan.


Execution plan for NOT IN:  7th Jan
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     5 |    50 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     5 |    50 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1"<=TO_DATE(' 2015-01-07 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6)

The plan is still a full tablescan – there are no indexes available – and the estimated number of rows has gone up to 5. The important thing, though, is the predicate section. In this case the optimizer has decided that the first thing it will apply is the (relatively expensive) predicate “d1 >= 3rd Jan” before worrying about the “NOT IN” numeric predicate. The optimizer has worked out that almost all the data will survive the NOT IN predicate, so it’s not efficient to apply it before using other predicates that eliminate more data.

By a stroke of luck my simple example happened to be a very good example. Here’s what happened when I set the end date to 8th Jan:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     6 |    60 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     6 |    60 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "D1"<=TO_DATE(' 2015-01-08 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6)

The estimated rows has gone up to 6 – but the interesting thing, as before, is the predicate section: in the previous example Oracle did the tests in the order “upper bound”, “lower bound”, “numeric”; in this test it has done “lower bound”, “upper bound”, “numeric”.

And this is what I got when I ran the test with 9th Jan:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     7 |    70 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     7 |    70 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6 AND
              "D1"<=TO_DATE(' 2015-01-09 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Again the estimated rows has gone up by one, but the ever-interesting predicate section now shows the evaluation order as: “lower bound”, “numeric”, “upper bound”.

There are only 6 possible orders for the predicate evaluation for the query with the NOT IN clause, and we’ve seen three of them. Three will fail, three will succeed – and I got all three of the successful orders. It wouldn’t take much fiddling around with the data (careful choice of duplicate values, variation in ranges of low and high values, and so on) and I could find a data set where small changes in the requested date range would allow me to reproduce all six variations. In fact when I changed my nls_date_format to “dd-mon-yy” and used a 2 digit year for testing I got two of the three possible failing predicate evaluation orders – “numeric”, “lower bound”, “higher bound” and “higher bound”, “numeric”, “lower bound” without changing the data set. (To be able to get all six orders with a single data set I’d probably need a data set where the “bad” v1 value corresponded to a d1 value somewhere mear the middle of the d1 range.)

The bottom line – use the correct data types; make sure your date literals are dates with a 4-digit year; check the predicate section to see if the optimizer did any implicit conversions with your predicates and what order it used them in. If you don’t do this you may find that a query can work perfectly for months, then crash because you finally got unlucky with the arithmetic.

May 21, 2015

Understanding SQL

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 6:12 pm BST May 21,2015

From time to time someone publishes a query on the OTN database forum and asks how to make it go faster, and you look at it and think it’s a nice example to explain a couple of principles because it’s short, easy to understand, obvious what sort of things might be wrong, and easy to fix. Then, after you’ve made a couple of suggestions and explained a couple of ideas the provider simply fades into the distance and doesn’t tell you any more about the query, or whether they’ve taken advantage of your advice, or found some other way to address the problem.

Such a query, with its execution plan, appeared a couple of weeks ago:

UPDATE FACETS_CUSTOM.MMR_DTL
SET
	CAPITN_PRCS_IND = 2,
	FIL_RUN_DT = Current_fil_run_dt,
	ROW_UPDT_DT = dta_cltn_end_dttm
WHERE
	CAPITN_PRCS_IND = 5
AND	HSPC_IND ='Y'
AND	EXISTS (
		SELECT	1
		FROM	FACETS_STAGE.CRME_FUND_DTL_STG STG_CRME
		WHERE	STG_CRME.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN STG_CRME.ERN_FROM_DT AND STG_CRME.ERN_THRU_DT
		AND	STG_CRME.FUND_ID IN ('AAB1', '1AA2', '1BA2', 'AAB2', '1AA3', '1BA3', '1B80', '1A80')
	)
AND	EXISTS (
		SELECT	1
		FROM	FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
		WHERE	XWLK.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN XWLK.HSPC_EVNT_EFF_DT AND XWLK.HSPC_EVNT_TERM_DT
	)
;

-------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|   1 |  UPDATE                       | MMR_DTL               |       |       |            |          |
|   2 |   NESTED LOOPS SEMI           |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|*  3 |    HASH JOIN RIGHT SEMI       |                       |    49 |  5488 | 12375   (2)| 00:02:29 |
|   4 |     TABLE ACCESS FULL         | FCTS_TMS_MBRID_XWLK   |  6494 | 64940 |    24   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL         | MMR_DTL               |   304K|    29M| 12347   (2)| 00:02:29 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CRME_FUND_DTL_STG     |     1 |    36 |     5   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | IE1_CRME_FUND_DTL_STG |     8 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("XWLK"."MBR_CK"="MMR_DTL"."MBRSHP_CK")
       filter("XWLK"."HSPC_EVNT_EFF_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "XWLK"."HSPC_EVNT_TERM_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   5 - filter("CAPITN_PRCS_IND"=5 AND "HSPC_IND"='Y')
   6 - filter(("STG_CRME"."FUND_ID"='1A80' OR "STG_CRME"."FUND_ID"='1AA2' OR
              "STG_CRME"."FUND_ID"='1AA3' OR "STG_CRME"."FUND_ID"='1B80' OR "STG_CRME"."FUND_ID"='1BA2' OR
              "STG_CRME"."FUND_ID"='1BA3' OR "STG_CRME"."FUND_ID"='AAB1' OR "STG_CRME"."FUND_ID"='AAB2') AND
              "STG_CRME"."ERN_FROM_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "STG_CRME"."ERN_THRU_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   7 - access("STG_CRME"."MBR_CK"="MMR_DTL"."MBRSHP_CK")

The most informative bit of narrative that went with this query said:

“The table MMR_DTL doesnt have index on these columns CAPITN_PRCS_IND , HSPC_IND .. Since this is an update stmt which will update 85k records, In a dilema whether to index these columns or not .. And this table MMR_DTL is an ever growing table. Worried about the update performance. “

This was in response an observation that there was a full tablescan on MMR_DTL at operation 5 despite the predicate “CAPITN_PRCS_IND”=5 AND “HSPC_IND”=’Y’. You’ll note that the predicted cardinality for that scan is 304K and the update statement is going to change CAPITN_PRCS_IND from the value 5 to the value 2 – so it’s not entirely unreasonable to be worried about the impact of creating an index that included the column capitn_prcs_ind.

What more can we say about this query, given the limited information. Quite a lot – unfortunately the owner of the query isn’t giving anything else away.

I’m going to leave this note unfinished to give people a little chance to think about the clues in the request, the questions they might ask, reasons why there might be a performance problem, and strategies they might investigate, then I’ll update the posting with a few ideas some time in the next 24 hours.

Update 1 – 24th May

There are so many ideas that spring up from a small amount of information that it’s very hard to write a concise and coherent description of what you’ve noticed, when and how far you pursued it, and how relevant the ideas might be to the problem in hand – especially when most of the thoughts require you to ask for more information. Unfortunately the time I had available to write this note has just disappeared, so I’m just going to have to complete it in rapidly written installments. The first bit is an outline of the immediate response I had to the initial presentation of the problem and the execution plan that went with it.

The only comment from the OP on this statement and plan was: “I couldn’t optimize this query for better performance and optimized cost.. Can some one guide me on this.”

We have no idea how many rows would be updated, how long it took, or how long the OP thinks it ought to take; it’s not until a subsequent post that we learn that the number of rows targetted for update is 85,000 – which tells us that the optimizer has run into some problems with its cardinality estimates. This suggests that IF there’s a serious performance problem then POSSIBLY there’s a better execution plan and we might get the optimizer to find it automatically if we could tell it how to adjust its cardinality estimates. It would be nice, however to know where the query spent its time (i.e. can we re-run it with rowsource execution stats or monitoring enabled, and see the actual run-time work in the plan).

If it took a couple of minutes to update that 85,000 rows, I probably wouldn’t want to spend time making it go faster; if it took 2 hours, of which 1 hour 50 minutes was spent waiting for a transaction (row) lock then I’d want to look at why the update collision could happen and see if that problem could be avoided – it might then be the case that the last 10 minutes was spent rolling back and restarting an update that ought to have taken 2 minutes “in vacuo”. Underlying all this, I would want to be sure (as I’ve implicitly, and I think reasonably, assumed) that it’s an update that runs only occasionally, perhaps once per day or once per week.

In the absence of hard information – let’s resort to a few hypotheticals; looking at the plan itself (and knowing the target 85,000 rows) I am prepared to make a few guesses about the run-time activity.

  1. We build an inmemory hash table from the whole of FCTS_TMS_MBRID_XWLK, a step for which the optimizer ought to be able to give a reasonable cost and cardinality – assuming (as I will from now on) that the basic stats are reasonably accurate.
  2. We scan the (fairly large) MMR_DETAIL table, applying a couple of filters; again the optimizer ought to do a reasonable job of estimating the cost of such a tablescan, and we might expect a significant fraction of the time to be spent on multiblock (possibly direct path) reads of the table. The cardinality reported is 304,000 but we note there are two predicates and both are for columns which we might guess have a small number of distinct values – one of which we are changing. Perhaps there’s a bad cardinality error there and maybe a couple of single column histograms would help, but maybe column group stats with a histogram on the pair would be even better. I also wonder when (if) HSPC_IND ever changes from Y to N, and contemplate the possibility of creating a function-based index that records ONLY the rows that match this predicate pair (see the note on indexing that will appear some time over the next week further down the page). It’s at this point that we might ask whether the number of rows returned by this scan should be very similar to the number of rows updated, or whether the scan identifies far too many rows and the two existence tests do a lot of work to eliminate the excess and, if the latter, which test should we apply first and how should we apply it.
  3. Having scanned the MMR_DTL we probe the in-memory hash table copy of FCTS_TMS_MBRID_XWLK for the first match, using an equality predicate (which will be the access predicate) and a range-based (filter) predicate which looks as if it is checking that some “start date” is between an “effective date” and a “termination date”. The estimated size of the result set is FAR too small at 49 rows when we know we have to have at least 85,000 rows survive this test; moreover, this tiny estimate comes from inputs of 6,500 and 304,000 rows being joined so we ought to wonder how such a tiny estimate could appear. A possible explanation is that the application has used some extreme dates to represent NULL values. If that’s the case then it’s possible that suitable histograms might help the optimizer recognise the extreme distribution; alternatively virtual columns that change the extreme values back to NULL and a predicate referencing the virtual columns may help.
  4. After estimating the cardinality of the intermediate result so badly, the optimizer decides that the second existence test can be performed as a semi-join using a nested loop. The thing to note here is that the optimizer “knows” that this is an expensive step – the cost of each table access operation is 5 (4 + 1) – but it’s a step that the optimizer “thinks” shouldn’t happen very frequently so the cost is considered acceptable. We know, however, that this step has to execute at least 85,000 times, so the optimizer’s prediction of visiting 4 blocks in the table to identify (on average) 8 rows and discard (on average) 7 of them looks nasty. Again we note that one of the predicates is range-based on a pair of dates – and in this case we might wonder whether or not most of the rows we discard are outside the date range, and whether we ought to consider (as a general point, and not just for this query) whether or not we should add one, other, or both the ERN_FROM_DT and ERN_THRU_DAT to the IE1_CRME_FUND_DTL_STG index. It’s at this point in the query that we are most ignorant of time spent witht the present data set and how that time will change in the future as the data in the MMR_DTL table grows – on one hand it’s possible that the rows for each MMR_DTL are widely scattered across the CRME_FUND_DTL_STG and this step could do a lot of random I/O, on the other hand the (assumed) time-dependent nature of the data may mean that the only MMR_DTL rows we look at are recently entered and the associated CRME_FUND_DTL_STG rows are therefore also recently entered and closely clustered – leading to a beneficial “self-caching” effect at the “high” end of the table as the query runs, which introduces an efficiency that the optimizer won’t notice. There is one numerical highlight in this join – we have a cost of 5 for each probe and 49 rows to test, so we might expect the incremental cost of the query to be around 250 (i.e. 49 * 5), but the difference between operations 3 and 2 is only 56 – suggesting that the optimizer does have some “self-caching” information, possibly based on there being a significant difference between the two tables for the number of distinct values of the join column. (See, for example: http://oracle-randolf.blogspot.co.uk/2012/05/nested-loop-join-costing.html )

Update 2 – 25th May

Scalability is a question that should always be considered – and there’s a scalability threat in the information we have so far. The plan shows a full tablescan of the MMR_DTL table, and while tablescans are not necessarily a bad thing we’ve been told that: “this table MMR_DTL is an ever growing table“. It’s possible that Oracle can be very quick and efficient when doing the existence tests on the rows it selects from the table – but it is inevitable that the tablescan will take longer to complete as time passes. Whether or not this is likely to matter is something we can’t decide from the information given: we don’t know how much of the time is the tablescan, we don’t know what fraction of the total time is due to the tablescan, and we don’t know  how much larger the table will grow each day (although, taking a wild guess, if we assume that the query runs once per day to manipulate recently arrived data the table would appear to be growing fairly rapidly.)

Another scalability detail we ought to ask about is the volume of data that we expect to update each time we run this statement. As time passes do we expect to see the same number of rows waiting to be updated, or are we expecting the business (whatever that may be) to grow steadily each month with an increase of a few percent in the number of rows to be updated on each execution. Our coding strategy may vary depending on the answer to that question – we might, for example, try to pre-empt a future problem by introducing some partitioning now.

The final scalablility issue is one I’ve raised already and comes from the CRME_FUND_DTL_STG table. According to the plan there about 8 rows (the number of rowids reporteed by the index range scan in operation 7) in this table for each distinct value of MMR_DTL.MBRSHP_CK; if MMR_DTL is large and growing, is CRME_FUND_DTL_STG very large and growing at 8 times the speed, or worse – as time passes will there be more rows for each distinct value of MMR_DTL.MBRSHP_CK.  Answers to these questions will help us decide whether we should use a hash join or a nested loop in the join to this table, and how to index the table to minimise random I/O.

Update 3 – 1st June

Since we know that the optimizer has come up with some very bad estimates, and given that the statement is short and simple, I’d be happy to pass the statement to the Tuning Advisor to see what suggestions it made. At the least it ought to come up with a suggestion for an SQL profile (i.e a set of opt_estimate() hints) to address the extremely poor cardinality estimates; it’s also likely to make some suggestions about potentially helpful indexes. I can think of three basic warnings, though:

  • Make sure you are licensed to use the advisor
  • Don’t run the advisor right after you’ve just done the big update – wait until the next big batch of data is ready for update
  • Be cautious about the indexes – the indexing suggestions may point you at the problem, but you may do better by using function-based indexes and modified SQL

The most obvious “simple” index to (assuming a lot of time goes on the table scan) is one of:


create index dtl_cpi on mmr_dtl(capitn_prcs_ind) compress;
create index dtl_cpi on mmr_dtl(capitn_prcs_ind, hspc_ind) compress;

If almost all the rows with capitn_prcs_ind = 5 also had hspc_ind = ‘Y’ then I’d probably choose the former, if a large number of rows (actually index rowids) could be eliminated before visiting the table I’d choose the latter. I would also make sure I had a frequency – or Top-N frequency in 12c – histogram on capitn_prcs_ind, and may even “fake” it to ensure that an automatic call to gather table stats didn’t do something that made the histogram a threat instead of a benefit. If I created the two-column index I would also create a frequency histogram on hspc_ind. The benefit of the histograms is based on my assumption that both columns have a very small number of values and a highly skewed distribution. I might go one step further and create a column group, but keeping an eye open for an “out of range” anomaly, on the pair of columns if there was a strong degree of correlation between the two columns (in particular if there were a relatively small number of ‘Y’, but most of the ‘Y’ was also capitn_prcs_ind).

A simple index, though, might get used for other statements where it was not appropriate; there’s also the extra overhead of maintence as the capitn_prcs_ind value changes from 5 to 2 – to update the index we delete the old entry and insert the new entry. Just to make matters a little worse, when an index has a large number of rows for the same key value it’s fairly common to find that its space utilisation averages about 50% per leaf block. On top of everything else, if 2 is a “final state” value for most of the rows in this very large table then a very large fraction of the index we’re building is probably a total waste of space. So we might look at maximising efficiency and minimising space (and undo and redo wastage) with a function-based index.  Again one or two columns as appropriate.


create index dtl_cpi_1 on mmr_dtl(case when capitn_prcs_ind = 5 then 0 end) compress;

create index dtl_cpi_2 on mmr_dtl(case when capitn_prcs_ind = 5 and hspc_ind = 'Y' then 0 end) compress;

execute dbms_stats.gather_table_stats(user, 'mmr_dtl',method_opt=>'for all hidden columns size 1')

Again my choice of index might be affected by the extra benefit of having two columns to eliminate the maximum number of rows from the table, but since the index would be very small either way I’d probably go for the two-column index. Note that once you’ve created a function-based index you have to create stats on the underlying hidden column, which I’ve done a little lazily in this case by simple gathering “for all hidden columns”. I don’t need a histogram since there’s only a single value in the index. Apart from the size benefit, the other efficiency benefit is that when I update the table I only have to worry about deleting a row from the index, I don’t have to insert a replacement row – getting rid of the third undo record and redo vector is likely to save about 30% on the index maintenance background costs.

I now have to change the original SQL to match the index definition, of course, so my driving where clause would become one of:

WHERE
	(case when CAPITN_PRCS_IND = 5 then 0 end) = 0
AND	HSPC_IND ='Y'
AND	EXISTS (...

WHERE
	(case when CAPITN_PRCS_IND = 5 AND HSPC_IND ='Y' then 0 end ) = 0
AND	EXISTS (...

Update 4 – 2nd June 2015

At this point we really do need better information to make sensible decisions about what to do next; however, as I indicated earlier on, the hash semi join to the small table FCTS_TMS_MBRID_XWLK does look like a sensible choice: it’s small so it’s a good target for a build table and small tables joined to big tables tend to be things that are used to eliminate rapidly (at least, in the case of existence tests).

The final big question is what to do about the table CRME_FUND_DTL_STG which looks as if it might be quite big (several rows for each MMR_DTL row) and therefore be contributing a lot of random I/O. We don’t really have many options here – and we’ve seen what they are in a previous post about “NOT EXISTS” subqueries. We could try to make the nested loop semi-join more efficient by adding columns to the index so that we can do more filtering in the index; we could consider forcing Oracle to stick with a filter subquery and see if we can introduce some indexing that could make the subquery run a very small number of times, or we could consider the impact of switching to a Hash semi-join (possibly switch the build and probe data sets to make it a “hash join right semi”).

A full tablescan on the CRME_FUND_DTL_STG to do the hash join might be rather expensive, though, so we might look at the significance of the predicate on STG_CRME.FUND_ID which lists 8 different literal values. Perhaps there’s some scope for creating an index on that column if those fund_ids represent a small fraction of the total data; maybe even consider list-partitioning the table (not necessarily one fund per partition) on the FUND_ID.

The only other significant “pre-processing” predicate on the CRME_FUND_DTL_STG table is the one that requires the MMR_DTL.PMT_MSA_STRT_DT to be between STG_CRME.ERN_FROM_DT and the STG_CRME.ERN_THRU_DT; so we might look at the how many rows in the table could be eliminated by finding the maximum and minimum values for MMR_DTL.PMT_MSA_STRT_DT and eliminating any rows where STG_CRME.ERN_FROM_DT was greater than the maximum PMT_MSA_START_DT or the STG_CRME.ERN_THRU_DT was less than the minimum PMT_MSA_START_DT.

We might play around with CTEs (subquery factoring / common table expressions) to make this happen – I’m not going to work the exact SQL, but it might start something like:


with starting_view as (
       select from mmr_detail
       where  ...
       and exists (
                select from FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
                where ...
       )
),
minmax_view as (
        select  /*+ materialize */
                min(PMT_MSA_STRT_DT), 
                max(PMT_MSA_STRT_DT) max_date
        from    starting_view
),
crme_fund_view as (
        select
        from
                minmax_view,
                CRME_FUND_DTL_STG
        where   ...
)
select
from
        starting_view   sv,
        crme_fund_view  cv
where
        ...

An alternative strategy to get the minimum and maximum dates more efficiently might be to include them in the function-based index, and then include TWO inline views, or maybe scalar subqueries in the where clause, that depend on the min/max range scan to find the values that can be used to eliminate as much data from CRME_FUND_DTL_STG as early as possible.

Footnote

I think it took me about 10 minutes looking at the original posting to come up with a few thoughts that might be appropriate – but it’s taken five sessions spread over nearly two week to write those ideas down. The same difference can appear between theoretical strategy and final action – everything I’ve said revolves around the two simple ideas of “how much data” (absolute and relative) and “how hard to we work to collect it” so it’s easy to come up with ideas; but it may take time to learn what the data looks like, and finding a SAFE way of making it possible to collect it efficiently.

May 18, 2015

Migrated rows

Filed under: Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 6:43 pm BST May 18,2015

I received an email recently describing a problem with a query which was running a full tablescan but: “almost all the waits are on ‘db file sequential read’ and the disk read is 10 times the table blocks”.  Some further information supplied was that the tablespace was using ASSM and 16KB block size; the table had 272 columns (ouch!) and the Oracle version was 11.2.0.4.

In his researches he had read my article on wide rows, and had picked out of one of the comments the line: “the very bad thing about chained rows and direct reads that is that finding the rest of row by ‘db file sequential read’ is never cached”, but he wasn’t sure that this was the problem he was seeing so, very sensibly, he had re-run the query with extended tracing available, and dumped (and formatted/edited) a couple of blocks from the table.

He then sent me the trace file and block dump. Generally this is a mistake – especially when the trace file is several megabytes – but he had prepared the ground well and had linked it back to one of my blog notes, and I thought there might be an opportunity for publishing a few more comments, so I took a look. Here’s a carefully edited subset of the block dump – showing all the pertinent information:


Start dump data blocks tsn: 99 file#:100 minblk 2513181 maxblk 2513181

Block header dump:  0x1926591d
 Object id on Block? Y
 seg/obj: 0x1652a7  csc: 0x53.891880b8  itc: 12  flg: E  typ: 1 - DATA
     brn: 1  bdba: 0x1965b70c ver: 0x01 opc: 0
     inc: 84  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0010.01d.0000dff9  0x2b442286.3469.09  C---    0  scn 0x0053.891880b1
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x04   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x05   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x06   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x07   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x08   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x09   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0a   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0b   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0c   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
bdba: 0x1926591d

data_block_dump,data header at 0x11083f154
===============
tsiz: 0x1ea8
hsiz: 0x26
pbl: 0x11083f154
     76543210
flag=--------
ntab=1
nrow=10
frre=-1
fsbo=0x26
fseo=0x4c5
avsp=0x49f
tosp=0x49f
0xe:pti[0]	nrow=10	offs=0
0x12:pri[0]	offs=0x1c15
0x14:pri[1]	offs=0x197b
0x16:pri[2]	offs=0x16e1
0x18:pri[3]	offs=0x1448
0x1a:pri[4]	offs=0x11b8
0x1c:pri[5]	offs=0xf1f
0x1e:pri[6]	offs=0xc85
0x20:pri[7]	offs=0x9ec
0x22:pri[8]	offs=0x752
0x24:pri[9]	offs=0x4c5
block_row_dump:
tab 0, row 0, @0x1c15
tl: 659 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 1, @0x197b
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 2, @0x16e1
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 3, @0x1448
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 4, @0x11b8
tl: 656 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 5, @0xf1f
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 7, @0x9ec
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 8, @0x752
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 9, @0x4c5
tl: 653 fb: -----L-- lb: 0x0  cc: 255

In the ITL you can see 10 entries with the flag set to “C—-” (committed) with no XID or SCN – that’s consistent with 10 rows migrating into the block in a single transaction. In the row directory you can see the block holds 10 rows, and in the body of the block you can see the header for each of those 10 rows with 255 columns (presumably the 2nd section of each row of 272 columns), and the flag bytes set to “—–L–” (the Last piece of a chained – as opposed to simply migrated – row).

So the block dump is consistent with the possiblity of a direct path read of a block somewhere (10 head pieces) having to read this block 10 times shortly afterwards. Can we find further corroboration in the trace file? The blockdump was for block 0x1926591d = 421943581 decimal


PARSE #4573135368:c=29,e=48,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4116693033,tim=79008343283418
EXEC #4573135368:c=53,e=93,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4116693033,tim=79008343283607
WAIT #4573135368: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=15477650 tim=79008343283636
WAIT #4573135368: nam='Disk file operations I/O' ela= 38 FileOperation=2 fileno=101 filetype=2 obj#=1462951 tim=79008343283973
WAIT #4573135368: nam='direct path read' ela= 8991 file number=100 first dba=947580 block cnt=13 obj#=1462951 tim=79008343293041

WAIT #4573135368: nam='db file sequential read' ela= 4934 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298032
WAIT #4573135368: nam='db file sequential read' ela= 155 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298216
WAIT #4573135368: nam='db file sequential read' ela= 127 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298378
WAIT #4573135368: nam='db file sequential read' ela= 125 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298526
WAIT #4573135368: nam='db file sequential read' ela= 128 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298677
WAIT #4573135368: nam='db file sequential read' ela= 123 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298826
WAIT #4573135368: nam='db file sequential read' ela= 134 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298983
WAIT #4573135368: nam='db file sequential read' ela= 129 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299135
WAIT #4573135368: nam='db file sequential read' ela= 180 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299341
WAIT #4573135368: nam='db file sequential read' ela= 133 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299497

WAIT #4573135368: nam='db file sequential read' ela= 11039 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310565
WAIT #4573135368: nam='db file sequential read' ela= 133 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310730
WAIT #4573135368: nam='db file sequential read' ela= 139 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310895
WAIT #4573135368: nam='db file sequential read' ela= 124 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311045
WAIT #4573135368: nam='db file sequential read' ela= 122 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311190
WAIT #4573135368: nam='db file sequential read' ela= 127 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311339
WAIT #4573135368: nam='db file sequential read' ela= 125 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311490
WAIT #4573135368: nam='db file sequential read' ela= 134 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311647
WAIT #4573135368: nam='db file sequential read' ela= 128 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311797
WAIT #4573135368: nam='db file sequential read' ela= 124 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311947

WAIT #4573135368: nam='db file sequential read' ela= 10592 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322564
WAIT #4573135368: nam='db file sequential read' ela= 142 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322740
WAIT #4573135368: nam='db file sequential read' ela= 126 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322889

There are a couple of interesting details in this trace file.

First we note (as the OP said) there are very few direct path reads – but direct path reads can be asynchronous with several running concurrently, which means that we may report one direct path read while the data returned from others records no time. (You’ll have to take my word for the sparseness of direct path reads – there were 5 reading a total of 58 blocks from the object, compared to 50,000 db file sequential reads)

Then you can see that although each block that was subject to “db file sequential read” is reported 10 times, the first read is much slower than the subsequent ones – a fairly good indication that the later reads are coming from a cache somewhere. (The 50,00 reads consisted of roughly 5,300 blocks being read 10 times, 1,400 blocks being read 9 times, 460 blocks being read 8 times, and a few blocks being read 7 or fewer times.)

You might also notice that the “coincidental” jump of 64 blocks between the sets of 10 reads – this appears fairly frequently, and it’s the type of pattern you might expect to see when a serial process is allocating blocks for use in a clean ASSM tablespace after the extent sizes have become fairly large (possibly the 64MB size that eventually appears with system managed extent sizes). There’s a “pseudo-random” choice of block within extent dicated by the process id, that spreads the work done by a single process steadily through the extent. Having filled 2513181, 2513245, 2513309 and so on for 16 steps the trace file comes back to 2513182, 2513246, 2513309 and so on.

It’s interesting (and time-consuming) to check the patterns but what we really need next, and don’t have, to check the theory is the set of 13 blocks dictated by the first direct path read:

WAIT #4573135368: nam='direct path read' ela= 8991 file number=100 first dba=947580 block cnt=13 obj#=1462951 tim=79008343293041

It’s likely that somewhere in the 13 blocks in the range 947580 onwards we would find the 10 row head pieces pointing to block 2513181; then the 10 row head pieces pointing to block 2513245, and so on – and I’d hope that we might see a pattern of many consecutive (or near-consecutive) rows in each originating block pointing to the same “next block”. In fact, with a few blocks in the early range, we might even get some idea of how the application was loading and updating data and be able to make some suggestions for changing the strategy to avoid row chaining.

Footnote

The OP also had a follow-up question which was: “One question for the block dump is why there is no hrid in it since the row pieces are the second row pieces and the flag bit is ‘—–L–‘?”  It would be nice to see this, of course – then we wouldn’t need to see the 947580-947592 range to see what had been happening to the data – but that’s not the way Oracle works, as I’ve pointed out above; but since the answer was in another posting of mine I simply emailed the relevant URL to the OP.

April 29, 2015

Not Exists

Filed under: Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 8:21 pm BST Apr 29,2015

This whole thing about “not exists” subqueries can run and run. In the previous episode I walked through some ideas of how the following query might perform depending on the data, the indexes, and the transformation that the optimizer might apply:

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As another participant in the original OTN thread had suggested, however, it might be possible to find a completely different way of writing the query, avoiding the subquery approach completely. In particular there are (probably) several ways that we could write an equivalent query where the table only appears once. In other words, if we restate the requirement we might be able to find a different SQL translation for that requirement.

Looking at the current SQL, it looks like the requirement is: “Count the number of rows in t1 that have values of X that only have one associated value of Y”.

Based on this requirement, the following SQL statements (supplied by two different people) look promising:


    WITH counts AS
       (SELECT x,y,count(*) xy_count
        FROM   t1
        GROUP BY x,y)
    SELECT SUM(x_count)
    FROM  (SELECT x, SUM(xy_count) x_count
           FROM   counts
           GROUP BY x
           HAVING COUNT(*) = 1);


SELECT SUM(COUNT(*))
  FROM t1
GROUP BY x HAVING COUNT(DISTINCT y)<=1

Logically they do seem to address the description of the problem – but there’s a critical difference between these statements and the original. The clue about the difference appears in the absence of any comparisons between columns in the new forms of the query, no t1.colX = t2.colX, no t1.colY != t2.colY, and this might give us an idea about how to test the code. Here’s some test data:


drop table t1 purge;

create table t1 (
        x       number(2,0),
        y       varchar2(10)
);

create index t1_i1 on t1(x,y);

--      Pick one of the three following pairs of rows

insert into t1(x,y) values(1,'a');
insert into t1(x,y) values(1,null);

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'b');

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'a');

commit;

--      A pair to be skipped

insert into t1(x,y) values(2,'c');
insert into t1(x,y) values(2,'c');

--      A pair to be reported

insert into t1(x,y) values(3,'d');
insert into t1(x,y) values(3,'e');

commit;

execute dbms_stats.gather_table_stats(user,'t1')

Notice the NULLs – comparisons with NULL lead to rows disappearing, so might the new forms of the query get different results from the old ?
The original query returns a count of 4 rows whichever pair we select from the top 6 inserts.

With the NULL in the Y column the new forms report 2 and 4 rows respectively – so only the second query looks viable.
With the NULLs in the X columns and differing Y columns the new forms report 2 and 2 rows respectively – so even the second query is broken.

However, if we add “or X is null” to the second query it reports 4 rows for both tests.
Finally, having added the “or x is null” predicate, we check that it returns the correct 4 rows for the final test pair – and it does.

It looks as if there is at least one solution to the problem that need only access the table once, though it then does two aggregates (hash group by in 11g). Depending on the data it’s quite likely that this single scan and double hash aggregation will be more efficient than any of the plans that do a scan and filter subquery or scan and hash anti-join. On the other hand the difference in performance might be small, and the ease of comprehension is just a little harder.

Footnote:

I can’t help thinking that the “real” requirement is probably as given in the textual restatement of the problem, and that the first rewrite of the query is probably the one that’s producing the “right” answers while the original query is probably producing the “wrong” answer.

April 26, 2015

Not Exists

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 7:17 pm BST Apr 26,2015

Another question on a seemingly simple “not exists” query has appeared on OTN just a few days after my last post about the construct. There are two little differences between the actual form of the two queries that make it worth repeating the analysis.

The first query was of the form:


select from big_table
where  not exists (select exact_matching_row from small table);

while the new query is of the form:


select from big_table alias1
where not exists (select inexact_matching_row from big_table alias2)

In the absence of a complete explanation, we might guess that the intention of the first query is to: “check correctness of undeclared foreign key constraint” i.e. small_table is the parent table with unique values, and big_table is the child end with some data that may be invalid due to a recent bulk data load. (In my example for the previous posting I relaxed the uniqueness assumption in small_table to make the problem a little more expensive.)

Our guess for the second query ought to be different; we are using the same table twice, and we are checking for the non-existence of “imperfect matches”. This introduces two potential threats – first that the (possibly pseudo-)join between the two tables is between two large tables and therefore inherently likely to be expensive; second that we may be allowed to have very large numbers of “perfect matches” that will escalate the scale of the join quite dramatically. Here’s the actual query (second version) from the posting; this will make it easier to explain why the structure of the query introduces the second threat and requires us (as so often) to understand the data in order to optimise the query execution:


select  count(*)
from    ubl_stg.wk_sap_fat w1
where   not exists (
                select  1
                from    ubl_stg.wk_sap_fat w2
                where   w2.mes_mese_id =  w1.mes_mese_id
                and     w2.sistema     <> w1.sistema
        )
;

Note the schema name ubl_stg – doesn’t that hint at “staging tables” for a data load.
Note the column name mes_mese_id – an “id” column, but clearly not one that’s supposed to be unique, so possibly a foreign key column that allows repetitions

The code is looking for, then counting, cases where for a given value of mes_mese_id there is only one corresponding value used for sistema. Since we don’t know the application we don’t know whether this count should be low or high relative to the number of rows in the table, nor do we know how many distinct values there might be of mes_mese_id – and these pieces of information are critical to identifying the best execution plan for the query.

The owner of this query told us that the table held 2 million rows which are “deleted and reloaded” every day, showed us that the (default) execution plan was a hash anti-join which took two hours to complete, told us that he (or she) didn’t think that an index would help because of the “not equal” predicate, and finally reported back that the problem was solved (no timing information supplied) by the addition of the /*+ no_unnest */ hint.

The question is – what has happened and why is there such a difference in performance ?

The first observation is that the 2 hours does seem an unreasonably long time for the query to run – and since it’s only a select statement it would be good to run it again and take a snapshot of the session statistics and session events (v$sesstat, v$session_event) for the session to see what the workload was and where the time was spent. Given the “deleted and reloaded” comment it’s possible that there may be some unexpected overhead due to some strange effects of read-consistency or delayed block cleanout, so we might also take a snapshot of tablespace or file I/O to check for lots of I/O on the undo tablespace (which might also show an I/O problem on the table’s tablespace or the user’s TEMP tablespace). The snapshots give us a very cheap, non-invasive option for getting some summary stats – but the tablespace/file stats are system-wide, of course, so may not tell us anything about our specific task, so we might even enable extended tracing (event 10046 level 8 / dbms_monitor with waits) to see in detail where we are losing time.

Since we don’t currently have the information we need to explain the two hours (it may have appeared by the time I post this note) it might be instructive to make some guesses about where the time could go in the hash anti-join, and why the /*+ no_unnest */ hint could make a difference.

Ignoring the possibility of strange undo/read-consistency/cleanout effects the first possiblity is simply that the hash join is large and turns into a multi-pass I/O thrash. The mes_mese_id column looks like it might be a number (id columns so often are) but the sistema column has the flavour of a reasonably large character column – so maybe our hash table has to be a couple of hundred megabytes – that could certainly be enough to spill to disc, though you’d have to have a really small PGA availability for it to turn into a multi-pass hash join.

Another possibility is that the pattern in the data makes the hash join burn up a huge amount of CPU – that should be easy to see on a re-run.  If there are relatively few distinct sets of values for (mes_mese_id, sistema) and there are very few cases where a mes_mese_id is associated with more than one sistema, then a large fraction of the hash table probes would have to follow a very long chain of matches to the very end, and that would take a large amount of CPU.

Pursue that “long chain” hypothesis to a slight extreme – what if there’s one mes_mese_id that appears in 250,000 of the 2M rows, and the sistema value is the same for every one of those quarter million rows,  which would require Oracle to walk a chain of 250,000 elements 250,000 times, for a total of 62.5 billion pointers to follow and comparisions to make – how much CPU might that take.  Worse still,since having the same mes_mese_id (the hashing is only on the equality predicate) means the quarter of a million rows would have to go into the same hash bucket so we might end up doing a multipass operation because that one bucket was very much larger than anything Oracle had anticipated when it did its internal partitioning calculations for the hash table. (There are some related notes in this article and in chapter 12 of Cost Based Oracle – Fundamentals)

Why would the /*+ unnest */ hint address these performance problems ? My first guess would be that the OP may have created the index on (mes_mese_id, sistema), in which case a full scan of that index (or even a fast-full scan if the index were newly created, or even a tablescan if the data had been loaded in the right order) followed by the filter subquery being driven by an index range scan would result in a relatively efficient subquery being executed once per distinct value of (mes_mese_id, sistema) rather than once per row. This could be much more efficient than a really badly skewed data set doing the hash anti-join. In fact, even in the absence of the index, if the number of distinct combinations was really quite small, that many tablescans – if the table were cached, whether in Oracle or the filesystem or the SAN cache – might still be a lot faster than the hash anti-join. (We were told that the problem was solved – but not told how much time constituted a viable solution.)

Models

Hand-waving and talk is fine – but a few modelled results might make it easier to comprehend, so here’s a way of generating a few versions of data sets and testing:


drop table t1 purge;

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                          id,
/*
        --
        --      Random generation nearly guaranteeing no random
        --      duplicates of (x,y) but quite a lot of mismatches
        --      Count 735,715 in 3.36 seconds.
        --
        trunc(dbms_random.value(0,2e6))                 x,
        lpad(trunc(dbms_random.value(0,1e6)),64,'0')    y
*/
/*
        --
        --      One specific pair repeated many times with no mismatch the rest
        --      as previously generated above. Times with different repeat counts:
        --       10,000            14 seconds
        --       20,000            52 seconds
        --       40,000           202 seconds
        --      CPU time quadruples as the count doubles (twice the number of
        --      probes walking twice the number of steps in the hash chain)
        --       80,000 =>        800 seconds
        --      160,000 =>      3,200 seconds
        --      240,000 =>      ca. 2 hours.
*/
        case
                when rownum <= 40000
                        then 2e6
                        else trunc(dbms_random.value(0,2e6))
        end                                             x,
        case
                when rownum <= 40000
                        then
                                lpad(0,64,'0')
                        else
                                lpad(
                                        trunc((rownum - 1)/1000) +
                                                case mod(rownum-1,1000) when 0 then 0 else 0 end,
                                        64,'0'
                                )
        end                                             y
/*
        --
        --      2,000 distinct values repeated 1,000 times each
        --      Query result: 2,000,000 in 235 seconds.
        --
        trunc((rownum - 1)/1000)                x,
        lpad(
                trunc((rownum - 1)/1000) +
                        case mod(rownum-1,1000) when 0 then 0 else 0 end,
                64,'0'
        )                                       y
*/
from
        generator       v1,
        generator       v2
where
        rownum <= 2e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As you can see I’ve changed the table and column names – this keeps them in line with the original SQL statement presented on OTN before we got the graphic display of a similar statement and plan. The SQL to create the data includes three variants and 5 sets of results (and 3 conjectures) running 11.2.0.4. These results appeared when the optimizer took the hash anti-join execution plan, and also spilled to disc with a one-pass workarea operation that first extended to about 200MB of PGA then dropped back to about 8MB.

To summarise the results recorded in the SQL – if we use the term “bad data” to describe rows where more than one Y value appears for a given X value then:

  1. With a large number of distinct pairs and a lot of bad data: the anti-join is pretty fast at 3.36 seconds.
  2. With no bad data, a small number of distinct pairs (2,000) and lots of rows per pair (1,000): the anti-join takes 235 CPU seconds
  3. As for #1 above, but with one extreme “good” pair that appears a large number of times: CPU time is proportional to the square of the number of duplicates of this value

I didn’t actually test beyond 40,000 duplicates for the last case, but you can see the double/quadruple pattern very clearly and the CPU time would have hit 2 hours at around 240,000 identical copies of one (x,y) pair.

/*+ no_unnest */

So what happens if you try using the /*+ no_unnest */ hint ? The target here is that the more repetitive the data the smaller the number of times you may have to run the subquery; and if you can get the driving data ordered you can guarantee the smallest possible number of runs of the subquery. I haven’t worked through all the possibilities, but to give you the flavour of what can happen, when I added the /*+ no_unnest */ hint to the query (and ensured that the table would be cached rather than read using direct path reads into the PGA) the execution time for the test with 1,000 copies of 2,000 pairs took 181 seconds to do 2,001 tablescans (compared with 235 seconds to do 2 tablescans and a hash anti-join) with the following execution path:


----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    69 |  8256K  (4)| 11:28:04 |
|   1 |  SORT AGGREGATE     |      |     1 |    69 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1   |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    TABLE ACCESS FULL| T1   |     2 |   138 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST NO_INDEX ("W2") */ 0
              FROM "T1" "W2" WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - filter("W2"."X"=:B1 AND "W2"."Y"<>:B2)

More significantly, when I created the optimum index the execution time dropped to 0.9 seconds – here’s the create index statement and subsequent plan – the extreme benefit appears because the data was effectively loaded in sorted order; if this had not been the case I would have forced an index full scan for the driving data set (with a “not null” predicate or constraint to make it possible for the optimizer to use the index to drive the query)


create index t1_i1 on t1(x,y) compress pctfree 0;

-----------------------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |     1 |    69 |  6008K  (1)| 08:20:45 |
|   1 |  SORT AGGREGATE     |       |     1 |    69 |            |          |
|*  2 |   FILTER            |       |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1    |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    INDEX RANGE SCAN | T1_I1 |     2 |   138 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T1" "W2"
              WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - access("W2"."X"=:B1)
       filter("W2"."Y"<>:B1)

Similarly, when I had the index in place and the 40,000 repetitions of a “good pair”, Oracle took a total of 9 seconds even though it had to run the subquery 1,960,000 times for the non-repetitive data (and once for the repetitive_pair). I have to say I was a little surprised at how rapidly it managed to get through that 2M subquery executions – but then I keep forgetting how ridiculously overpowered my new laptop is.

With these figures in mind you can appreciate that if the OP had lots of pairs with tens of thousands of repetitions, then even without creating the index on the table, the query time might drop from 2 hours for the hash anti-join to “a few minutes” for the filter subquery with a time that was good enough to look like “problem solved”.

Summary

If you have a few “long hash chains” in the build table – i.e. rows from the “first” table in the join that hash to the same value – then the amount of work Oracle has to do to check a single row from the probe (“second”) table that matches on the hash value can become significant. If a large number of rows from the probe table hit a long hash chain then the CPU time for the whole join can climb dramatically and you may want to force Oracle away from the hash join.

If the the long chains are the result of a skewed distribution where a small number of values appear very frequently in a table with a large number of distinct values that each appears infrequently then the optimizer may not notice the threat and may choose the hash plan when there is a much less resource-intensive alternative available.

Footnote:

There was another interesting observations I made while doing the experiments relating to whether the chains are due to hash collisions or require exact matches in the data – but I’ve spent an hour on desgining and running tests, and nearly 4 hours writing up the results so far. I need to do a few more tests to work out whether I’m seeing a very clever optimisation or a lucky coincidence in a certain scenario – so I’m going to save that for another day.

 

 

March 27, 2015

ANSI expansion

Filed under: ANSI Standard,Oracle,Troubleshooting — Jonathan Lewis @ 10:46 am BST Mar 27,2015

Here’s a quirky little bug that appeared on the OTN database forum in the last 24 hours which (in 12c, at least) produces an issue which I can best demonstrate with the following cut-n-paste:


SQL> desc purple
 Name                                Null?    Type
 ----------------------------------- -------- ------------------------
 G_COLUMN_001                        NOT NULL NUMBER(9)
 P_COLUMN_002                                 VARCHAR2(2)

SQL> select p.*
  2  from GREEN g
  3    join RED r on g.G_COLUMN_001 = r.G_COLUMN_001
  4    join PURPLE p on g.G_COLUMN_001 = p.G_COLUMN_001;
  join PURPLE p on g.G_COLUMN_001 = p.G_COLUMN_001
       *
ERROR at line 4:
ORA-01792: maximum number of columns in a table or view is 1000

SQL> select p.g_column_001, p.p_column_002
  2  from GREEN g
  3    join RED r on g.G_COLUMN_001 = r.G_COLUMN_001
  4    join PURPLE p on g.G_COLUMN_001 = p.G_COLUMN_001;

no rows selected

A query that requires “star-expansion” fails with ORA-01792, but if you explicitly expand the ‘p.*’ to list all the columns it represents the optimizer is happy. (The posting also showed the same difference in behaviour when changing “select constant from  {table join}” to “select (select constant from dual) from {table join}”)

The person who highlighted the problem supplied code to generate the tables so you can repeat the tests very easily; one of the quick checks I did was to modify the code to produce tables with a much smaller number of columns and then expanded the SQL to see what Oracle would have done with the ANSI. So, with only 3 columns each in table RED and GREEN, this is what I did:

set serveroutput on
set long 20000

variable m_sql_out clob

declare
    m_sql_in    clob :=
                        '
                        select p.*
                        from GREEN g
                        join RED r on g.G_COLUMN_001 = r.G_COLUMN_001
                        join PURPLE p on g.G_COLUMN_001 = p.G_COLUMN_001
                        ';
begin

    dbms_utility.expand_sql_text(
        m_sql_in,
        :m_sql_out
    );

end;
/

column m_sql_out wrap word
print m_sql_out

The dbms_utility.expand_sql_text() function is new to 12c, and you’ll need the execute privilege on the dbms_utility package to use it; but if you want to take advantage of it in 11g you can also find it (undocumented) in a package called dbms_sql2.

Here’s the result of the expansion (you can see why I reduced the column count to 3):


M_SQL_OUT
--------------------------------------------------------------------------------
SELECT "A1"."G_COLUMN_001_6" "G_COLUMN_001","A1"."P_COLUMN_002_7" "P_COLUMN_002"
FROM  (SELECT "A3"."G_COLUMN_001_0" "G_COLUMN_001","A3"."G_COLUMN_002_1"
"G_COLUMN_002","A3"."G_COLUMN_003_2" "G_COLUMN_003","A3"."G_COLUMN_001_3"
"G_COLUMN_001","A3"."R_COLUMN__002_4" "R_COLUMN__002","A3"."R_COLUMN__003_5"
"R_COLUMN__003","A2"."G_COLUMN_001" "G_COLUMN_001_6","A2"."P_COLUMN_002"
"P_COLUMN_002_7" FROM  (SELECT "A5"."G_COLUMN_001"
"G_COLUMN_001_0","A5"."G_COLUMN_002" "G_COLUMN_002_1","A5"."G_COLUMN_003"
"G_COLUMN_003_2","A4"."G_COLUMN_001" "G_COLUMN_001_3","A4"."R_COLUMN__002"
"R_COLUMN__002_4","A4"."R_COLUMN__003" "R_COLUMN__003_5" FROM
"TEST_USER"."GREEN" "A5","TEST_USER"."RED" "A4" WHERE
"A5"."G_COLUMN_001"="A4"."G_COLUMN_001") "A3","TEST_USER"."PURPLE" "A2" WHERE
"A3"."G_COLUMN_001_0"="A2"."G_COLUMN_001") "A1"

Tidying this up:


SELECT
        A1.G_COLUMN_001_6 G_COLUMN_001,
        A1.P_COLUMN_002_7 P_COLUMN_002
FROM    (
        SELECT
                A3.G_COLUMN_001_0 G_COLUMN_001,
                A3.G_COLUMN_002_1 G_COLUMN_002,
                A3.G_COLUMN_003_2 G_COLUMN_003,
                A3.G_COLUMN_001_3 G_COLUMN_001,
                A3.R_COLUMN__002_4 R_COLUMN__002,
                A3.R_COLUMN__003_5 R_COLUMN__003,
                A2.G_COLUMN_001 G_COLUMN_001_6,
                A2.P_COLUMN_002 P_COLUMN_002_7
        FROM    (
                SELECT
                        A5.G_COLUMN_001 G_COLUMN_001_0,
                        A5.G_COLUMN_002 G_COLUMN_002_1,
                        A5.G_COLUMN_003 G_COLUMN_003_2,
                        A4.G_COLUMN_001 G_COLUMN_001_3,
                        A4.R_COLUMN__002 R_COLUMN__002_4,
                        A4.R_COLUMN__003 R_COLUMN__003_5
                FROM
                        TEST_USER.GREEN A5,
                        TEST_USER.RED A4
                WHERE
                        A5.G_COLUMN_001=A4.G_COLUMN_001
                ) A3,
                TEST_USER.PURPLE A2
        WHERE
                A3.G_COLUMN_001_0=A2.G_COLUMN_001
        ) A1

As you can now see, the A1 alias lists all the columns in GREEN, plus all the columns in RED, plus all the columns in PURPLE – totalling 3 + 3 + 2 = 8. (There is a little pattern of aliasing and re-aliasing that turns the join column RED.g_column_001 into G_COLUMN_001_3, making it look at first glance as if it has come from the GREEN table).

You can run a few more checks, increasing the number of columns in the RED and GREEN tables, but essentially when the total number of columns in those two tables goes over 998 then adding the two extra columns from PURPLE makes that intermediate inline view break the 1,000 column rule.

Here’s the equivalent expanded SQL if you identify the columns explicitly in the select list (even with several hundred columns in the RED and GREEN tables):


SELECT
        A1.G_COLUMN_001_2 G_COLUMN_001,
        A1.P_COLUMN_002_3 P_COLUMN_002
FROM    (
        SELECT
                A3.G_COLUMN_001_0 G_COLUMN_001,
                A3.G_COLUMN_001_1 G_COLUMN_001,
                A2.G_COLUMN_001 G_COLUMN_001_2,
                A2.P_COLUMN_002 P_COLUMN_002_3
        FROM    (
                SELECT
                        A5.G_COLUMN_001 G_COLUMN_001_0,
                        A4.G_COLUMN_001 G_COLUMN_001_1
                FROM
                        TEST_USER.GREEN A5,
                        TEST_USER.RED A4
                WHERE
                        A5.G_COLUMN_001=A4.G_COLUMN_001
                ) A3,
                TEST_USER.PURPLE A2
        WHERE
                A3.G_COLUMN_001_0=A2.G_COLUMN_001
        ) A1

As you can see, the critical inline view now holds only the original join columns and the columns required for the select list.

If you’re wondering whether this difference in expansion could affect execution plans, it doesn’t seem to; the 10053 trace file includes the following (cosmetically altered) output:


Final query after transformations:******* UNPARSED QUERY IS *******
SELECT
        P.G_COLUMN_001 G_COLUMN_001,
        P.P_COLUMN_002 P_COLUMN_002
FROM
        TEST_USER.GREEN   G,
        TEST_USER.RED     R,
        TEST_USER.PURPLE  P
WHERE
        G.G_COLUMN_001=P.G_COLUMN_001
AND     G.G_COLUMN_001=R.G_COLUMN_001

So it looks as if the routine to transform the syntax puts in a lot of redundant text, then the optimizer takes it all out again.

The problem doesn’t exist with traditional Oracle syntax, by the way, it’s an artefact of Oracle’s expansion of the ANSI syntax, and 11.2.0.4 is quite happy to handle the text generated by the ANSI transformation when there are well over 1,000 columns in the inline view.

ASH

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 9:41 am BST Mar 27,2015

There was a little conversation on Oracle-L about ASH (active session history) recently which I thought worth highlighting – partly because it raised a detail that I had got wrong until Tim Gorman corrected me a few years ago.

Once every second the dynamic performance view v$active_session_history copies information about active sessions from v$session. (There are a couple of exceptions to the this rule – for example if a session has called dbms_lock.sleep() it will appear in v$session as state = ‘ACTIVE’, but it will not be recorded in v$active_session_history.) Each of these snapshots is referred to as a “sample” and may hold zero, one, or many rows.

The rows collected in every tenth sample are flagged for copying into the AWR where, once they’ve been copied into the underlying table, they can be seen in the view dba_hist_active_sess_history.  This is where a common misunderstanding occurs: it is not every 10th row in v$active_session_history it’s every 10th second; and if a sample happens to be empty that’s still the sample that is selected (which means there will be a gap in the output from dba_hist_active_sess_history). In effect dba_hist_active_sess_history holds copies of the information you’d get from v$session if you sampled it once every 10 seconds instead of once per second.

It’s possible to corroborate this through a fairly simple query as the rows from v$active_session_history that are going to be dumped to the AWR are as they are created:


select
        distinct case is_awr_sample when 'Y' then 'Y' end flag,
        sample_id,
        sample_time
from
        v$active_session_history
where
        sample_time > sysdate - 1/1440
order by
        2,1
;

F  SAMPLE_ID SAMPLE_TIME
- ---------- --------------------------------
     3435324 26-MAR-15 05.52.53.562 PM
     3435325 26-MAR-15 05.52.54.562 PM
     3435326 26-MAR-15 05.52.55.562 PM
     3435327 26-MAR-15 05.52.56.562 PM
     3435328 26-MAR-15 05.52.57.562 PM
     3435329 26-MAR-15 05.52.58.562 PM
     3435330 26-MAR-15 05.52.59.562 PM
     3435331 26-MAR-15 05.53.00.562 PM
Y    3435332 26-MAR-15 05.53.01.562 PM
     3435333 26-MAR-15 05.53.02.572 PM
     3435334 26-MAR-15 05.53.03.572 PM
     3435335 26-MAR-15 05.53.04.572 PM
     3435336 26-MAR-15 05.53.05.572 PM
     3435337 26-MAR-15 05.53.06.572 PM
     3435338 26-MAR-15 05.53.07.572 PM
     3435339 26-MAR-15 05.53.08.572 PM
     3435340 26-MAR-15 05.53.09.572 PM
     3435341 26-MAR-15 05.53.10.582 PM
Y    3435342 26-MAR-15 05.53.11.582 PM
     3435343 26-MAR-15 05.53.12.582 PM
     3435344 26-MAR-15 05.53.13.582 PM
     3435345 26-MAR-15 05.53.14.582 PM
     3435346 26-MAR-15 05.53.15.582 PM
     3435347 26-MAR-15 05.53.16.582 PM
     3435348 26-MAR-15 05.53.17.582 PM
     3435349 26-MAR-15 05.53.18.592 PM
     3435350 26-MAR-15 05.53.19.592 PM
     3435351 26-MAR-15 05.53.20.592 PM
Y    3435352 26-MAR-15 05.53.21.602 PM
     3435355 26-MAR-15 05.53.24.602 PM
     3435358 26-MAR-15 05.53.27.612 PM
     3435361 26-MAR-15 05.53.30.622 PM
     3435367 26-MAR-15 05.53.36.660 PM
     3435370 26-MAR-15 05.53.39.670 PM
     3435371 26-MAR-15 05.53.40.670 PM
     3435373 26-MAR-15 05.53.42.670 PM
     3435380 26-MAR-15 05.53.49.700 PM
     3435381 26-MAR-15 05.53.50.700 PM
Y    3435382 26-MAR-15 05.53.51.700 PM
     3435383 26-MAR-15 05.53.52.700 PM

40 rows selected.

As you can see at the beginning of the output the samples have a sample_time that increases one second at a time (with a little slippage), and the flagged samples appear every 10 seconds at 5.53.01, 5.53.11 and 5.53.21; but then the instance becomes fairly idle and there are several sample taken over the next 20 seconds or so where we don’t capture any active sessions; in particular there are no rows in the samples for 5.53.31, and 5.53.41; but eventually the instance gets a little busy again and we see that we’ve had active sessions in consecutive samples for the last few seconds, and we can see that we’ve flagged the sample at 5.53.51 for dumping into the AWR.

You’ll notice that I seem to be losing about 1/100th second every few seconds; this is probably a side effect of virtualisation and having a little CPU-intensive work going on in the background. If you see periods where the one second gap in v$active_session_history or 10 second gap in dba_hist_active_sess_history has been stretched by several percent you can assume that the CPU was under pressure over that period. The worst case I’ve seen to date reported gaps of 12 to 13 seconds in dba_hist_active_sess_history.  The “one second” algorithm is “one second since the last snapshot was captured” so if the process that’s doing the capture doesn’t get to the top of the runqueue in a timely fashion the snapshots slip a little.

When the AWR snapshot is taken, the flagged rows from v$active_session_history are copied to the relevant AWR table. You can adjust the frequency of sampling for both v$active_session_history, and dba_hist_active_sess_history, of course – there are hidden parameters to control both: _ash_sampling_interval (1,000 milliseconds) and _ash_disk_filter_ratio (10). There’s also a parameter controlling how much memory should be reserved in the shared pool to hold v$active_session_history.: _ash_size (1048618 bytes per session in my case).  The basic target is to keep one hour’s worth of data in memory, but if there’s no pressure for memory you can find that the v$active_session_history holds more than the hour; conversely, if there’s heavy demand for memory and lots of continuously active sessions you may find that Oracle does “emergency flushes” of v$active_session_history between the normal AWR snapshots. I have heard of people temporarily increasing the memory and reducing the interval and ratio – but I haven’t yet felt the need to do it myself.

 

March 11, 2015

Flashback Logging

Filed under: Flashback,Infrastructure,Locks,Oracle,Troubleshooting — Jonathan Lewis @ 3:21 pm BST Mar 11,2015

One of the waits that is specific to ASSM (automatic segment space management) is the “enq: FB – contention” wait. You find that the “FB” enqueue has the following description and wait information when you query v$lock_type, and v$event_name:


SQL> execute print_table('select * from v$lock_type where type = ''FB''')
TYPE                          : FB
NAME                          : Format Block
ID1_TAG                       : tablespace #
ID2_TAG                       : dba
IS_USER                       : NO
DESCRIPTION                   : Ensures that only one process can format data blocks in auto segment space managed tablespaces

SQL> execute print_table('select * from v$event_name where name like ''enq: FB%''')
EVENT#                        : 806
EVENT_ID                      : 1238611814
NAME                          : enq: FB - contention
PARAMETER1                    : name|mode
PARAMETER2                    : tablespace #
PARAMETER3                    : dba
WAIT_CLASS_ID                 : 1893977003
WAIT_CLASS#                   : 0
WAIT_CLASS                    : Other

This tells us that a process will acquire the lock when it wants to format a batch of blocks in a segment in a tablespace using ASSM – and prior experience tells us that this is a batch of 16 consecutive blocks in the current extent of the segment; and when we see a wait for an FB enqueue we can assume that two session have simultaneously tried to format the same new batch of blocks and one of them is waiting for the other to complete the format. In some ways, this wait can be viewed (like the “read by other session” wait) in a positive light – if the second session weren’t waiting for the first session to complete the block format it would have to do the formatting itself, which means the end-user has a reduced response time. On the other hand the set of 16 blocks picked by a session is dependent on its process id, so the second session might have picked a different set of 16 blocks to format, which means in the elapsed time of one format call the segment could have had 32 blocks formatted – this wouldn’t have improved the end-user’s response time, but it would mean that more time would pass before another session had to spend time formatting blocks. Basically, in a highly concurrent system, there’s not a lot you can do about FB waits (unless, of course, you do some clever partitioning of the hot objects).

There is actually one set of circumstances where you can have some control of how much time is spent on the wait, but before I mention it I’d like to point out a couple more details about the event itself. First, the parameter3/id2_tag is a little misleading: you can work out which blocks are being formatted (if you really need to), but the “dba” is NOT a data block address (which you might think if you look at the name and a few values). There is a special case when the FB enqueue is being held while you format the blocks in the 64KB extents that you get from system allocated extents, and there’s probably a special case (which I haven’t bothered to examine) if you create a tablespace with uniform extents that aren’t a multiple of 16 blocks, but in the general case the “dba” consists of two parts – a base “data block address” and a single (hex) digit offset identifying which batch of 16 blocks will be formatted.

For example: a value of 0x01800242 means start at data block address 0x01800240, count forward 2 * 16 blocks then format 16 blocks from that point onwards. Since the last digit can only range from 0x0 to 0xf this means the first 7 (hex) digits of a “dba” can only reference 16 batches of 16 blocks, i.e. 256 blocks. It’s not coincidence (I assume) that a single bitmap space management block can only cover 256 blocks in a segment – the FB enqueue is tied very closely to the bitmap block.

So now it’s time to ask why this discussion of the FB enqueue appears in an article titled “Flashback Logging”. Enable the 10704 trace at level 10, along with the 10046 trace at level 8 and you’ll see. Remember that Oracle may have to log the old version of a block before modifying it and if it’s a block that’s being reused it may contribute to “physical reads for flashback new” – here’s a trace of a “format block” event:


*** 2015-03-10 12:50:35.496
ksucti: init session DID from txn DID:
ksqgtl:
        ksqlkdid: 0001-0023-00000014

*** 2015-03-10 12:50:35.496
*** ksudidTrace: ksqgtl
        ktcmydid(): 0001-0023-00000014
        ksusesdi:   0000-0000-00000000
        ksusetxn:   0001-0023-00000014
ksqgtl: RETURNS 0
WAIT #140627501114184: nam='db file sequential read' ela= 4217 file#=6 block#=736 blocks=1 obj#=192544 tim=1425991835501051
WAIT #140627501114184: nam='db file sequential read' ela= 674 file#=6 block#=737 blocks=1 obj#=192544 tim=1425991835501761
WAIT #140627501114184: nam='db file sequential read' ela= 486 file#=6 block#=738 blocks=1 obj#=192544 tim=1425991835502278
WAIT #140627501114184: nam='db file sequential read' ela= 522 file#=6 block#=739 blocks=1 obj#=192544 tim=1425991835502831
WAIT #140627501114184: nam='db file sequential read' ela= 460 file#=6 block#=740 blocks=1 obj#=192544 tim=1425991835503326
WAIT #140627501114184: nam='db file sequential read' ela= 1148 file#=6 block#=741 blocks=1 obj#=192544 tim=1425991835504506
WAIT #140627501114184: nam='db file sequential read' ela= 443 file#=6 block#=742 blocks=1 obj#=192544 tim=1425991835504990
WAIT #140627501114184: nam='db file sequential read' ela= 455 file#=6 block#=743 blocks=1 obj#=192544 tim=1425991835505477
WAIT #140627501114184: nam='db file sequential read' ela= 449 file#=6 block#=744 blocks=1 obj#=192544 tim=1425991835505985
WAIT #140627501114184: nam='db file sequential read' ela= 591 file#=6 block#=745 blocks=1 obj#=192544 tim=1425991835506615
WAIT #140627501114184: nam='db file sequential read' ela= 449 file#=6 block#=746 blocks=1 obj#=192544 tim=1425991835507157
WAIT #140627501114184: nam='db file sequential read' ela= 489 file#=6 block#=747 blocks=1 obj#=192544 tim=1425991835507684
WAIT #140627501114184: nam='db file sequential read' ela= 375 file#=6 block#=748 blocks=1 obj#=192544 tim=1425991835508101
WAIT #140627501114184: nam='db file sequential read' ela= 463 file#=6 block#=749 blocks=1 obj#=192544 tim=1425991835508619
WAIT #140627501114184: nam='db file sequential read' ela= 685 file#=6 block#=750 blocks=1 obj#=192544 tim=1425991835509400
WAIT #140627501114184: nam='db file sequential read' ela= 407 file#=6 block#=751 blocks=1 obj#=192544 tim=1425991835509841

*** 2015-03-10 12:50:35.509
ksqrcl: FB,16,18002c2
ksqrcl: returns 0

Note: we acquire the lock (ksqgtl), read 16 blocks by “db file sequential read”, write them to the flashback log (buffer), format them in memory, release the lock (ksqrcl). That lock can be held for quite a long time – in this case 13 milliseconds. Fortunately the all the single block reads after the first have been accelerated by O/S prefetching, your timings may vary.

The higher the level of concurrent activity the more likely it is that processes will collide trying to format the same 16 blocks (the lock is exclusive, so the second will request and wait, then find that the blocks are already formatted when it finally get the lock). This brings me to the special case where waits for the FB enqueue waits might have a noticeable impact … if you’re running parallel DML and Oracle decides to use “High Water Mark Brokering”, which means the parallel slaves are inserting data into a single segment instead of each using its own private segment and leaving the query co-ordinator to clean up round the edges afterwards. I think this is most likely to happen if you have a tablespace using fairly large extents and Oracle thinks you’re going to process a relatively small amount of data (e.g. small indexes on large tables) – the trade-off is between collisions between processes and wasted space from the private segments.

January 19, 2015

Bitmap Counts

Filed under: bitmaps,Indexing,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 12:15 pm BST Jan 19,2015

In an earlier (not very serious) post about count(*) I pointed out how the optimizer sometimes does a redundant “bitmap conversion to rowid” when counting. In the basic count(*) example I showed this wasn’t a realistic issue unless you had set cursor_sharing to “force” (or the now-deprecated “similar”). There are, however, some cases where the optimizer can do this in more realistic circumstances and this posting models a scenario I came across a few years ago. The exact execution path has changed over time (i.e. version) but the anomaly persists, even in 12.1.0.2.

First we create a “fact” table and a dimension table, with a bitmap index on the fact table and a corresponding primary key on the dimension table:


create table area_sales (
	area		varchar2(10)	not null,
	dated		date		not null,
	category	number(3)	not null,
	quantity	number(8,0),
	value		number(9,2),
	constraint as_pk primary key (dated, area),
	constraint as_area_ck check (area in ('England','Ireland','Scotland','Wales'))
)
;

insert into area_sales
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	decode(mod(rownum,4),
		0,'England',
		1,'Ireland',
		2,'Scotland',
		3,'Wales'
	),
	sysdate + 0.0001 * rownum,
	mod(rownum-1,300),
	rownum,
	rownum
from
	generator,
	generator
where
	rownum <= 1e6
;

create bitmap index as_bi on area_sales(category) pctfree 0;

create table dim (
	id	number(3) not null,
	padding	varchar2(40)
)
;

alter table dim add constraint dim_pk primary key(id);

insert into dim
select
	distinct category, lpad(category,40,category)
from	area_sales
;

commit;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'AREA_SALES',
		method_opt 	 => 'for all columns size 1',
		cascade		 => true
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'DIM',
		method_opt 	 => 'for all columns size 1',
		cascade		 => true
	);
end;
/

Now we run few queries and show their execution plans with rowsource execution statistics. First a query to count the number of distinct categories used in the area_sales tables, then a query to list the IDs from the dim table that appear in the area_sales table, then the same query hinted to run efficiently.


set trimspool on
set linesize 156
set pagesize 60
set serveroutput off

alter session set statistics_level = all;

select
	distinct category
from
	area_sales
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

==========================================
select  distinct category from  area_sales
==========================================
---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |    300 |00:00:00.01 |     306 |       |       |          |
|   1 |  HASH UNIQUE                 |       |      1 |    300 |    300 |00:00:00.01 |     306 |  2294K|  2294K| 1403K (0)|
|   2 |   BITMAP INDEX FAST FULL SCAN| AS_BI |      1 |   1000K|    600 |00:00:00.01 |     306 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------

As you can see, Oracle is able to check the number of distinct categories very quickly by scanning the bitmap index and extracting ONLY the key values from each of the 600 index entries that make up the whole index (the E-rows figure effectively reports the number of rowids identified by the index, but Oracle doesn’t evaluate them to answer the query).


=======================================================================
select  /*+   qb_name(main)  */  dim.* from dim where  id in (   select
   /*+     qb_name(subq)    */    distinct category   from
area_sales  )
========================================================================

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |    300 |00:00:10.45 |     341 |       |       |          |
|*  1 |  HASH JOIN SEMI               |       |      1 |    300 |    300 |00:00:10.45 |     341 |  1040K|  1040K| 1260K (0)|
|   2 |   TABLE ACCESS FULL           | DIM   |      1 |    300 |    300 |00:00:00.01 |      23 |       |       |          |
|   3 |   BITMAP CONVERSION TO ROWIDS |       |      1 |   1000K|    996K|00:00:02.64 |     318 |       |       |          |
|   4 |    BITMAP INDEX FAST FULL SCAN| AS_BI |      1 |        |    599 |00:00:00.01 |     318 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------

What we see here is that (unhinted) oracle has converted the IN subquery to an EXISTS subquery then to a semi-join which it has chosen to operate as a HASH semi-join. But in the process of generating the probe (sescond) table Oracle has converted the bitmap index entries into a set of rowids – all 1,000,000 of them in my case – introducing a lot of redundant work. In the original customer query (version 9 or 10, I forget which) the optimizer unnested the subquery and converted it into an inline view with a distinct – but still performed a redundant bitmap conversion to rowids. In the case of the client, with rather more than 1M rows, this wasted a lot of CPU.


=====================================================================
select  /*+   qb_name(main)  */  dim.* from (  select   /*+
qb_name(inline)    no_merge    no_push_pred   */   distinct category
from   area_sales  ) sttv,  dim where  dim.id = sttv.category
=====================================================================

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |       |      1 |        |    300 |00:00:00.02 |     341 |       |       |          |
|*  1 |  HASH JOIN                     |       |      1 |    300 |    300 |00:00:00.02 |     341 |  1969K|  1969K| 1521K (0)|
|   2 |   VIEW                         |       |      1 |    300 |    300 |00:00:00.01 |     306 |       |       |          |
|   3 |    HASH UNIQUE                 |       |      1 |    300 |    300 |00:00:00.01 |     306 |  2294K|  2294K| 2484K (0)|
|   4 |     BITMAP INDEX FAST FULL SCAN| AS_BI |      1 |   1000K|    600 |00:00:00.01 |     306 |       |       |          |
|   5 |   TABLE ACCESS FULL            | DIM   |      1 |    300 |    300 |00:00:00.01 |      35 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------

By introducing a manual unnest in the original client code I avoided the bitmap conversion to rowid, and the query executed much more efficiently. As you can see the optimizer has predicted the 1M rowids in the inline view, but used only the key values from the 600 index entries. In the case of the client it really was a case of manually unnesting a subquery that the optimizer was automatically unnesting – but without introducing the redundant rowids.

In my recent 12c test I had to include the no_merge and no_push_pred hints. In the absence of the no_merge hint Oracle did a join then group by, doing the rowid expansion in the process; if I added the no_merge hint without the no_push_pred hint then Oracle did a very efficient nested loop semi-join into the inline view. Although this still did the rowid expansion (predicting 3,333 rowids per key) it “stops early” thanks to the “semi” nature of the join so ran very quickly:


=========================================================================
select  /*+   qb_name(main)  */  dim.* from (  select   /*+
qb_name(inline)    no_merge   */   distinct category  from   area_sales
 ) sttv,  dim where  dim.id = sttv.category
=========================================================================

-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |    300 |00:00:00.02 |     348 |
|   1 |  NESTED LOOPS SEMI            |       |      1 |    300 |    300 |00:00:00.02 |     348 |
|   2 |   TABLE ACCESS FULL           | DIM   |      1 |    300 |    300 |00:00:00.01 |      35 |
|   3 |   VIEW PUSHED PREDICATE       |       |    300 |   3333 |    300 |00:00:00.01 |     313 |
|   4 |    BITMAP CONVERSION TO ROWIDS|       |    300 |   3333 |    300 |00:00:00.01 |     313 |
|*  5 |     BITMAP INDEX SINGLE VALUE | AS_BI |    300 |        |    300 |00:00:00.01 |     313 |
-------------------------------------------------------------------------------------------------

Bottom line on all this – check your execution plans that use bitmap indexes – if you see a “bitmap conversion to rowids” in cases where you don’t then visit the table it may be a redundant conversion, and it may be expensive. If you suspect that this is happening then dbms_xplan.display_cursor() may confirm that you are doing a lot of CPU-intensive work to produce a very large number of rowids that you don’t need.

January 16, 2015

Spatial space

Filed under: Infrastructure,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 1:00 pm BST Jan 16,2015

One thing you (ought to) learn very early on in an Oracle career is that there are always cases you haven’t previously considered. It’s a feature that is frequently the downfall of “I found it on the internet” SQL.  Here’s one (heavily paraphrased) example that appeared on the OTN database forum a few days ago:

select table_name,round((blocks*8),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE';

select table_name,round((num_rows*avg_row_len/1024),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE';

The result from the first query is 704 kb,  the result from the second is 25.4 kb … fragmentation, rebuild, CTAS etc. etc.

The two queries are perfectly reasonable approximations (for an 8KB block size, with pctfree of zero) for the allocated space and actual data size for a basic heap table – and since the two values here don’t come close to matching it’s perfectly reasonable to consider doing something like a rebuild or shrink space to reclaim space and (perhaps) to improve performance.

In this case it doesn’t look as if the space reclaimed is likely to be huge (less than 1MB), on the other hand it’s probably not going to take much time to rebuild such a tiny table; it doesn’t seem likely that the rebuild could make a significant difference to performance (though apparently it did), but the act of rebuilding might cause execution plans to change for the better because new statistics might appear as the rebuild took place. The figures came from a test system, though, so maybe the table on the production system was much larger and the impact would be greater.

Being cautious about wasting time and introducing risk, I made a few comments about the question –  and learned that one of the columns was of type SDO_GEOMETRY. This makes a big difference about what to do next, because dbms_stats.gather_table_stats() doesn’t process such columns correctly, which results in a massive under-estimate for the avg_row_len (which is basically the sum of avg_col_len for the table). Here’s an example (run on 12c, based on some code taken from the 10gR2 manuals):


drop table cola_markets purge;

CREATE TABLE cola_markets (
  mkt_id NUMBER,
  name VARCHAR2(32),
  shape SDO_GEOMETRY);

INSERT INTO cola_markets VALUES(
  1,
  'cola_a',
  SDO_GEOMETRY(
    2003,  -- two-dimensional polygon
    NULL,
    NULL,
    SDO_ELEM_INFO_ARRAY(1,1003,3), -- one rectangle (1003 = exterior)
    SDO_ORDINATE_ARRAY(1,1, 5,7) -- only 2 points needed to
          -- define rectangle (lower left and upper right) with
          -- Cartesian-coordinate data
  )
);

insert into cola_markets select * from cola_markets;
/
/
/
/
/
/
/
/
/

execute dbms_stats.gather_table_stats(user,'cola_markets')
select
	avg_row_len, num_rows, blocks,
	round(avg_row_len * num_rows / 7200,0) expected_blocks
from user_tables where table_name = 'COLA_MARKETS';

analyze table cola_markets compute statistics;
select
	avg_row_len, num_rows, blocks,
	round(avg_row_len * num_rows / 7200,0) expected_blocks
from user_tables where table_name = 'COLA_MARKETS';

If you care to count the number of times I execute the “insert as select” it’s 10, so the table ends up with 2^10 = 1024 rows. The 7,200 in the calculated column converts bytes to approximate blocks on the assumption of 8KB blocks and pctfree = 10. Here are the results following the two different methods for generating object statistics:


PL/SQL procedure successfully completed.

AVG_ROW_LEN   NUM_ROWS     BLOCKS EXPECTED_BLOCKS
----------- ---------- ---------- ---------------
         14       1024        124               2

Table analyzed.

AVG_ROW_LEN   NUM_ROWS     BLOCKS EXPECTED_BLOCKS
----------- ---------- ---------- ---------------
        109       1024        124              16

Where does the difference in Expected_blocks come from ? (The Blocks figures is 124 because I’ve used 1MB uniform extents – 128 block – under ASSM (which means 4 space management blocks at the start of the first extent.)

Here are the column lengths after the call to dbms_stats: as you can see the avg_row_len is the sum of avg_col_len.


select column_name, data_type, avg_col_len
from   user_tab_cols
where  table_name = 'COLA_MARKETS'
order by
        column_id
;

COLUMN_NAME          DATA_TYPE                AVG_COL_LEN
-------------------- ------------------------ -----------
MKT_ID               NUMBER                             3
NAME                 VARCHAR2                           7
SYS_NC00010$         SDO_ORDINATE_ARRAY
SHAPE                SDO_GEOMETRY
SYS_NC00008$         NUMBER                             0
SYS_NC00004$         NUMBER                             4
SYS_NC00005$         NUMBER                             0
SYS_NC00006$         NUMBER                             0
SYS_NC00007$         NUMBER                             0
SYS_NC00009$         SDO_ELEM_INFO_ARRAY

The figures from the analyze command are only slightly different, but fortunately the analyze command uses the row directory pointers to calculate the actual row allocation, so picks up information about the impact of inline varrays, LOBs, etc. that the dbms_stats call might not be able to handle.


COLUMN_NAME          DATA_TYPE                AVG_COL_LEN
-------------------- ------------------------ -----------
MKT_ID               NUMBER                             2
NAME                 VARCHAR2                           6
SYS_NC00010$         SDO_ORDINATE_ARRAY
SHAPE                SDO_GEOMETRY
SYS_NC00008$         NUMBER                             1
SYS_NC00004$         NUMBER                             3
SYS_NC00005$         NUMBER                             1
SYS_NC00006$         NUMBER                             1
SYS_NC00007$         NUMBER                             1
SYS_NC00009$         SDO_ELEM_INFO_ARRAY

As a basic reminder – whenever you do anything slightly non-trivial (e.g. something you couldn’t have done in v5, say) then remember that all those dinky little script things you find on the Internet might not actually cover your particular case.

January 14, 2015

Bind Effects

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 1:24 pm BST Jan 14,2015

A couple of days ago I highlighted an optimizer anomaly caused by the presence of an index with a descending column. This was a minor (unrelated) detail that appeared in a problem on OTN where the optimizer was using an index FULL scan when someone was expecting to see an index RANGE scan. My earlier posting supplies the SQL to create the table and indexes I used to model the problem – and in this posting I’ll explain the problem and answer the central question.

Here’s the query and execution plan (from 11.2.0.x) as supplied by the OP – the odd appearance of the sys_op_descend() function calls is the minor detail that I explained in the previous post, but that’s not really relevant to the question of why Oracle is using an index full scan rather than an index range scan. The /*+ first_rows */ hint isn’t something you should be using but it was in the OP’s query, so I’ve included it in my model:


select /*+ FIRST_ROWS gather_plan_statistics scanned */ count(1)      FROM  XXX
where  (((((COL1 = '003' and COL2 >= '20150120') and COL3 >= '00000000') and COL4>= '000000000000' )
or ((COL1 = '003' and COL2 >= '20150120') and COL3> '00000000' )) or (COL1= '003' and COL2> '20150120'))
order by COL1,COL2,COL3,COL4  

Plan hash value: 919851669  

---------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |  18533 |00:01:47.04 |    156K |  70286 |
|   1 |  TABLE ACCESS BY INDEX ROWID| XXX    |      1 |  7886K |  18533 |00:01:47.04 |    156K |  70286 |
|*  2 |   INDEX FULL SCAN           | XXXXPK |      1 |  7886K |  18533 |00:01:30.36 |    131K |  61153 |
---------------------------------------------------------------------------------------------------------  

Predicate Information (identified by operation id):
---------------------------------------------------
  2 - filter((("COL2">:B2 AND "COL1"=:B1 AND
              SYS_OP_DESCEND("COL2")<SYS_OP_DESCEND(:B2)) OR ("COL1"=:B1 AND "COL2">=:B2
              AND "COL3">:B3 AND SYS_OP_DESCEND("COL2")<=SYS_OP_DESCEND(:B2)) OR
              ("COL1"=:B1 AND "COL2">=:B2 AND "COL3">=:B3 AND "COL4">=:B4 AND
              SYS_OP_DESCEND("COL2")<=SYS_OP_DESCEND(:B2))))  

If you look closely you’ll see that the OP has NOT supplied the output from a call to dbms_xplan.display_cursor() – the column and table names are highly suspect (but that’s allowable cosmetic change for confidentiality reasons) the giveaway is that the SQL statement uses literals but the execution plan is using bind variables (which are of the form B{number}, suggesting that the real SQL is embedded in PL/SQL with PL/SQL variables being used to supply values): the bind variables make a difference.

Let’s go back to my model to demonstrate the problem. Here’s a query with the same predicate structure as the problem query (with several pairs of brackets eliminated to improve readability) showing the actual run-time plan (from 11.2.0.4) when using literals:


select
        /*+ first_rows */
        *
from t1
where
        (C1 = 'DE' and C2 >  'AB')
or      (C1 = 'DE' and C2 >= 'AB' and C3 > 'AA' )
or      (C1 = 'DE' and C2 >= 'AB' and C3 >= 'AA' and C4 >= 'BB')
order by
        C1, C2, C3, C4
;

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |    21 |  2478 |     4  (25)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |    21 |       |     3  (34)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1"='DE')
       filter(((SYS_OP_DESCEND("C2")<SYS_OP_DESCEND('AB') AND "C2">'AB') OR
              (SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C3">'AA' AND "C2">='AB') OR
              (SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C4">='BB' AND "C2">='AB' AND
              "C3">='AA')))

As you can see, the optimizer has managed to “factor out” the predicate C1 = ‘DE’ from the three disjuncts and has then used it as an access() predicate for an index range scan. Now let’s see what the code and plan look like if we replace the four values by four bind variables:


variable B1 char(2)
variable B2 char(2)
variable B3 char(2)
variable B4 char(2)

begin
        :b1 := 'DE';
        :b2 := 'AB';
        :b3 := 'AA';
        :b4 := 'BB';
end;
/

select
        /*+ first_rows */
        *
from t1
where
        (C1 = :B1  and C2 >  :B2 )
or      (C1 = :B1  and C2 >= :B2 and C3 >  :B3 )
or      (C1 = :B1  and C2 >= :B2 and C3 >= :B3 and C4 >= :B4)
order by C1, C2, C3, C4
;

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |       |       |    31 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |   437 | 51566 |    31   (4)| 00:00:01 |
|*  2 |   INDEX FULL SCAN           | T1_IASC |   437 |       |    27   (4)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter((("C1"=:B1 AND "C2">:B2) OR ("C1"=:B1 AND "C3">:B3 AND
              "C2">=:B2) OR ("C1"=:B1 AND "C4">=:B4 AND "C2">=:B2 AND "C3">=:B3)))

The optimizer hasn’t factored out the common expression C1 = :B1. The reason for this, I think, is that though WE know that :B1 is supposed to be the same thing in all three occurrences the optimizer isn’t able to assume that that’s the case; in principle :B1 could be the place holder for 3 different values – so the optimizer plays safe and optimizes for that case. This leaves it with three options: Full tablescan with filter predicates, index full scan with filter predicates, three-part concatenation with index range scans in all three parts. The combination of the /*+ first_rows */ hint and the “order by” clause which matches the t1_1asc index has left the optimizer choosing the index full scan path – presumably to avoid the need to collect all the rows and sort them before returning the first row.

Given our understanding of the cause of the problem we now have a clue about how we might make the query more efficient – we have to eliminate the repetition of (at least) the :B1 bind variable. In fact we can get some extra mileage by modifying the repetition of the :B2 bind variable. Here’s a rewrite that may help:


select
        /*+ first_rows */
        *
from t1
where
        (C1 = :B1 and C2 >= :B2)
and     (
             C2 > :B2
         or  C3 > :B3
         or (C3 >= :B3 and C4 > :B4)
        )
order by C1, C2, C3, C4
;

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |   148 | 17464 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |   148 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1"=:B1 AND "C2">=:B2 AND "C2" IS NOT NULL)
       filter(("C2">:B2 OR "C3">:B3 OR ("C4">:B4 AND "C3">=:B3)))

I’ve factored out as much of the C1 and C2 predicates as I can – and the optimizer has used the resulting conditions as the access() predicate on the index (adding in a “not null” predicate on C2 that looks redundant to me – in fact the index was on the primary key in the original, but I hadn’t included that constraint in my model). You’ll notice, by the way, that the cardinality is now 148; compare this with the previous cardinality of 437 and you might (without bothering to look closely as the 10053 trace) do some hand-waving around the fact that 437 = (approximately) 148 * 3, which fits the idea that the optimizer was treating the three :B1 appearances as if they were three different possible values accessing three sets of data.

Miscellaneous.

This isn’t the end of the story; there are always more complications and surprises in store as you look further into the detail. For example, on the upgrade to 12c the execution plan for the query with bind variables was the same (ignoring the sys_op_descend() functions) as the query using literals – the optimizer managed to factor out the C1 predicate: does this mean SQL*Plus got smarter about telling the optimizer about the bind variables, or does it mean the optimizer got smarter about something that SQL*Plus has been doing all along ?

This change might make you think that the optimizer is supposed to assume that bind variables of the same name represent the same thing – but that’s not correct, and it’s easy to show; here’s a trivial example (accessing the same table with a query that, for my data, identifies the first row):


declare
        m_id number := 1;
        m_c1 char(2) := 'BV';
        m_c2 char(2) := 'GF';
        m_n number := 0;
begin
        execute immediate
                'SELECT /*+ FIND THIS */ COUNT(*) FROM T1 WHERE ID = :B1 AND C1 = :B1 AND C2 = :B1'
                into m_n
                using m_id, m_c1, m_c2
        ;
end;
/

select sql_id, sql_text from V$sql where sql_text like 'SELECT%FIND THIS%';

SQL_ID        SQL_TEXT
------------- ----------------------------------------------------------------------------------
9px3nuv54maxp SELECT /*+ FIND THIS */ COUNT(*) FROM T1 WHERE ID = :B1 AND C1 = :B1 AND C2 = :B1

If you were looking at the contents of v$sql, or a trace file, or an AWR report, you might easily be fooled into thinking that this was a query where the same value had been used three times – when we know that it wasn’t.

So, as we upgrade from 11g to 12c my model of the original problem suggests that the problem is going to go away – but, actually, I don’t really know why that’s the case (yet). On the other hand, I have at least recognised a pattern that the 11g optimizer currently has a problem with, and I have a method for helping the optimizer to be a little more efficient.

 

January 12, 2015

FBI Bug reprise

Filed under: Execution plans,Oracle,Troubleshooting,Uncategorized — Jonathan Lewis @ 12:52 pm BST Jan 12,2015

I’ve just had cause to resurrect a blog note I wrote three years ago. The note says that an anomaly I discovered in 9.2.0.8 wasfixed in 10.2.0.3 – and this is true for the simple example in the posting; but a recent question on the OTN database forum has shown that the bug still appears in more complex cases.  Here’s some code to create a table and two indexes:


create table t1
nologging
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum			id,
	cast(dbms_random.string('U',2) as char(2))	c1,
	cast(dbms_random.string('U',2) as char(2))	c2,
	cast(dbms_random.string('U',2) as char(2))	c3,
	cast(dbms_random.string('U',2) as char(2))	c4,
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create index t1_iasc  on t1(c1, c2,      c3, c4) nologging;
create index t1_idesc on t1(c1, c2 desc, c3, c4) nologging;

I’ve designed the table to model the way a problem was presented on OTN, it’s possible that the anomaly would appear in simpler circumstance; note that I have two indexes on the same four columns, but the second column of one of the indexes is declared as descending. To identify the indexes easily in execution plans the latter index has the text “desc” in its name. So here’s a query, with execution plan, that should use one of these indexes:


select
        *
from t1
where
        (C1 = 'DE' and C2 >  'AB')
or      (C1 = 'DE' and C2 >= 'AB' and C3 > 'AA' )
or      (C1 = 'DE' and C2 >= 'AB' and C3 >= 'AA' and C4 >= 'BB')
order by
        C1, C2, C3, C4
;

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |    21 |  2478 |     4  (25)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |    21 |       |     3  (34)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1"='DE')
       filter(((SYS_OP_DESCEND("C2")<SYS_OP_DESCEND('AB') AND "C2">'AB') OR
              (SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C3">'AA' AND "C2">='AB') OR
              (SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C4">='BB' AND "C2">='AB' AND
              "C3">='AA')))

Sure enough – the query has used the t1_iasc index – but why has the optimizer introduced all those predicates with the sys_op_descend() function calls in them when we’re not using an index with a descending column ? Somewhere in the code path the optimizer has picked up the other index, and decided to use some information from it that is totally redundant. One side effect of this is that the cardinality prediction is 21 – if I drop the index t1_idesc the sys_op_descend() calls disappear and the cardinality changes to 148.

Oracle 12c behaves differently – it uses concatenation to turn the query into three separate index probes unless I add the hint /*+ first_rows */ (which I shouldn’t be using, but it’s the hint supplied by the OP on OTN). With the hint in place we get an example of the actual execution plan differing from the prediction made through explain plan:


12c execution plan unhinted (concatenation, and lots of sys_op_descend()):

--------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |       |       |    12 (100)|          |
|   1 |  SORT ORDER BY                        |          |   149 | 17582 |    12   (9)| 00:00:01 |
|   2 |   CONCATENATION                       |          |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |   142 | 16756 |     4   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | T1_IDESC |     1 |       |     2   (0)| 00:00:01 |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |     6 |   708 |     4   (0)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN                  | T1_IDESC |     1 |       |     2   (0)| 00:00:01 |
|   7 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |     1 |   118 |     3   (0)| 00:00:01 |
|*  8 |     INDEX RANGE SCAN                  | T1_IASC  |     1 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("C1"='DE' AND "C3">='AA' AND "C4">='BB' AND
              SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB'))
       filter("C4">='BB' AND SYS_OP_UNDESCEND(SYS_OP_DESCEND("C2"))>='AB' AND "C3">='AA')
   6 - access("C1"='DE' AND "C3">'AA' AND SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB'))
       filter("C3">'AA' AND SYS_OP_UNDESCEND(SYS_OP_DESCEND("C2"))>='AB' AND
              (LNNVL("C4">='BB') OR LNNVL(SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB')) OR
              LNNVL(SYS_OP_UNDESCEND(SYS_OP_DESCEND("C2"))>='AB') OR LNNVL("C3">='AA')))
   8 - access("C1"='DE' AND "C2">'AB' AND "C2" IS NOT NULL)
       filter((LNNVL("C3">'AA') OR LNNVL(SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB')) OR
              LNNVL("C2">='AB')) AND (LNNVL("C4">='BB') OR
              LNNVL(SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB')) OR LNNVL("C2">='AB') OR
              LNNVL("C3">='AA')))

12c Execution plan with first_rows hint (and the sys_op_descend have gone)
---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |       |       |   150 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |   148 | 17464 |   150   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |   148 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1"='DE')
       filter(("C2">'AB' OR ("C3">'AA' AND "C2">='AB') OR ("C4">='BB' AND
              "C2">='AB' AND "C3">='AA')))

12c Execution plan with first_rows according to Explain Plan (and they're back again)
---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |   148 | 17464 |   150   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |   148 | 17464 |   150   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |   148 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1"='DE')
       filter("C2">'AB' AND SYS_OP_DESCEND("C2")<SYS_OP_DESCEND('AB') OR
              "C3">'AA' AND SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C2">='AB' OR
              "C4">='BB' AND SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C2">='AB' AND
              "C3">='AA')

On the plus side the upgrade to 12c has removed some of the sys_op_descend() appearances and seems to have fixed the cardinality anomaly when the sys_op_descend() calls do appear – but things are still going wrong in odd ways. (And this looks like another case where you could drop an index that you’re not using and seeing an execution plan change.)

Footnote:

When playing around with 12c, I had to be very careful to avoid confusing the issue as SQL Plan Directives were appearing from time to time, and giving me unexpected results as I repeated some of the experiments.

Footnote 2:

I haven’t actually addressed the question originally posed on OTN, but that can wait for another day.

January 9, 2015

count(*) – again !

Filed under: bitmaps,humour,Indexing,Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 12:56 pm BST Jan 9,2015

Because you can never have enough of a good thing.

Here’s a thought – The optimizer doesn’t treat all constants equally.  No explanations, just read the code – execution plans at the end:


SQL> drop table t1 purge;
SQL> create table t1 nologging as select * from all_objects;
SQL> create bitmap index t1_b1 on t1(owner);

SQL> alter session set statistics_level = all;

SQL> set serveroutput off
SQL> select count(*) from t1;
SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

SQL> select count(1) from t1;
SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

SQL> select count(-1) from t1;
SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

SQL> alter session set cursor_sharing = force;
SQL> alter system flush shared_pool;

SQL> select count(1) from t1;
SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

So, are you expecting to see the same results and performance from every single one of those queries ?


select count(*) from t1
----------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |      1 |00:00:00.01 |       9 |      5 |
|   1 |  SORT AGGREGATE               |       |      1 |      1 |      1 |00:00:00.01 |       9 |      5 |
|   2 |   BITMAP CONVERSION COUNT     |       |      1 |  84499 |     31 |00:00:00.01 |       9 |      5 |
|   3 |    BITMAP INDEX FAST FULL SCAN| T1_B1 |      1 |        |     31 |00:00:00.01 |       9 |      5 |
----------------------------------------------------------------------------------------------------------

select count(1) from t1
-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |      1 |00:00:00.01 |       9 |
|   1 |  SORT AGGREGATE               |       |      1 |      1 |      1 |00:00:00.01 |       9 |
|   2 |   BITMAP CONVERSION COUNT     |       |      1 |  84499 |     31 |00:00:00.01 |       9 |
|   3 |    BITMAP INDEX FAST FULL SCAN| T1_B1 |      1 |        |     31 |00:00:00.01 |       9 |
-------------------------------------------------------------------------------------------------

select count(-1) from t1
-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |      1 |00:00:00.43 |       9 |
|   1 |  SORT AGGREGATE               |       |      1 |      1 |      1 |00:00:00.43 |       9 |
|   2 |   BITMAP CONVERSION TO ROWIDS |       |      1 |  84499 |  84499 |00:00:00.22 |       9 |
|   3 |    BITMAP INDEX FAST FULL SCAN| T1_B1 |      1 |        |     31 |00:00:00.01 |       9 |
-------------------------------------------------------------------------------------------------

SQL> alter session set cursor_sharing = force;
SQL> alter system flush shared_pool;

select count(1) from t1
select count(:"SYS_B_0") from t1    -- effect of cursor-sharing
-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |      1 |00:00:00.46 |       9 |
|   1 |  SORT AGGREGATE               |       |      1 |      1 |      1 |00:00:00.46 |       9 |
|   2 |   BITMAP CONVERSION TO ROWIDS |       |      1 |  84499 |  84499 |00:00:00.23 |       9 |
|   3 |    BITMAP INDEX FAST FULL SCAN| T1_B1 |      1 |        |     31 |00:00:00.01 |       9 |
-------------------------------------------------------------------------------------------------

Check operation 2 in each plan – with the bitmap index in place there are two possible ways to count the rows referenced in the index – and one of them converts to rowids and does a lot more work.

The only “real” threat in this set of examples, of course, is the bind variable one – there are times when count(*) WILL be faster than count(1). Having said that, there is a case where a redundant “conversion to rowids” IS a threat – and I’ll write that up some time in the near future.

Trick question: when is 1+1 != 2 ?
Silly answer: compare the plan for: “select count (2) from t1″ with the plan for “select count(1+1) from t1″

Note: All tests above run on 12.1.0.2

December 12, 2014

push_pred – evolution

Filed under: CBO,Hints,Oracle,Troubleshooting — Jonathan Lewis @ 2:22 pm BST Dec 12,2014

Here’s a query (with a few hints to control how I want Oracle to run it) that demonstrates the difficulty of trying to solve problems by hinting (and the need to make sure you know where all your hinted code is):


select
	/*+
		qb_name(main)
		leading (@main t1@main v1@main t4@main)
		push_pred(v1@main)
	*/
	t1.*,v1.*,t4.*
from
	t1,
	(
	select	/*+ qb_name(inline) no_merge */
		t2.n1, t3.n2, count(*)
	from	t2, t3
	where exists (
		select	/*+ qb_name(subq) no_unnest push_subq */
			null
		from	t5
		where	t5.object_id = t2.n1
		)
	and	t3.n1 = t2.n2
	group by t2.n1, t3.n2
	)	v1,
	t4
where
	v1.n1 = t1.n1
and	t4.n1(+) = v1.n1
;

Nominally it’s a three-table join, except the second table is an in-line view which joins two tables and includes an existence subquery. Temporarily I have made the join to t4 an outer join – but that’s just to allow me to make a point, I don’t want an outer join in the final query. I’ve had to include the no_merge() hint in the inline view to stop Oracle using complex view merging to “join then aggregate” when I want it to “aggregate then join”; I’ve included the no_unnest and push_subq hints to make sure that the subquery is operated as a subquery, but operates at the earliest possible moment in the inline view. Ignoring the outer join (which would make operation 1 a nested loop outer), this is the execution plan I want to see:


-------------------------------------------------------------------------------------------
| Id  | Operation                         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |       |    50 | 12850 |  4060   (1)| 00:00:21 |
|   1 |  NESTED LOOPS                     |       |    50 | 12850 |  4060   (1)| 00:00:21 |
|   2 |   NESTED LOOPS                    |       |    50 | 12850 |  4060   (1)| 00:00:21 |
|   3 |    NESTED LOOPS                   |       |    50 |  7400 |  4010   (1)| 00:00:21 |
|   4 |     TABLE ACCESS FULL             | T1    |  1000 |   106K|     3   (0)| 00:00:01 |
|   5 |     VIEW PUSHED PREDICATE         |       |     1 |    39 |     4   (0)| 00:00:01 |
|   6 |      SORT GROUP BY                |       |     1 |    16 |     4   (0)| 00:00:01 |
|   7 |       NESTED LOOPS                |       |     1 |    16 |     3   (0)| 00:00:01 |
|   8 |        TABLE ACCESS BY INDEX ROWID| T2    |     1 |     8 |     2   (0)| 00:00:01 |
|*  9 |         INDEX UNIQUE SCAN         | T2_PK |     1 |       |     1   (0)| 00:00:01 |
|* 10 |          INDEX RANGE SCAN         | T5_I1 |     1 |     4 |     1   (0)| 00:00:01 |
|  11 |        TABLE ACCESS BY INDEX ROWID| T3    |     1 |     8 |     1   (0)| 00:00:01 |
|* 12 |         INDEX UNIQUE SCAN         | T3_PK |     1 |       |     0   (0)| 00:00:01 |
|* 13 |    INDEX UNIQUE SCAN              | T4_PK |     1 |       |     0   (0)| 00:00:01 |
|  14 |   TABLE ACCESS BY INDEX ROWID     | T4    |     1 |   109 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("T2"."N1"="T1"."N1")
       filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST QB_NAME ("SUBQ") */ 0 FROM
              "T5" "T5" WHERE "T5"."OBJECT_ID"=:B1))
  10 - access("T5"."OBJECT_ID"=:B1)
  12 - access("T3"."N1"="T2"."N2")
  13 - access("T4"."N1"="V1"."N1")

Note, particularly, operation 5: VIEW PUSHED PREDICATE, and the associated access predicate at line 9 “t2.n1 = t1.n1″ where the predicate based on t1 has been pushed inside the inline view: so Oracle will evaluate a subset view for each selected row of t1, which is what I wanted. Then you can see operation 10 is an index range scan of t5_i1, acting as a child to the index unique scan of t2_pk of operation 9 – that’s Oracle keeping the subquery as a subquery and executing it as early as possible.

So what happens when I try to get this execution plan using the SQL and hints I’ve got so far ?

Here’s the plan I got from 10.2.0.5:


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |    50 | 12750 |    62   (4)| 00:00:01 |
|   1 |  NESTED LOOPS                |       |    50 | 12750 |    62   (4)| 00:00:01 |
|*  2 |   HASH JOIN                  |       |    50 |  7350 |    12  (17)| 00:00:01 |
|   3 |    TABLE ACCESS FULL         | T1    |  1000 |   105K|     3   (0)| 00:00:01 |
|   4 |    VIEW                      |       |    50 |  1950 |     9  (23)| 00:00:01 |
|   5 |     HASH GROUP BY            |       |    50 |   800 |     9  (23)| 00:00:01 |
|*  6 |      HASH JOIN               |       |    50 |   800 |     7  (15)| 00:00:01 |
|*  7 |       TABLE ACCESS FULL      | T2    |    50 |   400 |     3   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN      | T5_I1 |     1 |     4 |     1   (0)| 00:00:01 |
|   9 |       TABLE ACCESS FULL      | T3    |  1000 |  8000 |     3   (0)| 00:00:01 |
|  10 |   TABLE ACCESS BY INDEX ROWID| T4    |     1 |   108 |     1   (0)| 00:00:01 |
|* 11 |    INDEX UNIQUE SCAN         | T4_PK |     1 |       |     0   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("V1"."N1"="T1"."N1")
   6 - access("T3"."N1"="T2"."N2")
   7 - filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST QB_NAME ("SUBQ") */ 0
              FROM "T5" "T5" WHERE "T5"."OBJECT_ID"=:B1))
   8 - access("T5"."OBJECT_ID"=:B1)
  11 - access("T4"."N1"="V1"."N1")

In 10g the optimizer has not pushed the join predicate down into the view (the t1 join predicate appears in the hash join at line 2); I think this is because the view has been declared non-mergeable through a hint. So let’s upgrade to 11.1.0.7:

------------------------------------------------------------------------------------------
| Id  | Operation                        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |       |    50 | 12950 |  4008K  (1)| 05:34:04 |
|   1 |  NESTED LOOPS                    |       |    50 | 12950 |  4008K  (1)| 05:34:04 |
|   2 |   MERGE JOIN CARTESIAN           |       |  1000K|   205M|  2065   (3)| 00:00:11 |
|   3 |    TABLE ACCESS FULL             | T1    |  1000 |   105K|     3   (0)| 00:00:01 |
|   4 |    BUFFER SORT                   |       |  1000 |   105K|  2062   (3)| 00:00:11 |
|   5 |     TABLE ACCESS FULL            | T4    |  1000 |   105K|     2   (0)| 00:00:01 |
|   6 |   VIEW PUSHED PREDICATE          |       |     1 |    43 |     4   (0)| 00:00:01 |
|   7 |    SORT GROUP BY                 |       |     1 |    16 |     4   (0)| 00:00:01 |
|*  8 |     FILTER                       |       |       |       |            |          |
|   9 |      NESTED LOOPS                |       |     1 |    16 |     3   (0)| 00:00:01 |
|  10 |       TABLE ACCESS BY INDEX ROWID| T2    |     1 |     8 |     2   (0)| 00:00:01 |
|* 11 |        INDEX UNIQUE SCAN         | T2_PK |     1 |       |     1   (0)| 00:00:01 |
|* 12 |         INDEX RANGE SCAN         | T5_I1 |     1 |     4 |     1   (0)| 00:00:01 |
|  13 |       TABLE ACCESS BY INDEX ROWID| T3    |  1000 |  8000 |     1   (0)| 00:00:01 |
|* 14 |        INDEX UNIQUE SCAN         | T3_PK |     1 |       |     0   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - filter("T4"."N1"="T1"."N1")
  11 - access("T2"."N1"="T4"."N1")
       filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST QB_NAME ("SUBQ") */ 0 FROM
              "T5" "T5" WHERE "T5"."OBJECT_ID"=:B1))
  12 - access("T5"."OBJECT_ID"=:B1)
  14 - access("T3"."N1"="T2"."N2")

Excellent – at operation 6 we see VIEW PUSHED PREDICATE, and at operation 11 we can see that the join predicate “t2.n1 = t1.n1″.

Less excellent – we have a Cartesian Merge Join between t1 and t4 before pushing predicates. Of course, we told the optimizer to push join predicates into the view, and there are two join predicates, one from t1 and one from t4 – and we didn’t tell the optimizer that we only wanted to push the t1 join predicate into the view. Clearly we need a way of specifying where predicates should be pushed FROM as well as a way of specifying where they should be pushed TO.

If we take a look at the outline information from the execution plan there’s a clue in one of the outline hints: PUSH_PRED(@”MAIN” “V1″@”MAIN” 3 2) – the hint has a couple of extra parameters to it – perhaps the 2 and 3 refer in some way to the 2nd and 3rd tables in the query. If I test with an outer join to t4 (which means the optimizer won’t be able to use my t4 predicate as a join INTO the view) I get the plan I want (except it’s an outer join, of course), and the hint changes to: PUSH_PRED(@”MAIN” “V1″@”MAIN” 2) – so maybe the 2 refers to t1 and the 3 referred to t4, so let’s try the following hints:


push_pred(v1@main 2)
no_push_pred(v1@main 3)

Unfortunately this gives us the following plan:


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |    50 | 12300 |    62   (4)| 00:00:01 |
|   1 |  NESTED LOOPS OUTER          |       |    50 | 12300 |    62   (4)| 00:00:01 |
|*  2 |   HASH JOIN                  |       |    50 |  6900 |    12  (17)| 00:00:01 |
|   3 |    TABLE ACCESS FULL         | T1    |  1000 |   105K|     3   (0)| 00:00:01 |
|   4 |    VIEW                      |       |    50 |  1500 |     9  (23)| 00:00:01 |
|   5 |     HASH GROUP BY            |       |    50 |   800 |     9  (23)| 00:00:01 |
|*  6 |      HASH JOIN               |       |    50 |   800 |     7  (15)| 00:00:01 |
|*  7 |       TABLE ACCESS FULL      | T2    |    50 |   400 |     3   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN      | T5_I1 |     1 |     4 |     1   (0)| 00:00:01 |
|   9 |       TABLE ACCESS FULL      | T3    |  1000 |  8000 |     3   (0)| 00:00:01 |
|  10 |   TABLE ACCESS BY INDEX ROWID| T4    |     1 |   108 |     1   (0)| 00:00:01 |
|* 11 |    INDEX UNIQUE SCAN         | T4_PK |     1 |       |     0   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("V1"."N1"="T1"."N1")
   6 - access("T3"."N1"="T2"."N2")
   7 - filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST QB_NAME ("SUBQ") */ 0
              FROM "T5" "T5" WHERE "T5"."OBJECT_ID"=:B1))
   8 - access("T5"."OBJECT_ID"=:B1)
  11 - access("T4"."N1"(+)="V1"."N1")

We don’t have join predicate pushdown; on the other hand we’ve got the join order we specified with our leading() hint – and that didn’t appear previously when we got the Cartesian Merge Join with predicate pushdown (our hints were incompatible, so something had to fail). So maybe the numbering has changed because the join order has changed and I should push_pred(v1 1) and no_push_pred(v1 3). Alas, trying all combinations of 2 values from 1,2, and 3 I can’t get the plan I want.

So let’s upgrade to 11.2.0.4. As hinted we get the pushed predicate with Cartesian merge join, but this time the push_pred() hint that appears in the outline looks like this: PUSH_PRED(@”MAIN” “V1″@”MAIN” 2 1) – note how the numbers have changed between 11.1.0.7 and 11.2.0.4. So let’s see what happens when I try two separate hints again, fiddling with the third parameter, e.g.:


push_pred(v1@main 1)
no_push_pred(v1@main 2)

With the values set as above I got the plan I want – it’s just a pity that I’m not 100% certain how the numbering in the push_pred() and no_push_pred() hints is supposed to work. In this case, though, it no longer matters as all I have to do now is create an SQL Baseline for my query, transferring the hinted plan into the the SMB with the unhinted SQL.

In passing, I did manage to get the plan I wanted in 11.1.0.7 by adding the hint /*+ outline_leaf(@main) */ to the original SQL. I’m even less keen on doing that than I am on adding undocumented parameters to the push_pred() and no_push_pred() hints, of course; but having done it I did wonder if there are any SQL Plan Baslines in 11.1.0.7 production systems that include the push_pred() hint that are going to change plan on the upgrade to 11.2.0.4 because the numbering inside the hint is supposed to change with version.

Footnote:

Loosely speaking, this blog note is the answer to a question posted about five years ago.

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 5,189 other followers