Oracle Scratchpad

June 17, 2015

Reverse Key

Filed under: Indexing,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:11 pm BST Jun 17,2015

A question came up on the OTN database forum recently asking if you could have a partitioned index on a non-partitioned table.

(Aside: I’m not sure whether it would be quicker to read the manuals or try the experiment – either would probably be quicker than posing the question to the forum. As so often happens in these RTFM questions the OP didn’t bother to acknowledge any of the responses)

The answer to the question is yes – you can create a globally partitioned index, though if it uses range partitioning you have to specify a MAXVALUE partition. The interesting thing about the question, though is that several people tried to guess why it had been asked and then made suggestions based on the most likely guess (and wouldn’t it have been nice to see some response from the OP ). The common guess was that there was a performance problem with the high-value block of a sequence-based (or time-based) index – a frequent source of “buffer busy wait” events and other nasty side effects.

Unfortunately too many people suggesting reverse key as a solution to this “right-hand” problem. If you’re licensed for partitioning it’s almost certain that a better option would simple be to use global hash partitioning (with 2^N for some N) partitions. Using reverse keys can result in a bigger performance than the one you’re trying to avoid – you may end up turning a little time spent on buffer busy waits into a large amount of time spent on db file sequential reads. To demonstrate the issue I’ve created a sample script – and adjusted my buffer cache down to the appropriate scale:

create table t1(
	id	not null
)
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		rownum <= 1e4
)
select
	1e7 + rownum	id
from
	generator	v1,
	generator	v2
where
	rownum <= 1e7 
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1'
	);
end;
/

alter table t1 add constraint t1_pk primary key(id) 
using index 
	reverse 
	nologging 
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';

begin
	for i in 20000001..20010000 loop
		insert into t1 values(i);
	end loop;
end;
/

I’ve created a table with 10,000,000 rows using a sequential value as the primary key, then inserted “the next” 10,000 rows into the table in order. The index occupied about about 22,000 blocks, so to make my demonstration show you the type of effect you could get from a busy production system with more tables and many indexes I ran my test with the buffer cache limited to 6,000 blocks – a fair fraction of the total index size. Here’s a small section of the trace file from the test running 10.2.0.3 on an elderly machine:


WAIT #43: nam='db file sequential read' ela= 13238 file#=6 block#=12653 blocks=1 obj#=63623 tim=3271125590
WAIT #43: nam='db file sequential read' ela=  7360 file#=6 block#=12749 blocks=1 obj#=63623 tim=3271133150
WAIT #43: nam='db file sequential read' ela=  5793 file#=6 block#=12844 blocks=1 obj#=63623 tim=3271139110
WAIT #43: nam='db file sequential read' ela=  5672 file#=6 block#=12940 blocks=1 obj#=63623 tim=3271145028
WAIT #43: nam='db file sequential read' ela= 15748 file#=5 block#=13037 blocks=1 obj#=63623 tim=3271160998
WAIT #43: nam='db file sequential read' ela=  8080 file#=5 block#=13133 blocks=1 obj#=63623 tim=3271169314
WAIT #43: nam='db file sequential read' ela=  8706 file#=5 block#=13228 blocks=1 obj#=63623 tim=3271178240
WAIT #43: nam='db file sequential read' ela=  7919 file#=5 block#=13325 blocks=1 obj#=63623 tim=3271186372
WAIT #43: nam='db file sequential read' ela= 15553 file#=6 block#=13549 blocks=1 obj#=63623 tim=3271202115
WAIT #43: nam='db file sequential read' ela=  7044 file#=6 block#=13644 blocks=1 obj#=63623 tim=3271209420
WAIT #43: nam='db file sequential read' ela=  6062 file#=6 block#=13741 blocks=1 obj#=63623 tim=3271215648
WAIT #43: nam='db file sequential read' ela=  6067 file#=6 block#=13837 blocks=1 obj#=63623 tim=3271221887
WAIT #43: nam='db file sequential read' ela= 11516 file#=5 block#=13932 blocks=1 obj#=63623 tim=3271234852
WAIT #43: nam='db file sequential read' ela=  9295 file#=5 block#=14028 blocks=1 obj#=63623 tim=3271244368
WAIT #43: nam='db file sequential read' ela=  9466 file#=5 block#=14125 blocks=1 obj#=63623 tim=3271254002
WAIT #43: nam='db file sequential read' ela=  7704 file#=5 block#=14221 blocks=1 obj#=63623 tim=3271261991
WAIT #43: nam='db file sequential read' ela= 16319 file#=6 block#=14444 blocks=1 obj#=63623 tim=3271278492
WAIT #43: nam='db file sequential read' ela=  7416 file#=6 block#=14541 blocks=1 obj#=63623 tim=3271286129
WAIT #43: nam='db file sequential read' ela=  5748 file#=6 block#=14637 blocks=1 obj#=63623 tim=3271292163
WAIT #43: nam='db file sequential read' ela=  7131 file#=6 block#=14732 blocks=1 obj#=63623 tim=3271299489
WAIT #43: nam='db file sequential read' ela= 16126 file#=5 block#=14829 blocks=1 obj#=63623 tim=3271315883
WAIT #43: nam='db file sequential read' ela=  7746 file#=5 block#=14925 blocks=1 obj#=63623 tim=3271323845
WAIT #43: nam='db file sequential read' ela=  9208 file#=5 block#=15020 blocks=1 obj#=63623 tim=3271333239
WAIT #43: nam='db file sequential read' ela=  7708 file#=5 block#=15116 blocks=1 obj#=63623 tim=3271341141
WAIT #43: nam='db file sequential read' ela= 15484 file#=6 block#=15341 blocks=1 obj#=63623 tim=3271356807
WAIT #43: nam='db file sequential read' ela=  5488 file#=6 block#=15437 blocks=1 obj#=63623 tim=3271362623
WAIT #43: nam='db file sequential read' ela= 10447 file#=6 block#=15532 blocks=1 obj#=63623 tim=3271373342
WAIT #43: nam='db file sequential read' ela= 12565 file#=6 block#=15629 blocks=1 obj#=63623 tim=3271386741
WAIT #43: nam='db file sequential read' ela= 17168 file#=5 block#=15725 blocks=1 obj#=63623 tim=3271404135
WAIT #43: nam='db file sequential read' ela=  7542 file#=5 block#=15820 blocks=1 obj#=63623 tim=3271411882
WAIT #43: nam='db file sequential read' ela=  9400 file#=5 block#=15917 blocks=1 obj#=63623 tim=3271421514
WAIT #43: nam='db file sequential read' ela=  7804 file#=5 block#=16013 blocks=1 obj#=63623 tim=3271429519
WAIT #43: nam='db file sequential read' ela= 14470 file#=6 block#=16237 blocks=1 obj#=63623 tim=3271444168
WAIT #43: nam='db file sequential read' ela=  5788 file#=6 block#=16333 blocks=1 obj#=63623 tim=3271450154
WAIT #43: nam='db file sequential read' ela=  9630 file#=6 block#=16429 blocks=1 obj#=63623 tim=3271460008
WAIT #43: nam='db file sequential read' ela= 10910 file#=6 block#=16525 blocks=1 obj#=63623 tim=3271471174
WAIT #43: nam='db file sequential read' ela= 15683 file#=5 block#=16620 blocks=1 obj#=63623 tim=3271487065
WAIT #43: nam='db file sequential read' ela=  8094 file#=5 block#=16717 blocks=1 obj#=63623 tim=3271495454
WAIT #43: nam='db file sequential read' ela=  6670 file#=5 block#=16813 blocks=1 obj#=63623 tim=3271502293
WAIT #43: nam='db file sequential read' ela=  7852 file#=5 block#=16908 blocks=1 obj#=63623 tim=3271510360
WAIT #43: nam='db file sequential read' ela= 10500 file#=6 block#=17133 blocks=1 obj#=63623 tim=3271521039
WAIT #43: nam='db file sequential read' ela= 11038 file#=6 block#=17229 blocks=1 obj#=63623 tim=3271532275
WAIT #43: nam='db file sequential read' ela= 12432 file#=6 block#=17325 blocks=1 obj#=63623 tim=3271544974
WAIT #43: nam='db file sequential read' ela=  7784 file#=6 block#=17421 blocks=1 obj#=63623 tim=3271553331
WAIT #43: nam='db file sequential read' ela=  7774 file#=5 block#=17517 blocks=1 obj#=63623 tim=3271561346
WAIT #43: nam='db file sequential read' ela=  6583 file#=5 block#=17613 blocks=1 obj#=63623 tim=3271568146
WAIT #43: nam='db file sequential read' ela=  7901 file#=5 block#=17708 blocks=1 obj#=63623 tim=3271576231
WAIT #43: nam='db file sequential read' ela=  6667 file#=5 block#=17805 blocks=1 obj#=63623 tim=3271583259
WAIT #43: nam='db file sequential read' ela=  9427 file#=6 block#=18029 blocks=1 obj#=63623 tim=3271592988
WAIT #43: nam='db file sequential read' ela= 52334 file#=6 block#=18125 blocks=1 obj#=63623 tim=3271646055
WAIT #43: nam='db file sequential read' ela= 50512 file#=6 block#=18221 blocks=1 obj#=63623 tim=3271697284
WAIT #43: nam='db file sequential read' ela= 10095 file#=6 block#=18317 blocks=1 obj#=63623 tim=3271708095

Check the block number for this list of single block reads – we’re jumping through the index about 100 blocks at a time to read the next block where an index entry has to go. The jumps are the expected (and designed) effect of reverse key indexes: the fact that the jumps turn into physical disc reads is the (possibly unexpected) side effect. Reversing an index makes adjacent values look very different (by reversing the bytes) and go to different index leaf blocks: the purpose of the exercise is to scatter concurrent similar inserts across multiple blocks, but if you scatter the index entries you need to buffer a lot more of the index to keep the most recently used values in memory. Reversing the index may eliminate buffer busy waits, but it may increase time lost of db file sequential reads dramatically.

Here’s a short list of interesting statistics from this test – this time running on 11.2.0.4 on a machine with SSDs) comparing the effects of reversing the index with those of not reversing the index – normal index first:


Normal index
------------
CPU used by this session               83
DB time                                97
db block gets                      40,732
physical reads                         51
db block changes                   40,657
redo entries                       20,174
redo size                       5,091,436
undo change vector size         1,649,648

Repeat with reverse key index
-----------------------------
CPU used by this session              115
DB time                               121
db block gets                      40,504
physical reads                     10,006
db block changes                   40,295
redo entries                       19,973
redo size                       4,974,820
undo change vector size         1,639,232

Because of the SSDs there’s little difference in timing between the two sets of data and, in fact, all the other measures of work done are very similar except for the physical read, and the increase in reads is probably the cause of the extra CPU time thanks to both the LRU manipulation and the interaction with the operating system.

If you want to check the effect of index reversal you can take advantage of the sys_op_lbid() function to sample a little of your data – in my case I’ve queried the last 10,000 rows (values) in the table:


select 
	/*+ 
		cursor_sharing_exact 
		dynamic_sampling(0) 
		no_monitoring 
		no_expand 
		index_ffs(t1,t1_i1) 
		noparallel_index(t,t1_i1) 
	*/ 
	count (distinct sys_op_lbid( &m_ind_id ,'L',t1.rowid)) as leaf_blocks
from 
	t1
where 
	id between 2e7 + 1 and 2e7 + 1e4
;

The &m_ind_id substition variable is the object_id of the index t1_i1.

In my case, with an index of 22,300 leaf blocks, my 10,000 consecutive values were scattered over 9,923 leaf blocks. If I want access to “recent data” to be as efficient as possible I need to keep that many blocks of the index cached, compared to (absolute) worst case for my data 100 leaf blocks. When you reverse key an index you have to think about how much bigger you have to make your buffer cache to keep the performance constant.

June 15, 2015

Dynamic Sampling

Filed under: CBO,Oracle — Jonathan Lewis @ 9:41 pm BST Jun 15,2015

Following on from an OTN posting about dynamic sampling difficulties I had planned to write a blog post about the difference between “not sampling when hinted” and “ignoring the sample” – but Mohamed Houri got there before me.

It’s just worth highlighing a little detail that is often overlooked, though: there are two versions of the dynamic_sampling() hint, the cursor level and the table level, and the number of blocks sampled at a particular level is dependent on which version you are using.  Level 4 at the cursor level, for example, will sample 64 blocks if and only if a certain condition is met,  but at the table level it will sample 256 blocks unconditionally.

So try to be a little more specific when you say “I told the optimizer to use dynamic sampling …”, it’s either:

“I told the optimizer to use cursor level dynamic sampling at level X …”

or

“I told the optimizer to use table level dynamic sampling at level Y for table A and …”

Note – apart from the changes to dynamic sampling that allow for a level 11, there’s also a change introduced (I think) in 10g for the sample() clause applied to the table during sampling – it’s the addition of a seed() clause which ensures that when you repeat the same level you generate the same set of random rows.

Addendum

Here’s a little code I wrote some time ago to check the effect of the two options at different levels. I started by creating a (nologging) table from the first 50,000 rows of all_objects, then doubled it up a few times to 400,000 rows total, and ensured that there were no stats on the table. Then executed in turn each variant of the following anonymous pl/sql block (note that I have the execute privilege on the dbms_system package):


declare
	m_ct number;
begin
	execute immediate 'alter session set events ''10053 trace name context forever''';
	for i in 1..10 loop
		sys.dbms_system.ksdwrt(1,'=============');
		sys.dbms_system.ksdwrt(1,'Level ' || i);
		sys.dbms_system.ksdwrt(1,'=============');

		execute immediate 
			'select /*+ dynamic_sampling('    || i || ') */ count(*) from t1 ' ||
--			'select /*+ dynamic_sampling(t1 ' || i || ') */ count(*) from t1 ' ||
			'where owner = ''SYS'' and object_type = ''SYNONYM'''
			into m_ct;
	end loop;
end;
/

Obviously I could examine the resulting trace file to pick out bits of each optimisation, but for a quick check a simple grep for “sample block cnt” is almost all I need to do – with the following (slightly decorated) results from 11.2.0.4:


Table level
===========
Level 1
    max. sample block cnt. : 32
    sample block cnt. : 31
    max. sample block cnt. : 64
    sample block cnt. : 63
    max. sample block cnt. : 128
    sample block cnt. : 127
    max. sample block cnt. : 256
    sample block cnt. : 255
    max. sample block cnt. : 512
    sample block cnt. : 511
    max. sample block cnt. : 1024
    sample block cnt. : 1023
    max. sample block cnt. : 2048
    sample block cnt. : 2047
    max. sample block cnt. : 4096
    sample block cnt. : 4095
    max. sample block cnt. : 8192
    sample block cnt. : 8191
Level 10
    max. sample block cnt. : 4294967295
    sample block cnt. : 11565

Cursor level
============
No sampling at level 1
Level 2
    max. sample block cnt. : 64
    sample block cnt. : 63
    max. sample block cnt. : 64
    sample block cnt. : 63
    max. sample block cnt. : 64
    sample block cnt. : 63
    max. sample block cnt. : 64
    sample block cnt. : 63
    max. sample block cnt. : 128
    sample block cnt. : 127
    max. sample block cnt. : 256
    sample block cnt. : 255
    max. sample block cnt. : 1024
    sample block cnt. : 1023
    max. sample block cnt. : 4096
    sample block cnt. : 4095
Level 10
    max. sample block cnt. : 4294967295
    sample block cnt. : 11565


You’ll notice that the cursor level example didn’t do any sampling at level 1. Although the manual doesn’t quite make it clear, sampling will only occur if three conditions are met:

  • The table has no statistics
  • The table has no indexes
  • The table is involved in a join so that a sample could affect the join order and method

If only the first two conditions are met then the execution path will be a full tablescan whatever the sample looks like and the number of rows returned has no further impact as far as the optimizer is concerned – hence the third requirement (which doesn’t get mentioned explicitly in the manuals). If you do have a query that meets all three requirements then the sample size is 32 (31) blocks.

 

CBO Series

Filed under: CBO,Oracle — Jonathan Lewis @ 9:19 pm BST Jun 15,2015

About a year ago I came across a couple of useful articles from Stefan Koehler, which is when I added his name to my blog roll. As an introduction for other readers I’ve compiled an index for a series of articles he wrote about the CBO viewed, largely, from the perspective of using Oracle to run SAP. Today I realised I hadn’t got around to publishing it, and there’s been a couple of additions since I first started to compile the list.

 

June 2, 2015

Predicate Order

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 7:10 pm BST Jun 2,2015

A recent OTN post demonstrated a very important point about looking at execution plans – especially when you don’t use the right data types. The question was:

We’ve this query which throws invalid number

SELECT * FROM table A
WHERE A.corporate_id IN (59375,54387) AND TRUNC(created_dt) BETWEEN '19-DEC-14' AND '25-DEC-14';

However it works fine if we use not in instead of in

SELECT * FROM table A  
WHERE A.corporate_id  NOT IN (59375,54387) AND TRUNC(created_dt) BETWEEN '19-DEC-14' AND '25-DEC-14';

Please assist.

A follow-up post told us that corporate_id was a varchar() type – so the root cause of the ORA-01722: invalid number error is simply that you shouldn’t be mixing data types. Either the corporate_id should have been defined as numeric or the in-list should have been a list of varchar2() values. (And, of course, the character strings that look like dates should have been converted explicitly to date data types using either the to_date() function with a 4-digit year or the date ‘yyyy-mm-dd’ syntax; and using “created_dt >=  19th Dec and created_dt < 26th Dec” would have given the optimizer a chance to get a better cardinality estimate)

The answer to the slightly more specific problem – why does changing NOT IN to IN allow the query to run rather than crashing – is (probably) one that I first addressed in an article in Oracle Magazine just over eleven years ago: with CPU costing enabled Oracle can change the order in which it applies filter predicates to a table. It’s also a question that can easily be answered by my commonest response to many of the optimizer questions that appear on OTN – look at the execution plan.

In this example it’s a fairly safe bet that there’s a reasonable small volume of data (according to the optimizer’s estimate) where to_number(corporate_id) is one of the required values, and a much larger volume of data where it is not; with some intermediate volume of data where the created_dt falls in the required date range. With CPU costing enabled (optional in 9i, enabled by default in 10g) the optimizer would then do some arithmetic to calculate the most cost-effective order of applying the filter predicates based on things like: the number of CPU cycles it takes to walk along a row to find a particular column. the number of CPU cycles it takes to convert a character column to a number and compare it with a number; the number of CPU cycles it takes truncate a date column and compare it with a string, the number of rows that would pass the numeric test hence requiring the first-applied date test, compared with the number of rows that would survive the first-applied date test hence requiring either the second date test or the numeric test to take place.

Here’s some code to demonstrate the point. It may require the system stats to be set to a particular values to ensure that it is probably repeatable, but there’s probably some flexibility in the range, which is why I’ve called dbms_stats.set_system_stats() in the first few lines:

drop table t1 purge;

create table t1 (
        v1      varchar2(10),
        d1      date
)
;

insert into t1 values(1,'01-Jan-2015');

insert into t1 values('x','02-Jan-2015');

insert into t1 values(3,'03-Jan-2015');
insert into t1 values(4,'04-Jan-2015');
insert into t1 values(5,'05-Jan-2015');
insert into t1 values(6,'06-Jan-2015');
insert into t1 values(7,'07-Jan-2015');
insert into t1 values(8,'08-Jan-2015');
insert into t1 values(9,'09-Jan-2015');
insert into t1 values(10,'10-Jan-2015');

execute dbms_stats.gather_table_stats(user,'t1');

First we create a table, load some data, and gather stats. You’ll notice that I’ve got a varchar2(10) column into which I’ve inserted numbers for all rows except one where it holds the value ‘x’. Now we just run some code to check the execution plans for a couple of queries.


explain plan for
select
        *
from    t1
where   v1 in (4,6)
and     d1 between '03-Jan-2015' and '09-Jan-2015'
;

select * from table(dbms_xplan.display);

explain plan for
select
        *
from    t1
where   v1 not in (4,6)
and     d1 between '03-Jan-2015' and '&1-Jan-2015'
;

select * from table(dbms_xplan.display);

As with the original question I’ve take a query with an IN operator and changed it to NOT IN. The in-list is numeric even though the relevant column is varchar2(10). The first query crashes with ORA-01722: invalid number, the second one runs and returns the correct result. You’ll notice, of course, that the “bad” value for v1 is not in the set of rows
where d1 is between 3rd and 9th Jan 2015. You’ll also notice that in my code I’ve used &1 for the end day in the query with the NOT IN clause so that I can re-run the query a few times to show the effects of changing the date range. Here are the execution plans – first with the IN clause:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     2 |    20 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     2 |    20 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((TO_NUMBER("V1")=4 OR TO_NUMBER("V1")=6) AND
              "D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "D1"<=TO_DATE(' 2015-01-09 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

The optimizer arithmetic predicts 2 rows returned using a full tablescan – it doesn’t know the query is going to crash. Notice the predicate information, though. The first predicate says Oracle will attempt to convert v1 to a number and compare it with 4 and then (if the first test fails) with 6. The query will crash as soon as it hits a row with a non-numeric value for v1. In outline, the optimizer has decided that the numeric conversion and test is very cheap (on CPU) and only a few rows will survive to take the more expensive date comparison; wherease either of the (expensive) date comparisons would leave a lot of rows that would still have to be checked with the numeric test. It makes sense to do the numeric comparison first.

Here’s the plan for the query with the NOT IN clause when I set the date range to be 3rd Jan to 7th Jan.


Execution plan for NOT IN:  7th Jan
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     5 |    50 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     5 |    50 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1"<=TO_DATE(' 2015-01-07 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6)

The plan is still a full tablescan – there are no indexes available – and the estimated number of rows has gone up to 5. The important thing, though, is the predicate section. In this case the optimizer has decided that the first thing it will apply is the (relatively expensive) predicate “d1 >= 3rd Jan” before worrying about the “NOT IN” numeric predicate. The optimizer has worked out that almost all the data will survive the NOT IN predicate, so it’s not efficient to apply it before using other predicates that eliminate more data.

By a stroke of luck my simple example happened to be a very good example. Here’s what happened when I set the end date to 8th Jan:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     6 |    60 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     6 |    60 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "D1"<=TO_DATE(' 2015-01-08 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6)

The estimated rows has gone up to 6 – but the interesting thing, as before, is the predicate section: in the previous example Oracle did the tests in the order “upper bound”, “lower bound”, “numeric”; in this test it has done “lower bound”, “upper bound”, “numeric”.

And this is what I got when I ran the test with 9th Jan:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     7 |    70 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     7 |    70 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("D1">=TO_DATE(' 2015-01-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND TO_NUMBER("V1")<>4 AND TO_NUMBER("V1")<>6 AND
              "D1"<=TO_DATE(' 2015-01-09 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Again the estimated rows has gone up by one, but the ever-interesting predicate section now shows the evaluation order as: “lower bound”, “numeric”, “upper bound”.

There are only 6 possible orders for the predicate evaluation for the query with the NOT IN clause, and we’ve seen three of them. Three will fail, three will succeed – and I got all three of the successful orders. It wouldn’t take much fiddling around with the data (careful choice of duplicate values, variation in ranges of low and high values, and so on) and I could find a data set where small changes in the requested date range would allow me to reproduce all six variations. In fact when I changed my nls_date_format to “dd-mon-yy” and used a 2 digit year for testing I got two of the three possible failing predicate evaluation orders – “numeric”, “lower bound”, “higher bound” and “higher bound”, “numeric”, “lower bound” without changing the data set. (To be able to get all six orders with a single data set I’d probably need a data set where the “bad” v1 value corresponded to a d1 value somewhere mear the middle of the d1 range.)

The bottom line – use the correct data types; make sure your date literals are dates with a 4-digit year; check the predicate section to see if the optimizer did any implicit conversions with your predicates and what order it used them in. If you don’t do this you may find that a query can work perfectly for months, then crash because you finally got unlucky with the arithmetic.

May 21, 2015

Understanding SQL

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 6:12 pm BST May 21,2015

From time to time someone publishes a query on the OTN database forum and asks how to make it go faster, and you look at it and think it’s a nice example to explain a couple of principles because it’s short, easy to understand, obvious what sort of things might be wrong, and easy to fix. Then, after you’ve made a couple of suggestions and explained a couple of ideas the provider simply fades into the distance and doesn’t tell you any more about the query, or whether they’ve taken advantage of your advice, or found some other way to address the problem.

Such a query, with its execution plan, appeared a couple of weeks ago:

UPDATE FACETS_CUSTOM.MMR_DTL
SET
	CAPITN_PRCS_IND = 2,
	FIL_RUN_DT = Current_fil_run_dt,
	ROW_UPDT_DT = dta_cltn_end_dttm
WHERE
	CAPITN_PRCS_IND = 5
AND	HSPC_IND ='Y'
AND	EXISTS (
		SELECT	1
		FROM	FACETS_STAGE.CRME_FUND_DTL_STG STG_CRME
		WHERE	STG_CRME.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN STG_CRME.ERN_FROM_DT AND STG_CRME.ERN_THRU_DT
		AND	STG_CRME.FUND_ID IN ('AAB1', '1AA2', '1BA2', 'AAB2', '1AA3', '1BA3', '1B80', '1A80')
	)
AND	EXISTS (
		SELECT	1
		FROM	FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
		WHERE	XWLK.MBR_CK = MMR_DTL.MBRSHP_CK
		AND	MMR_DTL.PMT_MSA_STRT_DT BETWEEN XWLK.HSPC_EVNT_EFF_DT AND XWLK.HSPC_EVNT_TERM_DT
	)
;

-------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|   1 |  UPDATE                       | MMR_DTL               |       |       |            |          |
|   2 |   NESTED LOOPS SEMI           |                       |     1 |   148 | 12431   (2)| 00:02:30 |
|*  3 |    HASH JOIN RIGHT SEMI       |                       |    49 |  5488 | 12375   (2)| 00:02:29 |
|   4 |     TABLE ACCESS FULL         | FCTS_TMS_MBRID_XWLK   |  6494 | 64940 |    24   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL         | MMR_DTL               |   304K|    29M| 12347   (2)| 00:02:29 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| CRME_FUND_DTL_STG     |     1 |    36 |     5   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN          | IE1_CRME_FUND_DTL_STG |     8 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("XWLK"."MBR_CK"="MMR_DTL"."MBRSHP_CK")
       filter("XWLK"."HSPC_EVNT_EFF_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "XWLK"."HSPC_EVNT_TERM_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   5 - filter("CAPITN_PRCS_IND"=5 AND "HSPC_IND"='Y')
   6 - filter(("STG_CRME"."FUND_ID"='1A80' OR "STG_CRME"."FUND_ID"='1AA2' OR
              "STG_CRME"."FUND_ID"='1AA3' OR "STG_CRME"."FUND_ID"='1B80' OR "STG_CRME"."FUND_ID"='1BA2' OR
              "STG_CRME"."FUND_ID"='1BA3' OR "STG_CRME"."FUND_ID"='AAB1' OR "STG_CRME"."FUND_ID"='AAB2') AND
              "STG_CRME"."ERN_FROM_DT"<=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT") AND
              "STG_CRME"."ERN_THRU_DT">=INTERNAL_FUNCTION("MMR_DTL"."PMT_MSA_STRT_DT"))
   7 - access("STG_CRME"."MBR_CK"="MMR_DTL"."MBRSHP_CK")

The most informative bit of narrative that went with this query said:

“The table MMR_DTL doesnt have index on these columns CAPITN_PRCS_IND , HSPC_IND .. Since this is an update stmt which will update 85k records, In a dilema whether to index these columns or not .. And this table MMR_DTL is an ever growing table. Worried about the update performance. “

This was in response an observation that there was a full tablescan on MMR_DTL at operation 5 despite the predicate “CAPITN_PRCS_IND”=5 AND “HSPC_IND”=’Y’. You’ll note that the predicted cardinality for that scan is 304K and the update statement is going to change CAPITN_PRCS_IND from the value 5 to the value 2 – so it’s not entirely unreasonable to be worried about the impact of creating an index that included the column capitn_prcs_ind.

What more can we say about this query, given the limited information. Quite a lot – unfortunately the owner of the query isn’t giving anything else away.

I’m going to leave this note unfinished to give people a little chance to think about the clues in the request, the questions they might ask, reasons why there might be a performance problem, and strategies they might investigate, then I’ll update the posting with a few ideas some time in the next 24 hours.

Update 1 – 24th May

There are so many ideas that spring up from a small amount of information that it’s very hard to write a concise and coherent description of what you’ve noticed, when and how far you pursued it, and how relevant the ideas might be to the problem in hand – especially when most of the thoughts require you to ask for more information. Unfortunately the time I had available to write this note has just disappeared, so I’m just going to have to complete it in rapidly written installments. The first bit is an outline of the immediate response I had to the initial presentation of the problem and the execution plan that went with it.

The only comment from the OP on this statement and plan was: “I couldn’t optimize this query for better performance and optimized cost.. Can some one guide me on this.”

We have no idea how many rows would be updated, how long it took, or how long the OP thinks it ought to take; it’s not until a subsequent post that we learn that the number of rows targetted for update is 85,000 – which tells us that the optimizer has run into some problems with its cardinality estimates. This suggests that IF there’s a serious performance problem then POSSIBLY there’s a better execution plan and we might get the optimizer to find it automatically if we could tell it how to adjust its cardinality estimates. It would be nice, however to know where the query spent its time (i.e. can we re-run it with rowsource execution stats or monitoring enabled, and see the actual run-time work in the plan).

If it took a couple of minutes to update that 85,000 rows, I probably wouldn’t want to spend time making it go faster; if it took 2 hours, of which 1 hour 50 minutes was spent waiting for a transaction (row) lock then I’d want to look at why the update collision could happen and see if that problem could be avoided – it might then be the case that the last 10 minutes was spent rolling back and restarting an update that ought to have taken 2 minutes “in vacuo”. Underlying all this, I would want to be sure (as I’ve implicitly, and I think reasonably, assumed) that it’s an update that runs only occasionally, perhaps once per day or once per week.

In the absence of hard information – let’s resort to a few hypotheticals; looking at the plan itself (and knowing the target 85,000 rows) I am prepared to make a few guesses about the run-time activity.

  1. We build an inmemory hash table from the whole of FCTS_TMS_MBRID_XWLK, a step for which the optimizer ought to be able to give a reasonable cost and cardinality – assuming (as I will from now on) that the basic stats are reasonably accurate.
  2. We scan the (fairly large) MMR_DETAIL table, applying a couple of filters; again the optimizer ought to do a reasonable job of estimating the cost of such a tablescan, and we might expect a significant fraction of the time to be spent on multiblock (possibly direct path) reads of the table. The cardinality reported is 304,000 but we note there are two predicates and both are for columns which we might guess have a small number of distinct values – one of which we are changing. Perhaps there’s a bad cardinality error there and maybe a couple of single column histograms would help, but maybe column group stats with a histogram on the pair would be even better. I also wonder when (if) HSPC_IND ever changes from Y to N, and contemplate the possibility of creating a function-based index that records ONLY the rows that match this predicate pair (see the note on indexing that will appear some time over the next week further down the page). It’s at this point that we might ask whether the number of rows returned by this scan should be very similar to the number of rows updated, or whether the scan identifies far too many rows and the two existence tests do a lot of work to eliminate the excess and, if the latter, which test should we apply first and how should we apply it.
  3. Having scanned the MMR_DTL we probe the in-memory hash table copy of FCTS_TMS_MBRID_XWLK for the first match, using an equality predicate (which will be the access predicate) and a range-based (filter) predicate which looks as if it is checking that some “start date” is between an “effective date” and a “termination date”. The estimated size of the result set is FAR too small at 49 rows when we know we have to have at least 85,000 rows survive this test; moreover, this tiny estimate comes from inputs of 6,500 and 304,000 rows being joined so we ought to wonder how such a tiny estimate could appear. A possible explanation is that the application has used some extreme dates to represent NULL values. If that’s the case then it’s possible that suitable histograms might help the optimizer recognise the extreme distribution; alternatively virtual columns that change the extreme values back to NULL and a predicate referencing the virtual columns may help.
  4. After estimating the cardinality of the intermediate result so badly, the optimizer decides that the second existence test can be performed as a semi-join using a nested loop. The thing to note here is that the optimizer “knows” that this is an expensive step – the cost of each table access operation is 5 (4 + 1) – but it’s a step that the optimizer “thinks” shouldn’t happen very frequently so the cost is considered acceptable. We know, however, that this step has to execute at least 85,000 times, so the optimizer’s prediction of visiting 4 blocks in the table to identify (on average) 8 rows and discard (on average) 7 of them looks nasty. Again we note that one of the predicates is range-based on a pair of dates – and in this case we might wonder whether or not most of the rows we discard are outside the date range, and whether we ought to consider (as a general point, and not just for this query) whether or not we should add one, other, or both the ERN_FROM_DT and ERN_THRU_DAT to the IE1_CRME_FUND_DTL_STG index. It’s at this point in the query that we are most ignorant of time spent witht the present data set and how that time will change in the future as the data in the MMR_DTL table grows – on one hand it’s possible that the rows for each MMR_DTL are widely scattered across the CRME_FUND_DTL_STG and this step could do a lot of random I/O, on the other hand the (assumed) time-dependent nature of the data may mean that the only MMR_DTL rows we look at are recently entered and the associated CRME_FUND_DTL_STG rows are therefore also recently entered and closely clustered – leading to a beneficial “self-caching” effect at the “high” end of the table as the query runs, which introduces an efficiency that the optimizer won’t notice. There is one numerical highlight in this join – we have a cost of 5 for each probe and 49 rows to test, so we might expect the incremental cost of the query to be around 250 (i.e. 49 * 5), but the difference between operations 3 and 2 is only 56 – suggesting that the optimizer does have some “self-caching” information, possibly based on there being a significant difference between the two tables for the number of distinct values of the join column. (See, for example: http://oracle-randolf.blogspot.co.uk/2012/05/nested-loop-join-costing.html )

Update 2 – 25th May

Scalability is a question that should always be considered – and there’s a scalability threat in the information we have so far. The plan shows a full tablescan of the MMR_DTL table, and while tablescans are not necessarily a bad thing we’ve been told that: “this table MMR_DTL is an ever growing table“. It’s possible that Oracle can be very quick and efficient when doing the existence tests on the rows it selects from the table – but it is inevitable that the tablescan will take longer to complete as time passes. Whether or not this is likely to matter is something we can’t decide from the information given: we don’t know how much of the time is the tablescan, we don’t know what fraction of the total time is due to the tablescan, and we don’t know  how much larger the table will grow each day (although, taking a wild guess, if we assume that the query runs once per day to manipulate recently arrived data the table would appear to be growing fairly rapidly.)

Another scalability detail we ought to ask about is the volume of data that we expect to update each time we run this statement. As time passes do we expect to see the same number of rows waiting to be updated, or are we expecting the business (whatever that may be) to grow steadily each month with an increase of a few percent in the number of rows to be updated on each execution. Our coding strategy may vary depending on the answer to that question – we might, for example, try to pre-empt a future problem by introducing some partitioning now.

The final scalablility issue is one I’ve raised already and comes from the CRME_FUND_DTL_STG table. According to the plan there about 8 rows (the number of rowids reporteed by the index range scan in operation 7) in this table for each distinct value of MMR_DTL.MBRSHP_CK; if MMR_DTL is large and growing, is CRME_FUND_DTL_STG very large and growing at 8 times the speed, or worse – as time passes will there be more rows for each distinct value of MMR_DTL.MBRSHP_CK.  Answers to these questions will help us decide whether we should use a hash join or a nested loop in the join to this table, and how to index the table to minimise random I/O.

Update 3 – 1st June

Since we know that the optimizer has come up with some very bad estimates, and given that the statement is short and simple, I’d be happy to pass the statement to the Tuning Advisor to see what suggestions it made. At the least it ought to come up with a suggestion for an SQL profile (i.e a set of opt_estimate() hints) to address the extremely poor cardinality estimates; it’s also likely to make some suggestions about potentially helpful indexes. I can think of three basic warnings, though:

  • Make sure you are licensed to use the advisor
  • Don’t run the advisor right after you’ve just done the big update – wait until the next big batch of data is ready for update
  • Be cautious about the indexes – the indexing suggestions may point you at the problem, but you may do better by using function-based indexes and modified SQL

The most obvious “simple” index to (assuming a lot of time goes on the table scan) is one of:


create index dtl_cpi on mmr_dtl(capitn_prcs_ind) compress;
create index dtl_cpi on mmr_dtl(capitn_prcs_ind, hspc_ind) compress;

If almost all the rows with capitn_prcs_ind = 5 also had hspc_ind = ‘Y’ then I’d probably choose the former, if a large number of rows (actually index rowids) could be eliminated before visiting the table I’d choose the latter. I would also make sure I had a frequency – or Top-N frequency in 12c – histogram on capitn_prcs_ind, and may even “fake” it to ensure that an automatic call to gather table stats didn’t do something that made the histogram a threat instead of a benefit. If I created the two-column index I would also create a frequency histogram on hspc_ind. The benefit of the histograms is based on my assumption that both columns have a very small number of values and a highly skewed distribution. I might go one step further and create a column group, but keeping an eye open for an “out of range” anomaly, on the pair of columns if there was a strong degree of correlation between the two columns (in particular if there were a relatively small number of ‘Y’, but most of the ‘Y’ was also capitn_prcs_ind).

A simple index, though, might get used for other statements where it was not appropriate; there’s also the extra overhead of maintence as the capitn_prcs_ind value changes from 5 to 2 – to update the index we delete the old entry and insert the new entry. Just to make matters a little worse, when an index has a large number of rows for the same key value it’s fairly common to find that its space utilisation averages about 50% per leaf block. On top of everything else, if 2 is a “final state” value for most of the rows in this very large table then a very large fraction of the index we’re building is probably a total waste of space. So we might look at maximising efficiency and minimising space (and undo and redo wastage) with a function-based index.  Again one or two columns as appropriate.


create index dtl_cpi_1 on mmr_dtl(case when capitn_prcs_ind = 5 then 0 end) compress;

create index dtl_cpi_2 on mmr_dtl(case when capitn_prcs_ind = 5 and hspc_ind = 'Y' then 0 end) compress;

execute dbms_stats.gather_table_stats(user, 'mmr_dtl',method_opt=>'for all hidden columns size 1')

Again my choice of index might be affected by the extra benefit of having two columns to eliminate the maximum number of rows from the table, but since the index would be very small either way I’d probably go for the two-column index. Note that once you’ve created a function-based index you have to create stats on the underlying hidden column, which I’ve done a little lazily in this case by simple gathering “for all hidden columns”. I don’t need a histogram since there’s only a single value in the index. Apart from the size benefit, the other efficiency benefit is that when I update the table I only have to worry about deleting a row from the index, I don’t have to insert a replacement row – getting rid of the third undo record and redo vector is likely to save about 30% on the index maintenance background costs.

I now have to change the original SQL to match the index definition, of course, so my driving where clause would become one of:

WHERE
	(case when CAPITN_PRCS_IND = 5 then 0 end) = 0
AND	HSPC_IND ='Y'
AND	EXISTS (...

WHERE
	(case when CAPITN_PRCS_IND = 5 AND HSPC_IND ='Y' then 0 end ) = 0
AND	EXISTS (...

Update 4 – 2nd June 2015

At this point we really do need better information to make sensible decisions about what to do next; however, as I indicated earlier on, the hash semi join to the small table FCTS_TMS_MBRID_XWLK does look like a sensible choice: it’s small so it’s a good target for a build table and small tables joined to big tables tend to be things that are used to eliminate rapidly (at least, in the case of existence tests).

The final big question is what to do about the table CRME_FUND_DTL_STG which looks as if it might be quite big (several rows for each MMR_DTL row) and therefore be contributing a lot of random I/O. We don’t really have many options here – and we’ve seen what they are in a previous post about “NOT EXISTS” subqueries. We could try to make the nested loop semi-join more efficient by adding columns to the index so that we can do more filtering in the index; we could consider forcing Oracle to stick with a filter subquery and see if we can introduce some indexing that could make the subquery run a very small number of times, or we could consider the impact of switching to a Hash semi-join (possibly switch the build and probe data sets to make it a “hash join right semi”).

A full tablescan on the CRME_FUND_DTL_STG to do the hash join might be rather expensive, though, so we might look at the significance of the predicate on STG_CRME.FUND_ID which lists 8 different literal values. Perhaps there’s some scope for creating an index on that column if those fund_ids represent a small fraction of the total data; maybe even consider list-partitioning the table (not necessarily one fund per partition) on the FUND_ID.

The only other significant “pre-processing” predicate on the CRME_FUND_DTL_STG table is the one that requires the MMR_DTL.PMT_MSA_STRT_DT to be between STG_CRME.ERN_FROM_DT and the STG_CRME.ERN_THRU_DT; so we might look at the how many rows in the table could be eliminated by finding the maximum and minimum values for MMR_DTL.PMT_MSA_STRT_DT and eliminating any rows where STG_CRME.ERN_FROM_DT was greater than the maximum PMT_MSA_START_DT or the STG_CRME.ERN_THRU_DT was less than the minimum PMT_MSA_START_DT.

We might play around with CTEs (subquery factoring / common table expressions) to make this happen – I’m not going to work the exact SQL, but it might start something like:


with starting_view as (
       select from mmr_detail
       where  ...
       and exists (
                select from FACETS_CUSTOM.FCTS_TMS_MBRID_XWLK XWLK
                where ...
       )
),
minmax_view as (
        select  /*+ materialize */
                min(PMT_MSA_STRT_DT), 
                max(PMT_MSA_STRT_DT) max_date
        from    starting_view
),
crme_fund_view as (
        select
        from
                minmax_view,
                CRME_FUND_DTL_STG
        where   ...
)
select
from
        starting_view   sv,
        crme_fund_view  cv
where
        ...

An alternative strategy to get the minimum and maximum dates more efficiently might be to include them in the function-based index, and then include TWO inline views, or maybe scalar subqueries in the where clause, that depend on the min/max range scan to find the values that can be used to eliminate as much data from CRME_FUND_DTL_STG as early as possible.

Footnote

I think it took me about 10 minutes looking at the original posting to come up with a few thoughts that might be appropriate – but it’s taken five sessions spread over nearly two week to write those ideas down. The same difference can appear between theoretical strategy and final action – everything I’ve said revolves around the two simple ideas of “how much data” (absolute and relative) and “how hard to we work to collect it” so it’s easy to come up with ideas; but it may take time to learn what the data looks like, and finding a SAFE way of making it possible to collect it efficiently.

May 18, 2015

Migrated rows

Filed under: Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 6:43 pm BST May 18,2015

I received an email recently describing a problem with a query which was running a full tablescan but: “almost all the waits are on ‘db file sequential read’ and the disk read is 10 times the table blocks”.  Some further information supplied was that the tablespace was using ASSM and 16KB block size; the table had 272 columns (ouch!) and the Oracle version was 11.2.0.4.

In his researches he had read my article on wide rows, and had picked out of one of the comments the line: “the very bad thing about chained rows and direct reads that is that finding the rest of row by ‘db file sequential read’ is never cached”, but he wasn’t sure that this was the problem he was seeing so, very sensibly, he had re-run the query with extended tracing available, and dumped (and formatted/edited) a couple of blocks from the table.

He then sent me the trace file and block dump. Generally this is a mistake – especially when the trace file is several megabytes – but he had prepared the ground well and had linked it back to one of my blog notes, and I thought there might be an opportunity for publishing a few more comments, so I took a look. Here’s a carefully edited subset of the block dump – showing all the pertinent information:


Start dump data blocks tsn: 99 file#:100 minblk 2513181 maxblk 2513181

Block header dump:  0x1926591d
 Object id on Block? Y
 seg/obj: 0x1652a7  csc: 0x53.891880b8  itc: 12  flg: E  typ: 1 - DATA
     brn: 1  bdba: 0x1965b70c ver: 0x01 opc: 0
     inc: 84  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0010.01d.0000dff9  0x2b442286.3469.09  C---    0  scn 0x0053.891880b1
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x04   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x05   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x06   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x07   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x08   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x09   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0a   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0b   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
0x0c   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.00000000
bdba: 0x1926591d

data_block_dump,data header at 0x11083f154
===============
tsiz: 0x1ea8
hsiz: 0x26
pbl: 0x11083f154
     76543210
flag=--------
ntab=1
nrow=10
frre=-1
fsbo=0x26
fseo=0x4c5
avsp=0x49f
tosp=0x49f
0xe:pti[0]	nrow=10	offs=0
0x12:pri[0]	offs=0x1c15
0x14:pri[1]	offs=0x197b
0x16:pri[2]	offs=0x16e1
0x18:pri[3]	offs=0x1448
0x1a:pri[4]	offs=0x11b8
0x1c:pri[5]	offs=0xf1f
0x1e:pri[6]	offs=0xc85
0x20:pri[7]	offs=0x9ec
0x22:pri[8]	offs=0x752
0x24:pri[9]	offs=0x4c5
block_row_dump:
tab 0, row 0, @0x1c15
tl: 659 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 1, @0x197b
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 2, @0x16e1
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 3, @0x1448
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 4, @0x11b8
tl: 656 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 5, @0xf1f
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 7, @0x9ec
tl: 665 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 8, @0x752
tl: 666 fb: -----L-- lb: 0x0  cc: 255
tab 0, row 9, @0x4c5
tl: 653 fb: -----L-- lb: 0x0  cc: 255

In the ITL you can see 10 entries with the flag set to “C—-” (committed) with no XID or SCN – that’s consistent with 10 rows migrating into the block in a single transaction. In the row directory you can see the block holds 10 rows, and in the body of the block you can see the header for each of those 10 rows with 255 columns (presumably the 2nd section of each row of 272 columns), and the flag bytes set to “—–L–” (the Last piece of a chained – as opposed to simply migrated – row).

So the block dump is consistent with the possiblity of a direct path read of a block somewhere (10 head pieces) having to read this block 10 times shortly afterwards. Can we find further corroboration in the trace file? The blockdump was for block 0x1926591d = 421943581 decimal


PARSE #4573135368:c=29,e=48,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4116693033,tim=79008343283418
EXEC #4573135368:c=53,e=93,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4116693033,tim=79008343283607
WAIT #4573135368: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=15477650 tim=79008343283636
WAIT #4573135368: nam='Disk file operations I/O' ela= 38 FileOperation=2 fileno=101 filetype=2 obj#=1462951 tim=79008343283973
WAIT #4573135368: nam='direct path read' ela= 8991 file number=100 first dba=947580 block cnt=13 obj#=1462951 tim=79008343293041

WAIT #4573135368: nam='db file sequential read' ela= 4934 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298032
WAIT #4573135368: nam='db file sequential read' ela= 155 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298216
WAIT #4573135368: nam='db file sequential read' ela= 127 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298378
WAIT #4573135368: nam='db file sequential read' ela= 125 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298526
WAIT #4573135368: nam='db file sequential read' ela= 128 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298677
WAIT #4573135368: nam='db file sequential read' ela= 123 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298826
WAIT #4573135368: nam='db file sequential read' ela= 134 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343298983
WAIT #4573135368: nam='db file sequential read' ela= 129 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299135
WAIT #4573135368: nam='db file sequential read' ela= 180 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299341
WAIT #4573135368: nam='db file sequential read' ela= 133 file#=100 block#=2513181 blocks=1 obj#=1462951 tim=79008343299497

WAIT #4573135368: nam='db file sequential read' ela= 11039 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310565
WAIT #4573135368: nam='db file sequential read' ela= 133 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310730
WAIT #4573135368: nam='db file sequential read' ela= 139 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343310895
WAIT #4573135368: nam='db file sequential read' ela= 124 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311045
WAIT #4573135368: nam='db file sequential read' ela= 122 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311190
WAIT #4573135368: nam='db file sequential read' ela= 127 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311339
WAIT #4573135368: nam='db file sequential read' ela= 125 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311490
WAIT #4573135368: nam='db file sequential read' ela= 134 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311647
WAIT #4573135368: nam='db file sequential read' ela= 128 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311797
WAIT #4573135368: nam='db file sequential read' ela= 124 file#=100 block#=2513245 blocks=1 obj#=1462951 tim=79008343311947

WAIT #4573135368: nam='db file sequential read' ela= 10592 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322564
WAIT #4573135368: nam='db file sequential read' ela= 142 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322740
WAIT #4573135368: nam='db file sequential read' ela= 126 file#=100 block#=2513309 blocks=1 obj#=1462951 tim=79008343322889

There are a couple of interesting details in this trace file.

First we note (as the OP said) there are very few direct path reads – but direct path reads can be asynchronous with several running concurrently, which means that we may report one direct path read while the data returned from others records no time. (You’ll have to take my word for the sparseness of direct path reads – there were 5 reading a total of 58 blocks from the object, compared to 50,000 db file sequential reads)

Then you can see that although each block that was subject to “db file sequential read” is reported 10 times, the first read is much slower than the subsequent ones – a fairly good indication that the later reads are coming from a cache somewhere. (The 50,00 reads consisted of roughly 5,300 blocks being read 10 times, 1,400 blocks being read 9 times, 460 blocks being read 8 times, and a few blocks being read 7 or fewer times.)

You might also notice that the “coincidental” jump of 64 blocks between the sets of 10 reads – this appears fairly frequently, and it’s the type of pattern you might expect to see when a serial process is allocating blocks for use in a clean ASSM tablespace after the extent sizes have become fairly large (possibly the 64MB size that eventually appears with system managed extent sizes). There’s a “pseudo-random” choice of block within extent dicated by the process id, that spreads the work done by a single process steadily through the extent. Having filled 2513181, 2513245, 2513309 and so on for 16 steps the trace file comes back to 2513182, 2513246, 2513309 and so on.

It’s interesting (and time-consuming) to check the patterns but what we really need next, and don’t have, to check the theory is the set of 13 blocks dictated by the first direct path read:

WAIT #4573135368: nam='direct path read' ela= 8991 file number=100 first dba=947580 block cnt=13 obj#=1462951 tim=79008343293041

It’s likely that somewhere in the 13 blocks in the range 947580 onwards we would find the 10 row head pieces pointing to block 2513181; then the 10 row head pieces pointing to block 2513245, and so on – and I’d hope that we might see a pattern of many consecutive (or near-consecutive) rows in each originating block pointing to the same “next block”. In fact, with a few blocks in the early range, we might even get some idea of how the application was loading and updating data and be able to make some suggestions for changing the strategy to avoid row chaining.

Footnote

The OP also had a follow-up question which was: “One question for the block dump is why there is no hrid in it since the row pieces are the second row pieces and the flag bit is ‘—–L–‘?”  It would be nice to see this, of course – then we wouldn’t need to see the 947580-947592 range to see what had been happening to the data – but that’s not the way Oracle works, as I’ve pointed out above; but since the answer was in another posting of mine I simply emailed the relevant URL to the OP.

May 15, 2015

Quiz Night

Filed under: Oracle — Jonathan Lewis @ 8:34 pm BST May 15,2015

How many rows will you return from a single table query if you include the predicate


        rownum > 2

in the where clause.

Warning: this IS a catch question

To make it easier and avoid ambiguity, you may assume the table is the standard SCOTT.EMP table.

 

Part 2:

This posting was prompted by noticing a note that Dominic Brooks posted a few months ago.

Can you supply a workaround for the little oddity he’s described.

Answers

Part 1

I did say that the first part was a catch question. The semi-automatic answer that I would expect most people to give would be “none” – because putting a specific limit on a result is (probably) the reason that most people use rownum, and therefore it’s easy to forget exactly what it’s really doing. The best reply (because it’s also a nice demonstration of an important feature of the run-time engine came from comment 8 by Balazs Papp with the predicate “1 = 1 or rownum > 2″. A disjunct (OR) evaluates to TRUE if any of its components evaluates to TRUE; since 1 is always equal to 1 the check against rownum is irrelevant and the entire data set will be returned. (Conversely, thinking back to an old post of mine about NOT IN subqueries, a conjunct (AND) evaluates to TRUE if only if every individual component evalues to TRUE.)

It’s always instructive to look at execution plans (ideally pulled from memory, and including the rowsoruce execution stats) when you experiment. First the query with just the rownum predicate, then with the “1=1″ included.


select * from emp where rownum > 2;

no rows returned

--------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  COUNT              |      |      1 |        |      0 |00:00:00.01 |       2 |
|*  2 |   FILTER            |      |      1 |        |      0 |00:00:00.01 |       2 |
|   3 |    TABLE ACCESS FULL| EMP  |      1 |     14 |     14 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM>2)

=======================================================================================

select * from emp where rownum > 2 or 1=1;

14 rows selected.

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     14 |00:00:00.01 |       3 |
|   1 |  COUNT             |      |      1 |        |     14 |00:00:00.01 |       3 |
|   2 |   TABLE ACCESS FULL| EMP  |      1 |     14 |     14 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------

{This space really was blank - no predicates reported]

Note how the first query had Oracle acquire every row (A-rows) and then discard it because the check “rownum > 2″ would be false for each row in turn.
Note how the entire predicate section of the second query has disappeared because “1=1″ is always true, so the filter is irrelevant. It is an interesting oddity, though, that the redundant COUNT operator still appears in the second plan – presumably an automatic response to the presence of the rownum predicate.

The query I had originally planned to use in my answer modelled a typing error I had seen in a production system that might cause a double take on first sight.  (This is using the standard SCOTT.EMP table again on 11.2.0.4):

select
        rowid, rownum, job
from
        emp 
where   job = 'CLERK' or rownum > 2
;

ROWID                  ROWNUM JOB
------------------ ---------- ---------
AAAvgVAAFAAAAiBAAA          1 CLERK
AAAvgVAAFAAAAiBAAK          2 CLERK
AAAvgVAAFAAAAiBAAL          3 CLERK
AAAvgVAAFAAAAiBAAM          4 ANALYST
AAAvgVAAFAAAAiBAAN          5 CLERK


select
        rowid, rownum, job
from
        emp
where   job = 'ANALYST' or rownum > 2
;

ROWID                  ROWNUM JOB
------------------ ---------- ---------
AAAvgVAAFAAAAiBAAH          1 ANALYST
AAAvgVAAFAAAAiBAAM          2 ANALYST
AAAvgVAAFAAAAiBAAN          3 CLERK

I’ve included the rowid for each row because (although it’s not an assumption you should generally make) it shows you the order the rows appeared in the block. As you can see, after two CLERKS (first query) anything is accepted; after two ANALYSTS (second query) anything is accepted.

Part 2

As Dominic points out in his comments below, the anomaly comes from a predicate “rownum <= nvl(:2, rownum)” which the optimizer applies to every row rather than recognising that when :2 is not null it could use its STOPKEY optimisation. This is a variant of the old problem of the optimizer finding cases where the only way it can get the correct answer in all circumstances is to do something inefficient – a problem I described some time ago in a note on “OR with subquery”. We recognise, of course, that in general we should either use the front end to choose one of two possible queries (depending on whether the bind variable is null or not) or we should rewrite the query as a UNION ALL, taking care that the code eliminates any rows from the later query blocks that have already appeared in the earlier query blocks, and being very careful about dealing with NULLs in the data; for example (setting the bind variable :n to the value 10, and with a table t1 that is a copy of the data from view all_objects, indexed on object_id):


select  *
from    t1
where   object_id <= 100
and     :n is not null
and     rownum <= :n
union all
select  *
from    t1
where   object_id <= 100
and     :n is null
;

--------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |       |      1 |        |     10 |00:00:00.01 |       5 |
|   1 |  UNION-ALL                     |       |      1 |        |     10 |00:00:00.01 |       5 |
|*  2 |   COUNT STOPKEY                |       |      1 |        |     10 |00:00:00.01 |       5 |
|*  3 |    FILTER                      |       |      1 |        |     10 |00:00:00.01 |       5 |
|   4 |     TABLE ACCESS BY INDEX ROWID| T1    |      1 |     37 |     10 |00:00:00.01 |       5 |
|*  5 |      INDEX RANGE SCAN          | T1_I1 |      1 |     37 |     10 |00:00:00.01 |       3 |
|*  6 |   FILTER                       |       |      1 |        |      0 |00:00:00.01 |       0 |
|   7 |    TABLE ACCESS BY INDEX ROWID | T1    |      0 |     37 |      0 |00:00:00.01 |       0 |
|*  8 |     INDEX RANGE SCAN           | T1_I1 |      0 |     37 |      0 |00:00:00.01 |       0 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=:N)
   3 - filter(:N IS NOT NULL)
   5 - access("OBJECT_ID"<=100)
   6 - filter(:N IS NULL)
   8 - access("OBJECT_ID"<=100)

When the bind variable :n is null the optimizer will execute only the second query block; but when :n is not null it will take the first query block, and take advantage of the COUNT STOPKEY optimisation. (I was a little surprised that the operations 2 and 3 were in the order show, I had expected them to be in the opposite order.)

In this example we can engineer the code so that precisely one of the two query blocks executes, so we don’t need to worry about the threat of an overlap that I raised in the note on “Subquery with OR” with its solution of a call to lnnvl().

 

May 13, 2015

Instance stats

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 7:31 pm BST May 13,2015

+

While reading a posting by Martin Bach on a new buffering option for 12c I was prompted to take a look at another of his posts on the instance activity stats, which reminded me that the class column on v$statname is a bit flag, which we can dissect using the bitand() function to pick out the statistics that belong to multiple classes. I’ve got 2 or 3 little scripts that do this one, for example, picks out all the statistics relating to RAC, another is just a cross-tab of the class values used and their breakdown by class.  Originally this latter script used the “diagonal” method of decode() then sum() – but when the 11g pivot() option appeared I used it as an experiment on pivoting.

This is the script as it now stands, with the output from 12.1.0.2




select
        *
from    (
        select
                st.class,
                pwr.class_id,
                case bitand(st.class, pwr.expn)
                        when 0 then to_number(null)
                               else 1
                end     class_flag
        from
                v$statname      st,
                (select
                        level                   class_id,
                        power(2,level - 1)      expn
                from
                        dual
                connect by level <= 8
                )       pwr
        where
                bitand(class,pwr.expn) = pwr.expn
        )
pivot   (
                sum(class_flag)
        for     class_id in (
                        1 as EndUser,
                        2 as Redo,
                        3 as Enqueue,
                        4 as Cache,
                        5 as OS,
                        6 as RAC,
                        7 as SQL,
                        8 as Debug
                )
        )
order by
        class
;


     CLASS    ENDUSER       REDO    ENQUEUE      CACHE         OS        RAC        SQL      DEBUG
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         1        130
         2                    68
         4                                9
         8                                         151
        16                                                     16
        32                                                                35
        33          3                                                      3
        34                     1                                           1
        40                                          53                    53
        64                                                                          130
        72                                          15                               15
       128                                                                                     565
       192                                                                            2          2

13 rows selected.

The titles given to the columns come from Martin’s blog, but the definitive set is in the Oracle documentation in the reference manual for v$statname. (I’ve changed the first class from “User” to “EndUser” because of a reserved word problem, and I abbreviated the “RAC” class for tidiness.) It’s interesting to note how many of the RAC statistics are also about the Cache layer.

May 12, 2015

Parallel Query

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 7:22 pm BST May 12,2015

According to the Oracle Database VLDB and Partitioning Guide (10g version and 11g version):

A SELECT statement can be executed in parallel only if the following conditions are satisfied:

  • The query includes a parallel hint specification (PARALLEL or PARALLEL_INDEX) or the schema objects referred to in the query have a PARALLEL declaration associated with them.
  • At least one table specified in the query requires one of the following:
    • A full table scan
    • An index range scan spanning multiple partitions
  • No scalar subqueries are in the SELECT list.

Note, particularly, that last restriction. I was looking at a query recently that seemed to be breaking this rule so, after examining the 10053 trace file for a while, I decided that I would construct a simplified model of the client’s query to demonstrate how the manuals can tell you the truth while being completely deceptive or (conversely) be wrong while still giving a perfectly correct impression. So here’s a query, with execution plan, from 11.2.0.4:

select
        /*+ parallel(t1 2) */
        d1.small_vc,
        t1.r1,
        t2.n21,
        t2.v21,
        t3.v31,
        (select max(v1) from ref1 where n1 = t2.n21)    ref_t2,
        (select max(v1) from ref2 where n1 = t1.r1)     ref_t1,
        t1.padding
from
        driver          d1,
        t1, t2, t3
where
        d1.n1 = 1
and     t1.n1 = d1.id
and     t1.n2 = 10
and     t1.n3 = 10
and     t2.id = t1.r2
and     t3.id = t1.r3
;

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 15700 |  1340   (3)| 00:00:07 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   2 |   TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  3 |    INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   4 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   5 |   TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  6 |    INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   7 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   8 |   PX SEND QC (RANDOM)        | :TQ10003 |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | P->S | QC (RAND)  |
|*  9 |    HASH JOIN                 |          |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 10 |     HASH JOIN                |          |   100 | 14700 |  1317   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 11 |      HASH JOIN               |          |   100 | 13300 |  1294   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  12 |       BUFFER SORT            |          |       |       |            |          |  Q1,03 | PCWC |            |
|  13 |        PX RECEIVE            |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |         PX SEND BROADCAST    | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 15 |          TABLE ACCESS FULL   | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  16 |       PX BLOCK ITERATOR      |          |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWC |            |
|* 17 |        TABLE ACCESS FULL     | T1       |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  18 |      BUFFER SORT             |          |       |       |            |          |  Q1,03 | PCWC |            |
|  19 |       PX RECEIVE             |          | 10000 |   136K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  20 |        PX SEND BROADCAST     | :TQ10001 | 10000 |   136K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  21 |         TABLE ACCESS FULL    | T2       | 10000 |   136K|    23   (5)| 00:00:01 |        |      |            |
|  22 |     BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|  23 |      PX RECEIVE              |          | 10000 |    97K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  24 |       PX SEND BROADCAST      | :TQ10002 | 10000 |    97K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  25 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    23   (5)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=:B1)
   6 - access("N1"=:B1)
   9 - access("T3"."ID"="T1"."R3")
  10 - access("T2"."ID"="T1"."R2")
  11 - access("T1"."N1"="D1"."ID")
  15 - filter("D1"."N1"=1)
  17 - filter("T1"."N2"=10 AND "T1"."N3"=10)

Thanks to my hint the query has been given a parallel execution plan – and a check of v$pq_tqstat after running the query showed that it had run parallel. Note, however, where the PX SEND QC and PX COORDINATOR operations appear – lines 7 and 8, and above those lines we see the two scalar subqueries.

This means we’re running the basic select statement as a parallel query but the query co-ordinator has serialised on the scalar subqueries in the select list.  Is the manual “right but deceptive” or “wrong but giving the right impression” ?  Serialising on (just) the scalar subqueries can have a huge impact on the performance and effectively make the query behave like a serial query even though, technically, the statement has run as a parallel query.

You may recall that an example of this type of behaviour, and its side effects when the scalar subqueries executed independently as parallel queries, showed up some time ago. At the time I said I would follow up with a note about the change in behaviour in 12c; this seems to be an appropriate moment to show the 12c plan(s), first the default:


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 19100 |  1364   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10005 |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED        |          |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  4 |     HASH JOIN OUTER          |          |   100 | 18100 |  1340   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  5 |      HASH JOIN               |          |   100 | 16400 |  1335   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  6 |       HASH JOIN OUTER        |          |   100 | 15000 |  1311   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  7 |        HASH JOIN             |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|   8 |         PX RECEIVE           |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,05 | PCWP |            |
|   9 |          PX SEND BROADCAST   | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  10 |           PX SELECTOR        |          |       |       |            |          |  Q1,00 | SCWC |            |
|* 11 |            TABLE ACCESS FULL | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | SCWP |            |
|  12 |         PX BLOCK ITERATOR    |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWC |            |
|* 13 |          TABLE ACCESS FULL   | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|  14 |        PX RECEIVE            |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  15 |         PX SEND BROADCAST    | :TQ10001 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  16 |          PX SELECTOR         |          |       |       |            |          |  Q1,01 | SCWC |            |
|  17 |           VIEW               | VW_SSQ_1 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  18 |            HASH GROUP BY     |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  19 |             TABLE ACCESS FULL| REF2     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,01 | SCWP |            |
|  20 |       PX RECEIVE             |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  21 |        PX SEND BROADCAST     | :TQ10002 | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  22 |         PX SELECTOR          |          |       |       |            |          |  Q1,02 | SCWC |            |
|  23 |          TABLE ACCESS FULL   | T2       | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | SCWP |            |
|  24 |      PX RECEIVE              |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  25 |       PX SEND BROADCAST      | :TQ10003 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | S->P | BROADCAST  |
|  26 |        PX SELECTOR           |          |       |       |            |          |  Q1,03 | SCWC |            |
|  27 |         VIEW                 | VW_SSQ_2 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  28 |          HASH GROUP BY       |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  29 |           TABLE ACCESS FULL  | REF1     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,03 | SCWP |            |
|  30 |     PX RECEIVE               |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  31 |      PX SEND BROADCAST       | :TQ10004 | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  32 |       PX SELECTOR            |          |       |       |            |          |  Q1,04 | SCWC |            |
|  33 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | SCWP |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T3"."ID"="T1"."R3")
   4 - access("ITEM_2"(+)="T2"."N21")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("ITEM_1"(+)="T1"."R1")
   7 - access("T1"."N1"="D1"."ID")
  11 - filter("D1"."N1"=1)
  13 - filter("T1"."N2"=10 AND "T1"."N3"=10)

The first thing to note is the location of the PX SEND QC and PX COORDINATOR operations – right at the top of the plan: there’s no serialisation at the query coordinator. Then we spot the views at operations 17 and 27 – VW_SSQ_1, VW_SSQ_2 (would SSQ be “scalar subquery”, perhaps). The optimimzer has unnested the scalar subqueries out of the select list into the join. When a scalar subquery in the select list returns no data it’s value is deemed to be NULL so the joins (operations 4 and 6) have to be outer joins.

You’ll notice that there are a lot of PX SELECTOR operations – each feeding a PX SEND BROADCAST operations that reports its “IN-OUT” column as S->P (i.e. serial to parallel). Historically a serial to parallel operation started with the query coordinator doing the serial bit but in 12c the optimizer can dictate that one of the PX slaves should take on that task (see Randolf Geist’s post here). Again my code to report v$pq_tqstat confirmed this behaviour in a way that we shall see shortly.

This type of unnesting is a feature of 12c and in some cases will be very effective. It is a cost-based decision, though, and the optimizer can make mistakes; fortunately we can control the feature. We could simply set the optimizer_features_enable back to 11.2.0.4 (perhaps through the hint) and this would take us back to the original plan, but this isn’t the best option in this case. There is a hidden parameter _optimizer_unnest_scalar_sq enabling the feature so we could, in principle, disable just the one feature by setting that parameter to false; a more appropriate strategy would simply be to tell the optimizer that it should not unnest the subqueries. In my case I could put the /*+ no_unnest */ hint into both the subqueries or use the qb_name() hint to give the two subquery blocks names, and then used the /*+ no_unnest() */ hint with the “@my_qb_name” format at the top of the main query. Here’s the execution plan I get whether I use the hidden parameter or the /*+ no_unnest */ mechanim:

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |       |       |  1554 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                 |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10003 |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    EXPRESSION EVALUATION        |          |       |       |            |          |  Q1,03 | PCWC |            |
|*  4 |     HASH JOIN                   |          |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  5 |      HASH JOIN                  |          |   100 | 14700 |  1330   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  6 |       HASH JOIN                 |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|   8 |         PX RECEIVE              |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|   9 |          PX SEND BROADCAST      | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL     | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  11 |        PX BLOCK ITERATOR        |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWC |            |
|* 12 |         TABLE ACCESS FULL       | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|  13 |       BUFFER SORT               |          |       |       |            |          |  Q1,03 | PCWC |            |
|  14 |        PX RECEIVE               |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  15 |         PX SEND BROADCAST       | :TQ10001 | 10000 |   136K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  16 |          TABLE ACCESS FULL      | T2       | 10000 |   136K|    24   (5)| 00:00:01 |        |      |            |
|  17 |      BUFFER SORT                |          |       |       |            |          |  Q1,03 | PCWC |            |
|  18 |       PX RECEIVE                |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  19 |        PX SEND BROADCAST        | :TQ10002 | 10000 |    97K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  20 |         TABLE ACCESS FULL       | T3       | 10000 |    97K|    24   (5)| 00:00:01 |        |      |            |
|  21 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  22 |      TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 23 |       INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|  24 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  25 |      TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 26 |       INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."ID"="T1"."R3")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("T1"."N1"="D1"."ID")
  10 - filter("D1"."N1"=1)
  12 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N2"=10 AND "T1"."N3"=10))
  23 - access("N1"=:B1)
  26 - access("N1"=:B1)

Note particularly that the PX SEND QC and PX COORDINATOR operations are operations 2 and 1,, and we have a new operator EXPRESSSION EVALUATION at operation 3. This has three child operations – the basic select starting at operation 4, and the two scalar subqueries starting at lines 21 and 24. We are operating the scalar subqueries as correlated subqueries, but we don’t leave all the work to the query coordinator – each slave is running its own subqueries before forwarding the final result to the coordinator. There is a little side effect that goes with this change – the “serial to parallel” operations are now, as they always used to be, driven by the query co-ordinator, the PX SELECTOR operations have disappeared.

And finally

Just to finish off, let’s take a look at the results from v$pq_tqstat in 12.1.0.2. First from the default plan with the PX SELECTOR operations. Remember that this turned into a five table join where two of the “tables” were non-correlated aggregate queries against the reference tables.


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                   200       2428          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                   100       1238         59         27           0
                                             1 P001                   100       1238         41         24           0

                    1 Producer               1 P002                  2000      23830          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                  1000      11939         57         26           0
                                             1 P001                  1000      11939         41         24           0

                    2 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     339732          0          0           0
                      Consumer               1 P000                 10000     169890         49         22           0
                                             1 P001                 10000     169890         31         21           0

                    3 Producer               1 P002                     0         48          0          0           0
                                             1 P003                  2000      23830          0          0           0
                      Consumer               1 P000                  1000      11939         58         26           0
                                             1 P001                  1000      11939         38         23           0

                    4 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     239986          0          0           0
                      Consumer               1 P000                 10000     120017         50         22           0
                                             1 P001                 10000     120017         34         21           0

                    5 Producer               1 P000                     1        169          0          0           0
                                             1 P001                     1        169          1          0           0
                      Consumer               1 QC                       2        338          3          0           0

As you read down the table queues you can see that in the first five table queues (0 – 4) we seem to operate parallel to parallel, but only one of the two producers (p002 and p003) produces any data at each stage. A more traditional plan would show QC as the single producer in each of these stages.

Now with scalar subquery unnesting blocked – the plan with the three table join and EXPRESSION EVALUATION – we see the more traditional serial to parallel, the producer is QC in all three of the first table queues (the full scan and broadcast of tables t1, t2, and t3).

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 QC                     200       1726          0          0           0
                      Consumer               1 P000                   100       1614         28         15           0
                                             1 P001                   100       1614         34         13           0

                    1 Producer               1 QC                   20000     339732          0          0           0
                      Consumer               1 P000                 10000     169866         19         10           0
                                             1 P001                 10000     169866         25          8           0

                    2 Producer               1 QC                   20000     239986          0          0           0
                      Consumer               1 P000                 10000     119993         23         11           0
                                             1 P001                 10000     119993         31         11           0

                    3 Producer               1 P000                     1        155          1          0           0
                                             1 P001                     1        155          0          0           0
                      Consumer               1 QC                       2        310          3          1           0

It’s an interesting point that this last set of results is identical to the set produced in 11g – you can’t tell from v$pq_tqstat whether the parallel slaves or the query co-ordinator executed the subqueries – you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.

 

May 11, 2015

Efficiency

Filed under: Oracle,Performance — Jonathan Lewis @ 9:24 pm BST May 11,2015

Here’s a question to which I don’t know the answer, and for which I don’t think anyone is likely to need the answer; but it might be an entertaining little puzzle for thr curious.

Assume you’re doing a full tablescan against a simple heap table of no more than 255 columns (and not using RAC, Exadata, In-memory, etc. etc. etc.), and the query is something like:


select  {columns 200 to 250}
from    t1
where   column_255 = {constant}
;

To test the predicate Oracle has to count its way along each row column by column to find column 255. Will it:

  1. transfer columns 200 to 250 to local memory as it goes, then check column_255 — which seems to be a waste of CPU if the predicate doesn’t evaluate to TRUE
  2. evaluate the predicate, then walk the row again to transfer columns 200 to 250 to local memory if the predicate evaluates to TRUE — which also seems to be a waste of CPU
  3. one or other of the above depending on which columns are needed, how long they are, and the selectivity of the predicate

How would you test your hypothesis ?

Parallel Execution

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 10:16 am BST May 11,2015

This is another little reference list I should have created some time ago. It covers a series of posts on interpreting parallel execution plans and understanding where the work happens.

I may add further links to this page in the future relating to other aspects of parallel execution.

 

May 8, 2015

Cost

Filed under: Uncategorized — Jonathan Lewis @ 8:21 am BST May 8,2015

I’ve just been checking “Cost Based Oracle – Fundamentals” (Apress 2005) to see what I said on a particular topic, and I couldn’t resist quoting the following from the opening page of Chapter 1:

One of the commonest questions about the CBO on the Internet is: “What does the cost represent?” This is usually followed by comments like: “According to explain plan the cost of doing a hash join for this query is seven million and the cost of a nested loop is forty-two – but the hash join completes in three seconds and the nested loop takes 14 hours.”

The answer is simple: the cost represents (and has always represented) the optimizer’s best estimate of the time it will take to execute the statement. But how can this be true when people can see oddities like the hash join / nested loop join example above? The answer can usually be found in that good old acronym GIGO: Garbage In, Garbage Out.

The CBO makes errors for six main reasons:

  • There are some inappropriate assumptions built into the cost model.
  • The relevant statistics about the data distribution are available, but misleading
  • The relevant statistics about the data distribution are not available
  • The performance characteristics of the hardware are not known
  • The current workload is not known
  • There are bugs in the code

Still true – although there are more features and cunning bits where inappropriate assumptions and bugs can appear.

 

 

April 29, 2015

Not Exists

Filed under: Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 8:21 pm BST Apr 29,2015

This whole thing about “not exists” subqueries can run and run. In the previous episode I walked through some ideas of how the following query might perform depending on the data, the indexes, and the transformation that the optimizer might apply:

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As another participant in the original OTN thread had suggested, however, it might be possible to find a completely different way of writing the query, avoiding the subquery approach completely. In particular there are (probably) several ways that we could write an equivalent query where the table only appears once. In other words, if we restate the requirement we might be able to find a different SQL translation for that requirement.

Looking at the current SQL, it looks like the requirement is: “Count the number of rows in t1 that have values of X that only have one associated value of Y”.

Based on this requirement, the following SQL statements (supplied by two different people) look promising:


    WITH counts AS
       (SELECT x,y,count(*) xy_count
        FROM   t1
        GROUP BY x,y)
    SELECT SUM(x_count)
    FROM  (SELECT x, SUM(xy_count) x_count
           FROM   counts
           GROUP BY x
           HAVING COUNT(*) = 1);


SELECT SUM(COUNT(*))
  FROM t1
GROUP BY x HAVING COUNT(DISTINCT y)<=1

Logically they do seem to address the description of the problem – but there’s a critical difference between these statements and the original. The clue about the difference appears in the absence of any comparisons between columns in the new forms of the query, no t1.colX = t2.colX, no t1.colY != t2.colY, and this might give us an idea about how to test the code. Here’s some test data:


drop table t1 purge;

create table t1 (
        x       number(2,0),
        y       varchar2(10)
);

create index t1_i1 on t1(x,y);

--      Pick one of the three following pairs of rows

insert into t1(x,y) values(1,'a');
insert into t1(x,y) values(1,null);

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'b');

-- insert into t1(x,y) values(null,'a');
-- insert into t1(x,y) values(null,'a');

commit;

--      A pair to be skipped

insert into t1(x,y) values(2,'c');
insert into t1(x,y) values(2,'c');

--      A pair to be reported

insert into t1(x,y) values(3,'d');
insert into t1(x,y) values(3,'e');

commit;

execute dbms_stats.gather_table_stats(user,'t1')

Notice the NULLs – comparisons with NULL lead to rows disappearing, so might the new forms of the query get different results from the old ?
The original query returns a count of 4 rows whichever pair we select from the top 6 inserts.

With the NULL in the Y column the new forms report 2 and 4 rows respectively – so only the second query looks viable.
With the NULLs in the X columns and differing Y columns the new forms report 2 and 2 rows respectively – so even the second query is broken.

However, if we add “or X is null” to the second query it reports 4 rows for both tests.
Finally, having added the “or x is null” predicate, we check that it returns the correct 4 rows for the final test pair – and it does.

It looks as if there is at least one solution to the problem that need only access the table once, though it then does two aggregates (hash group by in 11g). Depending on the data it’s quite likely that this single scan and double hash aggregation will be more efficient than any of the plans that do a scan and filter subquery or scan and hash anti-join. On the other hand the difference in performance might be small, and the ease of comprehension is just a little harder.

Footnote:

I can’t help thinking that the “real” requirement is probably as given in the textual restatement of the problem, and that the first rewrite of the query is probably the one that’s producing the “right” answers while the original query is probably producing the “wrong” answer.

April 26, 2015

Not Exists

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 7:17 pm BST Apr 26,2015

Another question on a seemingly simple “not exists” query has appeared on OTN just a few days after my last post about the construct. There are two little differences between the actual form of the two queries that make it worth repeating the analysis.

The first query was of the form:


select from big_table
where  not exists (select exact_matching_row from small table);

while the new query is of the form:


select from big_table alias1
where not exists (select inexact_matching_row from big_table alias2)

In the absence of a complete explanation, we might guess that the intention of the first query is to: “check correctness of undeclared foreign key constraint” i.e. small_table is the parent table with unique values, and big_table is the child end with some data that may be invalid due to a recent bulk data load. (In my example for the previous posting I relaxed the uniqueness assumption in small_table to make the problem a little more expensive.)

Our guess for the second query ought to be different; we are using the same table twice, and we are checking for the non-existence of “imperfect matches”. This introduces two potential threats – first that the (possibly pseudo-)join between the two tables is between two large tables and therefore inherently likely to be expensive; second that we may be allowed to have very large numbers of “perfect matches” that will escalate the scale of the join quite dramatically. Here’s the actual query (second version) from the posting; this will make it easier to explain why the structure of the query introduces the second threat and requires us (as so often) to understand the data in order to optimise the query execution:


select  count(*)
from    ubl_stg.wk_sap_fat w1
where   not exists (
                select  1
                from    ubl_stg.wk_sap_fat w2
                where   w2.mes_mese_id =  w1.mes_mese_id
                and     w2.sistema     <> w1.sistema
        )
;

Note the schema name ubl_stg – doesn’t that hint at “staging tables” for a data load.
Note the column name mes_mese_id – an “id” column, but clearly not one that’s supposed to be unique, so possibly a foreign key column that allows repetitions

The code is looking for, then counting, cases where for a given value of mes_mese_id there is only one corresponding value used for sistema. Since we don’t know the application we don’t know whether this count should be low or high relative to the number of rows in the table, nor do we know how many distinct values there might be of mes_mese_id – and these pieces of information are critical to identifying the best execution plan for the query.

The owner of this query told us that the table held 2 million rows which are “deleted and reloaded” every day, showed us that the (default) execution plan was a hash anti-join which took two hours to complete, told us that he (or she) didn’t think that an index would help because of the “not equal” predicate, and finally reported back that the problem was solved (no timing information supplied) by the addition of the /*+ no_unnest */ hint.

The question is – what has happened and why is there such a difference in performance ?

The first observation is that the 2 hours does seem an unreasonably long time for the query to run – and since it’s only a select statement it would be good to run it again and take a snapshot of the session statistics and session events (v$sesstat, v$session_event) for the session to see what the workload was and where the time was spent. Given the “deleted and reloaded” comment it’s possible that there may be some unexpected overhead due to some strange effects of read-consistency or delayed block cleanout, so we might also take a snapshot of tablespace or file I/O to check for lots of I/O on the undo tablespace (which might also show an I/O problem on the table’s tablespace or the user’s TEMP tablespace). The snapshots give us a very cheap, non-invasive option for getting some summary stats – but the tablespace/file stats are system-wide, of course, so may not tell us anything about our specific task, so we might even enable extended tracing (event 10046 level 8 / dbms_monitor with waits) to see in detail where we are losing time.

Since we don’t currently have the information we need to explain the two hours (it may have appeared by the time I post this note) it might be instructive to make some guesses about where the time could go in the hash anti-join, and why the /*+ no_unnest */ hint could make a difference.

Ignoring the possibility of strange undo/read-consistency/cleanout effects the first possiblity is simply that the hash join is large and turns into a multi-pass I/O thrash. The mes_mese_id column looks like it might be a number (id columns so often are) but the sistema column has the flavour of a reasonably large character column – so maybe our hash table has to be a couple of hundred megabytes – that could certainly be enough to spill to disc, though you’d have to have a really small PGA availability for it to turn into a multi-pass hash join.

Another possibility is that the pattern in the data makes the hash join burn up a huge amount of CPU – that should be easy to see on a re-run.  If there are relatively few distinct sets of values for (mes_mese_id, sistema) and there are very few cases where a mes_mese_id is associated with more than one sistema, then a large fraction of the hash table probes would have to follow a very long chain of matches to the very end, and that would take a large amount of CPU.

Pursue that “long chain” hypothesis to a slight extreme – what if there’s one mes_mese_id that appears in 250,000 of the 2M rows, and the sistema value is the same for every one of those quarter million rows,  which would require Oracle to walk a chain of 250,000 elements 250,000 times, for a total of 62.5 billion pointers to follow and comparisions to make – how much CPU might that take.  Worse still,since having the same mes_mese_id (the hashing is only on the equality predicate) means the quarter of a million rows would have to go into the same hash bucket so we might end up doing a multipass operation because that one bucket was very much larger than anything Oracle had anticipated when it did its internal partitioning calculations for the hash table. (There are some related notes in this article and in chapter 12 of Cost Based Oracle – Fundamentals)

Why would the /*+ unnest */ hint address these performance problems ? My first guess would be that the OP may have created the index on (mes_mese_id, sistema), in which case a full scan of that index (or even a fast-full scan if the index were newly created, or even a tablescan if the data had been loaded in the right order) followed by the filter subquery being driven by an index range scan would result in a relatively efficient subquery being executed once per distinct value of (mes_mese_id, sistema) rather than once per row. This could be much more efficient than a really badly skewed data set doing the hash anti-join. In fact, even in the absence of the index, if the number of distinct combinations was really quite small, that many tablescans – if the table were cached, whether in Oracle or the filesystem or the SAN cache – might still be a lot faster than the hash anti-join. (We were told that the problem was solved – but not told how much time constituted a viable solution.)

Models

Hand-waving and talk is fine – but a few modelled results might make it easier to comprehend, so here’s a way of generating a few versions of data sets and testing:


drop table t1 purge;

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                          id,
/*
        --
        --      Random generation nearly guaranteeing no random
        --      duplicates of (x,y) but quite a lot of mismatches
        --      Count 735,715 in 3.36 seconds.
        --
        trunc(dbms_random.value(0,2e6))                 x,
        lpad(trunc(dbms_random.value(0,1e6)),64,'0')    y
*/
/*
        --
        --      One specific pair repeated many times with no mismatch the rest
        --      as previously generated above. Times with different repeat counts:
        --       10,000            14 seconds
        --       20,000            52 seconds
        --       40,000           202 seconds
        --      CPU time quadruples as the count doubles (twice the number of
        --      probes walking twice the number of steps in the hash chain)
        --       80,000 =>        800 seconds
        --      160,000 =>      3,200 seconds
        --      240,000 =>      ca. 2 hours.
*/
        case
                when rownum <= 40000
                        then 2e6
                        else trunc(dbms_random.value(0,2e6))
        end                                             x,
        case
                when rownum <= 40000
                        then
                                lpad(0,64,'0')
                        else
                                lpad(
                                        trunc((rownum - 1)/1000) +
                                                case mod(rownum-1,1000) when 0 then 0 else 0 end,
                                        64,'0'
                                )
        end                                             y
/*
        --
        --      2,000 distinct values repeated 1,000 times each
        --      Query result: 2,000,000 in 235 seconds.
        --
        trunc((rownum - 1)/1000)                x,
        lpad(
                trunc((rownum - 1)/1000) +
                        case mod(rownum-1,1000) when 0 then 0 else 0 end,
                64,'0'
        )                                       y
*/
from
        generator       v1,
        generator       v2
where
        rownum <= 2e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As you can see I’ve changed the table and column names – this keeps them in line with the original SQL statement presented on OTN before we got the graphic display of a similar statement and plan. The SQL to create the data includes three variants and 5 sets of results (and 3 conjectures) running 11.2.0.4. These results appeared when the optimizer took the hash anti-join execution plan, and also spilled to disc with a one-pass workarea operation that first extended to about 200MB of PGA then dropped back to about 8MB.

To summarise the results recorded in the SQL – if we use the term “bad data” to describe rows where more than one Y value appears for a given X value then:

  1. With a large number of distinct pairs and a lot of bad data: the anti-join is pretty fast at 3.36 seconds.
  2. With no bad data, a small number of distinct pairs (2,000) and lots of rows per pair (1,000): the anti-join takes 235 CPU seconds
  3. As for #1 above, but with one extreme “good” pair that appears a large number of times: CPU time is proportional to the square of the number of duplicates of this value

I didn’t actually test beyond 40,000 duplicates for the last case, but you can see the double/quadruple pattern very clearly and the CPU time would have hit 2 hours at around 240,000 identical copies of one (x,y) pair.

/*+ no_unnest */

So what happens if you try using the /*+ no_unnest */ hint ? The target here is that the more repetitive the data the smaller the number of times you may have to run the subquery; and if you can get the driving data ordered you can guarantee the smallest possible number of runs of the subquery. I haven’t worked through all the possibilities, but to give you the flavour of what can happen, when I added the /*+ no_unnest */ hint to the query (and ensured that the table would be cached rather than read using direct path reads into the PGA) the execution time for the test with 1,000 copies of 2,000 pairs took 181 seconds to do 2,001 tablescans (compared with 235 seconds to do 2 tablescans and a hash anti-join) with the following execution path:


----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    69 |  8256K  (4)| 11:28:04 |
|   1 |  SORT AGGREGATE     |      |     1 |    69 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1   |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    TABLE ACCESS FULL| T1   |     2 |   138 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST NO_INDEX ("W2") */ 0
              FROM "T1" "W2" WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - filter("W2"."X"=:B1 AND "W2"."Y"<>:B2)

More significantly, when I created the optimum index the execution time dropped to 0.9 seconds – here’s the create index statement and subsequent plan – the extreme benefit appears because the data was effectively loaded in sorted order; if this had not been the case I would have forced an index full scan for the driving data set (with a “not null” predicate or constraint to make it possible for the optimizer to use the index to drive the query)


create index t1_i1 on t1(x,y) compress pctfree 0;

-----------------------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |     1 |    69 |  6008K  (1)| 08:20:45 |
|   1 |  SORT AGGREGATE     |       |     1 |    69 |            |          |
|*  2 |   FILTER            |       |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1    |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    INDEX RANGE SCAN | T1_I1 |     2 |   138 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T1" "W2"
              WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - access("W2"."X"=:B1)
       filter("W2"."Y"<>:B1)

Similarly, when I had the index in place and the 40,000 repetitions of a “good pair”, Oracle took a total of 9 seconds even though it had to run the subquery 1,960,000 times for the non-repetitive data (and once for the repetitive_pair). I have to say I was a little surprised at how rapidly it managed to get through that 2M subquery executions – but then I keep forgetting how ridiculously overpowered my new laptop is.

With these figures in mind you can appreciate that if the OP had lots of pairs with tens of thousands of repetitions, then even without creating the index on the table, the query time might drop from 2 hours for the hash anti-join to “a few minutes” for the filter subquery with a time that was good enough to look like “problem solved”.

Summary

If you have a few “long hash chains” in the build table – i.e. rows from the “first” table in the join that hash to the same value – then the amount of work Oracle has to do to check a single row from the probe (“second”) table that matches on the hash value can become significant. If a large number of rows from the probe table hit a long hash chain then the CPU time for the whole join can climb dramatically and you may want to force Oracle away from the hash join.

If the the long chains are the result of a skewed distribution where a small number of values appear very frequently in a table with a large number of distinct values that each appears infrequently then the optimizer may not notice the threat and may choose the hash plan when there is a much less resource-intensive alternative available.

Footnote:

There was another interesting observations I made while doing the experiments relating to whether the chains are due to hash collisions or require exact matches in the data – but I’ve spent an hour on desgining and running tests, and nearly 4 hours writing up the results so far. I need to do a few more tests to work out whether I’m seeing a very clever optimisation or a lucky coincidence in a certain scenario – so I’m going to save that for another day.

 

 

April 23, 2015

Golden Oldies

Filed under: bitmaps,Indexing,Oracle — Jonathan Lewis @ 8:45 am BST Apr 23,2015

I’ve just been motivated to resurrect a couple of articles I wrote for DBAZine about 12 years ago on the topic of bitmap indexes. All three links point to Word 97 documents which I posted on my old website in September 2003. Despite their age they’re still surprisingly good.

Update: 26th April 2015

Prompted by my reply to comment #2 below to look at what I said about bitmap indexes in Practical Oracle 8i (published 15 years ago), and found this gem:

An interesting feature of bitmap indexes is that it is rather hard to predict how large the index segment will be. The size of a B-tree index is based very closely on the number of rows and the typical size of the entries in the index column. The size of a bitmap index is dictated by a fairly small number of bit-strings which may have been compressed to some degree depending upon the number of consecutive 1’s and 0’s.

To pick an extreme example, imagine a table of one million rows that has one column that may contain one of eight values ‘A’ to ‘H’ say, which has been generated in one of of the two following extreme patterns:

  • All the rows for a given value appear together, so scanning down the table we get 125,000 rows with ‘A’ followed by 125,000 rows of ‘B’ and so on.
  • The rows cycle through the values in turn, so scanning down the table we get ‘A’,’B’. . . ‘H’ repeated 125,000 times.

What will the bitmap indexes look like in the two cases case?

For the first example, the basic map for the ‘A’ value will be 125,000 one-bits, followed by 875,000 zero bits – which will be trimmed off. Splitting the 125,000 bits into bytes and adding the necessary overhead of about 12% we get an entry of the ‘A’ rows of 18K. A similar argument applies for each of the values ‘B’ to ‘H’, so we get a total index size of around 8 x 18K – giving 156K.

For the second example, the basic map for the ‘A’ value will be a one followed by 7 zeros, repeated 125,000 times. There is no chance of compression here, so the ‘A’ entry will start at 125,000 bytes. Adding the overhead this goes up to 140K, and repeating the argument for the values ‘B’ to ‘H’ we get a total index of 1.12 MB.

This wild variation in size looks like a threat, but to put this into perspective, a standard B-tree index on this column would run to about 12 Mb irrespective of the pattern of the data. It would probably take about ten times as long to build as well.

As we can see, the size of a bitmap index can be affected dramatically by the packing of the column it depends upon as well as the number of different possible values the column can hold and the number of rows in the table. The compression that is applied before the index is stored, and the amazing variation in the resulting index does mean that the number of different values allowed in the column can be much larger than you might first expect. In fact it is often better to think of bitmap indexes in terms of how many occurrences of each value there are, rather than in terms of how many different values exist. Viewing the issue from this direction, a bitmap is often better than a B-tree when each value occurs more than a few hundred times in the table (but see the note below following the description of bitmap index entries).

 

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 5,193 other followers