Oracle Scratchpad

March 5, 2014

12c pq_replicate

Filed under: 12c,Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 5:39 pm BST Mar 5,2014

Another day, another airport lounge – another quick note: one of the changes that appeared in 12c was a tweak to the “broadcast” distribution option of parallel queries. I mentioned this in a footnote to a longer article a couple of months ago; this note simply expands on that brief comment with an example. We’ll start with a simple two-table hash join – which I’ll first construct and demonstrate in 11.2.0.4:


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	rownum				n1,
	lpad(rownum,6,'0')		small_vc,
	lpad(rownum,200,'0')		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1000
;

create table t2
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	1 + mod(rownum,10000)			n1,
	lpad(1 + mod(rownum,10000),6,'0')	small_vc,
	lpad(rownum,500,'0')			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 20000 ;

-- collect stats, no histograms.

select
  	/*+
  		leading(t1 t2)
 		parallel(t1 2)
 		parallel(t2 2)
 		use_hash(t2)
 	*/
 	t1.padding,
 	t2.padding
from 	t1, t2
where	t2.n1 = t1.n1
and	t2.small_vc = t1.small_vc
;

-------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  1000 |   707K|   135 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN            |          |  1000 |   707K|   135 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   207K|     4 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST  | :TQ10000 |  1000 |   207K|     4 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   207K|     4 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR   |          | 20000 |     9M|   131 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS FULL  | T2       | 20000 |     9M|   131 |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------------------------------

In this plan slave set 2 scans table t1 in parallel and broadcasts the result set to slave set 1 (lines 5 – 7). The significance of the broadcast option is that each slave in slave set 2 sends all the rows it has read to every slave in slave set 1. For a fairly large table with a high degree of parallelism this could be a lot of inter-process communication; the total number of rows passing through the PX message pool is “DOP x number of row filtered from t1″.

After a slave in slave set 1 has receive the whole of the t1 result set it builds an in-memory hash table and starts scanning rowid ranges (PX BLOCK ITERATOR) from table t2, probing the in-memory hash table to effect the join (lines 3,4, 8,9). Since each slave has a copy of the whole result set from t1 it can scan any chunk of t2 and handle the contents locally. Moreover, because slave set 1 isn’t reading its second input from a virtual table it is able to write its output immediately the virtual table (:TQ10001) that feeds the query coordinator with the result (lines 1,2) – we don’t have to do a “hash join buffered” operation and buffer the entire second input before starting to execute the join.

So how does 12c change things. With the same starting data and query, here’s the execution plan:

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  1000 |   707K|   135 |  Q1,00 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN          |          |  1000 |   707K|   135 |  Q1,00 | PCWP |            |
|   4 |     TABLE ACCESS FULL | T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   5 |     PX BLOCK ITERATOR |          | 20000 |     9M|   131 |  Q1,00 | PCWC |            |
|   6 |      TABLE ACCESS FULL| T2       | 20000 |     9M|   131 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------

Notice, in particular, that we only have one virtual table (or table queue :TQ10000) rather than two – and that’s from a parallel query slave set to the query co-ordinator, parallel to serial; the query only uses one set of parallel query slaves. Until you run the query with rowsource execution statistics enabled and look at the output from v$pq_tqstat it’s not going to be immediately obvious what has happened, but we should see that somehow Oracle is no longer broadcasting the first table even though it’s still doing something in parallel with both tables.

The run-time statistics confirm that we’ve only used one set of slaves, and each slave in the slave set has scanned the whole of table t1. This means each slave can build the full hash table and then go on to read rowid ranges from table t2. We’ve managed to get the benefit of broadcasting t1 (every slave has the whole of t1 so we don’t have to scan and distribute the big table t2 through the PX message pool) but we haven’t had to clone it multiple times through the PX message pool.

Clearly there’s a trade-off here that Oracle Corp. has decided is worth considering. I’m guessing it’s biased towards Exadata where you might run queries with a very high degree of parallelism. In that case the overhead of task switching as large numbers of messages are passed around may (and this is pure supposition) be greater than the added cost of loading the table into the buffer cache (of each instance) and having each slave scan it from there. (Reminder – 11g introduced two “opposite” changed to tablescans: “serial direct reads” and “parallel in-memory scans”.)

There’s one little oddity in this replication – there’s a pair of hints: pq_replicate and no_pq_replicate to control the effect if you think the optimizer is making the wrong choice. I would have guessed that in my example the hint would read: /*+ pq_replicate(t1) */ as it’s table t1 that is read by every single slave. Strangely, though, this is what the outline section of the execution plan showed:


  /*+
      BEGIN_OUTLINE_DATA
      PQ_REPLICATE(@"SEL$1" "T2"@"SEL$1")
      PQ_DISTRIBUTE(@"SEL$1" "T2"@"SEL$1" BROADCAST NONE)
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      OPT_PARAM('_optimizer_cost_model' 'io')
      DB_VERSION('12.1.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Notice how the hint specifies table t2, not table t1 !

Footnote

Here’s a little anomaly,  and a generic warning about “optimizer_features_enable”: I found that if I used the hint /*+ optimizer_features_enable(’11.2.0.4′) */ in 12c I could still get the pq_replicate() hint to work. Unfortunately there are a few places where the hint (or parameter) isn’t guaranteed to take the optimizer code backwards the full 100%.

March 3, 2014

Flashback Fail ?

Filed under: Bugs,Oracle,Troubleshooting — Jonathan Lewis @ 4:19 pm BST Mar 3,2014

Sitting in an airport, waiting for a plane, I decided to read a note (pdf) about Flashback data archive written by Beat Ramseier from Trivadis.  I’d got about three quarters of the way through it when I paused for thought and figured out that on the typical database implementation something nasty is going to happen after approximately 3 years and 9 months.  Can you guess why ?

It’s all about smon_scn_time – which normally records one row every five minutes (created by smon) with a continuous cycle of 24 hours – typically giving you about 1,440 rows in the table. The table is in a cluster, and the cluster key is the instance (thread) number. Clearly this was originally a clever idea from someone who realised that a cluster key of thread number would be beneficial if you had a RAC system with multiple instances – each instance gets its own blocks and the data for any one instance is as well clustered as possible.

The trouble is, when you enable flashback data archive smon no longer sticks to a 24 hour cycle, it just keeps adding rows. Now on my 8KB block tablespace I see 6 rows per block in the table/cluster – which means I get through 48 blocks per days,  17,520 blocks per year, and in 3 years and 9 months I’ll get to roughly 65,700 blocks – and that’s the problem.  An index entry in a cluster index points to a chain of cluster blocks, and the last two bytes of the “rowid” in the index entry identify which block within the chain the cluster key scan should start at – and two bytes means you can only have 65,536 blocks for a single cluster key.

I don’t know what’s going to happen when smon tries to insert a row into the 65,535th (-ish) block for the current thread – but it ought to raise an Oracle error, and then you’ll probably have to take emergency action to make sure that the flashback mechanisms carry on running.

Although oraus.msg indicates that it’s an error message about hash clusters it’s possible that the first sight will be: Oracle error: “ORA-02475 maximum cluster chain block count of %s has been exceeded”. If you’re using a 16KB block size then you’ve got roughly 7.5 years, and 32KB block sizes give you about 15 years (not that that’s a good argument for selecting larger block sizes, of course.)

Footnote:

Searching MoS for related topics (smon_scn_time flashback) I found doc ID: 1100993.1 from which we could possibly infer that the 1,440 rows was a fixed limit in 10g but that the number of rows allowed in smon_scn_time could increase in 11g if you enable automatic undo management. I also found a couple of bugs relating to index or cluster corruption – fixed by 11.2.0.4, though.

 

 

March 2, 2014

Auto Sample Size

Filed under: Function based indexes,Indexing,Infrastructure,IOT,LOBs,Oracle,Statistics — Jonathan Lewis @ 6:38 pm BST Mar 2,2014

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:


create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)
    )
;

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.

So what happens when Oracle collects table statistics – if you’ve enable the approximate NDV feature Oracle does a 100% sample, which means it has to call the function for every single row in the table. You will appreciate that the decode(instr()) function on the LOB column is going to read every single LOB in turn from the table – it’s not surprising that the time taken to calculate stats on the table jumped from a few minutes to a couple of hours. What did surprise me was that my call to dbms_lob.getlength() also seemed to read every lob in my example rather than reading the “LOB Locator” data that’s stored in the row – one day I’ll take a look into why that happened.

Looking at these examples it’s probably safe to conclude that if you really need to index some very small piece of “flag” information from a LOB it’s probably best to store it as a real column on the table – perhaps populated through a trigger so you don’t have to trust every single piece of front-end code to keep it up to date. (It would be quite nice if Oracle gave us the option for a “derived” column – i.e. one that could be defined in the same sort of way as a virtual column, with the difference that it should be stored in the table.)

So virtual columns based on LOBs can create a performance problem for the approximate NDV mechanism;  but the story doesn’t stop there because there’s another “less commonly used” feature of Oracle that introduces a different threat – with no workaround – it’s the index organized table (IOT). Here’s a basic example:

create table iot1 (
        id1	number(7,0),
	id2	number(7,0),
	v1	varchar2(10),
	v2	varchar2(10),
	padding	varchar2(500),
        constraint iot1_pk primary key(id1, id2)
)
organization index
including id2
overflow
;

insert into iot1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
        mod(rownum,20)                  id1,
        trunc(rownum,100)               id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum,100))      v2,
        rpad('x',500,'x')               padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5
;

commit;

alter system flush buffer_cache;

alter session set events '10046 trace name context forever';

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'IOT1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

alter session set events '10046 trace name context off';

You’ll notice I’ve created the table then inserted the data – if I did a “create table as select” Oracle would have sorted the data before inserting it, and that would have helped to hide the problem I’m trying to demonstrate. As it is my overflow segment is very badly ordered relative to the “top” (i.e. index) segment – in fact I can see after I’ve collected stats on the table that the clustering_factor on the index is 100,000 – an exact match for the rows in the table.

Running 11.2.0.4, with a 1MB uniform extent, freelist management, and 8KB block size the index segment held 279 leaf blocks, the overflow segment (reported in view user_tables as SYS_IOT_OVER_81594) held 7,144 data blocks.

So what interesting things do we find in a 10046 trace file after gathering stats – here are the key details from the tkprof results:

SQL ID: 7ak95sy9m1s4f Plan Hash: 1508788224

select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats
  cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
  no_substrb_pad  */to_char(count("ID1")),to_char(substrb(dump(min("ID1"),16,
  0,32),1,120)),to_char(substrb(dump(max("ID1"),16,0,32),1,120)),
  to_char(count("ID2")),to_char(substrb(dump(min("ID2"),16,0,32),1,120)),
  to_char(substrb(dump(max("ID2"),16,0,32),1,120)),to_char(count("V1")),
  to_char(substrb(dump(min("V1"),16,0,32),1,120)),
  to_char(substrb(dump(max("V1"),16,0,32),1,120)),to_char(count("V2")),
  to_char(substrb(dump(min("V2"),16,0,32),1,120)),
  to_char(substrb(dump(max("V2"),16,0,32),1,120)),to_char(count("PADDING")),
  to_char(substrb(dump(min("PADDING"),16,0,32),1,120)),
  to_char(substrb(dump(max("PADDING"),16,0,32),1,120))
from
 "TEST_USER"."IOT1" t  /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,
  NIL,NIL*/

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.37       0.37       7423     107705          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.37       0.37       7423     107705          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=107705 pr=7423 pw=0 time=377008 us)
    100000     100000     100000   APPROXIMATE NDV AGGREGATE (cr=107705 pr=7423 pw=0 time=426437 us cost=10 size=23944 card=82)
    100000     100000     100000    INDEX FAST FULL SCAN IOT1_PK (cr=107705 pr=7423 pw=0 time=298380 us cost=10 size=23944 card=82)(object id 85913)

********************************************************************************

SQL ID: 1ca2ug8s3mm5z Plan Hash: 2571749554

select /*+  no_parallel_index(t, "IOT1_PK")  dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad
  no_expand index(t,"IOT1_PK") */ count(*) as nrw,count(distinct
  sys_op_lbid(85913,'L',t.rowid)) as nlb,null as ndk,
  sys_op_countchg(sys_op_lbid(85913,'O',"V1"),1) as clf
from
 "TEST_USER"."IOT1" t where "ID1" is not null or "ID2" is not null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.16       0.16          0     100280          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.16       0.16          0     100280          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT GROUP BY (cr=100280 pr=0 pw=0 time=162739 us)
    100000     100000     100000   INDEX FULL SCAN IOT1_PK (cr=100280 pr=0 pw=0 time=164597 us cost=6 size=5900000 card=100000)(object id 85913)

The first query collects table and column stats, and we can see that the approximate NDV method has been used because of the trailing text: /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/. In this statement the hint /*+ full(t) */ has been interpreted to mean an index fast full scan, which is what we see in the execution plan. Although there are only 279 blocks in the index and 7,144 blocks in the overflow we’ve done a little over 100,000 buffer visits because for every index entry in the IOT top we’ve done a “fetch by rowid” into the overflow segment (the session stats records these as “table fetch continued row”). Luckily I had a small table so all those visits were buffer gets; on a very large table it’s quite possible that a significant fraction of those buffer gets will turn into single block physical reads.

Not only have we done one buffer visit per row to allow us to calculate the approximate NDV for the table columns, we’ve done the same all over again so that we can calculate the clustering_factor of the index. This is a little surprising since the “rowid” for an item in the overflow section is stored in the index segment but (as you can see in the second query in the tkprof output) Oracle has used column v1 (the first in the overflow segment) in the call to the sys_op_countchg() function where the equivalent call for an ordinary index would use t.rowid so, presumably, the code HAS to access the overflow segment. The really strange thing about this is that the same SQL statement has a call to sys_op_lbid() which uses the (not supposed to exist in IOTs) rowid – so it looks as if it ought to be possible for sys_op_countchg() to do the same.

So – big warning on upgrading to 11g: if you’ve got IOTs with overflows and you switch to auto_sample_size and enable approximate NDV then the time taken to gather stats on those IOTs may (depending to a large extent on the data clustering) take much longer than it used to.

February 28, 2014

Empty Hash

Filed under: Bugs,CBO,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:45 pm BST Feb 28,2014

A little while ago I highlighted a special case with the MINUS operator (that one of the commentators extended to include the INTERSECT operator) relating to the way the second subquery would take place even if the first subquery produced no rows. I’ve since had an email from an Oracle employee letting me know that the developers looked at this case and decided that it wasn’t feasible to address it because – taking a wider view point – if the query were to run parallel they would need a mechanism that allowed some synchronisation between slaves so that every slave could find out that none of the slaves had received no rows from the first subquery, and this was going to lead to hanging problems.

The email reminded me that there’s another issue of the same kind that I discovered several years ago – I thought I’d written it up, but maybe it was on a newsgroup or forum somewhere, I can’t find it on my blog or old website). The problem can be demonstrated by this example:


create table t1
as
select
	rownum			id,
	mod(rownum,25)		rep_col,
	rpad('x',50)		padding
from
	all_objects
where
	rownum <= 3000
;

delete from t1 where rep_col = 12;

create table t2
as
select
	rownum			id,
	mod(rownum,25)		rep_col,
	rpad('x',50)		padding
from
	all_objects
where
	rownum <= 10000
;

-- collect stats, no histograms

select
	/*+
		leading(t1 t2)
		use_hash(t2)
		no_swap_join_inputs(t2)
		pq_distribute(t2, hash, hash)
	*/
	count(*)
from
	t1,t2
where
	t1.rep_col = 12
and	t2.id = t1.id
;

You’ll notice that I’ve created a data set for table t1 where the values 12 does not appear in column rep_col; so my query will return no rows. However, for the purposes of demonstration, I’ve hinted an execution path that will scan t1 and distribute it by hash, then scan t2 to distribute that by hash before doing the join. Here’s the plan – which I’ve generated with rowsource execution statistics enabled and pulled from memory after executing the query:


-------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Starts | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   1 |  SORT AGGREGATE            |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   2 |   PX COORDINATOR           |          |      1 |      2 |00:00:00.01 |       6 |      2 |
|   3 |    PX SEND QC (RANDOM)     | :TQ10002 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE         |          |      2 |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN             |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   6 |       JOIN FILTER CREATE   | :BF0000  |      2 |      0 |00:00:00.01 |       0 |      0 |
|   7 |        PX RECEIVE          |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   8 |         PX SEND HASH       | :TQ10000 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   9 |          PX BLOCK ITERATOR |          |      2 |      0 |00:00:00.01 |      54 |     27 |
|* 10 |           TABLE ACCESS FULL| T1       |     27 |      0 |00:00:00.01 |      54 |     27 |
|  11 |       PX RECEIVE           |          |      0 |      0 |00:00:00.01 |       0 |      0 |
|  12 |        PX SEND HASH        | :TQ10001 |      0 |      0 |00:00:00.01 |       0 |      0 |
|  13 |         JOIN FILTER USE    | :BF0000  |      2 |      2 |00:00:00.01 |     116 |     87 |
|  14 |          PX BLOCK ITERATOR |          |      2 |      2 |00:00:00.01 |     116 |     87 |
|* 15 |           TABLE ACCESS FULL| T2       |     29 |      2 |00:00:00.01 |     116 |     87 |
-------------------------------------------------------------------------------------------------

Since this was 11.2.0.4 Oracle has used Bloom filtering to reduce the traffic between slave sets – but you can see that despite returning no rows from t1 (lines 7 – 10), Oracle still performed the parallel tablescan of t2 (lines 12 – 15). Thanks to the Bloom filter we don’t transfer 10,000 rows between slave sets, but we can see from the Buffers and Reads columns that we really did do the tablescan – and if we take a snapshot of instance activity we would have seen 10,000 rows fetched by tablescan at that point.

If I ran this query serially Oracle would stop after discovering that the first tablescan returned no rows – why bother scanning the probe table when the hash table is empty ? But because the query is running in parallel, a single slave that receives no data from the first tablescan cannot assume that every other slave in the same set has also received no data – there’s no cross-chat that allows the slaves to discover that every slave has no data – so the second scan goes ahead.

I was a little surprised by this when I first found it since I thought (from looking at some of the 1039x trace information) that the parallel slaves were telling the query coordinator how many rows they had acquired on each granule – which would allow the coordinator to spot the zero total. But it looks as if I was misinterpreting the trace.

On the plus side for this example – it’s probably the case that if zero is a possible volume of data returned by the query then the optimizer will have decided that it was going to get a “small” data set for the build table and therefore do a broadcast distribution – and if that happens the second tablescan won’t occur – as we see below (note the zero Reads, zero Buffers):


------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Starts | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   1 |  SORT AGGREGATE           |          |      1 |      1 |00:00:00.01 |       6 |      2 |
|   2 |   PX COORDINATOR          |          |      1 |      2 |00:00:00.01 |       6 |      2 |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE        |          |      2 |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN            |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   6 |       PX RECEIVE          |          |      2 |      0 |00:00:00.01 |       0 |      0 |
|   7 |        PX SEND BROADCAST  | :TQ10000 |      0 |      0 |00:00:00.01 |       0 |      0 |
|   8 |         PX BLOCK ITERATOR |          |      2 |      0 |00:00:00.01 |      54 |     27 |
|*  9 |          TABLE ACCESS FULL| T1       |     27 |      0 |00:00:00.01 |      54 |     27 |
|  10 |       PX BLOCK ITERATOR   |          |      0 |      0 |00:00:00.01 |       0 |      0 |
|* 11 |        TABLE ACCESS FULL  | T2       |      0 |      0 |00:00:00.01 |       0 |      0 |
------------------------------------------------------------------------------------------------

Unfortunately the same type of argument can’t be used to dismiss the minus/intersect cases.

 

February 26, 2014

Parallel Execution – 4

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 1:58 pm BST Feb 26,2014

I’m aware that in the previous article in this series I said I’d continue “in a few days” and it has now been more like 11 weeks – but finally I’ve got the time. In this article I’m going to talk primarily about Bloom filters and their impact on performance, but I’ll need to say something about the “virtual tables” and “parallel execution message size” before I begin. Take a look at this fragment of a parallel execution plan:

-----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------

|   7 |        PX SEND BROADCAST    | :TQ10000 |     3 |    18 |     2 |  Q1,00 | P->P | BROADCAST  |
|   8 |         PX BLOCK ITERATOR   |          |     3 |    18 |     2 |  Q1,00 | PCWC |            |
|*  9 |          TABLE ACCESS FULL  | T3       |     3 |    18 |     2 |  Q1,00 | PCWP |            |

-----------------------------------------------------------------------------------------------------

Examining the Name and TQ columns, we can say (from the TQ column) that one set of parallel execution slaves has operated steps 7, 8, and 9 to populate a “virtual table” called :TQ10000 which is broadcast (every slave receives every row) to another set of parallel execution slaves. Each slave in the other slave set will read from that virtual table (at some point) and do something with the data.

I am fairly certain (and it is an important point in understanding the mechanics and rationale of parallel execution plans) that a parallel slave (set) cannot write one virtual table while reading from another virtual table; it has to read the whole of the “incoming” virtual table before it can start writing to an “outgoing” virtual table. This restriction probably keeps the code as simple as possible, partly because it limits the need for parallel execution slave sets to a maximum of 2 for a data flow operation. Bear in mind that there is also (probably) a “side-channel” that the query co-ordinator uses to pass instructions to the parallel query slaves, and this side-channel probably falls outside any “can’t read and write simultaneously” restriction.

So how does Oracle implement virtual tables ? A virtual table is simply a small number of “pages” of memory, whose size is dictated by the parameter _parallel_execution_message_size. The default value for this parameter is either 2,148 bytes 4,096 bytes, or 16,384 bytes, depending the Oracle version and the setting for various other parameters. When I say “small number”, though, this is per communication channel and every slave in one slave set has to be able to pass data to every slave in the other slave set, so if you’re running a query with degree N you need to allocate a total of N-squared times that small number of pages.  (In fact, since both sets of slaves may also have to send data to the query co-ordinator at some stage (e.g. while “ranging”) you may need a further 2N sets of pages – N for each slave set – for those channels).

I recall reading a document many years ago (in the era of 7.2 or 7.3 – when we has OCIS rather than MoS) which said the number of pages required was 3, though I’ve also seen a blog note recently that says it’s 4 in RAC systems.  So, assume you have a query that is running parallel 5 in single instance, and your _parallel_execution_message_size is 16,384 bytes, then the amount of memory you need for virtual tables for that query is something like: 5 * 5 * 3 * 16KB = 1,200KB (or maybe 5 * 7 * 3 * 16KB = 1,680KB if my comment about the query co-ordinator is correct).

Again depending on version and parameter settings, this memory will be located either in the shared pool or large pool with the name “PX msg pool”; if you run a series of tests that continuously increases the degree of parallelism on a fixed query you’ll see this pool size growing at roughly the right rate to match the formulae above (although the approximation was somewhat obscured when I checked as Oracle was allocating the pool in chunks of 120KB + 24 bytes).

Given the small number of pages available on any channel for reading and writing, you can imagine that a writer will sometimes find that all the available pages for a given channel are full and it has to wait for the reader to empty them before it can write the next set of data. This leads to the two waits: “PX Deq Credit: send blkd” and “PX Deq Credit: need buffer”.  I’m not sure exactly why there are two different waits,  but it’s possible that one is an explicit block (e.g.: “send blkd” = the reader is now reading the buffers and has made them unavailable, and “need buffer” = all the buffers are full but the reader is currently doing something else and hasn’t noticed).

There is an important performance consideration that goes along with this blocking – if a reader starts spilling its data to disc that’s a (relatively) slow process – which means that ALL the writers trying to send data to that reader are likely to get blocked waiting each time that it spills a few more pages to disc, so one (unlucky) reader could produce far more delay to the query than just the time it spends writing (if I spill my pages to disc none of the slaves in the other slave set can send data to me until I’ve finished, which means they don’t send to data to anyone else either). Moreover if a single writer tends to send most of its data to a single reader then that writer will effectively be causing a block on all the other writers because they may have to wait for that one reader to spill the data from that one writer every time they try to send data to that one reader.

Clearly, spilling to disc is a bad thing – it’s going to slow the query down whatever the circumstances, but an imbalance in the sharing of data between readers and writes could have an impact far greater than we might at first expect. Any strategy that minimises the risk of spilling to disk is likely to be a good idea – and that’s where Bloom filters come into play; if we can find a cheap way of telling one set of slaves about the data that the other set doesn’t need then we can reduce traffic through the virtual tables and reduce the time lost through “need buffer” and “send blkd” waits.

The time when it’s really important to find a way to eliminate data before it goes through a table queue is when we do a hash join. Remember a hash join takes place in two pieces – we extract data from one table and build an (in-memory) hash table, then we extract data from the other table and probe the hash table for each row extracted; but when we are actually extracting the data we don’t consider the join condition. When a hash join goes parallel we may be in a position where one slave set holds the in-memory hash table and the other slave set does the work of scanning the second table – in this case we would like the second slave set to discard as much data as possible before sending it through the table queue … but it doesn’t have any information about the join condition.

A Bloom filter is simply a small dataset created by the first slave set that is a “summary” of the values that are likely to be useful in the join. As it builds the in-memory hash table, the first slave set can construct a Bloom filter for the join values; when the hash table is complete the first slave set can pass the Bloom filter to the second slave set (possibly by way of the query co-ordinator, I haven’t checked the mechanism yet) allowing the second slave set to discard the data that definitely won’t be needed and passing only data that will probably be needed through the table queue.  (Note: the Bloom filter is not a perfect filter, it may allow some data that isn’t needed. For more detailed information about Bloom filters, see this article by Christian Antognini.)

Let’s take a look at a simple example to demonstrate the principle – I’ve created tables t1 and t2 as “select * from all_objects” (some 70,000 rows in 11.2.0.4) and gathered stats with no histograms. Now I’m going to run three queries, first an unfiltered join between the two tables, then a filtered join showing a Bloom filter magically appearing, then I’ll repeat the filtered query after disabling Bloom filtering so that we can see the benefits of the filter. In all three cases I’ll enable rowsource execution statistics so that you can see the volume of data that went through the table queues. Here’s the first, unfiltered, query:


select
	/*+
		parallel(t1 2)
		parallel(t2 2)
		leading(t1 t2)
		pq_distribute(t2 hash hash)
	*/
	t1.object_name, t2.object_name, rpad('x',100,'x') padding
from	t1, t2
where
	t2.object_id = t1.object_id
;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.28 |       0 |    310 |    310 |  5952K|  1953K|     2/0/0|    2048 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.08 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  70101 |00:00:00.05 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  70101 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.06 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So, following the table queues:

  • slave set 2 scans t1 and distributes it to slave set 1 through virtual table 0, and slave set 1 builds an in-memory hash table: (implied by lines 3 – 7).
  • slave set 2 scans t2 and distributes it to slave set 1 through virtual table 1, and slave set 1  buffers it until the scan has completed and all relevant data has arrived: (implied by lines 3, 8 – 11)
  • slave set 1 does the hash join (line 3) reading back the second data set from its buffer (which may have spilled to disk) and passes the results on to the query coordinator through virtual table 2.

Now lets restrict the data we select from table t1.

select
	/*+
		parallel(t1 2)
		parallel(t2 2)
		leading(t1 t2)
		pq_distribute(t2 hash hash)
	 */
	t1.object_name, t2.object_name, rpad('x',100,'x') padding
from	t1, t2
where
	t2.object_id = t1.object_id
and	t1.owner = 'SYS'
;

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |          |        |      |            |  34463 |00:00:00.10 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR          |          |      1 |          |        |      |            |  34463 |00:00:00.10 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)    | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED    |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.16 |       0 |    124 |    124 |  3611K|  1953K|     2/0/0|    1024 |
|   4 |     JOIN FILTER CREATE   | :BF0000  |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.05 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.04 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   7 |        PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  34463 |00:00:00.03 |    1031 |   1005 |      0 |       |       |          |         |
|*  8 |         TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  34463 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
|   9 |     PX RECEIVE           |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  38848 |00:00:00.05 |       0 |      0 |      0 |       |       |          |         |
|  10 |      PX SEND HASH        | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  11 |       JOIN FILTER USE    | :BF0000  |      2 | 00:00:01 |  Q1,01 | PCWP |            |  38848 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|  12 |        PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  38848 |00:00:00.03 |    1031 |   1005 |      0 |       |       |          |         |
|* 13 |         TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  38848 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   8 - access(:Z>=:Z AND :Z<=:Z)
        filter("T1"."OWNER"='SYS')
   13 - access(:Z>=:Z AND :Z<=:Z)
        filter(SYS_OP_BLOOM_FILTER(:BF0000,"T2"."OBJECT_ID"))

In this case the basic plan is the same although we’ve introduced two extra lines, a “JOIN FILTER CREATE” at line 4, this is slave set 1 creating a Bloom filter as it constructs the in-memory hash table, and a “JOIN FILTER USE” at line 11, which is slave set 2 discarding data from the tablescan of t2 before sending data through virtual table 1.

Looking at predicate information for line 13 we can see the filter predicate using a function called sys_op_bloom_filter() to eliminate data. Oracle has decided that the number of distinct values of object_id returned by the predicate OWNER = ‘SYS’ is sufficiently small that it’s worth building a Bloom filter describing it and passing it to the other slave set. (In fact the optimizer has been a little over-optimistic in this case because I didn’t gather any histograms on the table – there’s a lot more SYS-owned data than the optimizer is expecting).

We can see that the Bloom filter has been pretty effective. Since the two tables have the same data (plus or minus one row) the volume that would be returned by a perfect filter at line 13 ought to match the volume produced from the first table at line 8 – in fact producing about 12.5% more than the ideal, but that’s still a big improvement over sending the whole table through the virtual table.

Just to finish off, here are the results when I disable Bloom filtering using the /*+ no_px_join_filter(t2) */ hint. (This hint appeared, along with the opposite hint /*+ px_join_filter(alias) */ and Bloom filters, in 10gR2 but didn’t get into outlines and SQL Plan baselines until 11.1.0.6

Plan hash value: 3683239666

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  34463 |00:00:00.22 |      10 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  34463 |00:00:00.22 |      10 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.21 |       0 |     93 |     93 |  3611K|  1953K|     2/0/0|    1024 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.04 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  34463 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  34463 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.07 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.06 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   7 - access(:Z>=:Z AND :Z<=:Z)
        filter("T1"."OWNER"='SYS')
   11 - access(:Z>=:Z AND :Z<=:Z)

You’ll notice we’re back to 11 lines, there’s no reference to JOIN FILTER and the predicate section no longer has a call to sys_op_bloom_filter().
More significantly the actual rows reported in line 11 is the full 70,000 for the whole table – the entire data set will be passing through virtual table 1, increasing the CPU load, the probability of “send blkd” and “need buffer”, and the changes to the BUFFERED data spilling to disc.

That’s nearly half a day gone writing up the preamble to how to interpret the execution plan for my original query after I hinted it into a hash/hash distribution – but it’s a lovely day outside so I’m heading out to the garden with my pruning secateurs and it’s going to be a few more days (I hope) before the last part of the series comes out.

Predicate Order

Filed under: Bugs,CBO,Execution plans,Oracle — Jonathan Lewis @ 8:14 am BST Feb 26,2014

Common internet question: does the order of predicates in the where clause make a difference.
General answer: It shouldn’t, but sometimes it will thanks to defects in the optimizer.

There’s a nicely presented example on the OTN database forum where predicate order does matter (between 10.1.x.x and 11.1.0.7). Notnne particularly – there’s a script to recreate the issue; note, also, the significance of the predicate section of the execution plan.
It’s bug 6782665, fixed in 11.2.0.1

February 25, 2014

FBI Skip Scan

Filed under: Bugs,Function based indexes,Indexing,Oracle — Jonathan Lewis @ 6:45 pm BST Feb 25,2014

A recent posting on the OTN database forum highlighted a bug (or defect, or limitation) in the way that the optimizer handles index skip scans with “function-based” indexes – it doesn’t do them. The defect has probably been around for a long time and demonstrates a common problem with testing Oracle – it’s very easy for errors in the slightly unusual cases to be missed; it also demonstrates a general principle that it can take some time for a (small) new feature to be applied consistently across the board.

The index definitions in the original posting included expressions like substr(nls_lower(colX), 1, 25), and it’s possible for all sorts of unexpected effects to appear when your code starts running into NLS  settings, so I’ve created a much simpler example. Here’s my table definition, with three index definitions:

create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	mod(rownum - 1,10)		mod1,
	rownum				id1,
	1 + mod(rownum - 1,10)		mod2,
	1 + rownum 			id2,
	lpad(rownum,7)			small_vc,
	rpad('x',100)			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5 ; create index t1_normal on t1(mod2, id2); create index t1_split  on t1(mod2, id1 + 1); create index t1_fbi    on t1(mod1 + 1, id1 + 1); begin 	dbms_stats.gather_table_stats( 		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

Note that I gathered stats after creating the indexes (generally you would be safe creating the indexes after the gathering the stats from 10g onwards) because I needed to gather stats on the virtual columns that support the function-based indexes. As you can see, mod2 = 1 + mod1, and id2 = 1 + id1, and I’ve created three indexes which (internally) are the same although they are defined in three different ways.

So here are three queries which select the same data although, again, the queries are not all exactly the same. Since the queries are doing the same things with structures that are the same we might hope to see the optimizer using the same strategy for all three. In fact, to make things easier for the optimizer, I’ve even told it exactly what to do in two of the cases:


select	*
from	t1
where
	id2 = 50000
;

select	/*+ index_ss(t1(mod2)) */
	*
from	t1
where
	id1 + 1 = 50000
;

select	/*+ index_ss(t1 t1_fbi) */
	*
from	t1
where
	id1 + 1 = 50000
;

In the first query I’ve referenced only the second column of the “normal” index with a high precision query – expecting the optimizer to find the index skip scan path. In the second and third queries I’ve reference the (id1 + 1) index expression which appears as the second column in two different indexes; to help the optimizer along I’ve hinted the index skip scan for each of the indexes in turn.

If the first query can do an index skip scan then the second and third queries should be able to do the same because the physical structures and the statistics (and the actual stored and requested values) are all the same. Here are the three plans (pulled from memory in an 11.2.0.4 instance).


SQL_ID  bvjstz9xa3n7a, child number 0
-------------------------------------
select * from t1 where  id2 = 50000

Plan hash value: 2078113469

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |       |       |    12 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |     1 |   125 |    12   (0)| 00:00:01 |
|*  2 |   INDEX SKIP SCAN           | T1_NORMAL |     1 |       |    11   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID2"=50000)
       filter("ID2"=50000)

SQL_ID  95573qx3w62q2, child number 0
-------------------------------------
select /*+ index_ss(t1(mod2)) */  * from t1 where  id1 + 1 = 50000

Plan hash value: 1422030023

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |       |       |   271 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1       |     1 |   130 |   271   (2)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | T1_SPLIT |     1 |       |   270   (2)| 00:00:02 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."SYS_NC00007$"=50000)
       filter("T1"."SYS_NC00007$"=50000)

SQL_ID  d4t48va4zrk1q, child number 0
-------------------------------------
select /*+ index_ss(t1 t1_fbi) */  * from t1 where  id1 + 1 = 50000

Plan hash value: 3151334857

--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |       |       |   271 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1     |     1 |   130 |   271   (2)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | T1_FBI |     1 |       |   270   (2)| 00:00:02 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."SYS_NC00007$"=50000)
       filter("T1"."SYS_NC00007$"=50000)

The first query automatically does an index skip scan on the normal index.

The second and third queries calculate the cardinality correctly, reference the right “virtual” column correctly, and come close to obeying the hint by using the required index – and then spoil things by doing an index full scan instead of an index skip scan. Checking the 10053 trace file I found that the optimizer hadn’t even considered the possibility of a skip scan for the last two queries – it had jumped straight to the full scan. (If the optimizer had ignored the hint completely I would have said that this behaviour was simply a limitation of FBIs – but since the optimizer has got partway there I think it’s probably a bug.)

Once you’ve got this far, of course, you might wonder if you could work around the problem in 11g by using virtual columns – so that’s the next test.

drop index t1_split;
drop index t1_fbi;

alter table t1 add (mod_virtual generated always as (mod1 + 1) virtual);
alter table t1 add (id_virtual generated always as (id1 + 1) virtual);

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create index t1_virtual on t1(mod_virtual, id_virtual);

select
	/*+ index_ss(t1) */
	*
from	t1
where
	id_virtual = 50000

I started by dropping the two function-based indexes (because I’m about to create “proper” virtual columns which are identical to the hidden “virtual columns” that were supporting the function-based indexes – if I didn’t do this I’d hit Oracle error “ORA-54015: Duplicate column expression was specified”). Then I declared my virtual columns, collected stats on the whole table (I could have used method_opt=>’for all hidden columns size 1′) so that there would be stats on those columns, then created an index using the virtual columns and tried to get a skip scan on that index. Any bets ?

select  /*+ index_ss(t1) */  * from t1 where  id_virtual = 50000

Plan hash value: 4250443541

------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |       |       |   271 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1         |     1 |   133 |   271   (2)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | T1_VIRTUAL |     1 |       |   270   (2)| 00:00:02 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID_VIRTUAL"=50000)
       filter("ID_VIRTUAL"=50000)

Bad luck – even though “proper” virtual columns can do wonders, they don’t get around this problem. Anyone who’s used multi-column function-based indexes or added virtual columns to indexes might want to look at their code and execution paths to see if there are opportunities for skip scans that the optimizer is failing to take. If there are you might like to raise an SR with Oracle – and point them to this blog as a repeatable test case.

I tried several searches for a relevant bug on MoS – unfortunately it was one of those occasions where the search result was either very small and didn’t have what I wanted, or was so large that I wasn’t going to check all the hits. If anyone does find a bug number, please tell us about it in the comments.

 

February 21, 2014

Indexing LOBs

Filed under: Function based indexes,Indexing,Infrastructure,LOBs,Oracle — Jonathan Lewis @ 6:42 pm BST Feb 21,2014

Many years ago, possibly when most sites were still using Oracle 8i, a possible solution to a particular customer problem was to create a function-based index on a CLOB column using the dbms_lob.getlength() function call. I can’t find the notes explaining why this was necessary (I usually have some sort of clue – such as the client name – in the script, but in this case all I had was a comment that “the manuals say you can’t do this, but it works provided you wrap the dbms_lob call inside a deterministic function”).

I never worked out why the dbms_lob.getlength() function wasn’t declared as deterministic – especially since it came complete with a most restrictive restricts_references pragma – so I had just assumed there was probably some good reason based on strange side effects when national language charactersets came into play. But here’s a little detail I noticed recently about the dbms_lob.getlength() function: it became deterministic in 11g, so if the client decided to implement my suggestion (which included the usual sorts of warnings) it’s now legal !

Footnote – the length() function has been deterministic and usable with LOBs for a long time, certainly since late 9i, but in 8i length(lob_col) will produce Oracle error “ORA-00932: inconsistent datatypes”

Index Compression – aargh

Filed under: Bugs,compression,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 7:57 am BST Feb 21,2014

The problem with telling people that some feature of Oracle is a “good thing” is that some of those people will go ahead and use it; and if enough people use it some of them will discover a hitherto undiscovered defect. Almost inevitably the bug will turn out to be one of those “combinations” bugs that leaves you thinking: “Why the {insert preferred expression of disbelief here} should {feature X} have anything to do with {feature Y}”.

Here – based on index compression, as you may have guessed from the title – is one such bug. I got it first on 11.1.0.7, but it’s still there on 11.2.0.4 and 12.1.0.1


create table t1 (
	id1	number not null,
	id2	number not null,
	n1	number,
	v1	varchar2(10)
)
rowdependencies
;

alter table t1 add constraint t1_pk primary key(id1, id2) using index (
	create index t1_pk on t1(id1, id2) compress 1
);

create table t2(
	id1	number not null,
	id2	number not null,
	id3	number not null,
	n1	number,
	v1	varchar2(10)
)
rowdependencies
;

alter table t2 add constraint t2_fk_t1 foreign key(id1, id2) references t1;

It’s quite simple – I’ve got a multi-column primary key, and it’s worth compressing on the first column because that column is fairly repetitive. Then I’ve created another table that has a foreign key constraint referencing my first table. Because I’ve got some replication going on and want to enable parallelism I’ve enabled rowdependencies all over the place. So let’s insert a couple of rows and see what happens – the next bit of text is cut-n-pasted from an 11.2.0.4 SQL*Plus session running a script after a call to set echo on:


SQL> insert into t1 values(1,1,1,'x');

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL> insert into t2 values(1,1,1,1,'x');
insert into t2 values(1,1,1,1,'x')
            *
ERROR at line 1:
ORA-00600: internal error code, arguments: [25027], [5], [394178], [], [], [], [], [], [], [], [], []

For further details, and before you get completely thrilled at the possibility of compressing lots of indexes, keep an eye on:

Bug 18125878 : ORA-600 [25027] ON INSERT IF TABLE HAS ROWDEPENDENCIES AND COMPRESS ON PK

“Inserting into a table with a foreign key where the base table has a primary key using index key compression and the table also has row dependencies enabled. Stack will include kdsgrds and kdiexi0 (in 12) / kdiexi (in 11, 10)”

As implied by that note from the bug, it also affects 10g.  The bug note reports it as fixed in 12.2.

February 16, 2014

Recursive subquery factoring

Filed under: Hints,Ignoring Hints,Oracle,Subquery Factoring,Tuning — Jonathan Lewis @ 6:11 pm BST Feb 16,2014

This is possibly my longest title to date – I try to keep them short enough to fit the right hand column of the blog without wrapping – but I couldn’t think of a good way to shorten it (Personally I prefer to use the expression CTE – common table expression – over “factored subquery” or “subquery factoring” or “with subquery”, and that would have achieved my goal, but might not have meant anything to most people.)

If you haven’t come across them before, recursive CTEs appeared in 11.2, are in the ANSI standard, and are (probably) viewed by Oracle as the strategic replacement for “connect by” queries. Here’s a simple (and silly) example:


with data(p) as (
	select 1 p from dual
	union all
	select p + 1 from data where p < 100
)
select	p
from	data
where	rownum <= 10
;

         P
----------
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10

10 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 37253879

---------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |     2 |    26 |     4   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY                             |      |       |       |            |          |
|   2 |   VIEW                                     |      |     2 |    26 |     4   (0)| 00:00:01 |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |       |       |            |          |
|   4 |     FAST DUAL                              |      |     1 |       |     2   (0)| 00:00:01 |
|*  5 |     RECURSIVE WITH PUMP                    |      |       |       |            |          |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   5 - filter("P"<100)

A recursive CTE has three features that identify it. First, the query alias (“data” in this case) is followed by a list of column aliases; secondly the query includes a UNION ALL; and thirdly the second subquery in the UNION ALL references the query alias – i.e. it’s the recursive bit. There are other optional bits but I’m not planning to go into those – all I want to talk about is how to control the materialization (or not) of a recursive CTE through hinting.

The reason I wrote this note was because Jeff Jacobs, in his presentation on “Performance Anti-patterns” at RMOUG last week, raised the question of whether or not the /*+ materialize */ and /*+ inline */ hints worked with recursive CTEs and gave an example of a UNION ALL query where the CTE always materialized, no matter how you applied the /*+ inline */ hint. The CTE seemed to be following the basic guideline for CTEs – if you use it once in the main query it goes inline, if you use it more than once it will (almost invariably) materialize.

I’m always interested in examples where “the hint is ignored”, so I exchanged a couple of email messages with Jeff and he sent me an example (which I’ve simplified for this blog) of a query that demonstrated the issue; and I spent a little while thinking about it and decided that it simply wasn’t possible to hint the code the way we wanted to and it was just one of those cases where it takes a bit of time for new features to catch up and fit in to the standard framework. Here’s a simplified version of the query, with its execution plan:

with data(p) as (
	select 1 p from dual
	union all
	select p + 1 from data where p < 100
)
select	p
from	data
where	rownum <= 10
union all
select	p
from	data
where	rownum <= 10
;

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |                            |     4 |    52 |     4   (0)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION                 |                            |       |       |            |          |
|   2 |   LOAD AS SELECT                           | SYS_TEMP_0FD9D6608_7391CD7 |       |       |            |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|                            |       |       |            |          |
|   4 |     FAST DUAL                              |                            |     1 |       |     2   (0)| 00:00:01 |
|*  5 |     RECURSIVE WITH PUMP                    |                            |       |       |            |          |
|   6 |   UNION-ALL                                |                            |       |       |            |          |
|*  7 |    COUNT STOPKEY                           |                            |       |       |            |          |
|   8 |     VIEW                                   |                            |     2 |    26 |     2   (0)| 00:00:01 |
|   9 |      TABLE ACCESS FULL                     | SYS_TEMP_0FD9D6608_7391CD7 |     2 |    12 |     2   (0)| 00:00:01 |
|* 10 |    COUNT STOPKEY                           |                            |       |       |            |          |
|  11 |     VIEW                                   |                            |     2 |    26 |     2   (0)| 00:00:01 |
|  12 |      TABLE ACCESS FULL                     | SYS_TEMP_0FD9D6608_7391CD7 |     2 |    12 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("P"<100)
   7 - filter(ROWNUM<=10)
  10 - filter(ROWNUM<=10)

The following morning I woke up with one of those “overnight insights” where you seem to have worked out the answer in your sleep. To make a hint work you have to put it in the right query block, or you have to name the right query block in the main query block: in this case the right query block doesn’t exist in the text, and it’s not possible to figure out what the name of the right query block would be if it came into existence.

If you try putting the /*+ inline */ hint into the query after the select at line 2 above, you’ve put the hint into the first query block of a union all, NOT into the query block of the recursvie CTE.

Having identified the problem, the solution (or at least, a possible solution) was obvious – create the query block you need. This (with its execution plan from 11.2.0.4) is what worked:

with data(p) as (
	select 1 p from dual
	union all
	select p + 1 from data where p < 100
),
data1 as (
	select /*+ inline */ * from data
)
select	p
from	(
	select * from data1 where rownum <= 10
	union all
	select * from data1 where rownum <= 10
	)
;

-----------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |      |     4 |    52 |     8   (0)| 00:00:01 |
|   1 |  VIEW                                        |      |     4 |    52 |     8   (0)| 00:00:01 |
|   2 |   UNION-ALL                                  |      |       |       |            |          |
|*  3 |    COUNT STOPKEY                             |      |       |       |            |          |
|   4 |     VIEW                                     |      |     2 |    26 |     4   (0)| 00:00:01 |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |       |       |            |          |
|   6 |       FAST DUAL                              |      |     1 |       |     2   (0)| 00:00:01 |
|*  7 |       RECURSIVE WITH PUMP                    |      |       |       |            |          |
|*  8 |    COUNT STOPKEY                             |      |       |       |            |          |
|   9 |     VIEW                                     |      |     2 |    26 |     4   (0)| 00:00:01 |
|  10 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|      |       |       |            |          |
|  11 |       FAST DUAL                              |      |     1 |       |     2   (0)| 00:00:01 |
|* 12 |       RECURSIVE WITH PUMP                    |      |       |       |            |          |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(ROWNUM<=10)
   7 - filter("P"<100)
   8 - filter(ROWNUM<=10)
  12 - filter("P"<100)

All I’ve done is create a second CTE that selects from the first CTE. This is now a simple select, so I can add a perfectly placed hint to it – in the hope that this would effectively require the dependent recursive CTE to be inlined inside it. It seems to be sufficient.

I haven’t tested the strategy exhaustively – so I can give you no guarantee that this has to work – unfortunately I did have another example that I applied the method to, and after several seconds of no response it crashed with an ORA-00600 error :( but that might have been a side effect of the nature of query (it included a couple of the optional extras) rather than a specific feature of inlining.)

[Further reading on "ignoring hints"]

[Further reading on "subquery factoring"]

February 14, 2014

12c Subquery Factoring

Filed under: 12c,Oracle,Subquery Factoring,Tuning — Jonathan Lewis @ 11:44 am BST Feb 14,2014

From time to time I’ve posted a reminder that subquery factoring (“with subquery”) can give you changes in execution plans even if the subquery that you’ve taken out of line is written back inline by Oracle rather than being materialized. This can still happen in 12c – here’s a sample query in the two forms with the result sets and execution plans.  First, the “factored” version:


with e as (
	select
		deptno
	from	emp
	order by
		deptno
)
select
	deptno, count(*)
from
	e
group by deptno
;

    DEPTNO   COUNT(*)
---------- ----------
        10          3
        20          5
        30          6

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1536940522

-------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost  |
-------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     3 |     9 |     6 |
|   1 |  SORT GROUP BY NOSORT|      |     3 |     9 |     6 |
|   2 |   VIEW               |      |    14 |    42 |     6 |
|   3 |    SORT ORDER BY     |      |    14 |    42 |     6 |
|   4 |     TABLE ACCESS FULL| EMP  |    14 |    42 |     2 |
-------------------------------------------------------------

Note that the view has not been materialized (no “TEMP TABLE TRANSFORMATION”). There is a “sort group by” with NOSORT, and the data has been reported in sorted order as a side effect of the “sort order by” in line 3.

Now with the inline view manually written in place:


select   deptno, count(*)
from (
	select
		/*+ gather_plan_statistics */
		deptno
	from	emp
	order by
		deptno
) e
group by deptno
;

    DEPTNO   COUNT(*)
---------- ----------
        30          6
        20          5
        10          3

3 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4067220884

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     3 |     9 |     6 |
|   1 |  HASH GROUP BY     |      |     3 |     9 |     6 |
|   2 |   TABLE ACCESS FULL| EMP  |    14 |    42 |     2 |
-----------------------------------------------------------

In principle, the optimizer should have rewritten the first query in the second form and optimized it – but there’s clearly something out of order in the optimisation. As you can see, the result is no longer in the same order (but we don’t expect any specific order, of course, since we didn’t end with an “order by” clause) and the aggregation step is a hash group by rather than a sort group by.

Footnote:

I know that I got this example off someone else, but all I’ve written into my script is a URL, and it’s the wrong URL – my clipboard buffer seems to lag a little sometimes when I’m copying from the host to a virtual machine. The URL I’ve got is to an article by Steven Pinker in the New Republic, and it had nothing to do with Oracle (though it was comparing science and the humanities).

If you’re the original source of the example, please let me know so that I can add an acknowledgement.

[Further reading on "subquery factoring"]

Test case

If you would like to reproduce the example, here’s the script – it’s a simple hack of the Scott emp/dept schema.

CREATE TABLE EMP (
	EMPNO		NUMBER(4) NOT NULL,
        ENAME		VARCHAR2(10),
        JOB		VARCHAR2(9),
        MGR		NUMBER(4),
        HIREDATE	DATE,
        SAL		NUMBER(7, 2),
        COMM		NUMBER(7, 2),
        DEPTNO		NUMBER(2)
);

insert into emp values
        (7369, 'SMITH',  'CLERK',     7902,
        to_date('17-DEC-1980', 'DD-MON-YYYY'),  800, NULL, 20);

insert into emp values
        (7499, 'ALLEN',  'SALESMAN',  7698,
        to_date('20-FEB-1981', 'DD-MON-YYYY'), 1600,  300, 30);

insert into emp values
        (7521, 'WARD',   'SALESMAN',  7698,
        to_date('22-FEB-1981', 'DD-MON-YYYY'), 1250,  500, 30);

insert into emp values
        (7566, 'JONES',  'MANAGER',   7839,
        to_date('2-APR-1981', 'DD-MON-YYYY'),  2975, NULL, 20);

insert into emp values
        (7654, 'MARTIN', 'SALESMAN',  7698,
        to_date('28-SEP-1981', 'DD-MON-YYYY'), 1250, 1400, 30);

insert into emp values
        (7698, 'BLAKE',  'MANAGER',   7839,
        to_date('1-MAY-1981', 'DD-MON-YYYY'),  2850, NULL, 30);

insert into emp values
        (7782, 'CLARK',  'MANAGER',   7839,
        to_date('9-JUN-1981', 'DD-MON-YYYY'),  2450, NULL, 10);

insert into emp values
        (7788, 'SCOTT',  'ANALYST',   7566,
        to_date('09-DEC-1982', 'DD-MON-YYYY'), 3000, NULL, 20);

insert into emp values
        (7839, 'KING',   'PRESIDENT', NULL,
        to_date('17-NOV-1981', 'DD-MON-YYYY'), 5000, NULL, 10);

insert into emp values
        (7844, 'TURNER', 'SALESMAN',  7698,
        to_date('8-SEP-1981', 'DD-MON-YYYY'),  1500,    0, 30);

insert into emp values
        (7876, 'ADAMS',  'CLERK',     7788,
        to_date('12-JAN-1983', 'DD-MON-YYYY'), 1100, NULL, 20);

insert into emp values
        (7900, 'JAMES',  'CLERK',     7698,
        to_date('3-DEC-1981', 'DD-MON-YYYY'),   950, NULL, 30);

insert into emp values
        (7902, 'FORD',   'ANALYST',   7566,
        to_date('3-DEC-1981', 'DD-MON-YYYY'),  3000, NULL, 20);

insert into emp values
        (7934, 'MILLER', 'CLERK',     7782,
        to_date('23-JAN-1982', 'DD-MON-YYYY'), 1300, NULL, 10);

CREATE TABLE DEPT(
	DEPTNO	NUMBER(2),
        DNAME	VARCHAR2(14),
        LOC	VARCHAR2(13)
);

insert into dept values (10, 'ACCOUNTING', 'NEW YORK');
insert into dept values (20, 'RESEARCH',   'DALLAS');
insert into dept values (30, 'SALES',      'CHICAGO');
insert into dept values (40, 'OPERATIONS', 'BOSTON');

commit;

alter table emp add constraint emp_pk primary key(empno);
alter table dept add constraint dept_pk primary key(deptno);

February 12, 2014

Caution – hints

Filed under: Hints,Oracle — Jonathan Lewis @ 6:57 pm BST Feb 12,2014

Here’s a little example of why you should be very cautious about implementing undocumented discoveries. If you take a look at the view v$sql_hints in 11.2.0.4 you’ll discover a hint (no_)cluster_by_rowid; and if you look in v$parameter you’ll discover two new parameters _optimizer_cluster_by_rowid and _optimizer_cluster_by_rowid_control.

It doesn’t take much imagination to guess that the parameters and hint have something to do with the costs of accessing compressed data by rowid on an Exadata system (see, for example, this posting) and it’s very easy to check what the hint does:


create table t1
as
select
	rownum					id,
	trunc(dbms_random.value(0,1000))	n1,
	trunc(dbms_random.value(0,1000))	n2,
	rpad('x',100,'x')			padding
from
	all_objects
where
	rownum <= 10000
;

create index t1_i1 on t1(n1,n2);

-- collect stats

set autotrace on explain

select
	/*+
		index(t1)
		cluster_by_rowid(t1)
	*/
	id
from
	t1
where
	n1 = 50
;

I’ve created an index on the two randomly generated columns (n1, n2) and then forced Oracle to walk the index searching for n1 = 50. Because of the presence of n2 in the index the order in which Oracle will visit the rows will be randomised relative to the order in which they are stored in the table – but if I can get Oracle to sort the rowids into order before visiting the table I could (in theory) get a performance benefit from minimising the number of consistent gets I do (some will turn into “buffer is pinned count”), and in the Exadata case I could get a big improvement by ensuring that I decompressed each compression unit no more than once. That’s what the hint does – and to prove the point I can select the id column – if the rows are being visited in order of rowid then the output will be in order of id:


        ID
----------
       198
       681
      1257
      1318
      5593
      7094
      7471
      8048
      8798
      8855
      9670

11 rows selected.

---------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    10 |    80 |    12 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    10 |    80 |    12 |
|   2 |   SORT CLUSTER BY ROWID     |       |    10 |       |     2 |
|*  3 |    INDEX RANGE SCAN         | T1_I1 |    10 |       |     2 |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=50)

Sure enough – with the hint the data arrives in sorted order, without it the results are not sorted. And, of course, we can check the execution plan and see that the optimizer has inserted a SORT operation between the index range scan and the table access by rowid.

So here’s the fun bit – the parameters are still there in 12.1.0.1 (and _optimizer_cluster_by_rowid is now true by default when it used to be false) but the hint is no longer available. So if you want the sort to take place, you have to depend (presumably) on the luck of the statistics – when does Oracle think it’s worth doing automatically – or you still have to write the query as a two-part job with an inline, non-mergeable view that accesses and sorts the rowids before visiting the table.

February 10, 2014

Row Migration

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 6:55 pm BST Feb 10,2014

At one of the presentations I attended at RMOUG this year the presenter claimed that if a row kept increasing in size and had to migrate from block to block as a consequence then each migration of that row would leave a pointer in the previous block so that an indexed access to the row would start at the original table block and have to follow an ever growing chain of pointers to reach the data.

This is not correct, and it’s worth making a little fuss about the error since it’s the sort of thing that can easily become an urban legend that results in people rebuilding tables “for performance” when they don’t need to.

Oracle behaves quite intelligently with migrated rows. First, the migrated row has a pointer back to the original location and if the row has to migrate a second time the first place that Oracle checks for space is the original block, so the row might “de-migrate” itself; however, even if it can’t migrate back to the original block, it will still revisit the original block to change the pointer in that block to refer to the block it has moved on to – so the row is never more than one step away from its original location. As a quick demonstration, here’s some code to generate and manipulate some data:

create table t1 (
id	number(6,0),
v1	varchar2(1200)
)
pctfree 0
;

prompt	==========================================
prompt	The following code fits 74 rows to a block
prompt	==========================================

insert into t1 select rownum - 1, rpad('x',100) from all_objects where rownum <= 75;
commit;

prompt	======================================
prompt	Make the first row migrate and dump it
prompt	======================================

update t1 set v1 = rpad('x',400) where id = 0;
commit;

alter system flush buffer_cache;
execute dump_seg('t1',2)

prompt	===========================================================
prompt	Fill the block the long row is now in, force it to migrate,
prompt	then dump it again.
prompt	===========================================================

insert into t1 select rownum + 75, rpad('x',100) from all_objects where rownum <= 75;
commit;

update t1 set v1 = rpad('x',800) where id = 0;
commit;

alter system flush buffer_cache;
execute dump_seg('t1',3)

prompt	========================================================
prompt	Fill the block the long row is now in and shrink the row
prompt	to see if it returns to its original block. (It won't.)
prompt	========================================================

insert into t1 select rownum + 150, rpad('x',100) from all_objects where rownum <= 75;
commit;

update t1 set v1 = rpad('x',50) where id = 0;
commit;

alter system flush buffer_cache;
execute dump_seg('t1',3)

prompt	========================================================
prompt	Make a lot of space in the first block and force the row
prompt	to migrate again to see if it migrates back. (It does.)
prompt	========================================================

delete from t1 where id between 1 and 20;
commit;

update t1 set v1 = rpad('x',1200) where id = 0;
commit;

alter system flush buffer_cache;
execute dump_seg('t1',3)

My test database was using 8KB blocks (hence the 74 rows per block), and 1MB uniform extents with freelist management. The procedure dump_seg() takes a segment name as its first parameter and a number of blocks as the second (then the segment type and starting block as the third and fourth) and dumps the first N data blocks of the segment. To demonstrate what goes on, I’ve extracted the content of the first row (id = 0) after each of the four dumps:

After the first update - the column count (cc) is zero and the "next rowid" (nrid) is row 1 of block 0x0140000b

tab 0, row 0, @0xb3
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x0140000b.1

After the second update - the next rowid is row 7 of block 0x0140000c

tab 0, row 0, @0xb3
tl: 9 fb: --H----- lb: 0x1  cc: 0
nrid:  0x0140000c.7

After the third update (shrinking the row) the row hasn't moved from block 0x0140000c

tab 0, row 0, @0xb3
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x0140000c.7

After the fourth update (making space, and growing the row too much) the row moves back home

tab 0, row 0, @0x4c1
tl: 1208 fb: --H-FL-- lb: 0x2  cc: 2
col  0: [ 1]  80
col  1: [1200]
78 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

My calls to dump blocks included the blocks where the row migrated to, so we’ll have a look at the target locations (as given by the original row location’s nrid) in those blocks over time. First we check block 0x0140000b, row 1 after the first two migrations:

tab 0, row 1, @0x1d7f
tl: 414 fb: ----FL-- lb: 0x2  cc: 2
hrid: 0x0140000a.0
col  0: [ 1]  80
col  1: [400]

tab 0, row 1, @0x1d7f
tl: 2 fb: ---DFL-- lb: 0x1

After the first migration (the row arrives here) we have a “head rowid” (hrid) pointer telling us where the row came from. After the second migration, when the row has moved on, we simply have a typical “deleted stub” – two bytes reserving the row directory entry until the commit has been done and cleaned out.

Then we can examine the second target (0x140000c, row 7) on the second and third and fourth updates:


tab 0, row 7, @0x1966
tl: 814 fb: ----FL-- lb: 0x2  cc: 2
hrid: 0x0140000a.0
col  0: [ 1]  80
col  1: [800]

tab 0, row 7, @0xb1
tl: 62 fb: ----FL-- lb: 0x1  cc: 2
hrid: 0x0140000a.0
col  0: [ 1]  80
col  1: [50]

tab 0, row 7, @0xb1
tl: 2 fb: ---DFL-- lb: 0x2

As you can see, on arrival this location gets the original rowid as its “head rowid” (hrid), and it knows nothing about the intermediate block where the row was briefly in transit. I’ve copied the length byte (in square brackets) of column 1 in the dumps so that you can see that the row stayed put as it shrank. We can then see on the last update that we are left with a deleted stub in this block as the row migrates back to its original location when we try to extend it beyond the free space in this block.

Migrated rows are only ever one step away from home. It’s not nice to have too many of them, but it’s not necessarily a disaster.

RAC Plans

Filed under: Execution plans,Hints,Oracle,RAC,Troubleshooting — Jonathan Lewis @ 1:12 pm BST Feb 10,2014

Recently appeared on Mos – “Bug 18219084 : DIFFERENT EXECUTION PLAN ACROSS RAC INSTANCES”

Now, I’m not going to claim that the following applies to this particular case – but it’s perfectly reasonable to expect to see different plans for the same query on RAC, and it’s perfectly possible for the two different plans to have amazingly different performance characteristics; and in this particular case I can see an obvious reason why the two nodes could have different plans.

Here’s the query reported in the bug:

SELECT /*+ INDEX(C IDX3_GOD_USE_LOT)*/
   PATTERN_ID, STB_TIME
    FROM mfgdev.MTR_AUTO_GOD_AGENT_BT C
   WHERE 1 = 1
     AND EXISTS (SELECT /*+ INDEX(B IDX_MTR_STB_LOT_EQP)*/ 1
            FROM MFGDEV.MTR_STB_BTH B
           WHERE B.PATTERN_ID = C.PATTERN_ID
             AND B.STB_TIME = C.STB_TIME
             AND B.ACTUAL_START_TIME < SYSDATE
             AND EXISTS (SELECT /*+ INDEX(D CW_LOTID)*/
                   1
                    FROM F14DM.DM_CW_WIP_BT D
                   WHERE D.LOT_ID = B.LOT_ID
                     AND D.SS = 'BNKI'));

See the reference to “sysdate”. I can show you a system where you had a 15 minute window each day (until the problem was addressed) to optimize a particular query if you wanted a good execution plan; if you optimized it any other time of day you got a bad plan – and the problem was sysdate: it acts like a peeked bind variable.

Maybe, on this system, if you optimize the query at 1:00 am you get one plan, and at 2:00 am you get another – and if those two optimizations occur on different nodes you’ve just met the conditions of this bug report.

Here’s another thought to go with this query: apparently it’s caused enough problems in the past that someone’s written a couple of hints into the code. With three tables and two correlated subqueries in the code a total of three index() hints is not enough. If you’re going to hard-code hints into a query then take a look at the outline it generates when it does the right thing, and that will tell you about the 15 or so hints you’ve missed out. (Better still, consider generating an SQL Baseline from the hinted code and attaching it to the unhinted code.)

February 9, 2014

IOT Catalogue

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 6:52 pm BST Feb 9,2014

I’ve just added a catalogue of Richard Foote’s articles on IOTs to the list I made a couple of years ago of Martin Widlake’s articles, so this is just a temporary note to point people to the updated list.

« Previous PageNext Page »

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,453 other followers