Oracle Scratchpad

November 7, 2016

Reorg

Filed under: fragmentation,humour,Oracle — Jonathan Lewis @ 5:31 pm GMT Nov 7,2016

A current question on the OTN database forum asks: “What’s the difference between object and tablespace reorganization?” Here’s an analogy to address the question.

I have three crates of Guiness in the boot (trunk) of my car, one crate has 4 bottles left, one has 7 bottles left and one has 2 bottles. I also have two cases of Louis Roederer Brut NV champagne, one case has 2 bottles left and one has only one. (I have two objects in my tablespace – one of type Beer, one of type Champagne – and my boot requires manual free space management .)

I move all the Guiness bottles into a single crate and all the champagne bottles into a single case. That’s a couple of “shrink space compact” calls – I’ve re-organised the objects to get all the bottles in each object close to each other, but the crates are still taking up space in the boot.

I take the two empty crates and the empty case out of the boot. That’s a couple of “resize” (or “shrink space” without “compact”) calls that free up space in the boot.

I now want to load a port barrel into car, but it won’t fit until I slide the remaining beer crate and champagne case together at one side of the boot. That’s a couple of “move” commands that have reorganized the boot (tablespace) to make the free space usable.

 

September 29, 2016

IOT limitation

Filed under: Execution plans,Infrastructure,IOT,Oracle — Jonathan Lewis @ 10:17 am GMT Sep 29,2016

In the right circumstances Index Organized Tables (IOTs) give us tremendous benefits – provided you use them in the ideal fashion. Like so many features in Oracle, though, you often have to compromise between the benefit you really need and the cost of the side effect that a feature produces.

The fundamental design targets for an IOT are that you have short rows and only want to access them through index range scans of primary key. The basic price you pay for optimised access is the extra work you have to do as you insert the data. Anything you do outside the two specific targets is likely to lead to increased costs of using the IOT – and there’s one particular threat that I’ve mentioned twice in the past (here and here). I want to mention it one more time with a focus on client code and reporting.


create table iot1 (
        id1     number(7.0),
        id2     number(7.0),
        v1      varchar2(10),
        v2      varchar2(10),
        padding varchar2(500),
        constraint iot1_pk primary key(id1, id2)
)
organization index
including id2
overflow
;

insert into iot1
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        mod(rownum,311)                 id1,
        mod(rownum,337)                 id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum/100))      v2,
        rpad('x',500,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 ; commit; begin dbms_stats.gather_table_stats( ownname => user,
                tabname          => 'IOT1'
                method_opt       => 'for all columns size 1'
        );
end;
/

alter system flush buffer_cache;

select table_name, blocks from user_tables where table_name = 'IOT1' or table_name like 'SYS_IOT_OVER%';
select index_name, leaf_blocks from user_indexes where table_name = 'IOT1';

set autotrace traceonly
select max(v2) from iot1;
set autotrace off

I’ve created an index organized table with an overflow. The table definition places all columns after the id2 column into the overflow segment. After collecting stats I’ve then queried the table with a query that, for a heap table, would produce a tablescan as the execution plan. But there is no “table”, there is only an index for an IOT. Here’s the output I get (results from 11g and 12c are very similar):

TABLE_NAME               BLOCKS
-------------------- ----------
SYS_IOT_OVER_151543        8074
IOT1

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
IOT1_PK                      504

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |     4 | 99793   (1)| 00:00:04 |
|   1 |  SORT AGGREGATE       |         |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IOT1_PK |   100K|   390K| 99793   (1)| 00:00:04 |
---------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     100376  consistent gets
       8052  physical reads

The index segment has 504 leaf blocks, the overflow segment has 8,074 used blocks below the high water mark. The plan claims an index fast full scan of the index segment – but the physical reads statistic looks more like a “table” scan of the overflow segment. What’s actually happening ?

The 100,000+ consistent reads should tell you what’s happening – we really are doing an index fast full scan on the index segment, and for each index entry we go to the overflow segment to find the v2 value. Oracle doesn’t have a mechanism for doing a “tablescan” of just the overflow segment – even though the definition of the IOT ought (apparently) to be telling Oracle exactly which columns are in the overflow.

In my particular test Oracle reported a significant number of “db file scattered read” waits against the overflow segment, but these were for “prefetch warmup”; in a normal system with a buffer cache full of other data this wouldn’t have happened. The other interesting statistic that showed up was “table fetch continued row” – which was (close to) 100,000, again highlighting that we weren’t doing a normal full tablescan.

In terms of normal query processing this anomaly of attempted “tablescans” being index driven probably isn’t an issue but, as I pointed out in one of earlier posts on the topic, when Oracle gathers stats on the “table” it will do a “full tablescan”. If you have a very large table with an overflow segment it could be a very slow process – especially if you’ve engineered the IOT for the right reason, viz: the data arrives in the wrong order relative to the order you want to query it, and you’ve kept the rows in the IOT_TOP short by dumping the rarely used data in the overflow. With this in mind you might want to make sure that you write a bit of special code that gathers stats only on the columns you know to be in the IOT_TOP, creates representative numbers for the other columns, then locks the stats until the next time you want to refresh them.

 

September 13, 2016

Securefile space

Filed under: fragmentation,Infrastructure,LOBs,Oracle,Troubleshooting — Jonathan Lewis @ 7:29 am GMT Sep 13,2016

Here’s a little script I hacked together a couple of years ago from a clone of a script I’d been using for checking space usage in the older types of segments. Oracle Corp. eventually put together a routine to peer inside securefile LOBs:

rem
rem	Script:		dbms_space_use_sf.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Dec 2013
rem	Purpose:	
rem
rem	Last tested 
rem		12.1.0.2
rem		11.2.0.4
rem	Not tested
rem		11.1.0.7
rem	Not relevant
rem		10.2.0.5
rem		 9.2.0.8
rem		 8.1.7.4
rem
rem	Notes:
rem	See also dbms_space_use.sql
rem
rem	11g introduced securefiles lobs and two overloads of 
rem	dbms_space_usage to report space used by their segments
rem
rem	Valid values for suoption are:
rem		SPACEUSAGE_EXACT (16): Computes space usage exhaustively
rem		SPACEUSAGE_FAST  (17): Retrieves values from in-memory statistics
rem
rem	This version allows for partitioned objects, could delete
rem	lines to parameter 4 and partition names to eliminate
rem	the complaints about substitution variables.
rem


define m_seg_owner	= &1
define m_seg_name	= &2
define m_seg_type	= '&3'
define m_part_name	= &4

define m_segment_owner	= &m_seg_owner
define m_segment_name	= &m_seg_name
define m_segment_type	= '&m_seg_type'
define m_partition_name = &m_part_name

@@setenv

execute snap_enqueues.start_snap
execute snap_events.start_snap
execute snap_my_stats.start_snap

spool dbms_space_use_sf

prompt	============
prompt	Secure files
prompt	============

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10614);

	m_segment_size_blocks	number(12,0);
	m_segment_size_bytes	number(12,0);
	m_used_blocks		number(12,0);
	m_used_bytes		number(12,0);
	m_expired_blocks	number(12,0);
	m_expired_bytes		number(12,0);
	m_unexpired_blocks	number(12,0);
	m_unexpired_bytes	number(12,0);

begin
	dbms_space.space_usage(
		upper('&m_segment_owner'),
		upper('&m_segment_name'),
		upper('&m_segment_type'),
		suoption		=> dbms_space.spaceusage_exact,	
--		suoption		=> dbms_space.spaceusage_fast,
		segment_size_blocks	=> m_segment_size_blocks,
		segment_size_bytes	=> m_segment_size_bytes,
		used_blocks		=> m_used_blocks,
		used_bytes		=> m_used_bytes,
		expired_blocks		=> m_expired_blocks,
		expired_bytes		=> m_expired_bytes,
		unexpired_blocks	=> m_unexpired_blocks,
		unexpired_bytes		=> m_unexpired_bytes,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.new_line;
	dbms_output.put_line(' Segment Blocks:   ' || to_char(m_segment_size_blocks,'999,999,990') || ' Bytes: ' || to_char(m_segment_size_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Used Blocks:      ' || to_char(m_used_blocks,'999,999,990')         || ' Bytes: ' || to_char(m_used_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Expired Blocks:   ' || to_char(m_expired_blocks,'999,999,990')      || ' Bytes: ' || to_char(m_expired_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Unexpired Blocks: ' || to_char(m_unexpired_blocks,'999,999,990')    || ' Bytes: ' || to_char(m_unexpired_bytes,'999,999,999,990')); 

exception
	when wrong_ssm then
		dbms_output.put_line('Segment not ASSM');
end;
/

prompt	===============
prompt	Generic details
prompt	===============

declare
	m_total_blocks			number;
	m_total_bytes			number;
	m_unused_blocks			number;
	m_unused_bytes			number;
	m_last_used_extent_file_id	number;
	m_last_used_extent_block_id	number;
	m_last_used_block		number;
begin
	dbms_space.unused_space(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		total_blocks		=> m_total_blocks,
		total_bytes 		=> m_total_bytes, 
		unused_blocks		=> m_unused_blocks,  
		unused_bytes		=> m_unused_bytes,
		last_used_extent_file_id 	=> m_last_used_extent_file_id, 
		last_used_extent_block_id	=> m_last_used_extent_block_id,
		last_used_block		=> m_last_used_block,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.put_line('Segment Total blocks: ' || to_char(m_total_blocks,'999,999,990'));
	dbms_output.put_line('Object Unused blocks: ' || to_char(m_unused_blocks,'999,999,990'));

end;
/

-- execute snap_my_stats.end_snap
-- execute snap_events.end_snap
-- execute snap_enqueues.end_snap

spool off

Sample of output (from a slightly older version of the code):


============
Secure files
============

 Segment Blocks:    168960 Bytes: 1384120320
 Used Blocks:       151165 Bytes: 1238343680
 Expired Blocks     17795 Bytes: 145776640
 Unexpired Blocks   0 Bytes: 0

PL/SQL procedure successfully completed.

===============
Generic details
===============
Segment Total blocks: 168960
Object Unused blocks: 0

PL/SQL procedure successfully completed.


September 11, 2016

Basicfile LOBs 6

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 7:48 pm GMT Sep 11,2016

One of the nice things about declaring your (basicfile) LOBs as “enable storage in row” is that the block addresses of the first 12 chunks will be listed in the row and won’t use the LOB index, so if your LOBs are larger than 3960 bytes but otherwise rather small the LOB index will hold only the timestamp entries for deleted LOBs. This makes it just a little easier to pick out the information you need when things behave strangely, so in this installment of my series I’m going to take about an example with with storage enabled in row.

I’m going to demonstrate an issue that is causing a major problem. First I’m going to build a table with a LOB column with multiple (4) freepools – because that’s what you do to handle concurrency – then I’m going to start 4 sessions (carefully checking that I have one associated with each free pool) and do a few thousand inserts with commits from each session. The size of the LOB value I insert will be 20KB so it will be “indexed” in the row but stored out of the row taking 3 LOB blocks.

Once I’ve got the data in place I’m going to use three of the sessions to delete three quarters of the rows from the table then use a call to the dbms_space package to show you that the segment contains virtually no free space. I’ve engineered the code so that it will take just three more rows in the table to fill the available free space and force Oracle either to allocate a new extent or to start reclaiming some of the delete reusable LOB space – and I’m going to run that insert from the session that DIDN’T delete any rows.

I’ve been running these tests on 11.2.0.4, but get similar behaviour on 12c.

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as 
    basicfile 
    text_lob(
            enable storage in row
            chunk 8k
            retention
            nocache
            logging
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('X',20000,'X');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/

truncate table t1
;

You’ll notice I’ve used the retention keyword.  Before I built the LOB I set my undo_retention to 10 seconds so that the space from deleted LOBs should become available for reuse very quickly. The name of the tablespace I’ve used for the LOB is a clue that I’m using an 8KB block size and ASSM (the latter is a requirement of the retention option).

Here’s the code to check which freepool (0 to 3) a session will be associated with (this isn’t documented, but seems to be correct);


select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

So I can keep starting sessions and running that query until I’ve got a session covering each freepool. (The first time I tried this I had to start 7 sessions before I got all 4 freepools covered). Now I can run the following from all 4 sessions concurrently:

define m_loop_counter = 12027

lock table t1 in row share mode;
commit;

declare
        m_v1 varchar2(32767) := rpad('x',20000,'x');
begin
        for i in 1..&m_loop_counter loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

The purpose of the lock table command is to ensure that all 4 processes start running simultaneously. From a fifth session I execute a “lock table t1 in exclusive mode” before starting the other four sessions running, so they all queue on the exclusive lock; then I commit from the fifth session and away we go. The whole thing took about 30 seconds to run. The rather random-looking value 12,027 was a careful choice to ensure that the last extent in the segment had just a few blocks left – and I used my “dbms_space_use.sql” script to check this, getting the following output:

====
ASSM
====

Unformatted                   :        7 /       57,344
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :  144,324 / ############

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145536
Object Unused blocks: 0

I’ve got 7 “unformatted” blocks in the segment – though in fact these might be “formatted but free” from the perspective of the LOB code.

After going to sessions 0, 1, and 3 and deleting 12,000 rows from each in turn (and committing, leaving a total of 12,108 rows in the table) the report doesn’t change: I haven’t made any space free I’ve simply flagged it in the LOB index as “reusable”. So now we go to session 2 and run the following code 3 times – with “set timing on”:


SQL> l
  1  declare
  2     m_v1 varchar2(32767) := rpad('x',20000,'x');
  3  begin
  4     for i in 1..1 loop
  5             insert into t1 values (s1.nextval, m_v1);
  6             commit;
  7     end loop;
  8* end;

The first run took 0.02 seconds – and the unformatted count dropped to 4

The second run took 0.01 seconds – and the unformatted count dropped to 1

The third run took 10.74 seconds, of which 9 seconds was CPU. The session generated 500,000 redo entries totalling 100MB of redo from 1 million db block changes after doing 8.4 million logical I/Os, issuing 108,000 enqueue (lock) requests and running 108,000 index range scans. The report of space usage ended up looking like this:


Unformatted                   :  108,125 /  885,760,000
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   36,333 /  297,639,936

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145664
Object Unused blocks: 0

My session has cleared every single piece of re-usable space from the LOB and made it free (unformatted) before allocating space for its one LOB. (That’s going to hurt when the client has 2 million LOBs on the reusable list and isn’t running everything on SSDs – which is why I’m working on this problem).

If you’re wondering why it takes so much redo and so many buffer visits to free 36,000 LOBs this (roughly) is what Oracle does to free up one reusable LOB of 3 blocks – which corresponds to a single index entry carrying three block ids:

  • Find the lowest index entry in the freepool, pin the index leaf block
  • Identify the last block in the list of 3
  • Lock the relevant L1 space management block for the segment and set relevant “bit” to “unformatted”
  • Delete the index entry
  • Re-insert the index entry with one block id removed
  • Commit and unlock the L1 bitmap block
  • Repeat delete/insert the cycle for 2nd block id
  • Repeat the cycle for 3rd (or 1st since we’re going backwards) block id – but don’t re-insert the index entry

Oracle reclaims one block (chunk) at a time. And that’s a bit of a clue to a possible workaround because event 44951 gets mentioned a couple of times in MoS and on the internet as a workaround to a particular problem of HW enqueue waits for LOBS. MoS note 740075.1 tells us:

When using Automatic Segment Space Management (ASSM), and the fix for Bug 6376915 has been applied in your database (Included in 10.2.0.4 +) it is possible to adjust the number of chunks that are cleaned up when the chunk cleanup operation is required.

This can be enabled by setting event 44951 to a value between 1 and 1024 (default is 1). With the value between 1 and 1024 setting the number of chunks to be cleaned up each time a chunk reclaimation operation occurs. This can therefore reduce the number of requests for the High Watermark Enqueue.

Other notes explain that by default only one chunk is cleaned up at a time – which is exactly the behaviour I’m seeing. So what happens when I bounce the database with this event set at level 5 (an arbitrary choice, but larger than the LOBs I’ve been inserting) in the parameter file and repeat the experiment ? On the first attempt it made no difference, but then I changed the experiment slightly and started again. Initially I had done my first re-insert from the one session that hadn’t deleted any rows – which made it an extreme boundary condition; on the second attempt I deleted two rows from the session that had not yet deleted any data (and waited for the retention time to elapse) before doing the inserts from that session.

Deleting two rows would put 6 blocks (in two index entries) onto my re-usable list, so I was starting the inserts with 7 free blocks, 6 reusable blocks and the event set to level 5. Here’s what I saw as I inserted rows one by one.

  • Insert one row: “Unformatted” blocks went up to 9:  I had freed 5 of the reusable blocks then used 3 of them for my lob (7 + 5 – 3 = 9)
  • Insert one row: “Unformatted” blocks went down to 7: I had freed the last reusable block then used 3 blocks for my lob (9 + 1 – 3 = 7)
  • Insert one row: “Unformatted” blocks went down to 4
  • Insert one row: “Unformatted” blocks went down to 1
  • Insert one row: Oracle cleared all the reusable space (11 seconds, 500MB redo), then added an extent (!) to the segment and used 2 of its blocks for part of the new LOB.

So the event isn’t really connected with my problem – though it adds some efficiency to the processing – and my  “boundary condition” is one that’s likely to occur fairly frequently if you’ve got a basicfile LOB defined with multiple freepools. Fortunately it’s probably going to require two pre-conditions before it’s a big problem: first that you’re handling a large number of LOBs and second that your pattern of inserting and deleting is not symmetric – it’s when you use a large number of concurrent sessions for small batches of inserts but a single session for large bulk deletes that all hell can break loose shortly after a delete.

tl;dr

As with many other features of Oracle, skew plays a part in making things break. If you’re doing lots of inserts and deletes of basicfile lobs make sure the mechanisms you use for inserting and deleting look similar: in particular similar numbers of processes to do similar amounts of work for both operations.

 

P.S. It gets worse.

P.P.S. Don’t even start to think that you can work around this by using securefiles.

P.P.P.S. I got an hint from one test that if a reusable LOB is exactly the same size as the LOB being inserted then Oracle very cleverly takes the entry index entry and rewrites it to be the LOB index entry rather than freeing (and then potentially using) the space it identifies.

 

Basicfile LOBS 5

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 6:49 pm GMT Sep 11,2016

At the end of the last installment we had seen a test case that caused Oracle to add a couple of redundant new extents to a LOB segment after one process deleted 3,000 LOBs and another four concurrent processes inserted 750 LOBs each a few minutes later (after the undo retention period had elapsed). To add confusion the LOBINDEX seemed to show that all the “reusable” chunks had been removed from the index which suggests that they should have been re-used. Our LOB segment started at 8,192 blocks, is currently at 8,576 blocks and is only using 8,000 of them.

How will things look if I now connect a new session (which might be associated with a different freepool), delete the oldest 3,000 LOBs, wait a little while, then get my original four sessions to do their concurrent inserts again ? And what will things look like after I’ve repeated this cycle several times ?

I had to drop the tables from my original test since writing the previous article, so the following results start from recreating the whole test from scratch and won’t align perfectly with the previous sets of results. Here’s what the index treedump looked like after going through the serial delete / concurrent insert cycle 12 times:

----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 71, level: 1)
   leaf: 0x1800223 25166371 (-1: nrow: 0 rrow: 0)
   leaf: 0x1800227 25166375 (0: nrow: 0 rrow: 0)
   leaf: 0x1800236 25166390 (1: nrow: 0 rrow: 0)
   leaf: 0x180023d 25166397 (2: nrow: 63 rrow: 63)
   leaf: 0x1800206 25166342 (3: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (4: nrow: 81 rrow: 81)
   leaf: 0x1800229 25166377 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (11: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (12: nrow: 126 rrow: 126)

   leaf: 0x1800266 25166438 (13: nrow: 0 rrow: 0)
   leaf: 0x180025e 25166430 (14: nrow: 39 rrow: 39)
   leaf: 0x1800262 25166434 (15: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (16: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (17: nrow: 76 rrow: 76)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x180024a 25166410 (24: nrow: 118 rrow: 118)

   leaf: 0x1800263 25166435 (25: nrow: 0 rrow: 0)
   leaf: 0x180024c 25166412 (26: nrow: 0 rrow: 0)
   leaf: 0x1800254 25166420 (27: nrow: 0 rrow: 0)
   leaf: 0x1800264 25166436 (28: nrow: 1 rrow: 0)
   leaf: 0x1800274 25166452 (29: nrow: 2 rrow: 0)
   leaf: 0x180027c 25166460 (30: nrow: 2 rrow: 0)
   leaf: 0x180025d 25166429 (31: nrow: 2 rrow: 0)
   leaf: 0x1800241 25166401 (32: nrow: 2 rrow: 0)
   leaf: 0x1800245 25166405 (33: nrow: 2 rrow: 0)
   leaf: 0x1800265 25166437 (34: nrow: 1 rrow: 0)
   leaf: 0x1800251 25166417 (35: nrow: 3 rrow: 0)
   leaf: 0x1800249 25166409 (36: nrow: 4 rrow: 0)
   leaf: 0x1800242 25166402 (37: nrow: 1 rrow: 0)
   leaf: 0x1800255 25166421 (38: nrow: 2 rrow: 0)
   leaf: 0x1800259 25166425 (39: nrow: 3 rrow: 0)
   leaf: 0x1800246 25166406 (40: nrow: 1 rrow: 0)

   leaf: 0x1800214 25166356 (41: nrow: 38 rrow: 0)
   leaf: 0x1800218 25166360 (42: nrow: 81 rrow: 0)
   leaf: 0x180021c 25166364 (43: nrow: 81 rrow: 0)
   leaf: 0x1800220 25166368 (44: nrow: 0 rrow: 0)
   leaf: 0x180022d 25166381 (45: nrow: 26 rrow: 26)
   leaf: 0x1800231 25166385 (46: nrow: 81 rrow: 81)
   leaf: 0x1800219 25166361 (47: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (48: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (49: nrow: 81 rrow: 81)
   leaf: 0x180022c 25166380 (50: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (51: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (52: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (53: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (54: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (55: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (56: nrow: 79 rrow: 79)
   leaf: 0x180023f 25166399 (57: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (58: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (59: nrow: 81 rrow: 81)
   leaf: 0x1800210 25166352 (60: nrow: 120 rrow: 120)

   leaf: 0x180021d 25166365 (61: nrow: 0 rrow: 0)

   leaf: 0x1800248 25166408 (62: nrow: 21 rrow: 21)
   leaf: 0x1800268 25166440 (63: nrow: 81 rrow: 81)
   leaf: 0x180026c 25166444 (64: nrow: 152 rrow: 152)
   leaf: 0x180026b 25166443 (65: nrow: 152 rrow: 152)
   leaf: 0x180026f 25166447 (66: nrow: 152 rrow: 152)
   leaf: 0x1800273 25166451 (67: nrow: 152 rrow: 152)
   leaf: 0x1800277 25166455 (68: nrow: 152 rrow: 152)
   leaf: 0x180027b 25166459 (69: nrow: 66 rrow: 66)
----- end tree dump

As usual I’ve split the treedump into the sections that reflect the freepools, each of which could consist of two parts the LOBs (key values starting with even numbers) and the “reusable chunks” (key values starting with odd numbers). The dump suggests that things have worked well: as you can see it’s grown a few blocks after my 12 cycles but there are only 6 sections (not the full 8 that might be there), and only a few leaf blocks showing “empty” (rrows = 0). As “reusable” sections have appeared the index has grown a little, then the reusable entries have been taken off the index and the index has shrunk a bit; you can even see that freepool 3 (the highest numbered one) is still showing a pattern of 152 LOBs indexed per block – this is despite the fact that at one point a reusable section for freepool 3 (00 07) appeared above this section and then disappeared as those reusable chunks were reclaimed.

All in all the index seems to be behaving extremely well, with only a little growth and (probably temporarily) a couple of little glitches of empty leaf blocks.

Here’s the dump of the (slightly edited) “col 0” values to confirm where the freepool breaks were –

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 2c cc
 1:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 31 61
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 33 61
 3:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 36
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 37 1e
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 38 37
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 39 1e
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3a 37
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3b 50
 9:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3c
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3d 4b
11:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3e
12:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3e e7

13:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 25 9b
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 32 a0
15:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 34 1d
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 36 c6
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 39 3d
18:     col 0; len  9; ( 9):  00 02 00 01 00 00 09 df 3d
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 3f 52
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 40 cf
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 20
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 71
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 c2
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 42 13

25:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
26:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
27:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
28:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
29:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
30:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
31:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
32:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
33:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
34:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
35:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
36:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
37:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
38:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
39:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
40:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00

41:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 26 52
42:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a 27
43:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a dc
44:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2b 2d
45:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 31 34
46:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 33 15
47:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 92
48:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 e3
49:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 34
50:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 85
51:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 d6
52:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 36 27
53:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 6c
54:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 ef
55:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3a d0
56:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3c 4d
57:     col 0; len  9; ( 9):  00 04 00 01 00 00 09 df 3d
58:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3e 4b
59:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3f 96
60:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 40 7d

61:     col 0; len 10; (10):  00 05 57 bc b9 db 00 00 00 00

62:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 32 5b
63:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 33 a6
64:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 36 4f
65:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 38 13
66:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3a 09
67:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3b cd
68:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3d 91
69:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3f 23

As you can see at leaf block 61 we didn’t quite empty the reusable list from freepool 2 (00 05 = 2 * 2 + 1), and leaf blocks 25 to 40 tell us that freepool 1 (00 03 = 2 * 1 + 1) was the freepool used on the last big delete. Despite the odd little glitches it looks as if this strategy of “deleted LOBs go to my process’ freepool” seems to do a good job of reusing index space.

But there IS a problem. Here’s the output from a script I wrote using the dbms_space package to show how space in the LOB segment has been used:


Unformatted                   :    4,508 /   36,929,536
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :    8,000 /   65,536,000

Segment Total blocks: 12672
Object Unused blocks: 0

The LOB segment has grown from an initial 8,192 blocks with a few unformatted blocks and 8,000 used blocks (2 blocks per LOB, 4,000 LOBs) to 12,672 blocks with 4,508 blocks unformatted. (The difference between Full + Unformatted and Segment Total blocks is the set of bitmap space management blocks for the segment). After only 12 cycles we have “leaked” an overhead of 50% of our real data space – maybe this helps to explain why the client that started me down this set of blogs has seen Oracle allocate 700GB to hold just 200GB of LOBs.

The tablespace is declared as locally managed with 1MB uniform extents and automatic segment space management. By writing a simple script I can get Oracle to write a script to dump the first block of each extent – and they will all be Level 1 bitmap space management blocks. Running grep against the trace file I can pick out the lines that tell me how many data blocks are mapped by the bitmap and how many of them have been formatted or not. This is the result (I have 99 extents in the segment – 99 * 128 = 12,672):

   unformatted: 0       total: 64        first useful block: 4
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 24      total: 64        first useful block: 2
   unformatted: 52      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 127     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 108     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 105     total: 128       first useful block: 1
   unformatted: 96      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 38      total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 65      total: 128       first useful block: 1
   unformatted: 98      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1

You’ll notice that for the first 63 extents Oracle says there are 64 blocks mapped by the bitmap with the first useful block at block 2, thereafter it says there are 128 blocks and the first useful block is block 1 (Oracle is counting from zero). While Oracle thinks the segment is “quite small” it allocates two level 1 bitmap blocks per 1MB extent and I’ve only dumped the first block from each extent; but when the segment reaches 64MB Oracle decides that it’s getting pretty big and there’s no point in wasting space so changes to using a single level 1 bitmap block per 1MB extent. It’s just one of those tiny details you discover when you happen to look a little closely and how things work. (On a much larger test with 8MB uniform extents Oracle got to the point where it was using one L1 bitmap block for the whole 8MB.)

There’s a fascinating pattern in the later extents of 3 full extents followed by 3 empty extents – my first guess had been that Oracle was allocating new extents but not using them, but clearly that’s not right, it’s removing “reusable chunks” from the index and then not re-using them but using the new extents instead (some of the time). Something is seriously wrong with the way Oracle is handling the “reusable chunks” part of the index. With a little luck it’s some nasty side effect of the “one process delete / multiple process insert” strategy we have adopted, so: (a) we need to repeat the entire experiment with a concurrent delete mechanism and (b) we need to think about how we might re-engineer a REALLY BIG system that has followed this unfortunate strategy for a long time. Of course if (a) turns out to be a disaster as well we don’t really need to think too hard about (b) until we have discovered a good way of dealing with our rolling pattern of inserts and deletes.

Some (minimum effort, we hope) ideas we will have to look at for (b):

  • Oracle has an option to change the freepools count on a LOB segment – do we need to use it, how much work would it entail, would it require downtime
  • Oracle has an option to “rebuild” the freepools on a LOB segment
  • We can always try the “shrink space compact” option on the LOB
  • Should we just rebuild (move) the LOB segment – and use a larger extent size while we’re at it
  • Should we recreate the table and change the LOB to Securefiles as we do so – and do all the testing all over again
  • If we’re deleting old data on such a regular pattern should we try to bypass the deletes by partitioning the table in some clever way

TO BE CONTINUED.

 

Space Usage

Filed under: fragmentation,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 6:09 pm GMT Sep 11,2016

Here’s a simple script that I’ve used for many years to check space usage inside segments.  The comment about freelist groups may be out of date  – I’ve not had to worry about that for a very long time. There is a separate script for securefile lobs.


rem
rem	Script:		dbms_space_use.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Nov 2002
rem	Purpose:	
rem
rem	Last tested 
rem		12.1.0.2
rem		11.2.0.4
rem	Not tested
rem		11.1.0.7
rem		10.2.0.5
rem		 9.2.0.8
rem	Not relevant
rem		 8.1.7.4
rem
rem	Notes:
rem	For accuracy in free space you (once) needed to set the
rem	scan limit; and for those rare objects cases where you 
rem	had defined multiple freelist groups you still have to
rem	work through each free list group in turn
rem
rem	For the ASSM calls:
rem		FS1	=> 0% - 25% free space
rem		FS2	=> 25% - 50% free space
rem		FS3	=> 50% - 75% free space
rem		FS4	=> 75% - 100% free space
rem		Bytes = blocks * block size
rem
rem	Expected errors:
rem		ORA-10614: Operation not allowed on this segment
rem			(MSSM segment, ASSM call)
rem		ORA-10618: Operation not allowed on this segment
rem			(ASSM segment, MSSM call)
rem		ORA-03200: the segment type specification is invalid
rem			(e.g. for LOBINDEX or LOBSEGMENT)
rem			11g - "LOB" is legal for LOB segments
rem			    - use "INDEX" for the LOBINDEX
rem
rem	For indexes
rem		Blocks are FULL or FS2 (re-usable)
rem
rem	Special case: LOB segments.
rem	The number of blocks reported by FS1 etc. is actually the
rem	number of CHUNKS in use (and they're full or empty). So 
rem	if your CHUNK size is not the same as your block size the
rem	total "blocks" used doesn't match the number of blocks 
rem	below the HWM.
rem
rem	The package dbms_space is created by dbmsspu.sql
rem	and the body is in prvtspcu.plb
rem
rem	11.2 overloads dbms_space.space_usage for securefile lobs
rem	See dbms_space_use_sf.sql
rem
rem	When supplying details about partitions the segment type
rem	can consist of two words (e.g. LOB PARTITION), these 
rem	must be surrounded by quotes to survive the script.
rem
rem	You might want to set up two versions of this code with
rem	all references to partitions removed from one of them
rem	or you have to keep pressing return to bypass the 
rem	requests for substitution variables
rem

define m_seg_owner	= &1
define m_seg_name	= &2
define m_seg_type	= '&3'
define m_part_name	= &4

define m_segment_owner	= &m_seg_owner
define m_segment_name	= &m_seg_name
define m_segment_type	= '&m_seg_type'
define m_partition_name = &m_part_name

@@setenv

spool dbms_space_use

prompt	===================
prompt	Freelist management
prompt	===================

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10618);

	m_free	number(10);
begin
	dbms_space.free_blocks(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		partition_name		=> upper('&m_partition_name'),
--		scan_limit		=> 50,
		freelist_group_id	=> 0,
		free_blks		=> m_free
	);
	dbms_output.put_line('Free blocks below HWM: ' || m_free);
exception
	when wrong_ssm then
		dbms_output.put_line('Segment not freelist managed');
end;
/


prompt	====
prompt	ASSM
prompt	====

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10614);

	m_unformatted_blocks	number;
	m_unformatted_bytes	number;
	m_fs1_blocks		number;
	m_fs1_bytes		number;
	m_fs2_blocks		number;  
	m_fs2_bytes		number;

	m_fs3_blocks		number;
	m_fs3_bytes		number;
	m_fs4_blocks		number; 
	m_fs4_bytes		number;
	m_full_blocks		number;
	m_full_bytes		number;

begin
	dbms_space.SPACE_USAGE(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		unformatted_blocks	=> m_unformatted_blocks,
		unformatted_bytes	=> m_unformatted_bytes, 
		fs1_blocks		=> m_fs1_blocks , 
		fs1_bytes		=> m_fs1_bytes,
		fs2_blocks		=> m_fs2_blocks,  
		fs2_bytes		=> m_fs2_bytes,
		fs3_blocks		=> m_fs3_blocks,  
		fs3_bytes		=> m_fs3_bytes,
		fs4_blocks		=> m_fs4_blocks,  
		fs4_bytes		=> m_fs4_bytes,
		full_blocks		=> m_full_blocks, 
		full_bytes		=> m_full_bytes,
		partition_name		=> upper('&m_partition_name')
	);


	dbms_output.new_line;
	dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,999,990'));
	dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,999,990') || ' / ' || to_char(m_full_bytes,'999,999,999,990'));

exception
	when wrong_ssm then
		dbms_output.put_line('Segment not ASSM');
end;
/


prompt	=======
prompt	Generic
prompt	=======

declare
	m_total_blocks			number;
	m_total_bytes			number;
	m_unused_blocks			number;
	m_unused_bytes			number;
	m_last_used_extent_file_id	number;
	m_last_used_extent_block_id	number;
	m_last_used_block		number;
begin
	dbms_space.unused_space(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		total_blocks		=> m_total_blocks,
		total_bytes 		=> m_total_bytes, 
		unused_blocks		=> m_unused_blocks,  
		unused_bytes		=> m_unused_bytes,
		last_used_extent_file_id 	=> m_last_used_extent_file_id, 
		last_used_extent_block_id	=> m_last_used_extent_block_id,
		last_used_block		=> m_last_used_block,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.put_line('Segment Total blocks: '  || to_char(m_total_blocks,'999,999,990'));
	dbms_output.put_line('Object Unused blocks: '  || to_char(m_unused_blocks,'999,999,990'));

end;
/

undefine 1
undefine 2
undefine 3
undefine 4

undefine m_seg_owner
undefine m_seg_name
undefine m_seg_type
undefine m_part_name

undefine m_segment_owner
undefine m_segment_name
undefine m_segment_type
undefine m_partition_name

spool off



Here’s a sample of output (from a segment using ASSM):


===================
Freelist management
===================
Segment not freelist managed

PL/SQL procedure successfully completed.

====
ASSM
====

Unformatted                   :  132,385 / ############
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   12,327 /  100,982,784

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145920
Object Unused blocks: 0

PL/SQL procedure successfully completed.

(I’ve increased the length of the byte-count output since I produced that report ;)

August 24, 2016

Basicfile LOBs 4

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 7:02 pm GMT Aug 24,2016

At the end of the previous installment we saw that a single big batch delete would (apparently) attach all the “reusable” chunks into a single freepool, and asked the questions:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

(Okay, I’ll admit it, the third question is a clue about the answer to the second question.)

I find that this process of asking “what’s good, what’s bad, what could possibly go wrong” is an excellent way to prompt thoughts about why Oracle Corp. might have chosen a particular strategy and what that means in terms of the best (or expected) use of the feature and worst threats from misuse of the feature. So lets’s see what thoughts we can come up with.

  • Good idea: The only alternative to using a single freepool when you make chunks reusable is to spread the chunks uniformly across all the freepools – either putting the chunks onto the same free pool that the LOB was previously attached to or doing some sort of round-robin. If you go for either of these fair-share strategies you increase the amount of contention on LOB deletes if many users are deleting at the same time – which sounds like someething you might want to avoid, but LOBs are supposed to be fairly static (somewhere on MoS there’s a note that says the expected behaviour is pretty much: “we thought you’d write once, read many, and not update”) so surely a small amount of contention shouldn’t be a big problem
  • Bad idea: As mentioned in a previous post, it looks like the freepool picked by a process is dependent on the process id – so if you happen to have just a couple of processes doing large deletes they might, coincidentally, pick the same freepool and end up constantly contending with each other rather than drifting in and out of collisions. If, as often happens with archive-like processes, you use one or two processes to delete a large fraction of the data you end up with one or two freepools holding lots of reusable space and all the other freepools holding no freespace – which brings us to the third question.
  • What happens next: Let’s say 3% of your LOB (one day out of a month) is currently “reusable chunks” and the chunks are all attached to the same freepool; your process connects to insert some new LOBs and its process id identifies the wrong freepool. There are no free blocks below the highwater mark and the retention limit is long gone. Does your process (a) add an extent to create some more free space (this is the type of thing that used to happen with manual segment space management, freelist groups and freelists for non-LOB tables and indexes) or (b) start stealing from another freepool that has reusable chunks. In either case what’s going to happen in the longer term ?
  • What happens even later: Imagine you have 28 days of data and use a single process to delete data on the 29th day. For reasons of concurrency you have been running with freepools 20. If option (a) applies then (assuming everything works perfectly) at steady state you will end up with roughly 20 days worth of reusable chunks spread across your 20 freepools before the system stabilises and stops adding unnecessary extents; if option (b) applies then (assuming everything works perfectly) every night you put a load of reusable chunks on one freepool and all through the day your 20 processes are fighting (at the oldest end of the index) to reuse those chunks. I said in an earlier installment that multiple freepools got rid of “the two hot spots” – this single thread deletion strategy has just brought one of them back.

So what really happens ? By the end of the last installment I had deleted the oldest 3,000 LOBs and found them attached as reusable chunks in freepool 2 with several consecutive “empty”  (nrows=81, rrows=0) leaf blocks at the low end of all the other pools.  After running my 4 concurrent processes to insert 750 rows each (i.e. insert the replacements for the 3,000 rows I’ve deleted) this is what the index treedump looks like (with a little editing to show the main breaks between freepools):


----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 60, level: 1)
   leaf: 0x180020e 25166350 (-1: nrow: 22 rrow: 22)
   leaf: 0x1800212 25166354 (0: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (1: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (2: nrow: 74 rrow: 74)
   leaf: 0x1800239 25166393 (3: nrow: 81 rrow: 81)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (7: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (8: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (9: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (10: nrow: 78 rrow: 78)
   leaf: 0x1800232 25166386 (11: nrow: 151 rrow: 151)
---
   leaf: 0x1800226 25166374 (12: nrow: 0 rrow: 0)
   leaf: 0x180020f 25166351 (13: nrow: 64 rrow: 64)
   leaf: 0x1800213 25166355 (14: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (15: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (16: nrow: 81 rrow: 81)
   leaf: 0x1800265 25166437 (17: nrow: 81 rrow: 81)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x1800242 25166402 (24: nrow: 122 rrow: 122)
---
   leaf: 0x1800229 25166377 (25: nrow: 0 rrow: 0)
   leaf: 0x1800214 25166356 (26: nrow: 36 rrow: 36)
   leaf: 0x1800230 25166384 (27: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (28: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (29: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (30: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (31: nrow: 75 rrow: 75)
   leaf: 0x1800231 25166385 (32: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (33: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (34: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (35: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (36: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (37: nrow: 134 rrow: 134)
---
   leaf: 0x1800215 25166357 (38: nrow: 1 rrow: 0)
   leaf: 0x180026e 25166446 (39: nrow: 4 rrow: 0)
   leaf: 0x180021b 25166363 (40: nrow: 1 rrow: 0)
   leaf: 0x180024b 25166411 (41: nrow: 2 rrow: 0)
   leaf: 0x1800276 25166454 (42: nrow: 2 rrow: 0)
   leaf: 0x180024f 25166415 (43: nrow: 0 rrow: 0)
   leaf: 0x180027e 25166462 (44: nrow: 4 rrow: 0)
   leaf: 0x1800221 25166369 (45: nrow: 0 rrow: 0)
   leaf: 0x180027a 25166458 (46: nrow: 0 rrow: 0)
---
   leaf: 0x1800218 25166360 (47: nrow: 0 rrow: 0)
   leaf: 0x180021c 25166364 (48: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (49: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (50: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (51: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (52: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (53: nrow: 152 rrow: 152)
   leaf: 0x1800253 25166419 (54: nrow: 152 rrow: 152)
   leaf: 0x1800257 25166423 (55: nrow: 152 rrow: 152)
   leaf: 0x180025b 25166427 (56: nrow: 152 rrow: 152)
   leaf: 0x180025f 25166431 (57: nrow: 152 rrow: 152)
   leaf: 0x1800263 25166435 (58: nrow: 1 rrow: 1)
----- end tree dump

Highlights

The number of leaf blocks has dropped from 72 to 60 – I didn’t think that this could happen without an index coalesce or rebuild, but maybe it’s a special feature of LOBINDEXes or maybe it’s a new feature of B-trees in general that I hadn’t noticed. Some of the “known empty” leaf blocks seem to have been taken out of the structure.

We still see the half full / full split between the leaf blocks for the first 3 freepools when compared to the top freepool.

There are still some empty leaf blocks (rrow = 0), but apart from the top freepool no more than one per freepool for the other sections that are indexing LOBs.

The section of index that is the freepool 2 section for “reusable” chunks shows an interesting anomaly. There are some leafblocks that are now empty (rrow=0) but were only holding a few index entries (nrow=1-4 rather than the 75 – 140 entries that we saw in the previous installment) at the moment they were last updated; this suggests a certain level of contention with problems of read-consistency, cleanout, and locking between processes trying to reclaim reusable blocks.

It’s just slightly surprising the the top freepool shows several empty leaf blocks – is this just a temporary coincidence, or a boundary case that means the blocks will never be cleaned and re-used; if it’s a fluke will a similar fluke also reappear (eventually) on the other freepools. Is it something to do with the fact that freepool 2 happened to be the freepool that got the first lot of reusable chunks ? Clearly we need to run a few more cycles of deletes and inserts to see what happens.

We have one important conclusion to make but before we make it let’s look at the partial key “col 0” values in the row directory of the root block just to confirm that the breaks I’ve listed above do correspond to each of the separate freepool sections:

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
 1:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0b
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc
 3:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0d
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0d 51
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 db bf f4
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c0 77
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c1 90
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 77
 9:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 fa
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c4 45
11:     col 0; len ..; (..):  00 00 00 01 00 00 09 db c5

12:     col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
13:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
15:     col 0; len ..; (..):  00 02 00 01 00 00 09 db 0b
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 0b ee
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 db bf 8b
18:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c0 a4
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c2 21
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c3 6c
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c4 21
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c5 9e
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 53
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 d6

25:     col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
26:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
27:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
28:     col 0; len ..; (..):  00 04 00 01 00 00 09 db 0b
29:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d
30:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc 64
31:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc b5
32:     col 0; len ..; (..):  00 04 00 01 00 00 09 db bd
33:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd 51
34:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd a2
35:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd f3
36:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 44
37:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 95

38:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
39:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
40:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
41:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
42:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
43:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
44:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
45:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
46:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

47:     col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
48:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
49:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
50:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
51:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
52:     col 0; len ..; (..):  00 06 00 01 00 00 09 db 0a
53:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a
54:     col 0; len 10; (10):  00 06 00 01 00 00 09 db bf da
55:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c1 6c
56:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c2 cc
57:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c4 90
58:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c6 22

I’ve broken the list and numbered the entries to match the treedump above, so it’s each to check that leaf blocks 38 to 46 are the now empty blocks for the reusable chunks. We started the reload with 3,001 entries for reusable chunks all in one freepool; we’ve ended it with none. Something has “stolen” the reusable chunks from freepool 2 so that they could be used for creating new LOBs that were distributed across all the freepools.

Oracle has been very efficient about re-using the index space, with a little bit of wastage creeping in, perhaps caused by coincidences in timing, perhaps by some code that avoids waiting too long when trying to lock index entries. We have a contention point because of the single threaded delete – but it doesn’t appear to be a disaster for space utilisation. Of course we need to look at the level of contention, and repeat the cycle a few times, changing the freepool used for deletion fairly randomly to see if we just got lucky or if the first few deletes are special cases. We can also ask questions about how the “stealing” takes place – does a process steal one index entry at a time, or does it take several consecutive index entries from the same block while it’s got the leaf block locked – but perhaps we don’t really need to know the fine details, the amount of time spent in contention (typically TX waits of some sort) could tell use whether or not we had a significant problem.

Contention and Resources

For each of the processes running the inserts I took a couple of snapshots – session stats and wait events – to see if anything interesting showed up. Naturally, the closer you look the more strange things you find. Here are a few sets of numbers from v$session_event and v$sesstat (in my snapshot format – with the four sessions always reported in the same order);

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: HW - contention                                985           0          93.15        .095           1
enq: HW - contention                                 10           0           5.46        .546           1
enq: HW - contention                              1,001           0         102.27        .102           1
enq: HW - contention                              1,010           0         106.12        .105           1

db file sequential read                           1,038           0          40.75        .039           2
db file sequential read                              39           0           3.21        .082           1
db file sequential read                           1,038           0          28.33        .027           1
db file sequential read                           1,046           0          34.89        .033           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           1,038
physical reads direct                                                      979

physical reads                                                              39
physical reads direct                                                       19

physical reads                                                           1,038
physical reads direct                                                      998

physical reads                                                           1,046
physical reads direct                                                    1,005

session logical reads                                                  114,060
session logical reads                                                   22,950
session logical reads                                                  104,555
session logical reads                                                   93,173

data blocks consistent reads - undo records applied                      2,165
data blocks consistent reads - undo records applied                        119
data blocks consistent reads - undo records applied                      1,222
data blocks consistent reads - undo records applied                        193

My first thought when looking at the wait events was to get an idea of where most of the time went, and I had expected the HW enqueue to be the most likely contender: this enqueue is held not only when the high water mark for a segment is moved, it’s also held when a process is doing any space management for inserting a LOB. So my first suprise was that one session was hardly waiting at all compared to the other sessions.

Then I noticed that this one session was also suffering a very small number of “db file sequential read” waits compared to every other session – but why were ANY sessions doing lots of db file sequential reads: the LOB was declared as nocache so any reads ought to be direct path reads and although Oracle doesn’t always have to wait for EVERY direct path read we should have read (and rewritten) 1,500 “reusable” LOB chunks by direct path reads in each session – I refuse to believe we never waited for ANY of them. So take a look at the session stats: which show us the that the “db file sequential read” waits match exactly with the “physical reads” count but most of the “physical reads” are recorded “physical reads direct” – Oracle is recording the wrong wait event while reading the “reusable” chunks.

Okay, so our direct path read waits are being recorded incorrectly: but one session does hardly any physical reads anyway – so what does that mean ? It means the process ISN’T reusing the chunks – you can’t be reusing chunks if you haven’t t read them. But the dumps from the index tell us that all the reusable chunks have been reused – so how do we resolve that contradiction ? Something is reading the index to identify some reusable chunks, wiping the reference from the index, then not using the chunks so (a) we’ve got some reusable chunks “going missing” and (b) we must be allocating some new chunks from somewhere – maybe bumping the high water mark of the segment, maybe allocating new extents.

Fortunately I had used the dbms_space package to check what the lob segment looked like after I had loaded it. It was 8192 blocks long, with 66 blocks shown as unused and 8,000 (that’s exactly 2 blocks/chunks per LOB) marked as full. After the delete/insert cycle is was 8,576 blocks long, with 8,000 blocks marked as full and 444 marked as unused. We had added three extents of 1MB each that we didn’t need, and one session seems to have avoided some contention by using the new extents for (most of) its LOBs rather than competing for the reusable space with the other LOBs.

Was this a one-off, or a repeatable event. How bad could it get ?

TO BE CONTINUED.

Post-script

Is there a way of discovering from SQL (perhaps with a low-cost PL/SQL function) the freepool for a LOB when it’s defined as Basicfile. You can get the LOBid for a Securefiles LOB using the dbms_lobutil package and the LOBid includes the critical first two bytes – but the package is only relevant to Securefiles. I rather fancy the idea of a process knowing which freepool it is associated with and only deleting LOBs that come out of that freepool.

Update

A thought about knowing the freepool – you could always add the freepool number as a column to the base table and have a “before row insert” trigger (or some other coding strategy) that populated the column with the freepool id. This would require the application to know a little too much about the implementation and about the internals, and would require a code change somewhere if anyone decided to change freepools. (Possibly VPD/RLS/FGAC could be used to hide the details from the outside world). The code to delete LOBs could then delete only those rows that matched its current freepool. WARNING – this is an idle thought, probably not relevant to (most) people, and there may be better strategies to make the effort irrelevant.

Following a little accident with another test I noticed a pattern that suggested that you would only reuse space on your own freepool and might, at the same time, at the same time moving some of your freepool into segment freespace. After a couple of experiments though it looks as if you use the reusable space in your freepool; if you have no reusable space left in your freepool you start using the segment freespace (ignoring the reusable space in other freepools); if there is no free space in the segment you raid EVERY OTHER FREEPOOL and transfer all their reusable space to segment freespace – which is bad news if there are 200,000 reusable chunks in their freepools. Unfortunately, at the same time, you add an extent to the segment and insert your new LOB into that extent.  There’s still plenty of scope for refining details here; I’ve only tried pushing one session into raiding the other freepools, so I don’t know what happens if there are multiple sessions trying to create LOBs while the raid is going on, who blocks whom, for how long, and what happens about new extents etc. Clearly, though, there’s plenty of scope for nasty things to happen.

 

 

August 22, 2016

Basicfile LOBS 3

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 5:56 pm GMT Aug 22,2016

In the previous article in this mini-series I described how the option for setting freepools N when defining Basicfile LOBs was a feature aimed at giving you improved concurrency for inserts and deletes that worked by splitting the LOBINDEX into 2N sections: N sections to index the current LOB chunks by LOB id, alternating with N sections to map the reusable LOB chunks by deletion time.

In this article we’ll look a little further into the lifecycle of the LOB segment but before getting into the details I’ll just throw out a couple of consequences of the basic behaviour of LOBs that might let you pick the best match for the workload you have to deal with.

  • If you have enabled storage in row the first 12 chunks of a lob will be identified by the LOB Locator stored in the row, so if all your LOBs are sized between 4KB and 96KB (approximately) the LOB Index will consist only of entries for the reusable LOB space due to deleted LOBs even though the LOBs themselves will be stored out of line. This makes it look like a good idea to enable storage in row even when you expect all of your (smallish) LOBs to be stored out of row.
  • It’s quite nice to cache LOBs (at least for a little while) if your pattern of use means you access a specific LOB for a little while before it ceases to be interesting; but LOBs can swamp a large fraction of the buffer cache if you’re not careful. If you expect to follow this pattern of behaviour you might define a RECYCLE cache and then assign the LOB to that cache so that you get the benefits of caching while still protecting the main volume of your buffer cache.
  • Depending on the expected size of your LOBs you may have a good justification for creating a tablespace of a non-standard size for the LOB segment so that it takes fewer block reads to read the entire LOB. If (for example) you have a LOB which is always in the range of 62KB then a tablespace with a blocksize of 32KB would be a good choice because the LOB could be read with just two block reads. A fringe benefit of the non-standard block size, of course, is that you have to define a non-standard cache, which separates the LOB activity from the rest of the buffer cache. (Note: Oracle reads LOBs one chunk at a time, so the number of LOB reads – as opposed to block reads – for a 32KB chunk is the same whether the block size is 8KB or 32KB)
  • If you’re going to be logging your LOBs then remember that nocache LOBs will write entire chunks into the redo logs – think about how much extra redo this might generate: it might be better to have a small recycle cache and cache your LOBS as cached LOBs are logged at the byte level. (You don’t want a 32KB block size, nocache, logging if your LOBs are all 33KB).

The LOB lifetime

Before deciding on the suitability of a feature the first thing to do is define what you’re trying to achieve so that you can think realistically about where the threats may be and what tests are going to be important – so I’m going to describe a scenario, then talk about what threats might appear based on the current details I’ve given about Basicfile LOBs and freepools.

  • We have many processes inserting “small” (16KB to 24KB) LOBs concurrently in bursts during the day.
  • Typically we peak at about 20 processes inserting at the same moment, and we end up with about 100K new LOBs per day though this varies between 50K and 200K.
  • The inserts are all “insert one row; commit”.
  • The LOBs have to be kept for 28 days, after which they (the rows that hold them) are deleted by an overnight batch job.
  • The LOBs have to be logged and the database is running in archivelog mode

As soon as you see the “aged 28 days” you might immediately think “partitioning” (though perhaps your first thought might be that restaurant in Cincinnati airport that hangs its beef to air-dry for 28 days before cooking). Unfortunately not everyone has licensed the partitioning option, so what do you have to worry about when you start to design for this requirement. (We’re also going to assume that securefiles are going to be saved for another blog mini-series).

Clearly we should make use of multple freepools to avoid the insert contention on the LOBINDEX. With about 20 concurrent processes we might immediate go for freepools 20, but we might decide that a smaller number like 4 or 8 is sufficient. We probably ought to do some tests to see if we can discover any penalties for larger numbers of freepools, and to see what sort of contention we get with a smaller number of freepools.

We got a hint from the previous article that when a process deletes a LOB it indexes the reusable chunks in the same freepool as it inserts LOBs – at least, that’s what seemed to happen in our little test case in the previous article. Does Oracle always follow this pattern, or will a multi-row delete, or a large number of single “delete;commt;” cycles spread the reusable chunks evenly across all the available freepools ? If you do a single large delete do you end up with all the reusable space in one freepool – if so, does it matter or should we have multiple processes do our “big batch delete” ?

On second thoughts, my little demo showed that when you insert a LOB into freepool X and then delete it the reusable space goes into freepool X. Maybe I’ve misinterpreted the test and need to do a better test; maybe the reusable space goes into the freepool that the LOB was originally attached to, not into the freepool dictated by the process id. That would mean that a bulk delete would tend to spread the LOBs across all the freepools – which means if you used multiple processes to delete data they might cause contention on the “reusable” segments of the LOBINDEX.

If we do a single large delete and all the reusable chunks go into the same freepool what happens when we start inserting new LOBs ? If the LOB segment is “full” is it only the processes associated with that one freepool that can use the reusable space, or will EVERY process start to raid the freepool that has the only reusable space If the latter then all we’ve done by using multiple freepools is postpone (by roughly 28 days) the moment when we start to get contention on our LOBINDEX ?

Fortunately if we’ve made some poor choices in the orginal design Oracle does allow us to “rebuild freepools”, and even change the number of freepools:


alter table t1 modify lob (c1) (rebuild freepools);
alter table t1 modify lob (c1) (freepools (3));

Mind you, there is a little note on MoS that rebuilding freepools “may take some time” and locks the table in exclusive mode while it’s going on. So perhaps we should check to see how the rebuild works, and try to figure out how long it might take. A maxim for dealing with very large objects is that you really want to get it right first time because it’s hard to test the effects of change especially since you probably end up wanting to do your final tests on a backup copy of the production system.

Getting Started

I’ve specified 100K LOBs per day, sized between 16KB and 24KB, kept for 28 days – that’s about 50 GB, and I don’t really want to sit waiting for Oracle to build that much data while running 20 concurrent processes that are logging and generating archived redo log. (Especially since I may want to repeat the exercise two or three times with different numbers of freepools.) I’m going to start small and grow the scale when it’s necessary.

I’ll start with 4 concurrent processes inserting 1,000 LOBs each, sized at 12KB, with freepools 4, and I’ll rig the system very carefully so that each process uses a different freepool. After that I’ll run a single batch delete to delete the first 3,000 LOBs – I’ll pick a process that ought to use freepool 1 or 2 (i.e. not 0 or 3, the “end” freepools); then I’ll repeat the insert cycle but insert just 750 LOBs per process. At various points in this sequence of events I’ll stop and dump some index blocks and look at some stats to see if I can spot any important patterns emerging.

Once I’ve got through that cycle I’ll decide what to do next – the first set of results may produce some important new questions – but I’m guessing that I’ll probably end up repeating the “delete / insert” cycle at least one more time.

Here’s a little code to create a suitable table,


create sequence s1 cache 10000;

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile
    text_lob(
            disable storage in row
            chunk 8k
            retention
            nocache
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
;

I’ve inserted a row to make sure that all the objects appear in all the right places. The code I’ve used to do this insert is a version of the code that I’m going to use for the concurrency testing but restricted to insert one row with an id of zero. In the concurrency test I’ll make use of the sequence I’ve created to act as the primary key to the table.

Having created the table I then start four more sessions, carefully ensuring that they will each pick a different freepool. To make sure I had one session per freepool I just kept connecting sessions and running a silly little check for each session’s process id (pid) until I had four that returned each of the values from 0 to 3:

select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

Once I had the four extra sessions set up, I issued a simple “lock table t1 in exclusive mode” from my original session then started the following script in each of the other four:


spool temp&1

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..1000 loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

spool off

(I supplied A, B, C, and D as the first parameter to the script so that I got four sets of output, but I haven’t included the code I used to get a snapshot of the session stats, session waits, and system enqueues recorded by each session.)

First check – did I get all four freepools evenly used (which is what I had assumed would happen when I chose the 4 process ids so carefully. I can check this by doing a block dump of the LOBINDEX root block because with 4,001 entries I’m (almost certainly) going to get a root block, no further branch levels, and a few dozen leaf blocks.

As with all B-tree indexes the “row directory” of the root block will contain a list of “truncated” key values that allow Oracle to search down to the correct block in the next layer of the index so I’m going to extract just the key values, and only the first column of those keys in the same way that I did with the previous article. This means every line in the following output shows you, in order, the first LOB id (with a few of them truncated) in each leaf block:

col 0; len 10; (10):  00 00 00 01 00 00 09 da fe a7
col 0; len 10; (10):  00 00 00 01 00 00 09 db 00 24
col 0; len 10; (10):  00 00 00 01 00 00 09 db 01 6f
col 0; len 10; (10):  00 00 00 01 00 00 09 db 02 ec
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 04
col 0; len 10; (10):  00 00 00 01 00 00 09 db 05 7c
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 2b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 e0
col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 0b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc

col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
col 0; len 10; (10):  00 02 00 01 00 00 09 da fe 81
col 0; len 10; (10):  00 02 00 01 00 00 09 db 00 62
col 0; len 10; (10):  00 02 00 01 00 00 09 db 01 ad
col 0; len 10; (10):  00 02 00 01 00 00 09 db 02 94
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 f8
col 0; len 10; (10):  00 02 00 01 00 00 09 db 06 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 07 f2
col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
col 0; len  9; ( 9):  00 02 00 01 00 00 09 db 0b

col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
col 0; len 10; (10):  00 04 00 01 00 00 09 da fe 4c
col 0; len 10; (10):  00 04 00 01 00 00 09 da ff c9
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 01
col 0; len 10; (10):  00 04 00 01 00 00 09 db 01 f8
col 0; len 10; (10):  00 04 00 01 00 00 09 db 03 75
col 0; len 10; (10):  00 04 00 01 00 00 09 db 04 5c
col 0; len 10; (10):  00 04 00 01 00 00 09 db 06 3d
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 07
col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 0b

col 0; len  2; ( 2):  00 06
col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
col 0; len  9; ( 9):  00 06 00 01 00 00 09 db 0a
col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a

As you can see, we have the expected pattern (for 4 freepools) of entries starting with (00 00), (00 02), (00 04), and (00 06); but you might wonder why there are 11 leaf blocks for 00, 12 leaf blocks for 02 and 04, and only 8 leaf blocks for 06. We can answer the 11/12 anomaly by remembering that any branch blocks will have a “leftmost child” entry that won’t appear in the row directory – so the 12th leaf (or rather the 1st leaf) block for 00 is being pointed to by the “LMC”. But what about the missing blocks for 06 ? A treedump shows the answer:


branch: 0x1800204 25166340 (0: nrow: 44, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 81)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 75)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 132)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 81)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 81)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 81)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 81)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 81)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 81)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 81)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 81)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 111)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 78)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 81)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 76)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 81)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 81)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 107 rrow: 107)

   leaf: 0x180022b 25166379 (35: nrow: 78 rrow: 78)
   leaf: 0x1800218 25166360 (36: nrow: 152 rrow: 152)
   leaf: 0x180021c 25166364 (37: nrow: 152 rrow: 152)
   leaf: 0x1800220 25166368 (38: nrow: 152 rrow: 152)
   leaf: 0x1800224 25166372 (39: nrow: 152 rrow: 152)
   leaf: 0x1800228 25166376 (40: nrow: 152 rrow: 152)
   leaf: 0x180022c 25166380 (41: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (42: nrow: 11 rrow: 11)

Although there are little glitches along the way the leaf blocks for freepools 0, 1, and 2 have 81 entries per leaf block, the leaf blocks for freepool 3 have 152 – it’s the difference between inserting rows at the end of an index and getting a “90/10” split compared with inserting somewhere in the middle and getting a “50/50” split. In fact, although Oracle uses the idgen1$ sequence to drive the generation of LOB ids, the way it “batches” IDs (the sequence cache size is 50) means that LOB ids can appear out of order when generated by multiple sessions – even in the same freepool: it is possible for the top freepool to suffer 50/50 splits though these are likely to appear relatively rarely – at least, that is, until I start deleting LOBs when a “00 07” section of reusable chunks may appear. (As a side effect, the LOB ids in my table and the id column on the underlying table are out of order with respect to each other.)

You’ll notice that I’ve left blank links in the treedump list corresponding to the breaks between the free pools (remember there is a “-1” entry in the treedump for the “leftmost child” that doesn’t appear in the row directory). This helps to explain the few leaf blocks with unusual row counts – they’re the ones that at (or very near) the boundaries between freepools.

Bulk Delete

After loading some data in with concurrent inserts and multiple freepools I’m going to do a single big delete from the table to get rid of a lot of “old” data. In fact all I’ll do is delete the rows where id <= 3000. The big question is this – will a single delete put all the reusable blocks into a single freepool, or will it put the reusable space for each LOB into the freepool that the LOB was originally in, or will it find some other way to spread the reusable space evenly across all the free pools ? One freepool or many – both options have good points, both options have bad points.

Here’s what I got as the treedump after the delete:


branch: 0x1800204 25166340 (0: nrow: 72, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 0)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 0)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 0)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 0)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 0)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 0)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 0)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 0)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 22)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 120)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 0)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 0)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 0)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 0)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 0)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 0)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 0)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 0)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 0)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 64)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 101)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 0)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 0)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 0)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 0)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 0)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 0)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 0)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 0)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 0)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 36)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (35: nrow: 139 rrow: 139)

   leaf: 0x1800227 25166375 (36: nrow: 138 rrow: 138)
   leaf: 0x1800205 25166341 (37: nrow: 126 rrow: 126)
   leaf: 0x1800219 25166361 (38: nrow: 82 rrow: 82)
   leaf: 0x1800272 25166450 (39: nrow: 95 rrow: 95)
   leaf: 0x1800209 25166345 (40: nrow: 118 rrow: 118)
   leaf: 0x180021f 25166367 (41: nrow: 143 rrow: 143)
   leaf: 0x180020d 25166349 (42: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (43: nrow: 90 rrow: 90)
   leaf: 0x1800222 25166370 (44: nrow: 147 rrow: 147)
   leaf: 0x1800211 25166353 (45: nrow: 81 rrow: 81)
   leaf: 0x1800247 25166407 (46: nrow: 73 rrow: 73)
   leaf: 0x1800223 25166371 (47: nrow: 98 rrow: 98)
   leaf: 0x180026a 25166442 (48: nrow: 98 rrow: 98)
   leaf: 0x180021d 25166365 (49: nrow: 127 rrow: 127)
   leaf: 0x1800266 25166438 (50: nrow: 131 rrow: 131)
   leaf: 0x1800215 25166357 (51: nrow: 133 rrow: 133)
   leaf: 0x180026e 25166446 (52: nrow: 141 rrow: 141)
   leaf: 0x180021b 25166363 (53: nrow: 82 rrow: 82)
   leaf: 0x180024b 25166411 (54: nrow: 93 rrow: 93)
   leaf: 0x1800276 25166454 (55: nrow: 109 rrow: 109)
   leaf: 0x180024f 25166415 (56: nrow: 77 rrow: 77)
   leaf: 0x180021e 25166366 (57: nrow: 143 rrow: 143)
   leaf: 0x180027e 25166462 (58: nrow: 126 rrow: 126)
   leaf: 0x1800221 25166369 (59: nrow: 93 rrow: 93)
   leaf: 0x1800253 25166419 (60: nrow: 82 rrow: 82)
   leaf: 0x180027a 25166458 (61: nrow: 97 rrow: 97)
   leaf: 0x1800257 25166423 (62: nrow: 84 rrow: 84)

   leaf: 0x180022b 25166379 (63: nrow: 78 rrow: 0)
   leaf: 0x1800218 25166360 (64: nrow: 152 rrow: 0)
   leaf: 0x180021c 25166364 (65: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (66: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (67: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (68: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (69: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (70: nrow: 11 rrow: 11)


The number of leaf blocks has gone up from 44 to 72 (but that shouldn’t be too much of a surprise – index leaf block space can’t be reused until after the commit, so we were bound to grow the index to insert the entries for reusable chunks).

As before I’ve inserted a few blank lines to break the list into the separate index sections, and you can see that the first few blocks in each of the first three freepools has nrow = 81 and (typically) rrow = 0. These are the leaf blocks where all the LOB entries have been marked as deleted. There are a couple of variations – leaf block 10, for example, shows nrow = 132, rrow = 120: this is the leaf block where freepool 0 (LOB section) overlapped with freepool 1 (LOB section), and the first 10 LOBs in freepool 1 have been marked as deleted. The LOB section for freepool 4 follows the same sort of pattern, though nrow = 152 in most of the blocks.

The important detail is in leaf blocks 36 to 62 – which show nrow = rrow throughout, but with a degree of randomness as to the actual number of index entries. These are the leaf blocks that record the “reusable chunks”, and they’ve all been associated with freepool 2 (counting from zero). There are several details that combine to explain why the numbers of entries per leaf block vary so much, but I don’t want to get too distracted by them now; remember, though, that I pointed out that the LOB ids and table id column weren’t in synch with each other so part of what you’re seeing here is 50/50 leaf node splits followed by a little back-filling.

Again I’ve extracted the “col 0” values from the block dump of the root block – I won’t show all of them, I’ll just show you the entries from entries 35 to 63 so that you can see the leaf block pointers for the “reusable” section of freepool 2, and the LOB section of freepools 2 and 3:


col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d

col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

col 0; len 02; (02):  00 06

The starting “00 05” tells us that this is the reusable chunks associated with freepool 2 (2 * 2 + 1 = 5), and the fact that the next four bytes are identical across the entries tells you that I managed to delete my 3,000 LOBs in less than one second.

After seeing the effect of this monolithic delete you should now be asking yourself a few questions, such as:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

TO BE CONTINUED …

 

August 18, 2016

Basicfile LOBs 2

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 1:21 pm GMT Aug 18,2016

There are probably quite a lot of people still using Basicfile LOBs, although Oracle Corp. would like everyone to migrate to the (now default) Securefile LOBs. If you’re on Basicfile, though, and don’t want (or aren’t allowed) to change just yet here are a few notes that may help you understand some of the odd performance and storage effects.

Of course, there are a lot of variations in how you declare the LOB – pctversion vs. retention, cache vs. nocache, logging vs. nologging, enable vs. disable storage in row, and I can’t cover all the combinations – so what I’ll be looking at is a general strategy for handling a large number of small LOBs that are being inserted into the database at a fairly high degree of concurrency, and then being deleted a few weeks later so, in theory, the LOB segment should end up at a steady state with “N” days worth of data stored. The driver behind this modelling is a problem I was asked to examine a little while ago.

Some background details on Basicfile LOBs

If the LOB column is defined as “enable storage in row” then a very small LOB (up to 3,960 bytes) will be stored almost as if it were an ordinary column in the row; if the size of a LOB is a little larger than this limit then the LOB will be stored in chunks in the LOB segment and pointers to the first 12 chunks will be stored in the row, with pointers for further chunks stored in the LOBINDEX. The chunk size defined for a LOB column can be up to 32KB – though the default chunk size is the block size for the containing tablespace –  so it’s possible to store a LOB of nearly 384KB before Oracle needs to create index entries in the LOBINDEX, though most people use 8KB blocks and will start using the LOBINDEX when a LOB gets close to 96KB.

If the LOB column is defined as “disable storage in row” then, no matter how small it really is, it will always take up at least one chunk in the LOB segment and will have a corresponding index entry in the LOBINDEX.  For reasons of efficiency an entry in LOBINDEX always “carries” 32 bytes of pointer data, allowing it to list up to 8 chunks.

When a LOB is deleted (replaced by a null, an empty_lob(), or a new LOB value) the previous state of the base table row and the LOBINDEX will be preserved in the undo segment in the ordinary manner but the previous version of the LOB data itself is simply left in the segment (and a new version of the LOB created if the operation is an “update”). The chunks making up the old version are added to the LOBINDEX with a key based on the time (seconds since 1st Jan 1970) the delete took place – this means that when Oracle wants to re-use space in the LOB segment it can walk the LOBINDEX in order to find the chunks that were marked as available for reuse the greatest time into the past. (It also means that the LOBINDEX is one of the strangest in the Oracle pantheon – part of it indexes “reusable chunks keyed by time” part of it indexes “current chunks keyed by LOB id”.

There are two options for how long old versions of LOBs will be kept: PCTVERSION specifies the percentage of space below the segment’s highwater mark that may be used to keep old versions, and (until 12c, where things change) RETENTION specifies that Oracle should try to keep old versions for the length of time given by the system parameter undo_retention (which defaults to 900 seconds). If enough versions of LOBs have been kept Oracle can create a read-consistent version of a given LOB by using the normal undo mechanisms to take the base table row and LOBINDEX back to the correct point in time which will then ensure that the LOB pointers will be pointing to the correct chunks.  (If the LOB chunks have been over-written this is the point where you will get an Oracle error: “ORA-22924 Snapshot too old”, followed by a misleading “ORA-01555 Snapshot too old ….”)

One final interesting point from a performance perspective is that if you define the LOB to be “nocache”, which means that typical reads and writes of the lob will use direct path, and then specified “nologging” then reads and writes of the LOB will generate tiny amounts of redo log.  Two special points to go with this, though: if you specify “nocache logging” the direct path writes will be logged, but the log content will be by chunk – so if you store 4,000 bytes of data in a LOB with a 32KB chunk size you will write 32KB of redo log; secondly if you are testing the effects of logging and nologging, make sure your test database is running in archivelog mode if your production database is going to be archiving – otherwise Oracle will fool you by taking a short cut and NOT logging a nocache LOB even if you specify logging! The LOBINDEX is always cached and logged, by the way, and even if the LOB is defined as nocache there are circumstances where LOB blocks are read into the buffer cache (remember my previous note describing how we saw 6 billion buffer gets on a nocache LOB).

The last detail I want to mention is the FREEPOOLS parameter. The description in the developers guide for 11.2 describes this as:  “Specifies the number of FREELIST groups for BasicFiles LOBs, if the database is in automatic undo mode.” Unfortunately freelists and freelist groups are things that happen in manual segment space management so this definition requires an alternative meaning for the expression “FREELIST groups”. The purpose of FREEPOOLS is to help deal with concurrency problems but there’s not much information around to help you understand the mechanisms and pitfalls of freepools and the available documents on MoS don’t really do anything to clarify the position – and that’s what this article is (finally) going to talk about.

Basicfile FREEPOOLs – the truth is out there

When you specify FREEPOOLs you affect the way Oracle uses the LOBINDEX – not the space management information about the segment holding the index but the actual content of (in fact the KEY values held by) the index.

You can do a treedump of a LOBINDEX by object_id in the standard way that you do a treedump of any B-tree (or bitmap) index, and you can dump blocks from a LOBINDEX in the same way you dump any other data block in the database, by file number and block number (or block range), so it’s easy to see what happens in a LOBINDEX when you start using multiple freepools. I’ve created a table holding a LOB defined with “disable storage in row” so that I always use the LOBINDEX, inserted three rows then deleted one of them and dumped the one index block (which happens to be both the root and a leaf). Here’s the SQL to create the table and do the data handling:


create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile text_lob(
        disable storage in row
        chunk 8k
        retention
        nocache
        tablespace test_8k_assm
)
;


declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..3 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/


delete from t1 where id = 1;
commit;

alter system flush buffer_cache;

I’ve ended by flushing the buffer cache so that I don’t get a huge trace file when I try to dump the index to disc. Here’s the next bit of processing:


SQL> select object_id from user_objects where object_type = 'INDEX' and object_name like 'SYS_IL%';

 OBJECT_ID
----------
    241599

SQL> alter session set events 'immediate trace name treedump level 241599';

----- begin tree dump
leaf: 0x1800204 25166340 (0: nrow: 4 rrow: 3)
----- end tree dump

SQL> alter session dump datafile 6 block 516;

I’ve included in the above the treedump that I extracted from the tracefile and this shows that the index consists of a single leaf block (0x1800204 = file 6 block 516) with 4 row directory entries of which one has been deleted. Here’s the row dump from that leaf block – the first three entries are the index entries identifying the three LOBs I created (and, as shown by the flag value “—D–“, the first has been marked as deleted) the fourth entry points to a set of free chunks (corresponding to the chunks that will become available for re-use after a delay corresponding to the undo retention time).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 a6 01 80 01 aa 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 ae 01 80 01 b2 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 b6 01 80 01 ba 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
 01 80 01 a6 01 80 01 aa 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00
col 1; len 4; (4):  01 80 01 a6

If you examine the entries closely you will see that despite the common structure of all four of them there are two patterns. Ignoring the “data (32):” portion and looking at just “col 0” the last few bytes of the first three entries hold consecutive numbers which are actually the LOB Ids for the three LOBs (derived from the sequences idgen1$). The fourth entry breaks that pattern and if you examine bytes 3 to 6 you will find that that is (approximately, by the time I publish this article) the number of seconds since 1st Jan 1970.

To a large degree you need only examine “col 0” to get a good idea of how Oracle handles the LOBINDEX, but I will say just a few things about the rest of the entry.  For the “reusable space” index entries “col 1” is the first of a list of up to 8 chunks that were released from the same LOB at that moment, and the “data(32)” is the full list of those 8 chunks – each chunk is identified by the block address of the first block of the chunk. If I had created and deleted a LOB of roughly 128KB I would have used 16 chunks to create it and made 16 chunks available on deletion, so there would have been two index entries with the same “col 0” value, each identifying 8 of the chunks – hence the re-appearance of the first chunk as “col 1”.  (You can’t help wondering why Oracle doesn’t squeeze 9 chunks per index entry rather than repeating the first of the list – maybe there’s a peripheral effect that makes 8 easier, maybe it’s simply a good sanity check mechanism.)

For index entries about current LOBs “col 1” is a counter for the index entries that identify the entire LOBs. Our LOBs were all very small so we only needed one index entry (which Oracle starts counting from zero).  The “data (32)” entry for the “zeroth” entry starts with 16 bytes of metadata then holds up to 4 pointers to chunks; subsequent entries don’t need the metadata and can hold up to 8 pointers each and “col 1” stores the chunk number that the index entry starts with, so “col 1” in consecutive index entries for a given LOB id will have values 0, 4, 12, 20, etc.

You might note, by the way, that my LOBs are not made up of consecutive blocks even though my chunk size is exactly one block. This is a side effect of ASSM (automatic segment space management) and nothing specifically to do with LOBs.

With this sketch in place you now have some idea of how a LOBINDEX works. Apart from the convenience of knowing roughly what information is stored in the index, and how it has this strange dual purpose, you can now view it just like any other B-tree index in Oracle. When you insert a LOB you insert some index entries into the middle of the index (the high-value point of the LOB Id bit), when you delete a LOB you mark some (consecutive) index entries as deleted and insert some index entries at the high end of the index the high_value point of the “reusable chunks” bit) because each delete is the most recent delete.

As soon as you’ve got that far you realise that if you have some degree of concurrency of inserts and deletes then you have two hot spots in the index – the mid point where you’re going to get lots of 50/50 splits as LOBs are inserted and the end point where all the re-usable chunks are indexed. So how do you configure Oracle, and what does Oracle do, to reduce the contention ?

Take another look at the “col 0” values – which I’ve cut out and listed in isolation below:


col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87

col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00

Apart from making it easy to see the sequencing in the 3 LOB ids it’s now easy to note that the first three (LOB) entries start with “00 00” while the last (reusable space) entry starts with “00 01”. It’s really this starting two bytes that makes it easy for Oracle to separate the current LOBs section of the index from the reusable space section. The two bytes are the freepool identifier – it’s the first (and only, in my example) free pool – but Oracle is counting from zero, doubling the counter for the current LOBs, and doubling and adding one for the reusable space.

Here are some results when I drop and recreate the table with freepools 4 and repeat the experiment. (I’ve removed the “data(32)” content to make the output a little cleaner, and then extracted the “col 0” values).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 55
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 55
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

It just happened that with 4 freepools available my session picked freepool 3 so its LOBINDEX entries are preceded with 00 06 (2 * 3), and it’s reusable space index entries are preceded with 00 07 (2 * 3 + 1). At present I think the freepool chosen by a session (counting from zero) is derived from the session’s process id (pid) by a simple mod(pid , freepools).

So what happens if I start a second session, and adjust my little PL/SQL procedure to insert rows 4, 5, and 6.

I expect to see two things. First, the “ordinary” B-tree event – the index entry that’s marked for deletion will be cleared out of the index; secondly I should see four new index entries (one marked as deleted) which, with a little luck (one chance in four), will show that they are associated with a different freepool.

Here’s the dump (again with the “data(32)” deleted, and the “col 0” extracted at the end):


row#0[7782] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 87
col 1; len 4; (4):  00 00 00 00

row#1[7732] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 88
col 1; len 4; (4):  00 00 00 00

row#2[7682] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 89
col 1; len 4; (4):  00 00 00 00

row#3[7632] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 01 57 b3 3b ad 00 00 00 00
col 1; len 4; (4):  01 80 01 a4

row#4[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#5[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#6[7832] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 87
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 88
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 89

col 0; len 10; (10): 00 01 57 b3 3b ad 00 00 00 00

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

The index entry previously marked as deleted has disappeared (it was LOB id “09 da 36 55”).

We have four new index entries – the first 4 in the list above – and we can see that our second session has been allocated to freepool 0, the LOBINDEX entries are preceded by “00 00”, and the reusable space index entry is preceded by “00 01”.

So by declaring freepools N, we effectively break the index up into 2N nearly discrete sections. Half the sections get inserts at their high end as we insert new LOBs (with ever increasing LOB ids) and the other half (apart, sometimes, from the very top section) get inserts at the high end as time passes and we make LOB space available for reuse by deleting existing LOBs. (Note – if two LOBs of more than 8 chunks each are deleted in the same hundredth of a second then their index entries may end up interleaving as the full key is (timestamp, first chunk address) and the chunks may be scattered widely across the tablespace). Freepools allow Oracle to remove the two killer hot spots in the index.

There are side effects, of course: apart from the section for reusable space in the top freepool each section of the index will generally be subject to 50/50 block splits so you can expect the index to be roughly twice the optimum size – and even bigger than that due to other side effects of how it’s used if you’re constantly deleting and inserting LOBs. But size isn’t really the big problem; I’ll be examining further side effects of the LOBINDEX, and the mechanism that Oracle has for using the index, and the performance threats this introduces, in the next installment.


August 5, 2016

Basicfile LOBs 1

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 4:12 pm GMT Aug 5,2016

I got a call to a look at a performance problem involving LOBs a little while ago. The problem was with an overnight batch that had about 40 sessions inserting small LOBs (12KB to 22KB) concurrently, for a total of anything between 100,000 and 1,000,000 LOBs per night. You can appreciate that this would eventually become a very large LOB segment – so before the batch started all LOBs older than one month were deleted.

The LOB column had the following (camouflaged) declaration:

 LOB (little_lob) STORE AS BASICFILE (
        TABLESPACE lob_ts
        ENABLE STORAGE IN ROW
        RETENTION
        NOCACHE
        LOGGING
)

The database was 11gR2, the tablespace was defined with ASSM with uniform 1MB extents and a blocksize of 8KB (so the LOBs were all 2 or 3 chunks) and the undo retention time was 900 seconds. The effect of the “enable storage in row” is that the LOBINDEX didn’t have to hold any details of current LOB chunks (for in-row, the first 12 chunks are listed in the LOB Locator in the base table).

So, examining an AWR report covering the critical interval, reviewing the captured ASH data, and checking the database, a few questions came to mind:

  • With 200 GB of current LOB data in the segment, why was the segment roughly 800GB ?
  • With no need for current LOBs to be indexed, how had the LOB Index reached 500,000 blocks in size ?
  • There had been 500,000 inserts that night – so why had Oracle done 6 Billion (cached) buffer gets on the (nocache) LOB segment ?
  • Given that the LOB Segment had not changed size during the night, why had there been millions of HW enqueue wait on the inserts ?

Knowing the stuff that I did know about basicfile LOBs it seemed likely that the most significant problem was that the segment hadn’t been created with multiple freepools which, according to the very sparse (and not entirely self-consistent) documentation, exist to allow improved concurrency. So I thought I’d search the Internet for any useful information about freepools, how they worked, what impact they might have on this problem, why their absence might produce the symptoms I’d seen, and what the best course of action would be to address the problem.

Of course the “correct” solution according to MoS would be to convert from basicfile to securefile – with a strange insistence on using online redefinition, but no explanation of why a simple CTAS or alter table move is undesirable or dangerous. Unfortunately there are a couple of notes on MoS describing performance issues with “high” levels of concurrent inserts that need to be addressed by setting hidden parameters so I’m not (yet) keen on rebuilding 700GB of a production system to produce a change that might still not work quickly enough; especially since I couldn’t find anything on MoS that could quantify the time needed to do the conversion.

To my surprise I couldn’t find a single useful piece of information about the problem. The only articles I could find seemed to be little bits of cut-n-paste from the Oracle manual pages about using multiple freepools, and the best of those actually demonstrated rebuilding or changing the freepools settings on a LOB of a few megabytes. The most significant MoS note did say that the process “could be slow” and would lock the table. But surely someone, somewhere, must have tried it on a big system and had some idea of “how slow”.

In the end I had to start building some simple models and doing a few experiments to find out what happens and where the time goes and what causes the strange results and – most importantly – how freepools might help. Fortunately, following a call to the Oak Table for any ideas or links to useful documents, I got a pointer to the original Oracle patents which were enormously helpful in showing why freepools could help and why, in the wrong circumstances, you could still end up with a (slightly smaller) disaster on your hands.

 

To be continued …

Footnote

If you’re interested, the patent numbers are: 5,999,943 and 6,061,678.  Now I just need someone to tell me the numbers for the securefile LOBs patents.

 

July 12, 2016

Union All MV

Filed under: Infrastructure,Materialized view,Oracle — Jonathan Lewis @ 10:10 am GMT Jul 12,2016

In an article I wrote last week about Bloom filters disappearing as you changed a SELECT to a (conventional) INSERT/SELECT I suggested using the subquery_pruning() hint to make the optimizer fall back to an older strategy of partition pruning. My example showed this working with a range partitioned table but one of the readers reported a problem when trying to apply the strategy to a composite range/hash partitioned table and followed this up with an execution plan of a select statement with a Bloom filter where the subquery_pruning() hint didn’t introduced subquery pruning when the select was used for an insert.

A couple of standard ways to work around this probelm are to embed the select statement in a pipeline function so that we can “insert into table select from table(pipeline_function)”, or to write a pl/sql block that opens a cursor to do a select with bulk collect and loops through an array insert. The overhead in both cases is likely to be relatively small (especially when compared with the overhead of failing to filter). In this case, however, the reader suggested that maybe the problem appeared because the driving table (i.e. the one that would have been query to derive the pruning values) was actually an inline view with a union all.

After modifying my working model to try a couple of different tests I was inclined to agree. Since the two tables in the view looked as if they were likely to be relatively tiny and static I suggested that it would be safe to create a materialized view defined to “refresh on commit” and then use the materialized view explicitly in the query. This, finally, brings me to the point of today’s article – how do you create such a materialized view ?

I’m going to start by creating a couple of small base tables from a familiar object:


create table tt as select * from all_objects where object_type = 'TABLE';
create table tv as select * from all_objects where object_type = 'VIEW';

alter table tt add constraint tt_pk primary key (object_id);
alter table tv add constraint tv_pk primary key (object_id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'TT',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'TV',
                method_opt       => 'for all columns size 1'
        );
end;
/

Assume, now, that I need an inline view that is interested in the things you will recognise from the above as the tables owned by OUTLN (which will apper in tt) and the views owned by SYSTEM (which will appear in tv) – in the 11.2.0.4 system I’m playing on at the moment that’s three rows from each of the two tables). Here’s the SQL I’d put into the inline view:


select
        object_id, object_type, object_name
from    tt
where   owner = 'OUTLN'
union all
select
        object_id, object_type, object_name
from    tv
where   owner = 'SYSTEM'
;

Since this view won’t give me partition pruning I have to replace it with a table and because I want to ensure that the table is always up to date I have to generate it as the container for a materialized view with refresh on commit. First I need some materialized view logs so that I can do a fast refresh:


create materialized view log on tt
with
        rowid, primary key
        (object_type, object_name, owner)
including new values
;

create materialized view log on tv
with
        rowid, primary key
        (object_type, object_name, owner)
including new values
;

I’ve included the primary key in the definition because I happen to want the object_id column in the log – but I could just have included it as a column in the filter list. I’ve included the rowid in the definition because Oracle needs the rowid if it’s going to be able to do a fast refresh. I can now create a materialized view:


create materialized view mv_t
        build immediate
        refresh fast on commit
as
select
        'T' mv_marker,
        rowid rid,
        object_id, object_type, object_name
from    tt
where   owner = 'OUTLN'
union all
select
        'V' mv_marker,
        rowid rid,
        object_id, object_type, object_name
from    tv
where   owner = 'SYSTEM'
;

I’ve taken the option to “build immediate” and specified – most importantly for my needs – “refresh on commit”. You’ll notice I haven’t chosen to “enable query rewrite”; for the purposes of this demo I don’t need that particular feature.

There are two key features to the materialized view that are a little special – first I’ve included the rowid of each source table as a named column in the materialized view; as I mentioned above Oracle will not allow the view to be fast refreshable without the rowid. The second feature is that I’ve introduced a literal value into the view which I’ve named mv_marker; this makes it easy to see which table a row comes from when you query the materialized view … and Oracle needs to see this.

That’s the job done. Just to demonstrate that my materialized view is working as required here’s a little more SQL (following by the output):


select * from mv_t;

delete from tt where object_name = 'OL$';
update tv set object_name = 'PRODUCT_PRIVILEGES' where object_name = 'PRODUCT_PRIVS';

commit;

select * from mv_t;

=======================================

M RID                 OBJECT_ID OBJECT_TYPE         OBJECT_NAME
- ------------------ ---------- ------------------- --------------------------------
T AAA6tXAAFAAAAEBAAI        471 TABLE               OL$
T AAA6tXAAFAAAAEBAAJ        474 TABLE               OL$HINTS
T AAA6tXAAFAAAAEBAAK        478 TABLE               OL$NODES
V AAA6tWAAFAAAACgABI       8260 VIEW                SCHEDULER_PROGRAM_ARGS
V AAA6tWAAFAAAACgABJ       8261 VIEW                SCHEDULER_JOB_ARGS
V AAA6tWAAFAAAACuAA7      14233 VIEW                PRODUCT_PRIVS

6 rows selected.

2 rows deleted.


1 row updated.


Commit complete.


M RID                 OBJECT_ID OBJECT_TYPE         OBJECT_NAME
- ------------------ ---------- ------------------- --------------------------------
T AAA6tXAAFAAAAEBAAJ        474 TABLE               OL$HINTS
T AAA6tXAAFAAAAEBAAK        478 TABLE               OL$NODES
V AAA6tWAAFAAAACgABI       8260 VIEW                SCHEDULER_PROGRAM_ARGS
V AAA6tWAAFAAAACgABJ       8261 VIEW                SCHEDULER_JOB_ARGS
V AAA6tWAAFAAAACuAA7      14233 VIEW                PRODUCT_PRIVILEGES

5 rows selected.

If you’re wondering why you see “2 rows deleted” but a reduction by just one row in the final output, remember that we’re deleting from table tt but the materialized view holds information about just the subset of tables owned by OUTLN – I happen to have a row in tt that says SYSTEM also owns a table called OL$.

Assistance

If you have trouble working out why your attempts to create a particular materialized view aren’t working the dbms_mview package has a procedure called explain_mview that may give you enough ideas to work out what you’re doing wrong. For example, here’s how I could find out that I needed a literal column to tag the two parts of my union all view:


@$ORACLE_HOME/rdbms/admin/utlxmv.sql

begin
        dbms_mview.explain_mview (
                q'{
                create materialized view mv_t
                        build immediate
                        refresh fast
                        enable query rewrite
                as
                select  -- 'T' mv_marker,
                        rowid rid,
                        object_id, object_type, object_name from tt
                union all
                select  -- 'V' mv_marker,
                        rowid rid,
                        object_id, object_type, object_name from tv
                }'
        );
end;
/

column cap_class noprint
column related_text format a7
column short_msg format a72
break on cap_class skip 1

select
        substr(capability_name,1,3) cap_class,
        capability_name, possible, related_text, substr(msgtxt,1,70) short_msg
from
        mv_capabilities_table
where
        mvname = 'MV_T'
order by
        substr(capability_name,1,3), related_num, seq
;

The first line calls a supplied script to create a table called mv_capabilities_table in the current schema. The call to dbms_mview.explain_mview passes the text of a “create materialized view” statement to the procedure (there are a couple of variations possible) then, after a couple of SQL*Plus formatting commands I’ve queried the table to see Oracle’s analysis for the statement. (You can tag each call to this procedure using a second parameter that I haven’t bothered to use.)

Here’s the output for the failed attempt above, which has commented out the literals that tag the two parts of the UNION ALL:

CAPABILITY_NAME                POS RELATED SHORT_MSG
------------------------------ --- ------- ------------------------------------------------------------------------
PCT_TABLE                      N   TT      relation is not a partitioned table
PCT_TABLE_REWRITE              N   TT      relation is not a partitioned table
PCT_TABLE                      N   TV      relation is not a partitioned table
PCT_TABLE_REWRITE              N   TV      relation is not a partitioned table
PCT                            N

REFRESH_COMPLETE               Y
REFRESH_FAST                   N
REFRESH_FAST_AFTER_INSERT      N           the materialized view does not have a UNION ALL marker column
REFRESH_FAST_AFTER_INSERT      N           set operator in a context not supported for fast refresh
REFRESH_FAST_AFTER_ONETAB_DML  N           see the reason why REFRESH_FAST_AFTER_INSERT is disabled
REFRESH_FAST_AFTER_ANY_DML     N           see the reason why REFRESH_FAST_AFTER_ONETAB_DML is disabled
REFRESH_FAST_PCT               N           PCT FAST REFRESH is not possible if query has set operand query blocks

REWRITE                        Y
REWRITE_FULL_TEXT_MATCH        Y
REWRITE_PARTIAL_TEXT_MATCH     N           set operator encountered in mv
REWRITE_GENERAL                N           set operator encountered in mv
REWRITE_PCT                    N           general rewrite is not possible or PCT is not possible on any of the d


17 rows selected.

The query manages to split the output into three sections (but that depends on a side-effect in a way that I would normally call bad design): elements relating to “Partition Change Tracking”, elements relating to “Materialized View Refresh” and elements relating to “Query Rewrite”. You’ll notice that the rewrite section tells me that (even though I haven’t chosen to enable it) my view could be enabled to do query rewrite.

Critically, though, this version of the materialized view can’t be fast refreshed, and we see the key reason in the first “Refresh fast after insert” line: “the materialized view does not have a UNION ALL marker column”. That’s how I know I have to include a literal column that has a different value in each of the two parts of the UNION ALL.

June 24, 2016

Never …

Filed under: Infrastructure,Oracle,Philosophy — Jonathan Lewis @ 1:15 pm GMT Jun 24,2016

From time to time a question comes up on OTN that results in someone responding with the mantra: “Never do in PL/SQL that which can be done in plain  SQL”. It’s a theme I’ve critqued a couple of times before on this blog, most recently with regard to Bryn Llewellyn’s presentation on transforming one table into another and Stew Ashton’s use of Analytic functions to solve a problem that I got stuck with.

Here’s a different question that challenges that mantra. What’s the obvious reason why someone might decide to produce something like the following pattern of code rather than writing a simple “insert into t1 select * from t2;”:


declare

        cursor c1 is
        select * from t2
        ;

        type c1_array is table of c1%rowtype index by binary_integer;
        m_tab c1_array;

begin

        open c1;
        loop
                fetch c1
                bulk collect into m_tab limit 100;

                begin
                        forall i in 1..m_tab.count
                                insert into t1 values m_tab(i);
                exception
                        when others
                                then begin
                                        --  proper exception handling should go here
                                        dbms_output.put_line(m_tab(1).id);
                                        dbms_output.put_line(sqlerrm);
                                end;
                end;

                exit when c1%notfound;

        end loop;
        close c1;
end;
/

There is a very good argument for this approach.

Follow-up (Saturday 25th)

As Andras Gabor pointed out in one of the comments, there are documented scenarios where the execution plan for a simple select statement is not legal for the select part of an “insert into .. select …” statement. In particular, if you have a distributed query the most efficient execution plan may require the remote site to be the driving site but the plan for a CTAS or insert/select is required to use the local site as the driving site. There are workarounds – if you’re allowed to use them – such as creating a view at the remote site and selecting from the view, or you could create a pipelined function locally and select from the pipelined function (but that’s going to be writing PL/SQL anyway, and you’d have to create one or two object types in the database to implement it).

Another example of plan limitations, that I had not seen before (but have now found documented as “not a bug” in MoS note 20112932), showed up in a comment from Louis: a select statement may run efficiently because the plan uses a Bloom filter, but the filter disappears when the statement is used in insert/select.

These limitations, however, were not the point I had in mind. The “obvious” reason for taking the pl/sql approach is error handling. What happens if one of the rows in your insert statement raises an Oracle exception ? The entire statement has to rollback. If you adopt the PL/SQL array processing approach then you can trap each error as it occurs and decide what to do about it – and there’s an important detail behind that statement that is really important: a PL/SQL array insert can operate at virtually the same speed as the simple SQL statement once you’ve set the arraysize to a value which allows each insert to populate a couple of blocks.

Let me emphasise the critical point of the last sentence:  array inserts in PL/SQL operate at (virtually) the speed of the standard SQL insert / select.

As it stands I don’t think the exception handler in my code above could detect which row in the batch had caused the error – I’ve just printed the ID from the first row in the batch as a little debug detail that’s only useful to me because of my knowledge of the data. Realistically the PL/SQL block to handle the inserts might look more like the following:


-- In program declaration section

        dml_errors      exception;
        pragma exception_init(dml_errors, -24381);

        m_error_pos     number(6,0)     := 0;

-- ------------------------------

                begin
                        forall i in 1..m_tab.count save exceptions
                                insert into t1 values m_tab(i);
                exception
                        when dml_errors then begin

                                for i in 1..sql%bulk_exceptions.count loop

                                        dbms_output.put_line(
                                                'Array element: ' ||
                                                        sql%bulk_exceptions(i).error_index || ' ' ||
                                                        sqlerrm(-sql%bulk_exceptions(i).error_code)
                                        );

                                        m_error_pos := sql%bulk_exceptions(i).error_index;
                                        dbms_output.put_line(
                                                'Content: ' || m_tab(m_error_pos).id || ' ' || m_tab(m_error_pos).n1
                                        );

                                end loop;
                        end;

                        when others then raise;
                end;


You’ll notice that I’ve added the save exceptions clause to the forall statement. This allows Oracle to trap any errors that occur in the array processing step and record details of the guilty array element as it goes along, storing those details in an array calls sql%bulk_exceptions. My exception handler then handles the array processing exception by walking through that array.

I’ve also introduced an m_error_pos variable (which I could have declared inside the specific exception handler) to remove a little of the clutter from the line that shows I can identify exactly which row in the source data caused the problem. With a minimum of wasted resources this code now inserts all the valid rows and reports the invalid rows (and, if necessary, could take appropriate action on each invalid row as it appears).

If you’ve got a data loading requirement where almost all the data is expected to be correct but errors occasionally happen, this type of coding strategy is likely to be the most efficient thing you could do to get your data into the database. It may be slightly slower when there are no errors, but that’s a good insurance premium when compared with the crash and complete rollback that occurs if you take the simple, though slightly faster on good days, approach – and there are bound to be cases where a pre-emptive check of all the data (that would – we hope – make the pure SQL insert safe) would add far more overhead than the little bit of PL/SQL processing shown here.

Results

It’s obviously a little difficult to produce any time-based rates that demonstrate the similarity in performance of the SQL and PL/SQL approaches – the major time component in a little demo I built was about the I/O rather than the the CPU (which, in itself, rather validates the claim anyway). But if you want to do some testing here’s my data model with some results in the following section:


rem
rem     Script: plsql_loop_insert.sql
rem     Author: Jonathan Lewis
rem

execute dbms_random.seed(0)

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        2 * trunc(dbms_random.value(1e10,1e12))         n1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
create table t2
nologging
noparallel
as
select
        /*+ no_parallel(t1) */
        id + 1e6        id,
        n1 - 1          n1,
        rpad('x',100,'x') padding
from t1 
;

-- update t2 set n1 = n1 + 1 where id = 2e6;
-- update t2 set n1 = n1 + 1 where id = 2e6 - 10;
-- update t2 set n1 = n1 + 1 where id = 2e6 - 20;
-- update t2 set n1 = n1 + 1 where id = 1750200;
-- update t2 set n1 = n1 + 1 where id = 1500003;
-- update t2 set n1 = n1 + 1 where id = 1500001;
commit;

alter system checkpoint;
alter system switch logfile;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

create unique index t1_i1 on t1(n1) nologging;
create unique index t1_pk on t1(id) nologging;
alter table t1 add constraint t1_pk primary key(id);


I’ve generated 1 million rows with an id column and a random integer – picking the range of the random numbers to give me a very good chance (that worked) of getting a unique set of values. I’ve doubled the random values I use for t1 so that I can substract 1 and still guarantee uniqueness when I generate the t2 values (I’ve also added 1 million to the id value for t2 for the same uniqueness reasons).

The optional update to add 1 to a scattering of rows in t2 ensures that those values go back to their original t1 values so that they can cause “duplicate key” errors. The SQL insert was a simple insert into t1 select * from t2 (ensuring that parallel query didn’t come into play), and the PL/SQL detail I used was as follows:

declare

        cursor c1 is
        select /*+ no_parallel(t2) */ * from t2
        ;

        type c1_array is table of c1%rowtype index by binary_integer;
        m_tab c1_array;

        dml_errors      exception;
        pragma exception_init(dml_errors, -24381);

        m_error_pos     number(6,0)     := 0;

begin

        open c1;
        loop
                fetch c1
                bulk collect
                into m_tab limit 100;

                begin
                        forall i in 1..m_tab.count save exceptions
                                insert into t1 values m_tab(i);

                exception
                        when dml_errors then begin

                                for i in 1..sql%bulk_exceptions.count loop

                                        dbms_output.put_line(
                                                'Array element: ' ||
                                                        sql%bulk_exceptions(i).error_index || ' ' ||
                                                        sqlerrm(-sql%bulk_exceptions(i).error_code)
                                        );

                                        m_error_pos := sql%bulk_exceptions(i).error_index;
                                        dbms_output.put_line(
                                                'Content: ' || m_tab(m_error_pos).id || ' ' || m_tab(m_error_pos).n1
                                        );

                                end loop;
                        end;

                        when others then raise;

                end;

                exit when c1%notfound;  -- when fetch < limit

        end loop;
        close c1;
end;
/

The PL/SQL output with one bad row (2e6 – 20) looked like this:

Array element: 80 ORA-00001: unique constraint (.) violated
Content: 1999980 562332925640

Here are some critical session statistics for different tests in 11g:


No bad data, insert select
--------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                             944
CPU used by this session                               944
DB time                                              1,712
redo entries                                     1,160,421
redo size                                      476,759,324
undo change vector size                        135,184,996

No bad data, PL/SQL loop
------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                             990
CPU used by this session                               990
DB time                                              1,660
redo entries                                     1,168,022
redo size                                      478,337,320
undo change vector size                        135,709,056


Duplicate Key (2e6-20), insert select (with huge rollback)
----------------------------------------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                           1,441
CPU used by this session                             1,440
DB time                                              2,427
redo entries                                     2,227,412
redo size                                      638,505,684
undo change vector size                        134,958,012
rollback changes - undo records applied          1,049,559

Duplicate Key (2e6-20), PL/SQL loop - bad row reported
------------------------------------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                             936
CPU used by this session                               936
DB time                                              1,570
redo entries                                     1,168,345
redo size                                      478,359,528
undo change vector size                        135,502,488
rollback changes - undo records applied                 74

Most of the difference between CPU time and DB time in all the tests was file I/O time (in my case largely checkpoint wait time because I happened to have small log files) but in larger systems it’s quite common to see a lot of time spent on db file sequential reads as index blocks are read for update). You can see that there’s some “unexpected” variation in CPU time – I wasn’t expecting the PL/SQL loop that failed after nearly 1M inserts to use less CPU than anything else – but the CPU numbers fluctuated a few hundredths of a second across tests, this just happened to be particularly noticeable with the first one I did – so to some extent this was probably affected by background activity relating to space management, job queue processing and all the other virtual machines on the system.

Critically I think it’s fair to say that the differences in CPU timing are not hugely significant across a reasonably sized data set, and most importantly the redo and undo hardly vary at all between the successful SQL and both PL/SQL tests. The bulk processing PL/SQL approach doesn’t add a dramatic overhead – but it clearly does bypass the threat of a massive rollback cost.

Footnote:

You might want to argue the case for using basic SQL with the log errors clause. The code method is simple and it gives you a table of rows which have caused exceptions as the insert executed – and that may be sufficient for your purposes; but there’s a problem until you upgrade to 12c.

Here’s how I had to modify my test case to demonistrate the method:


begin
        dbms_errlog.create_error_log('t1');
end;
/

insert into t1 select * from t2
log errors
reject limit unlimited
;

The procedure call creates a table to hold the bad rows, by default it’s name will be err$_t1, and it will be a clone of the t1 table with changes to column types (which might be interesting if you’ve enabled 32K columns in 12c — to be tested) and a few extra columns:


SQL> desc err$_t1
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ORA_ERR_NUMBER$                        NUMBER
 ORA_ERR_MESG$                          VARCHAR2(2000)
 ORA_ERR_ROWID$                         ROWID
 ORA_ERR_OPTYP$                         VARCHAR2(2)
 ORA_ERR_TAG$                           VARCHAR2(2000)
 ID                                     VARCHAR2(4000)
 N1                                     VARCHAR2(4000)
 PADDING                                VARCHAR2(4000)

SQL> execute print_table('select * from err$_t1')
ORA_ERR_NUMBER$               : 1
ORA_ERR_MESG$                 : ORA-00001: unique constraint (TEST_USER.T1_I1) violated

ORA_ERR_ROWID$                :
ORA_ERR_OPTYP$                : I
ORA_ERR_TAG$                  :
ID                            : 1999980
N1                            : 562332925640
PADDING                       : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SO what’s the problem with logging errors ? Here are the sets of session stats corresponding to the ones that I reported above for the SQL and PL/SQL options. The first set comes from running this test on 11.2.0.4, the second from 12.1.0.2:


11g results
===========
Name                                                 Value
----                                                 -----
CPU used when call started                           1,534
CPU used by this session                             1,534
DB time                                              2,816
redo entries                                     3,113,105
redo size                                      902,311,860
undo change vector size                        269,307,108

12c results
===========
Name                                                 Value
----                                                 -----
CPU used when call started                             801
CPU used by this session                               801
DB time                                              3,061  -- very long checkpoint waits !!
redo entries                                     1,143,342
redo size                                      492,615,336
undo change vector size                        135,087,044


The 12c stats are very sinilar to the stats from the perfect SQL run and the two PL/SQL runs – but if you look at the 11g stats you’ll see that they’re completely different from all the other stats. The number of redo entries (if nothing else) tells you that Oracle has dropped back from array processing to single row processing in order to be able to handle the error logging (1 million rows, one entry for each row, it’s PK index entry, and the unique key index entry.)

Until 12c “error logging” is just row by row processing.

Footnote:

As far as I can tell, I first pointed out this “single row processing” aspect of the log errors option some time around December 2005.

Late Entry:

While looking for a posting about efficient updates  I came across another of my posting that compares SQL with PL/SQL for updates – it’s worth a read.

 

May 23, 2016

Virtual Partitions

Filed under: 12c,CBO,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 1:16 pm GMT May 23,2016

Here’s a story of (my) failure prompted by a recent OTN posting.

The OP wants to use composite partitioning based on two different date columns – the table should be partitioned by range on the first date and subpartitioned by month on the second date. Here’s the (slightly modified) table creation script he supplied:


rem
rem     Script: virtual_partition.sql
rem     Dated:  May 2016
rem

CREATE TABLE M_DTX
(
        R_ID    NUMBER(3),
        R_AMT   NUMBER(5),
        DATE1   DATE,
        DATE2   DATE,
        VC GENERATED ALWAYS AS (EXTRACT(MONTH FROM DATE2))
)
PARTITION BY RANGE (DATE1) interval (numtoyminterval(1,'MONTH'))
SUBPARTITION BY LIST (VC)
        SUBPARTITION TEMPLATE (
                SUBPARTITION M1 VALUES (1),
                SUBPARTITION M2 VALUES (2),
                SUBPARTITION M3 VALUES (3),
                SUBPARTITION M4 VALUES (4),
                SUBPARTITION M5 VALUES (5),
                SUBPARTITION M6 VALUES (6),
                SUBPARTITION M7 VALUES (7),
                SUBPARTITION M8 VALUES (8),
                SUBPARTITION M9 VALUES (9),
                SUBPARTITION M10 VALUES (10),
                SUBPARTITION M11 VALUES (11),
                SUBPARTITION M12 VALUES (12)
        )
        (
        PARTITION M_DTX_2015060100 VALUES LESS THAN (TO_DATE('2015-06-01 00:00:01', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
        )
;

There’s nothing particularly exciting about this – until you get to the query requirement – the user wants to query on date1 and date2, and doesn’t know about the virtual month column, e.g. (and, I know that there should be a to_date() or ANSI equivalent here):

SELECT * FROM m_dtx WHERE date1 = trunc(sysdate) AND date2 = '01-Jun-2016';

Now, as a general rule, you don’t expect partition elimination to occur unless the partitioning column appears with a predicate that make elimination possible, so your first response to this query is that it could eliminate on date1, but can’t possibly eliminiate on vc because vc isn’t in the where clause. However it’s possible that the partitioning code might be coded to recognise that the subpartition is on a virtual column that is derived from date2, so perhaps it could generate a new predicate before optimising, for example:

date2 = '01-Jun-2016'  => vc = 6

Unfortunately, your first response is correct – the optimizer doesn’t get this clever, and doesn’t do the sub-partition elimination. Here’s the execution plan from 12.1.0.2 for the sample query, followed by the execution plan when I explicitly add the predicate vc = 6.


SQL_ID  8vk1a05uv16mb, child number 0
-------------------------------------
SELECT /*+ dynamic_sampling(0) */  * FROM m_dtx WHERE date1 =
trunc(sysdate) AND date2 = to_date('01-Jun-2016','dd-mon-yyyy')

Plan hash value: 3104206240

------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |       |       |    15 (100)|          |       |       |
|   1 |  PARTITION RANGE SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
|   2 |   PARTITION LIST ALL   |       |     1 |    57 |    15   (7)| 00:00:01 |     1 |    12 |
|*  3 |    TABLE ACCESS FULL   | M_DTX |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("DATE2"=TO_DATE(' 2016-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "DATE1"=TRUNC(SYSDATE@!)))



SQL_ID  33q012bdhjrpn, child number 0
-------------------------------------
SELECT /*+ dynamic_sampling(0) */  * FROM m_dtx WHERE date1 =
trunc(sysdate) AND date2 = to_date('01-Jun-2016','dd-mon-yyyy') and vc
= 6

Plan hash value: 938710559

------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |       |       |    15 (100)|          |       |       |
|   1 |  PARTITION RANGE SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
|   2 |   PARTITION LIST SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |     6 |     6 |
|*  3 |    TABLE ACCESS FULL   | M_DTX |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("DATE2"=TO_DATE(' 2016-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "DATE1"=TRUNC(SYSDATE@!)))


Note how the predicate vc = 6  doesn’t show up in the predicate section in either case, but the execution plan shows PARTITION LIST ALL at operation 2 when we omit the predicate and PARTITION LIST SINGE when we include it (with suitable values also appearing for Pstart and Pstop). (The cost, by the way, is the cost of scanning a whole (range)partition whether or not the optimizer expects to restrict that scan to just one sub-partition.)

So the optimizer isn’t quite clever enough (yet). BUT … the optimizer can be very clever with constraints, combining constraints with predicates and applying transitive closure to produce new predicates – so maybe we could get the optimizer to do this if we helped it a little bit. Given the table definition supplied I’m going to assume that the date2 column is supposed to be non-null, so let’s add some truthful constraints/declarations to the table definition:


alter table m_dtx modify date2 not null;
alter table m_dtx modify vc  not null;
alter table m_dtx add constraint md_ck_vc check (vc = extract(month from date2));

Alas, this didn’t make any difference to the execution plan. But it did do something surprising to my attempts to load data into the table:


insert into m_dtx (r_id, r_amt, date1, date2)
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        mod(rownum, 1000),
        rownum,
        trunc(sysdate,'yyyy') + dbms_random.value(0,365),
        trunc(sysdate,'yyyy') + dbms_random.value(0,365)
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4
;

insert into m_dtx (r_id, r_amt, date1, date2)
*
ERROR at line 1:
ORA-01400: cannot insert NULL into (???)

So the array insert with the virtual column doesn’t like the NOT NULL constraint on the virtual column because vc is, presumably, still null when the constraint is checked (though there’s no problem with single row inserts with the values() clause – I wonder what happens with the PL/SQL “FORALL” clause) – so let’s remove the not null constraint on vc and see what happens.


insert into m_dtx (r_id, r_amt, date1, date2)
*
ERROR at line 1:
ORA-02290: check constraint (TEST_USER.MD_CK_VC) violated

Unsurprisingly, given the fact that Oracle didn’t like the not null constraint, the critical check constraint also fails. This, by the way, is odd because a check constraint should accept a row when the constraint doesn’t evaluate to FALSE, so (a) vc can’t have been evaluated at this point or the constraint would evaluate to TRUE – which is not FALSE, and (b) vc at this point can no longer be null or the constraint would evaluate to NULL – which is not FALSE: so what “value” has vc got that makes the constraint check return FALSE ?

Bottom line:

I can see some scope for an optimizer enhancement that tries to find eliminating predicates from virtual columns; and I think there’s a need for ensuring that we can safely add constraints to virtual columns – after all we might want to create an index on a virtual column and sometimes we need a NOT NULL declaration to ensure that an index-only execution path can be found. Unfortunately I have to end this blog without finding an immediate solution for the OP.

Despite this failure, though, there are cases (as I showed a couple of years ago) where the optimizer in 12c can get clever enough to recognize the connection between a queried date column and the virtual partitioning column based on that date column.

May 9, 2016

RI Locks

Filed under: deadlocks,Indexing,IOT,Locks,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 12:24 pm GMT May 9,2016

RI = Referential Integrity: also known informally as parent/child integrity, and primary (or unique) key/foreign key checking.

I’m on a bit of a roll with things that I must have explained dozens or even hundreds of times in different environments without ever formally explaining them on my blog. Here’s a blog item I could have done with to response to  a question that came up on the OTN database forum over the weekend.

What happens in the following scenario:


-- session 1

create table parent (
        id        number(8,0),
        constraint par_pk primary key(id)
);

create table child  (
        id_p      number(8,0) not null references parent,
        id_c      number(8,0) not null,
        constraint child_pk primary key(id_p, id_c)
)
;

insert into parent values(1);

-- session 2
insert into child values(1,1);

Since the parent row corresponding to the child row doesn’t (yet) seem to exist as far as session 2 is concerned you might expect session 2 to respond immediately with an error message like:

ERROR at line 1:
ORA-02291: integrity constraint (TEST_USER.SYS_C0017926) violated - parent key not found

In fact, although the end-user is not allowed to see the uncommitted parent row, the user’s process can see the uncommitted row and will wait until session 1 commits or rolls back – so if you examine v$lock for the current locks for the two sessions you’d see something like this:

  1  select  sid, type, id1, id2, lmode, request, ctime, block
  2  from    V$lock
  3  where   sid in (select sid from V$session where username = 'TEST_USER')
  4  and     type != 'AE'
  5  order by
  6*         sid, type desc
  7  /

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
         3 TX     327709      12584          6          0        283          1
           TM     143734          0          2          0        283          0
           TM     143732          0          3          0        283          0

       250 TX     589829      12877          6          0        240          0
           TX     327709      12584          0          4        240          0
           TM     143734          0          3          0        240          0
           TM     143732          0          3          0        240          0


7 rows selected.

In the above, SID 250 is session 2: it’s holding a transaction lock (TX) in mode 6 because it has acquired an undo segment and has generated some undo, it’s also waiting for a transaction lock in mode 4 (share) and – checking id1 and id2 – we can see that the transaction table entry it’s waiting for is held by session 3 in mode 6 (and we also note that the lock held by session 3 is marked as a blocker).

If session 3 commits (thus releasing the transaction lock) session 250 will continue processing the insert; if session 3 rolls back session 250 will raise error ORA-02291 and roll back its insert statement. (Note: if this were a multi-statement transaction it would only be the insert into child that would be rolled back; that’s another one of those details that is important but often isn’t stated explicitly, leaving people believing that the entire transaction would be rolled back.)

Updates and deletes can produce the same effects. Imagine that we have just created the two tables, and then run the following:


-- session 1
insert into parent values(1);
commit;
delete from parent where id = 1;

-- session 2
insert into child values(1,1);

Again session 2 will wait for session 1 to commit or roll back. In this case if session 1 commits session 2 will raise Oracle error ORA-02291, if session 1 rolls back session 2 will continue with the insert.

Deadlocks

Whenever you can demonstrate a way of producing a wait chain you can also manage to produce a deadlock. Consider the following (starting, again, from empty tables);


-- (1) session 1
insert into parent values(1);

-- (2) session 2
insert into parent values(2);

-- (3) session 1
insert into child values(2,2);

-- (4)session 2
insert into child values(1,1);

Session 1 will start waiting for session 2 to commit (or rollback) at step 3, then session 2 will start to wait for session 1 at step 4 – with the result that session 1 will recognise the deadlock after about three seconds and rollback its last statement, raising exception ORA-00060 and dumping a trace file. (Note: session 1 will not, as many people think, roll back the entire transaction, it will only roll back the statement that allowed the deadlock to develop). Session 2 will still be waiting for session 1 to commit or rollback its insert into parent. Contrary to the popular claim, Oracle will not “resolve” the deadlock, it will simply break the deadlock leaving one session waiting for the other session to respond appropriately to the deadlock error.

For reference, here’s the deadlock graph (from a 12c trace file) produced by session 1 (SID = 3) for this demo:


Deadlock graph:
                                          ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name                             process session holds waits  process session holds waits
TX-00010017-000026C7-00000000-00000000          6       3     X             33     250           S
TX-000A000D-000026F8-00000000-00000000         33     250     X              6       3           S

session 3: DID 0001-0006-00000004       session 250: DID 0001-0021-00000041
session 250: DID 0001-0021-00000041     session 3: DID 0001-0006-00000004

Rows waited on:
  Session 3: no row
  Session 250: no row

When you see a deadlock graph with TX waits of type S (share, mode 4) it’s a very good bet that the wait has something to do with indexes – which may mean referential integrity as discussed here, but may mean collisions on primary keys, and may mean something to do with simple collisions on index-organized tables. You’ll notice that the “Rows waited on:” section shows no row – unfortunately in earlier versions of Oracle you may find a spurious row entry here because the wait information from some other (block) wait has been left in the relevant columns in v$session.

May 3, 2016

Debugging

Filed under: CBO,compression,Execution plans,Infrastructure,Oracle — Jonathan Lewis @ 8:11 am GMT May 3,2016

The OTN database forum supplied a little puzzle a few days ago – starting with the old, old, question: “Why is the plan with the higher cost taking less time to run?”

The standard (usually correct) answer to this question is that the optimizer doesn’t know all it needs to know to predict what’s going to happen, and even if it had perfect information about your data the model used isn’t perfect anyway. This was the correct answer in this case, but with a little twist in the tail that made it a little more entertaining. Here’s the query, with the two execution plans and the execution statistics from autotrace:


SELECT  /* INDEX(D XPKCLIENT_ACCOUNT) */ 
        E.ECID,A.acct_nb
FROM    
        client_account d, 
        client         e, 
        account        a
where
        A.acct_nb ='00000000000000722616216'</li>


AND     D.CLNT_ID = E.CLNT_ID
AND     D.ACCT_ID=A.ACCT_ID;

Plan (A) with a full tablescan of client_account – cost 808, runtime 1.38 seconds, buffer gets 17,955


-------------------------------------------------------------------------------------------------
| Id | Operation                      | Name           | Rows  | Bytes  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |                |     1 |    59  |   808 (14) | 00:00:10 |
|  1 |  NESTED LOOPS                  |                |     1 |    59  |   808 (14) | 00:00:10 |
|  2 |   NESTED LOOPS                 |                |     1 |    59  |   808 (14) | 00:00:10 |
|* 3 |    HASH JOIN                   |                |     1 |    42  |   806 (14) | 00:00:10 |
|  4 |     TABLE ACCESS BY INDEX ROWID| ACCOUNT        |     1 |    30  |     5  (0) | 00:00:01 |
|* 5 |      INDEX RANGE SCAN          | XAK1ACCOUNT    |     1 |        |     4  (0) | 00:00:01 |
|  6 |     TABLE ACCESS FULL          | CLIENT_ACCOUNT |  9479K|   108M |   763 (10) | 00:00:09 |
|* 7 |    INDEX UNIQUE SCAN           | XPKCLIENT      |     1 |        |     1  (0) | 00:00:01 |
|  8 |   TABLE ACCESS BY INDEX ROWID  | CLIENT         |     1 |    17  |     2  (0) | 00:00:01 |
-------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     0  recursive calls
     0  db block gets
 17955  consistent gets
     0  physical reads
     0  redo size
   623  bytes sent via SQL*Net to client
   524  bytes received via SQL*Net from client
     2  SQL*Net roundtrips to/from client
     0  sorts (memory)
     0  sorts (disk)
     1  rows processed

Plan (B) with an index fast full scan on a client_account index – cost 1,190, runtime 0.86 seconds, buffer gets 28696


----------------------------------------------------------------------------------------------------
| Id | Operation                      | Name              | Rows  | Bytes  | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|  1 |  NESTED LOOPS                  |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|  2 |   NESTED LOOPS                 |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|* 3 |    HASH JOIN                   |                   |     1 |    42  |  1188  (8) | 00:00:14 |
|  4 |     TABLE ACCESS BY INDEX ROWID| ACCOUNT           |     1 |    30  |     5  (0) | 00:00:01 |
|* 5 |      INDEX RANGE SCAN          | XAK1ACCOUNT       |     1 |        |     4  (0) | 00:00:01 |
|  6 |     INDEX FAST FULL SCAN       | XPKCLIENT_ACCOUNT | 9479K |   108M |  1145  (5) | 00:00:13 |
|* 7 |    INDEX UNIQUE SCAN           | XPKCLIENT         |     1 |        |     1  (0) | 00:00:01 |
|  8 |   TABLE ACCESS BY INDEX ROWID  | CLIENT            |     1 |    17  |     2  (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     0  recursive calls
     0  db block gets
 28696  consistent gets
     0  physical reads
     0  redo size
   623  bytes sent via SQL*Net to client
   524  bytes received via SQL*Net from client
     2  SQL*Net roundtrips to/from client
     0  sorts (memory)
     0  sorts (disk)
     1  rows processed

Note, particularly, that the two plans are the same apart from operation 6 where a full tablescan changes to an index fast full scan, predicting the same number of rows but with an increase of 50% in the cost; the increase in cost is matched by an increase in the reported workload – a 60% increase in the number of consistent reads and no disk reads or recursive SQL in either case. Yet the execution time (on multiple repeated executions) dropped by nearly 40%.

So what’s interesting and informative about the plan ?

The cost of a tablescan or an index fast full scan is easy to calculate; broadly speaking it’s “size of object” / “multiblock read count” * k, where k is some constant relating to the hardware capability. The costs in these plans and the autotrace statistics seem to be telling us that the index is bigger than the table, while the actual run times seem to be telling us that the index has to be smaller than the table.

It’s easy for an index to be bigger than its underlying table, of course; for example, if this table consisted of nothing but two short columns the index could easily be bigger (even after a rebuild) because it would be two short columns plus a rowid. If that were the case here, though, we would expect the time to fast full scan the index to be higher than the time to scan the table.

So two thoughts crossed my mind as I looked at operation 6:

  • Mixing block sizes in a database really messes up the optimizer costing, particularly for tablescans and index fast full scans. Maybe the table had been built in a tablespace using 32KB  blocks while the index had been built in a tablespace using the more common 8KB blocksize – I didn’t want to start working out the arithmetic but that might be just enough to produce the contradiction.
  • Maybe the table was both bigger AND smaller than the index – bigger because it held more data, smaller because it had been compressed. If so then the difference in run-time would be the overhead of decompressing the rows before projecting and comparing the data.

Conveniently the OP has included an extract from the 10053 trace:


Table Stats::
  Table: CLIENT_ACCOUNT  Alias:  D
    #Rows: 9479811  #Blks:  18110  AvgRowLen:  71.00  ChainCnt:  0.00
  Column (#1): CLNT_ID(
    AvgLen: 6 NDV: 1261035 Nulls: 0 Density: 0.000001 Min: 0 Max: 4244786
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 239
  Column (#2): ACCT_ID(
    AvgLen: 6 NDV: 9479811 Nulls: 0 Density: 0.000000 Min: 1 Max: 22028568
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255

Index Stats::
  Index: XPKCLIENT_ACCOUNT  Col#: 1 2
    LVLS: 2  #LB: 28543  #DK: 9479811  LB/K: 1.00  DB/K: 1.00  CLUF: 1809449.00

Note that the index is called xpclient_account – which suggests “primary key” –  and the number of distinct keys in the index (#DK) matches the number of rows in the table(#Rows). The index and table stats seem to be consistent so we’re not looking at a problem of bad statistics.

Now to do some simple (ballpark) arithmetic: for the table can we check if  “rows * average row length / 8K =  blocks”. We can read the numbers directly from the trace file:  9,500,000 * 71 / 8,000 = 84,000.  It’s wrong by a factor of about 4 (so maybe it’s a 32K block, and maybe I could rule out that possibility by including more detail in the arithmetic – like allowing properly for the block header, row overheads, pctfree etc).

For the index – we believe it’s the primary key, so we know the number of rows in the index – it’s the same as the number of distinct keys. As for the length of an index entry, we have the index definition (col#: 1 2) and we happen to have the column stats about those columns so we know their average length. Allowing for the rowid and length bytes we can say that the average index entry is (6 +1) + (6 + 1) + 6 = 20 bytes.  So the number of leaf blocks should be roughy 9,500,000 * 20 / 8,000 = 23,750. That’s close enough given the reported 28,543 and the fact that I haven’t bothered to worry about row overheads, block overheads and pctfree.

The aritmetic provides an obvious guess – which turned out to be correct: the table is compressed, the index isn’t. The optimizer hasn’t allowed for the CPU cost of decompressing the compressed rows, so the time required to decompress 9.5M rows doesn’t appear in the execution plan.

Footnote.

Looking at the column stats, it looks like there are roughly 8 acct_ids for each clnt_id, so it would probably be sensible to compress the primary key index (clnt_id, acct_id) on the first column as this would probably reduce the size of the index by about 20%.

Better still – the client_account table has very short rows – it looks like a typical intersection table with a little extra data carried. Perhaps this is a table that should be an index-organized table with no overflow. It looks like there should also be an index (acct_id, clnt_id) on this table to optimse the path from account to client and this would become a secondary index – interestingly being one of those rare cases where the secondary index on an IOT might actually be a tiny bit smaller than the equivalent index on a heap table because (in recent versions of Oracle) primary key columns that are included in the secondary key are not repeated in the index structure. (It’s a little strange that this index doesn’t seem to exist already – you might have expected it to be there given the OP’s query, and given that it’s an “obvious” requirement as an index to protect the foreign key.)

The only argument against the IOT strategy is that the table clearly compresses very well as a heap table, so a compressed heap table plus two B-tree indexes might be more cost-effective than an IOT with a single secondary index.

 

Next Page »

Blog at WordPress.com.