Oracle Scratchpad

September 26, 2016

PK Histogram

Filed under: Histograms,Oracle,Statistics — Jonathan Lewis @ 10:25 am BST Sep 26,2016

One of the little myths of Oracle appeared on the Oracle-L list server a few days ago – the one that says: “you don’t need a histogram on a single column unique/primary key”.

Not only can a histogram be helpful on a column that’s declared to hold unique values, the optimizer may even spot the need automatically. It’s a little unusual (and probably the result of poor programming practice) but it does happen. Here’s an example demonstrating the principle:


rem
rem     Script:         pk_histogram.sql
rem     Author:         Jonathan Lewis
rem
rem     Last tested
rem             12.1.0.2
rem             11.2.0.4
rem             11.1.0.7
rem             10.2.0.5
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum  id
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4
;

insert into t1 select id + 1e6 from t1;
insert into t1 select id + 1e7 from t1;

alter table t1 add constraint t1_pk primary key(id);

select
        /*+ dynamic_sampling(0) */
        *
from    t1
where
        id between 12000 and 13000
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1'
        );
end;
/


select
        column_name, sample_size,
        num_distinct, num_nulls, density,
        histogram, num_buckets
from   
        user_tab_cols
where
        table_name = 'T1'
order by
        column_name
;

I’ve created a small data set some large gaps in the ranges of values used, then queried the data with a range-based predicate that will return no rows. After that I’ve let Oracle do it’s own thing with gathering stats (I think all the preferences are at their defaults). This is the result I got:

COLUMN_NAME          SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS
-------------------- ----------- ------------ ---------- ---------- --------------- -----------
ID                          5555        40000          0 .000024752 HEIGHT BALANCED         254

Oracle will gather a histogram automatically if it can see that the distribution of the data you are querying is skewed; but people tend to think of “skewed” as meaning there is wide variation in the number of rows for a given value and forget that it can also mean a variation in the number of rows within a given size of range. All I’ve done with my example is cue Oracle to the fact that I have big gaps in the range of known values – so it has produced a histogram that let’s it know where the gaps are.

Note: The results above are from 11.2.0.4, in 12.1.0.2 I got a HYBRID histogram, on 10.2.0.5 the sample size was 40,000

September 20, 2016

Why Bother

Filed under: Uncategorized — Jonathan Lewis @ 11:16 am BST Sep 20,2016

This note comes to you prompted by “Noons” in a recent twitter exchange

In response to a complaint by Lukas Eder about having to educate people in 2016 that there is no (performance) difference between count(*) and count(1), Nuno  asked me to blog about my claim that this non-difference is a good educational example on at least three different counts.

One thing I won’t count in the list of three is an actual demonstration of the non-difference. I would be surprised if anyone reading this blog needed a proof of the point, but if you’re interested here are a few examples from my past writings – including a couple where the problem becomes subtler and the “non-difference” stops happening:

You’ll notice that the earliest note I’ve published on the blog dates back 8 years or more; and one of the notes references back to a FAQ dated 2001 published on my old website. So why bother making a fuss about this non-difference when it ought to be common knowledge and really isn’t worth the effort.

  • There is a huge amount of garbage on the internet about how Oracle works and what you can to do make it work better, so when you demonstrate to someone that claims can be backed up by proof and an idea that they’ve picked up from the internet is wrong it might make them pause for thought the next time they read some unjustified claim that really matters.
  • This is a delightfully simple example to use to demonstrate how you can get a little extra information from an execution plan that could help you understand the cause of a performance problem.  (I could make the same comment about being able to get something helpful a 10053 trace file, but I won’t claim that as reason number 3)
  • Starting even with this extremely simple example that proves one point we can see what happens as we push the boundaries of what we know. Thinking about this problem teaches us how to come up with new questions and modified examples which enhance our understanding of the product.

Frankly I find it extremely tedious how often I have to explain mechanisms that I first described in Practical Oracle 8i – it can be teribly depressing to see how ignorant people can be – but there are always new people starting their careers in Oracle, the manuals they look to are more concerned with describing how to use GUIs than with explaining the technology, and the old garbage on the internet won’t go away. New arrivals have little previous experience for telling the difference between fact and fiction until someone provides the demonstration – so we keep repeating the simple examples.

My favourite quote about the internet came by way of Terry Pratchett (in “The Truth”): A lie can run round the world before the truth has got its boots on.”

You just have to keep hammering away at the truth.

 

September 16, 2016

Stats collection time

Filed under: Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 12:58 pm BST Sep 16,2016

Yesterday I posted a note about querying dba_optstat_operations to get a quick report of how long calls to dbms_stats had been taking but said I had another script that helped to fill some of the gaps it left. One of my readers points out fairly promptely that 12c enhances the feature considerably, with a view dba_optstat_operation_tasks that (for example) lists all the tables processed during a single call to gather_schema_stats.

Well, I wrote my script years (if not decades) before 12c came out, so I’m going to publish it anyway.

It definitely falls into the quick and dirty category. It’s not trying to be clever, makes a critical assumption about the recent past, and isn’t a generally useful tool, but I’ve found it a quick way to highlight a possible problem with expensive stats collection routines (especially those written by 3rd parties who have disabled Oracle’s automatic job.)

The concept is very simple: list all the tables in the order they were last analyzed and assume that for any “obviously continuous” stretch of time the start of stats collections for one table coincided with the end of stats collection for the previous one. A little fiddling with the lag() function then allows you to report the time spent on gathering the stats.

rem
rem     stats_check.sql
rem
rem     Quick and dirty to get an idea of
rem     where the time went on the most
rem     recent stats collection job.
rem
rem     Weakness: It assumes that the start
rem     of stats collection of a table is the
rem     end of stats collection for the previous
rem     table.  This may be far from true.
rem

define m_schema = '&1'

set linesize 144
set pagesize 60
set trimspool on

column a_date format a20

spool stats_check

select
        table_name,
        blocks, num_rows, sample_size,
        round(100 * sample_size/nullif(num_rows,0),2)   sample_pct,
        to_char(last_analyzed,'dd-mon-yyyy hh24:mi:ss') a_date,
        to_char(
                round(1440 *
                        (
                        last_analyzed -
                        lag(last_analyzed, 1) over (order by last_analyzed)
                        ), 2
                ),'9,999.00'
        )                        a_time
from
        dba_tables
where
        owner = upper('&m_schema')
order by
        last_analyzed
;

spool off

It’s a very simple script – one detail that’s quite useful is the presence of the data about table size and sample size. But that’s really only the starting point. If you’re gathering stats on a table then most of the time might be spent on gathering stats for histograms or indexes, so there’s plenty of scope to add bits to the script to report number of histograms per table, number of indexes per table, and so on. Personally I tend to do “incremental” trouble-shooting and keep my scripts (and their run time) short, so I’d only worry about specific details if I saw a headline figure that seemed worth a little more effort.

Here’s some (camouflaged) output – which happens to highlight one of the weaknesses of the approach:

TABLE_NAME              BLOCKS   NUM_ROWS SAMPLE_SIZE SAMPLE_PCT A_DATE                  A_TIME
--------------------- --------  --------- ----------- ---------- --------------------  --------
TABLE_AAA                    0          0           0            14-jul-2016 01:27:55       .00
TABLE_BBB              6320666  540356865   108071373         20 14-jul-2016 02:30:45     62.83
TABLE_CCC              9587372  930770535   186154107         20 14-jul-2016 03:36:13     65.47
TABLE_DDD              2272319  104698080    20939616         20 14-jul-2016 04:19:45     43.53
TABLE_EEE              4413696  258746720    51749344         20 14-jul-2016 04:49:13     29.47
TABLE_FFF                    0          0           0            14-jul-2016 05:02:28     13.25
TABLE_GGG              4377656  258740995    51748199         20 14-jul-2016 05:02:28       .00
TABLE_HHH                    0          0           0            14-jul-2016 05:02:28       .00
TABLE_III                   60         52          52        100 14-jul-2016 05:02:29       .02
TABLE_JJJ                   60        170         170        100 14-jul-2016 05:02:30       .02
TABLE_KKK                   60        100         100        100 14-jul-2016 05:02:30       .00
TABLE_LLL                   60       5548        5021       90.5 14-jul-2016 05:02:31       .02
TABLE_MMM                 4660     686575      137315         20 14-jul-2016 05:02:32       .02
...
TABLE_TTT                    0          0           0            14-jul-2016 05:02:38       .00
TABLE_UUU                   60        659         659        100 14-jul-2016 05:02:38       .00
TABLE_VVV               325324   18618685     3723737         20 14-jul-2016 05:04:14      1.60
TABLE_WWW                   60       1106        1106        100 14-jul-2016 05:06:01      1.78
TABLE_XXX                24001    2534810      506962         20 14-jul-2016 05:06:13       .20
TABLE_YYY                    0          0           0            14-jul-2016 05:06:20       .12
TABLE_ZZZ                    0          0           0            14-jul-2016 05:06:20       .00

You’ll notice that the zero row FFF table is reported to have taken 13.25 minutes for stats gathering, while the 4.3M row GGG table table took no time at all. That’s the problem of sorting by the last_analyzed time when it’s accurate only to the second and some tables take less than a second to gather stats. Clearly these two time estimates are the wrong way round. (WWW and XXX look fairly suspect too, but their timestamps are different so there must be some other explanation – but the time is too small to worry about at the moment.)

Tables BBB and CCC are also an example of how variable the results are when you compare cost with time. Although the table and sample had twice the rows table CCC has fewer histograms and indexes. The same variation shows up clearly between EEE and GGG (after allowing for the order error), and slightly less immediately between DDD and all the other large tables.

That brings me to one more threat of mis-interpretation (and a reason why I’m not terribly keen on publishing scripts that I’ve hacked together). If you see Oracle claiming (through this script) that it took no more than a couple of minutes to gather stats on a 5 billion row table it may be true – but maybe that stats collection was only related to gathering stats for a single, fairly new, partition in a partitioned table. There are, no doubt, several boundary cases that a simple script like this will have overlooked – and when I run it I’ll be at a client site with knowledge of the client database and I’ll spot the anomalies and omissions.

Update

One simple enhancement to the script above – indexes take time to analyze as well, and if you query only for tables the time spent analyzing the indexes on a table will be attributed to the NEXT table in the output. All we have to do is change the reference to dba_tables to be an inline view referencing dba_tables and dba_indexes in a UNION ALL:


select
        table_name, index_name,
        blocks, num_rows, sample_size,
        round(100 * sample_size/nullif(num_rows,0),2)   sample_pct,
        to_char(last_analyzed,'dd-mon-yyyy hh24:mi:ss') a_date,
        to_char(
                round(1440 *
                        (
                        last_analyzed -
                        lag(last_analyzed, 1) over (order by last_analyzed)
                        ), 2
                ),'9,999.99'
        )                        a_time
from
        (
        select
                owner, table_name, null as index_name,
                blocks, num_rows, sample_size, last_analyzed
        from
                dba_tables
        union all
        select
                table_owner, table_name, index_name,
                leaf_blocks, num_rows, sample_size, last_analyzed
        from
                dba_indexes
        )
where
        owner = upper('&m_schema')
order by
        last_analyzed
;

You’ll notice that I’ve left the ‘&schema’ outside the view so that I only have to supply a value once; the optimizer is smart enough to push this predicate inside the union all view so that the plan shows the two separate subqueries being operated only for the given schema.

Footnote

A reader has emailed me to point out that one of the cases where this script won’t help (or, at the least, would produce highly misleading results) when the CONCURRENT preference has been set to true, allowing multiple jobs run to gather stats on difference objects at the same.

 

September 15, 2016

Stats time

Filed under: Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 9:41 am BST Sep 15,2016

I don’t really remember how long it’s been since Oracle created an automatic log of how long a call to the dbms_stats package took, though it was probably some time in the 10g time-line. It wasn’t until it had been around for several years, though before I wrote little script (possibly prompted by a comment from Martin Widlake) that I’ve used occasionally since to see what’s been going on in the past, how variable stats collection times have been, and what unexpected dbms_stats call an application may have been making. Here’s what it currently looks like:

rem
rem     Script:         optimizer_operations.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2010
rem     Purpose:
rem
rem     Last tested
rem             11.2.0.4
rem     Not tested
rem             12.1.0.2
rem             10.2.0.5
rem
rem     Notes:
rem     Simple script to list calls to dbms_stats showing
rem     what they did and how long it took.
rem
rem     Notice that gather_database_stats and gather_schema_stats
rem     don't tell you anything about individual tables.
rem

spool optimizer_operations

column start_time       format a35
column operation        format a28
column target           format a28

column duration         format a12
column max_duration     format a12

column utc_start_time   format a28
column first_utc_time   format a28

select
        operation,
        cast(sys_extract_utc(min(start_time)) as timestamp(0))                  first_utc_time,
        --
        to_char(extract(hour from max(end_time - start_time)),'FM00') ||
        ':' ||
        to_char(extract(minute from max(end_time - start_time)),'FM00') ||
        ':' ||
        to_char(round(extract(second from max(end_time - start_time)),2),'FM00.00')     max_duration,
        --
        count(*)        Frequency
from
        dba_optstat_operations
group by
        operation
order by
        operation
;

select
        operation,
        target,
        -- start_time,
        cast(sys_extract_utc(start_time) as timestamp(0))                               utc_start_time,
        to_char(extract(hour from (end_time - start_time)),'FM00') ||
        ':' ||
        to_char(extract(minute from (end_time - start_time)),'FM00') ||
        ':' ||
        to_char(round(extract(second from (end_time - start_time)),2),'FM00.00')        duration
from
        dba_optstat_operations
where
        end_time - start_time >= numtodsinterval(2,'SECOND')
and     target is not null
order by
        start_time
;

spool off

This version of the script first prints a summary showing how many calls to which operation have occured, the worst case scenario for each, and when the first one happened – in case a recent change has introduced a problem – you might find it useful to include other bits of information, like the min(duration) in the first section, it’s not difficult modify the SQL. The second part of the report is then just a list of the individual calls, with a minimum interesting time included.

And here’s some (camouflagued) output


OPERATION                    FIRST_UTC_TIME               MAX_DURATION  FREQUENCY
---------------------------- ---------------------------- ------------ ----------
copy_table_stats             14-JUL-16 11.12.55 PM        00:00:03.43         320
gather_database_stats(auto)  14-JUL-16 06.00.01 AM        01:13:17.25          67
gather_schema_stats          14-JUL-16 04.00.02 AM        05:16:01.22          35
gather_table_stats           14-JUL-16 01.15.59 AM        00:04:07.98         153


OPERATION                    TARGET                             UTC_START_TIME               DURATION
---------------------------- ---------------------------------- ---------------------------- ------------
gather_schema_stats          MY_SCHEMA                          01-AUG-16 04.00.02 AM        04:40:33.03
copy_table_stats             SYS.WRH$_SQLSTAT.WRH$_SQLSTA_24706 01-AUG-16 11.18.15 PM        00:00:02.04
                             69947_32443

gather_schema_stats          MY_SCHEMA                          02-AUG-16 04.00.02 AM        04:57:39.61
gather_schema_stats          MY_SCHEMA                          03-AUG-16 04.00.02 AM        04:40:10.88
gather_schema_stats          MY_SCHEMA                          04-AUG-16 04.00.02 AM        05:00:48.05
gather_schema_stats          MY_SCHEMA                          05-AUG-16 04.00.03 AM        05:07:42.33
gather_schema_stats          MY_SCHEMA                          06-AUG-16 04.00.02 AM        04:50:39.93
gather_table_stats           SYS.WRI$_ADV_ACTIONS               06-AUG-16 12.42.38 PM        00:00:03.01
gather_table_stats           SYS.WRI$_ADV_FINDINGS              06-AUG-16 12.42.41 PM        00:00:02.94
gather_table_stats           SYS.WRI$_ADV_RECOMMENDATIONS       06-AUG-16 12.42.46 PM        00:00:02.03
gather_table_stats           SYS.WRI$_ADV_SQLT_PLANS            06-AUG-16 12.42.49 PM        00:00:04.45
gather_schema_stats          MY_SCHEMA                          07-AUG-16 04.00.02 AM        04:48:57.23
gather_schema_stats          MY_SCHEMA                          08-AUG-16 04.00.01 AM        04:40:57.81
gather_table_stats           MY_SCHEMA.TABLE_A                  09-AUG-16 01.16.21 AM        00:00:03.61
gather_schema_stats          MY_SCHEMA                          09-AUG-16 04.00.03 AM        04:51:14.48
gather_table_stats           MY_SCHEMA.TABLE_A                  10-AUG-16 01.16.44 AM        00:00:02.07
gather_schema_stats          MY_SCHEMA                          10-AUG-16 04.00.02 AM        04:36:19.35
gather_schema_stats          MY_SCHEMA                          11-AUG-16 04.00.02 AM        04:43:30.77
copy_table_stats             SYS.WRH$_FILESTATXS.WRH$_FILEST_24 11-AUG-16 11.24.21 PM        00:00:02.65
                             70669947_32683

gather_schema_stats          MY_SCHEMA                          12-AUG-16 04.00.02 AM        04:42:15.92
gather_schema_stats          MY_SCHEMA                          12-AUG-16 04.00.02 AM        04:42:15.89
gather_schema_stats          MY_SCHEMA                          13-AUG-16 04.00.02 AM        04:41:06.68
gather_table_stats           SYS.WRI$_ADV_PARAMETERS            13-AUG-16 10.31.35 AM        00:00:03.88
gather_table_stats           SYS.WRI$_ADV_RATIONALE             13-AUG-16 10.31.39 AM        00:00:03.03
gather_table_stats           MY_SCHEMA.TABLE_B                  13-AUG-16 01.24.33 PM        00:04:07.98
gather_table_stats           MY_SCHEMA.TABLE_C.SYS_P12423       13-AUG-16 02.06.34 PM        00:03:22.33
copy_table_stats             SYS.WRH$_ACTIVE_SESSION_HISTORY.WR 13-AUG-16 11.03.35 PM        00:00:02.11
                             H$_ACTIVE_2470669947_32731


There is an important limitation with this script – it’s all very well seeing that gather_schema_stats has been running at a fairly uniform 4.5 to 5 hours, but it doesn’t tell us where the time went inside that call and it’s a call that could have been covering a lot of tables. There’s not a lot you can do about that from this set of data, but there’s another script I wrote a long time ago (before I wrote this one) that tried to do something about that question, and I’ll be publishing that tomorrow.

 

September 13, 2016

Securefile space

Filed under: fragmentation,Infrastructure,LOBs,Oracle,Troubleshooting — Jonathan Lewis @ 7:29 am BST Sep 13,2016

Here’s a little script I hacked together a couple of years ago from a clone of a script I’d been using for checking space usage in the older types of segments. Oracle Corp. eventually put together a routine to peer inside securefile LOBs:

rem
rem	Script:		dbms_space_use_sf.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Dec 2013
rem	Purpose:	
rem
rem	Last tested 
rem		12.1.0.2
rem		11.2.0.4
rem	Not tested
rem		11.1.0.7
rem	Not relevant
rem		10.2.0.5
rem		 9.2.0.8
rem		 8.1.7.4
rem
rem	Notes:
rem	See also dbms_space_use.sql
rem
rem	11g introduced securefiles lobs and two overloads of 
rem	dbms_space_usage to report space used by their segments
rem
rem	Valid values for suoption are:
rem		SPACEUSAGE_EXACT (16): Computes space usage exhaustively
rem		SPACEUSAGE_FAST  (17): Retrieves values from in-memory statistics
rem
rem	This version allows for partitioned objects, could delete
rem	lines to parameter 4 and partition names to eliminate
rem	the complaints about substitution variables.
rem


define m_seg_owner	= &1
define m_seg_name	= &2
define m_seg_type	= '&3'
define m_part_name	= &4

define m_segment_owner	= &m_seg_owner
define m_segment_name	= &m_seg_name
define m_segment_type	= '&m_seg_type'
define m_partition_name = &m_part_name

@@setenv

execute snap_enqueues.start_snap
execute snap_events.start_snap
execute snap_my_stats.start_snap

spool dbms_space_use_sf

prompt	============
prompt	Secure files
prompt	============

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10614);

	m_segment_size_blocks	number(12,0);
	m_segment_size_bytes	number(12,0);
	m_used_blocks		number(12,0);
	m_used_bytes		number(12,0);
	m_expired_blocks	number(12,0);
	m_expired_bytes		number(12,0);
	m_unexpired_blocks	number(12,0);
	m_unexpired_bytes	number(12,0);

begin
	dbms_space.space_usage(
		upper('&m_segment_owner'),
		upper('&m_segment_name'),
		upper('&m_segment_type'),
		suoption		=> dbms_space.spaceusage_exact,	
--		suoption		=> dbms_space.spaceusage_fast,
		segment_size_blocks	=> m_segment_size_blocks,
		segment_size_bytes	=> m_segment_size_bytes,
		used_blocks		=> m_used_blocks,
		used_bytes		=> m_used_bytes,
		expired_blocks		=> m_expired_blocks,
		expired_bytes		=> m_expired_bytes,
		unexpired_blocks	=> m_unexpired_blocks,
		unexpired_bytes		=> m_unexpired_bytes,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.new_line;
	dbms_output.put_line(' Segment Blocks:   ' || to_char(m_segment_size_blocks,'999,999,990') || ' Bytes: ' || to_char(m_segment_size_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Used Blocks:      ' || to_char(m_used_blocks,'999,999,990')         || ' Bytes: ' || to_char(m_used_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Expired Blocks:   ' || to_char(m_expired_blocks,'999,999,990')      || ' Bytes: ' || to_char(m_expired_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Unexpired Blocks: ' || to_char(m_unexpired_blocks,'999,999,990')    || ' Bytes: ' || to_char(m_unexpired_bytes,'999,999,999,990')); 

exception
	when wrong_ssm then
		dbms_output.put_line('Segment not ASSM');
end;
/

prompt	===============
prompt	Generic details
prompt	===============

declare
	m_total_blocks			number;
	m_total_bytes			number;
	m_unused_blocks			number;
	m_unused_bytes			number;
	m_last_used_extent_file_id	number;
	m_last_used_extent_block_id	number;
	m_last_used_block		number;
begin
	dbms_space.unused_space(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		total_blocks		=> m_total_blocks,
		total_bytes 		=> m_total_bytes, 
		unused_blocks		=> m_unused_blocks,  
		unused_bytes		=> m_unused_bytes,
		last_used_extent_file_id 	=> m_last_used_extent_file_id, 
		last_used_extent_block_id	=> m_last_used_extent_block_id,
		last_used_block		=> m_last_used_block,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.put_line('Segment Total blocks: ' || to_char(m_total_blocks,'999,999,990'));
	dbms_output.put_line('Object Unused blocks: ' || to_char(m_unused_blocks,'999,999,990'));

end;
/

-- execute snap_my_stats.end_snap
-- execute snap_events.end_snap
-- execute snap_enqueues.end_snap

spool off

Sample of output (from a slightly older version of the code):


============
Secure files
============

 Segment Blocks:    168960 Bytes: 1384120320
 Used Blocks:       151165 Bytes: 1238343680
 Expired Blocks     17795 Bytes: 145776640
 Unexpired Blocks   0 Bytes: 0

PL/SQL procedure successfully completed.

===============
Generic details
===============
Segment Total blocks: 168960
Object Unused blocks: 0

PL/SQL procedure successfully completed.


September 11, 2016

Basicfile LOBs 6

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 7:48 pm BST Sep 11,2016

One of the nice things about declaring your (basicfile) LOBs as “enable storage in row” is that the block addresses of the first 12 chunks will be listed in the row and won’t use the LOB index, so if your LOBs are larger than 3960 bytes but otherwise rather small the LOB index will hold only the timestamp entries for deleted LOBs. This makes it just a little easier to pick out the information you need when things behave strangely, so in this installment of my series I’m going to take about an example with with storage enabled in row.

I’m going to demonstrate an issue that is causing a major problem. First I’m going to build a table with a LOB column with multiple (4) freepools – because that’s what you do to handle concurrency – then I’m going to start 4 sessions (carefully checking that I have one associated with each free pool) and do a few thousand inserts with commits from each session. The size of the LOB value I insert will be 20KB so it will be “indexed” in the row but stored out of the row taking 3 LOB blocks.

Once I’ve got the data in place I’m going to use three of the sessions to delete three quarters of the rows from the table then use a call to the dbms_space package to show you that the segment contains virtually no free space. I’ve engineered the code so that it will take just three more rows in the table to fill the available free space and force Oracle either to allocate a new extent or to start reclaiming some of the delete reusable LOB space – and I’m going to run that insert from the session that DIDN’T delete any rows.

I’ve been running these tests on 11.2.0.4, but get similar behaviour on 12c.

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as 
    basicfile 
    text_lob(
            enable storage in row
            chunk 8k
            retention
            nocache
            logging
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('X',20000,'X');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/

truncate table t1
;

You’ll notice I’ve used the retention keyword.  Before I built the LOB I set my undo_retention to 10 seconds so that the space from deleted LOBs should become available for reuse very quickly. The name of the tablespace I’ve used for the LOB is a clue that I’m using an 8KB block size and ASSM (the latter is a requirement of the retention option).

Here’s the code to check which freepool (0 to 3) a session will be associated with (this isn’t documented, but seems to be correct);


select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

So I can keep starting sessions and running that query until I’ve got a session covering each freepool. (The first time I tried this I had to start 7 sessions before I got all 4 freepools covered). Now I can run the following from all 4 sessions concurrently:

define m_loop_counter = 12027

lock table t1 in row share mode;
commit;

declare
        m_v1 varchar2(32767) := rpad('x',20000,'x');
begin
        for i in 1..&m_loop_counter loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

The purpose of the lock table command is to ensure that all 4 processes start running simultaneously. From a fifth session I execute a “lock table t1 in exclusive mode” before starting the other four sessions running, so they all queue on the exclusive lock; then I commit from the fifth session and away we go. The whole thing took about 30 seconds to run. The rather random-looking value 12,027 was a careful choice to ensure that the last extent in the segment had just a few blocks left – and I used my “dbms_space_use.sql” script to check this, getting the following output:

====
ASSM
====

Unformatted                   :        7 /       57,344
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :  144,324 / ############

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145536
Object Unused blocks: 0

I’ve got 7 “unformatted” blocks in the segment – though in fact these might be “formatted but free” from the perspective of the LOB code.

After going to sessions 0, 1, and 3 and deleting 12,000 rows from each in turn (and committing, leaving a total of 12,108 rows in the table) the report doesn’t change: I haven’t made any space free I’ve simply flagged it in the LOB index as “reusable”. So now we go to session 2 and run the following code 3 times – with “set timing on”:


SQL> l
  1  declare
  2     m_v1 varchar2(32767) := rpad('x',20000,'x');
  3  begin
  4     for i in 1..1 loop
  5             insert into t1 values (s1.nextval, m_v1);
  6             commit;
  7     end loop;
  8* end;

The first run took 0.02 seconds – and the unformatted count dropped to 4

The second run took 0.01 seconds – and the unformatted count dropped to 1

The third run took 10.74 seconds, of which 9 seconds was CPU. The session generated 500,000 redo entries totalling 100MB of redo from 1 million db block changes after doing 8.4 million logical I/Os, issuing 108,000 enqueue (lock) requests and running 108,000 index range scans. The report of space usage ended up looking like this:


Unformatted                   :  108,125 /  885,760,000
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   36,333 /  297,639,936

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145664
Object Unused blocks: 0

My session has cleared every single piece of re-usable space from the LOB and made it free (unformatted) before allocating space for its one LOB. (That’s going to hurt when the client has 2 million LOBs on the reusable list and isn’t running everything on SSDs – which is why I’m working on this problem).

If you’re wondering why it takes so much redo and so many buffer visits to free 36,000 LOBs this (roughly) is what Oracle does to free up one reusable LOB of 3 blocks – which corresponds to a single index entry carrying three block ids:

  • Find the lowest index entry in the freepool, pin the index leaf block
  • Identify the last block in the list of 3
  • Lock the relevant L1 space management block for the segment and set relevant “bit” to “unformatted”
  • Delete the index entry
  • Re-insert the index entry with one block id removed
  • Commit and unlock the L1 bitmap block
  • Repeat delete/insert the cycle for 2nd block id
  • Repeat the cycle for 3rd (or 1st since we’re going backwards) block id – but don’t re-insert the index entry

Oracle reclaims one block (chunk) at a time. And that’s a bit of a clue to a possible workaround because event 44951 gets mentioned a couple of times in MoS and on the internet as a workaround to a particular problem of HW enqueue waits for LOBS. MoS note 740075.1 tells us:

When using Automatic Segment Space Management (ASSM), and the fix for Bug 6376915 has been applied in your database (Included in 10.2.0.4 +) it is possible to adjust the number of chunks that are cleaned up when the chunk cleanup operation is required.

This can be enabled by setting event 44951 to a value between 1 and 1024 (default is 1). With the value between 1 and 1024 setting the number of chunks to be cleaned up each time a chunk reclaimation operation occurs. This can therefore reduce the number of requests for the High Watermark Enqueue.

Other notes explain that by default only one chunk is cleaned up at a time – which is exactly the behaviour I’m seeing. So what happens when I bounce the database with this event set at level 5 (an arbitrary choice, but larger than the LOBs I’ve been inserting) in the parameter file and repeat the experiment ? On the first attempt it made no difference, but then I changed the experiment slightly and started again. Initially I had done my first re-insert from the one session that hadn’t deleted any rows – which made it an extreme boundary condition; on the second attempt I deleted two rows from the session that had not yet deleted any data (and waited for the retention time to elapse) before doing the inserts from that session.

Deleting two rows would put 6 blocks (in two index entries) onto my re-usable list, so I was starting the inserts with 7 free blocks, 6 reusable blocks and the event set to level 5. Here’s what I saw as I inserted rows one by one.

  • Insert one row: “Unformatted” blocks went up to 9:  I had freed 5 of the reusable blocks then used 3 of them for my lob (7 + 5 – 3 = 9)
  • Insert one row: “Unformatted” blocks went down to 7: I had freed the last reusable block then used 3 blocks for my lob (9 + 1 – 3 = 7)
  • Insert one row: “Unformatted” blocks went down to 4
  • Insert one row: “Unformatted” blocks went down to 1
  • Insert one row: Oracle cleared all the reusable space (11 seconds, 500MB redo), then added an extent (!) to the segment and used 2 of its blocks for part of the new LOB.

So the event isn’t really connected with my problem – though it adds some efficiency to the processing – and my  “boundary condition” is one that’s likely to occur fairly frequently if you’ve got a basicfile LOB defined with multiple freepools. Fortunately it’s probably going to require two pre-conditions before it’s a big problem: first that you’re handling a large number of LOBs and second that your pattern of inserting and deleting is not symmetric – it’s when you use a large number of concurrent sessions for small batches of inserts but a single session for large bulk deletes that all hell can break loose shortly after a delete.

tl;dr

As with many other features of Oracle, skew plays a part in making things break. If you’re doing lots of inserts and deletes of basicfile lobs make sure the mechanisms you use for inserting and deleting look similar: in particular similar numbers of processes to do similar amounts of work for both operations.

 

P.S. It gets worse.

P.P.S. Don’t even start to think that you can work around this by using securefiles.

P.P.P.S. I got an hint from one test that if a reusable LOB is exactly the same size as the LOB being inserted then Oracle very cleverly takes the entry index entry and rewrites it to be the LOB index entry rather than freeing (and then potentially using) the space it identifies.

 

Basicfile LOBS 5

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 6:49 pm BST Sep 11,2016

At the end of the last installment we had seen a test case that caused Oracle to add a couple of redundant new extents to a LOB segment after one process deleted 3,000 LOBs and another four concurrent processes inserted 750 LOBs each a few minutes later (after the undo retention period had elapsed). To add confusion the LOBINDEX seemed to show that all the “reusable” chunks had been removed from the index which suggests that they should have been re-used. Our LOB segment started at 8,192 blocks, is currently at 8,576 blocks and is only using 8,000 of them.

How will things look if I now connect a new session (which might be associated with a different freepool), delete the oldest 3,000 LOBs, wait a little while, then get my original four sessions to do their concurrent inserts again ? And what will things look like after I’ve repeated this cycle several times ?

I had to drop the tables from my original test since writing the previous article, so the following results start from recreating the whole test from scratch and won’t align perfectly with the previous sets of results. Here’s what the index treedump looked like after going through the serial delete / concurrent insert cycle 12 times:

----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 71, level: 1)
   leaf: 0x1800223 25166371 (-1: nrow: 0 rrow: 0)
   leaf: 0x1800227 25166375 (0: nrow: 0 rrow: 0)
   leaf: 0x1800236 25166390 (1: nrow: 0 rrow: 0)
   leaf: 0x180023d 25166397 (2: nrow: 63 rrow: 63)
   leaf: 0x1800206 25166342 (3: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (4: nrow: 81 rrow: 81)
   leaf: 0x1800229 25166377 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (11: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (12: nrow: 126 rrow: 126)

   leaf: 0x1800266 25166438 (13: nrow: 0 rrow: 0)
   leaf: 0x180025e 25166430 (14: nrow: 39 rrow: 39)
   leaf: 0x1800262 25166434 (15: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (16: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (17: nrow: 76 rrow: 76)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x180024a 25166410 (24: nrow: 118 rrow: 118)

   leaf: 0x1800263 25166435 (25: nrow: 0 rrow: 0)
   leaf: 0x180024c 25166412 (26: nrow: 0 rrow: 0)
   leaf: 0x1800254 25166420 (27: nrow: 0 rrow: 0)
   leaf: 0x1800264 25166436 (28: nrow: 1 rrow: 0)
   leaf: 0x1800274 25166452 (29: nrow: 2 rrow: 0)
   leaf: 0x180027c 25166460 (30: nrow: 2 rrow: 0)
   leaf: 0x180025d 25166429 (31: nrow: 2 rrow: 0)
   leaf: 0x1800241 25166401 (32: nrow: 2 rrow: 0)
   leaf: 0x1800245 25166405 (33: nrow: 2 rrow: 0)
   leaf: 0x1800265 25166437 (34: nrow: 1 rrow: 0)
   leaf: 0x1800251 25166417 (35: nrow: 3 rrow: 0)
   leaf: 0x1800249 25166409 (36: nrow: 4 rrow: 0)
   leaf: 0x1800242 25166402 (37: nrow: 1 rrow: 0)
   leaf: 0x1800255 25166421 (38: nrow: 2 rrow: 0)
   leaf: 0x1800259 25166425 (39: nrow: 3 rrow: 0)
   leaf: 0x1800246 25166406 (40: nrow: 1 rrow: 0)

   leaf: 0x1800214 25166356 (41: nrow: 38 rrow: 0)
   leaf: 0x1800218 25166360 (42: nrow: 81 rrow: 0)
   leaf: 0x180021c 25166364 (43: nrow: 81 rrow: 0)
   leaf: 0x1800220 25166368 (44: nrow: 0 rrow: 0)
   leaf: 0x180022d 25166381 (45: nrow: 26 rrow: 26)
   leaf: 0x1800231 25166385 (46: nrow: 81 rrow: 81)
   leaf: 0x1800219 25166361 (47: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (48: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (49: nrow: 81 rrow: 81)
   leaf: 0x180022c 25166380 (50: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (51: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (52: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (53: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (54: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (55: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (56: nrow: 79 rrow: 79)
   leaf: 0x180023f 25166399 (57: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (58: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (59: nrow: 81 rrow: 81)
   leaf: 0x1800210 25166352 (60: nrow: 120 rrow: 120)

   leaf: 0x180021d 25166365 (61: nrow: 0 rrow: 0)

   leaf: 0x1800248 25166408 (62: nrow: 21 rrow: 21)
   leaf: 0x1800268 25166440 (63: nrow: 81 rrow: 81)
   leaf: 0x180026c 25166444 (64: nrow: 152 rrow: 152)
   leaf: 0x180026b 25166443 (65: nrow: 152 rrow: 152)
   leaf: 0x180026f 25166447 (66: nrow: 152 rrow: 152)
   leaf: 0x1800273 25166451 (67: nrow: 152 rrow: 152)
   leaf: 0x1800277 25166455 (68: nrow: 152 rrow: 152)
   leaf: 0x180027b 25166459 (69: nrow: 66 rrow: 66)
----- end tree dump

As usual I’ve split the treedump into the sections that reflect the freepools, each of which could consist of two parts the LOBs (key values starting with even numbers) and the “reusable chunks” (key values starting with odd numbers). The dump suggests that things have worked well: as you can see it’s grown a few blocks after my 12 cycles but there are only 6 sections (not the full 8 that might be there), and only a few leaf blocks showing “empty” (rrows = 0). As “reusable” sections have appeared the index has grown a little, then the reusable entries have been taken off the index and the index has shrunk a bit; you can even see that freepool 3 (the highest numbered one) is still showing a pattern of 152 LOBs indexed per block – this is despite the fact that at one point a reusable section for freepool 3 (00 07) appeared above this section and then disappeared as those reusable chunks were reclaimed.

All in all the index seems to be behaving extremely well, with only a little growth and (probably temporarily) a couple of little glitches of empty leaf blocks.

Here’s the dump of the (slightly edited) “col 0” values to confirm where the freepool breaks were –

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 2c cc
 1:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 31 61
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 33 61
 3:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 36
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 37 1e
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 38 37
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 39 1e
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3a 37
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3b 50
 9:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3c
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3d 4b
11:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3e
12:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3e e7

13:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 25 9b
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 32 a0
15:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 34 1d
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 36 c6
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 39 3d
18:     col 0; len  9; ( 9):  00 02 00 01 00 00 09 df 3d
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 3f 52
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 40 cf
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 20
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 71
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 c2
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 42 13

25:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
26:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
27:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
28:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
29:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
30:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
31:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
32:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
33:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
34:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
35:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
36:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
37:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
38:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
39:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
40:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00

41:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 26 52
42:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a 27
43:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a dc
44:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2b 2d
45:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 31 34
46:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 33 15
47:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 92
48:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 e3
49:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 34
50:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 85
51:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 d6
52:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 36 27
53:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 6c
54:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 ef
55:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3a d0
56:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3c 4d
57:     col 0; len  9; ( 9):  00 04 00 01 00 00 09 df 3d
58:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3e 4b
59:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3f 96
60:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 40 7d

61:     col 0; len 10; (10):  00 05 57 bc b9 db 00 00 00 00

62:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 32 5b
63:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 33 a6
64:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 36 4f
65:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 38 13
66:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3a 09
67:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3b cd
68:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3d 91
69:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3f 23

As you can see at leaf block 61 we didn’t quite empty the reusable list from freepool 2 (00 05 = 2 * 2 + 1), and leaf blocks 25 to 40 tell us that freepool 1 (00 03 = 2 * 1 + 1) was the freepool used on the last big delete. Despite the odd little glitches it looks as if this strategy of “deleted LOBs go to my process’ freepool” seems to do a good job of reusing index space.

But there IS a problem. Here’s the output from a script I wrote using the dbms_space package to show how space in the LOB segment has been used:


Unformatted                   :    4,508 /   36,929,536
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :    8,000 /   65,536,000

Segment Total blocks: 12672
Object Unused blocks: 0

The LOB segment has grown from an initial 8,192 blocks with a few unformatted blocks and 8,000 used blocks (2 blocks per LOB, 4,000 LOBs) to 12,672 blocks with 4,508 blocks unformatted. (The difference between Full + Unformatted and Segment Total blocks is the set of bitmap space management blocks for the segment). After only 12 cycles we have “leaked” an overhead of 50% of our real data space – maybe this helps to explain why the client that started me down this set of blogs has seen Oracle allocate 700GB to hold just 200GB of LOBs.

The tablespace is declared as locally managed with 1MB uniform extents and automatic segment space management. By writing a simple script I can get Oracle to write a script to dump the first block of each extent – and they will all be Level 1 bitmap space management blocks. Running grep against the trace file I can pick out the lines that tell me how many data blocks are mapped by the bitmap and how many of them have been formatted or not. This is the result (I have 99 extents in the segment – 99 * 128 = 12,672):

   unformatted: 0       total: 64        first useful block: 4
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 24      total: 64        first useful block: 2
   unformatted: 52      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 127     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 108     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 105     total: 128       first useful block: 1
   unformatted: 96      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 38      total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 65      total: 128       first useful block: 1
   unformatted: 98      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1

You’ll notice that for the first 63 extents Oracle says there are 64 blocks mapped by the bitmap with the first useful block at block 2, thereafter it says there are 128 blocks and the first useful block is block 1 (Oracle is counting from zero). While Oracle thinks the segment is “quite small” it allocates two level 1 bitmap blocks per 1MB extent and I’ve only dumped the first block from each extent; but when the segment reaches 64MB Oracle decides that it’s getting pretty big and there’s no point in wasting space so changes to using a single level 1 bitmap block per 1MB extent. It’s just one of those tiny details you discover when you happen to look a little closely and how things work. (On a much larger test with 8MB uniform extents Oracle got to the point where it was using one L1 bitmap block for the whole 8MB.)

There’s a fascinating pattern in the later extents of 3 full extents followed by 3 empty extents – my first guess had been that Oracle was allocating new extents but not using them, but clearly that’s not right, it’s removing “reusable chunks” from the index and then not re-using them but using the new extents instead (some of the time). Something is seriously wrong with the way Oracle is handling the “reusable chunks” part of the index. With a little luck it’s some nasty side effect of the “one process delete / multiple process insert” strategy we have adopted, so: (a) we need to repeat the entire experiment with a concurrent delete mechanism and (b) we need to think about how we might re-engineer a REALLY BIG system that has followed this unfortunate strategy for a long time. Of course if (a) turns out to be a disaster as well we don’t really need to think too hard about (b) until we have discovered a good way of dealing with our rolling pattern of inserts and deletes.

Some (minimum effort, we hope) ideas we will have to look at for (b):

  • Oracle has an option to change the freepools count on a LOB segment – do we need to use it, how much work would it entail, would it require downtime
  • Oracle has an option to “rebuild” the freepools on a LOB segment
  • We can always try the “shrink space compact” option on the LOB
  • Should we just rebuild (move) the LOB segment – and use a larger extent size while we’re at it
  • Should we recreate the table and change the LOB to Securefiles as we do so – and do all the testing all over again
  • If we’re deleting old data on such a regular pattern should we try to bypass the deletes by partitioning the table in some clever way

TO BE CONTINUED.

 

Space Usage

Filed under: fragmentation,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 6:09 pm BST Sep 11,2016

Here’s a simple script that I’ve used for many years to check space usage inside segments.  The comment about freelist groups may be out of date  – I’ve not had to worry about that for a very long time. There is a separate script for securefile lobs.


rem
rem	Script:		dbms_space_use.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Nov 2002
rem	Purpose:	
rem
rem	Last tested 
rem		12.1.0.2
rem		11.2.0.4
rem	Not tested
rem		11.1.0.7
rem		10.2.0.5
rem		 9.2.0.8
rem	Not relevant
rem		 8.1.7.4
rem
rem	Notes:
rem	For accuracy in free space you (once) needed to set the
rem	scan limit; and for those rare objects cases where you 
rem	had defined multiple freelist groups you still have to
rem	work through each free list group in turn
rem
rem	For the ASSM calls:
rem		FS1	=> 0% - 25% free space
rem		FS2	=> 25% - 50% free space
rem		FS3	=> 50% - 75% free space
rem		FS4	=> 75% - 100% free space
rem		Bytes = blocks * block size
rem
rem	Expected errors:
rem		ORA-10614: Operation not allowed on this segment
rem			(MSSM segment, ASSM call)
rem		ORA-10618: Operation not allowed on this segment
rem			(ASSM segment, MSSM call)
rem		ORA-03200: the segment type specification is invalid
rem			(e.g. for LOBINDEX or LOBSEGMENT)
rem			11g - "LOB" is legal for LOB segments
rem			    - use "INDEX" for the LOBINDEX
rem
rem	For indexes
rem		Blocks are FULL or FS2 (re-usable)
rem
rem	Special case: LOB segments.
rem	The number of blocks reported by FS1 etc. is actually the
rem	number of CHUNKS in use (and they're full or empty). So 
rem	if your CHUNK size is not the same as your block size the
rem	total "blocks" used doesn't match the number of blocks 
rem	below the HWM.
rem
rem	The package dbms_space is created by dbmsspu.sql
rem	and the body is in prvtspcu.plb
rem
rem	11.2 overloads dbms_space.space_usage for securefile lobs
rem	See dbms_space_use_sf.sql
rem
rem	When supplying details about partitions the segment type
rem	can consist of two words (e.g. LOB PARTITION), these 
rem	must be surrounded by quotes to survive the script.
rem
rem	You might want to set up two versions of this code with
rem	all references to partitions removed from one of them
rem	or you have to keep pressing return to bypass the 
rem	requests for substitution variables
rem

define m_seg_owner	= &1
define m_seg_name	= &2
define m_seg_type	= '&3'
define m_part_name	= &4

define m_segment_owner	= &m_seg_owner
define m_segment_name	= &m_seg_name
define m_segment_type	= '&m_seg_type'
define m_partition_name = &m_part_name

@@setenv

spool dbms_space_use

prompt	===================
prompt	Freelist management
prompt	===================

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10618);

	m_free	number(10);
begin
	dbms_space.free_blocks(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		partition_name		=> upper('&m_partition_name'),
--		scan_limit		=> 50,
		freelist_group_id	=> 0,
		free_blks		=> m_free
	);
	dbms_output.put_line('Free blocks below HWM: ' || m_free);
exception
	when wrong_ssm then
		dbms_output.put_line('Segment not freelist managed');
end;
/


prompt	====
prompt	ASSM
prompt	====

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10614);

	m_unformatted_blocks	number;
	m_unformatted_bytes	number;
	m_fs1_blocks		number;
	m_fs1_bytes		number;
	m_fs2_blocks		number;  
	m_fs2_bytes		number;

	m_fs3_blocks		number;
	m_fs3_bytes		number;
	m_fs4_blocks		number; 
	m_fs4_bytes		number;
	m_full_blocks		number;
	m_full_bytes		number;

begin
	dbms_space.SPACE_USAGE(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		unformatted_blocks	=> m_unformatted_blocks,
		unformatted_bytes	=> m_unformatted_bytes, 
		fs1_blocks		=> m_fs1_blocks , 
		fs1_bytes		=> m_fs1_bytes,
		fs2_blocks		=> m_fs2_blocks,  
		fs2_bytes		=> m_fs2_bytes,
		fs3_blocks		=> m_fs3_blocks,  
		fs3_bytes		=> m_fs3_bytes,
		fs4_blocks		=> m_fs4_blocks,  
		fs4_bytes		=> m_fs4_bytes,
		full_blocks		=> m_full_blocks, 
		full_bytes		=> m_full_bytes,
		partition_name		=> upper('&m_partition_name')
	);


	dbms_output.new_line;
	dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,999,990'));
	dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,999,990'));
	dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,999,990') || ' / ' || to_char(m_full_bytes,'999,999,999,990'));

exception
	when wrong_ssm then
		dbms_output.put_line('Segment not ASSM');
end;
/


prompt	=======
prompt	Generic
prompt	=======

declare
	m_total_blocks			number;
	m_total_bytes			number;
	m_unused_blocks			number;
	m_unused_bytes			number;
	m_last_used_extent_file_id	number;
	m_last_used_extent_block_id	number;
	m_last_used_block		number;
begin
	dbms_space.unused_space(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		total_blocks		=> m_total_blocks,
		total_bytes 		=> m_total_bytes, 
		unused_blocks		=> m_unused_blocks,  
		unused_bytes		=> m_unused_bytes,
		last_used_extent_file_id 	=> m_last_used_extent_file_id, 
		last_used_extent_block_id	=> m_last_used_extent_block_id,
		last_used_block		=> m_last_used_block,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.put_line('Segment Total blocks: '  || to_char(m_total_blocks,'999,999,990'));
	dbms_output.put_line('Object Unused blocks: '  || to_char(m_unused_blocks,'999,999,990'));

end;
/

undefine 1
undefine 2
undefine 3
undefine 4

undefine m_seg_owner
undefine m_seg_name
undefine m_seg_type
undefine m_part_name

undefine m_segment_owner
undefine m_segment_name
undefine m_segment_type
undefine m_partition_name

spool off



Here’s a sample of output (from a segment using ASSM):


===================
Freelist management
===================
Segment not freelist managed

PL/SQL procedure successfully completed.

====
ASSM
====

Unformatted                   :  132,385 / ############
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   12,327 /  100,982,784

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145920
Object Unused blocks: 0

PL/SQL procedure successfully completed.

(I’ve increased the length of the byte-count output since I produced that report ;)

August 31, 2016

Parallel_index hint

Filed under: Bugs,CBO,Oracle,Parallel Execution — Jonathan Lewis @ 1:28 pm BST Aug 31,2016

Prompted by a recent OTN posting I’ve dug out from my library the following demonstration of an anomalty with the parallel_index() hint. This note is a warning about  how little we understand hints and what they’re supposed to mean, and how we can be caught out by an upgrade. We’ll start with a data set which, to match a comment made in the origina posting rather than being a necessity for the demonstration, has an index that I’ve manipulated to be larger than the underlying table:


rem
rem     Script:         parallel_index_hint_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          December 2005
rem

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1e4)         modded,
        lpad(rownum,10,'0')     v1,
        lpad('x',30,'x')        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(modded) pctfree 75;
alter table t1 modify modded not null;

Your figures may vary slightly if you try to recreate this model, but according to my stats blocks=7876 for the table and leaf_blocks=8054 for the index. Now I’m going to generate the execution plans for a simple query – count(modded) from the table – with a varying selection of hints.


select /*+ index_ffs(t1 t1_i1) */ count(modded) from t1;

select /*+ parallel_index(t1 t1_i1 20) */ count(modded) from t1;

select /*+ index_ffs(t1 t1_i1) parallel_index(t1 t1_i1 20) */ count(modded) from t1;

select /*+ parallel_index(t1 t1_i1 20) parallel(t1 10) */ count(modded) from t1;

Pause for thought

  • which of the four statements will have a plan that uses an index fast full scan ?
  • which of the four statements will have a plan that indicates parallel execution ?
  • which of the four statements will indicate a parallel index fast full scan ?
  • why are two of the plans going to be identical but with different costs ?

The most interesting bit of this note is in the last question because it’s also the answer to a more subtle “why didn’t Oracle do what I thought it should” question. Here are the four plans I got from an instance of 11.2.0.4 (with a little labelling to remind us about the hinting):


==================================================
index_ffs() on it's own - get index fast full scan
==================================================

-----------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Cost (%CPU)| Time     |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |  1043   (4)| 00:00:06 |
|   1 |  SORT AGGREGATE       |       |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|  1043   (4)| 00:00:06 |
-----------------------------------------------------------------------

=====================================================
parallel_index() on it's own
Get serial tablescan which is cheaper than serial FFS
=====================================================

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |  1027   (4)| 00:00:06 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  1000K|  1027   (4)| 00:00:06 |
-------------------------------------------------------------------

========================================================
parallel_index() with index_ffs()
Get parallel fast full scan - at same cost as SERIAL FFS
========================================================

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |  1043   (4)| 00:00:06 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR    |          |  1000K|  1043   (4)| 00:00:06 |  Q1,00 | PCWC |            |
|   6 |       INDEX FAST FULL SCAN| T1_I1    |  1000K|  1043   (4)| 00:00:06 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------

===============================================
parallel_index() with parallel()
Get parallel fast full scan - costed correctly.
(Not costed at the degree given for table).
===============================================

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |    58   (4)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR    |          |  1000K|    58   (4)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       INDEX FAST FULL SCAN| T1_I1    |  1000K|    58   (4)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------

First plan – we supplied the index_ffs() hint, it’s valid: so Oracle did an index fast full scan. The serial cost was 1,043.

Second plan – we supplied the parallel_index() hint: but the optimizer apparently didn’t consider the cost of a parallel index fast full scan (which ought to have cost something like (1043/20)/0.9 = 58; instead it chose the serial tablescan at a cost of 1027 (cheaper than a serial index fast full scan because of the way I had defined a very large index).

Third plan – we supplied the parallel_index() hint with an explicit index_ffs() hint: both were legal so the optimizer obeyed the hints and produced a parallel index fast full scan (which is what we might have hoped would happen automatically for the second query). But the cost of the query is 1,043 – the cost of the serial index fast full scan.

Final plan – we didn’t hint an index_ffs() we hinted parallel() and parallel_index(): “strangely” Oracle has selected the parallel index fast full scan – and this time the cost is 58 (which, as I pointed out above, is (1043/20)/0.9, the value it’s “supposed” to be). It looks as if Oracle uses the parallel scaling factor on the index only if the table is ALSO hinted (or defined) to be treated to parallel execution.

Note, by the way, that I’ve hinted the table parallel 10 so that the cost of the parallel tablescan (ca. (1027/10)/0.9 = 114) will be larger than the cost of the parallel index fast full scan. Normally you’re likely to have the same degree hinted for table and index, and the index is likely to be smaller than the table which means that if you see this anomaly in production it’s more likely to be a case of Oracle using a parallel tablescan when you’re expecting a parallel index fast full scan. Oracle will have compared the serial cost of an index fast full scan against the parallel cost of the table.

Footnote:

When I ran the script on 12c there was one important change. The cost of the “correctly costed” index fast full scan came out at 103 rather than 58. Oracle has used the degree from the parallel() hint for the tablescan to calculate the cost of the parallel index fast full scan. If you are unlucky enough to have some code that has conflicting degrees hinted (or defined) across tables and indexes then you may see some plans change because parallel index fast full scans suddenly change their cost.

Addendum (prompted by Randolf Geist’s comment): The plan in 12c also had a Note: “Degree of Parallelism is 4 because of table property”. This was a little odd because the degree of the table was 1 and the hint said 10, but the arithmetic of the tablescan had ( as indicated) used a DOP of 4 – despite reporting Degree=0 in the 10053 trace file. The only way I could get the number 4 was to look at the CPU parameters – bearing in mind the critical parallel parameters:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     2
parallel_threads_per_cpu             integer     2

parallel_adaptive_multi_user         boolean     TRUE
parallel_automatic_tuning            boolean     FALSE
parallel_degree_limit                string      CPU
parallel_degree_policy               string      MANUAL
parallel_io_cap_enabled              boolean     FALSE

The anomaly of “degree 10” appearing in the costing for the index fast full scan was echoed in the 10053 trace: the file reported “Degree: 20”, then showed the cost being divided by 10.

At run-time, though, the query DID run parallel 20 when hinted with both the parallel() and parallel_index() hints, and ran parallel 10 (while costing and repeating the note about parallel 4) when modified to have only the /*+ parallel(t1 10) */ hint.

August 24, 2016

Basicfile LOBs 4

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 7:02 pm BST Aug 24,2016

At the end of the previous installment we saw that a single big batch delete would (apparently) attach all the “reusable” chunks into a single freepool, and asked the questions:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

(Okay, I’ll admit it, the third question is a clue about the answer to the second question.)

I find that this process of asking “what’s good, what’s bad, what could possibly go wrong” is an excellent way to prompt thoughts about why Oracle Corp. might have chosen a particular strategy and what that means in terms of the best (or expected) use of the feature and worst threats from misuse of the feature. So lets’s see what thoughts we can come up with.

  • Good idea: The only alternative to using a single freepool when you make chunks reusable is to spread the chunks uniformly across all the freepools – either putting the chunks onto the same free pool that the LOB was previously attached to or doing some sort of round-robin. If you go for either of these fair-share strategies you increase the amount of contention on LOB deletes if many users are deleting at the same time – which sounds like someething you might want to avoid, but LOBs are supposed to be fairly static (somewhere on MoS there’s a note that says the expected behaviour is pretty much: “we thought you’d write once, read many, and not update”) so surely a small amount of contention shouldn’t be a big problem
  • Bad idea: As mentioned in a previous post, it looks like the freepool picked by a process is dependent on the process id – so if you happen to have just a couple of processes doing large deletes they might, coincidentally, pick the same freepool and end up constantly contending with each other rather than drifting in and out of collisions. If, as often happens with archive-like processes, you use one or two processes to delete a large fraction of the data you end up with one or two freepools holding lots of reusable space and all the other freepools holding no freespace – which brings us to the third question.
  • What happens next: Let’s say 3% of your LOB (one day out of a month) is currently “reusable chunks” and the chunks are all attached to the same freepool; your process connects to insert some new LOBs and its process id identifies the wrong freepool. There are no free blocks below the highwater mark and the retention limit is long gone. Does your process (a) add an extent to create some more free space (this is the type of thing that used to happen with manual segment space management, freelist groups and freelists for non-LOB tables and indexes) or (b) start stealing from another freepool that has reusable chunks. In either case what’s going to happen in the longer term ?
  • What happens even later: Imagine you have 28 days of data and use a single process to delete data on the 29th day. For reasons of concurrency you have been running with freepools 20. If option (a) applies then (assuming everything works perfectly) at steady state you will end up with roughly 20 days worth of reusable chunks spread across your 20 freepools before the system stabilises and stops adding unnecessary extents; if option (b) applies then (assuming everything works perfectly) every night you put a load of reusable chunks on one freepool and all through the day your 20 processes are fighting (at the oldest end of the index) to reuse those chunks. I said in an earlier installment that multiple freepools got rid of “the two hot spots” – this single thread deletion strategy has just brought one of them back.

So what really happens ? By the end of the last installment I had deleted the oldest 3,000 LOBs and found them attached as reusable chunks in freepool 2 with several consecutive “empty”  (nrows=81, rrows=0) leaf blocks at the low end of all the other pools.  After running my 4 concurrent processes to insert 750 rows each (i.e. insert the replacements for the 3,000 rows I’ve deleted) this is what the index treedump looks like (with a little editing to show the main breaks between freepools):


----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 60, level: 1)
   leaf: 0x180020e 25166350 (-1: nrow: 22 rrow: 22)
   leaf: 0x1800212 25166354 (0: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (1: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (2: nrow: 74 rrow: 74)
   leaf: 0x1800239 25166393 (3: nrow: 81 rrow: 81)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (7: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (8: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (9: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (10: nrow: 78 rrow: 78)
   leaf: 0x1800232 25166386 (11: nrow: 151 rrow: 151)
---
   leaf: 0x1800226 25166374 (12: nrow: 0 rrow: 0)
   leaf: 0x180020f 25166351 (13: nrow: 64 rrow: 64)
   leaf: 0x1800213 25166355 (14: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (15: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (16: nrow: 81 rrow: 81)
   leaf: 0x1800265 25166437 (17: nrow: 81 rrow: 81)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x1800242 25166402 (24: nrow: 122 rrow: 122)
---
   leaf: 0x1800229 25166377 (25: nrow: 0 rrow: 0)
   leaf: 0x1800214 25166356 (26: nrow: 36 rrow: 36)
   leaf: 0x1800230 25166384 (27: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (28: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (29: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (30: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (31: nrow: 75 rrow: 75)
   leaf: 0x1800231 25166385 (32: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (33: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (34: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (35: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (36: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (37: nrow: 134 rrow: 134)
---
   leaf: 0x1800215 25166357 (38: nrow: 1 rrow: 0)
   leaf: 0x180026e 25166446 (39: nrow: 4 rrow: 0)
   leaf: 0x180021b 25166363 (40: nrow: 1 rrow: 0)
   leaf: 0x180024b 25166411 (41: nrow: 2 rrow: 0)
   leaf: 0x1800276 25166454 (42: nrow: 2 rrow: 0)
   leaf: 0x180024f 25166415 (43: nrow: 0 rrow: 0)
   leaf: 0x180027e 25166462 (44: nrow: 4 rrow: 0)
   leaf: 0x1800221 25166369 (45: nrow: 0 rrow: 0)
   leaf: 0x180027a 25166458 (46: nrow: 0 rrow: 0)
---
   leaf: 0x1800218 25166360 (47: nrow: 0 rrow: 0)
   leaf: 0x180021c 25166364 (48: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (49: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (50: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (51: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (52: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (53: nrow: 152 rrow: 152)
   leaf: 0x1800253 25166419 (54: nrow: 152 rrow: 152)
   leaf: 0x1800257 25166423 (55: nrow: 152 rrow: 152)
   leaf: 0x180025b 25166427 (56: nrow: 152 rrow: 152)
   leaf: 0x180025f 25166431 (57: nrow: 152 rrow: 152)
   leaf: 0x1800263 25166435 (58: nrow: 1 rrow: 1)
----- end tree dump

Highlights

The number of leaf blocks has dropped from 72 to 60 – I didn’t think that this could happen without an index coalesce or rebuild, but maybe it’s a special feature of LOBINDEXes or maybe it’s a new feature of B-trees in general that I hadn’t noticed. Some of the “known empty” leaf blocks seem to have been taken out of the structure.

We still see the half full / full split between the leaf blocks for the first 3 freepools when compared to the top freepool.

There are still some empty leaf blocks (rrow = 0), but apart from the top freepool no more than one per freepool for the other sections that are indexing LOBs.

The section of index that is the freepool 2 section for “reusable” chunks shows an interesting anomaly. There are some leafblocks that are now empty (rrow=0) but were only holding a few index entries (nrow=1-4 rather than the 75 – 140 entries that we saw in the previous installment) at the moment they were last updated; this suggests a certain level of contention with problems of read-consistency, cleanout, and locking between processes trying to reclaim reusable blocks.

It’s just slightly surprising the the top freepool shows several empty leaf blocks – is this just a temporary coincidence, or a boundary case that means the blocks will never be cleaned and re-used; if it’s a fluke will a similar fluke also reappear (eventually) on the other freepools. Is it something to do with the fact that freepool 2 happened to be the freepool that got the first lot of reusable chunks ? Clearly we need to run a few more cycles of deletes and inserts to see what happens.

We have one important conclusion to make but before we make it let’s look at the partial key “col 0” values in the row directory of the root block just to confirm that the breaks I’ve listed above do correspond to each of the separate freepool sections:

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
 1:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0b
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc
 3:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0d
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0d 51
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 db bf f4
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c0 77
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c1 90
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 77
 9:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 fa
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c4 45
11:     col 0; len ..; (..):  00 00 00 01 00 00 09 db c5

12:     col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
13:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
15:     col 0; len ..; (..):  00 02 00 01 00 00 09 db 0b
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 0b ee
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 db bf 8b
18:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c0 a4
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c2 21
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c3 6c
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c4 21
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c5 9e
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 53
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 d6

25:     col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
26:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
27:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
28:     col 0; len ..; (..):  00 04 00 01 00 00 09 db 0b
29:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d
30:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc 64
31:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc b5
32:     col 0; len ..; (..):  00 04 00 01 00 00 09 db bd
33:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd 51
34:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd a2
35:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd f3
36:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 44
37:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 95

38:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
39:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
40:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
41:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
42:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
43:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
44:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
45:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
46:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

47:     col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
48:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
49:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
50:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
51:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
52:     col 0; len ..; (..):  00 06 00 01 00 00 09 db 0a
53:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a
54:     col 0; len 10; (10):  00 06 00 01 00 00 09 db bf da
55:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c1 6c
56:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c2 cc
57:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c4 90
58:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c6 22

I’ve broken the list and numbered the entries to match the treedump above, so it’s each to check that leaf blocks 38 to 46 are the now empty blocks for the reusable chunks. We started the reload with 3,001 entries for reusable chunks all in one freepool; we’ve ended it with none. Something has “stolen” the reusable chunks from freepool 2 so that they could be used for creating new LOBs that were distributed across all the freepools.

Oracle has been very efficient about re-using the index space, with a little bit of wastage creeping in, perhaps caused by coincidences in timing, perhaps by some code that avoids waiting too long when trying to lock index entries. We have a contention point because of the single threaded delete – but it doesn’t appear to be a disaster for space utilisation. Of course we need to look at the level of contention, and repeat the cycle a few times, changing the freepool used for deletion fairly randomly to see if we just got lucky or if the first few deletes are special cases. We can also ask questions about how the “stealing” takes place – does a process steal one index entry at a time, or does it take several consecutive index entries from the same block while it’s got the leaf block locked – but perhaps we don’t really need to know the fine details, the amount of time spent in contention (typically TX waits of some sort) could tell use whether or not we had a significant problem.

Contention and Resources

For each of the processes running the inserts I took a couple of snapshots – session stats and wait events – to see if anything interesting showed up. Naturally, the closer you look the more strange things you find. Here are a few sets of numbers from v$session_event and v$sesstat (in my snapshot format – with the four sessions always reported in the same order);

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: HW - contention                                985           0          93.15        .095           1
enq: HW - contention                                 10           0           5.46        .546           1
enq: HW - contention                              1,001           0         102.27        .102           1
enq: HW - contention                              1,010           0         106.12        .105           1

db file sequential read                           1,038           0          40.75        .039           2
db file sequential read                              39           0           3.21        .082           1
db file sequential read                           1,038           0          28.33        .027           1
db file sequential read                           1,046           0          34.89        .033           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           1,038
physical reads direct                                                      979

physical reads                                                              39
physical reads direct                                                       19

physical reads                                                           1,038
physical reads direct                                                      998

physical reads                                                           1,046
physical reads direct                                                    1,005

session logical reads                                                  114,060
session logical reads                                                   22,950
session logical reads                                                  104,555
session logical reads                                                   93,173

data blocks consistent reads - undo records applied                      2,165
data blocks consistent reads - undo records applied                        119
data blocks consistent reads - undo records applied                      1,222
data blocks consistent reads - undo records applied                        193

My first thought when looking at the wait events was to get an idea of where most of the time went, and I had expected the HW enqueue to be the most likely contender: this enqueue is held not only when the high water mark for a segment is moved, it’s also held when a process is doing any space management for inserting a LOB. So my first suprise was that one session was hardly waiting at all compared to the other sessions.

Then I noticed that this one session was also suffering a very small number of “db file sequential read” waits compared to every other session – but why were ANY sessions doing lots of db file sequential reads: the LOB was declared as nocache so any reads ought to be direct path reads and although Oracle doesn’t always have to wait for EVERY direct path read we should have read (and rewritten) 1,500 “reusable” LOB chunks by direct path reads in each session – I refuse to believe we never waited for ANY of them. So take a look at the session stats: which show us the that the “db file sequential read” waits match exactly with the “physical reads” count but most of the “physical reads” are recorded “physical reads direct” – Oracle is recording the wrong wait event while reading the “reusable” chunks.

Okay, so our direct path read waits are being recorded incorrectly: but one session does hardly any physical reads anyway – so what does that mean ? It means the process ISN’T reusing the chunks – you can’t be reusing chunks if you haven’t t read them. But the dumps from the index tell us that all the reusable chunks have been reused – so how do we resolve that contradiction ? Something is reading the index to identify some reusable chunks, wiping the reference from the index, then not using the chunks so (a) we’ve got some reusable chunks “going missing” and (b) we must be allocating some new chunks from somewhere – maybe bumping the high water mark of the segment, maybe allocating new extents.

Fortunately I had used the dbms_space package to check what the lob segment looked like after I had loaded it. It was 8192 blocks long, with 66 blocks shown as unused and 8,000 (that’s exactly 2 blocks/chunks per LOB) marked as full. After the delete/insert cycle is was 8,576 blocks long, with 8,000 blocks marked as full and 444 marked as unused. We had added three extents of 1MB each that we didn’t need, and one session seems to have avoided some contention by using the new extents for (most of) its LOBs rather than competing for the reusable space with the other LOBs.

Was this a one-off, or a repeatable event. How bad could it get ?

TO BE CONTINUED.

Post-script

Is there a way of discovering from SQL (perhaps with a low-cost PL/SQL function) the freepool for a LOB when it’s defined as Basicfile. You can get the LOBid for a Securefiles LOB using the dbms_lobutil package and the LOBid includes the critical first two bytes – but the package is only relevant to Securefiles. I rather fancy the idea of a process knowing which freepool it is associated with and only deleting LOBs that come out of that freepool.

Update

A thought about knowing the freepool – you could always add the freepool number as a column to the base table and have a “before row insert” trigger (or some other coding strategy) that populated the column with the freepool id. This would require the application to know a little too much about the implementation and about the internals, and would require a code change somewhere if anyone decided to change freepools. (Possibly VPD/RLS/FGAC could be used to hide the details from the outside world). The code to delete LOBs could then delete only those rows that matched its current freepool. WARNING – this is an idle thought, probably not relevant to (most) people, and there may be better strategies to make the effort irrelevant.

Following a little accident with another test I noticed a pattern that suggested that you would only reuse space on your own freepool and might, at the same time, at the same time moving some of your freepool into segment freespace. After a couple of experiments though it looks as if you use the reusable space in your freepool; if you have no reusable space left in your freepool you start using the segment freespace (ignoring the reusable space in other freepools); if there is no free space in the segment you raid EVERY OTHER FREEPOOL and transfer all their reusable space to segment freespace – which is bad news if there are 200,000 reusable chunks in their freepools. Unfortunately, at the same time, you add an extent to the segment and insert your new LOB into that extent.  There’s still plenty of scope for refining details here; I’ve only tried pushing one session into raiding the other freepools, so I don’t know what happens if there are multiple sessions trying to create LOBs while the raid is going on, who blocks whom, for how long, and what happens about new extents etc. Clearly, though, there’s plenty of scope for nasty things to happen.

 

 

August 22, 2016

Basicfile LOBS 3

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 5:56 pm BST Aug 22,2016

In the previous article in this mini-series I described how the option for setting freepools N when defining Basicfile LOBs was a feature aimed at giving you improved concurrency for inserts and deletes that worked by splitting the LOBINDEX into 2N sections: N sections to index the current LOB chunks by LOB id, alternating with N sections to map the reusable LOB chunks by deletion time.

In this article we’ll look a little further into the lifecycle of the LOB segment but before getting into the details I’ll just throw out a couple of consequences of the basic behaviour of LOBs that might let you pick the best match for the workload you have to deal with.

  • If you have enabled storage in row the first 12 chunks of a lob will be identified by the LOB Locator stored in the row, so if all your LOBs are sized between 4KB and 96KB (approximately) the LOB Index will consist only of entries for the reusable LOB space due to deleted LOBs even though the LOBs themselves will be stored out of line. This makes it look like a good idea to enable storage in row even when you expect all of your (smallish) LOBs to be stored out of row.
  • It’s quite nice to cache LOBs (at least for a little while) if your pattern of use means you access a specific LOB for a little while before it ceases to be interesting; but LOBs can swamp a large fraction of the buffer cache if you’re not careful. If you expect to follow this pattern of behaviour you might define a RECYCLE cache and then assign the LOB to that cache so that you get the benefits of caching while still protecting the main volume of your buffer cache.
  • Depending on the expected size of your LOBs you may have a good justification for creating a tablespace of a non-standard size for the LOB segment so that it takes fewer block reads to read the entire LOB. If (for example) you have a LOB which is always in the range of 62KB then a tablespace with a blocksize of 32KB would be a good choice because the LOB could be read with just two block reads. A fringe benefit of the non-standard block size, of course, is that you have to define a non-standard cache, which separates the LOB activity from the rest of the buffer cache. (Note: Oracle reads LOBs one chunk at a time, so the number of LOB reads – as opposed to block reads – for a 32KB chunk is the same whether the block size is 8KB or 32KB)
  • If you’re going to be logging your LOBs then remember that nocache LOBs will write entire chunks into the redo logs – think about how much extra redo this might generate: it might be better to have a small recycle cache and cache your LOBS as cached LOBs are logged at the byte level. (You don’t want a 32KB block size, nocache, logging if your LOBs are all 33KB).

The LOB lifetime

Before deciding on the suitability of a feature the first thing to do is define what you’re trying to achieve so that you can think realistically about where the threats may be and what tests are going to be important – so I’m going to describe a scenario, then talk about what threats might appear based on the current details I’ve given about Basicfile LOBs and freepools.

  • We have many processes inserting “small” (16KB to 24KB) LOBs concurrently in bursts during the day.
  • Typically we peak at about 20 processes inserting at the same moment, and we end up with about 100K new LOBs per day though this varies between 50K and 200K.
  • The inserts are all “insert one row; commit”.
  • The LOBs have to be kept for 28 days, after which they (the rows that hold them) are deleted by an overnight batch job.
  • The LOBs have to be logged and the database is running in archivelog mode

As soon as you see the “aged 28 days” you might immediately think “partitioning” (though perhaps your first thought might be that restaurant in Cincinnati airport that hangs its beef to air-dry for 28 days before cooking). Unfortunately not everyone has licensed the partitioning option, so what do you have to worry about when you start to design for this requirement. (We’re also going to assume that securefiles are going to be saved for another blog mini-series).

Clearly we should make use of multple freepools to avoid the insert contention on the LOBINDEX. With about 20 concurrent processes we might immediate go for freepools 20, but we might decide that a smaller number like 4 or 8 is sufficient. We probably ought to do some tests to see if we can discover any penalties for larger numbers of freepools, and to see what sort of contention we get with a smaller number of freepools.

We got a hint from the previous article that when a process deletes a LOB it indexes the reusable chunks in the same freepool as it inserts LOBs – at least, that’s what seemed to happen in our little test case in the previous article. Does Oracle always follow this pattern, or will a multi-row delete, or a large number of single “delete;commt;” cycles spread the reusable chunks evenly across all the available freepools ? If you do a single large delete do you end up with all the reusable space in one freepool – if so, does it matter or should we have multiple processes do our “big batch delete” ?

On second thoughts, my little demo showed that when you insert a LOB into freepool X and then delete it the reusable space goes into freepool X. Maybe I’ve misinterpreted the test and need to do a better test; maybe the reusable space goes into the freepool that the LOB was originally attached to, not into the freepool dictated by the process id. That would mean that a bulk delete would tend to spread the LOBs across all the freepools – which means if you used multiple processes to delete data they might cause contention on the “reusable” segments of the LOBINDEX.

If we do a single large delete and all the reusable chunks go into the same freepool what happens when we start inserting new LOBs ? If the LOB segment is “full” is it only the processes associated with that one freepool that can use the reusable space, or will EVERY process start to raid the freepool that has the only reusable space If the latter then all we’ve done by using multiple freepools is postpone (by roughly 28 days) the moment when we start to get contention on our LOBINDEX ?

Fortunately if we’ve made some poor choices in the orginal design Oracle does allow us to “rebuild freepools”, and even change the number of freepools:


alter table t1 modify lob (c1) (rebuild freepools);
alter table t1 modify lob (c1) (freepools (3));

Mind you, there is a little note on MoS that rebuilding freepools “may take some time” and locks the table in exclusive mode while it’s going on. So perhaps we should check to see how the rebuild works, and try to figure out how long it might take. A maxim for dealing with very large objects is that you really want to get it right first time because it’s hard to test the effects of change especially since you probably end up wanting to do your final tests on a backup copy of the production system.

Getting Started

I’ve specified 100K LOBs per day, sized between 16KB and 24KB, kept for 28 days – that’s about 50 GB, and I don’t really want to sit waiting for Oracle to build that much data while running 20 concurrent processes that are logging and generating archived redo log. (Especially since I may want to repeat the exercise two or three times with different numbers of freepools.) I’m going to start small and grow the scale when it’s necessary.

I’ll start with 4 concurrent processes inserting 1,000 LOBs each, sized at 12KB, with freepools 4, and I’ll rig the system very carefully so that each process uses a different freepool. After that I’ll run a single batch delete to delete the first 3,000 LOBs – I’ll pick a process that ought to use freepool 1 or 2 (i.e. not 0 or 3, the “end” freepools); then I’ll repeat the insert cycle but insert just 750 LOBs per process. At various points in this sequence of events I’ll stop and dump some index blocks and look at some stats to see if I can spot any important patterns emerging.

Once I’ve got through that cycle I’ll decide what to do next – the first set of results may produce some important new questions – but I’m guessing that I’ll probably end up repeating the “delete / insert” cycle at least one more time.

Here’s a little code to create a suitable table,


create sequence s1 cache 10000;

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile
    text_lob(
            disable storage in row
            chunk 8k
            retention
            nocache
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
;

I’ve inserted a row to make sure that all the objects appear in all the right places. The code I’ve used to do this insert is a version of the code that I’m going to use for the concurrency testing but restricted to insert one row with an id of zero. In the concurrency test I’ll make use of the sequence I’ve created to act as the primary key to the table.

Having created the table I then start four more sessions, carefully ensuring that they will each pick a different freepool. To make sure I had one session per freepool I just kept connecting sessions and running a silly little check for each session’s process id (pid) until I had four that returned each of the values from 0 to 3:

select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

Once I had the four extra sessions set up, I issued a simple “lock table t1 in exclusive mode” from my original session then started the following script in each of the other four:


spool temp&1

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..1000 loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

spool off

(I supplied A, B, C, and D as the first parameter to the script so that I got four sets of output, but I haven’t included the code I used to get a snapshot of the session stats, session waits, and system enqueues recorded by each session.)

First check – did I get all four freepools evenly used (which is what I had assumed would happen when I chose the 4 process ids so carefully. I can check this by doing a block dump of the LOBINDEX root block because with 4,001 entries I’m (almost certainly) going to get a root block, no further branch levels, and a few dozen leaf blocks.

As with all B-tree indexes the “row directory” of the root block will contain a list of “truncated” key values that allow Oracle to search down to the correct block in the next layer of the index so I’m going to extract just the key values, and only the first column of those keys in the same way that I did with the previous article. This means every line in the following output shows you, in order, the first LOB id (with a few of them truncated) in each leaf block:

col 0; len 10; (10):  00 00 00 01 00 00 09 da fe a7
col 0; len 10; (10):  00 00 00 01 00 00 09 db 00 24
col 0; len 10; (10):  00 00 00 01 00 00 09 db 01 6f
col 0; len 10; (10):  00 00 00 01 00 00 09 db 02 ec
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 04
col 0; len 10; (10):  00 00 00 01 00 00 09 db 05 7c
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 2b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 e0
col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 0b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc

col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
col 0; len 10; (10):  00 02 00 01 00 00 09 da fe 81
col 0; len 10; (10):  00 02 00 01 00 00 09 db 00 62
col 0; len 10; (10):  00 02 00 01 00 00 09 db 01 ad
col 0; len 10; (10):  00 02 00 01 00 00 09 db 02 94
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 f8
col 0; len 10; (10):  00 02 00 01 00 00 09 db 06 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 07 f2
col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
col 0; len  9; ( 9):  00 02 00 01 00 00 09 db 0b

col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
col 0; len 10; (10):  00 04 00 01 00 00 09 da fe 4c
col 0; len 10; (10):  00 04 00 01 00 00 09 da ff c9
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 01
col 0; len 10; (10):  00 04 00 01 00 00 09 db 01 f8
col 0; len 10; (10):  00 04 00 01 00 00 09 db 03 75
col 0; len 10; (10):  00 04 00 01 00 00 09 db 04 5c
col 0; len 10; (10):  00 04 00 01 00 00 09 db 06 3d
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 07
col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 0b

col 0; len  2; ( 2):  00 06
col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
col 0; len  9; ( 9):  00 06 00 01 00 00 09 db 0a
col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a

As you can see, we have the expected pattern (for 4 freepools) of entries starting with (00 00), (00 02), (00 04), and (00 06); but you might wonder why there are 11 leaf blocks for 00, 12 leaf blocks for 02 and 04, and only 8 leaf blocks for 06. We can answer the 11/12 anomaly by remembering that any branch blocks will have a “leftmost child” entry that won’t appear in the row directory – so the 12th leaf (or rather the 1st leaf) block for 00 is being pointed to by the “LMC”. But what about the missing blocks for 06 ? A treedump shows the answer:


branch: 0x1800204 25166340 (0: nrow: 44, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 81)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 75)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 132)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 81)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 81)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 81)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 81)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 81)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 81)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 81)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 81)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 111)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 78)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 81)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 76)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 81)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 81)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 107 rrow: 107)

   leaf: 0x180022b 25166379 (35: nrow: 78 rrow: 78)
   leaf: 0x1800218 25166360 (36: nrow: 152 rrow: 152)
   leaf: 0x180021c 25166364 (37: nrow: 152 rrow: 152)
   leaf: 0x1800220 25166368 (38: nrow: 152 rrow: 152)
   leaf: 0x1800224 25166372 (39: nrow: 152 rrow: 152)
   leaf: 0x1800228 25166376 (40: nrow: 152 rrow: 152)
   leaf: 0x180022c 25166380 (41: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (42: nrow: 11 rrow: 11)

Although there are little glitches along the way the leaf blocks for freepools 0, 1, and 2 have 81 entries per leaf block, the leaf blocks for freepool 3 have 152 – it’s the difference between inserting rows at the end of an index and getting a “90/10” split compared with inserting somewhere in the middle and getting a “50/50” split. In fact, although Oracle uses the idgen1$ sequence to drive the generation of LOB ids, the way it “batches” IDs (the sequence cache size is 50) means that LOB ids can appear out of order when generated by multiple sessions – even in the same freepool: it is possible for the top freepool to suffer 50/50 splits though these are likely to appear relatively rarely – at least, that is, until I start deleting LOBs when a “00 07” section of reusable chunks may appear. (As a side effect, the LOB ids in my table and the id column on the underlying table are out of order with respect to each other.)

You’ll notice that I’ve left blank links in the treedump list corresponding to the breaks between the free pools (remember there is a “-1” entry in the treedump for the “leftmost child” that doesn’t appear in the row directory). This helps to explain the few leaf blocks with unusual row counts – they’re the ones that at (or very near) the boundaries between freepools.

Bulk Delete

After loading some data in with concurrent inserts and multiple freepools I’m going to do a single big delete from the table to get rid of a lot of “old” data. In fact all I’ll do is delete the rows where id <= 3000. The big question is this – will a single delete put all the reusable blocks into a single freepool, or will it put the reusable space for each LOB into the freepool that the LOB was originally in, or will it find some other way to spread the reusable space evenly across all the free pools ? One freepool or many – both options have good points, both options have bad points.

Here’s what I got as the treedump after the delete:


branch: 0x1800204 25166340 (0: nrow: 72, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 0)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 0)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 0)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 0)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 0)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 0)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 0)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 0)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 22)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 120)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 0)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 0)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 0)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 0)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 0)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 0)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 0)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 0)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 0)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 64)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 101)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 0)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 0)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 0)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 0)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 0)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 0)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 0)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 0)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 0)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 36)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (35: nrow: 139 rrow: 139)

   leaf: 0x1800227 25166375 (36: nrow: 138 rrow: 138)
   leaf: 0x1800205 25166341 (37: nrow: 126 rrow: 126)
   leaf: 0x1800219 25166361 (38: nrow: 82 rrow: 82)
   leaf: 0x1800272 25166450 (39: nrow: 95 rrow: 95)
   leaf: 0x1800209 25166345 (40: nrow: 118 rrow: 118)
   leaf: 0x180021f 25166367 (41: nrow: 143 rrow: 143)
   leaf: 0x180020d 25166349 (42: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (43: nrow: 90 rrow: 90)
   leaf: 0x1800222 25166370 (44: nrow: 147 rrow: 147)
   leaf: 0x1800211 25166353 (45: nrow: 81 rrow: 81)
   leaf: 0x1800247 25166407 (46: nrow: 73 rrow: 73)
   leaf: 0x1800223 25166371 (47: nrow: 98 rrow: 98)
   leaf: 0x180026a 25166442 (48: nrow: 98 rrow: 98)
   leaf: 0x180021d 25166365 (49: nrow: 127 rrow: 127)
   leaf: 0x1800266 25166438 (50: nrow: 131 rrow: 131)
   leaf: 0x1800215 25166357 (51: nrow: 133 rrow: 133)
   leaf: 0x180026e 25166446 (52: nrow: 141 rrow: 141)
   leaf: 0x180021b 25166363 (53: nrow: 82 rrow: 82)
   leaf: 0x180024b 25166411 (54: nrow: 93 rrow: 93)
   leaf: 0x1800276 25166454 (55: nrow: 109 rrow: 109)
   leaf: 0x180024f 25166415 (56: nrow: 77 rrow: 77)
   leaf: 0x180021e 25166366 (57: nrow: 143 rrow: 143)
   leaf: 0x180027e 25166462 (58: nrow: 126 rrow: 126)
   leaf: 0x1800221 25166369 (59: nrow: 93 rrow: 93)
   leaf: 0x1800253 25166419 (60: nrow: 82 rrow: 82)
   leaf: 0x180027a 25166458 (61: nrow: 97 rrow: 97)
   leaf: 0x1800257 25166423 (62: nrow: 84 rrow: 84)

   leaf: 0x180022b 25166379 (63: nrow: 78 rrow: 0)
   leaf: 0x1800218 25166360 (64: nrow: 152 rrow: 0)
   leaf: 0x180021c 25166364 (65: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (66: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (67: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (68: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (69: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (70: nrow: 11 rrow: 11)


The number of leaf blocks has gone up from 44 to 72 (but that shouldn’t be too much of a surprise – index leaf block space can’t be reused until after the commit, so we were bound to grow the index to insert the entries for reusable chunks).

As before I’ve inserted a few blank lines to break the list into the separate index sections, and you can see that the first few blocks in each of the first three freepools has nrow = 81 and (typically) rrow = 0. These are the leaf blocks where all the LOB entries have been marked as deleted. There are a couple of variations – leaf block 10, for example, shows nrow = 132, rrow = 120: this is the leaf block where freepool 0 (LOB section) overlapped with freepool 1 (LOB section), and the first 10 LOBs in freepool 1 have been marked as deleted. The LOB section for freepool 4 follows the same sort of pattern, though nrow = 152 in most of the blocks.

The important detail is in leaf blocks 36 to 62 – which show nrow = rrow throughout, but with a degree of randomness as to the actual number of index entries. These are the leaf blocks that record the “reusable chunks”, and they’ve all been associated with freepool 2 (counting from zero). There are several details that combine to explain why the numbers of entries per leaf block vary so much, but I don’t want to get too distracted by them now; remember, though, that I pointed out that the LOB ids and table id column weren’t in synch with each other so part of what you’re seeing here is 50/50 leaf node splits followed by a little back-filling.

Again I’ve extracted the “col 0” values from the block dump of the root block – I won’t show all of them, I’ll just show you the entries from entries 35 to 63 so that you can see the leaf block pointers for the “reusable” section of freepool 2, and the LOB section of freepools 2 and 3:


col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d

col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

col 0; len 02; (02):  00 06

The starting “00 05” tells us that this is the reusable chunks associated with freepool 2 (2 * 2 + 1 = 5), and the fact that the next four bytes are identical across the entries tells you that I managed to delete my 3,000 LOBs in less than one second.

After seeing the effect of this monolithic delete you should now be asking yourself a few questions, such as:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

TO BE CONTINUED …

 

August 18, 2016

Basicfile LOBs 2

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 1:21 pm BST Aug 18,2016

There are probably quite a lot of people still using Basicfile LOBs, although Oracle Corp. would like everyone to migrate to the (now default) Securefile LOBs. If you’re on Basicfile, though, and don’t want (or aren’t allowed) to change just yet here are a few notes that may help you understand some of the odd performance and storage effects.

Of course, there are a lot of variations in how you declare the LOB – pctversion vs. retention, cache vs. nocache, logging vs. nologging, enable vs. disable storage in row, and I can’t cover all the combinations – so what I’ll be looking at is a general strategy for handling a large number of small LOBs that are being inserted into the database at a fairly high degree of concurrency, and then being deleted a few weeks later so, in theory, the LOB segment should end up at a steady state with “N” days worth of data stored. The driver behind this modelling is a problem I was asked to examine a little while ago.

Some background details on Basicfile LOBs

If the LOB column is defined as “enable storage in row” then a very small LOB (up to 3,960 bytes) will be stored almost as if it were an ordinary column in the row; if the size of a LOB is a little larger than this limit then the LOB will be stored in chunks in the LOB segment and pointers to the first 12 chunks will be stored in the row, with pointers for further chunks stored in the LOBINDEX. The chunk size defined for a LOB column can be up to 32KB – though the default chunk size is the block size for the containing tablespace –  so it’s possible to store a LOB of nearly 384KB before Oracle needs to create index entries in the LOBINDEX, though most people use 8KB blocks and will start using the LOBINDEX when a LOB gets close to 96KB.

If the LOB column is defined as “disable storage in row” then, no matter how small it really is, it will always take up at least one chunk in the LOB segment and will have a corresponding index entry in the LOBINDEX.  For reasons of efficiency an entry in LOBINDEX always “carries” 32 bytes of pointer data, allowing it to list up to 8 chunks.

When a LOB is deleted (replaced by a null, an empty_lob(), or a new LOB value) the previous state of the base table row and the LOBINDEX will be preserved in the undo segment in the ordinary manner but the previous version of the LOB data itself is simply left in the segment (and a new version of the LOB created if the operation is an “update”). The chunks making up the old version are added to the LOBINDEX with a key based on the time (seconds since 1st Jan 1970) the delete took place – this means that when Oracle wants to re-use space in the LOB segment it can walk the LOBINDEX in order to find the chunks that were marked as available for reuse the greatest time into the past. (It also means that the LOBINDEX is one of the strangest in the Oracle pantheon – part of it indexes “reusable chunks keyed by time” part of it indexes “current chunks keyed by LOB id”.

There are two options for how long old versions of LOBs will be kept: PCTVERSION specifies the percentage of space below the segment’s highwater mark that may be used to keep old versions, and (until 12c, where things change) RETENTION specifies that Oracle should try to keep old versions for the length of time given by the system parameter undo_retention (which defaults to 900 seconds). If enough versions of LOBs have been kept Oracle can create a read-consistent version of a given LOB by using the normal undo mechanisms to take the base table row and LOBINDEX back to the correct point in time which will then ensure that the LOB pointers will be pointing to the correct chunks.  (If the LOB chunks have been over-written this is the point where you will get an Oracle error: “ORA-22924 Snapshot too old”, followed by a misleading “ORA-01555 Snapshot too old ….”)

One final interesting point from a performance perspective is that if you define the LOB to be “nocache”, which means that typical reads and writes of the lob will use direct path, and then specified “nologging” then reads and writes of the LOB will generate tiny amounts of redo log.  Two special points to go with this, though: if you specify “nocache logging” the direct path writes will be logged, but the log content will be by chunk – so if you store 4,000 bytes of data in a LOB with a 32KB chunk size you will write 32KB of redo log; secondly if you are testing the effects of logging and nologging, make sure your test database is running in archivelog mode if your production database is going to be archiving – otherwise Oracle will fool you by taking a short cut and NOT logging a nocache LOB even if you specify logging! The LOBINDEX is always cached and logged, by the way, and even if the LOB is defined as nocache there are circumstances where LOB blocks are read into the buffer cache (remember my previous note describing how we saw 6 billion buffer gets on a nocache LOB).

The last detail I want to mention is the FREEPOOLS parameter. The description in the developers guide for 11.2 describes this as:  “Specifies the number of FREELIST groups for BasicFiles LOBs, if the database is in automatic undo mode.” Unfortunately freelists and freelist groups are things that happen in manual segment space management so this definition requires an alternative meaning for the expression “FREELIST groups”. The purpose of FREEPOOLS is to help deal with concurrency problems but there’s not much information around to help you understand the mechanisms and pitfalls of freepools and the available documents on MoS don’t really do anything to clarify the position – and that’s what this article is (finally) going to talk about.

Basicfile FREEPOOLs – the truth is out there

When you specify FREEPOOLs you affect the way Oracle uses the LOBINDEX – not the space management information about the segment holding the index but the actual content of (in fact the KEY values held by) the index.

You can do a treedump of a LOBINDEX by object_id in the standard way that you do a treedump of any B-tree (or bitmap) index, and you can dump blocks from a LOBINDEX in the same way you dump any other data block in the database, by file number and block number (or block range), so it’s easy to see what happens in a LOBINDEX when you start using multiple freepools. I’ve created a table holding a LOB defined with “disable storage in row” so that I always use the LOBINDEX, inserted three rows then deleted one of them and dumped the one index block (which happens to be both the root and a leaf). Here’s the SQL to create the table and do the data handling:


create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile text_lob(
        disable storage in row
        chunk 8k
        retention
        nocache
        tablespace test_8k_assm
)
;


declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..3 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/


delete from t1 where id = 1;
commit;

alter system flush buffer_cache;

I’ve ended by flushing the buffer cache so that I don’t get a huge trace file when I try to dump the index to disc. Here’s the next bit of processing:


SQL> select object_id from user_objects where object_type = 'INDEX' and object_name like 'SYS_IL%';

 OBJECT_ID
----------
    241599

SQL> alter session set events 'immediate trace name treedump level 241599';

----- begin tree dump
leaf: 0x1800204 25166340 (0: nrow: 4 rrow: 3)
----- end tree dump

SQL> alter session dump datafile 6 block 516;

I’ve included in the above the treedump that I extracted from the tracefile and this shows that the index consists of a single leaf block (0x1800204 = file 6 block 516) with 4 row directory entries of which one has been deleted. Here’s the row dump from that leaf block – the first three entries are the index entries identifying the three LOBs I created (and, as shown by the flag value “—D–“, the first has been marked as deleted) the fourth entry points to a set of free chunks (corresponding to the chunks that will become available for re-use after a delay corresponding to the undo retention time).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 a6 01 80 01 aa 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 ae 01 80 01 b2 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 b6 01 80 01 ba 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
 01 80 01 a6 01 80 01 aa 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00
col 1; len 4; (4):  01 80 01 a6

If you examine the entries closely you will see that despite the common structure of all four of them there are two patterns. Ignoring the “data (32):” portion and looking at just “col 0” the last few bytes of the first three entries hold consecutive numbers which are actually the LOB Ids for the three LOBs (derived from the sequences idgen1$). The fourth entry breaks that pattern and if you examine bytes 3 to 6 you will find that that is (approximately, by the time I publish this article) the number of seconds since 1st Jan 1970.

To a large degree you need only examine “col 0” to get a good idea of how Oracle handles the LOBINDEX, but I will say just a few things about the rest of the entry.  For the “reusable space” index entries “col 1” is the first of a list of up to 8 chunks that were released from the same LOB at that moment, and the “data(32)” is the full list of those 8 chunks – each chunk is identified by the block address of the first block of the chunk. If I had created and deleted a LOB of roughly 128KB I would have used 16 chunks to create it and made 16 chunks available on deletion, so there would have been two index entries with the same “col 0” value, each identifying 8 of the chunks – hence the re-appearance of the first chunk as “col 1”.  (You can’t help wondering why Oracle doesn’t squeeze 9 chunks per index entry rather than repeating the first of the list – maybe there’s a peripheral effect that makes 8 easier, maybe it’s simply a good sanity check mechanism.)

For index entries about current LOBs “col 1” is a counter for the index entries that identify the entire LOBs. Our LOBs were all very small so we only needed one index entry (which Oracle starts counting from zero).  The “data (32)” entry for the “zeroth” entry starts with 16 bytes of metadata then holds up to 4 pointers to chunks; subsequent entries don’t need the metadata and can hold up to 8 pointers each and “col 1” stores the chunk number that the index entry starts with, so “col 1” in consecutive index entries for a given LOB id will have values 0, 4, 12, 20, etc.

You might note, by the way, that my LOBs are not made up of consecutive blocks even though my chunk size is exactly one block. This is a side effect of ASSM (automatic segment space management) and nothing specifically to do with LOBs.

With this sketch in place you now have some idea of how a LOBINDEX works. Apart from the convenience of knowing roughly what information is stored in the index, and how it has this strange dual purpose, you can now view it just like any other B-tree index in Oracle. When you insert a LOB you insert some index entries into the middle of the index (the high-value point of the LOB Id bit), when you delete a LOB you mark some (consecutive) index entries as deleted and insert some index entries at the high end of the index the high_value point of the “reusable chunks” bit) because each delete is the most recent delete.

As soon as you’ve got that far you realise that if you have some degree of concurrency of inserts and deletes then you have two hot spots in the index – the mid point where you’re going to get lots of 50/50 splits as LOBs are inserted and the end point where all the re-usable chunks are indexed. So how do you configure Oracle, and what does Oracle do, to reduce the contention ?

Take another look at the “col 0” values – which I’ve cut out and listed in isolation below:


col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87

col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00

Apart from making it easy to see the sequencing in the 3 LOB ids it’s now easy to note that the first three (LOB) entries start with “00 00” while the last (reusable space) entry starts with “00 01”. It’s really this starting two bytes that makes it easy for Oracle to separate the current LOBs section of the index from the reusable space section. The two bytes are the freepool identifier – it’s the first (and only, in my example) free pool – but Oracle is counting from zero, doubling the counter for the current LOBs, and doubling and adding one for the reusable space.

Here are some results when I drop and recreate the table with freepools 4 and repeat the experiment. (I’ve removed the “data(32)” content to make the output a little cleaner, and then extracted the “col 0” values).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 55
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 55
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

It just happened that with 4 freepools available my session picked freepool 3 so its LOBINDEX entries are preceded with 00 06 (2 * 3), and it’s reusable space index entries are preceded with 00 07 (2 * 3 + 1). At present I think the freepool chosen by a session (counting from zero) is derived from the session’s process id (pid) by a simple mod(pid , freepools).

So what happens if I start a second session, and adjust my little PL/SQL procedure to insert rows 4, 5, and 6.

I expect to see two things. First, the “ordinary” B-tree event – the index entry that’s marked for deletion will be cleared out of the index; secondly I should see four new index entries (one marked as deleted) which, with a little luck (one chance in four), will show that they are associated with a different freepool.

Here’s the dump (again with the “data(32)” deleted, and the “col 0” extracted at the end):


row#0[7782] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 87
col 1; len 4; (4):  00 00 00 00

row#1[7732] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 88
col 1; len 4; (4):  00 00 00 00

row#2[7682] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 89
col 1; len 4; (4):  00 00 00 00

row#3[7632] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 01 57 b3 3b ad 00 00 00 00
col 1; len 4; (4):  01 80 01 a4

row#4[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#5[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#6[7832] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 87
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 88
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 89

col 0; len 10; (10): 00 01 57 b3 3b ad 00 00 00 00

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

The index entry previously marked as deleted has disappeared (it was LOB id “09 da 36 55”).

We have four new index entries – the first 4 in the list above – and we can see that our second session has been allocated to freepool 0, the LOBINDEX entries are preceded by “00 00”, and the reusable space index entry is preceded by “00 01”.

So by declaring freepools N, we effectively break the index up into 2N nearly discrete sections. Half the sections get inserts at their high end as we insert new LOBs (with ever increasing LOB ids) and the other half (apart, sometimes, from the very top section) get inserts at the high end as time passes and we make LOB space available for reuse by deleting existing LOBs. (Note – if two LOBs of more than 8 chunks each are deleted in the same hundredth of a second then their index entries may end up interleaving as the full key is (timestamp, first chunk address) and the chunks may be scattered widely across the tablespace). Freepools allow Oracle to remove the two killer hot spots in the index.

There are side effects, of course: apart from the section for reusable space in the top freepool each section of the index will generally be subject to 50/50 block splits so you can expect the index to be roughly twice the optimum size – and even bigger than that due to other side effects of how it’s used if you’re constantly deleting and inserting LOBs. But size isn’t really the big problem; I’ll be examining further side effects of the LOBINDEX, and the mechanism that Oracle has for using the index, and the performance threats this introduces, in the next installment.


August 16, 2016

Month End

Filed under: audit,CBO,Non-technical — Jonathan Lewis @ 1:04 pm BST Aug 16,2016

A question about parallel query and cardinality estimates appeared on OTN a little while ago that prompted me to write this note about helping the optimizer do the best job with the least effort.  (A critical point in the correct answer to the original question is that parallel query may lead to “unexpected” dynamic sampling, which can make a huge difference to the choice of execution plans, but that’s another matter.)

The initial cardinality error in the plan came from the following predicate on a “Date dimension” table:


      AR_DATE.CALENDAR_DT   = AR_DATE.MONTH_END_DT 
AND   AR_DATE.CALENDAR_DT  >= ADD_MONTHS(TRUNC(SYSDATE,'fmyy'), -36) 
AND   AR_DATE.MONTH_END_DT >= ADD_MONTHS(TRUNC(SYSDATE,'fmyy'), -36)

In the parallel plan the estimated number of rows on a full tablescan of the table was 742, while on the serial plan the same tablescan produced a cardinality of 1. You will appreciate that having an estimate of 1 (or less) that is nearly three orders of magnitude wrong is likely to lead to a very bad execution plan.

My first thought when I saw this was (based on a purely intuitive interpretation): “there’s one day every month that’s the last day of the month and we’re looking at roughly that last 36 months so we might predict a cardinality of about 36”. That’s still a long way off the 742 estimate and 1,044 actual for the parallel query, but it’s a warning flag that the serial estimate is probably an important error – it’s also an example of the very simple “sanity checking” mental exercises that can accompany almost any execution plan analysis.

My second thought (which happened to be wrong, and would only have been right some time well before version 10.2.0.5) was that the optimizer would treat the add_months() expressions as unknown values and assign a selectivity of 5% to each of the predicates, reducing the combined selectivity to 1/400th of the selectivity it gave to the first predicate. In fact the optimizer evaluates the expressions and would have used the normal (required range / total range) calculation for those two predicates.

It’s the first predicate that I want to examine, though – how does the optimizer calculate a selectivity for it ? Here’s some code to generate sample data to use for testing.


rem
rem     Script:         month_end.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2016
rem

create table t1
nologging
as
select
        rownum                                                   id,
        to_date('01-Jan-2010','dd-mon-yyyy') + rownum - 1       calendar_date,
        add_months(
                trunc(to_date('01-Jan-2010','dd-mon-yyyy') + rownum - 1 ,'MM' ),
                1
        ) - 1                                                   month_end_date
from
        dual
connect by
        level <= trunc(sysdate) - to_date('01-jan_2010','dd-mon-yyyy') + 1
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

This clunky bit of code gives me consecutive dates from 1st Jan 2010 up to “today” with the month_end_date column holding the month end date corresponding to the row’s calendar_date. So now we can check what the optimizer makes of the predciate calendar_date = month_end_date:


set autotrace on explain

select count(*) from t1 where calendar_date = month_end_date;

  COUNT(*)
----------
        79

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    16 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    16 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     1 |    16 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("CALENDAR_DATE"="MONTH_END_DATE")

Looking at operation 2 we can see that, in effect, the optimizer has considered two independent predicates “calendar_date = {unknown}” and “month_end_date = {unknown}” and taken the lower of the two selectivities – which means the cardinality estimate is 1 because the calendar_date column is unique across this table.

There are various ways to work around problems like this. One of the simplest would be to tell Oracle to sample this table with the (table-level) hint /*+ dynamic_sampling(t1 1) */; in fact, since this predicate is effectively treated as an examination of two predicates the (cursor-level) hint /*+ dynamic_sampling(4) */ would also cause sampling to take place – note that level 4 or higher is required to trigger sampling for “multiple” predicates on a single table. As a general guideline we always try to minimise the risk of side effects so if this problem were embedded in a much larger query I would prefer the table-level hint over the cursor-level hint.

There are other options, though, that would allow you to bypass sampling – provided you can modify the SQL. The script I used to create this table also included the following statement:


alter table t1 add (
        date_offset1 generated always as (calendar_date - month_end_date) virtual,
        date_flag generated always as (case when calendar_date - month_end_date = 0 then 'Y' end) virtual
);

In 12c I would declare these virtual columns to be invisible to avoid problems with any SQL that didn’t use explicit column lists. For demonstration purposes I’ve set up two options – I can find the rows I want with one of two obvious predicates:

    date_offset1 = 0
    date_flag = 'Y'

In fact there’s a third predicate I could use that doesn’t need to know about the virtual columns:

    calendar_date - month_end_date = 0

Unfortunately I can’t arbitrarily swap the order of the two dates in the last predicate, and the optimizer won’t spot that it is also equivalent to “calendar_date = month_end_date”. Here are a few execution plans – for which the only significant bit is the cardinality estimate of the full tablescans:


select count(*) from t1 where date_flag = 'Y';

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     2 |     4  (25)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     2 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |    79 |   158 |     4  (25)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DATE_FLAG"='Y')



select count(*) from t1 where date_offset1 = 0;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |    78 |   312 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DATE_OFFSET1"=0)



select count(*) from t1 where calendar_date - month_end_date = 0;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |    78 |   312 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."DATE_OFFSET1"=0)

It’s interesting to note that the optimizer has transformed the last predicate into the equivalent virtual column expression to do the arithmetic. You might also note that the date_flag option is slightly more accurate, but that’s because it’s based on an expression which is null for the rows we don’t want while the date_offset1 column has a value for every row and a little bit of arithmetical rounding comes into play. You might also note that there’s a small cost difference – which I’d ascribe to the CPU cost that the optimizer has added for the CASE expression being applied on top of the simple date arithmetic.

Of course, whatever else you might play around with when working around a cardinality problem like this, I think the strategic aim for a data warehouse system would be to get a REAL flag column on the table and populate it at data loading time if month-end dates played an important part in the proceedings – though I have to say that the virtual flag column is one I do rather fancy.

 

August 5, 2016

Basicfile LOBs 1

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 4:12 pm BST Aug 5,2016

I got a call to a look at a performance problem involving LOBs a little while ago. The problem was with an overnight batch that had about 40 sessions inserting small LOBs (12KB to 22KB) concurrently, for a total of anything between 100,000 and 1,000,000 LOBs per night. You can appreciate that this would eventually become a very large LOB segment – so before the batch started all LOBs older than one month were deleted.

The LOB column had the following (camouflaged) declaration:

 LOB (little_lob) STORE AS BASICFILE (
        TABLESPACE lob_ts
        ENABLE STORAGE IN ROW
        RETENTION
        NOCACHE
        LOGGING
)

The database was 11gR2, the tablespace was defined with ASSM with uniform 1MB extents and a blocksize of 8KB (so the LOBs were all 2 or 3 chunks) and the undo retention time was 900 seconds. The effect of the “enable storage in row” is that the LOBINDEX didn’t have to hold any details of current LOB chunks (for in-row, the first 12 chunks are listed in the LOB Locator in the base table).

So, examining an AWR report covering the critical interval, reviewing the captured ASH data, and checking the database, a few questions came to mind:

  • With 200 GB of current LOB data in the segment, why was the segment roughly 800GB ?
  • With no need for current LOBs to be indexed, how had the LOB Index reached 500,000 blocks in size ?
  • There had been 500,000 inserts that night – so why had Oracle done 6 Billion (cached) buffer gets on the (nocache) LOB segment ?
  • Given that the LOB Segment had not changed size during the night, why had there been millions of HW enqueue wait on the inserts ?

Knowing the stuff that I did know about basicfile LOBs it seemed likely that the most significant problem was that the segment hadn’t been created with multiple freepools which, according to the very sparse (and not entirely self-consistent) documentation, exist to allow improved concurrency. So I thought I’d search the Internet for any useful information about freepools, how they worked, what impact they might have on this problem, why their absence might produce the symptoms I’d seen, and what the best course of action would be to address the problem.

Of course the “correct” solution according to MoS would be to convert from basicfile to securefile – with a strange insistence on using online redefinition, but no explanation of why a simple CTAS or alter table move is undesirable or dangerous. Unfortunately there are a couple of notes on MoS describing performance issues with “high” levels of concurrent inserts that need to be addressed by setting hidden parameters so I’m not (yet) keen on rebuilding 700GB of a production system to produce a change that might still not work quickly enough; especially since I couldn’t find anything on MoS that could quantify the time needed to do the conversion.

To my surprise I couldn’t find a single useful piece of information about the problem. The only articles I could find seemed to be little bits of cut-n-paste from the Oracle manual pages about using multiple freepools, and the best of those actually demonstrated rebuilding or changing the freepools settings on a LOB of a few megabytes. The most significant MoS note did say that the process “could be slow” and would lock the table. But surely someone, somewhere, must have tried it on a big system and had some idea of “how slow”.

In the end I had to start building some simple models and doing a few experiments to find out what happens and where the time goes and what causes the strange results and – most importantly – how freepools might help. Fortunately, following a call to the Oak Table for any ideas or links to useful documents, I got a pointer to the original Oracle patents which were enormously helpful in showing why freepools could help and why, in the wrong circumstances, you could still end up with a (slightly smaller) disaster on your hands.

 

To be continued …

Footnote

If you’re interested, the patent numbers are: 5,999,943 and 6,061,678.  Now I just need someone to tell me the numbers for the securefile LOBs patents.

 

August 2, 2016

Adaptive mayhem

Filed under: 12c,Oracle — Jonathan Lewis @ 4:29 pm BST Aug 2,2016

So you run a query and it gives you a plan with a note that says “This is an adaptive plan”.

So you run it again and the plan changes,  with a note that says “Statistics feedback used for this statement”

So you pause to think for a bit, then run the query again and the plan changes, with a note that says “One SQL Directive used, dynamic statistics used”. (You waited too long and the internal re-optimization hints got flushed down into an SQL directive.)

So you decide to think about it the following morning when you’re feeling bright and fresh, and when you run it you get another plan because overnight the automatic stats job gathered stats on the critical table and created a column group that was indicated by the (now defunct) directive.

Happy optimisation!

See also:

UPDATE:

This just in from fellow Oak Table member Stefan Koehler:If the automatic stats collection task doesn’t finish inside its window that you may get unlucky and find that Oracle has created extended stats on a table, failed to complete the stats collection, and dropped the extended stats. Bug 19450314 mentioned above avoids the problems of unnecessary invalidation when the extended stats are added and collected – it doesn’t deal with the problem if the stats are dropped.

Update 2:

Fresh news from OpenWorld 2016 on 12.2, courtesy of DBI Services’ Franck Pachot: The parameter contolling adaptive optimisation has been split into two parts, one for the adaptive plans the other for adaptive statistics – so you can officially switch of the bit that’s more likely to cause unexpected side effects while leaving the bit that’s more likely to be beneficial active.

 

Next Page »

Blog at WordPress.com.