Oracle Scratchpad

August 3, 2017

Rebuilding Indexes

Filed under: Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:00 pm BST Aug 3,2017

One of the special events that can make it necessary to rebuild an index is the case of the “massive DML”, typically a bulk delete that purges old data from a table. You may even find cases where it’s a good idea to mark a couple of your indexes as unusable before doing a massive delete and then rebuild them after the delete.

Despite the fact that a massive delete is an obvious special case it’s still not necessary in many cases to worry about a rebuild afterwards because the space made free by the delete will be smoothly reused over time with very little variation in performance. There is, however, one particular feature that increases the probability of a rebuild becoming necessary – global (or globally partitioned) indexes on partitioned tables. The problem (and the absence of problem in non-partitioned tables) is in the nature of the rowid.

For non-partitioned tables, and partitioned tables with local indexes, the rowid stored in an index is (assuming we’re thinking only of heap tables) stored as a sequence of 6 bytes consisting, in order, of: (tablespace relative file number, block number within file, row number within block). If the table is non-partitioned, or if this is an index segment from a locally partitioned index, all the index entries will be pointing to the same table segment and Oracle knows which segment that is from the data dictionary information – so Oracle can derive the data_object_id of the table segment and convert the tablespace relative file number into the absolute file number to navigate to the right row in the table.

When the index is global or globally partitioned any index entry may point to any of the table’s segments, so the rowid that is stored in the index is expanded by a further 4 bytes to hold the data_object_id of the table segment it is pointing to – and the data_object_id is the leading component: (data_object_id, tablespace relative file number, block number within file, row number within block). Think about what this means when you start to drop “old” partitions and add new partitions. Compare this with what happens when you simply delete a large volume of old data from a table and starting inserting new data. There’s an important difference to the way in which indexes will evolve.

Purging data

When you delete a large volume of data from a (simple, heap) table you will create a lot of empty space in a lot of existing table blocks. If that delete is supposed to get rid of “old” data (and to keep the description simple we’ll assume it’s the first time you’ve done this) it’s very likely that the delete will result in lots of empty blocks near the start of the table – rows that were inserted at the same time will tend to be physically close to each other in the table. This means that future inserts will soon start to reuse those table blocks. Think about what this means for index entries – especially for non-unique keys.

Assume you have 100 rows with value ‘XXX’ for an indexed column. Breaking the rowid into its component parts the index entries will be (‘XXX’,{file_id, block_id, row_number}).  Now assume you delete the oldest 10 rows then, over time, insert 10 more rows with the same key value. You’ve deleted the 10 index entries with the lowest values for (file_id, block_id) but the space that’s become available in the table will be in and around exactly that range of blocks – so the new index entries will probably end up looking very similar to the deleted index entries and inserted in and around the existing index entries for value ‘XXX’, so over time the index is unlikely to allocate much new space.

Now think about what happens when your table it partitioned but the index is global; your index entries are (‘XXX’,{data_object_id, file_id, block_id, row_number}). When you drop the oldest partition you will probably[1] delete all the index entries with the lowest data_object_id. When you start inserting new rows for ‘XXX’ the new table partition will have a data_object_id that is going to be higher than any previous data_object_id – which means you’re going to be inserting rows into the right-hand (high-value) edge of this section of the index. In some cases – typically those where you have a couple of leaf blocks per key value – the index may end up growing significantly because the insertion point for rows in the new partition isn’t in the leaf block with the available space, and it won’t be until you’ve done a few more bulk deletes and the leaf blocks relating to the oldest table partitions become completely empty that the space can be reused.

An example of this type of behaviour probably appeared on the OTN database forum quite recently.  Of course, there are various reasons why indexes can become inefficient, and the degree of inefficiency may only become noticeable over a fairly long period of time; moreover there are various reasons why global indexes are a little problematic, and various reasons why a bulk delete (which is what executing “alter table drop partition” does to a global index) has unpleasant side effects dependent somewhat on the number (and size) of the partitions and on how many you try to drop in one go.

There’s not  a lot you can do about this quirk of global indexes, but it’s always worth taking extra care with partitioned tables and focusing even more carefully on a strategic review of indexes:

  • Does this index really need to exist at all
  • Could this index be replaced by a selective function-based index
  • Does this index really need to be global / globally partitioned
  • How big is this index compared to the size it ought to be
  • Should this index be (basic) compressed
  • Is this index likely to be disrupted by a historic purge – or is there another reason for its undesirable behaviour

 

[1] probably delete entries with the lowest data_object_id” – I have to say this because if you’ve executed a “move partition” at any time a new data_object_id will have been generated for the partition, so the oldest partition could, in principal, have the highest data_object_id. The issue of changing data_object_ids brings a whole new level of complexity to global indexes – but only in a few special cases, fortunately.

 

 

July 25, 2017

Redo OP Codes:

Filed under: Infrastructure,Oracle,redo — Jonathan Lewis @ 6:17 pm BST Jul 25,2017

This posting was prompted by a tweet from Kamil Stawiarski in response to a question about how he’d discovered the meaning of Redo Op Codes 5.1 and 11.6 – and credited me and Julian Dyke with “the hardest part”.

Over the years I’ve accumulated (from Julian Dyke, or odd MoS notes, etc.) and let dribble out the occasional interpretation of a few op codes – typically in response to a question on the OTN database forum or the Oracle-L listserver, and sometimes as a throwaway comment in a blog post, but I’ve never published the full set of codes that I’ve acquired (or guessed) to date.

It’s been some time since I’ve looked closely at a redo stream, and there are many features of Oracle that I’ve never had to look at at the level of the redo so there are plenty of gaps in the list – and maybe a few people will use the comments to help fill the gaps.

It’s possible that I may be able to add more op codes over the new days – I know that somewhere I have some op codes relating to space management, and a batch relating to LOB handling, but it looks like I forgot to add them to the master list – so here’s what I can offer so far:


1	Transaction Control

2	Transaction read

3	Transaction update

4	Block cleanout
		4.1	Block cleanout record
		4.2	Physical cleanout
		4.3	Single array change
		4.4	Multiple array changes
		4.5	Format block
		4.6	ktbcc redo -  Commit Time Block Cleanout Change (?RAC, ?recursive, ?SYS objects)

5	Transaction undo management
		5.1	Update undo block
		5.2	Get undo header
		5.3	Rollout a transaction begin
		5.4	On a rollback or commit
		5.5	Create rollback segmenr
		5.6	On a rollback of an insert
		5.7	In the ktubl for 'dbms_transaction.local_transaction_id'
			(begin transaction) - also arrives for incoming distributed
			tx, no data change but TT slot acquired. Also for recursive
			transaction (e.g. truncate). txn start scn:  0xffff.ffffffff
		5.8	Mark transaction as dead
		5.9	Rollback extension of rollback seg
		5.10	Rollback segment header change for extension of rollback seg
		5.11	Mark undo as applied during rollback
		5.19	Transaction audit record - first
		5.20	Transaction audit record - subsequent
		5.23	ktudbr redo: disable block level recovery (reports XID)
		5.24	ktfbhundo - File Space Header Undo

6	Control file

10	Index
		10.1	SQL load index block
		10.2	Insert Leaf Row
		10.3	Purge Leaf Row
		10.4	Delete Leaf Row
		10.5	Restore Leaf during rollback
		10.6	(kdxlok) Lock block (pre-split?)
		10.7	(kdxulo) unlock block during undo
		10.8	(kdxlne) initialize leaf block being split
		10.9	(kdxair) apply XAT do to ITL 1	-- related to leaf block split
		10.10	Set leaf block next pointer
		10.11	(kdxlpr) (UNDO) set kdxleprv (previous pointer)
		10.12 	Initialize root block after split
		10.13	index redo (kdxlem): (REDO) make leaf block empty,
		10.14	Restore block before image
		10.15	(kdxbin) Insert branch block row
		10.16	Purge branch row
		10.17	Initialize new branch block
		10.18	Update key data in row -- index redo (kdxlup): update keydata
		10.19	Clear split flag
		10.20	Set split flag
		10.21	Undo branch operation
		10.22	Undo leaf operation
		10.23	restore block to tree
		10.24	Shrink ITL
		10.25	format root block
		10.26	format root block (undo)
		10.27	format root block (redo)
		10.28	Migrating block (undo)
		10.29	Migrating block (redo)
		10.30	Update nonkey value
		10.31	index root block redo (kdxdlr):  create/load index
		10.34 	make branch block empty
		10.35	index redo (kdxlcnu): update nonkey
		10.37	undo index change (kdxIndexlogicalNonkeyUpdate) -- bitmap index
		10.38	index change (kdxIndexlogicalNonkeyUpdate) -- bitmap index
		10.39	index redo (kdxbur) :  branch block update range
		10.40	index redo (kdxbdu) :  branch block DBA update,

11	Table
		11.1  undo row operation
		11.2  insert row  piece
		11.3  delete row piece
		11.4  lock row piece
		11.5  update row piece
		11.6  overwrite row piece
		11.7  manipulate first column
		11.8  change forwarding address - migration
		11.9  change cluster key index
		11.10 Set Cluster key pointers
		11.11 Insert multiple rows
		11.12 Delete multiple rows
		11.13 toggle block header flags
		11.17 Update multiple rows
		11.19 Array update ?
		11.20 SHK (mark as shrunk?)
		11.24 HCC update rowid map ?

12	Cluster

13	Segment management
		13.1	ktsfm redo: -- allocate space ??
		13.5	KTSFRBFMT (block format) redo
		13.6	(block link modify) (? index )  (options: lock clear, lock set)
		13.7	KTSFRGRP (fgb/shdr modify freelist) redo: (options unlink block, move HWM)
		13.13	ktsbbu undo - undo operation on bitmap block
		13.14	ktsbbu undo - undo operation on bitmap block
		13.17	ktsphfredo - Format Pagetable Segment Header
		13.18	ktspffredo - Format Level1 Bitmap Block
		13.19	ktspsfredo - Format Level2 Bitmap Block
		13.21	ktspbfredo - Format Pagetable Datablock
		13.22	State change on level 1 bitmap block
		13.23	Undo on level 1 bitmap block
		13.24	Bitmap block (BMB) state change (level 2 ?)
		13.25	Undo on level 2 bitmap block
		13.26	?? Level 3 bitmap block state change ??
		13.27	?? Level 3 bitmap block undo ??
		13.28	Update LHWM and HHWM on segment header
		13.29	Undo on segment header
		13.31	Segment shrink redo for L1 bitmap block
		13.32	Segment shrink redo for segment header block

14	Extent management
		14.1	ktecush redo: clear extent control lock
		14.2	ktelk redo - lock extent (map)
		14.3	Extent de-allocate
		14.4	kteop redo - redo operation on extent map
		14.5	kteopu undo - undo operation on extent map
		14.8	kteoputrn - undo operation for flush for truncate

15	Tablespace

16	Row cache

17	Recovery management
		17.1	End backup mode marker
		17.3	Crash Recovery at scn:  0x0000.02429111
		17.28	STANDBY METADATA CACHE INVALIDATION

18	Block image (hot backups)
		18.1	Block image
		18.3	Reuse redo entry
				   (Range reuse: tsn=1 base=8388753 nblks=8)
				or (Object reuse: tsn=2 objd=76515)

19	Direct loader
		19.1	Direct load block record
		19.2	Nologging invalidate block range
			Direct Loader invalidate block range redo entry

20	Compatibility segment

21	LOB segment
		21.1	kdlop (Long Feild) redo:  [sic]
				(insert basicfile clob)

22	Locally managed tablespace
		22.2	ktfbhredo - File Space Header Redo:
		22.3	ktfbhundo - File Space Header Undo:
		22.5	ktfbbredo - File BitMap Block Redo:
		22.16	File Property Map Block (FPM)

23	Block writes
		23.1	Block written record
		23.2	Block read record (BRR) -- reference in Doc ID: 12423475.8

24	DDL statements
		24.1	DDL
		24.2	Direct load block end mark
		24.4	?? Media recovery marker
		24.10	??
		24.11	??

(E & O.E) – you’ll notice that some of the descriptions include question marks – those are guesses – and some are little more than the raw text extracted from a redo change vector with no interpretation of what they might mean.

Update

It didn’t take long for someone to email me a much longer list that has been published elsewhere on the Internet. The results don’t have the hierarchical style display I have above, so I may copy the extra entries into the list above when I get a little time.

July 24, 2017

Fast Now, Fast Later

Filed under: Infrastructure,Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 1:39 pm BST Jul 24,2017

The following is the text of an article I published in the UKOUG magazine several years ago (2010), but I came across it recently while writing up some notes for a presentation and thought it would be worth repeating here.

Fast Now, Fast Later

The title of this piece came from a presentation by Cary Millsap and captures an important point about trouble-shooting as a very memorable aphorism. Your solution to a problem may look good for you right now but is it a solution that will still be appropriate when the database has grown in volume and has more users.

I was actually prompted to write this article by a question on the OTN database forum that demonstrated the need for the basic combination of problem solving and forward planning. Someone had a problem with a fairly sudden change in performance of his system from November to December, and he had some samples from trace files and Statspack of a particular query that demonstrated the problem.

The query was very simple:

select  *
from    tph
where   pol_num = :b0
order by
        pm_dt, snum

When the query was operating fast enough the trace file from a sample run showed the following (edited) tkprof output, with an the optimizer taking advantage of the primary key of (pol_num, pm_dt, snum) on table TPH to avoid a sort for the order by clause. (Note that the heading on the plan is “Row Source Operation” – which means it’s the execution plan that really was used)

call    count    cpu  elapsed  disk  query  current  rows
---------------------------------------------------------
Parse       1   0.01     0.13     0    106        0     0
Execute     1   0.03     0.03     0      0        0     0
Fetch       4   0.01     0.22    46     49        0    43
---------------------------------------------------------
total       6   0.06     0.39    46    155        0    43

Rows Row Source Operation
---- --------------------
  43 TABLE ACCESS BY INDEX ROWID TPH (cr=49 pr=46 pw=0 time=226115 us)
  43   INDEX RANGE SCAN TPH_PK (cr=6 pr=3 pw=0 time=20079 us)(object id 152978)

Elapsed times include waiting on following events:
Event waited on               Times  Max. Wait Total Waited
                             Waited 
-----------------------------------------------------------
db file sequential read          46       0.01         0.21

When the query was running less efficiently the change in the trace didn’t immediately suggest any fundamental problems:


call    count    cpu  elapsed  disk  query  current  rows
---------------------------------------------------------
Parse       1   0.00     0.00     0     51        0     0
Execute     1   0.01     0.01     0      0        0     0
Fetch       4   0.00     0.59    47     51        0    45
---------------------------------------------------------
total       6   0.01     0.61    47    102        0    45

Rows Row Source Operation
---- --------------------
45 TABLE ACCESS BY INDEX ROWID TPH (cr=51 pr=47 pw=0 time=593441 us)
45 INDEX RANGE SCAN TPH_PK (cr=6 pr=2 pw=0 time=33470 us)(object id 152978)

Elapsed times include waiting on following events:
Event waited on               Times  Max. Wait Total Waited
                             Waited 
-----------------------------------------------------------
db file sequential read          47       0.03         0.58

The plan is the same, the number of rows returned is roughly the same, and the number of disc reads and buffer gets has hardly changed. Clearly the overall change in performance comes from the slower average disk read times (a total of 0.21 seconds with a maximum of one hundredth of a second, compared to a total 0.58 seconds with a maximum of 3 hundredths), but why has the disk I/O time changed?

The figures give us a couple of preliminary ideas. An average read time of 4.5 milliseconds ( 0.21 seconds / 46 reads) is pretty good for a “small” random read of a reasonably loaded disc subject to a degree of concurrent access [ed: bearing in mind this was 2010], so the waits for “db file sequential read” in the first tkprof output are probably getting some help from a cache somewhere – possibly a SAN cache at the end of a fibre link or maybe from a local file system buffer (we might get a better idea if we could see the complete list of individual read times).

In the second case an average of 12.3 milliseconds ( 0.58 seconds / 45 reads) looks much more like a reasonable amount of genuine disc I/O is taking place – and the maximum of 30 milliseconds suggests that the disc(s) in question are subject to an undesirable level of concurrent access: our session is spending some of its time in a disk queue. Again, it would be nice to see the wait times for all the reads, but at this point it’s not really necessary.

There are couple more clues about what’s going on – one is the text of the query itself (and I’ll be coming back to that later) and the other is in the detail of the disk I/Os. If you check the “Row Source Operation” details you’ll see that in the first case the sample query selected 43 rows from the table and requested 43 (46 – 3) physical reads (pr) of the table to do so. In the second case it was 45 rows and 45 (47 – 2) physical reads. Is this simply a case of the same query needing a little more data and having to do a little more work as time passes?

So now we come to the Statspack data. Based on my observations (or guesses) about the nature of the query and the work going on, I asked if we could see some summary information for a couple of comparative intervals, and also to see if this particular query appeared in the “SQL ordered by reads” section of the Statspack reports. Here are the results, first for a snapshot taken in October:


Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                      3,816,939      58,549     15   79.4
CPU time                                                     7,789          10.6
db file parallel write                         371,865       2,005      5    2.7
log file parallel write                         75,554       1,552     21    2.1
log file sync                                   17,198       1,228     71    1.7

                                                     CPU      Elapsd     Old
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
        775,166       43,228           17.9   24.3   212.58  12449.98  1505833981
Module: javaw.exe
SELECT * FROM TPH WHERE POL_NUM = :B1 ORDER BY PM_DT ,SNUM FOR UPDATE NOWAIT}

You might notice that the critical query is actually a ‘select for update’ rather than the simple select that we had originally been told about; this doesn’t affect the execution plan, but is going to have some significance as far as undo and redo are concerned.

Now look at the corresponding figures for an interval in December:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                      7,000,428      92,970     13   89.8
CPU time                                                     6,780           6.5
db file parallel write                         549,286       1,450      3    1.4
db file scattered read                          84,127         720      9     .7
log file parallel write                         41,197         439     11     .4


                                                     CPU      Elapsd     Old
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
      2,444,437       43,363           56.4   25.2   221.31  23376.07 1505833981
Module: javaw.exe
SELECT * FROM TPH WHERE POL_NUM = :B1 ORDER BY PM_DT ,SNUM FOR UPDATE NOWAIT

You’ll see in both cases that a huge fraction of the total database time is spent in single block reads (the “db file sequential read” time), but in December the number of reads has gone up by about 3.2 million. You can also see that about 1.7 million of the “extra” reads could be attributed to the critical query even though the number of executions of that query has hardly changed. The average number of reads per execution has gone up from 18 to 56. (I did ask if I could see the section of Statspack titled “SQL ordered by Executions” as this includes the average number of rows per execution and it would have been nice to know whether this average had gone up just a little bit, or whether it had gone up in line with the average reads per execution. Unfortunately the request was ignored, so I am going to proceed as if the change in the average result set was small.)

This, perhaps, tells us exactly what the problem is (and even if it doesn’t, the figures are symptomatic of one of the common examples of non-scalable queries).

Look at the query again – are we reporting all the rows for a “policy number”, ordered by “payment date”. If so, the number of payments recorded is bound to increase with time, and inevitably there will be lots of payments for policies belonging to other people between each pair of payments I make on my policy – and that would put each of my payments in a different table block (if I use a normal heap table).

Initially the payments table may be sufficiently small that a significant fraction of it stays in Oracle’s data cache or even in the file-system or SAN cache; but as time passes and the table grows the probability of me finding most of my blocks cached decreases – moreover, as time passes, I want increasing numbers of blocks which means that as I read my blocks I’m more likely to knock your blocks out of the cache. Given the constantly increasing numbers of competing reads, it is also no surprise that eventually the average single block read time also increases.

In scenarios like this it is inevitable that performance will degrade over time; in fact it is reasonably likely that the performance profile will degrade slowly to start with and then show an increasingly dramatic plunge. The only question really is how much damage limitation you can do.

One strategy, of course, is to increase the memory available for the critical object(s). This may mean assigning the table to a generously sized KEEP cache. (The cache need not be the same size as the table to improve things, but the bigger the better – for this query, at least). But such a strategy is only postponing the inevitable – you really need to find an approach which is less susceptible to the passage of time.

In this case, there are a few options to consider. First – note that we are selecting all the rows for a policy: do we really need to, or could we select the rows within a given date range, thus setting an upper limit on the average volume of data we need to acquire for any one policy. If we do that, we may want to think about strategies for summarizing and deleting older data, or using partitioning to isolate older data in separate segments.

If we can’t deal with the problem by changing the code (and, in this case, the apparent business requirement) can we avoid the need to visit so many data blocks for single policy. There are two obvious options to consider here – we could create the table as an “index cluster” clustered on the policy number; in this way we pay a penalty each time we insert a new row for a policy because we have to find the correct block in the cluster but when we run a query against that policy we will only need to read one or two blocks (probably) to get all the data. Alternatively we could consider setting the table up as an index-organized table (IOT) – again we do more work inserting data into the correct leaf block in the index but again we reap the benefit as we query the data because all the rows we want are in the same two or three leaf blocks (and stored in the order we want them).

Of course we are still subject to the same basic problem of the result set increasing in size as time passes, but at least we have managed to reduce (dramatically) the number of blocks we have to visit and the rate of growth of the number of blocks per query, thereby improving the scalability of the queries significantly.

Introducing new structures to an existing system is difficult, of course, and we may have to work out variations on this theme (like creating an index that includes all the table columns if we can’t switch to an IOT!). The key point is this, though: sometimes we can look at our data and the critical queries and recognize that the volume of data we have to process (even if we don’t return it, as we did in this example) is always going to increase over time, then we need to consider ways of minimizing the volume of data, or improving the packing of data so that the work we do doesn’t change (much) over time. Don’t just think ‘fast now’, think ‘will it still be fast later’.

 

May 23, 2017

255 Again!

Filed under: 12c,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 1:10 pm BST May 23,2017

There are so many things that can go wrong when you start using tables with more than 255 columns – here’s one I discovered partly because I was thinking about a client requirement, partly because I had a vague memory of a change in behaviour in 12c and Stefan Koehler pointed me to a blog note by Sayan Malakshinov when I asked the Oak Table if anyone remembered seeing the relevant note. Enough of the roundabout route, I’m going to start with a bit of code to create a table, stick a row in it, then update that row:

rem
rem     Script: wide_table_4.sql
rem     Author: Jonathan Lewis
rem     Dated:  May 2017
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0,4
rem

set pagesize 0
set feedback off

spool temp.sql

prompt create table t1(

select
        'col' || to_char(rownum,'fm0000') || '  varchar2(10),'
from
        all_objects
where   rownum <= 320
;

prompt col0321 varchar2(10)
prompt )
prompt /

spool off

@temp

set pagesize 40
set feedback on

insert into t1 (col0010, col0280) values ('0010','0280');
commit;

update t1 set col0320 ='0320';
commit;

column file_no  new_value m_file_no
column block_no new_value m_block_no

select
        dbms_rowid.rowid_relative_fno(rowid)    file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        dbms_rowid.rowid_row_number(rowid)      row_no
from
        t1
;

alter system flush buffer_cache;
alter system dump datafile &m_file_no block &m_block_no;

So I’ve written one of those horrible scripts that write a script and then run it. The script creates a table with 320 columns and inserts a row that populates columns 10 and 280. That gets me two row pieces, one consisting of the 255 columns from columns 26 to 280 that goes in as row piece 0, the other consisting of the first 25 columns that goes in as row piece 1; the remaining 40 columns are not populated so Oracle “forgets” about them (“trailing nulls take no space”). The script then updates the row by setting column 320 to a non-null value.

For convenience I’ve then generated the file and block number (and row number, just to show its head piece went in as row 1 rather than row 0) of the row and done a symbolic block dump. The question is: what am I going to see in that block dump ?

Answers (part 1)

Here’s an extract from the block dump from 11.2.0.4 (12.1.0.2 is similar) – though I’ve cut out a lot of lines reporting the NULL columns:


ntab=1
nrow=2
frre=-1
fsbo=0x16
fseo=0x1e54
avsp=0x1e3e
tosp=0x1f13
0xe:pti[0]      nrow=2  offs=0
0x12:pri[0]     offs=0x1e7a
0x14:pri[1]     offs=0x1e54
block_row_dump:
tab 0, row 0, @0x1e7a
tl: 49 fb: -------- lb: 0x2  cc: 40
nrid:  0x014000a7.0
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 37: *NULL*
col 38: *NULL*
col 39: *NULL*
tab 0, row 1, @0x1e54
tl: 38 fb: --H-F--- lb: 0x2  cc: 25
nrid:  0x014000a3.0
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 22: *NULL*
col 23: *NULL*
col 24: *NULL*
end_of_block_dump

The block holds two row pieces, and the piece stored as “row 1” is the starting row piece (the H in the flag byte (fb) tells us this). This row piece consists of 25 columns. The next rowpiece (identified by nrid:) is row zero in block 0x014000a3 – that’s block 163 of file 5 – which is the same block as the first row piece. When we look at row zero we see that it holds 40 columns, all null; it’s pointing to a third row piece at row zero in block 0x014000a7 (file 5, block 167), and as corroboration we can also see that the flag byte has no bits set and that tells us that this is just a boring “somewhere in the middle” bit. So it looks like we have to follow the pointer to find the last 255 columns of the table. So let’s take a look at the dump of file 5 block 167:


fsbo=0x14
fseo=0x1e76
avsp=0x1e62
tosp=0x1e62
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1e76
block_row_dump:
tab 0, row 0, @0x1e76
tl: 266 fb: -----L-- lb: 0x1  cc: 255
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 251: *NULL*
col 252: *NULL*
col 253: *NULL*
col 254: [ 4]  30 33 32 30
end_of_block_dump

Take note of the L in the flag byte – that tells us that we’re looking at the last row piece of a multi-piece row. It’s that last 255 columns we were looking for. The mechanics have worked as follows

  • On the simple insert Oracle split the used 280 columns into (25, 255)
  • On the update we grew the used column count from 280 to 320, adding 40 columns. Oracle extended the 255 column row piece to 295, then split it (40, 255) leaving 40 in the original block and migrating the 255 to a new block. So a row that could be only 2 pieces is now

So a row that could be two pieces in one block is now three pieces spread over two blocks; and there’s worse to come. Go back to the original block dump and check the used space. A good first approximation would be to check the “tl:” (total length) value for each row – this gives you: 49 + 38 bytes; add on a couple of hundred for the general block overhead and stuff like the transaction table and you find you’ve used less than 300 bytes in the block. But I’ve got a little procedure (I published this version of it some time ago) to check for free and used space – and this is what it said about the (ASSM) segment that holds this table:


Unformatted                   :           44 /          360,448
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           15 /          122,880
Full                          :            1 /            8,192

Take particular note of the “Full” block at the end of the report – that’s the block where we’ve used up rather less than 300 bytes. In fact if you look again at the first block dump you’ll see the avsp (available space) and tosp (total space) figures of 0x1e3e and 0x1f13 bytes (7,742 and 7,955 bytes). There’s loads of space in the block – but the block is marked in the bitmap space management map as full. That’s really bad news.

On the plus side 12.2 behaves differently, as noted by Sayan in his blog note. We still get the third row piece, but it’s in the same block as the first two and the block doesn’t marked as full in the bitmap.

And there’s still more to come – but it will have to wait a little longer.

 

May 19, 2017

255 columns

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 5:49 pm BST May 19,2017

This is one of my “black hole” articles – I drafted it six months ago, but forgot to publish it.

A recent post on OTN highlighted some of the interesting oddities that appear when you create tables with more than 255 columns. In fact this was a more subtle case than usual because it reminded us that it’s possible to have a partitioned table which appears to have less than the critical 255 columns while actually having more than 255 columns thanks to the anomaly of how Oracle handles dropping columns in a partitioned table.  (For a useful insight see this note from Dominic Brooks – and for a nice thought about preparing simple tables for an exchange with such a partitioned tables look at the 12.2 feature of “create table for exchange” in Maria Colgan’s recent article)

The thread took me down the path of trying to recreate some notes I wrote a long time ago and can no longer find and the OTN problem wasn’t the basic one I had assumed anyway, but I thought I’d publish a bit of the work I had done so that you can see another of the funny effects that appear when your table definition has too many columns (and you use them).

The OP told us about a table with more than 350 columns, so here’s a little script I wrote to generate a table with 365 columns and some data. (It turned out that the OP had more than 390 columns in the table, but 30+ had been “dropped”.)


rem
rem	Script:		wide_table_2.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Nov 2016
rem	Purpose:
rem
rem	Last tested
rem		11.2.0.4
rem

create sequence s1;

declare
	m_statement_1	varchar2(32767) :=
		'create table t1(col0001 varchar2(10),';
	m_statement_2	varchar2(32767) :=
		'insert into t1 values(lpad(s1.nextval,10),';
begin
	for i in 2 .. 365 loop
		m_statement_1 := m_statement_1 ||
			'col' || to_char(i,'FM0000') || ' varchar2(100),'
		;
		m_statement_2 := m_statement_2 ||
			case when i in (2,3,4)
			-- case when i in (122,123,124)
			-- case when i in (262,263,264)
				then 'dbms_random.string(''U'',ceil(dbms_random.value(0,100))),'
			     when i = 365
				then 'lpad(s1.currval,7))'
				else '''COL' || to_char(i,'FM0000') || ''','
			end
		;
	end loop;

	m_statement_1 := substr(m_statement_1, 1, length(m_statement_1) - 1);
	m_statement_1 := m_statement_1 || ') pctfree 25';

	execute immediate m_statement_1;

	for i in 1..10000 loop
		execute immediate m_statement_2;
	end loop;

end;
/

I’ve taken a fairly simple approach to building a string that creates a table – and it’s easy to adjust the number of columns – and a string to insert some values into that table. The insert statement will insert a row number into the first and last columns of the table and generate a random length string for a few of the columns. I’ve picked three possible sets of three columns for the random length string; one set is definitely going to be in the first row piece, one set is definitely going to be in the last row piece, and (since the row will split 110/255) one will be somewhere inside whichever is the larger row piece.

If I wanted to do something more sophisticated I’d probably have to switch to a PL/SQL array for the two statements strings – 32,767 characters doesn’t give me much freedom to play if I wanted to test a table with 1,000 columns.

Having created and populated my table, I performed the following three tests on it:


analyze table t1 compute statistics;

prompt	====
prompt	CTAS
prompt	====

create table t1a pctfree 25 as select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

prompt	======
prompt	Insert
prompt	======

truncate table t1a;
insert into t1a select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

prompt	=============
prompt	Insert append
prompt	=============

truncate table t1a;
insert /*+ append */ into t1a select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

The first test creates a new table (t1a, at pctfree 25, matching the original) copying the original table with a simple “create as select”.

The second test truncates this table and does a basic “insert as select” to repopulate it.

Third test truncates the table again and does an “insert as select” with the /*+ append */ hint to repopulate it.

In all three cases (and with three variations of where the longer random strings went) I used the analyze command to gather stats on the tables so that I could get a count of the number of chained rows; and I dumped a couple of blocks from the tables to see what the inserted rows looked like.

Here’s a summary of the results from 11.2.0.4 when the random-length columns are near the start of the row (the position didn’t really affect the outcome, and the results for 12.1.0.2 and 12.2.0.1 were very similar):

====
CTAS
====

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       9504        237

======
Insert
======

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       6676       3287

=============
Insert append
=============

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       9504        237

As you can see we get two significantly different results: the CTAS and the “insert append” produce tables reporting 9,504 blocks and 237 chained rows, while the original table (single row inserts) and the regular “insert as select” produce tables with 6,676 blocks and 3,133 chained rows. It seems that the CTAS (which would also cover “alter table move”) and direct path insert have minimised the number of chained rows at a cost of a dramatically increased number of blocks. (The scale of the difference happens to be particularly extreme in this case – I didn’t do this deliberately it was simply a consequence of the way I happened to generate the data and the length of the rows.)

We know, of course, that every row in this table will consist of two row pieces, one of 110 columns and one of 255 columns; so every row is in some respects chained due to the potential for intra-block chaining of those two pieces, but the analyze command reports only inter-block chaining i.e. only those rows that start in one block and end in another block – intra-block chaining doesn’t count as “proper” chaining (at least in this version of Oracle).

There are two questions to address in these results: the first is “What’s happening?”, the second, which we ask when we get the answer to the first, is “How come the direct path method still gives us some chained rows?”

I believe the answer to the first question is that the direct path method attempts to avoid chaining unavoidable row-pieces. Even if it means leaving a huge amount of empty space in a block Oracle starts a new row in a new block if there isn’t enough space for both of the anticipated row-pieces to fit in the current block. I think this may be a feature to help Exadata and its use of direct path reads for smart scans, where a relatively small number of chained rows (which might be outside the current Exadata storage unit – and even in the disk space managed by another cell server) could have a catastrophic impact on performance because the system would have to do a single block read to pick up the extra piece – which could have a devastating impact on the performance.

So why do some rows still see chaining under this strategy – I think it’s because there’s a small error in the arithmetic somewhere (possibly visible only in ASSM tablespaces, perhaps related to row-piece headers) where Oracle thinks there’s enough space for both row pieces but there isn’t quite so it tries to insert both pieces and has to chain the second one. (I’ll corroborate this hypothesis with some analysis of block dumps in a moment).

So you have a choice – lots of wasted space and a little row-chaining, or maximum packing of data and (potentially) lots of row-chaining. But there’s more: I’ve said we get one row piece of 110 columns and one of 255 columns for each row, but the point at which the split occurs and the order in which the pieces are inserted depends on the method used.

  • Single row inserts (initial table creation, typical OLTP processing): The split occurred at column 111 – so the leading 110 columns are in one row piece and the trailing 255 columns are in the other – and the row piece with the trailing columns is inserted first.
  • Array inserts (normal): Exactly the same as the single row inserts.
  • Direct path inserts / CTAS: The split occurred at column 256, with the leading 255 column row-piece inserted first and the trailing 110 column row-piece inserted second.

I’m not sure that this particular detail matters very much in normal circumstances when you consider the dramatic difference in size that can appear in the comparison between direct path and normal inserts, but maybe there’s someone who will notice a performance (or even space) side effect because of this inconsistency. We will see in a later post, though, that this difference can have an enormous impact if you subsequently add columns to the table and populate them.

I said I’d come back to the row-chaining anomaly. One of the little details that I didn’t include in my code listing was the call to “analyze table report chained rows” that I did (after executing $ORACLE_HOME/rdbms/admin/utlchain.sql) to list the head rowids of the chained rows into the chained_rows table. After doing this I ran a simple pl/sql loop to dump all the relevant blocks to the trace file:

begin
	for r in (
		select
			dbms_rowid.rowid_relative_fno(head_rowid) file#,
			dbms_rowid.rowid_block_number(head_rowid) block#
			from	chained_rows
		) loop
			execute immediate 'alter system dump datafile ' || r.file# || ' block ' || r.block#;
	end loop;
end;
/

Here’s a little extract from the resulting trace file showing you what the start of a row piece looks like when dumped:

tab 0, row 0, @0x1765
tl: 2075 fb: --H-F--- lb: 0x0  cc: 255
nrid:  0x01401bc4.1
col  0: [10]  20 20 20 20 20 20 20 33 36 32
col  1: [19]  41 52 43 4a 4a 42 4e 55 46 4b 48 4c 45 47 4c 58 4c 4e 56
col  2: [ 8]  59 4b 51 46 4a 50 53 55
col  3: [17]  53 58 59 4e 4a 49 54 4a 41 5a 5a 51 44 44 4b 58 4d
col  4: [ 7]  43 4f 4c 30 30 30 35
col  5: [ 7]  43 4f 4c 30 30 30 36

A convenient thing to check is the cc: (“column count”) entry (end of 2nd line). You can see that this row piece has 255 columns, and if you look at the first six columns dumped you can see that it’s the row numbered 362 (33 36 32), then there are three columns of different length strings, then two columns with the values ‘COL00005’ and ‘COL0007’ respectively. It’s the “cc:” entry that’s useful though, so I’m going to do a bit of simple unix hackery:

grep " cc: " test_ora_24398.trc | sed "s/^.*cc: //"  | sort | uniq -c | sort -n
      1 1
      5 2
    112 108
    125 109
    237 110
    474 255

In my 237 blocks with chained rows I had 474 row pieces of 255 columns and 237 row pieces of 110 columns; then I had 125 row pieces that had lost (and therefore chained) one column and 112 row pieces that had lost and therefore chained 2 columns. I also had a couple of small “tail-end” pieces from earlier blocks scattered in these blocks. These figures suggest that there’s a small error (actually no more than about 20 bytes) in the calculation Oracle does to decide if it can fit a whole row into the current block or whether it has to go on to the next empty block.

Conclusions

When copying a table defined with more than 255 columns there’s the potential for a huge variation in the space usage and chain count depending on whether you do a CTAS (or insert /*+ append */) or a simple insert. You have to decide which option is the biggest threat to your available resources.

There is a little anomaly with the way in which rows are split that is also dependent on the method used for copying – this may also have some effect, though perhaps small enough to be ignored when compared with the space/chaining difference as far as ordinary OLTP processing is concerned. But there are  some important side effects we will consider in a later post.

Even though CTAS/direct path insert can eliminate a lot of row chaining it is still possible to find some row chaining in the resulting data. This may be the result of a calculation error (or possibly a deliberate space saving compromise).

Note that any comments about using CTAS to copy a table also apply to “alter table move” and to using expdp/impdp.

 

April 21, 2017

Undo Understood

Filed under: Infrastructure,Oracle,Read Consistency,Troubleshooting,undo — Jonathan Lewis @ 10:45 am BST Apr 21,2017

It’s hard to understand all the ramifications of Oracle’s undo handling, and it’s not hard to find cases where the resulting effects are very confusing. In a recent post on the OTN database forum resulted in one response insisting that the OP was obviously updating a table with frequent commits from one session while querying it from another thereby generating a large number of undo reads in the querying session.

It’s a possible cause of the symptoms that had been described – although not the only possible cause, especially since the symptoms hadn’t been described completely. It’s actually possible to see this type of activity when there are no updates and no outstanding commits taking place at all on the target table. Unfortunately it’s quite hard to demonstrate this with a quick, simple, script in recent versions of Oracle unless you do some insanely stupid things to make the problem appear – but I know how to do “insanely stupid” in Oracle, so here we go; first some data creation:

rem
rem     Script:         undo_rec_apply_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2017
rem

create table t2(v1 varchar2(100));
insert into t2 values(rpad('x',100));
commit;

create table t1
nologging
pctfree 99 pctused 1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        cast(lpad(rownum,10,'0') as varchar2(10))       v1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 8e4 -- > comment to bypass WordPress formatting issue
;

alter table t1 add constraint t1_pk primary key(id)
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

The t2 table is there as a target for a large of updates from a session other than the one demonstrating the problem. The t1 table has been defined and populated in a way that puts one row into each of 80,000 blocks (though, with ASSM and my specific tablespace definition of uniform 1MB extents, the total space is about 80,400 blocks). I’ve got a primary key declaration that allows me to pick single rows/blocks from the table if I want to.

At this point I’m going to do a lot of updates to the main table using a very inefficient strategy to emulate the type of thing that can happen on a very large table with lots of random updates and many indexes to maintain:


begin
        for i in 1..800 loop
                update t1 set v1 = upper(v1) where id = 100 * i;
                execute immediate 'alter system switch logfile';
                execute immediate 'alter system flush buffer_cache';
                commit;
                dbms_lock.sleep(0.01);
        end loop;
end;
/

set transaction read only;

I’m updating every 100th row/block in the table with single row commits, but before each commit I’m switching log files and flushing the buffer cache.

This is NOT an experiment to try on a production system, or even a development system if there are lots of busy developers or testers around – and if you’re running your dev/test in archivelog mode (which, for some of your systems you should be) you’re going to end up with a lot of archived redo logs. I have to do this switch to ensure that the updated blocks are unpinned so that they will be written to disc and flushed from the cache by the flush buffer cache. (This extreme approach would not have been necessary in earlier versions of Oracle, but the clever developers at Oracle Corp. keep adding “damage limitation” touches to the code that I have to work around to create small tests.) Because the block has been flushed from memory before the commit the session will record a “commit cleanout failures: block lost” on each commit. By the time this loop has run to completion there will be 800 blocks from the table on disc needing a “delayed block cleanout”.

Despite the extreme brute force I use in this loop, there is a further very important detail that has to be set before this test will work (at least in 11.2.0.4, which is what I’ve used in my test runs). I had to start the database with the hidden parameter _db_cache_pre_warm set to false. If I don’t have the database started with this feature disabled Oracle would notice that the buffer cache had a lot of empty space and would “pre-warm” the cache by loading a few thousand blocks from t1 as I updated one row – with the side effect that the update from the previous cycle of the loop would be cleaned out on the current cycle of the loop. If you do run this experiment, remember to reset the parameter and restart the instance when you’ve finished.

I’ve finished this chunk of code with a call to “set transaction read only” – this emulates the start of a long-running query: it captures a point in time (through the current SCN) and any queries that run in the session from now on have to be read-consistent with that point in time. After doing this I need to use a second session to do a bit of hard work – in my case the following:

execute snap_rollstats.start_snap

begin
        for i in 1..10000 loop
                update t2 set v1 = upper(v1);
                update t2 set v1 = lower(v1);
                commit;
        end loop;
end;
/

execute snap_rollstats.end_snap

The calls to the snap_rollstats package simply read v$rollstat and give me a report of the changes in the undo segment statistics over the period of the loop. I’ve executed 10,000 transactions in the interval, which was sufficient on my system to use each undo segment header at least 1,000 times and (since there are 34 transaction table slots in each undo segment header) overwrite each transaction table slot about 30 times. You can infer from these comments that I have only 10 undo segments active at the time, your system may have many more (check the number of rows in v$rollstat) so you may want to scale up that 10,000 loop count accordingly.

At this point, then, the only thing I’ve done since the start of my “long running query” is to update another table from another session. What happens when I do a simple count() from t1 that requires a full tablescan ?

alter system flush buffer_cache;

execute snap_filestat.start_snap
execute snap_my_stats.start_snap

select count(v1) from t1;

execute snap_my_stats.end_snap
execute snap_filestat.end_snap

I’ve flushed the buffer cache to get rid of any buffered undo blocks – again an unreasonable thing to do in production but a valid way of emulating the aging out of undo blocks that would take place in a production system – and surrounded my count() with a couple of packaged call to report the session stats and file I/O stats due to my query. (If you’re sharing your database then the file I/O stats will be affected by the activity of other users, of course, but in my case I had a private database.)

Here are the file stats:


--------------
Datafile Stats
--------------
file#       Reads      Blocks    Avg Size   Avg Csecs     S_Reads   Avg Csecs     M_Reads   Avg Csecs    Max      Writes      Blocks   Avg Csecs    Max
File name
-----       -----      ------    --------   ---------     -------   ---------     -------   ---------    ---      ------      ------   ---------    ---
    1          17          17       1.000        .065          17        .065           0        .000      6           0           0        .000     15
/u01/app/oracle/oradata/TEST/datafile/o1_mf_system_938s4mr3_.dbf
    3         665         665       1.000        .020         665        .020           0        .000      6           0           0        .000     15
/u01/app/oracle/oradata/TEST/datafile/o1_mf_undotbs1_938s5n46_.dbf
    5         631      80,002     126.786        .000           2        .045         629        .000      6           0           0        .000     17
/u01/app/oracle/oradata/TEST/datafile/o1_mf_test_8k__cz1w7tz1_.dbf

As expected I’ve done a number of multiblock reads of my data tablespace for a total of roughly 80,000 blocks read. What you may not have expected is that I’ve done 665 single block reads of the undo tablespace.

What have I been doing with all those undo blocks ? Check the session stats:


Session stats
-------------
Name                                                                     Value
----                                                                     -----
transaction tables consistent reads - undo records applied              10,014
transaction tables consistent read rollbacks                                10

We’ve been reading undo blocks so that we can create read-consistent copies of the 10 undo segment headers that were active in my instance. We haven’t (and you’ll have to trust me on this, I can’t show you the stats that aren’t there!) reported any “data blocks consistent reads – undo records applied”.

If you want to see a detailed explanation of what has happened you’ll need to read Oracle Core (UK source), chapter 3 (and possibly chapter 2 to warm yourself up for the topic). In outline the following type of thing happens:

  • Oracle gets to the first block updated in t1 and sees that there’s an ITL (interested transaction list) entry that hasn’t been marked as committed (we flushed the block from memory before the commit cleanout could take place so the relevant transaction is, apparently, still running and the row is still marked as locked).
  • Let’s say the ITL entry says the transaction was for undo segment 34, transaction table slot 11, sequence 999. Oracle reads the undo segment header block for undo segment 34 and checks transaction table slot 11, which is now at sequence 1032. Oracle can infer from this that the transaction that updated the table has committed – but can’t yet know whether it committed before or after the start of our “long running query”.
  • Somehow Oracle has to get slot 11 back to sequence 999 so that it can check the commit SCN recorded in the slot at that sequence number. This is where we see “undo records applied” to make the “transaction table read consistent”. It can do this because the undo segment header has a “transaction control” section in it that records some details of the most recent transaction started in that segment. When a transaction starts it updates this information, but saves the old version of the transaction control and the previous version of its transaction table slot in its first undo record, consequently Oracle can clone the undo segment header block, identify the most recent transaction, find its first undo record and apply it to unwind the transaction table information. As it does so it has also wound the transaction control section backwards one step, so it can use that (older) version to go back another step … and so on, until it takes the cloned undo segment header so far back that it takes our transaction table slot back to sequence 999 – and the job is done, we can now check the actual commit SCN.  (Or, if we’re unlucky, we might receive an ORA-01555 before we get there)

So – no changes to the t1 table during the query, but lots of undo records read because OTHER tables have been changing.

Footnote:

In my example the tablescan used direct path reads – so the blocks that went through delayed block cleanout were in private memory, which means they weren’t in the buffer cache and didn’t get written out to disc. When I flushed the buffer cache (again to emulate aging out of undo blocks etc.) and repeated the tablescan Oracle had to go through all that work of creating read consistent transaction tables all over again.

Footnote 2:

Despite the grand claim implied by the title I hope everyone realises that this is just one little detail of how undo handling can interfere with their expectations of performance.

 

 

March 9, 2017

Join Elimination

Filed under: CBO,Execution plans,Infrastructure,Oracle — Jonathan Lewis @ 6:39 pm BST Mar 9,2017

A question has just appeared on OTN describing a problem where code that works in 11g doesn’t work in 12c (exact versions not specified). The code in question is a C-based wrapper for some SQL, and the problem is a buffer overflow problem. The query supplied is as follows:


select T1.C1 from T1, T2 where T1.C1 = T2.D1;

The problem is that this works in 11g where the receiving (C) variable is declared as

char myBuffer [31];

but it doesn’t work in 12c unless the receiving variable is declared as:

char myBuffer [51];

There’s an important bit of background information that might be giving us a clue about what’s happened (although what I’m about to describe isn’t actually the problem unless the SQL provided is a simplified version of the problem SQL that is expected to display the problem). Column C1 is defined as char(30) and column D1 is defined as char(50). Here’s some sample code showing why you might need a buffer of 50+1 bytes to hold something that ought to be 30+1 bytes long. (This may be nothing to do with the anomaly described in the original posting – it’s just something I thought of when I first saw the question.)


rem     Script:         join_elimination_oddity.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2017

create table t1(
        c30     char(30) primary key
);

create table t2(
        d50     char(50) references t1
);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

explain plan for
select
        t1.c30
from
        t1, t2
where
        t1.c30 = t2.d50
;

select * from table(dbms_xplan.display(null,null,'projection'));

So we’re selecting c30 – the 30 byte character column – from t1; what do we actually get ? Here’s the plan with the projection:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    52 |     1   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T2   |     1 |    52 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T2"."D50" IS NOT NULL)

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - "T2"."D50"[CHARACTER,50]

Table t1 has been eliminated and the projected column is the “equivalent” column from t2 – which is too long for the expected output. To work around this problem you can disable join elimination either by parameter (_optimizer_join_elimination_enabled=false) or by hinting /*+ no_eliminate_join(t1) */ in which case the plan (with my data) became a nested loop join from t2 to t1 with column c1 projected as expected.

Footnote:

Two things to note about my demonstration

  • If you’re going to create a referential integrity constraint between columns they do need to be of exactly the same type. (If you want to get technical you might say “in the same domain” – but Oracle doesn’t do domains.)
  • This extremely simple case demonstrates the problem in 11.2.0.4 as well as 12.1.0.2. Possibly a more complex query could be produced where (thanks to limitations in query transformations) 11g doesn’t spot the option for join elimination while 12c does; alternatively, a very simple two-column example in 11g won’t do join elimination while a two-column example in 12.2 can (though it doesn’t always) – so upgrading to 12.2 MIGHT cause more people to see this anomaly appearing.

Update (14/3/17)

A follow-up to the original posting has identified source of the problem. Rather than a side effect of join elimination the anomaly has appeared because of the “partial join evaluation” mechanism introduced in 12c. There is a hint to disable the mechanism /*+ no_partial_join(alias) */ or – if you can’t get hints into place for all the necessary code – it would be possible to disable the feature completely through the use of a hidden parameter “_optimizer_partial_join_eval”=false.  The usual rule of checking with Oracle support before setting hidden parameter applies, of course.

 

March 7, 2017

Guesswork

Filed under: Infrastructure,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 7:57 pm BST Mar 7,2017

A recent posting on the OTN database forum described a problem with an insert (as select) statement that sometimes ran extremely slowly: nothing interesting yet, there could be plenty of boring reasons for that to happen. The same SQL statement (by SQL_ID) might take 6 hours to insert 300K rows one night while taking just a few minutes to insert 900K another night (still nothing terribly interesting).

An analysis of the ASH data about the statement showed that the problem was on the “LOAD TABLE CONVENTIONAL” operation – which starts to get interesting if you also assume that someone who was competent to look at ASH would probably have noticed whether or not the time spent was on (the first obvious guess) some variant of “TX enqueue” waits or “log file” waits or something amazingly different. It’s also a little more interesting if you’ve noticed that the title of the posting is about “consuming a lot of CPU time” – so where could that be going if we see most of the excess time going on the insert rather than on the select.

Here’s an enormous clue that this might be a “non-standard” problem – the SQL statement starts like this:


INSERT  /*+ ignore_row_on_dupkey_index(tgt OTC_DAT_TV_PROC_STATUS_UK) */
INTO OTC_DAT_TV_PROC_STATUS TGT (
    {list of columns}
)
SELECT  /*+ parallel(8) */
    ...

See the hint which says “ignore rows if they raise duplicate key errors against index otc_dat_tv_proc_status_uk” ? Think what it might take to implement code that obeys the hint – Oracle can’t know that a row is duplicating a value until it has inserted the row (to get a rowid) then tried to maintain the index and discovered the pre-existing value, at which point it raises an exception, then handles it by undoing the single row insert (I wonder what that does to array inserts, and the logic of triggers), then carries on with the next row. (I suppose the code could check every relevant index before doing the insert – but that would mean a very big, and possibly very resource-intensive, change to all existing “insert a row” code.)

Surely we’re going to see some dramatic effects if a large fraction of our rows result in duplicate values.

Warning – the guess might not be right but it’s much more interesting than all the other guesses you might make at this point and a good excuse for doing a bit of experimentation and learning. So here’s a simple model to test the “excess work” hypothesis – running against 11.2.0.4:

rem
rem     Script:         ignore_dupkey.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2017
rem

drop table t2;
drop table t1;

create table t1
nologging
as
select  *
from    all_objects
where   rownum <= 50000 ; -- > comment to protect WordPress formatting

create table t2
nologging
as
select  *
from    all_objects
where   rownum <= 50000 ; -- > comment to protect WordPress formatting

update t1 set object_id = object_id + (select max(object_id) from t1)
;

commit;

create unique index t1_i1 on t1(object_id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

All I’ve done is create two copies of the same data – with an update in place that changes one set of data so that the object_id doesn’t overlap with the object_id in the other set. The update statement is something that I’m going to include in one test but exclude from the second. Now for the statement that tests for the effects of the hint:


execute snap_events.start_snap
execute snap_my_stats.start_snap

insert
        /*+ ignore_row_on_dupkey_index(t1 (object_id)) */
        into t1
select  *
from    t2
;

execute snap_my_stats.end_snap
execute snap_events.end_snap


There are several different things I could do once I’ve got this basic test set up if I want to refine what I’m testing and isolate certain aspects of the mechanism, but all I’ll do for this note is a simple report of some figures from the two tests – one where the object_id values don’t overlap and one where t1 and t2 are exact copies of each other. The two procedures wrapping the statement are just my standard routines for capturing changes in v$mystat and v$session_event for my session.

For the insert where I’ve done the update to avoid any duplicates appearing the insert completed in about 0.3 seconds, generating 10MB of redo and 2MB of undo.

When I removed the update statement the (continuously failing) insert took 35.5 seconds to complete, of which almost all the time was CPU time. The redo jumped to 478MB with 14MB of undo. The extreme level of redo surprised me slightly especially since the scale of the change was so much greater than that of the undo – I think it may be due to a problem with Oracle needing to unwind one row from an (internal) array insert before retrying. Here, taken from the session stats of the problem run, is a little indication of why the time (and especially the CPU time) increased so much:

Name                                                                     Value
----                                                                     -----
rollback changes - undo records applied                                 95,014
active txn count during cleanout                                        24,627
cleanout - number of ktugct calls                                       24,627
HSC Heap Segment Block Changes                                          97,509
Heap Segment Array Inserts                                              97,509
recursive calls                                                        682,574
recursive cpu usage                                                      2,193
session logical reads                                                1,341,077

Obviously there’s a lot of work done rolling back changes that should not have been made (though why it’s reported as 95,000 rather than 100,000 I don’t know and I’d rather not do a trace of buffer activity to find out) with an associated extra load of data block activity. Most visible, though, is the huge number of recursive calls with, perhaps associated, a very large number of session logical reads. Clearly it’s worth enabling extended tracing to see what’s going on – if you haven’t already guessed what some of those calls are about. Here’s an extract from the top of an appropriate tkprof output:


tkprof test_ora_32298.trc ignore_dupkey sort=execnt

select /*+ rule */ c.name, u.name
from
 con$ c, cdef$ cd, user$ u  where c.con# = cd.con# and cd.enabled = :1 and
  c.owner# = u.user#


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    48753      1.86       2.08          0          0          0           0
Execute  48753      3.75       4.17          0          0          0           0
Fetch    48753      2.86       3.12          0      97506          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   146259      8.47       9.38          0      97506          0           0

select o.name, u.name
from
 obj$ o, user$ u  where o.obj# = :1 and o.owner# = u.user#

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    48753      1.87       1.99          0          0          0           0
Execute  48753      3.60       3.63          0          0          0           0
Fetch    48753      7.35       7.52          0     243765          0       48753
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   146259     12.84      13.14          0     243765          0       48753

The top two statements in the trace file show Oracle first trying to find the name of the constraint that has been breached, then falling back to searching for the name of the unique index that has caused the error to appear. If I had created a unique constraint rather than just a unique index then the second of these two statement would not have appeared in the trace file (and the run would have been a little quicker – hint: constraints are a good thing).

You’ll notice that the total number of calls from the two statement is roughly 292,500 – far short of the 682,000 reported in the session stats. Unfortunately there was nothing else in the trace files that could be blamed for the outstanding 400,000 missing calls. It’s not really necessary to chase up all the details, though; clearly we can see that this feature is very expensive if lots of duplicates appear – like DML error logging it has probably been created as a way of dealing with occasional errors when handling large volumes of data.

Footnote

Notice that my example uses the “index description” method for specifying the index in the hint; it’s legal with either the name or the description. Interestingly (but, perhaps, unsurprisingly) this is a hint that Oracle has to parse for correctness before optimisation. I made a mistake in my first attempt at writing the update statement leaving me with duplicates in the data so Oracle couldn’t create the unique index; as a consequence of the missing unique index the insert statement reported the Oracle error “ORA-38913: Index specified in the index hint is invalid”.

 

February 16, 2017

Truncate 12c

Filed under: 12c,Infrastructure,Oracle — Jonathan Lewis @ 12:52 pm BST Feb 16,2017

Here’s one of those little improvements in 12c (including 12.1) that will probably end up being described as “little known features” in about 3 years time. Arguably it’s one of those little things that no-one should care about because it’s not the sort of thing you should do on a production system, but that doesn’t mean it won’t be seen in the wild.

Rather than simply state the feature I’m going to demonstrate it, starting with a little code to build a couple of tables with referential integrity:


create table parent (
        id      number(4),
        name    varchar2(10),
        constraint par_pk primary key (id)
)
;

create table child(
        id_p    number(4)
                        constraint chi_fk_par
                        references parent
                        on delete cascade,
        id      number(4),
        name    varchar2(10),
        constraint chi_pk primary key (id_p, id)
)
;

insert into parent values (1,'Smith');
insert into parent values (2,'Jones');

insert into child values(1,1,'Sally');
insert into child values(1,2,'Simon');

insert into child values(2,1,'Jack');
insert into child values(2,2,'Jill');

commit;


There’s one important detail in this code that isn’t taking the default and isn’t used very frequently – it’s the option on the foreign key to take the action “on delete cascade”. If you delete a row from the parent table then Oracle will automatically delete any referenced rows from the child table first thus avoiding the error ORA-02292: integrity constraint (TEST_USER.CHI_FK_PAR) violated – child record found. (Conveniently I have a suitable index on the child table that will bypass the problem of a mode 4 (or, where child rows already exist, mode 5) TM lock being taken on the child as the parent row is deleted.)

And here’s the demonstration of the new feature – working in 12.1 onwards:


truncate table parent;

truncate table parent cascade;

The first command will raise Oracle error ORA-02266: unique/primary keys in table referenced by enabled foreign keys, but the second command will truncate the parent and child tables “simultaneously”: but only if the referential integrity constraint is set to “on delete cascade”. If the referential integrity constraint is left to its default action then the second command will raise error: ORA-14705: unique or primary keys referenced by enabled foreign keys in table “TEST_USER”.”CHILD”

This feature (and several broadly similar features) also works with matching partitions of equi-partitioned (or ref partitioned) tables – and that’s a context where the requirement  is much more likely to appear than with non-partitioned tables.

 

January 26, 2017

Basicfile LOBs

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 12:03 pm BST Jan 26,2017

I wrote a short series a little while ago about some of the nasty things that can happen (and can’t really be avoided) with Basicfile LOBs and recently realised that it needed a directory entry so that I didn’t have to supply 6 URLs if I wanted to point someone to it; so here’s the catalogue:

At some stage I may also write a similar series about Securefile LOBs – because you do hit problems if you have a system that does a lot of work modifying a LOB segment whether or not it’s Basicfile or Securefile, and you need a strategy for damage limitation.

Footnote

At the time of creating this catalogue I’ve had an SR open with Oracle for about 4 months on the problem that triggered this series, basically asking if there was a way to limit the number of chunks that could be taken off the reusable part of the index. So far I haven’t had an answer to that question; however the client was able to switch the table into a partitioned table and now drops old partitions rather than deleting old data.

The problem is now recorded against version 12.1.0.2 as bug 22973766 so I’ve closed the SR and will keep an occasional eye on bug number:

Bug 22973766 : LARGE DELETE/INSERT CYCLE FOR LOBS AND EXTENDED VARCHARS WASTES A LOT OF SPACE

November 7, 2016

Reorg

Filed under: fragmentation,humour,Oracle — Jonathan Lewis @ 5:31 pm BST Nov 7,2016

A current question on the OTN database forum asks: “What’s the difference between object and tablespace reorganization?” Here’s an analogy to address the question.

I have three crates of Guiness in the boot (trunk) of my car, one crate has 4 bottles left, one has 7 bottles left and one has 2 bottles. I also have two cases of Louis Roederer Brut NV champagne, one case has 2 bottles left and one has only one. (I have two objects in my tablespace – one of type Beer, one of type Champagne – and my boot requires manual free space management .)

I move all the Guiness bottles into a single crate and all the champagne bottles into a single case. That’s a couple of “shrink space compact” calls – I’ve re-organised the objects to get all the bottles in each object close to each other, but the crates are still taking up space in the boot.

I take the two empty crates and the empty case out of the boot. That’s a couple of “resize” (or “shrink space” without “compact”) calls that free up space in the boot.

I now want to load a port barrel into car, but it won’t fit until I slide the remaining beer crate and champagne case together at one side of the boot. That’s a couple of “move” commands that have reorganized the boot (tablespace) to make the free space usable.

 

September 29, 2016

IOT limitation

Filed under: Execution plans,Infrastructure,IOT,Oracle — Jonathan Lewis @ 10:17 am BST Sep 29,2016

In the right circumstances Index Organized Tables (IOTs) give us tremendous benefits – provided you use them in the ideal fashion. Like so many features in Oracle, though, you often have to compromise between the benefit you really need and the cost of the side effect that a feature produces.

The fundamental design targets for an IOT are that you have short rows and only want to access them through index range scans of primary key. The basic price you pay for optimised access is the extra work you have to do as you insert the data. Anything you do outside the two specific targets is likely to lead to increased costs of using the IOT – and there’s one particular threat that I’ve mentioned twice in the past (here and here). I want to mention it one more time with a focus on client code and reporting.


create table iot1 (
        id1     number(7.0),
        id2     number(7.0),
        v1      varchar2(10),
        v2      varchar2(10),
        padding varchar2(500),
        constraint iot1_pk primary key(id1, id2)
)
organization index
including id2
overflow
;

insert into iot1
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        mod(rownum,311)                 id1,
        mod(rownum,337)                 id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum/100))      v2,
        rpad('x',500,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 ; commit; begin dbms_stats.gather_table_stats( ownname => user,
                tabname          => 'IOT1'
                method_opt       => 'for all columns size 1'
        );
end;
/

alter system flush buffer_cache;

select table_name, blocks from user_tables where table_name = 'IOT1' or table_name like 'SYS_IOT_OVER%';
select index_name, leaf_blocks from user_indexes where table_name = 'IOT1';

set autotrace traceonly
select max(v2) from iot1;
set autotrace off

I’ve created an index organized table with an overflow. The table definition places all columns after the id2 column into the overflow segment. After collecting stats I’ve then queried the table with a query that, for a heap table, would produce a tablescan as the execution plan. But there is no “table”, there is only an index for an IOT. Here’s the output I get (results from 11g and 12c are very similar):

TABLE_NAME               BLOCKS
-------------------- ----------
SYS_IOT_OVER_151543        8074
IOT1

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
IOT1_PK                      504

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |     4 | 99793   (1)| 00:00:04 |
|   1 |  SORT AGGREGATE       |         |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IOT1_PK |   100K|   390K| 99793   (1)| 00:00:04 |
---------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     100376  consistent gets
       8052  physical reads

The index segment has 504 leaf blocks, the overflow segment has 8,074 used blocks below the high water mark. The plan claims an index fast full scan of the index segment – but the physical reads statistic looks more like a “table” scan of the overflow segment. What’s actually happening ?

The 100,000+ consistent reads should tell you what’s happening – we really are doing an index fast full scan on the index segment, and for each index entry we go to the overflow segment to find the v2 value. Oracle doesn’t have a mechanism for doing a “tablescan” of just the overflow segment – even though the definition of the IOT ought (apparently) to be telling Oracle exactly which columns are in the overflow.

In my particular test Oracle reported a significant number of “db file scattered read” waits against the overflow segment, but these were for “prefetch warmup”; in a normal system with a buffer cache full of other data this wouldn’t have happened. The other interesting statistic that showed up was “table fetch continued row” – which was (close to) 100,000, again highlighting that we weren’t doing a normal full tablescan.

In terms of normal query processing this anomaly of attempted “tablescans” being index driven probably isn’t an issue but, as I pointed out in one of earlier posts on the topic, when Oracle gathers stats on the “table” it will do a “full tablescan”. If you have a very large table with an overflow segment it could be a very slow process – especially if you’ve engineered the IOT for the right reason, viz: the data arrives in the wrong order relative to the order you want to query it, and you’ve kept the rows in the IOT_TOP short by dumping the rarely used data in the overflow. With this in mind you might want to make sure that you write a bit of special code that gathers stats only on the columns you know to be in the IOT_TOP, creates representative numbers for the other columns, then locks the stats until the next time you want to refresh them.

 

September 13, 2016

Securefile space

Filed under: fragmentation,Infrastructure,LOBs,Oracle,Troubleshooting — Jonathan Lewis @ 7:29 am BST Sep 13,2016

Here’s a little script I hacked together a couple of years ago from a clone of a script I’d been using for checking space usage in the older types of segments. Oracle Corp. eventually put together a routine to peer inside securefile LOBs:

rem
rem	Script:		dbms_space_use_sf.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Dec 2013
rem	Purpose:	
rem
rem	Last tested 
rem		12.1.0.2
rem		11.2.0.4
rem	Not tested
rem		11.1.0.7
rem	Not relevant
rem		10.2.0.5
rem		 9.2.0.8
rem		 8.1.7.4
rem
rem	Notes:
rem	See also dbms_space_use.sql
rem
rem	11g introduced securefiles lobs and two overloads of 
rem	dbms_space_usage to report space used by their segments
rem
rem	Valid values for suoption are:
rem		SPACEUSAGE_EXACT (16): Computes space usage exhaustively
rem		SPACEUSAGE_FAST  (17): Retrieves values from in-memory statistics
rem
rem	This version allows for partitioned objects, could delete
rem	lines to parameter 4 and partition names to eliminate
rem	the complaints about substitution variables.
rem


define m_seg_owner	= &1
define m_seg_name	= &2
define m_seg_type	= '&3'
define m_part_name	= &4

define m_segment_owner	= &m_seg_owner
define m_segment_name	= &m_seg_name
define m_segment_type	= '&m_seg_type'
define m_partition_name = &m_part_name

@@setenv

execute snap_enqueues.start_snap
execute snap_events.start_snap
execute snap_my_stats.start_snap

spool dbms_space_use_sf

prompt	============
prompt	Secure files
prompt	============

declare
	wrong_ssm	exception;
	pragma exception_init(wrong_ssm, -10614);

	m_segment_size_blocks	number(12,0);
	m_segment_size_bytes	number(12,0);
	m_used_blocks		number(12,0);
	m_used_bytes		number(12,0);
	m_expired_blocks	number(12,0);
	m_expired_bytes		number(12,0);
	m_unexpired_blocks	number(12,0);
	m_unexpired_bytes	number(12,0);

begin
	dbms_space.space_usage(
		upper('&m_segment_owner'),
		upper('&m_segment_name'),
		upper('&m_segment_type'),
		suoption		=> dbms_space.spaceusage_exact,	
--		suoption		=> dbms_space.spaceusage_fast,
		segment_size_blocks	=> m_segment_size_blocks,
		segment_size_bytes	=> m_segment_size_bytes,
		used_blocks		=> m_used_blocks,
		used_bytes		=> m_used_bytes,
		expired_blocks		=> m_expired_blocks,
		expired_bytes		=> m_expired_bytes,
		unexpired_blocks	=> m_unexpired_blocks,
		unexpired_bytes		=> m_unexpired_bytes,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.new_line;
	dbms_output.put_line(' Segment Blocks:   ' || to_char(m_segment_size_blocks,'999,999,990') || ' Bytes: ' || to_char(m_segment_size_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Used Blocks:      ' || to_char(m_used_blocks,'999,999,990')         || ' Bytes: ' || to_char(m_used_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Expired Blocks:   ' || to_char(m_expired_blocks,'999,999,990')      || ' Bytes: ' || to_char(m_expired_bytes,'999,999,999,990')); 
	dbms_output.put_line(' Unexpired Blocks: ' || to_char(m_unexpired_blocks,'999,999,990')    || ' Bytes: ' || to_char(m_unexpired_bytes,'999,999,999,990')); 

exception
	when wrong_ssm then
		dbms_output.put_line('Segment not ASSM');
end;
/

prompt	===============
prompt	Generic details
prompt	===============

declare
	m_total_blocks			number;
	m_total_bytes			number;
	m_unused_blocks			number;
	m_unused_bytes			number;
	m_last_used_extent_file_id	number;
	m_last_used_extent_block_id	number;
	m_last_used_block		number;
begin
	dbms_space.unused_space(
		segment_owner		=> upper('&m_segment_owner'),
		segment_name		=> upper('&m_segment_name'),
		segment_type		=> upper('&m_segment_type'),
		total_blocks		=> m_total_blocks,
		total_bytes 		=> m_total_bytes, 
		unused_blocks		=> m_unused_blocks,  
		unused_bytes		=> m_unused_bytes,
		last_used_extent_file_id 	=> m_last_used_extent_file_id, 
		last_used_extent_block_id	=> m_last_used_extent_block_id,
		last_used_block		=> m_last_used_block,
		partition_name		=> upper('&m_partition_name')
	);

	dbms_output.put_line('Segment Total blocks: ' || to_char(m_total_blocks,'999,999,990'));
	dbms_output.put_line('Object Unused blocks: ' || to_char(m_unused_blocks,'999,999,990'));

end;
/

-- execute snap_my_stats.end_snap
-- execute snap_events.end_snap
-- execute snap_enqueues.end_snap

spool off

Sample of output (from a slightly older version of the code):


============
Secure files
============

 Segment Blocks:    168960 Bytes: 1384120320
 Used Blocks:       151165 Bytes: 1238343680
 Expired Blocks     17795 Bytes: 145776640
 Unexpired Blocks   0 Bytes: 0

PL/SQL procedure successfully completed.

===============
Generic details
===============
Segment Total blocks: 168960
Object Unused blocks: 0

PL/SQL procedure successfully completed.


September 11, 2016

Basicfile LOBs 6

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 7:48 pm BST Sep 11,2016

One of the nice things about declaring your (basicfile) LOBs as “enable storage in row” is that the block addresses of the first 12 chunks will be listed in the row and won’t use the LOB index so, if your LOBs are larger than 3960 bytes but otherwise rather small, the LOB index will hold only the timestamp entries for deleted LOBs. This makes it just a little easier to pick out the information you need when things behave strangely so in this installment of my series I’m going to take about an example with storage enabled in row.

I’m going to demonstrate an issue that is causing a major problem to a client of mine. First I’m going to build a table with a LOB column with multiple (4) freepools – because that’s the sort of thing you do to handle concurrency – then I’m going to start 4 sessions (carefully checking that I have one associated with each free pool) and do a few thousand single row inserts with commits from each session. The size of each LOB value I insert will be 20KB so it will be “indexed” in the row but stored out of the row taking 3 LOB blocks.

Once I’ve got the data in place I’m going to use three of the sessions to delete three quarters of the rows from the table then use a call to the dbms_space package to show you that the segment contains virtually no free space. I’ve engineered the code so that it will take just three more rows in the table to fill the available free space and force Oracle either to allocate a new extent or to start reclaiming some of the delete reusable LOB space – and I’m going to run that insert from the session that DIDN’T delete any rows.

I’ve been running these tests on 11.2.0.4, but get similar behaviour on 12c.

rem
rem     Script: lob_cycle_setup.sql
rem     Dated:  Aug 2016
rem     Author: Jonathan Lewis
rem

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as 
    basicfile 
    text_lob(
            enable storage in row
            chunk 8k
            retention
            nocache
            logging
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('X',20000,'X');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/

truncate table t1
;

You’ll notice I’ve used the retention keyword.  Before I built the LOB I set my undo_retention to 10 seconds so that the space from deleted LOBs should become available for reuse very quickly. The name of the tablespace I’ve used for the LOB is a clue that I’m using an 8KB block size and ASSM (the latter is a requirement of the retention option).

Here’s the code to check which freepool (0 to 3) a session will be associated with (this isn’t documented, but seems to be correct);


select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

So I can keep starting sessions and running that query until I’ve got a session covering each freepool. (The first time I tried this I had to start 7 sessions before I got all 4 freepools covered). Now I can run the following from all 4 sessions concurrently:

rem
rem     Script: lob_cycle_run.sql
rem     Author: Jonathan Lewis
rem     Dated:  Aug 2016
rem

define m_loop_counter = 12027

lock table t1 in row share mode;
commit;

declare
        m_v1 varchar2(32767) := rpad('x',20000,'x');
begin
        for i in 1..&m_loop_counter loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

The purpose of the lock table command is to ensure that all 4 processes start running simultaneously. From a fifth session I execute a “lock table t1 in exclusive mode” before starting the other four sessions running, so they all queue on the exclusive lock; then I commit from the fifth session and away we go. The whole thing took about 30 seconds to run. The rather random-looking value 12,027 was a careful choice to ensure that the last extent in the segment had just a few blocks left – and I used my “dbms_space_use.sql” script to check this, getting the following output:

====
ASSM
====

Unformatted                   :        7 /       57,344
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :  144,324 / ############

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145536
Object Unused blocks: 0

I’ve got 7 “unformatted” blocks in the segment – though in fact these might be “formatted but free” from the perspective of the LOB code.

After going to sessions 0, 1, and 3 and deleting 12,000 rows from each in turn (and committing, leaving a total of 12,108 rows in the table) the report doesn’t change: I haven’t made any space free I’ve simply flagged it in the LOB index as “reusable”. So now we go to session 2 and run the following code 3 times – with “set timing on”:

rem
rem     Script: lob_cycle_one.sql
rem     Author: Jonathan Lewis
rem     Dated:  Aug 2016
rem

SQL> l
  1  declare
  2     m_v1 varchar2(32767) := rpad('x',20000,'x');
  3  begin
  4     for i in 1..1 loop
  5             insert into t1 values (s1.nextval, m_v1);
  6             commit;
  7     end loop;
  8* end;

The first run took 0.02 seconds – and the unformatted count dropped to 4

The second run took 0.01 seconds – and the unformatted count dropped to 1

The third run took 10.74 seconds, of which 9 seconds was CPU. The session generated 500,000 redo entries totalling 100MB of redo from 1 million db block changes after doing 8.4 million logical I/Os, issuing 108,000 enqueue (lock) requests and running 108,000 index range scans. The report of space usage ended up looking like this:


Unformatted                   :  108,125 /  885,760,000
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :   36,333 /  297,639,936

PL/SQL procedure successfully completed.

=======
Generic
=======
Segment Total blocks: 145664
Object Unused blocks: 0

My session has cleared every single piece of re-usable space from the LOB and made it free (unformatted) before allocating space for its one LOB. (That’s going to hurt when the client has 2 million LOBs on the reusable list and isn’t running everything on SSDs – which is why I’m working on this problem).

If you’re wondering why it takes so much redo and so many buffer visits to free 36,000 LOBs this (roughly) is what Oracle does to free up one reusable LOB of 3 blocks – which corresponds to a single index entry carrying three block ids:

  • Find the lowest index entry in the freepool, pin the index leaf block
  • Identify the last block in the list of 3
  • Lock the relevant L1 space management block for the segment and set relevant “bit” to “unformatted”
  • Delete the index entry
  • Re-insert the index entry with one block id removed
  • Commit and unlock the L1 bitmap block
  • Repeat delete/insert the cycle for 2nd block id
  • Repeat the cycle for 3rd (or 1st since we’re going backwards) block id – but don’t re-insert the index entry

Oracle reclaims one block (chunk) at a time. And that’s a bit of a clue to a possible workaround because event 44951 gets mentioned a couple of times in MoS and on the internet as a workaround to a particular problem of HW enqueue waits for LOBS. MoS note 740075.1 tells us:

When using Automatic Segment Space Management (ASSM), and the fix for Bug 6376915 has been applied in your database (Included in 10.2.0.4 +) it is possible to adjust the number of chunks that are cleaned up when the chunk cleanup operation is required.

This can be enabled by setting event 44951 to a value between 1 and 1024 (default is 1). With the value between 1 and 1024 setting the number of chunks to be cleaned up each time a chunk reclaimation operation occurs. This can therefore reduce the number of requests for the High Watermark Enqueue.

Other notes explain that by default only one chunk is cleaned up at a time – which is exactly the behaviour I’m seeing. So what happens when I bounce the database with this event set at level 5 (an arbitrary choice, but larger than the LOBs I’ve been inserting) in the parameter file and repeat the experiment ? On the first attempt it made no difference, but then I changed the experiment slightly and started again. Initially I had done my first re-insert from the one session that hadn’t deleted any rows – which made it an extreme boundary condition; on the second attempt I deleted two rows from the session that had not yet deleted any data (and waited for the retention time to elapse) before doing the inserts from that session.

Deleting two rows would put 6 blocks (in two index entries) onto my re-usable list, so I was starting the inserts with 7 free blocks, 6 reusable blocks and the event set to level 5. Here’s what I saw as I inserted rows one by one.

  • Insert one row: “Unformatted” blocks went up to 9:  I had freed 5 of the reusable blocks then used 3 of them for my lob (7 + 5 – 3 = 9)
  • Insert one row: “Unformatted” blocks went down to 7: I had freed the last reusable block then used 3 blocks for my lob (9 + 1 – 3 = 7)
  • Insert one row: “Unformatted” blocks went down to 4
  • Insert one row: “Unformatted” blocks went down to 1
  • Insert one row: Oracle cleared all the reusable space (11 seconds, 500MB redo), then added an extent (!) to the segment and used 2 of its blocks for part of the new LOB.

So the event isn’t really connected with my problem – though it adds some efficiency to the processing – and my  “boundary condition” is one that’s likely to occur fairly frequently if you’ve got a basicfile LOB defined with multiple freepools. Fortunately it’s probably going to require two pre-conditions before it’s a big problem: first that you’re handling a large number of LOBs and second that your pattern of inserting and deleting is not symmetric – it’s when you use a large number of concurrent sessions for small batches of inserts but a single session for large bulk deletes that all hell can break loose shortly after a delete.

tl;dr

As with many other features of Oracle, skew plays a part in making things break. If you’re doing lots of inserts and deletes of basicfile lobs make sure the mechanisms you use for inserting and deleting look similar: in particular similar numbers of processes to do similar amounts of work for both operations.

P.S. It gets worse.

P.P.S. Don’t even start to think that you can work around this by using securefiles.

P.P.P.S. I got an hint from one test that if a reusable LOB is exactly the same size as the LOB being inserted then Oracle very cleverly takes the entry index entry and rewrites it to be the LOB index entry for the new LOB rather than freeing (and then potentially using) the space it identifies.

Basicfile LOBS 5

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 6:49 pm BST Sep 11,2016

At the end of the last installment we had seen a test case that caused Oracle to add a couple of redundant new extents to a LOB segment after one process deleted 3,000 LOBs and another four concurrent processes inserted 750 LOBs each a few minutes later (after the undo retention period had elapsed). To add confusion the LOBINDEX seemed to show that all the “reusable” chunks had been removed from the index which suggests that they should have been re-used. Our LOB segment started at 8,192 blocks, is currently at 8,576 blocks and is only using 8,000 of them.

How will things look if I now connect a new session (which might be associated with a different freepool), delete the oldest 3,000 LOBs, wait a little while, then get my original four sessions to do their concurrent inserts again ? And what will things look like after I’ve repeated this cycle several times ?

I had to drop the tables from my original test since writing the previous article, so the following results start from recreating the whole test from scratch and won’t align perfectly with the previous sets of results. Here’s what the index treedump looked like after going through the serial delete / concurrent insert cycle 12 times:

----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 71, level: 1)
   leaf: 0x1800223 25166371 (-1: nrow: 0 rrow: 0)
   leaf: 0x1800227 25166375 (0: nrow: 0 rrow: 0)
   leaf: 0x1800236 25166390 (1: nrow: 0 rrow: 0)
   leaf: 0x180023d 25166397 (2: nrow: 63 rrow: 63)
   leaf: 0x1800206 25166342 (3: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (4: nrow: 81 rrow: 81)
   leaf: 0x1800229 25166377 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (11: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (12: nrow: 126 rrow: 126)

   leaf: 0x1800266 25166438 (13: nrow: 0 rrow: 0)
   leaf: 0x180025e 25166430 (14: nrow: 39 rrow: 39)
   leaf: 0x1800262 25166434 (15: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (16: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (17: nrow: 76 rrow: 76)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x180024a 25166410 (24: nrow: 118 rrow: 118)

   leaf: 0x1800263 25166435 (25: nrow: 0 rrow: 0)
   leaf: 0x180024c 25166412 (26: nrow: 0 rrow: 0)
   leaf: 0x1800254 25166420 (27: nrow: 0 rrow: 0)
   leaf: 0x1800264 25166436 (28: nrow: 1 rrow: 0)
   leaf: 0x1800274 25166452 (29: nrow: 2 rrow: 0)
   leaf: 0x180027c 25166460 (30: nrow: 2 rrow: 0)
   leaf: 0x180025d 25166429 (31: nrow: 2 rrow: 0)
   leaf: 0x1800241 25166401 (32: nrow: 2 rrow: 0)
   leaf: 0x1800245 25166405 (33: nrow: 2 rrow: 0)
   leaf: 0x1800265 25166437 (34: nrow: 1 rrow: 0)
   leaf: 0x1800251 25166417 (35: nrow: 3 rrow: 0)
   leaf: 0x1800249 25166409 (36: nrow: 4 rrow: 0)
   leaf: 0x1800242 25166402 (37: nrow: 1 rrow: 0)
   leaf: 0x1800255 25166421 (38: nrow: 2 rrow: 0)
   leaf: 0x1800259 25166425 (39: nrow: 3 rrow: 0)
   leaf: 0x1800246 25166406 (40: nrow: 1 rrow: 0)

   leaf: 0x1800214 25166356 (41: nrow: 38 rrow: 0)
   leaf: 0x1800218 25166360 (42: nrow: 81 rrow: 0)
   leaf: 0x180021c 25166364 (43: nrow: 81 rrow: 0)
   leaf: 0x1800220 25166368 (44: nrow: 0 rrow: 0)
   leaf: 0x180022d 25166381 (45: nrow: 26 rrow: 26)
   leaf: 0x1800231 25166385 (46: nrow: 81 rrow: 81)
   leaf: 0x1800219 25166361 (47: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (48: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (49: nrow: 81 rrow: 81)
   leaf: 0x180022c 25166380 (50: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (51: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (52: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (53: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (54: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (55: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (56: nrow: 79 rrow: 79)
   leaf: 0x180023f 25166399 (57: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (58: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (59: nrow: 81 rrow: 81)
   leaf: 0x1800210 25166352 (60: nrow: 120 rrow: 120)

   leaf: 0x180021d 25166365 (61: nrow: 0 rrow: 0)

   leaf: 0x1800248 25166408 (62: nrow: 21 rrow: 21)
   leaf: 0x1800268 25166440 (63: nrow: 81 rrow: 81)
   leaf: 0x180026c 25166444 (64: nrow: 152 rrow: 152)
   leaf: 0x180026b 25166443 (65: nrow: 152 rrow: 152)
   leaf: 0x180026f 25166447 (66: nrow: 152 rrow: 152)
   leaf: 0x1800273 25166451 (67: nrow: 152 rrow: 152)
   leaf: 0x1800277 25166455 (68: nrow: 152 rrow: 152)
   leaf: 0x180027b 25166459 (69: nrow: 66 rrow: 66)
----- end tree dump

As usual I’ve split the treedump into the sections that reflect the freepools, each of which could consist of two parts the LOBs (key values starting with even numbers) and the “reusable chunks” (key values starting with odd numbers). The dump suggests that things have worked well: as you can see it’s grown a few blocks after my 12 cycles but there are only 6 sections (not the full 8 that might be there), and only a few leaf blocks showing “empty” (rrows = 0). As “reusable” sections have appeared the index has grown a little, then the reusable entries have been taken off the index and the index has shrunk a bit; you can even see that freepool 3 (the highest numbered one) is still showing a pattern of 152 LOBs indexed per block – this is despite the fact that at one point a reusable section for freepool 3 (00 07) appeared above this section and then disappeared as those reusable chunks were reclaimed.

All in all the index seems to be behaving extremely well, with only a little growth and (probably temporarily) a couple of little glitches of empty leaf blocks.

Here’s the dump of the (slightly edited) “col 0” values to confirm where the freepool breaks were –

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 2c cc
 1:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 31 61
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 33 61
 3:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 36
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 37 1e
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 38 37
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 39 1e
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3a 37
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3b 50
 9:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3c
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3d 4b
11:     col 0; len  9; ( 9):  00 00 00 01 00 00 09 df 3e
12:     col 0; len 10; (10):  00 00 00 01 00 00 09 df 3e e7

13:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 25 9b
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 32 a0
15:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 34 1d
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 36 c6
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 39 3d
18:     col 0; len  9; ( 9):  00 02 00 01 00 00 09 df 3d
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 3f 52
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 40 cf
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 20
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 71
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 41 c2
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 df 42 13

25:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
26:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
27:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
28:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
29:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
30:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
31:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
32:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
33:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
34:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
35:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
36:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
37:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
38:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
39:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00
40:     col 0; len 10; (10):  00 03 57 bc ba 2f 00 00 00 00

41:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 26 52
42:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a 27
43:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2a dc
44:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 2b 2d
45:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 31 34
46:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 33 15
47:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 92
48:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 34 e3
49:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 34
50:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 85
51:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 35 d6
52:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 36 27
53:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 6c
54:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 38 ef
55:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3a d0
56:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3c 4d
57:     col 0; len  9; ( 9):  00 04 00 01 00 00 09 df 3d
58:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3e 4b
59:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 3f 96
60:     col 0; len 10; (10):  00 04 00 01 00 00 09 df 40 7d

61:     col 0; len 10; (10):  00 05 57 bc b9 db 00 00 00 00

62:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 32 5b
63:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 33 a6
64:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 36 4f
65:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 38 13
66:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3a 09
67:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3b cd
68:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3d 91
69:     col 0; len 10; (10):  00 06 00 01 00 00 09 df 3f 23

As you can see at leaf block 61 we didn’t quite empty the reusable list from freepool 2 (00 05 = 2 * 2 + 1), and leaf blocks 25 to 40 tell us that freepool 1 (00 03 = 2 * 1 + 1) was the freepool used on the last big delete. Despite the odd little glitches it looks as if this strategy of “deleted LOBs go to my process’ freepool” seems to do a good job of reusing index space.

But there IS a problem. Here’s the output from a script I wrote using the dbms_space package to show how space in the LOB segment has been used:


Unformatted                   :    4,508 /   36,929,536
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :    8,000 /   65,536,000

Segment Total blocks: 12672
Object Unused blocks: 0

The LOB segment has grown from an initial 8,192 blocks with a few unformatted blocks and 8,000 used blocks (2 blocks per LOB, 4,000 LOBs) to 12,672 blocks with 4,508 blocks unformatted. (The difference between Full + Unformatted and Segment Total blocks is the set of bitmap space management blocks for the segment). After only 12 cycles we have “leaked” an overhead of 50% of our real data space – maybe this helps to explain why the client that started me down this set of blogs has seen Oracle allocate 700GB to hold just 200GB of LOBs.

The tablespace is declared as locally managed with 1MB uniform extents and automatic segment space management. By writing a simple script I can get Oracle to write a script to dump the first block of each extent – and they will all be Level 1 bitmap space management blocks. Running grep against the trace file I can pick out the lines that tell me how many data blocks are mapped by the bitmap and how many of them have been formatted or not. This is the result (I have 99 extents in the segment – 99 * 128 = 12,672):

   unformatted: 0       total: 64        first useful block: 4
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 24      total: 64        first useful block: 2
   unformatted: 52      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 42      total: 64        first useful block: 2
   unformatted: 0       total: 64        first useful block: 2
   unformatted: 26      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 62      total: 64        first useful block: 2
   unformatted: 127     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 108     total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 105     total: 128       first useful block: 1
   unformatted: 96      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 38      total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 1       total: 128       first useful block: 1
   unformatted: 2       total: 128       first useful block: 1
   unformatted: 65      total: 128       first useful block: 1
   unformatted: 98      total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 0       total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1
   unformatted: 125     total: 128       first useful block: 1

You’ll notice that for the first 63 extents Oracle says there are 64 blocks mapped by the bitmap with the first useful block at block 2, thereafter it says there are 128 blocks and the first useful block is block 1 (Oracle is counting from zero). While Oracle thinks the segment is “quite small” it allocates two level 1 bitmap blocks per 1MB extent and I’ve only dumped the first block from each extent; but when the segment reaches 64MB Oracle decides that it’s getting pretty big and there’s no point in wasting space so changes to using a single level 1 bitmap block per 1MB extent. It’s just one of those tiny details you discover when you happen to look a little closely and how things work. (On a much larger test with 8MB uniform extents Oracle got to the point where it was using one L1 bitmap block for the whole 8MB.)

There’s a fascinating pattern in the later extents of 3 full extents followed by 3 empty extents – my first guess had been that Oracle was allocating new extents but not using them, but clearly that’s not right, it’s removing “reusable chunks” from the index and then not re-using them but using the new extents instead (some of the time). Something is seriously wrong with the way Oracle is handling the “reusable chunks” part of the index. With a little luck it’s some nasty side effect of the “one process delete / multiple process insert” strategy we have adopted, so: (a) we need to repeat the entire experiment with a concurrent delete mechanism and (b) we need to think about how we might re-engineer a REALLY BIG system that has followed this unfortunate strategy for a long time. Of course if (a) turns out to be a disaster as well we don’t really need to think too hard about (b) until we have discovered a good way of dealing with our rolling pattern of inserts and deletes.

Some (minimum effort, we hope) ideas we will have to look at for (b):

  • Oracle has an option to change the freepools count on a LOB segment – do we need to use it, how much work would it entail, would it require downtime
  • Oracle has an option to “rebuild” the freepools on a LOB segment
  • We can always try the “shrink space compact” option on the LOB
  • Should we just rebuild (move) the LOB segment – and use a larger extent size while we’re at it
  • Should we recreate the table and change the LOB to Securefiles as we do so – and do all the testing all over again
  • If we’re deleting old data on such a regular pattern should we try to bypass the deletes by partitioning the table in some clever way

TO BE CONTINUED.

 

Next Page »

Powered by WordPress.com.