Oracle Scratchpad

February 28, 2018

255 Columns

Filed under: Infrastructure,Oracle,Troubleshooting,undo — Jonathan Lewis @ 12:27 pm BST Feb 28,2018

It’s the gift that keeps on giving – no matter how many problems you find there are always more waiting to be found. It’s been some time since I last wrote about tables with more than 255 columns, and I said then that there was more to come. In the last article I described how adding a few columns to a table, or updating a trailing column in a way that made the table’s used column count exceed 255, could result in some strange row-splitting behaviour – in this article I’m going to look at a critical side effect of that behaviour.

We’ll start with a simple model and a question – I’ll create a table with a thousand rows of data, then I’ll add a few columns to that table and update the last column in every row and check the undo and redo generated by the update.  Eventually I’m going to need a table with more than 255 columns so the script to create the table is rather long and I’ve posted it at the end of the article in case you want to experiment – in the following text I’ve omitted a few hundred lines of column declarations and values.


rem
rem     Script:         wide_table_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(
        col000,
        col001,
        col002,
        col003,
...
        col199,
/*
        col200,
        col201,
        col202,
        col203,
        col204,
        col205,
        col206,
        col207,
        col208,
        col209,
*/
        col210,
        col220,
...
        col247,
        col248,
        col249
)
as
with generator as (
        select 
                rownum id
        from dual
        connect by
                level <= 1e3 -- > comment to avoid WordPress format issue
)
select
        lpad(000,10,'0'),
        lpad(001,10,'0'),
        lpad(002,10,'0'),
        lpad(003,10,'0'),
...
        lpad(247,10,'0'),
        lpad(248,10,'0'),
        lpad(249,10,'0')
from
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add(
        col250 varchar2(10),
        col251 varchar2(10),
        col252 varchar2(10),
        col253 varchar2(10),
        col254 varchar2(10),
        col255 varchar2(10),
        col256 varchar2(10),
        col257 varchar2(10),
        col258 varchar2(10),
        col259 varchar2(10)
)
;

update t1 set col259 = lpad('259',10,'0');
commit;


The table I’ve created has columns named from col000 to col249 populated with a text matching the column’s numeric id – a total of 250 columns, except that for the first part of the test I’ve commented out the creation and population of 10 of those columns, giving me a total of 240 columns. Then I’ve added 10 more columns and updated the last of those 10. So, for the first part of the test, I’ve grown my table from 240 used columns to 250 used columns. Here are a few critical stats for that update from the session statistics view (v$sesstat joined to v$statname):

 

Name                                                 Value
----                                                 -----
db block gets                                        1,023
consistent gets                                        555
db block changes                                     2,012
redo entries                                           907
redo size                                          262,692
undo change vector size                             76,052
table scan rows gotten                               1,002
table scan blocks gotten                               501
HSC Heap Segment Block Changes                       1,000

That’s pretty much the sort of thing we might expect. For a small update to a row it’s reasonable to see 250 – 300 bytes of redo of which about half is the redo for the undo. We can see that we’ve scanned 1,000 rows and made 2,000 block changes (one to the table block and one to an undo block for each row in the table). The table was 500 blocks of data (the avg_row_len is about 2640 bytes from 240 columns at 10 bytes + a length byte, which makes two rows per block with lots of spare space at the default 10 pctfree). You might been expecting the number of redo entries to be a close match to the number of rows but it’s a little short because the redo records for the first few updates would have been stored in private redo and turned into a single large redo entry.

So what do the stats look like when we start with 250 columns and grow to 260 columns, breaking through the dreaded 255 barrier ?


Name                                                 Value
----                                                 -----
db block gets                                        9,503
consistent gets                                      1,894
db block changes                                     9,384
redo size                                        8,110,584
redo entries                                         5,778
undo change vector size                          3,780,260
table scan rows gotten                               1,002
table scan blocks gotten                               501
HSC Heap Segment Block Changes                       3,000

Such a simple change – with such a large impact.

The average undo is now nearly 3.5KB per row (and the rows were only about 2,860 bytes each anyway), the number of redo entries is up to 6 times the original, we’re averaging 3 “HSC Heap Segment Block Changes” per row instead of 1 and in total we’ve managed to do an extra 7,000 db block changes overall.

To get an idea of what was going on behind the scenes I dumped the redo log file. For the first part of the test most of the redo entries consisted of a pair of redo change vectors with OP codes 5.1 (modify undo block) and 11.5 (update row piece). The 5.1 corresponded to the undo needed to reverse out the effect of its paired 11.5 and both change vectors were pretty small.

For the second part of the test I found a frequently repeated sequence of three consecutive redo records of paired redo vectors: (5.1, 11.4), (5.1, 11.2) and (5.1, 11.6). Again each 5.1 corresponds to the undo needed to reverse out the effect of its associated 11.x, and in this case the three “table” (level 11) OP codes are, in order: “lock row piece”, “insert row piece”, “overwrite row piece”. These 3 pairs occured 1,000 times each, accounting for roughly 3,000 of the redo entries reported.

On top of this there were 2,500 redo records holding redo change vectors with op code 13.22 (“State change on level 1 bitmap block”), and a few hundred , with op code 13.21 (“Format page table block”) with just a few others bringing the total up to the 5,800 redo entries reported. (You might note that the last OP code is a clue that we added quite a lot of new blocks to the table as we executed the update – in fact the number of used table blocks grew by about 50%.)

We also had 500 redo change vectors of type 5.2 (“get undo header”). This number was significantly more than in the first part of the test because we had a lot more undo block changes (with bigger undo change vectors) which means we used a lot more undo blocks, and each time we move to a new undo block we update our transaction slot in the undo header with the current undo block address. I mentioned the pairing of 5.1 and 11.6 above – in fact 50% of those records recorded just two changes (5.1, 11.6) the other 50% recorded three changes (5.2, 5.1, 11.6) – in effect every other row update resulted in Oracle demanding a new undo block.

I’ll explain in a moment why we have a very large number of “State change on level 1 bitmap block”; first let’s examine the key steps of how Oracle is processing a single row update – the sequence of 11.4, 11.2, 11.6:

  • 11.4: lock the head row piece – this seems to happen when the row is in (or going to turn into) multiple pieces; presumably because the piece that is going to change might not be the head piece. This is a tiny operation that generates undo and redo in the order of tens of bytes.
  • 11.2: our code extends the row to 260 columns, which means Oracle has to split it into two pieces of 5 columns and 255 columns respectively – so one of those row-pieces has to cause an insert to take place. Inserting a row may require a lot of redo, of course, but the undo for a (table-only) insert is, again, tiny.
  • 11.6: When Oracle has to split a wide row (>255 columns) it counts columns from the end, so the first row piece of our row is 5 columns and the second row piece (which is the one inserted by the 11.2 operation) is 255 columns. This means we have to overwrite the original row piece with a much shorter row piece. So we’re replacing 2,750 bytes (250 columns) with 55 bytes (5 columns), which means we have to write the contents of the “deleted” 250 columns to the undo tablespace – and that’s where most of the huge volume of undo comes from.

There are two extra points to note about the way Oracle handles the insert/overwrite steps. The length of our row exaggerates the effect, of course, but when we insert the ending 255 columns the block they go to is probably going to change from nearly empty to half full, or half full to full – which means its bitmap entry has to be updated; similarly when the initial 250 columns is overwritten with just 5 columns a huge amount of free space will appear in the block which, again, means that the block’s bitmap entry has to be updated. This gives us a good idea of why we see so many 13.22 (“L1 bitmap state change”)redo change vectors.

The second point is that the numbers still don’t add up. Allowing a couple of hundred bytes of undo per row for the lock row and insert row, then 2,750 plus a few more for the overwrite, I’ve accounted for about 3,000 bytes per row updated – which leaves me short by about 800 bytes per row.  If I dump undo blocks I can see that the undo change vector for the overwrite is actually 3,628 bytes long rather than being “slightly over” the 2,750 for the before image of the overwritten row. Strangely I can see a couple of hundred bytes of what looks like damage in the row image, and there’s also a couple of hundred bytes of what looks like garbage (but probably isn’t) after the row image, but I’ve got no idea why there’s so much excess data in the record.

One day I (or someone else) may get around to looking very closely at why that particular undo record is a lot bigger than an observer might expect – but at least we can see that the space is there, and even if some of that volume could be made to disappear the problem of the massive undo relating to the overwrite and Oracle’s odd strategy of counting columns from the end is still going to be present, and there are probably some occasions when you need to know about this.

tl;dr

Once the number of columns in a table goes over 255 then a simple update to a “trailing” null column (i.e. one past the current end of the row) is likely to generate a much larger volume of undo and redo than you might expect. In particular the size of the undo record is likely to be equivalent to the volume of the last 255 columns of the table – and then a large fraction more.

The reason why this particular anomaly came to my attention is because a client had a problem upgrading an application that required them to add a few columns to a table and then update a couple of them. The size of the undo tablespace was 300 GB, the update (the only thing running at the time) should have added about 30 bytes to the length of each row, the update should have affected 250 million rows. The process crashed after a few hours “ORA-30036: unable to extend segment by %s in undo tablespace ‘%s'”. Even allowing for a “reasonable” overhead it seemed rather surprising that Oracle needed more than 1,200 bytes of undo space per row – but then I found the table was defined with 350 columns.

Solutions for this user:  it’s an upgrade that’s allowed to take quite a long time, and the nature of the update is such that it would be possible to update in batches, committing after each batch.  It would also be nice to review how the table was used to see if it could be rebuilt with a different column order to move all the unused columns to the end of the row – with a little luck the result table might find almost all the rows fitting into a single row piece, even after the upgrade.

Footnote

If you want to experiment, here’s the whole script to create the table, insert some rows, then add a few more columns and update one of them. It’s very long, and not in the least bit exciting, but it may save you a little typing time if you want to use it.


create table t1(
        col000,
        col001,
        col002,
        col003,
        col004,
        col005,
        col006,
        col007,
        col008,
        col009,
        col010,
        col011,
        col012,
        col013,
        col014,
        col015,
        col016,
        col017,
        col018,
        col019,
        col020,
        col021,
        col022,
        col023,
        col024,
        col025,
        col026,
        col027,
        col028,
        col029,
        col030,
        col031,
        col032,
        col033,
        col034,
        col035,
        col036,
        col037,
        col038,
        col039,
        col040,
        col041,
        col042,
        col043,
        col044,
        col045,
        col046,
        col047,
        col048,
        col049,
        col050,
        col051,
        col052,
        col053,
        col054,
        col055,
        col056,
        col057,
        col058,
        col059,
        col060,
        col061,
        col062,
        col063,
        col064,
        col065,
        col066,
        col067,
        col068,
        col069,
        col070,
        col071,
        col072,
        col073,
        col074,
        col075,
        col076,
        col077,
        col078,
        col079,
        col080,
        col081,
        col082,
        col083,
        col084,
        col085,
        col086,
        col087,
        col088,
        col089,
        col090,
        col091,
        col092,
        col093,
        col094,
        col095,
        col096,
        col097,
        col098,
        col099,
        col100,
        col101,
        col102,
        col103,
        col104,
        col105,
        col106,
        col107,
        col108,
        col109,
        col110,
        col111,
        col112,
        col113,
        col114,
        col115,
        col116,
        col117,
        col118,
        col119,
        col120,
        col121,
        col122,
        col123,
        col124,
        col125,
        col126,
        col127,
        col128,
        col129,
        col130,
        col131,
        col132,
        col133,
        col134,
        col135,
        col136,
        col137,
        col138,
        col139,
        col140,
        col141,
        col142,
        col143,
        col144,
        col145,
        col146,
        col147,
        col148,
        col149,
        col150,
        col151,
        col152,
        col153,
        col154,
        col155,
        col156,
        col157,
        col158,
        col159,
        col160,
        col161,
        col162,
        col163,
        col164,
        col165,
        col166,
        col167,
        col168,
        col169,
        col170,
        col171,
        col172,
        col173,
        col174,
        col175,
        col176,
        col177,
        col178,
        col179,
        col180,
        col181,
        col182,
        col183,
        col184,
        col185,
        col186,
        col187,
        col188,
        col189,
        col190,
        col191,
        col192,
        col193,
        col194,
        col195,
        col196,
        col197,
        col198,
        col199,
        col200,
        col201,
        col202,
        col203,
        col204,
        col205,
        col206,
        col207,
        col208,
        col209,
        col210,
        col211,
        col212,
        col213,
        col214,
        col215,
        col216,
        col217,
        col218,
        col219,
        col220,
        col221,
        col222,
        col223,
        col224,
        col225,
        col226,
        col227,
        col228,
        col229,
        col230,
        col231,
        col232,
        col233,
        col234,
        col235,
        col236,
        col237,
        col238,
        col239,
        col240,
        col241,
        col242,
        col243,
        col244,
        col245,
        col246,
        col247,
        col248,
        col249
)
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e3 -- > comment to avoid WordPress format issue
)
select
        lpad(000,10,'0'),
        lpad(001,10,'0'),
        lpad(002,10,'0'),
        lpad(003,10,'0'),
        lpad(004,10,'0'),
        lpad(005,10,'0'),
        lpad(006,10,'0'),
        lpad(007,10,'0'),
        lpad(008,10,'0'),
        lpad(009,10,'0'),
        lpad(010,10,'0'),
        lpad(011,10,'0'),
        lpad(012,10,'0'),
        lpad(013,10,'0'),
        lpad(014,10,'0'),
        lpad(015,10,'0'),
        lpad(016,10,'0'),
        lpad(017,10,'0'),
        lpad(018,10,'0'),
        lpad(019,10,'0'),
        lpad(020,10,'0'),
        lpad(021,10,'0'),
        lpad(022,10,'0'),
        lpad(023,10,'0'),
        lpad(024,10,'0'),
        lpad(025,10,'0'),
        lpad(026,10,'0'),
        lpad(027,10,'0'),
        lpad(028,10,'0'),
        lpad(029,10,'0'),
        lpad(030,10,'0'),
        lpad(031,10,'0'),
        lpad(032,10,'0'),
        lpad(033,10,'0'),
        lpad(034,10,'0'),
        lpad(035,10,'0'),
        lpad(036,10,'0'),
        lpad(037,10,'0'),
        lpad(038,10,'0'),
        lpad(039,10,'0'),
        lpad(040,10,'0'),
        lpad(041,10,'0'),
        lpad(042,10,'0'),
        lpad(043,10,'0'),
        lpad(044,10,'0'),
        lpad(045,10,'0'),
        lpad(046,10,'0'),
        lpad(047,10,'0'),
        lpad(048,10,'0'),
        lpad(049,10,'0'),
        lpad(050,10,'0'),
        lpad(051,10,'0'),
        lpad(052,10,'0'),
        lpad(053,10,'0'),
        lpad(054,10,'0'),
        lpad(055,10,'0'),
        lpad(056,10,'0'),
        lpad(057,10,'0'),
        lpad(058,10,'0'),
        lpad(059,10,'0'),
        lpad(060,10,'0'),
        lpad(061,10,'0'),
        lpad(062,10,'0'),
        lpad(063,10,'0'),
        lpad(064,10,'0'),
        lpad(065,10,'0'),
        lpad(066,10,'0'),
        lpad(067,10,'0'),
        lpad(068,10,'0'),
        lpad(069,10,'0'),
        lpad(070,10,'0'),
        lpad(071,10,'0'),
        lpad(072,10,'0'),
        lpad(073,10,'0'),
        lpad(074,10,'0'),
        lpad(075,10,'0'),
        lpad(076,10,'0'),
        lpad(077,10,'0'),
        lpad(078,10,'0'),
        lpad(079,10,'0'),
        lpad(080,10,'0'),
        lpad(081,10,'0'),
        lpad(082,10,'0'),
        lpad(083,10,'0'),
        lpad(084,10,'0'),
        lpad(085,10,'0'),
        lpad(086,10,'0'),
        lpad(087,10,'0'),
        lpad(088,10,'0'),
        lpad(089,10,'0'),
        lpad(090,10,'0'),
        lpad(091,10,'0'),
        lpad(092,10,'0'),
        lpad(093,10,'0'),
        lpad(094,10,'0'),
        lpad(095,10,'0'),
        lpad(096,10,'0'),
        lpad(097,10,'0'),
        lpad(098,10,'0'),
        lpad(099,10,'0'),
        lpad(100,10,'0'),
        lpad(101,10,'0'),
        lpad(102,10,'0'),
        lpad(103,10,'0'),
        lpad(104,10,'0'),
        lpad(105,10,'0'),
        lpad(106,10,'0'),
        lpad(107,10,'0'),
        lpad(108,10,'0'),
        lpad(109,10,'0'),
        lpad(110,10,'0'),
        lpad(111,10,'0'),
        lpad(112,10,'0'),
        lpad(113,10,'0'),
        lpad(114,10,'0'),
        lpad(115,10,'0'),
        lpad(116,10,'0'),
        lpad(117,10,'0'),
        lpad(118,10,'0'),
        lpad(119,10,'0'),
        lpad(120,10,'0'),
        lpad(121,10,'0'),
        lpad(122,10,'0'),
        lpad(123,10,'0'),
        lpad(124,10,'0'),
        lpad(125,10,'0'),
        lpad(126,10,'0'),
        lpad(127,10,'0'),
        lpad(128,10,'0'),
        lpad(129,10,'0'),
        lpad(130,10,'0'),
        lpad(131,10,'0'),
        lpad(132,10,'0'),
        lpad(133,10,'0'),
        lpad(134,10,'0'),
        lpad(135,10,'0'),
        lpad(136,10,'0'),
        lpad(137,10,'0'),
        lpad(138,10,'0'),
        lpad(139,10,'0'),
        lpad(140,10,'0'),
        lpad(141,10,'0'),
        lpad(142,10,'0'),
        lpad(143,10,'0'),
        lpad(144,10,'0'),
        lpad(145,10,'0'),
        lpad(146,10,'0'),
        lpad(147,10,'0'),
        lpad(148,10,'0'),
        lpad(149,10,'0'),
        lpad(150,10,'0'),
        lpad(151,10,'0'),
        lpad(152,10,'0'),
        lpad(153,10,'0'),
        lpad(154,10,'0'),
        lpad(155,10,'0'),
        lpad(156,10,'0'),
        lpad(157,10,'0'),
        lpad(158,10,'0'),
        lpad(159,10,'0'),
        lpad(160,10,'0'),
        lpad(161,10,'0'),
        lpad(162,10,'0'),
        lpad(163,10,'0'),
        lpad(164,10,'0'),
        lpad(165,10,'0'),
        lpad(166,10,'0'),
        lpad(167,10,'0'),
        lpad(168,10,'0'),
        lpad(169,10,'0'),
        lpad(170,10,'0'),
        lpad(171,10,'0'),
        lpad(172,10,'0'),
        lpad(173,10,'0'),
        lpad(174,10,'0'),
        lpad(175,10,'0'),
        lpad(176,10,'0'),
        lpad(177,10,'0'),
        lpad(178,10,'0'),
        lpad(179,10,'0'),
        lpad(180,10,'0'),
        lpad(181,10,'0'),
        lpad(182,10,'0'),
        lpad(183,10,'0'),
        lpad(184,10,'0'),
        lpad(185,10,'0'),
        lpad(186,10,'0'),
        lpad(187,10,'0'),
        lpad(188,10,'0'),
        lpad(189,10,'0'),
        lpad(190,10,'0'),
        lpad(191,10,'0'),
        lpad(192,10,'0'),
        lpad(193,10,'0'),
        lpad(194,10,'0'),
        lpad(195,10,'0'),
        lpad(196,10,'0'),
        lpad(197,10,'0'),
        lpad(198,10,'0'),
        lpad(199,10,'0'),
        lpad(200,10,'0'),
        lpad(201,10,'0'),
        lpad(202,10,'0'),
        lpad(203,10,'0'),
        lpad(204,10,'0'),
        lpad(205,10,'0'),
        lpad(206,10,'0'),
        lpad(207,10,'0'),
        lpad(208,10,'0'),
        lpad(209,10,'0'),
        lpad(210,10,'0'),
        lpad(211,10,'0'),
        lpad(212,10,'0'),
        lpad(213,10,'0'),
        lpad(214,10,'0'),
        lpad(215,10,'0'),
        lpad(216,10,'0'),
        lpad(217,10,'0'),
        lpad(218,10,'0'),
        lpad(219,10,'0'),
        lpad(220,10,'0'),
        lpad(221,10,'0'),
        lpad(222,10,'0'),
        lpad(223,10,'0'),
        lpad(224,10,'0'),
        lpad(225,10,'0'),
        lpad(226,10,'0'),
        lpad(227,10,'0'),
        lpad(228,10,'0'),
        lpad(229,10,'0'),
        lpad(230,10,'0'),
        lpad(231,10,'0'),
        lpad(232,10,'0'),
        lpad(233,10,'0'),
        lpad(234,10,'0'),
        lpad(235,10,'0'),
        lpad(236,10,'0'),
        lpad(237,10,'0'),
        lpad(238,10,'0'),
        lpad(239,10,'0'),
        lpad(240,10,'0'),
        lpad(241,10,'0'),
        lpad(242,10,'0'),
        lpad(243,10,'0'),
        lpad(244,10,'0'),
        lpad(245,10,'0'),
        lpad(246,10,'0'),
        lpad(247,10,'0'),
        lpad(248,10,'0'),
        lpad(249,10,'0')
from
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select
        avg_row_len, num_rows,  blocks,
        num_rows / trunc(8000/avg_row_len) estimated_blocks
from
        user_tables
where
        table_name = 'T1'
;

prompt  =================
prompt  Add a few columns
prompt  =================

alter table t1 add(
        col250 varchar2(10),
        col251 varchar2(10),
        col252 varchar2(10),
        col253 varchar2(10),
        col254 varchar2(10),
        col255 varchar2(10),
        col256 varchar2(10),
        col257 varchar2(10),
        col258 varchar2(10),
        col259 varchar2(10)
)
;

-- alter system switch logfile;

update t1 set col259 = lpad('259',10,'0');
commit;

-- execute dump_log

P.S. if you do investigate and solve the question of the excess space in the undo record, and the odd content in the row “before image” then do let me know. (Hint: part of the excess may be a “null columns” map – but that still leaves plenty to account for.)

January 19, 2018

Nested MVs

Filed under: 12c,Bugs,Infrastructure,Materialized view,Oracle — Jonathan Lewis @ 2:43 pm BST Jan 19,2018

A recent client was seeing a very large redo penalty from refreshing materialized views. Unfortunately they had to be refreshed very frequently, and were being handled with a complete refresh in atomic mode – which means delete every row from every MV then re-insert every row.  The total redo was running at about 5GB per hour, which wasn’t a problem for throughput, but the space for handling backup and recovery was getting a bit extreme.

The requirement consisted of two MVs which extracted and aggregated row and column subsets in two different ways from a single table; then two MVs that aggregated one of the first MVs in two different ways; then two MVs which each joined one of the first level MVs to one of the scond level MVs.

No problem – join MVs are legal, aggregate MVs are legal, “nested” MVs are legal: all you have to do is create the right MV logs and pick the right refresh command.  Since the client was also running Standard Editions (SE2) there was no need to worry about how to ensure that query rewrite would work (feature not implemented on SE).

So here, simplified and camouflaged, is a minimum subset of just the first few stages of the construction: a base table with MV log, one first-level aggregate MV with its own MV log, and two aggregate MVs based on the first MV.

drop materialized view log on req_line;
drop materialized view log on jpl_req_group_numlines;

drop materialized view jpl_req_group_numlines;
drop materialized view jpl_req_numsel;
drop materialized view jpl_req_basis;

drop table req_line;

-- ----------
-- Base Table
-- ----------

create table req_line(
        eventid         number(10,0),
        selected        number(10,0),
        req             number(10,0),
        basis           number(10,0),
        lnid            number(10,0),
        area            varchar2(10),
        excess          number(10,0),
        available       number(10,0),
        kk_id           number(10,0),
        eventdate       number(10,0),
        rs_id           number(10,0)
)
;

-- --------------------
-- MV log on base table
-- --------------------

create materialized view log 
on
req_line
with rowid(
        req, basis, lnid, eventid, selected, area,
        excess, available, kk_id, eventdate, rs_id
)
including new values
;

-- --------------------
-- Level 1 aggregate MV
-- --------------------

create materialized view jpl_req_group_numlines(
        eventid, selected, 
        row_ct, req_ct, basis_ct, req, basis, 
        maxlnid, excess, numsel, area, available, kk_id, 
        rs_id, eventdate
)
segment creation immediate
build immediate
refresh fast on demand 
as 
select 
        eventid,
        selected,
        count(*)        row_ct,
        count(req)      req_ct,
        count(basis)    basis_ct,
        sum(req)        req,
        sum(basis)      basis,
        max(lnid)       maxlnid,
        excess,
        count(selected) numsel,
        area,
        available,
        kk_id,
        rs_id,
        eventdate
from 
        req_line
group by 
        eventid, selected, area, excess,
        available, kk_id, eventdate, rs_id
;

-- ------------------------
-- MV log on first level MV
-- ------------------------

create materialized view log 
on
jpl_req_group_numlines
with rowid 
(
        eventid, area, selected, available,
        basis, req, maxlnid, numsel
)
including new values
;


-- ----------------------------
-- First "level 2" aggregate MV
-- ----------------------------

create materialized view jpl_req_numsel(
        eventid, selected, 
        row_ct, totalreq_ct, totalbasis_ct, totalreq, totalbasis, 
        maxlnid, numsel_ct, numsel, area
)
segment creation immediate
build immediate
refresh fast on demand
as 
select 
        eventid,
        selected,
        count(*)        row_ct,
        count(req)      req_ct,
        count(basis)    basis_ct,
        sum(req)        req,
        sum(basis)      basis,
        max(maxlnid)    maxlnid,
        count(numsel)   numsel_ct,
        sum(numsel)     numsel,
        area
from 
        jpl_req_group_numlines
group by 
        eventid, selected, area
;


-- -----------------------------
-- Second "level 2" aggregate MV
-- -----------------------------

create materialized view jpl_req_basis(
        eventid, 
        row_ct, totalbasis_ct, totalreq_ct, totalbasis, totalreq, 
        area, selected, available, maxlnid ,
        numsel_ct, numsel
)
segment creation immediate
build immediate
refresh fast on demand
as 
select 
        eventid,
        count(*)        row_ct,
        count(basis)    totalbasis_ct,
        count(req)      totalreq_ct,
        sum(basis)      totalbasis,
        sum(req)        totalreq,
        area,
        selected,
        available,
        max(maxlnid)    maxlnid,
        count(numsel)   numsel,
        sum(numsel)     numsel
from
        jpl_req_group_numlines
group by 
        eventid, area, available, selected
;

Once the table, MV logs and MVs exist we can insert some data into the base table, then try refreshing the views. I have tried three different calls to the dbms_refresh package, dbms_mview.refresh_all_mviews(), dbms_mview.refresh_dependent(), and dbms_mview.refresh(), specifying the ‘F’ (fast) refresh method, atomic refresh, and nested. All three fail in the same way on 12.2.0.1. The code below shows only the refresh_dependent() call.

I’ve included a query to report the current state of the materialized views before and after the calls, and set a two second sleep before the refresh so that changes in “last refresh” time will appear. The final queries are just to check that the expected volume of data has been transferred to the materialized views.


-- ------------------------------------
-- Insert some data into the base table
-- ------------------------------------

begin
        for i in 1..100 loop
                execute immediate 'insert into req_line values( :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx, :xxx)' 
                using i,i,i,i,i,i,i,i,i,i,i;
                commit;
        end loop;
end;
/

set linesize 144
column mview_name format a40

select
        mview_name, staleness, compile_state, last_refresh_type, 
        to_char(last_refresh_date,'dd-mon hh24:mi:ss')          ref_time
from
        user_mviews
ORDER by
        last_refresh_date, mview_name
;

prompt  Waiting for 2 seconds to allow refresh time to change

execute dbms_lock.sleep(2)

declare
        m_fail_ct       number(6,0);
begin
        dbms_mview.refresh_dependent(
                number_of_failures      => m_fail_ct,
                list                    => 'req_line',
                method                  => 'F',
                nested                  => true,
                atomic_refresh          => true
        );

        dbms_output.put_line('Failures: ' || m_fail_ct);
end;
/

select
        mview_name, staleness, compile_state, last_refresh_type, 
        to_char(last_refresh_date,'dd-mon hh24:mi:ss')          ref_time
from
        user_mviews
order by
        last_refresh_date, mview_name
;

-- --------------------------------
-- Should be 100 rows in each table
-- --------------------------------

select count(*) from jpl_req_basis;
select count(*) from jpl_req_group_numlines;
select count(*) from jpl_req_numsel;

Both the earlier versions of Oracle are happy with this code and refresh all three materialized view without fail. Oracle 12.2.0.1 crashes the procedure call with a deadlock error which, when traced, shows itself to be a self-deadlock while attempting to select a data dictionary row for update:


MVIEW_NAME                               STALENESS	     COMPILE_STATE	 LAST_REF REF_TIME
---------------------------------------- ------------------- ------------------- -------- ------------------------
JPL_REQ_BASIS                            FRESH		     VALID		 COMPLETE 19-jan 14:03:01
JPL_REQ_GROUP_NUMLINES			 NEEDS_COMPILE	     NEEDS_COMPILE	 COMPLETE 19-jan 14:03:01
JPL_REQ_NUMSEL                           FRESH		     VALID		 COMPLETE 19-jan 14:03:01

3 rows selected.

Waiting for 2 seconds to allow refresh time to change

PL/SQL procedure successfully completed.

declare
*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 2952
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 85
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 245
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 1243
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 2414
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 2908
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 3699
ORA-06512: at "SYS.DBMS_SNAPSHOT_KKXRCA", line 3723
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 75
ORA-06512: at line 4


MVIEW_NAME				 STALENESS	     COMPILE_STATE	 LAST_REF REF_TIME
---------------------------------------- ------------------- ------------------- -------- ------------------------
JPL_REQ_NUMSEL                           NEEDS_COMPILE	     NEEDS_COMPILE	 COMPLETE 19-jan 14:03:01
JPL_REQ_BASIS                            FRESH		     VALID		 FAST	  19-jan 14:03:04
JPL_REQ_GROUP_NUMLINES                   FRESH		     VALID		 FAST	  19-jan 14:03:04

The deadlock graph from the trace file, with a little extra surrounding information, looks like this:


Deadlock graph:
                                          ------------Blocker(s)-----------  ------------Waiter(s)------------
Resource Name                             process session holds waits serial  process session holds waits serial
TX-00020009-00000C78-A9B090F8-00000000         26      14     X        40306      26      14           X  40306


*** 2018-01-19T14:18:03.925859+00:00 (ORCL(3))
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
----- Current SQL Statement for this session (sql_id=2vnzfjzg6px33) -----
select log, oldest, oldest_pk, oldest_oid, oldest_new, youngest+1/86400,  flag, yscn, oldest_seq, oscn, oscn_pk, oscn_oid, oscn_new, oscn_seq  from sys.mlog$ where mowner = :1 and master = :2 for update
----- PL/SQL Stack -----

So far I haven’t been able to spot whether or not I’m doing something wrong, or prohibited, and I haven’t been able to find a matching problem on MoS. Since the code works on 11gR2 and 12cR1 I’m inclined to believe it’s a bug introduced in the 12cR2 timeline – which is a nuisance for my client, but if it is a bug then perhaps a fix will appear fairly promptly.

January 11, 2018

ASSM tangle

Filed under: ASSM,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:35 pm BST Jan 11,2018

Here’s a follow-on from Tuesday’s (serious) note about a bug in 12.1.0.2 that introduces random slowdown on large-scale inserts. This threat in this note, while truthful and potentially a nuisance, is much less likely to become visible because it depends on you doing something that you probably shouldn’t be doing.

There have always been problems with ASSM and large-scale deletes – when should Oracle mark a block as having free space on deletion: if your session does it immediately then other sessions will start trying to use the free space that isn’t really there until you commit; if your session doesn’t do it immediately when can it happen, since you won’t want it done on commit – but that means the segment could “lose” a lot of free space if something doesn’t come along in a timely fashion and tidy up.

But here’s a quirky problem that takes things one step further. What happens if you try to delete a load of data and fail and your session rolls back? If we start with yesterday’s script (running on 11.2.0.4 or 12.2.0.1) we can create a table with 1M rows in it and the following space usage:


Unformatted		      : 	   0 /		      0
Freespace 1 (  0 -  25% free) : 	   0 /		      0
Freespace 2 ( 25 -  50% free) : 	   1 /		  8,192
Freespace 3 ( 50 -  75% free) : 	   0 /		      0
Freespace 4 ( 75 - 100% free) : 	  67 /		548,864
Full			      :       41,666 /	    341,327,872

You will recall that each “Full” block actually had the basic 10% free space, plus a couple of hundred extra bytes which Oracle had to “forget about” because the incoming rows were always 290 bytes long. Let’s take this table and delete the first 100,000 rows, then emulate a session error and roll back, and then check the space usage:


delete from t1 where rownum <= 100000;
rollback;

-- generate space usage report

Unformatted		      : 	   0 /		      0
Freespace 1 (  0 -  25% free) :        4,167 /	     34,136,064
Freespace 2 ( 25 -  50% free) : 	   1 /		  8,192
Freespace 3 ( 50 -  75% free) : 	   0 /		      0
Freespace 4 ( 75 - 100% free) : 	  67 /		548,864
Full			      :       37,499 /	    307,191,808

We have 4,167 blocks which were full, and we know they are effectively full for the purposes of our data, but they’re now declared as having some free space. When Oracle rolled back the delete it wasn’t running code that would attempt to discover that the block was going to go over the limit, it simply calculated the byte change from re-inserting the row, added it to the total free space (tosp) and produced a number that hadn’t reached the limit set by pctfree – so flagged the block accordingly. (Remember my comment in the earlier article that Oracle doesn’t generate undo for the state changes on the Level 1 bitmap blocks – this is, at least in part – a consequence of that strategy).

Guideline

If you’re going to do large-scale deletes in an ASSM environment, make sure they don’t fail or subsequent inserts may take a long time.

 

January 10, 2018

ASSM Argh 2

Filed under: ASSM,Infrastructure,Oracle — Jonathan Lewis @ 1:24 pm BST Jan 10,2018

After yesterday’s post one of the obvious follow-up questions was whether the problem I demonstrated was a side effect of my use of PL/SQL arrays and loops to load data. What would happen with a pure “insert select” statement.  It’s easy enough to check:


rem
rem     Script:         assm_argh2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

drop table t2 purge;
drop table t1 purge;

create table t2
segment creation immediate
tablespace test_8k_assm
as
select * from all_objects where rownum <= 50000 -- >comment to avoid WordPress anomaly
;

create table t1
segment creation immediate
tablespace test_8k_assm
as
select * from all_objects where rownum = 0
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

insert /*+ append */ into t1
select t2.*
from
        (
         select /*+ cardinality(40) */ rownum id
         from dual connect by level <= 40 -- > comment to avoid WordPress anomaly
        ) d,
        t2
;

commit;

declare
        m_unformatted_blocks    number;
        m_unformatted_bytes     number;
        m_fs1_blocks            number;
        m_fs1_bytes             number;
        m_fs2_blocks            number;
        m_fs2_bytes             number;

        m_fs3_blocks            number;
        m_fs3_bytes             number;
        m_fs4_blocks            number;
        m_fs4_bytes             number;
        m_full_blocks           number;
        m_full_bytes            number;

begin
        dbms_space.SPACE_USAGE(
                segment_owner           => 'TEST_USER',
                segment_name            => 'T1',
                segment_type            => 'TABLE',
                unformatted_blocks      => m_unformatted_blocks,
                unformatted_bytes       => m_unformatted_bytes,
                fs1_blocks              => m_fs1_blocks ,
                fs1_bytes               => m_fs1_bytes,
                fs2_blocks              => m_fs2_blocks,
                fs2_bytes               => m_fs2_bytes,
                fs3_blocks              => m_fs3_blocks,
                fs3_bytes               => m_fs3_bytes,
                fs4_blocks              => m_fs4_blocks,
                fs4_bytes               => m_fs4_bytes,
                full_blocks             => m_full_blocks,
                full_bytes              => m_full_bytes
        );

        dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,999,990'));
        dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,999,990') || ' / ' || to_char(m_full_bytes,'999,999,999,990'));
end;
/

I’ve copied the first 50,000 rows from all_objects as a way of generating date, then cloned it 40 times into the main table to give me a total of 2,000,000 rows.

A comment on yesterday’s blog reported that the behaviour I described has been fixed in the October bundle patch for 12.1.0.2, but I haven’t patched my copy yet. So here are the results (with a little cosmetic editing) from running the insert and reporting on space usage from 11.2.0.4, 12.1.0.2, and 12.2.0.1 in order:


11.2.0.4
========
2000000 rows created.

Unformatted                   :          764 /        6,258,688
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          133 /        1,089,536
Full                          :       28,579 /      234,119,168

12.1.0.2
========
2000000 rows created.

Unformatted                   :          256 /        2,097,152
Freespace 1 (  0 -  25% free) :       32,810 /      268,779,520
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            1 /            8,192
Freespace 4 ( 75 - 100% free) :           47 /          385,024
Full                          :          443 /        3,629,056

12.2.0.1
========
2000000 rows created.

Unformatted		      : 	 764 /	      6,258,688
Freespace 1 (  0 -  25% free) : 	   0 /		      0
Freespace 2 ( 25 -  50% free) : 	   1 /		  8,192
Freespace 3 ( 50 -  75% free) : 	   0 /		      0
Freespace 4 ( 75 - 100% free) : 	 226 /	      1,851,392
Full			      :       39,706 /	    325,271,552

The total number of blocks involved changes from version to version, of course, thanks to the huge differences in the contents of all_objects, but the headline message is clear – 12.1.0.2 is broken for this basic requirement. On the plus side, though, this is what you get from 12.1.0.2 if you change that insert to include the /*+ append */ hint:


2000000 rows created.

Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :            0 /                0
Full                          :       33,380 /      273,448,960

Unsurprisingly, 11.2.0.4 and 12.2.0.1 also behave and report 100% Full. This is a slightly special case, of course since there was no previous data in the table, but even when I started the big insert after inserting and committing a few rows all three versions behaved.

January 9, 2018

ASSM argh!

Filed under: 12c,ASSM,Bugs,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:53 pm BST Jan 9,2018

Here’s a problem with ASSM that used to exist in older versions of Oracle had disappeared by 11.2.0.4 and then re-appeared in 12.1.0.2 – disappearing again by 12.2.0.1. It showed up on MoS a few days ago under the heading: “Insert is running long with more waits on db file sequential read”.

The obvious response to this heading is to question the number of indexes on the table – because big tables with lots of indexes tend to give you lots of random I/O as Oracle maintains the indexes – but this table had no indexes. The owner of the problem supplied several of bits of information in the initial post, with further material in response to follow-up questions, including the tkprof summary of the 10046/level 12 trace of the insert and two extracts from the trace file to show us some of the “db file sequential read” waits – the first extract made me wonder if there might be some issue involving 16KB blocks but the second one dispelled that illusion.

There are several buggy things that can appear with ASSM and large-scale DML operations, and sometimes the problems can appear long after the original had done the dirty deed, so I thought I’d create a simple model based on the information supplied to date – and discovered what the problem (probably) was. Here’s how it starts – I’ve created a tablespace using ASSM, and in this tablespace I’ve created a table which has 48 columns with a row length of 290 bytes (roughly matching the OP’s table), and I’ve hacked out a simple PL/SQL block that loops around inserting arrays of 100 rows at a time into the table for a total of 1M rows before committing.


rem
rem     Script:         assm_cleanout.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2        Lots of blocks left "not full"
rem             11.2.0.4
rem

rem
rem     using OMF, so no file-name needed
rem     Ran this bit as SYS, and altered test user to have unlimited quota
rem

/*
create
        tablespace test_8k_assm
        datafile size 1G
        extent management local
        autoallocate
        segment space management auto
;
*/

rem
rem     Create the table, pre-allocate some space.
rem     This means we should get consistent 8M extents and not initial little ones
rem

create table t1 (
        v001 varchar2(5), v002 varchar2(5), v003 varchar2(5), v004 varchar2(5), v005 varchar2(5),
        v006 varchar2(5), v007 varchar2(5), v008 varchar2(5), v009 varchar2(5), v010 varchar2(5),
        v011 varchar2(5), v012 varchar2(5), v013 varchar2(5), v014 varchar2(5), v015 varchar2(5),
        v016 varchar2(5), v017 varchar2(5), v018 varchar2(5), v019 varchar2(5), v020 varchar2(5),
        v021 varchar2(5), v022 varchar2(5), v023 varchar2(5), v024 varchar2(5), v025 varchar2(5),
        v026 varchar2(5), v027 varchar2(5), v028 varchar2(5), v029 varchar2(5), v030 varchar2(5),
        v031 varchar2(5), v032 varchar2(5), v033 varchar2(5), v034 varchar2(5), v035 varchar2(5),
        v036 varchar2(5), v037 varchar2(5), v038 varchar2(5), v039 varchar2(5), v040 varchar2(5),
        v041 varchar2(5), v042 varchar2(5), v043 varchar2(5), v044 varchar2(5), v045 varchar2(5),
        v046 varchar2(5), v047 varchar2(5), v048 varchar2(5)
)
segment creation immediate
tablespace test_8k_assm
storage(initial 8M)
;

alter table t1 allocate extent (size 8M);
alter table t1 allocate extent (size 8M);

rem
rem     Simple anonymous pl/sql block
rem     Large insert, handled with array inserts
rem     Can modify loop count and array size very easily
rem

declare
        type tab_array is table of t1%rowtype;
        junk_array tab_array;
begin

        select
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx'
        bulk collect into
                junk_array
        from
                all_objects
        where
                rownum  <= 100 -- > comment to avoid WordPress format issue
        ;

        for i in 1..10000 loop
                forall j in 1..junk_array.count
                        insert into t1 values junk_array(j) ;
        end loop;

end;

commit;

The number of rows per block after this insert is 24, with 1038 bytes free space left (808 due to the pctfree = 10, then the bit that was too small to take a 25th row before breaching the pctfree barrier). This means we should report 1M/24 = 41,666 full blocks and one block with some free space. So we query the table using the dbms_space package:


declare
        m_unformatted_blocks    number;
        m_unformatted_bytes     number;
        m_fs1_blocks            number;
        m_fs1_bytes             number;
        m_fs2_blocks            number;
        m_fs2_bytes             number;

        m_fs3_blocks            number;
        m_fs3_bytes             number;
        m_fs4_blocks            number;
        m_fs4_bytes             number;
        m_full_blocks           number;
        m_full_bytes            number;

begin
        dbms_space.SPACE_USAGE(
                segment_owner           => 'TEST_USER',
                segment_name            => 'T1',
                segment_type            => 'TABLE',
                unformatted_blocks      => m_unformatted_blocks,
                unformatted_bytes       => m_unformatted_bytes,
                fs1_blocks              => m_fs1_blocks ,
                fs1_bytes               => m_fs1_bytes,
                fs2_blocks              => m_fs2_blocks,
                fs2_bytes               => m_fs2_bytes,
                fs3_blocks              => m_fs3_blocks,
                fs3_bytes               => m_fs3_bytes,
                fs4_blocks              => m_fs4_blocks,
                fs4_bytes               => m_fs4_bytes,
                full_blocks             => m_full_blocks,
                full_bytes              => m_full_bytes
        );

        dbms_output.new_line;
        dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,999,990'));
        dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,999,990') || ' / ' || to_char(m_full_bytes,'999,999,999,990'));

end;
/

The results aren’t what we expect:


Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :       35,001 /      286,728,192
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :        6,665 /       54,599,680

We have one block marked as 25 – 50% free (that’s the one block with 16 rows in it, which means about 40% space currently free) but our 41,666 full blocks are actually reported as 6,665 full blocks and 35,001 blocks with some space available. That’s going to hurt eventually if some process wants to insert more rows and finds that it has to fail its way through 35,001 blocks before finding a block which has enough free space.

So what happens when I repeat the PL/SQL block (and commit)? Here are the results from calls to dbms_space after the next two cycles:


Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :       70,002 /      573,456,384
Freespace 2 ( 25 -  50% free) :            2 /           16,384
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          220 /        1,802,240
Full                          :       13,330 /      109,199,360

Unformatted                   :          256 /        2,097,152
Freespace 1 (  0 -  25% free) :      105,003 /      860,184,576
Freespace 2 ( 25 -  50% free) :            3 /           24,576
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          117 /          958,464
Full                          :       19,995 /      163,799,040

Every time we execute the PL/SQL block we leave a trail of 35,001 more blocks which are flagged as “not quite full”.

Looking at the session stats while running the insert loop I can tell that Oracle isn’t checking to see whether or not it should be using those blocks. (A quick way of proving this is to flush the buffer cache before each execution of the PL/SQL and note that Oracle doesn’t read back the 105,000 blocks before inserting any data). So somehow, sometime, someone might get a nasty surprise – and here’s one way that it might happen:

Since I know I my data fits 24 rows per block I’m going to modify my PL/SQL block to select one row into the array then loop round the insert 25 times – so I know I’m inserting a little bit more than one block’s worth of data. Starting from the state with 105,003 blocks marked as “Freespace 1” this is what I saw – first, the free space report after inserting 25 rows:


Unformatted                   :          240 /        1,966,080
Freespace 1 (  0 -  25% free) :        1,074 /        8,798,208
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          133 /        1,089,536
Full                          :      123,927 /    1,015,209,984

Then a few wait events and session statistics for the insert:


---------------------------------------------------------
SID:    39:TEST_USER - jonathan
Session Events - 09-Jan 16:57:18
Interval:-      6 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs        Csec    Avg Csec    Max Csec
-----                                             -----   ---------        ----    --------    --------
db file sequential read                          15,308           0         128        .008           3
db file scattered read                           20,086           0         271        .014           4

---------------------------------
Session stats - 09-Jan 16:57:18
Interval:-  6 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
session logical reads                                                  269,537
physical read total IO requests                                         35,401
db block gets                                                          229,522
consistent gets                                                         40,015
physical reads                                                         124,687
physical reads cache                                                   124,687
db block changes                                                       208,489
physical reads cache prefetch                                           89,293
redo entries                                                           207,892
redo size                                                           16,262,724
undo change vector size                                                  1,720
deferred (CURRENT) block cleanout applications                         103,932
table scan blocks gotten                                                20,797
HSC Heap Segment Block Changes                                              25

The session has read and updated almost all of the level 1 bitmap blocks. I don’t know exactly what triggered this re-read, but seems to be related to the number of rows inserted (or, perhaps, the amount of space used rather than the row count) as an insert crosses the pctfree boundary and fails over to the next block. I’ve only done a couple of little tests to try and get a better idea of why an insert sometimes sweeps through the bitmap blocks – so I know that inserting 2 or 3 rows at a time will also trigger the cleanout – but there are probably several little details involved that need to be identified.

You might note a couple of details in the stats:

  • Because I had flushed the buffer cache before the insert Oracle did its “cache warmup” tablescanning trick – if this had not happened I would probably have done a single block read for every single bitmap block I touched.
  • There are 103,932 block cleanout applications – but 208,000 db block changes and redo entries. Roughly half the latter are for data block cleanouts (OP code 4.1) and half are the state changes on the level 1 bitmap blocks (OP code 13.22). You’ll notice that neither change produces any undo.
  • I’ve also included the HSC Heap Segment Block Changes statistics to show you that not all changes to Heap Segment Blocks show up where you might expect them.

And finally:

If you re-run the tests on 11.2.0.4 and 12.2.0.1 you get the following results after the intial script run – the problem doesn’t exist:


11.2.0.4
========
Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :       41,666 /      341,327,872

12.2.0.1
========
Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :       41,666 /      341,327,872

So how does this help the OP.

  • First, there may be a huge mess still waiting to be cleaned in the table – but at 34M blocks I’m not too keen to suggest running the dbms_space routine to find out what it looks like – but maybe that’s necessary.
  • Secondly – an upgrade to 12.2 will probably avoid the problem in future.
  • Thirdly – if the number of rows per block is very close to uniform, write a little code to do a loop that inserts (say) 2 * expected number of rows per block as single row inserts and rolls back; the inserts will probably trigger a massive though perhaps not complete cleanout, so rinse and repeat until the cleanout is complete. Try to find a time when you don’t mind the extra load to get this job done.
  • Finally – on the big job that does the bulk insert – repeat the dummy insert/rollback at the end of the job to clean up the mess made by the job.

Addenda

Prompted by comment #2 below, I should add that if the problem has been fixed in 12.2 then possibly there’s a bug report and patch for it already. If there isn’t then the OP could raise an SR (referencing this blog note), and request a bug fix or back-port from 12.2.

And with 24 hours of publication, comment #4 (from Yury Pudovchenko) tells us that the bug is fixed by the Oct 2017 Bundle Patch.

 

 

December 30, 2017

nvarchar2

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 12:08 pm BST Dec 30,2017

Here’s an odd little quirk that appeared when I was playing around with default values just recently. I think it’s one I’ve seen before, I may even have written about it many years ago but I can’t find any reference to it at present. Let’s start with a script that I’ll run on 12.2.0.1 (the effect does appear on earlier versions):


rem
rem     Script:         nvarchar2_anomaly.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2017
rem

create table t1 (
        column1  varchar2(10),
        column2  nvarchar2(10)
);

create table t2 (
        column1  varchar2(10)
);

alter table t2 add column2 nvarchar2(10);

create table t3 (
        column1  varchar2(10),
        column2  nvarchar2(10) default 'xxxxxxxx'
);

create table t4 (
        column1  varchar2(10)
);

alter table t4 add column2 nvarchar2(10) default 'xxxxxxxx';

insert into t1(column1) values('a');
insert into t2(column1) values('a');
insert into t3(column1) values('a');
insert into t4(column1) values('a');

All I’ve done it create 4 tables which. when described will all look the same:


 Name                    Null?    Type
 ----------------------- -------- ----------------
 COLUMN1                          VARCHAR2(10)
 COLUMN2                          NVARCHAR2(10)

There is a significant different between the first two and the last two, of course, thanks to the specification of a default value which means that the inserts will produce two possible results: the first two tables will have nulls in column2; the last two will have the nvarchar2 equivalent of ‘xxxxxxxx’ which, in my instance, will be a string of 16 bytes: “0,78,0,78,0,78,0,78,0,78,0,78,0,78,0,78”.

Surprisingly, though, there is a dramatic oddity between t3 and t4 which shows up when I query user_tab_cols:

select
        table_name, column_id, column_name,  segment_column_id, data_default
from    user_tab_cols
where   table_name like 'T_'
order by
        table_name, column_id
;

TABLE_NAME            COLUMN_ID COLUMN_NAME          SEGMENT_COLUMN_ID DATA_DEFAULT
-------------------- ---------- -------------------- ----------------- --------------------
T1                            1 COLUMN1                              1
                              2 COLUMN2                              2

T2                            1 COLUMN1                              1
                              2 COLUMN2                              2

T3                            1 COLUMN1                              1
                              2 COLUMN2                              2 'xxxxxxxx'

T4                            1 COLUMN1                              1
                              2 COLUMN2                              3 'xxxxxxxx'
                                SYS_NC00002$                         2

Table t4 has acquired two columns – a hidden column (which physically exists as the second column in the stored data and is declared as raw(126)) and the column which I had specified. You’ll note that the test shows two differences that may be significant: comparing t3/t4 we see that adding, rather than initially defining, the nvarchar2() column introduces the extra column; comparing t2/t4 we see that adding a varchar2() rather than an nvarchar2() doesn’t produce the same effect. Tentative assumption, therefore, is that there is something special about adding nvarchar2() columns. [Update: wrong, see comment 2 and the link it supplies]

Casting my mind back to various customers who have gone through multiple upgrades of 3rd party applications that invariably seem to add columns to tables, I wondered whether this extra column appeared every time you added an nvarchar2(). I’d not noticed anything in that past that suggested this might be the case, but it’s obviously worth checking: and in my simple tests it looked as if Oracle created just one extra column and used it to capture a value that seemed to be determined by the number and location of columns that had been added.

It’s a curiosity, and leaves room for further investigation – so if anyone has links to related articles please feel free to add them in the comments.

 

August 3, 2017

Rebuilding Indexes

Filed under: Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:00 pm BST Aug 3,2017

One of the special events that can make it necessary to rebuild an index is the case of the “massive DML”, typically a bulk delete that purges old data from a table. You may even find cases where it’s a good idea to mark a couple of your indexes as unusable before doing a massive delete and then rebuild them after the delete.

Despite the fact that a massive delete is an obvious special case it’s still not necessary in many cases to worry about a rebuild afterwards because the space made free by the delete will be smoothly reused over time with very little variation in performance. There is, however, one particular feature that increases the probability of a rebuild becoming necessary – global (or globally partitioned) indexes on partitioned tables. The problem (and the absence of problem in non-partitioned tables) is in the nature of the rowid.

For non-partitioned tables, and partitioned tables with local indexes, the rowid stored in an index is (assuming we’re thinking only of heap tables) stored as a sequence of 6 bytes consisting, in order, of: (tablespace relative file number, block number within file, row number within block). If the table is non-partitioned, or if this is an index segment from a locally partitioned index, all the index entries will be pointing to the same table segment and Oracle knows which segment that is from the data dictionary information – so Oracle can derive the data_object_id of the table segment and convert the tablespace relative file number into the absolute file number to navigate to the right row in the table.

When the index is global or globally partitioned any index entry may point to any of the table’s segments, so the rowid that is stored in the index is expanded by a further 4 bytes to hold the data_object_id of the table segment it is pointing to – and the data_object_id is the leading component: (data_object_id, tablespace relative file number, block number within file, row number within block). Think about what this means when you start to drop “old” partitions and add new partitions. Compare this with what happens when you simply delete a large volume of old data from a table and starting inserting new data. There’s an important difference to the way in which indexes will evolve.

Purging data

When you delete a large volume of data from a (simple, heap) table you will create a lot of empty space in a lot of existing table blocks. If that delete is supposed to get rid of “old” data (and to keep the description simple we’ll assume it’s the first time you’ve done this) it’s very likely that the delete will result in lots of empty blocks near the start of the table – rows that were inserted at the same time will tend to be physically close to each other in the table. This means that future inserts will soon start to reuse those table blocks. Think about what this means for index entries – especially for non-unique keys.

Assume you have 100 rows with value ‘XXX’ for an indexed column. Breaking the rowid into its component parts the index entries will be (‘XXX’,{file_id, block_id, row_number}).  Now assume you delete the oldest 10 rows then, over time, insert 10 more rows with the same key value. You’ve deleted the 10 index entries with the lowest values for (file_id, block_id) but the space that’s become available in the table will be in and around exactly that range of blocks – so the new index entries will probably end up looking very similar to the deleted index entries and inserted in and around the existing index entries for value ‘XXX’, so over time the index is unlikely to allocate much new space.

Now think about what happens when your table it partitioned but the index is global; your index entries are (‘XXX’,{data_object_id, file_id, block_id, row_number}). When you drop the oldest partition you will probably[1] delete all the index entries with the lowest data_object_id. When you start inserting new rows for ‘XXX’ the new table partition will have a data_object_id that is going to be higher than any previous data_object_id – which means you’re going to be inserting rows into the right-hand (high-value) edge of this section of the index. In some cases – typically those where you have a couple of leaf blocks per key value – the index may end up growing significantly because the insertion point for rows in the new partition isn’t in the leaf block with the available space, and it won’t be until you’ve done a few more bulk deletes and the leaf blocks relating to the oldest table partitions become completely empty that the space can be reused.

An example of this type of behaviour probably appeared on the OTN database forum quite recently.  Of course, there are various reasons why indexes can become inefficient, and the degree of inefficiency may only become noticeable over a fairly long period of time; moreover there are various reasons why global indexes are a little problematic, and various reasons why a bulk delete (which is what executing “alter table drop partition” does to a global index) has unpleasant side effects dependent somewhat on the number (and size) of the partitions and on how many you try to drop in one go.

There’s not  a lot you can do about this quirk of global indexes, but it’s always worth taking extra care with partitioned tables and focusing even more carefully on a strategic review of indexes:

  • Does this index really need to exist at all
  • Could this index be replaced by a selective function-based index
  • Does this index really need to be global / globally partitioned
  • How big is this index compared to the size it ought to be
  • Should this index be (basic) compressed
  • Is this index likely to be disrupted by a historic purge – or is there another reason for its undesirable behaviour

 

[1] probably delete entries with the lowest data_object_id” – I have to say this because if you’ve executed a “move partition” at any time a new data_object_id will have been generated for the partition, so the oldest partition could, in principal, have the highest data_object_id. The issue of changing data_object_ids brings a whole new level of complexity to global indexes – but only in a few special cases, fortunately.

 

 

July 25, 2017

Redo OP Codes:

Filed under: Infrastructure,Oracle,redo — Jonathan Lewis @ 6:17 pm BST Jul 25,2017

This posting was prompted by a tweet from Kamil Stawiarski in response to a question about how he’d discovered the meaning of Redo Op Codes 5.1 and 11.6 – and credited me and Julian Dyke with “the hardest part”.

Over the years I’ve accumulated (from Julian Dyke, or odd MoS notes, etc.) and let dribble out the occasional interpretation of a few op codes – typically in response to a question on the OTN database forum or the Oracle-L listserver, and sometimes as a throwaway comment in a blog post, but I’ve never published the full set of codes that I’ve acquired (or guessed) to date.

It’s been some time since I’ve looked closely at a redo stream, and there are many features of Oracle that I’ve never had to look at at the level of the redo so there are plenty of gaps in the list – and maybe a few people will use the comments to help fill the gaps.

It’s possible that I may be able to add more op codes over the new days – I know that somewhere I have some op codes relating to space management, and a batch relating to LOB handling, but it looks like I forgot to add them to the master list – so here’s what I can offer so far:


1	Transaction Control

2	Transaction read

3	Transaction update

4	Block cleanout
		4.1	Block cleanout record
		4.2	Physical cleanout
		4.3	Single array change
		4.4	Multiple array changes
		4.5	Format block
		4.6	ktbcc redo -  Commit Time Block Cleanout Change (?RAC, ?recursive, ?SYS objects)

5	Transaction undo management
		5.1	Update undo block
		5.2	Get undo header
		5.3	Rollout a transaction begin
		5.4	On a rollback or commit
		5.5	Create rollback segmenr
		5.6	On a rollback of an insert
		5.7	In the ktubl for 'dbms_transaction.local_transaction_id'
			(begin transaction) - also arrives for incoming distributed
			tx, no data change but TT slot acquired. Also for recursive
			transaction (e.g. truncate). txn start scn:  0xffff.ffffffff
		5.8	Mark transaction as dead
		5.9	Rollback extension of rollback seg
		5.10	Rollback segment header change for extension of rollback seg
		5.11	Mark undo as applied during rollback
		5.19	Transaction audit record - first
		5.20	Transaction audit record - subsequent
		5.23	ktudbr redo: disable block level recovery (reports XID)
		5.24	ktfbhundo - File Space Header Undo

6	Control file

10	Index
		10.1	SQL load index block
		10.2	Insert Leaf Row
		10.3	Purge Leaf Row
		10.4	Delete Leaf Row
		10.5	Restore Leaf during rollback
		10.6	(kdxlok) Lock block (pre-split?)
		10.7	(kdxulo) unlock block during undo
		10.8	(kdxlne) initialize leaf block being split
		10.9	(kdxair) apply XAT do to ITL 1	-- related to leaf block split
		10.10	Set leaf block next pointer
		10.11	(kdxlpr) (UNDO) set kdxleprv (previous pointer)
		10.12 	Initialize root block after split
		10.13	index redo (kdxlem): (REDO) make leaf block empty,
		10.14	Restore block before image
		10.15	(kdxbin) Insert branch block row
		10.16	Purge branch row
		10.17	Initialize new branch block
		10.18	Update key data in row -- index redo (kdxlup): update keydata
		10.19	Clear split flag
		10.20	Set split flag
		10.21	Undo branch operation
		10.22	Undo leaf operation
		10.23	restore block to tree
		10.24	Shrink ITL
		10.25	format root block
		10.26	format root block (undo)
		10.27	format root block (redo)
		10.28	Migrating block (undo)
		10.29	Migrating block (redo)
		10.30	Update nonkey value
		10.31	index root block redo (kdxdlr):  create/load index
		10.34 	make branch block empty
		10.35	index redo (kdxlcnu): update nonkey
		10.37	undo index change (kdxIndexlogicalNonkeyUpdate) -- bitmap index
		10.38	index change (kdxIndexlogicalNonkeyUpdate) -- bitmap index
		10.39	index redo (kdxbur) :  branch block update range
		10.40	index redo (kdxbdu) :  branch block DBA update,

11	Table
		11.1  undo row operation
		11.2  insert row  piece
		11.3  delete row piece
		11.4  lock row piece
		11.5  update row piece
		11.6  overwrite row piece
		11.7  manipulate first column
		11.8  change forwarding address - migration
		11.9  change cluster key index
		11.10 Set Cluster key pointers
		11.11 Insert multiple rows
		11.12 Delete multiple rows
		11.13 toggle block header flags
		11.17 Update multiple rows
		11.19 Array update ?
		11.20 SHK (mark as shrunk?)
		11.24 HCC update rowid map ?

12	Cluster

13	Segment management
		13.1	ktsfm redo: -- allocate space ??
		13.5	KTSFRBFMT (block format) redo
		13.6	(block link modify) (? index )  (options: lock clear, lock set)
		13.7	KTSFRGRP (fgb/shdr modify freelist) redo: (options unlink block, move HWM)
		13.13	ktsbbu undo - undo operation on bitmap block
		13.14	ktsbbu undo - undo operation on bitmap block
		13.17	ktsphfredo - Format Pagetable Segment Header
		13.18	ktspffredo - Format Level1 Bitmap Block
		13.19	ktspsfredo - Format Level2 Bitmap Block
		13.21	ktspbfredo - Format Pagetable Datablock
		13.22	State change on level 1 bitmap block
		13.23	Undo on level 1 bitmap block
		13.24	Bitmap block (BMB) state change (level 2 ?)
		13.25	Undo on level 2 bitmap block
		13.26	?? Level 3 bitmap block state change ??
		13.27	?? Level 3 bitmap block undo ??
		13.28	Update LHWM and HHWM on segment header
		13.29	Undo on segment header
		13.31	Segment shrink redo for L1 bitmap block
		13.32	Segment shrink redo for segment header block

14	Extent management
		14.1	ktecush redo: clear extent control lock
		14.2	ktelk redo - lock extent (map)
		14.3	Extent de-allocate
		14.4	kteop redo - redo operation on extent map
		14.5	kteopu undo - undo operation on extent map
		14.8	kteoputrn - undo operation for flush for truncate

15	Tablespace

16	Row cache

17	Recovery management
		17.1	End backup mode marker
		17.3	Crash Recovery at scn:  0x0000.02429111
		17.28	STANDBY METADATA CACHE INVALIDATION

18	Block image (hot backups)
		18.1	Block image
		18.3	Reuse redo entry
				   (Range reuse: tsn=1 base=8388753 nblks=8)
				or (Object reuse: tsn=2 objd=76515)

19	Direct loader
		19.1	Direct load block record
		19.2	Nologging invalidate block range
			Direct Loader invalidate block range redo entry

20	Compatibility segment

21	LOB segment
		21.1	kdlop (Long Feild) redo:  [sic]
				(insert basicfile clob)

22	Locally managed tablespace
		22.2	ktfbhredo - File Space Header Redo:
		22.3	ktfbhundo - File Space Header Undo:
		22.5	ktfbbredo - File BitMap Block Redo:
		22.16	File Property Map Block (FPM)

23	Block writes
		23.1	Block written record
		23.2	Block read record (BRR) -- reference in Doc ID: 12423475.8

24	DDL statements
		24.1	DDL
		24.2	Direct load block end mark
		24.4	?? Media recovery marker
		24.10	??
		24.11	??

(E & O.E) – you’ll notice that some of the descriptions include question marks – those are guesses – and some are little more than the raw text extracted from a redo change vector with no interpretation of what they might mean.

Update

It didn’t take long for someone to email me a much longer list that has been published elsewhere on the Internet. The results don’t have the hierarchical style display I have above, so I may copy the extra entries into the list above when I get a little time.

July 24, 2017

Fast Now, Fast Later

Filed under: Infrastructure,Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 1:39 pm BST Jul 24,2017

The following is the text of an article I published in the UKOUG magazine several years ago (2010), but I came across it recently while writing up some notes for a presentation and thought it would be worth repeating here.

Fast Now, Fast Later

The title of this piece came from a presentation by Cary Millsap and captures an important point about trouble-shooting as a very memorable aphorism. Your solution to a problem may look good for you right now but is it a solution that will still be appropriate when the database has grown in volume and has more users.

I was actually prompted to write this article by a question on the OTN database forum that demonstrated the need for the basic combination of problem solving and forward planning. Someone had a problem with a fairly sudden change in performance of his system from November to December, and he had some samples from trace files and Statspack of a particular query that demonstrated the problem.

The query was very simple:

select  *
from    tph
where   pol_num = :b0
order by
        pm_dt, snum

When the query was operating fast enough the trace file from a sample run showed the following (edited) tkprof output, with an the optimizer taking advantage of the primary key of (pol_num, pm_dt, snum) on table TPH to avoid a sort for the order by clause. (Note that the heading on the plan is “Row Source Operation” – which means it’s the execution plan that really was used)

call    count    cpu  elapsed  disk  query  current  rows
---------------------------------------------------------
Parse       1   0.01     0.13     0    106        0     0
Execute     1   0.03     0.03     0      0        0     0
Fetch       4   0.01     0.22    46     49        0    43
---------------------------------------------------------
total       6   0.06     0.39    46    155        0    43

Rows Row Source Operation
---- --------------------
  43 TABLE ACCESS BY INDEX ROWID TPH (cr=49 pr=46 pw=0 time=226115 us)
  43   INDEX RANGE SCAN TPH_PK (cr=6 pr=3 pw=0 time=20079 us)(object id 152978)

Elapsed times include waiting on following events:
Event waited on               Times  Max. Wait Total Waited
                             Waited 
-----------------------------------------------------------
db file sequential read          46       0.01         0.21

When the query was running less efficiently the change in the trace didn’t immediately suggest any fundamental problems:


call    count    cpu  elapsed  disk  query  current  rows
---------------------------------------------------------
Parse       1   0.00     0.00     0     51        0     0
Execute     1   0.01     0.01     0      0        0     0
Fetch       4   0.00     0.59    47     51        0    45
---------------------------------------------------------
total       6   0.01     0.61    47    102        0    45

Rows Row Source Operation
---- --------------------
45 TABLE ACCESS BY INDEX ROWID TPH (cr=51 pr=47 pw=0 time=593441 us)
45 INDEX RANGE SCAN TPH_PK (cr=6 pr=2 pw=0 time=33470 us)(object id 152978)

Elapsed times include waiting on following events:
Event waited on               Times  Max. Wait Total Waited
                             Waited 
-----------------------------------------------------------
db file sequential read          47       0.03         0.58

The plan is the same, the number of rows returned is roughly the same, and the number of disc reads and buffer gets has hardly changed. Clearly the overall change in performance comes from the slower average disk read times (a total of 0.21 seconds with a maximum of one hundredth of a second, compared to a total 0.58 seconds with a maximum of 3 hundredths), but why has the disk I/O time changed?

The figures give us a couple of preliminary ideas. An average read time of 4.5 milliseconds ( 0.21 seconds / 46 reads) is pretty good for a “small” random read of a reasonably loaded disc subject to a degree of concurrent access [ed: bearing in mind this was 2010], so the waits for “db file sequential read” in the first tkprof output are probably getting some help from a cache somewhere – possibly a SAN cache at the end of a fibre link or maybe from a local file system buffer (we might get a better idea if we could see the complete list of individual read times).

In the second case an average of 12.3 milliseconds ( 0.58 seconds / 45 reads) looks much more like a reasonable amount of genuine disc I/O is taking place – and the maximum of 30 milliseconds suggests that the disc(s) in question are subject to an undesirable level of concurrent access: our session is spending some of its time in a disk queue. Again, it would be nice to see the wait times for all the reads, but at this point it’s not really necessary.

There are couple more clues about what’s going on – one is the text of the query itself (and I’ll be coming back to that later) and the other is in the detail of the disk I/Os. If you check the “Row Source Operation” details you’ll see that in the first case the sample query selected 43 rows from the table and requested 43 (46 – 3) physical reads (pr) of the table to do so. In the second case it was 45 rows and 45 (47 – 2) physical reads. Is this simply a case of the same query needing a little more data and having to do a little more work as time passes?

So now we come to the Statspack data. Based on my observations (or guesses) about the nature of the query and the work going on, I asked if we could see some summary information for a couple of comparative intervals, and also to see if this particular query appeared in the “SQL ordered by reads” section of the Statspack reports. Here are the results, first for a snapshot taken in October:


Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                      3,816,939      58,549     15   79.4
CPU time                                                     7,789          10.6
db file parallel write                         371,865       2,005      5    2.7
log file parallel write                         75,554       1,552     21    2.1
log file sync                                   17,198       1,228     71    1.7

                                                     CPU      Elapsd     Old
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
        775,166       43,228           17.9   24.3   212.58  12449.98  1505833981
Module: javaw.exe
SELECT * FROM TPH WHERE POL_NUM = :B1 ORDER BY PM_DT ,SNUM FOR UPDATE NOWAIT}

You might notice that the critical query is actually a ‘select for update’ rather than the simple select that we had originally been told about; this doesn’t affect the execution plan, but is going to have some significance as far as undo and redo are concerned.

Now look at the corresponding figures for an interval in December:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                      7,000,428      92,970     13   89.8
CPU time                                                     6,780           6.5
db file parallel write                         549,286       1,450      3    1.4
db file scattered read                          84,127         720      9     .7
log file parallel write                         41,197         439     11     .4


                                                     CPU      Elapsd     Old
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
      2,444,437       43,363           56.4   25.2   221.31  23376.07 1505833981
Module: javaw.exe
SELECT * FROM TPH WHERE POL_NUM = :B1 ORDER BY PM_DT ,SNUM FOR UPDATE NOWAIT

You’ll see in both cases that a huge fraction of the total database time is spent in single block reads (the “db file sequential read” time), but in December the number of reads has gone up by about 3.2 million. You can also see that about 1.7 million of the “extra” reads could be attributed to the critical query even though the number of executions of that query has hardly changed. The average number of reads per execution has gone up from 18 to 56. (I did ask if I could see the section of Statspack titled “SQL ordered by Executions” as this includes the average number of rows per execution and it would have been nice to know whether this average had gone up just a little bit, or whether it had gone up in line with the average reads per execution. Unfortunately the request was ignored, so I am going to proceed as if the change in the average result set was small.)

This, perhaps, tells us exactly what the problem is (and even if it doesn’t, the figures are symptomatic of one of the common examples of non-scalable queries).

Look at the query again – are we reporting all the rows for a “policy number”, ordered by “payment date”. If so, the number of payments recorded is bound to increase with time, and inevitably there will be lots of payments for policies belonging to other people between each pair of payments I make on my policy – and that would put each of my payments in a different table block (if I use a normal heap table).

Initially the payments table may be sufficiently small that a significant fraction of it stays in Oracle’s data cache or even in the file-system or SAN cache; but as time passes and the table grows the probability of me finding most of my blocks cached decreases – moreover, as time passes, I want increasing numbers of blocks which means that as I read my blocks I’m more likely to knock your blocks out of the cache. Given the constantly increasing numbers of competing reads, it is also no surprise that eventually the average single block read time also increases.

In scenarios like this it is inevitable that performance will degrade over time; in fact it is reasonably likely that the performance profile will degrade slowly to start with and then show an increasingly dramatic plunge. The only question really is how much damage limitation you can do.

One strategy, of course, is to increase the memory available for the critical object(s). This may mean assigning the table to a generously sized KEEP cache. (The cache need not be the same size as the table to improve things, but the bigger the better – for this query, at least). But such a strategy is only postponing the inevitable – you really need to find an approach which is less susceptible to the passage of time.

In this case, there are a few options to consider. First – note that we are selecting all the rows for a policy: do we really need to, or could we select the rows within a given date range, thus setting an upper limit on the average volume of data we need to acquire for any one policy. If we do that, we may want to think about strategies for summarizing and deleting older data, or using partitioning to isolate older data in separate segments.

If we can’t deal with the problem by changing the code (and, in this case, the apparent business requirement) can we avoid the need to visit so many data blocks for single policy. There are two obvious options to consider here – we could create the table as an “index cluster” clustered on the policy number; in this way we pay a penalty each time we insert a new row for a policy because we have to find the correct block in the cluster but when we run a query against that policy we will only need to read one or two blocks (probably) to get all the data. Alternatively we could consider setting the table up as an index-organized table (IOT) – again we do more work inserting data into the correct leaf block in the index but again we reap the benefit as we query the data because all the rows we want are in the same two or three leaf blocks (and stored in the order we want them).

Of course we are still subject to the same basic problem of the result set increasing in size as time passes, but at least we have managed to reduce (dramatically) the number of blocks we have to visit and the rate of growth of the number of blocks per query, thereby improving the scalability of the queries significantly.

Introducing new structures to an existing system is difficult, of course, and we may have to work out variations on this theme (like creating an index that includes all the table columns if we can’t switch to an IOT!). The key point is this, though: sometimes we can look at our data and the critical queries and recognize that the volume of data we have to process (even if we don’t return it, as we did in this example) is always going to increase over time, then we need to consider ways of minimizing the volume of data, or improving the packing of data so that the work we do doesn’t change (much) over time. Don’t just think ‘fast now’, think ‘will it still be fast later’.

 

May 23, 2017

255 Again!

Filed under: 12c,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 1:10 pm BST May 23,2017

There are so many things that can go wrong when you start using tables with more than 255 columns – here’s one I discovered partly because I was thinking about a client requirement, partly because I had a vague memory of a change in behaviour in 12c and Stefan Koehler pointed me to a blog note by Sayan Malakshinov when I asked the Oak Table if anyone remembered seeing the relevant note. Enough of the roundabout route, I’m going to start with a bit of code to create a table, stick a row in it, then update that row:

rem
rem     Script: wide_table_4.sql
rem     Author: Jonathan Lewis
rem     Dated:  May 2017
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0,4
rem

set pagesize 0
set feedback off

spool temp.sql

prompt create table t1(

select
        'col' || to_char(rownum,'fm0000') || '  varchar2(10),'
from
        all_objects
where   rownum <= 320
;

prompt col0321 varchar2(10)
prompt )
prompt /

spool off

@temp

set pagesize 40
set feedback on

insert into t1 (col0010, col0280) values ('0010','0280');
commit;

update t1 set col0320 ='0320';
commit;

column file_no  new_value m_file_no
column block_no new_value m_block_no

select
        dbms_rowid.rowid_relative_fno(rowid)    file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        dbms_rowid.rowid_row_number(rowid)      row_no
from
        t1
;

alter system flush buffer_cache;
alter system dump datafile &m_file_no block &m_block_no;

So I’ve written one of those horrible scripts that write a script and then run it. The script creates a table with 320 columns and inserts a row that populates columns 10 and 280. That gets me two row pieces, one consisting of the 255 columns from columns 26 to 280 that goes in as row piece 0, the other consisting of the first 25 columns that goes in as row piece 1; the remaining 40 columns are not populated so Oracle “forgets” about them (“trailing nulls take no space”). The script then updates the row by setting column 320 to a non-null value.

For convenience I’ve then generated the file and block number (and row number, just to show its head piece went in as row 1 rather than row 0) of the row and done a symbolic block dump. The question is: what am I going to see in that block dump ?

Answers (part 1)

Here’s an extract from the block dump from 11.2.0.4 (12.1.0.2 is similar) – though I’ve cut out a lot of lines reporting the NULL columns:


ntab=1
nrow=2
frre=-1
fsbo=0x16
fseo=0x1e54
avsp=0x1e3e
tosp=0x1f13
0xe:pti[0]      nrow=2  offs=0
0x12:pri[0]     offs=0x1e7a
0x14:pri[1]     offs=0x1e54
block_row_dump:
tab 0, row 0, @0x1e7a
tl: 49 fb: -------- lb: 0x2  cc: 40
nrid:  0x014000a7.0
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 37: *NULL*
col 38: *NULL*
col 39: *NULL*
tab 0, row 1, @0x1e54
tl: 38 fb: --H-F--- lb: 0x2  cc: 25
nrid:  0x014000a3.0
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 22: *NULL*
col 23: *NULL*
col 24: *NULL*
end_of_block_dump

The block holds two row pieces, and the piece stored as “row 1” is the starting row piece (the H in the flag byte (fb) tells us this). This row piece consists of 25 columns. The next rowpiece (identified by nrid:) is row zero in block 0x014000a3 – that’s block 163 of file 5 – which is the same block as the first row piece. When we look at row zero we see that it holds 40 columns, all null; it’s pointing to a third row piece at row zero in block 0x014000a7 (file 5, block 167), and as corroboration we can also see that the flag byte has no bits set and that tells us that this is just a boring “somewhere in the middle” bit. So it looks like we have to follow the pointer to find the last 255 columns of the table. So let’s take a look at the dump of file 5 block 167:


fsbo=0x14
fseo=0x1e76
avsp=0x1e62
tosp=0x1e62
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1e76
block_row_dump:
tab 0, row 0, @0x1e76
tl: 266 fb: -----L-- lb: 0x1  cc: 255
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 251: *NULL*
col 252: *NULL*
col 253: *NULL*
col 254: [ 4]  30 33 32 30
end_of_block_dump

Take note of the L in the flag byte – that tells us that we’re looking at the last row piece of a multi-piece row. It’s that last 255 columns we were looking for. The mechanics have worked as follows

  • On the simple insert Oracle split the used 280 columns into (25, 255)
  • On the update we grew the used column count from 280 to 320, adding 40 columns. Oracle extended the 255 column row piece to 295, then split it (40, 255) leaving 40 in the original block and migrating the 255 to a new block. So a row that could be only 2 pieces is now

So a row that could be two pieces in one block is now three pieces spread over two blocks; and there’s worse to come. Go back to the original block dump and check the used space. A good first approximation would be to check the “tl:” (total length) value for each row – this gives you: 49 + 38 bytes; add on a couple of hundred for the general block overhead and stuff like the transaction table and you find you’ve used less than 300 bytes in the block. But I’ve got a little procedure (I published this version of it some time ago) to check for free and used space – and this is what it said about the (ASSM) segment that holds this table:


Unformatted                   :           44 /          360,448
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           15 /          122,880
Full                          :            1 /            8,192

Take particular note of the “Full” block at the end of the report – that’s the block where we’ve used up rather less than 300 bytes. In fact if you look again at the first block dump you’ll see the avsp (available space) and tosp (total space) figures of 0x1e3e and 0x1f13 bytes (7,742 and 7,955 bytes). There’s loads of space in the block – but the block is marked in the bitmap space management map as full. That’s really bad news.

On the plus side 12.2 behaves differently, as noted by Sayan in his blog note. We still get the third row piece, but it’s in the same block as the first two and the block doesn’t marked as full in the bitmap.

And there’s still more to come – but it will have to wait a little longer.

 

May 19, 2017

255 columns

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 5:49 pm BST May 19,2017

This is one of my “black hole” articles – I drafted it six months ago, but forgot to publish it.

A recent post on OTN highlighted some of the interesting oddities that appear when you create tables with more than 255 columns. In fact this was a more subtle case than usual because it reminded us that it’s possible to have a partitioned table which appears to have less than the critical 255 columns while actually having more than 255 columns thanks to the anomaly of how Oracle handles dropping columns in a partitioned table.  (For a useful insight see this note from Dominic Brooks – and for a nice thought about preparing simple tables for an exchange with such a partitioned tables look at the 12.2 feature of “create table for exchange” in Maria Colgan’s recent article)

The thread took me down the path of trying to recreate some notes I wrote a long time ago and can no longer find and the OTN problem wasn’t the basic one I had assumed anyway, but I thought I’d publish a bit of the work I had done so that you can see another of the funny effects that appear when your table definition has too many columns (and you use them).

The OP told us about a table with more than 350 columns, so here’s a little script I wrote to generate a table with 365 columns and some data. (It turned out that the OP had more than 390 columns in the table, but 30+ had been “dropped”.)


rem
rem	Script:		wide_table_2.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Nov 2016
rem	Purpose:
rem
rem	Last tested
rem		11.2.0.4
rem

create sequence s1;

declare
	m_statement_1	varchar2(32767) :=
		'create table t1(col0001 varchar2(10),';
	m_statement_2	varchar2(32767) :=
		'insert into t1 values(lpad(s1.nextval,10),';
begin
	for i in 2 .. 365 loop
		m_statement_1 := m_statement_1 ||
			'col' || to_char(i,'FM0000') || ' varchar2(100),'
		;
		m_statement_2 := m_statement_2 ||
			case when i in (2,3,4)
			-- case when i in (122,123,124)
			-- case when i in (262,263,264)
				then 'dbms_random.string(''U'',ceil(dbms_random.value(0,100))),'
			     when i = 365
				then 'lpad(s1.currval,7))'
				else '''COL' || to_char(i,'FM0000') || ''','
			end
		;
	end loop;

	m_statement_1 := substr(m_statement_1, 1, length(m_statement_1) - 1);
	m_statement_1 := m_statement_1 || ') pctfree 25';

	execute immediate m_statement_1;

	for i in 1..10000 loop
		execute immediate m_statement_2;
	end loop;

end;
/

I’ve taken a fairly simple approach to building a string that creates a table – and it’s easy to adjust the number of columns – and a string to insert some values into that table. The insert statement will insert a row number into the first and last columns of the table and generate a random length string for a few of the columns. I’ve picked three possible sets of three columns for the random length string; one set is definitely going to be in the first row piece, one set is definitely going to be in the last row piece, and (since the row will split 110/255) one will be somewhere inside whichever is the larger row piece.

If I wanted to do something more sophisticated I’d probably have to switch to a PL/SQL array for the two statements strings – 32,767 characters doesn’t give me much freedom to play if I wanted to test a table with 1,000 columns.

Having created and populated my table, I performed the following three tests on it:


analyze table t1 compute statistics;

prompt	====
prompt	CTAS
prompt	====

create table t1a pctfree 25 as select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

prompt	======
prompt	Insert
prompt	======

truncate table t1a;
insert into t1a select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

prompt	=============
prompt	Insert append
prompt	=============

truncate table t1a;
insert /*+ append */ into t1a select * from t1;
analyze table t1a compute statistics;

select	table_name, num_rows, avg_row_len, blocks, chain_cnt
from	user_tables
where	table_name like 'T1%'
;

The first test creates a new table (t1a, at pctfree 25, matching the original) copying the original table with a simple “create as select”.

The second test truncates this table and does a basic “insert as select” to repopulate it.

Third test truncates the table again and does an “insert as select” with the /*+ append */ hint to repopulate it.

In all three cases (and with three variations of where the longer random strings went) I used the analyze command to gather stats on the tables so that I could get a count of the number of chained rows; and I dumped a couple of blocks from the tables to see what the inserted rows looked like.

Here’s a summary of the results from 11.2.0.4 when the random-length columns are near the start of the row (the position didn’t really affect the outcome, and the results for 12.1.0.2 and 12.2.0.1 were very similar):

====
CTAS
====

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       9504        237

======
Insert
======

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       6676       3287

=============
Insert append
=============

TABLE_NAME             NUM_ROWS AVG_ROW_LEN     BLOCKS  CHAIN_CNT
-------------------- ---------- ----------- ---------- ----------
T1                        10000        3062       6676       3313
T1A                       10000        3062       9504        237

As you can see we get two significantly different results: the CTAS and the “insert append” produce tables reporting 9,504 blocks and 237 chained rows, while the original table (single row inserts) and the regular “insert as select” produce tables with 6,676 blocks and 3,133 chained rows. It seems that the CTAS (which would also cover “alter table move”) and direct path insert have minimised the number of chained rows at a cost of a dramatically increased number of blocks. (The scale of the difference happens to be particularly extreme in this case – I didn’t do this deliberately it was simply a consequence of the way I happened to generate the data and the length of the rows.)

We know, of course, that every row in this table will consist of two row pieces, one of 110 columns and one of 255 columns; so every row is in some respects chained due to the potential for intra-block chaining of those two pieces, but the analyze command reports only inter-block chaining i.e. only those rows that start in one block and end in another block – intra-block chaining doesn’t count as “proper” chaining (at least in this version of Oracle).

There are two questions to address in these results: the first is “What’s happening?”, the second, which we ask when we get the answer to the first, is “How come the direct path method still gives us some chained rows?”

I believe the answer to the first question is that the direct path method attempts to avoid chaining unavoidable row-pieces. Even if it means leaving a huge amount of empty space in a block Oracle starts a new row in a new block if there isn’t enough space for both of the anticipated row-pieces to fit in the current block. I think this may be a feature to help Exadata and its use of direct path reads for smart scans, where a relatively small number of chained rows (which might be outside the current Exadata storage unit – and even in the disk space managed by another cell server) could have a catastrophic impact on performance because the system would have to do a single block read to pick up the extra piece – which could have a devastating impact on the performance.

So why do some rows still see chaining under this strategy – I think it’s because there’s a small error in the arithmetic somewhere (possibly visible only in ASSM tablespaces, perhaps related to row-piece headers) where Oracle thinks there’s enough space for both row pieces but there isn’t quite so it tries to insert both pieces and has to chain the second one. (I’ll corroborate this hypothesis with some analysis of block dumps in a moment).

So you have a choice – lots of wasted space and a little row-chaining, or maximum packing of data and (potentially) lots of row-chaining. But there’s more: I’ve said we get one row piece of 110 columns and one of 255 columns for each row, but the point at which the split occurs and the order in which the pieces are inserted depends on the method used.

  • Single row inserts (initial table creation, typical OLTP processing): The split occurred at column 111 – so the leading 110 columns are in one row piece and the trailing 255 columns are in the other – and the row piece with the trailing columns is inserted first.
  • Array inserts (normal): Exactly the same as the single row inserts.
  • Direct path inserts / CTAS: The split occurred at column 256, with the leading 255 column row-piece inserted first and the trailing 110 column row-piece inserted second.

I’m not sure that this particular detail matters very much in normal circumstances when you consider the dramatic difference in size that can appear in the comparison between direct path and normal inserts, but maybe there’s someone who will notice a performance (or even space) side effect because of this inconsistency. We will see in a later post, though, that this difference can have an enormous impact if you subsequently add columns to the table and populate them.

I said I’d come back to the row-chaining anomaly. One of the little details that I didn’t include in my code listing was the call to “analyze table report chained rows” that I did (after executing $ORACLE_HOME/rdbms/admin/utlchain.sql) to list the head rowids of the chained rows into the chained_rows table. After doing this I ran a simple pl/sql loop to dump all the relevant blocks to the trace file:

begin
	for r in (
		select
			dbms_rowid.rowid_relative_fno(head_rowid) file#,
			dbms_rowid.rowid_block_number(head_rowid) block#
			from	chained_rows
		) loop
			execute immediate 'alter system dump datafile ' || r.file# || ' block ' || r.block#;
	end loop;
end;
/

Here’s a little extract from the resulting trace file showing you what the start of a row piece looks like when dumped:

tab 0, row 0, @0x1765
tl: 2075 fb: --H-F--- lb: 0x0  cc: 255
nrid:  0x01401bc4.1
col  0: [10]  20 20 20 20 20 20 20 33 36 32
col  1: [19]  41 52 43 4a 4a 42 4e 55 46 4b 48 4c 45 47 4c 58 4c 4e 56
col  2: [ 8]  59 4b 51 46 4a 50 53 55
col  3: [17]  53 58 59 4e 4a 49 54 4a 41 5a 5a 51 44 44 4b 58 4d
col  4: [ 7]  43 4f 4c 30 30 30 35
col  5: [ 7]  43 4f 4c 30 30 30 36

A convenient thing to check is the cc: (“column count”) entry (end of 2nd line). You can see that this row piece has 255 columns, and if you look at the first six columns dumped you can see that it’s the row numbered 362 (33 36 32), then there are three columns of different length strings, then two columns with the values ‘COL00005’ and ‘COL0007’ respectively. It’s the “cc:” entry that’s useful though, so I’m going to do a bit of simple unix hackery:

grep " cc: " test_ora_24398.trc | sed "s/^.*cc: //"  | sort | uniq -c | sort -n
      1 1
      5 2
    112 108
    125 109
    237 110
    474 255

In my 237 blocks with chained rows I had 474 row pieces of 255 columns and 237 row pieces of 110 columns; then I had 125 row pieces that had lost (and therefore chained) one column and 112 row pieces that had lost and therefore chained 2 columns. I also had a couple of small “tail-end” pieces from earlier blocks scattered in these blocks. These figures suggest that there’s a small error (actually no more than about 20 bytes) in the calculation Oracle does to decide if it can fit a whole row into the current block or whether it has to go on to the next empty block.

Conclusions

When copying a table defined with more than 255 columns there’s the potential for a huge variation in the space usage and chain count depending on whether you do a CTAS (or insert /*+ append */) or a simple insert. You have to decide which option is the biggest threat to your available resources.

There is a little anomaly with the way in which rows are split that is also dependent on the method used for copying – this may also have some effect, though perhaps small enough to be ignored when compared with the space/chaining difference as far as ordinary OLTP processing is concerned. But there are  some important side effects we will consider in a later post.

Even though CTAS/direct path insert can eliminate a lot of row chaining it is still possible to find some row chaining in the resulting data. This may be the result of a calculation error (or possibly a deliberate space saving compromise).

Note that any comments about using CTAS to copy a table also apply to “alter table move” and to using expdp/impdp.

 

April 21, 2017

Undo Understood

Filed under: Infrastructure,Oracle,Read Consistency,Troubleshooting,undo — Jonathan Lewis @ 10:45 am BST Apr 21,2017

It’s hard to understand all the ramifications of Oracle’s undo handling, and it’s not hard to find cases where the resulting effects are very confusing. In a recent post on the OTN database forum resulted in one response insisting that the OP was obviously updating a table with frequent commits from one session while querying it from another thereby generating a large number of undo reads in the querying session.

It’s a possible cause of the symptoms that had been described – although not the only possible cause, especially since the symptoms hadn’t been described completely. It’s actually possible to see this type of activity when there are no updates and no outstanding commits taking place at all on the target table. Unfortunately it’s quite hard to demonstrate this with a quick, simple, script in recent versions of Oracle unless you do some insanely stupid things to make the problem appear – but I know how to do “insanely stupid” in Oracle, so here we go; first some data creation:

rem
rem     Script:         undo_rec_apply_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2017
rem

create table t2(v1 varchar2(100));
insert into t2 values(rpad('x',100));
commit;

create table t1
nologging
pctfree 99 pctused 1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        cast(lpad(rownum,10,'0') as varchar2(10))       v1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 8e4 -- > comment to bypass WordPress formatting issue
;

alter table t1 add constraint t1_pk primary key(id)
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

The t2 table is there as a target for a large of updates from a session other than the one demonstrating the problem. The t1 table has been defined and populated in a way that puts one row into each of 80,000 blocks (though, with ASSM and my specific tablespace definition of uniform 1MB extents, the total space is about 80,400 blocks). I’ve got a primary key declaration that allows me to pick single rows/blocks from the table if I want to.

At this point I’m going to do a lot of updates to the main table using a very inefficient strategy to emulate the type of thing that can happen on a very large table with lots of random updates and many indexes to maintain:


begin
        for i in 1..800 loop
                update t1 set v1 = upper(v1) where id = 100 * i;
                execute immediate 'alter system switch logfile';
                execute immediate 'alter system flush buffer_cache';
                commit;
                dbms_lock.sleep(0.01);
        end loop;
end;
/

set transaction read only;

I’m updating every 100th row/block in the table with single row commits, but before each commit I’m switching log files and flushing the buffer cache.

This is NOT an experiment to try on a production system, or even a development system if there are lots of busy developers or testers around – and if you’re running your dev/test in archivelog mode (which, for some of your systems you should be) you’re going to end up with a lot of archived redo logs. I have to do this switch to ensure that the updated blocks are unpinned so that they will be written to disc and flushed from the cache by the flush buffer cache. (This extreme approach would not have been necessary in earlier versions of Oracle, but the clever developers at Oracle Corp. keep adding “damage limitation” touches to the code that I have to work around to create small tests.) Because the block has been flushed from memory before the commit the session will record a “commit cleanout failures: block lost” on each commit. By the time this loop has run to completion there will be 800 blocks from the table on disc needing a “delayed block cleanout”.

Despite the extreme brute force I use in this loop, there is a further very important detail that has to be set before this test will work (at least in 11.2.0.4, which is what I’ve used in my test runs). I had to start the database with the hidden parameter _db_cache_pre_warm set to false. If I don’t have the database started with this feature disabled Oracle would notice that the buffer cache had a lot of empty space and would “pre-warm” the cache by loading a few thousand blocks from t1 as I updated one row – with the side effect that the update from the previous cycle of the loop would be cleaned out on the current cycle of the loop. If you do run this experiment, remember to reset the parameter and restart the instance when you’ve finished.

I’ve finished this chunk of code with a call to “set transaction read only” – this emulates the start of a long-running query: it captures a point in time (through the current SCN) and any queries that run in the session from now on have to be read-consistent with that point in time. After doing this I need to use a second session to do a bit of hard work – in my case the following:

execute snap_rollstats.start_snap

begin
        for i in 1..10000 loop
                update t2 set v1 = upper(v1);
                update t2 set v1 = lower(v1);
                commit;
        end loop;
end;
/

execute snap_rollstats.end_snap

The calls to the snap_rollstats package simply read v$rollstat and give me a report of the changes in the undo segment statistics over the period of the loop. I’ve executed 10,000 transactions in the interval, which was sufficient on my system to use each undo segment header at least 1,000 times and (since there are 34 transaction table slots in each undo segment header) overwrite each transaction table slot about 30 times. You can infer from these comments that I have only 10 undo segments active at the time, your system may have many more (check the number of rows in v$rollstat) so you may want to scale up that 10,000 loop count accordingly.

At this point, then, the only thing I’ve done since the start of my “long running query” is to update another table from another session. What happens when I do a simple count() from t1 that requires a full tablescan ?

alter system flush buffer_cache;

execute snap_filestat.start_snap
execute snap_my_stats.start_snap

select count(v1) from t1;

execute snap_my_stats.end_snap
execute snap_filestat.end_snap

I’ve flushed the buffer cache to get rid of any buffered undo blocks – again an unreasonable thing to do in production but a valid way of emulating the aging out of undo blocks that would take place in a production system – and surrounded my count() with a couple of packaged call to report the session stats and file I/O stats due to my query. (If you’re sharing your database then the file I/O stats will be affected by the activity of other users, of course, but in my case I had a private database.)

Here are the file stats:


--------------
Datafile Stats
--------------
file#       Reads      Blocks    Avg Size   Avg Csecs     S_Reads   Avg Csecs     M_Reads   Avg Csecs    Max      Writes      Blocks   Avg Csecs    Max
File name
-----       -----      ------    --------   ---------     -------   ---------     -------   ---------    ---      ------      ------   ---------    ---
    1          17          17       1.000        .065          17        .065           0        .000      6           0           0        .000     15
/u01/app/oracle/oradata/TEST/datafile/o1_mf_system_938s4mr3_.dbf
    3         665         665       1.000        .020         665        .020           0        .000      6           0           0        .000     15
/u01/app/oracle/oradata/TEST/datafile/o1_mf_undotbs1_938s5n46_.dbf
    5         631      80,002     126.786        .000           2        .045         629        .000      6           0           0        .000     17
/u01/app/oracle/oradata/TEST/datafile/o1_mf_test_8k__cz1w7tz1_.dbf

As expected I’ve done a number of multiblock reads of my data tablespace for a total of roughly 80,000 blocks read. What you may not have expected is that I’ve done 665 single block reads of the undo tablespace.

What have I been doing with all those undo blocks ? Check the session stats:


Session stats
-------------
Name                                                                     Value
----                                                                     -----
transaction tables consistent reads - undo records applied              10,014
transaction tables consistent read rollbacks                                10

We’ve been reading undo blocks so that we can create read-consistent copies of the 10 undo segment headers that were active in my instance. We haven’t (and you’ll have to trust me on this, I can’t show you the stats that aren’t there!) reported any “data blocks consistent reads – undo records applied”.

If you want to see a detailed explanation of what has happened you’ll need to read Oracle Core (UK source), chapter 3 (and possibly chapter 2 to warm yourself up for the topic). In outline the following type of thing happens:

  • Oracle gets to the first block updated in t1 and sees that there’s an ITL (interested transaction list) entry that hasn’t been marked as committed (we flushed the block from memory before the commit cleanout could take place so the relevant transaction is, apparently, still running and the row is still marked as locked).
  • Let’s say the ITL entry says the transaction was for undo segment 34, transaction table slot 11, sequence 999. Oracle reads the undo segment header block for undo segment 34 and checks transaction table slot 11, which is now at sequence 1032. Oracle can infer from this that the transaction that updated the table has committed – but can’t yet know whether it committed before or after the start of our “long running query”.
  • Somehow Oracle has to get slot 11 back to sequence 999 so that it can check the commit SCN recorded in the slot at that sequence number. This is where we see “undo records applied” to make the “transaction table read consistent”. It can do this because the undo segment header has a “transaction control” section in it that records some details of the most recent transaction started in that segment. When a transaction starts it updates this information, but saves the old version of the transaction control and the previous version of its transaction table slot in its first undo record, consequently Oracle can clone the undo segment header block, identify the most recent transaction, find its first undo record and apply it to unwind the transaction table information. As it does so it has also wound the transaction control section backwards one step, so it can use that (older) version to go back another step … and so on, until it takes the cloned undo segment header so far back that it takes our transaction table slot back to sequence 999 – and the job is done, we can now check the actual commit SCN.  (Or, if we’re unlucky, we might receive an ORA-01555 before we get there)

So – no changes to the t1 table during the query, but lots of undo records read because OTHER tables have been changing.

Footnote:

In my example the tablescan used direct path reads – so the blocks that went through delayed block cleanout were in private memory, which means they weren’t in the buffer cache and didn’t get written out to disc. When I flushed the buffer cache (again to emulate aging out of undo blocks etc.) and repeated the tablescan Oracle had to go through all that work of creating read consistent transaction tables all over again.

Footnote 2:

Despite the grand claim implied by the title I hope everyone realises that this is just one little detail of how undo handling can interfere with their expectations of performance.

 

 

March 9, 2017

Join Elimination

Filed under: CBO,Execution plans,Infrastructure,Oracle — Jonathan Lewis @ 6:39 pm BST Mar 9,2017

A question has just appeared on OTN describing a problem where code that works in 11g doesn’t work in 12c (exact versions not specified). The code in question is a C-based wrapper for some SQL, and the problem is a buffer overflow problem. The query supplied is as follows:


select T1.C1 from T1, T2 where T1.C1 = T2.D1;

The problem is that this works in 11g where the receiving (C) variable is declared as

char myBuffer [31];

but it doesn’t work in 12c unless the receiving variable is declared as:

char myBuffer [51];

There’s an important bit of background information that might be giving us a clue about what’s happened (although what I’m about to describe isn’t actually the problem unless the SQL provided is a simplified version of the problem SQL that is expected to display the problem). Column C1 is defined as char(30) and column D1 is defined as char(50). Here’s some sample code showing why you might need a buffer of 50+1 bytes to hold something that ought to be 30+1 bytes long. (This may be nothing to do with the anomaly described in the original posting – it’s just something I thought of when I first saw the question.)


rem     Script:         join_elimination_oddity.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2017

create table t1(
        c30     char(30) primary key
);

create table t2(
        d50     char(50) references t1
);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

explain plan for
select
        t1.c30
from
        t1, t2
where
        t1.c30 = t2.d50
;

select * from table(dbms_xplan.display(null,null,'projection'));

So we’re selecting c30 – the 30 byte character column – from t1; what do we actually get ? Here’s the plan with the projection:


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    52 |     1   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T2   |     1 |    52 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T2"."D50" IS NOT NULL)

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - "T2"."D50"[CHARACTER,50]

Table t1 has been eliminated and the projected column is the “equivalent” column from t2 – which is too long for the expected output. To work around this problem you can disable join elimination either by parameter (_optimizer_join_elimination_enabled=false) or by hinting /*+ no_eliminate_join(t1) */ in which case the plan (with my data) became a nested loop join from t2 to t1 with column c1 projected as expected.

Footnote:

Two things to note about my demonstration

  • If you’re going to create a referential integrity constraint between columns they do need to be of exactly the same type. (If you want to get technical you might say “in the same domain” – but Oracle doesn’t do domains.)
  • This extremely simple case demonstrates the problem in 11.2.0.4 as well as 12.1.0.2. Possibly a more complex query could be produced where (thanks to limitations in query transformations) 11g doesn’t spot the option for join elimination while 12c does; alternatively, a very simple two-column example in 11g won’t do join elimination while a two-column example in 12.2 can (though it doesn’t always) – so upgrading to 12.2 MIGHT cause more people to see this anomaly appearing.

Update (14/3/17)

A follow-up to the original posting has identified source of the problem. Rather than a side effect of join elimination the anomaly has appeared because of the “partial join evaluation” mechanism introduced in 12c. There is a hint to disable the mechanism /*+ no_partial_join(alias) */ or – if you can’t get hints into place for all the necessary code – it would be possible to disable the feature completely through the use of a hidden parameter “_optimizer_partial_join_eval”=false.  The usual rule of checking with Oracle support before setting hidden parameter applies, of course.

 

March 7, 2017

Guesswork

Filed under: Infrastructure,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 7:57 pm BST Mar 7,2017

A recent posting on the OTN database forum described a problem with an insert (as select) statement that sometimes ran extremely slowly: nothing interesting yet, there could be plenty of boring reasons for that to happen. The same SQL statement (by SQL_ID) might take 6 hours to insert 300K rows one night while taking just a few minutes to insert 900K another night (still nothing terribly interesting).

An analysis of the ASH data about the statement showed that the problem was on the “LOAD TABLE CONVENTIONAL” operation – which starts to get interesting if you also assume that someone who was competent to look at ASH would probably have noticed whether or not the time spent was on (the first obvious guess) some variant of “TX enqueue” waits or “log file” waits or something amazingly different. It’s also a little more interesting if you’ve noticed that the title of the posting is about “consuming a lot of CPU time” – so where could that be going if we see most of the excess time going on the insert rather than on the select.

Here’s an enormous clue that this might be a “non-standard” problem – the SQL statement starts like this:


INSERT  /*+ ignore_row_on_dupkey_index(tgt OTC_DAT_TV_PROC_STATUS_UK) */
INTO OTC_DAT_TV_PROC_STATUS TGT (
    {list of columns}
)
SELECT  /*+ parallel(8) */
    ...

See the hint which says “ignore rows if they raise duplicate key errors against index otc_dat_tv_proc_status_uk” ? Think what it might take to implement code that obeys the hint – Oracle can’t know that a row is duplicating a value until it has inserted the row (to get a rowid) then tried to maintain the index and discovered the pre-existing value, at which point it raises an exception, then handles it by undoing the single row insert (I wonder what that does to array inserts, and the logic of triggers), then carries on with the next row. (I suppose the code could check every relevant index before doing the insert – but that would mean a very big, and possibly very resource-intensive, change to all existing “insert a row” code.)

Surely we’re going to see some dramatic effects if a large fraction of our rows result in duplicate values.

Warning – the guess might not be right but it’s much more interesting than all the other guesses you might make at this point and a good excuse for doing a bit of experimentation and learning. So here’s a simple model to test the “excess work” hypothesis – running against 11.2.0.4:

rem
rem     Script:         ignore_dupkey.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2017
rem

drop table t2;
drop table t1;

create table t1
nologging
as
select  *
from    all_objects
where   rownum <= 50000 ; -- > comment to protect WordPress formatting

create table t2
nologging
as
select  *
from    all_objects
where   rownum <= 50000 ; -- > comment to protect WordPress formatting

update t1 set object_id = object_id + (select max(object_id) from t1)
;

commit;

create unique index t1_i1 on t1(object_id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

All I’ve done is create two copies of the same data – with an update in place that changes one set of data so that the object_id doesn’t overlap with the object_id in the other set. The update statement is something that I’m going to include in one test but exclude from the second. Now for the statement that tests for the effects of the hint:


execute snap_events.start_snap
execute snap_my_stats.start_snap

insert
        /*+ ignore_row_on_dupkey_index(t1 (object_id)) */
        into t1
select  *
from    t2
;

execute snap_my_stats.end_snap
execute snap_events.end_snap


There are several different things I could do once I’ve got this basic test set up if I want to refine what I’m testing and isolate certain aspects of the mechanism, but all I’ll do for this note is a simple report of some figures from the two tests – one where the object_id values don’t overlap and one where t1 and t2 are exact copies of each other. The two procedures wrapping the statement are just my standard routines for capturing changes in v$mystat and v$session_event for my session.

For the insert where I’ve done the update to avoid any duplicates appearing the insert completed in about 0.3 seconds, generating 10MB of redo and 2MB of undo.

When I removed the update statement the (continuously failing) insert took 35.5 seconds to complete, of which almost all the time was CPU time. The redo jumped to 478MB with 14MB of undo. The extreme level of redo surprised me slightly especially since the scale of the change was so much greater than that of the undo – I think it may be due to a problem with Oracle needing to unwind one row from an (internal) array insert before retrying. Here, taken from the session stats of the problem run, is a little indication of why the time (and especially the CPU time) increased so much:

Name                                                                     Value
----                                                                     -----
rollback changes - undo records applied                                 95,014
active txn count during cleanout                                        24,627
cleanout - number of ktugct calls                                       24,627
HSC Heap Segment Block Changes                                          97,509
Heap Segment Array Inserts                                              97,509
recursive calls                                                        682,574
recursive cpu usage                                                      2,193
session logical reads                                                1,341,077

Obviously there’s a lot of work done rolling back changes that should not have been made (though why it’s reported as 95,000 rather than 100,000 I don’t know and I’d rather not do a trace of buffer activity to find out) with an associated extra load of data block activity. Most visible, though, is the huge number of recursive calls with, perhaps associated, a very large number of session logical reads. Clearly it’s worth enabling extended tracing to see what’s going on – if you haven’t already guessed what some of those calls are about. Here’s an extract from the top of an appropriate tkprof output:


tkprof test_ora_32298.trc ignore_dupkey sort=execnt

select /*+ rule */ c.name, u.name
from
 con$ c, cdef$ cd, user$ u  where c.con# = cd.con# and cd.enabled = :1 and
  c.owner# = u.user#


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    48753      1.86       2.08          0          0          0           0
Execute  48753      3.75       4.17          0          0          0           0
Fetch    48753      2.86       3.12          0      97506          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   146259      8.47       9.38          0      97506          0           0

select o.name, u.name
from
 obj$ o, user$ u  where o.obj# = :1 and o.owner# = u.user#

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    48753      1.87       1.99          0          0          0           0
Execute  48753      3.60       3.63          0          0          0           0
Fetch    48753      7.35       7.52          0     243765          0       48753
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   146259     12.84      13.14          0     243765          0       48753

The top two statements in the trace file show Oracle first trying to find the name of the constraint that has been breached, then falling back to searching for the name of the unique index that has caused the error to appear. If I had created a unique constraint rather than just a unique index then the second of these two statement would not have appeared in the trace file (and the run would have been a little quicker – hint: constraints are a good thing).

You’ll notice that the total number of calls from the two statement is roughly 292,500 – far short of the 682,000 reported in the session stats. Unfortunately there was nothing else in the trace files that could be blamed for the outstanding 400,000 missing calls. It’s not really necessary to chase up all the details, though; clearly we can see that this feature is very expensive if lots of duplicates appear – like DML error logging it has probably been created as a way of dealing with occasional errors when handling large volumes of data.

Footnote

Notice that my example uses the “index description” method for specifying the index in the hint; it’s legal with either the name or the description. Interestingly (but, perhaps, unsurprisingly) this is a hint that Oracle has to parse for correctness before optimisation. I made a mistake in my first attempt at writing the update statement leaving me with duplicates in the data so Oracle couldn’t create the unique index; as a consequence of the missing unique index the insert statement reported the Oracle error “ORA-38913: Index specified in the index hint is invalid”.

 

February 16, 2017

Truncate 12c

Filed under: 12c,Infrastructure,Oracle — Jonathan Lewis @ 12:52 pm BST Feb 16,2017

Here’s one of those little improvements in 12c (including 12.1) that will probably end up being described as “little known features” in about 3 years time. Arguably it’s one of those little things that no-one should care about because it’s not the sort of thing you should do on a production system, but that doesn’t mean it won’t be seen in the wild.

Rather than simply state the feature I’m going to demonstrate it, starting with a little code to build a couple of tables with referential integrity:


create table parent (
        id      number(4),
        name    varchar2(10),
        constraint par_pk primary key (id)
)
;

create table child(
        id_p    number(4)
                        constraint chi_fk_par
                        references parent
                        on delete cascade,
        id      number(4),
        name    varchar2(10),
        constraint chi_pk primary key (id_p, id)
)
;

insert into parent values (1,'Smith');
insert into parent values (2,'Jones');

insert into child values(1,1,'Sally');
insert into child values(1,2,'Simon');

insert into child values(2,1,'Jack');
insert into child values(2,2,'Jill');

commit;


There’s one important detail in this code that isn’t taking the default and isn’t used very frequently – it’s the option on the foreign key to take the action “on delete cascade”. If you delete a row from the parent table then Oracle will automatically delete any referenced rows from the child table first thus avoiding the error ORA-02292: integrity constraint (TEST_USER.CHI_FK_PAR) violated – child record found. (Conveniently I have a suitable index on the child table that will bypass the problem of a mode 4 (or, where child rows already exist, mode 5) TM lock being taken on the child as the parent row is deleted.)

And here’s the demonstration of the new feature – working in 12.1 onwards:


truncate table parent;

truncate table parent cascade;

The first command will raise Oracle error ORA-02266: unique/primary keys in table referenced by enabled foreign keys, but the second command will truncate the parent and child tables “simultaneously”: but only if the referential integrity constraint is set to “on delete cascade”. If the referential integrity constraint is left to its default action then the second command will raise error: ORA-14705: unique or primary keys referenced by enabled foreign keys in table “TEST_USER”.”CHILD”

This feature (and several broadly similar features) also works with matching partitions of equi-partitioned (or ref partitioned) tables – and that’s a context where the requirement  is much more likely to appear than with non-partitioned tables.

 

Next Page »

Powered by WordPress.com.