Oracle Scratchpad

January 29, 2018

Case Study – 1

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 10:45 am GMT Jan 29,2018

It has been some time since I wrote an article walking through the analysis of information on an AWR report, but a nice example appeared a few weeks ago on Twitter that broke a big AWR picture into a sequence of bite-sized chunks that made a little story. So here it is, replayed in sync with my ongoing thoughts. The problem started with the (highly paraphrased) question – “How could I get these headline figures when all the ‘SQL ordered by’ sections of the report show captured SQL account for 0.0% of Total?”. The report was a 30 minute snapshot from 11.2.0.3, and here’s the Load Profile:As you can see, the database was doing quite a lot of work as measured by the physical reads and writes, the number of block changes and size of redo, and the fact that the average CPU usage by the foreground sessions in this instance accounts for 9.5 CPUs. Generally speaking the 49 seconds per second of DB time is also a fairly good hint,when combined with the other numbers, that the instance is busy but, in the absence of any other information, that could be one session holding a blocking lock with 48 other sessions waiting for the lock to be released.

There are a couple of unusual (though not impossible) features to this profile. Given the amount of work the figures for user calls and executes look very small – again not technically impossible, just unlikely in “normal” database processing given the apparent workload – and bear in mind that the “SQL ordered by ..” figures seem to account for none of the workload. Similarly the figures for logical reads and block changes are a little unusual (for “normal” processing) – on average this instance seems to have modified every block it visited (without any work showing up in the captured SQL).

Next in the set of images was the Time Model:As you can see, the foreground time (DB time) is 85,944 seconds or which foreground CPU usage (DB CPU) is 16,735 seconds – with about 69,000 seconds unaccounted ! THere’s virtually no time spend on PL/SQL or SQL, and rman doesn’t even make an appearance  (I mention rman specifically because there was at least one version of Oracle where the rman time was accidentally lost from this summary).

So what does the Top Timed Events look like:

It’s nice to see that this is consistent with the Load Profile: the DB CPU matches, and there’s a lot of physical reads (and a quick check says that 6,560,642/1,800 = 3,644 … which is pretty close to the 3,746 physical reads per second in the Load Profile).  There’s one major anomaly here, though: the huge number of (and time spent on) latch: row cache objects. and even though it’s not the single largest component of time it’s the most obvious one to pursue so the next bit of the report to check is the Dictionary Cache Statistics, with the Tablespace IO Stats and Segments by Physical Reads to follow. I don’t have an image for the dictionary cache stats, but the answer to the question “What were all the rowcache object gets for?” was: “dc_tablespaces (214,796,434)” – which (probably) told me everything I needed to know.

I could show you the specific Instance Activity statistic that I wanted to see next, but I’ll postpone that for a moment and jump to the Tablespace IO Stats – which we were planning to do and might have done anyway if we hadn’t known the rocket science behind massive number of gets on dc_tablespaces.

That’s a horrendous number of (single block) reads of the undo tablespace – and why would they be happening ? The instance is engaged in some massive rollback activity (and the transactions being rolled back are on objects in the GIRO tablespace – which is why it is also suffering a lot of single block reads) and this is the point where we jump to the relevant Instance Activity statistic to confirm the claim:

There are two other “undo records applied” statistics, but we don’t need to think about them – the match between the count of undo records applied and the gets on the dc_tablespaces latch is close to perfect. Almost everything that this instance is doing is rolling back – there must have been some enormous data changes (possibly only one, possibly parallel-enabled) that failed in the previous half hour and now the entire mess is being cleaned up.

One little detail to note – the “undo records applied” per second is 122,355, but the Load Profile reported 247,885 “Block changes” per second. The two figures are consistent with each other. Each application of an undo record is two block changes – the first when you modify the source data block, the second when you update the undo record itself to flag it as “applied”:  122,355 * 2  = 244,710, which is a good match for 247,855.

Final Thoughts

There is a second reason why you could see lots of gets on dc_tablespaces – but I think it was a bug in 9i relating to temporary tablespace groups. The phenomenon as a side effect of rolling back was something I discovered in the 8i timeline and I’ve not actually checked what an AWR report really would look like if I forced a massive rollback to take place as the only workload across the interval – so here’s a quick test I constructed and ran to finish the thread off:


set timing on

create table t1 as
with g as (select rownum id from dual connect by level <= 1000)
select rownum id, rownum n1, rpad ('x',150) padding from g,g
;

create index t1_i1 on t1(id);

begin
        for i in 1..1000000 loop
                update t1 set n1 = n1 where id = i;
        end loop;
end;
/

alter system flush buffer_cache;

pause take AWR snapshot from another session

rollback;

prompt take AWR snapshot from another session

On the small VM I used to run the test it took a little over 60 seconds to run the update and the same again to complete the rollback. The “DB time” shown in the Time Model section of the AWR report was 62.0 seconds, while the “sql execute elapsed time” was 3.9 seconds (which was the SQL executed while capturing the AWR data).

Conclusion

This was a problem where the database seemed to have done a lot of work that couldn’t be attributed to SQL. While I had a little knowledge of the rocket science up my sleeve that may have allowed me to identify the source more quickly and with more confidence than the average DBA all I’ve done in this note is highlight a couple of oddities and big numbers that anyone could have spotted, and followed a couple of simple guesses:

  1. DB time was large, but sql (and pl/sql) time was non-existent
  2. Top timed events were I/O and latches – so identify the latch
  3. The specific latch pointed towards tablespaces – so check the Tablespace I/O and note the UNDO tablespace
  4. Look at any Instance Activity statistics about “undo”.
  5. Job done – but a quick test case is the icing on the cake.

 

January 25, 2018

gc buffer busy

Filed under: Oracle,RAC,Troubleshooting — Jonathan Lewis @ 2:12 pm GMT Jan 25,2018

I had to write this post because I can never remember which way round Oracle named the two versions of gc buffer busy when it started identifying them separately. There are two scenarios to cover when my session wants my instance to acquire a global cache lock on a block and some other session is already trying to acquire that lock or is holding it in an incompatible fashion:

  • The other session is in my instance
  • The other session is in a remote instance

One of these cases is reported as “gc buffer busy acquire”, the other as a “gc buffer busy release” and I always have to check which is which. I think I usually get it right first time when I see it but I always manage to convince myself that I might have got it wrong and end up searching the internet for Riyaj Shamsudeen’s blog posting about it.

The “release” is waiting for another instance to surrender the lock to my instance; the “acquire” is waiting for another session in my instance to finish acquiring the lock from the other  instance.

I decided to jot down this note so I didn’t have to keep searching for Riyaj’s and also because a little problem on OTN at the moment showed a couple of AWR reports with an unlikely combination of waits for acquire (180,000,000) and release (2,000) waits.

If you’re wondering why this looks odd – if I’m waiting for an acquire someone else in my instance must be waiting for a release.  Obviously many sessions could be waiting for one release, and if acquirers time out very rapidly (though they’re not reported as doing so) then the ratio could get very high – but 90,000 acquires per release doesn’t look right.

 

January 15, 2018

Histogram Hassle

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 1:01 pm GMT Jan 15,2018

I came across a simple performance problem recently that ended up highlighting a problem with the 12c hybrid histogram algorithm. It was a problem that I had mentioned in passing a few years ago, but only in the context of Top-N histograms and without paying attention to the consequences. In fact I should have noticed the same threat in a recent article by Maria Colgan that mentioned the problems introduced in 12c by the option “for all columns size repeat”.

So here’s the context (note – all numbers used in this example are approximations to make the arithmetic obvious).  The client had a query with a predicate like the follwing:

    t4.columnA = :b1
and t6.columnB = :b2

The optimizer was choosing to drive the query through an indexed access path into t6, which returned ca. 1,000,000 rows before joining (two tables later) to t4 at which point all but a couple of rows remained – typical execution time was in the order of tens of minutes. A hint to start the join order with t4 /*+ leading(t4) */, meant that Oracle used an index to acquire two starting rows and reduced the response time to the classic “sub-second”.

The problem had arisen because the optimizer had estimated a cardinality of 2 rows for the index on t6 and the reason for this was that, on average, that was the correct number. There were 2,000,000 rows in the table with 1,000,000 distinct values. It was just very unlucky that one of the values appeared 1,000,000 times and that was the value the users always wanted to query – and there was no histogram on the column to tell the optimizer that there was a massive skew in the data distribution.

Problem solved – all I had to do was set a table preference for this table to add a histogram to this column and gather stats. Since there were so many distinct values and so much “non-popular” data in the table the optimizer should end up with a hybrid histogram that would highlight this value. I left instructions for the required test and waited for the email telling me that my suggestion was brilliant and the results were fantastic … I got an email telling me it hadn’t worked.

Here’s a model of the situation – I’ve created a table with 2 million rows and a column where every other row contains the same value but otherwise contains the rownum. Because the client code was using a varchar2() column I’ve done the same here, converting the numbers to character strings left-padded with zeros. There are a few rows (about 20) where the column value is higher than the very popular value.


rem
rem     Script:         histogram_problem_12c.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 2e4
)
select
        rownum  as id,
        case
                when mod(rownum,2) = 0
                        then '999960'
                        else lpad(rownum,6,'0')
        end     as bad_col
from
        generator       v1,
        generator       v2
where
        rownum <= 2e6    -- typing error, ends up with 2 rows per non-popular value.
;

Having created the data I’m going to create a histogram on the bad_col – specifying 254 columns – then query user_tab_histograms for the resulting histogram (from which I’ll delete a huge chunk of boring rows in the middle):


begin

        dbms_stats.gather_table_stats(
                ownname         => 'TEST_USER',
                tabname         => 'T1',
                method_opt      => 'for columns bad_col size 254'
        );

end;
/

select
        column_name, histogram, sample_size
from
        user_tab_columns
where
        table_name = 'T1'
;

column end_av format a12

select
        endpoint_number         end_pt,
        to_char(endpoint_value,'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') end_val,
        endpoint_actual_value   end_av,
        endpoint_repeat_count   end_rpt
from
        user_tab_histograms
where
        table_name = 'T1'
and     column_name = 'BAD_COL'
order by
        endpoint_number
;

COLUMN_NAME          HISTOGRAM             Sample
-------------------- --------------- ------------
BAD_COL              HYBRID                 5,513
ID                   NONE               2,000,000

    END_PT END_VAL                         END_AV          END_RPT
---------- ------------------------------- ------------ ----------
         1  303030303031001f0fe211e0800000 000001                1
        12  3030383938311550648a5e3d200000 008981                1
        23  303135323034f8f5cbccd2b4a00000 015205                1
        33  3032333035311c91ae91eb54000000 023051                1
        44  303239373236f60586ef3a0ae00000 029727                1
...
      2685  3938343731391ba0f38234fde00000 984719                1
      2695  39393235303309023378c0a1400000 992503                1
      2704  3939373537370c2db4ae83e2000000 997577                1
      5513  393939393938f86f9b35437a800000 999999                1

254 rows selected.

So we have a hybrid histogram, we’ve sampled 5,513 rows to build the histogram, we have 254 buckets in the histogram report, and the final row in the histogram is end point 5513 (matching the sample size). The first row of the histogram shows us the (real) low value in the column and the last row of the histogram reports the (real) high value. But there’s something very odd about the histogram – we know that ‘999960’ is the one popular value, occurring 50% of the time in the data, but it doesn’t appear in the histogram at all.

Looking more closely we see that every bucket covers a range of about 11 (sometimes 9 or 10) rows from the sample, and the highest value in each bucket appears just once; but the last bucket covers 2,809 rows from the sample with the highest value in the bucket appearing just once. We expect a hybrid histogram to have buckets which (at least initially) are all roughly the same size – i.e. “sample size”/”number of buckets” – with some buckets being larger by something like the amount that appears in their repeat count, so it doesn’t seem right that we have an enormous bucket with a repeat count of just 1. Something is broken.

The problem is that the sample didn’t find the low and high values for the column – although the initial full tablescan did, of course – so Oracle has “injected” the low and high values into the histogram fiddling with the contents of the first and last buckets. At the bottom end of the histogram this hasn’t really caused any problems (in our case), but at the top end it has taken the big bucket for our very popular ‘999960’ and apparently simply replaced the value with the high value of ‘999999’ and a repeat count of 1.

As an indication of the truth of this claim, here are the last few rows of the histogram if I repeat the experiment but, before gathering the histogram, delete the rows where bad_col is greater than ‘999960’. (Oracle’s sample is random, of course, and has changed slightly for this run.)

    END_PT END_VAL                         END_AV          END_RPT
---------- ------------------------------- ------------ ----------
...
      2641  3938373731371650183cf7a0a00000 987717                1
      2652  3939353032310e65c1acf984a00000 995021                1
      2661  393938393433125319cc9f5ba00000 998943                1
      5426  393939393630078c23b063cf600000 999960             2764

Similarly, if I inserted a few hundred rows with a higher value than my popular value (in this case I thought 500 rows would be a fairly safe bet as the sample was about one in 360 rows) I got a histogram which started with a bucket about the popular bucket, so the problem of that bucket being hacked to the high value was less significant:


    END_PT END_VAL                         END_AV          END_RPT
---------- ------------------------------- ------------ ----------
...
      2718  393736313130fe68d8cfd6e4000000 976111                1
      2729  393836373630ebfe9c2b7b94c00000 986761                1
      2740  39393330323515efa3c99771600000 993025                1
      5495  393939393630078c23b063cf600000 999960             2747
      5497  393939393938f86f9b35437a800000 999999                1

Bottom line, then: if you have an important popular value in a column and there aren’t very many rows with a higher value, you may find that Oracle loses sight of the popular value as it fudges the column’s high value into the final bucket.

Workaround

I did consider writing a bit of PL/SQL for the client to fake a realistic frequency histogram, but decided that that wouldn’t be particularly friendly to future DBAs who might have to cope with changes. Luckily the site doesn’t gather stats using the automatic scheduler job and only rarely updates stats anyway, so I suggested we create a histogram on the column using an estimate_percent of 100. This took about 8 minutes to run – for reasons that I will go into in a moment – after which I suggested we lock stats on the table and document the fact that when stats are collected on this table it’s got to be a two-pass job – the normal gather with its auto_sample_size to start with, then a 100% sample for this column to gather the histogram:


begin
        dbms_stats.gather_table_stats(
                user,
                't1',
                method_opt       => 'for columns bad_col size 254',
                estimate_percent => 100,
                cascade          => false
        );
end;
/

    END_PT END_VAL                         END_AV          END_RPT
---------- ------------------------------- ------------ ----------
...
       125  39363839393911e01d15b75c600000 968999                0
       126  393834373530e98510b6f19a000000 984751                0
       253  393939393630078c23b063cf600000 999960                0
       254  393939393938f86f9b35437a800000 999999                0

129 rows selected.

This took a lot longer, of course, and produced an old-style height-balanced histogram. Part of the time came from the increased volume of data that had to be processed, part of it came from a suprise (which also appeared, in a different guise, in the code that created the original hybrid histogram).

I had specifically chosen the method_opt to gather for nothing but the single column. In fact whether I forced the “legacy” (height-balanced) code or the modern (hybrid) code, I got a full tablescan that did some processing of EVERY column in the table and then threw most of the results away. Here are fragments of the SQL – old version first:


select /*+
            no_parallel(t) no_parallel_index(t) dbms_stats
            cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
            xmlindex_sel_idx_tbl no_substrb_pad
       */
       count(*),
       count("ID"), sum(sys_op_opnsize("ID")),
       count("BAD_COL"), sum(sys_op_opnsize("BAD_COL"))
       ...
from
       "TEST_USER"."T1" t

select /*+
           full(t)    no_parallel(t) no_parallel_index(t) dbms_stats
           cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
           xmlindex_sel_idx_tbl no_substrb_pad
       */
       to_char(count("ID")),
       to_char(count("BAD_COL")),
       substrb(dump(min("BAD_COL"),16,0,64),1,240),
       substrb(dump(max("BAD_COL"),16,0,64),1,240),
       ...
       count(rowidtochar(rowid))
from
       "TEST_USER"."T1" t  /* ACL,TOPN,NIL,NIL,RWID,U,U254U*/

The new code only used the substrb() functions on the bad_col, but all other columns in the table were subject to the to_char(count()).
The old code applied the count() and sys_op_opnsize() to every column in the table.

This initial scan was a bit expensive – and disappointing – for the client since their table had 290 columns (which means intra-block chaining as a minimum) and had been updated so much that 45% of the rows in the table had to be “continued fetches”. I can’t think why every column had to be processed like this, but if they hadn’t been that would have saved a lot of CPU and I/O since the client’s critical column was very near the start of the table.

Finally

This problem with the popular value going missing is a known issue, for which there is a bug number, but there is further work going on in the same area which means this particular detail is being rolled into another bug fix. More news when it becomes available.

Bear in mind that this problem also appears for Top-N (aka Top-Frequency) histograms – where both the lowest and highest buckets may be replaced with a bucket that reports the low-value and high-value for the column with a repeat-count of 1.

Update (Jan 2018)

This is now fixed under bug number “25994960: CARDINALITY MISESTIMATE FROM HYBRID HISTOGRAM” with a patch (of the same number) for 12.1.0.2

January 11, 2018

ASSM tangle

Filed under: ASSM,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:35 pm GMT Jan 11,2018

Here’s a follow-on from Tuesday’s (serious) note about a bug in 12.1.0.2 that introduces random slowdown on large-scale inserts. This threat in this note, while truthful and potentially a nuisance, is much less likely to become visible because it depends on you doing something that you probably shouldn’t be doing.

There have always been problems with ASSM and large-scale deletes – when should Oracle mark a block as having free space on deletion: if your session does it immediately then other sessions will start trying to use the free space that isn’t really there until you commit; if your session doesn’t do it immediately when can it happen, since you won’t want it done on commit – but that means the segment could “lose” a lot of free space if something doesn’t come along in a timely fashion and tidy up.

But here’s a quirky problem that takes things one step further. What happens if you try to delete a load of data and fail and your session rolls back? If we start with yesterday’s script (running on 11.2.0.4 or 12.2.0.1) we can create a table with 1M rows in it and the following space usage:


Unformatted		      : 	   0 /		      0
Freespace 1 (  0 -  25% free) : 	   0 /		      0
Freespace 2 ( 25 -  50% free) : 	   1 /		  8,192
Freespace 3 ( 50 -  75% free) : 	   0 /		      0
Freespace 4 ( 75 - 100% free) : 	  67 /		548,864
Full			      :       41,666 /	    341,327,872

You will recall that each “Full” block actually had the basic 10% free space, plus a couple of hundred extra bytes which Oracle had to “forget about” because the incoming rows were always 290 bytes long. Let’s take this table and delete the first 100,000 rows, then emulate a session error and roll back, and then check the space usage:


delete from t1 where rownum <= 100000;
rollback;

-- generate space usage report

Unformatted		      : 	   0 /		      0
Freespace 1 (  0 -  25% free) :        4,167 /	     34,136,064
Freespace 2 ( 25 -  50% free) : 	   1 /		  8,192
Freespace 3 ( 50 -  75% free) : 	   0 /		      0
Freespace 4 ( 75 - 100% free) : 	  67 /		548,864
Full			      :       37,499 /	    307,191,808

We have 4,167 blocks which were full, and we know they are effectively full for the purposes of our data, but they’re now declared as having some free space. When Oracle rolled back the delete it wasn’t running code that would attempt to discover that the block was going to go over the limit, it simply calculated the byte change from re-inserting the row, added it to the total free space (tosp) and produced a number that hadn’t reached the limit set by pctfree – so flagged the block accordingly. (Remember my comment in the earlier article that Oracle doesn’t generate undo for the state changes on the Level 1 bitmap blocks – this is, at least in part – a consequence of that strategy).

Guideline

If you’re going to do large-scale deletes in an ASSM environment, make sure they don’t fail or subsequent inserts may take a long time.

 

January 9, 2018

ASSM argh!

Filed under: 12c,ASSM,Bugs,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:53 pm GMT Jan 9,2018

Here’s a problem with ASSM that used to exist in older versions of Oracle had disappeared by 11.2.0.4 and then re-appeared in 12.1.0.2 – disappearing again by 12.2.0.1. It showed up on MoS a few days ago under the heading: “Insert is running long with more waits on db file sequential read”.

The obvious response to this heading is to question the number of indexes on the table – because big tables with lots of indexes tend to give you lots of random I/O as Oracle maintains the indexes – but this table had no indexes. The owner of the problem supplied several of bits of information in the initial post, with further material in response to follow-up questions, including the tkprof summary of the 10046/level 12 trace of the insert and two extracts from the trace file to show us some of the “db file sequential read” waits – the first extract made me wonder if there might be some issue involving 16KB blocks but the second one dispelled that illusion.

There are several buggy things that can appear with ASSM and large-scale DML operations, and sometimes the problems can appear long after the original had done the dirty deed, so I thought I’d create a simple model based on the information supplied to date – and discovered what the problem (probably) was. Here’s how it starts – I’ve created a tablespace using ASSM, and in this tablespace I’ve created a table which has 48 columns with a row length of 290 bytes (roughly matching the OP’s table), and I’ve hacked out a simple PL/SQL block that loops around inserting arrays of 100 rows at a time into the table for a total of 1M rows before committing.


rem
rem     Script:         assm_cleanout.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2        Lots of blocks left "not full"
rem             11.2.0.4
rem

rem
rem     using OMF, so no file-name needed
rem     Ran this bit as SYS, and altered test user to have unlimited quota
rem

/*
create
        tablespace test_8k_assm
        datafile size 1G
        extent management local
        autoallocate
        segment space management auto
;
*/

rem
rem     Create the table, pre-allocate some space.
rem     This means we should get consistent 8M extents and not initial little ones
rem

create table t1 (
        v001 varchar2(5), v002 varchar2(5), v003 varchar2(5), v004 varchar2(5), v005 varchar2(5),
        v006 varchar2(5), v007 varchar2(5), v008 varchar2(5), v009 varchar2(5), v010 varchar2(5),
        v011 varchar2(5), v012 varchar2(5), v013 varchar2(5), v014 varchar2(5), v015 varchar2(5),
        v016 varchar2(5), v017 varchar2(5), v018 varchar2(5), v019 varchar2(5), v020 varchar2(5),
        v021 varchar2(5), v022 varchar2(5), v023 varchar2(5), v024 varchar2(5), v025 varchar2(5),
        v026 varchar2(5), v027 varchar2(5), v028 varchar2(5), v029 varchar2(5), v030 varchar2(5),
        v031 varchar2(5), v032 varchar2(5), v033 varchar2(5), v034 varchar2(5), v035 varchar2(5),
        v036 varchar2(5), v037 varchar2(5), v038 varchar2(5), v039 varchar2(5), v040 varchar2(5),
        v041 varchar2(5), v042 varchar2(5), v043 varchar2(5), v044 varchar2(5), v045 varchar2(5),
        v046 varchar2(5), v047 varchar2(5), v048 varchar2(5)
)
segment creation immediate
tablespace test_8k_assm
storage(initial 8M)
;

alter table t1 allocate extent (size 8M);
alter table t1 allocate extent (size 8M);

rem
rem     Simple anonymous pl/sql block
rem     Large insert, handled with array inserts
rem     Can modify loop count and array size very easily
rem

declare
        type tab_array is table of t1%rowtype;
        junk_array tab_array;
begin

        select
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx',
                'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx'
        bulk collect into
                junk_array
        from
                all_objects
        where
                rownum  <= 100 -- > comment to avoid WordPress format issue
        ;

        for i in 1..10000 loop
                forall j in 1..junk_array.count
                        insert into t1 values junk_array(j) ;
        end loop;

end;

commit;

The number of rows per block after this insert is 24, with 1038 bytes free space left (808 due to the pctfree = 10, then the bit that was too small to take a 25th row before breaching the pctfree barrier). This means we should report 1M/24 = 41,666 full blocks and one block with some free space. So we query the table using the dbms_space package:


declare
        m_unformatted_blocks    number;
        m_unformatted_bytes     number;
        m_fs1_blocks            number;
        m_fs1_bytes             number;
        m_fs2_blocks            number;
        m_fs2_bytes             number;

        m_fs3_blocks            number;
        m_fs3_bytes             number;
        m_fs4_blocks            number;
        m_fs4_bytes             number;
        m_full_blocks           number;
        m_full_bytes            number;

begin
        dbms_space.SPACE_USAGE(
                segment_owner           => 'TEST_USER',
                segment_name            => 'T1',
                segment_type            => 'TABLE',
                unformatted_blocks      => m_unformatted_blocks,
                unformatted_bytes       => m_unformatted_bytes,
                fs1_blocks              => m_fs1_blocks ,
                fs1_bytes               => m_fs1_bytes,
                fs2_blocks              => m_fs2_blocks,
                fs2_bytes               => m_fs2_bytes,
                fs3_blocks              => m_fs3_blocks,
                fs3_bytes               => m_fs3_bytes,
                fs4_blocks              => m_fs4_blocks,
                fs4_bytes               => m_fs4_bytes,
                full_blocks             => m_full_blocks,
                full_bytes              => m_full_bytes
        );

        dbms_output.new_line;
        dbms_output.put_line('Unformatted                   : ' || to_char(m_unformatted_blocks,'999,999,990') || ' / ' || to_char(m_unformatted_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 1 (  0 -  25% free) : ' || to_char(m_fs1_blocks,'999,999,990') || ' / ' || to_char(m_fs1_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 2 ( 25 -  50% free) : ' || to_char(m_fs2_blocks,'999,999,990') || ' / ' || to_char(m_fs2_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 3 ( 50 -  75% free) : ' || to_char(m_fs3_blocks,'999,999,990') || ' / ' || to_char(m_fs3_bytes,'999,999,999,990'));
        dbms_output.put_line('Freespace 4 ( 75 - 100% free) : ' || to_char(m_fs4_blocks,'999,999,990') || ' / ' || to_char(m_fs4_bytes,'999,999,999,990'));
        dbms_output.put_line('Full                          : ' || to_char(m_full_blocks,'999,999,990') || ' / ' || to_char(m_full_bytes,'999,999,999,990'));

end;
/

The results aren’t what we expect:


Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :       35,001 /      286,728,192
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :        6,665 /       54,599,680

We have one block marked as 25 – 50% free (that’s the one block with 16 rows in it, which means about 40% space currently free) but our 41,666 full blocks are actually reported as 6,665 full blocks and 35,001 blocks with some space available. That’s going to hurt eventually if some process wants to insert more rows and finds that it has to fail its way through 35,001 blocks before finding a block which has enough free space.

So what happens when I repeat the PL/SQL block (and commit)? Here are the results from calls to dbms_space after the next two cycles:


Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :       70,002 /      573,456,384
Freespace 2 ( 25 -  50% free) :            2 /           16,384
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          220 /        1,802,240
Full                          :       13,330 /      109,199,360

Unformatted                   :          256 /        2,097,152
Freespace 1 (  0 -  25% free) :      105,003 /      860,184,576
Freespace 2 ( 25 -  50% free) :            3 /           24,576
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          117 /          958,464
Full                          :       19,995 /      163,799,040

Every time we execute the PL/SQL block we leave a trail of 35,001 more blocks which are flagged as “not quite full”.

Looking at the session stats while running the insert loop I can tell that Oracle isn’t checking to see whether or not it should be using those blocks. (A quick way of proving this is to flush the buffer cache before each execution of the PL/SQL and note that Oracle doesn’t read back the 105,000 blocks before inserting any data). So somehow, sometime, someone might get a nasty surprise – and here’s one way that it might happen:

Since I know I my data fits 24 rows per block I’m going to modify my PL/SQL block to select one row into the array then loop round the insert 25 times – so I know I’m inserting a little bit more than one block’s worth of data. Starting from the state with 105,003 blocks marked as “Freespace 1” this is what I saw – first, the free space report after inserting 25 rows:


Unformatted                   :          240 /        1,966,080
Freespace 1 (  0 -  25% free) :        1,074 /        8,798,208
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :          133 /        1,089,536
Full                          :      123,927 /    1,015,209,984

Then a few wait events and session statistics for the insert:


---------------------------------------------------------
SID:    39:TEST_USER - jonathan
Session Events - 09-Jan 16:57:18
Interval:-      6 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs        Csec    Avg Csec    Max Csec
-----                                             -----   ---------        ----    --------    --------
db file sequential read                          15,308           0         128        .008           3
db file scattered read                           20,086           0         271        .014           4

---------------------------------
Session stats - 09-Jan 16:57:18
Interval:-  6 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
session logical reads                                                  269,537
physical read total IO requests                                         35,401
db block gets                                                          229,522
consistent gets                                                         40,015
physical reads                                                         124,687
physical reads cache                                                   124,687
db block changes                                                       208,489
physical reads cache prefetch                                           89,293
redo entries                                                           207,892
redo size                                                           16,262,724
undo change vector size                                                  1,720
deferred (CURRENT) block cleanout applications                         103,932
table scan blocks gotten                                                20,797
HSC Heap Segment Block Changes                                              25

The session has read and updated almost all of the level 1 bitmap blocks. I don’t know exactly what triggered this re-read, but seems to be related to the number of rows inserted (or, perhaps, the amount of space used rather than the row count) as an insert crosses the pctfree boundary and fails over to the next block. I’ve only done a couple of little tests to try and get a better idea of why an insert sometimes sweeps through the bitmap blocks – so I know that inserting 2 or 3 rows at a time will also trigger the cleanout – but there are probably several little details involved that need to be identified.

You might note a couple of details in the stats:

  • Because I had flushed the buffer cache before the insert Oracle did its “cache warmup” tablescanning trick – if this had not happened I would probably have done a single block read for every single bitmap block I touched.
  • There are 103,932 block cleanout applications – but 208,000 db block changes and redo entries. Roughly half the latter are for data block cleanouts (OP code 4.1) and half are the state changes on the level 1 bitmap blocks (OP code 13.22). You’ll notice that neither change produces any undo.
  • I’ve also included the HSC Heap Segment Block Changes statistics to show you that not all changes to Heap Segment Blocks show up where you might expect them.

And finally:

If you re-run the tests on 11.2.0.4 and 12.2.0.1 you get the following results after the intial script run – the problem doesn’t exist:


11.2.0.4
========
Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :       41,666 /      341,327,872

12.2.0.1
========
Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           67 /          548,864
Full                          :       41,666 /      341,327,872

So how does this help the OP.

  • First, there may be a huge mess still waiting to be cleaned in the table – but at 34M blocks I’m not too keen to suggest running the dbms_space routine to find out what it looks like – but maybe that’s necessary.
  • Secondly – an upgrade to 12.2 will probably avoid the problem in future.
  • Thirdly – if the number of rows per block is very close to uniform, write a little code to do a loop that inserts (say) 2 * expected number of rows per block as single row inserts and rolls back; the inserts will probably trigger a massive though perhaps not complete cleanout, so rinse and repeat until the cleanout is complete. Try to find a time when you don’t mind the extra load to get this job done.
  • Finally – on the big job that does the bulk insert – repeat the dummy insert/rollback at the end of the job to clean up the mess made by the job.

Addenda

Prompted by comment #2 below, I should add that if the problem has been fixed in 12.2 then possibly there’s a bug report and patch for it already. If there isn’t then the OP could raise an SR (referencing this blog note), and request a bug fix or back-port from 12.2.

And with 24 hours of publication, comment #4 (from Yury Pudovchenko) tells us that the bug is fixed by the Oct 2017 Bundle Patch.

 

 

January 2, 2018

Defaults

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 8:43 am GMT Jan 2,2018

Following on from a Twitter reference and an update to an old posting about a side effect of  constraints on the work done inserting data, I decided to have a closer look at the more general picture of default values and inserts. Here’s a script that I’ve tested against 11.2.0.4, 12.1.0.2, 12.2.0.1 and most recently 18.3.0.0 (original install, no patches applied in all cases):


rem
rem     Script:         defaults_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2017
rem

create table t1 (
        column1  varchar2(10),
        column2  varchar2(10),
        column3  varchar2(10),
        column4  varchar2(10),
        column32 varchar2(32)   default 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
        column33 varchar2(33)   default 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
        virtual1      generated always as (
                column4 || column3 || column2 || column1
        ) virtual
)
segment creation immediate
;

execute dbms_output.put_line(dbms_stats.create_extended_stats(user,'t1','(column1 || column2 || column3 || column4)'))
execute dbms_output.put_line(dbms_stats.create_extended_stats(user,'t1','(column1,column2,column3)'))

create or replace function plsqlfunction_with_30char_name(
        i_in varchar2
)
return varchar
deterministic
is
begin
        return initcap(i_in);
end;
/

create index t1_i1 on t1(substr(plsqlfunction_with_30char_name(column1),1,10));

When you create a function-based index you get a hidden, virtual column supporting the index expression; when you create extended stats (of either type) you get a hidden virtual column holding the extension definition, when you create any type of virtual column, including a “real” virtual column you get a data dictionary entry holding the column name and the expression definition: all these options use the “data_default” column from user_tab_cols to display the defining information – as we can see when we the following query:


select  column_name, data_default
from    user_tab_cols
where   table_name = 'T1'
order by
         column_id
;

COLUMN_NAME                      DATA_DEFAULT
-------------------------------- --------------------------------------------------------------------------------
COLUMN1
COLUMN2
COLUMN3
COLUMN4
COLUMN32                         'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
COLUMN33                         'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
VIRTUAL1                         "COLUMN4"||"COLUMN3"||"COLUMN2"||"COLUMN1"
SYS_STUDAN97VB_XDKRTR_NPFAB80P   "COLUMN1"||"COLUMN2"||"COLUMN3"||"COLUMN4"
SYS_STUTCBJ8OFFSY1D9ZXRYZ0W3W#   SYS_OP_COMBINED_HASH("COLUMN1","COLUMN2","COLUMN3")
SYS_NC00010$                     SUBSTR("TEST_USER"."PLSQLFUNCTION_WITH_30CHAR_NAME"("COLUMN1"),1,10)

Apart from the special cases I’ve just listed, you’ll also see the “default values” I specified for column32 and column33 – you’ll notice that I’ve supplied a 30 character string as the default for column32, and a 31 character string as the default for column33 – this is a convenience that means the used space in the data_default (which is a long column) corresponds to the name of the column once you include the single quotes in their character count.

Having set my data up I’m going to emulate a bad application that uses lots of literal string SQL and leaves Oracle to fill in the default values (and, of course, derive the various virtual values it might need).


alter session set events '10046 trace name context forever, level 4';

begin
        for i in 1..10 loop
                execute immediate '
                        insert into t1 (column1, column2, column3, column4)
                        values( ' || i || ', ' || i || ', ' || i || ', ' || i || ')'
                ;
                commit;
        end loop;
end;
/

alter session set events '10046 trace name context off';

This code generates 10 strings that populate column1 through to column4 only. But you’ll notice the call to enable SQL tracing – and here’s the interesting bit of the output from applying tkprof to the trace file:


  SQL ID: 47r1y8yn34jmj Plan Hash: 2191121161

select default$
from
 col$ where rowid=:1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       50      0.00       0.00          0          0          0           0
Execute     50      0.00       0.00          0          0          0           0
Fetch       50      0.00       0.00          0        100          0          50
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      150      0.00       0.00          0        100          0          50

The summary is the same for all three versions of Oracle I tested*** – we’ve queried for a default value from col$ 5 times for each row we insert! (Technically that’s 5 times for each insert statement we’ve had to (hard-)parse; this anomaly wouldn’t appear if we have been using a bind-variable method and reusing the insert statement.)

*** There is one difference from 12.2 onwards – the number of parse calls reported for the statement was 1 rather than 50 but, judging by the various cursor cache hit stats, that may be due to a change in accounting rather than a change in workload.

Check the table definition: there are two “real defaults” and 4 expressions due to the various virtual columns – so why 5 calls per insert and not 6 ? The answer lies in the length of the actual value involved – if the text that appears in the (long) data_default column is 32 characters or shorter it will be stored in the dictionary cache (rowcache), but only one of our 6 values is that short, so Oracle looks up the other five on each insert (hard parse).

This is a little strange on two counts: first, given the enormous memories that have been available for years and the extreme usefulness of virtual columns and extended stats it seems something of an oversight to limit the memory available to the cache that holds such critical definitions; secondly, we don’t need to evaluate the virtual columns (unless they are indexed) or extended stats on inserts so why is Oracle querying the definitions anyway ? [Possible answer: perhaps Oracle is using generic code that allows for column check constraints – which might exist on users’ virtual columns – and hasn’t catered for bypassing system-generated virtual columns.]

A key point to remember before you start worrying too much about the impact of the execution count for this query against col$ is that it’s only relevant to “hard” parses – so you’re only likely to notice it if you have a large number of different “literal string” inserts that should be using bind variables; and that means you’ve probably got an important coding defect to address before you worry too much about the extra impact caused by this particular call. Nevertheless there are a couple of bug reports on MoS that have been raised against this query and, after writing this note, I did a quick Google search for the critical SQL_ID and found (among others) this production example from Franck Pachot.

 

October 3, 2017

Parsing

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 4:15 pm GMT Oct 3,2017

Here’s a quick quiz.

According to the Oracle 12.1 Database SQL Tuning Guide the first stage of parsing a statement is the Syntax Check, which is followed by the Semantic Check, followed by the Shared Pool Check. So where you do think the statement text will be while the Syntax Check is going on ?

 

 

 

 

 

 

And the answer looks like ….

 

 

 

… the shared pool. Here’s a simple test, cut-n-paste from SQL*Plus running under 12.1.0.2 in the SYS schema (I’ve also done this in the past with older versions):

 

 

 


SQL> select user frrom dual;
select user frrom dual
                  *
ERROR at line 1:
ORA-00923: FROM keyword not found where expected

SQL> select kglhdpar, kglhdadr, kglobt03, kglnaobj from x$kglob where kglnaobj like '%frrom%' and kglnaobj not like '%kgl%';

KGLHDPAR         KGLHDADR         KGLOBT03      KGLNAOBJ
---------------- ---------------- ------------- --------------------------------------------------------------------------------
00000000D1EE3880 00000000AEF43F40 bshhvh0ypcz6x select user frrom du
00000000D1EE3880 00000000D1EE3880 bshhvh0ypcz6x select user frrom du

2 rows selected.

SQL> 

So the statement is garbage and fails (or ought to be failing) on a syntax text – but I can find it in x$kglob – the library cache objects – in the SGA with a parent and child entry.

I have to say that when I first read, many years ago, that there was a syntax check it struck me that the fastest way of doing a syntax check on a “good” system would be to start with a search for the text in the library cache. If it were there (and on a good system it probably would be within a few minutes of startup) then you could have a flag that avoided the CPU cost of doing the syntax check and move straight on to the semantic (basically object and permissions) check.

 

October 2, 2017

markhot

Filed under: latches,mutex,Oracle,Troubleshooting — Jonathan Lewis @ 8:42 am GMT Oct 2,2017

How can a single piece of SQL text – checked very carefully – end up with multiple SQL_IDs ? There are probably quite a lot of people who know the answer to this question but won’t think of it until they’re reminded and, thanks to a question that came up on the forum formerly known as OTN a couple of days ago, I was reminded about it recently and rediscovered an article I had drafted on the topic a few years ago.

The specific problem on the forum was about having a huge number of child cursors for a single parent thanks to a frequently executed update statement that updated all 50 columns of a table using bind variables to do so. The reason why a single statement could produce so many child cursors seemed to be due to the variation in the lengths of the bind variables supplied – which could well be the consequence of an internal library mechanism rather than an explicit design mechanism written into the client code. One of the comments to my 2007 article suggested event  10503 as a damage limitation mechanism, but there was some problem with it not working as expected at the time. A quick check on MoS now reports bug 10274265 : “EVENT 10503 NOT WORKING ON THE SESSION LEVEL” as fixed in 12.1 with lots of backport patches to various versions of 11.2

Moving on from the back-story, the case that prompted my draft note in 2014 was the simpler one of having lots of sessions constantly executing exactly the same SQL statement and child cursor, so rather than having a latch/mutex type of problem because of a large number of child cursors I was seeing a problem purely because of the level of concurrent access to the same child cursor.  The solution in the version of Oracle the client was using at the time was to tell Oracle to mark the SQL statement as “hot” by setting a hidden parameter; but the mechanism is now officially exposed in a procedure called dbms_shared_pool.markhot() that I learned about a few months ago when I was at a client who had a similar problem with highly concurrent execution of a small set of statements – with the extra twist that the table referenced in the critical statements was a partitioned table which suffered a fairly regular partition exchange.

When a statement (through it’s “full_hash_value”) is marked as hot, an extra value visible as the property column in v$db_object_cache is set to a value that seems to be dependent on the process id of the session attempting to execute the statement, and this value is used as an extra component in calculating a new full_hash_value (which leads to a new hash_value and sql_id). With a different full_hash_value the same text generates a new parent cursor which is (probably) going to be associated with a new library cache hash bucket and latch. The property value for the original parent cursor is set to “HOT”, and the extra copies become “HOTCOPY1”, “HOTCOPY2” and so on. Interestingly once an object is marked as HOT and the first HOTCOPYn has appeared the original version may disappear from v$sql while still existing in v$db_object_cache.

The number of “HOTCOPYn” versions of the statement is limited by the hidden parameter “_kgl_hot_object_copies” which (according to my notes) defaults to either cpu_count or cpu_count/2. On my most recent test on 11.2.0.4 it seemed to be the latter.

Marking a cursor hot

There are three options:

  • set a hidden parameter in the startup file
  • execute an “alter system” command to set the hidden parameter
  • from 11.2.0.3 onwards (possibly earlier in 11.2) call dbms_shared_pool.markhot()

Examples:

Startup file:

_kgl_debug="hash='cc7d5ecdcc9e7c0767456468efe922ea' namespace=0 debug=33554432"

Alter system call with multiple targets:

alter system set "_kgl_debug" =
        "hash='cc7d5ecdcc9e7c0767456468efe922ea' namespace=0 debug=33554432",
        "hash='59a1f6575a5006600792ee802558305b' namespace=0 debug=33554432"
;

markhot() procedure:

begin
        dbms_shared_pool.markhot(
                hash            =>'71fc6ccf9a3265368492ec9fc05b785b',
                namespace       =>0,
                global          =>true
        );
end;
/

The namespace identifies the object as an SQL Cursor (you can mark other types of object as hot if you need to), and for those of a mathematical bent you’ll work out that the debug values is power(2,25).

The value supplied as the hash is the full_hash_value and you can find this in v$db_object_cache either by searching on some string that easily identifies your statement, or by searching v$sql on a string to get the (short) hash value of the statement and using that to search v$db_object_cache on the hash_value column.


select
        hash_value,
        full_hash_value,
        namespace,
        child_latch,
        property        hot_flag,
        executions,
        invalidations
from
        v$db_object_cache
where
        name like '{some part of your critical SQL statement}'
;

I ran into two problems using the markhot() approach. The first not terribly serious – the second fatal, except I’m not going to do it again and I wouldn’t have run into it if I hadn’t been impatient working around the first.

First: if you’ve already got lots of sessions executing the statement and holding cursors open in some way before you call markhot() then it may be some time before all those sessions release the hot parent and child and acquire a “cool” parent and child and unfortunately you can’t call markhot() until at least one session has opened the relevant cursor – and that’s a problem that isn’t relevant if you’ve got the hidden parameter set.

Secondly: although eventually your hot cursor(s) will drop out of use, if you try to get rid of them early by a cunning call to dbms_shared_pool.purge() you may find that you don’t manage to purge them; if you decide to try again, and again (as I did) you may find that your session goes into an infinite CPU spin and no-one can get at the hot cursor.  Be patient, once you’ve marked a cursor as hot your application will (probably) end up spreading itself over the copies.

One last detail – if, for any reason, you decide that a cursor no longer needs to be marked hot there is a procedure dbms_shared_pool.unmarkhot() that takes the same three parameters to clear the property and allow the copies to disappear.

Footnote

The OTN problem that prompted me to write this note wasn’t about high concurrency levels, it was about mutex contention while searching for the right child cursor. The markhot() procedure doesn’t really look as if it’s designed to address this issue but, as a side-effect of having multiple parent cursors for the same statement text, there should be fewer sessions searching each child-cursor chain at any one moment and this may be enough to reduce the contention. Statistically, of course, every child chain is likely to end up the same length so the amount of shared pool memory used by the SQL statement will eventually grow by a factor matching the number of hot copies produced – but if the problem is contention it may be better (e.g.) to have 16 times the memory used so that 100 concurrent sessions can be spread across 16 different chains rather than having 100 sessions all trying to search the same chain at the same time.

 

August 3, 2017

Rebuilding Indexes

Filed under: Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:00 pm GMT Aug 3,2017

One of the special events that can make it necessary to rebuild an index is the case of the “massive DML”, typically a bulk delete that purges old data from a table. You may even find cases where it’s a good idea to mark a couple of your indexes as unusable before doing a massive delete and then rebuild them after the delete.

Despite the fact that a massive delete is an obvious special case it’s still not necessary in many cases to worry about a rebuild afterwards because the space made free by the delete will be smoothly reused over time with very little variation in performance. There is, however, one particular feature that increases the probability of a rebuild becoming necessary – global (or globally partitioned) indexes on partitioned tables. The problem (and the absence of problem in non-partitioned tables) is in the nature of the rowid.

For non-partitioned tables, and partitioned tables with local indexes, the rowid stored in an index is (assuming we’re thinking only of heap tables) stored as a sequence of 6 bytes consisting, in order, of: (tablespace relative file number, block number within file, row number within block). If the table is non-partitioned, or if this is an index segment from a locally partitioned index, all the index entries will be pointing to the same table segment and Oracle knows which segment that is from the data dictionary information – so Oracle can derive the data_object_id of the table segment and convert the tablespace relative file number into the absolute file number to navigate to the right row in the table.

When the index is global or globally partitioned any index entry may point to any of the table’s segments, so the rowid that is stored in the index is expanded by a further 4 bytes to hold the data_object_id of the table segment it is pointing to – and the data_object_id is the leading component: (data_object_id, tablespace relative file number, block number within file, row number within block). Think about what this means when you start to drop “old” partitions and add new partitions. Compare this with what happens when you simply delete a large volume of old data from a table and starting inserting new data. There’s an important difference to the way in which indexes will evolve.

Purging data

When you delete a large volume of data from a (simple, heap) table you will create a lot of empty space in a lot of existing table blocks. If that delete is supposed to get rid of “old” data (and to keep the description simple we’ll assume it’s the first time you’ve done this) it’s very likely that the delete will result in lots of empty blocks near the start of the table – rows that were inserted at the same time will tend to be physically close to each other in the table. This means that future inserts will soon start to reuse those table blocks. Think about what this means for index entries – especially for non-unique keys.

Assume you have 100 rows with value ‘XXX’ for an indexed column. Breaking the rowid into its component parts the index entries will be (‘XXX’,{file_id, block_id, row_number}).  Now assume you delete the oldest 10 rows then, over time, insert 10 more rows with the same key value. You’ve deleted the 10 index entries with the lowest values for (file_id, block_id) but the space that’s become available in the table will be in and around exactly that range of blocks – so the new index entries will probably end up looking very similar to the deleted index entries and inserted in and around the existing index entries for value ‘XXX’, so over time the index is unlikely to allocate much new space.

Now think about what happens when your table it partitioned but the index is global; your index entries are (‘XXX’,{data_object_id, file_id, block_id, row_number}). When you drop the oldest partition you will probably[1] delete all the index entries with the lowest data_object_id. When you start inserting new rows for ‘XXX’ the new table partition will have a data_object_id that is going to be higher than any previous data_object_id – which means you’re going to be inserting rows into the right-hand (high-value) edge of this section of the index. In some cases – typically those where you have a couple of leaf blocks per key value – the index may end up growing significantly because the insertion point for rows in the new partition isn’t in the leaf block with the available space, and it won’t be until you’ve done a few more bulk deletes and the leaf blocks relating to the oldest table partitions become completely empty that the space can be reused.

An example of this type of behaviour probably appeared on the OTN database forum quite recently.  Of course, there are various reasons why indexes can become inefficient, and the degree of inefficiency may only become noticeable over a fairly long period of time; moreover there are various reasons why global indexes are a little problematic, and various reasons why a bulk delete (which is what executing “alter table drop partition” does to a global index) has unpleasant side effects dependent somewhat on the number (and size) of the partitions and on how many you try to drop in one go.

There’s not  a lot you can do about this quirk of global indexes, but it’s always worth taking extra care with partitioned tables and focusing even more carefully on a strategic review of indexes:

  • Does this index really need to exist at all
  • Could this index be replaced by a selective function-based index
  • Does this index really need to be global / globally partitioned
  • How big is this index compared to the size it ought to be
  • Should this index be (basic) compressed
  • Is this index likely to be disrupted by a historic purge – or is there another reason for its undesirable behaviour

 

[1] probably delete entries with the lowest data_object_id” – I have to say this because if you’ve executed a “move partition” at any time a new data_object_id will have been generated for the partition, so the oldest partition could, in principal, have the highest data_object_id. The issue of changing data_object_ids brings a whole new level of complexity to global indexes – but only in a few special cases, fortunately.

 

 

July 24, 2017

Fast Now, Fast Later

Filed under: Infrastructure,Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 1:39 pm GMT Jul 24,2017

The following is the text of an article I published in the UKOUG magazine several years ago (2010), but I came across it recently while writing up some notes for a presentation and thought it would be worth repeating here.

Fast Now, Fast Later

The title of this piece came from a presentation by Cary Millsap and captures an important point about trouble-shooting as a very memorable aphorism. Your solution to a problem may look good for you right now but is it a solution that will still be appropriate when the database has grown in volume and has more users.

I was actually prompted to write this article by a question on the OTN database forum that demonstrated the need for the basic combination of problem solving and forward planning. Someone had a problem with a fairly sudden change in performance of his system from November to December, and he had some samples from trace files and Statspack of a particular query that demonstrated the problem.

The query was very simple:

select  *
from    tph
where   pol_num = :b0
order by
        pm_dt, snum

When the query was operating fast enough the trace file from a sample run showed the following (edited) tkprof output, with an the optimizer taking advantage of the primary key of (pol_num, pm_dt, snum) on table TPH to avoid a sort for the order by clause. (Note that the heading on the plan is “Row Source Operation” – which means it’s the execution plan that really was used)

call    count    cpu  elapsed  disk  query  current  rows
---------------------------------------------------------
Parse       1   0.01     0.13     0    106        0     0
Execute     1   0.03     0.03     0      0        0     0
Fetch       4   0.01     0.22    46     49        0    43
---------------------------------------------------------
total       6   0.06     0.39    46    155        0    43

Rows Row Source Operation
---- --------------------
  43 TABLE ACCESS BY INDEX ROWID TPH (cr=49 pr=46 pw=0 time=226115 us)
  43   INDEX RANGE SCAN TPH_PK (cr=6 pr=3 pw=0 time=20079 us)(object id 152978)

Elapsed times include waiting on following events:
Event waited on               Times  Max. Wait Total Waited
                             Waited 
-----------------------------------------------------------
db file sequential read          46       0.01         0.21

When the query was running less efficiently the change in the trace didn’t immediately suggest any fundamental problems:


call    count    cpu  elapsed  disk  query  current  rows
---------------------------------------------------------
Parse       1   0.00     0.00     0     51        0     0
Execute     1   0.01     0.01     0      0        0     0
Fetch       4   0.00     0.59    47     51        0    45
---------------------------------------------------------
total       6   0.01     0.61    47    102        0    45

Rows Row Source Operation
---- --------------------
45 TABLE ACCESS BY INDEX ROWID TPH (cr=51 pr=47 pw=0 time=593441 us)
45 INDEX RANGE SCAN TPH_PK (cr=6 pr=2 pw=0 time=33470 us)(object id 152978)

Elapsed times include waiting on following events:
Event waited on               Times  Max. Wait Total Waited
                             Waited 
-----------------------------------------------------------
db file sequential read          47       0.03         0.58

The plan is the same, the number of rows returned is roughly the same, and the number of disc reads and buffer gets has hardly changed. Clearly the overall change in performance comes from the slower average disk read times (a total of 0.21 seconds with a maximum of one hundredth of a second, compared to a total 0.58 seconds with a maximum of 3 hundredths), but why has the disk I/O time changed?

The figures give us a couple of preliminary ideas. An average read time of 4.5 milliseconds ( 0.21 seconds / 46 reads) is pretty good for a “small” random read of a reasonably loaded disc subject to a degree of concurrent access [ed: bearing in mind this was 2010], so the waits for “db file sequential read” in the first tkprof output are probably getting some help from a cache somewhere – possibly a SAN cache at the end of a fibre link or maybe from a local file system buffer (we might get a better idea if we could see the complete list of individual read times).

In the second case an average of 12.3 milliseconds ( 0.58 seconds / 45 reads) looks much more like a reasonable amount of genuine disc I/O is taking place – and the maximum of 30 milliseconds suggests that the disc(s) in question are subject to an undesirable level of concurrent access: our session is spending some of its time in a disk queue. Again, it would be nice to see the wait times for all the reads, but at this point it’s not really necessary.

There are couple more clues about what’s going on – one is the text of the query itself (and I’ll be coming back to that later) and the other is in the detail of the disk I/Os. If you check the “Row Source Operation” details you’ll see that in the first case the sample query selected 43 rows from the table and requested 43 (46 – 3) physical reads (pr) of the table to do so. In the second case it was 45 rows and 45 (47 – 2) physical reads. Is this simply a case of the same query needing a little more data and having to do a little more work as time passes?

So now we come to the Statspack data. Based on my observations (or guesses) about the nature of the query and the work going on, I asked if we could see some summary information for a couple of comparative intervals, and also to see if this particular query appeared in the “SQL ordered by reads” section of the Statspack reports. Here are the results, first for a snapshot taken in October:


Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                      3,816,939      58,549     15   79.4
CPU time                                                     7,789          10.6
db file parallel write                         371,865       2,005      5    2.7
log file parallel write                         75,554       1,552     21    2.1
log file sync                                   17,198       1,228     71    1.7

                                                     CPU      Elapsd     Old
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
        775,166       43,228           17.9   24.3   212.58  12449.98  1505833981
Module: javaw.exe
SELECT * FROM TPH WHERE POL_NUM = :B1 ORDER BY PM_DT ,SNUM FOR UPDATE NOWAIT}

You might notice that the critical query is actually a ‘select for update’ rather than the simple select that we had originally been told about; this doesn’t affect the execution plan, but is going to have some significance as far as undo and redo are concerned.

Now look at the corresponding figures for an interval in December:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                      7,000,428      92,970     13   89.8
CPU time                                                     6,780           6.5
db file parallel write                         549,286       1,450      3    1.4
db file scattered read                          84,127         720      9     .7
log file parallel write                         41,197         439     11     .4


                                                     CPU      Elapsd     Old
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
      2,444,437       43,363           56.4   25.2   221.31  23376.07 1505833981
Module: javaw.exe
SELECT * FROM TPH WHERE POL_NUM = :B1 ORDER BY PM_DT ,SNUM FOR UPDATE NOWAIT

You’ll see in both cases that a huge fraction of the total database time is spent in single block reads (the “db file sequential read” time), but in December the number of reads has gone up by about 3.2 million. You can also see that about 1.7 million of the “extra” reads could be attributed to the critical query even though the number of executions of that query has hardly changed. The average number of reads per execution has gone up from 18 to 56. (I did ask if I could see the section of Statspack titled “SQL ordered by Executions” as this includes the average number of rows per execution and it would have been nice to know whether this average had gone up just a little bit, or whether it had gone up in line with the average reads per execution. Unfortunately the request was ignored, so I am going to proceed as if the change in the average result set was small.)

This, perhaps, tells us exactly what the problem is (and even if it doesn’t, the figures are symptomatic of one of the common examples of non-scalable queries).

Look at the query again – are we reporting all the rows for a “policy number”, ordered by “payment date”. If so, the number of payments recorded is bound to increase with time, and inevitably there will be lots of payments for policies belonging to other people between each pair of payments I make on my policy – and that would put each of my payments in a different table block (if I use a normal heap table).

Initially the payments table may be sufficiently small that a significant fraction of it stays in Oracle’s data cache or even in the file-system or SAN cache; but as time passes and the table grows the probability of me finding most of my blocks cached decreases – moreover, as time passes, I want increasing numbers of blocks which means that as I read my blocks I’m more likely to knock your blocks out of the cache. Given the constantly increasing numbers of competing reads, it is also no surprise that eventually the average single block read time also increases.

In scenarios like this it is inevitable that performance will degrade over time; in fact it is reasonably likely that the performance profile will degrade slowly to start with and then show an increasingly dramatic plunge. The only question really is how much damage limitation you can do.

One strategy, of course, is to increase the memory available for the critical object(s). This may mean assigning the table to a generously sized KEEP cache. (The cache need not be the same size as the table to improve things, but the bigger the better – for this query, at least). But such a strategy is only postponing the inevitable – you really need to find an approach which is less susceptible to the passage of time.

In this case, there are a few options to consider. First – note that we are selecting all the rows for a policy: do we really need to, or could we select the rows within a given date range, thus setting an upper limit on the average volume of data we need to acquire for any one policy. If we do that, we may want to think about strategies for summarizing and deleting older data, or using partitioning to isolate older data in separate segments.

If we can’t deal with the problem by changing the code (and, in this case, the apparent business requirement) can we avoid the need to visit so many data blocks for single policy. There are two obvious options to consider here – we could create the table as an “index cluster” clustered on the policy number; in this way we pay a penalty each time we insert a new row for a policy because we have to find the correct block in the cluster but when we run a query against that policy we will only need to read one or two blocks (probably) to get all the data. Alternatively we could consider setting the table up as an index-organized table (IOT) – again we do more work inserting data into the correct leaf block in the index but again we reap the benefit as we query the data because all the rows we want are in the same two or three leaf blocks (and stored in the order we want them).

Of course we are still subject to the same basic problem of the result set increasing in size as time passes, but at least we have managed to reduce (dramatically) the number of blocks we have to visit and the rate of growth of the number of blocks per query, thereby improving the scalability of the queries significantly.

Introducing new structures to an existing system is difficult, of course, and we may have to work out variations on this theme (like creating an index that includes all the table columns if we can’t switch to an IOT!). The key point is this, though: sometimes we can look at our data and the critical queries and recognize that the volume of data we have to process (even if we don’t return it, as we did in this example) is always going to increase over time, then we need to consider ways of minimizing the volume of data, or improving the packing of data so that the work we do doesn’t change (much) over time. Don’t just think ‘fast now’, think ‘will it still be fast later’.

 

May 23, 2017

255 Again!

Filed under: 12c,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 1:10 pm GMT May 23,2017

There are so many things that can go wrong when you start using tables with more than 255 columns – here’s one I discovered partly because I was thinking about a client requirement, partly because I had a vague memory of a change in behaviour in 12c and Stefan Koehler pointed me to a blog note by Sayan Malakshinov when I asked the Oak Table if anyone remembered seeing the relevant note. Enough of the roundabout route, I’m going to start with a bit of code to create a table, stick a row in it, then update that row:

rem
rem     Script: wide_table_4.sql
rem     Author: Jonathan Lewis
rem     Dated:  May 2017
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0,4
rem

set pagesize 0
set feedback off

spool temp.sql

prompt create table t1(

select
        'col' || to_char(rownum,'fm0000') || '  varchar2(10),'
from
        all_objects
where   rownum <= 320
;

prompt col0321 varchar2(10)
prompt )
prompt /

spool off

@temp

set pagesize 40
set feedback on

insert into t1 (col0010, col0280) values ('0010','0280');
commit;

update t1 set col0320 ='0320';
commit;

column file_no  new_value m_file_no
column block_no new_value m_block_no

select
        dbms_rowid.rowid_relative_fno(rowid)    file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        dbms_rowid.rowid_row_number(rowid)      row_no
from
        t1
;

alter system flush buffer_cache;
alter system dump datafile &m_file_no block &m_block_no;

So I’ve written one of those horrible scripts that write a script and then run it. The script creates a table with 320 columns and inserts a row that populates columns 10 and 280. That gets me two row pieces, one consisting of the 255 columns from columns 26 to 280 that goes in as row piece 0, the other consisting of the first 25 columns that goes in as row piece 1; the remaining 40 columns are not populated so Oracle “forgets” about them (“trailing nulls take no space”). The script then updates the row by setting column 320 to a non-null value.

For convenience I’ve then generated the file and block number (and row number, just to show its head piece went in as row 1 rather than row 0) of the row and done a symbolic block dump. The question is: what am I going to see in that block dump ?

Answers (part 1)

Here’s an extract from the block dump from 11.2.0.4 (12.1.0.2 is similar) – though I’ve cut out a lot of lines reporting the NULL columns:


ntab=1
nrow=2
frre=-1
fsbo=0x16
fseo=0x1e54
avsp=0x1e3e
tosp=0x1f13
0xe:pti[0]      nrow=2  offs=0
0x12:pri[0]     offs=0x1e7a
0x14:pri[1]     offs=0x1e54
block_row_dump:
tab 0, row 0, @0x1e7a
tl: 49 fb: -------- lb: 0x2  cc: 40
nrid:  0x014000a7.0
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 37: *NULL*
col 38: *NULL*
col 39: *NULL*
tab 0, row 1, @0x1e54
tl: 38 fb: --H-F--- lb: 0x2  cc: 25
nrid:  0x014000a3.0
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 22: *NULL*
col 23: *NULL*
col 24: *NULL*
end_of_block_dump

The block holds two row pieces, and the piece stored as “row 1” is the starting row piece (the H in the flag byte (fb) tells us this). This row piece consists of 25 columns. The next rowpiece (identified by nrid:) is row zero in block 0x014000a3 – that’s block 163 of file 5 – which is the same block as the first row piece. When we look at row zero we see that it holds 40 columns, all null; it’s pointing to a third row piece at row zero in block 0x014000a7 (file 5, block 167), and as corroboration we can also see that the flag byte has no bits set and that tells us that this is just a boring “somewhere in the middle” bit. So it looks like we have to follow the pointer to find the last 255 columns of the table. So let’s take a look at the dump of file 5 block 167:


fsbo=0x14
fseo=0x1e76
avsp=0x1e62
tosp=0x1e62
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1e76
block_row_dump:
tab 0, row 0, @0x1e76
tl: 266 fb: -----L-- lb: 0x1  cc: 255
col  0: *NULL*
col  1: *NULL*
col  2: *NULL*
...
col 251: *NULL*
col 252: *NULL*
col 253: *NULL*
col 254: [ 4]  30 33 32 30
end_of_block_dump

Take note of the L in the flag byte – that tells us that we’re looking at the last row piece of a multi-piece row. It’s that last 255 columns we were looking for. The mechanics have worked as follows

  • On the simple insert Oracle split the used 280 columns into (25, 255)
  • On the update we grew the used column count from 280 to 320, adding 40 columns. Oracle extended the 255 column row piece to 295, then split it (40, 255) leaving 40 in the original block and migrating the 255 to a new block. So a row that could be only 2 pieces is now

So a row that could be two pieces in one block is now three pieces spread over two blocks; and there’s worse to come. Go back to the original block dump and check the used space. A good first approximation would be to check the “tl:” (total length) value for each row – this gives you: 49 + 38 bytes; add on a couple of hundred for the general block overhead and stuff like the transaction table and you find you’ve used less than 300 bytes in the block. But I’ve got a little procedure (I published this version of it some time ago) to check for free and used space – and this is what it said about the (ASSM) segment that holds this table:


Unformatted                   :           44 /          360,448
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            0 /                0
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :           15 /          122,880
Full                          :            1 /            8,192

Take particular note of the “Full” block at the end of the report – that’s the block where we’ve used up rather less than 300 bytes. In fact if you look again at the first block dump you’ll see the avsp (available space) and tosp (total space) figures of 0x1e3e and 0x1f13 bytes (7,742 and 7,955 bytes). There’s loads of space in the block – but the block is marked in the bitmap space management map as full. That’s really bad news.

On the plus side 12.2 behaves differently, as noted by Sayan in his blog note. We still get the third row piece, but it’s in the same block as the first two and the block doesn’t marked as full in the bitmap.

And there’s still more to come – but it will have to wait a little longer.

 

May 10, 2017

Quantum Space

Filed under: humour,Oracle,Troubleshooting — Jonathan Lewis @ 1:31 pm GMT May 10,2017

Here’s a not very serious note that makes a serious point.  I’ve got a small tablespace made up of 4 files, and here’s a little report I can run against the data dictionary for that tablespace:


select 'File space' What, nvl(sum(user_bytes)/1048576,0) MB from dba_data_files where tablespace_name = 'LOB_TEST'
union all
select 'Free space',      nvl(sum(bytes/1048576),0)         from dba_free_space where tablespace_name = 'LOB_TEST'
union all
select 'Extents',         nvl(sum(bytes/1048576),0)         from dba_extents    where tablespace_name = 'LOB_TEST'
union all
select 'Segments',        nvl(sum(bytes/1048576),0)         from dba_segments   where tablespace_name = 'LOB_TEST'
;

The name of the tablespace isn’t significant – it happens to be a tablespace I created to do some tests relating to space allocation with securefile LOBs, and it’s been hanging around ever since.

The query first reports the “user” space defined by the files – remember that there will be some space for header information and other metadata, and there may be some space near the end of the file which is smaller than the minimum extent allowed for that tablespace.

After the statement of total possible usable space we get: the free space, the space taken up by used extents, and the space taken up by segments.

Here’s the result I get currently:


WHAT               MB
---------- ----------
File space        196
Free space        196
Extents             0
Segments          196

The total usable space is 196MB, of which 196MB are free with no space allocated to extents — except 196MB HAS been allocated to segments. Depending how you look at it the tablespace is either full, or empty.

If you’re in the know the puzzle resolves itself if you query for the segment names, which look like: “BIN$TrDKUCvcVQbgUwEAAH9K2Q==$0” – I have one segment that is in the recycle bin, so I can recover it from the bin if I need it, but until I do Oracle can reuse the space if it wants it. Like Schrodinger’s cat the objects is both alive and dead until someone decides to peek.

After a call to “purge recyclebin” the result changes to:


WHAT               MB
---------- ----------
File space        196
Free space        196
Extents             0
Segments            0

On a production system you might need to issue “purge dba_recyclebin” (if you have the appropriate privilege) to resolve the apparent contradiction as the call to “purge recyclebin” applies only to objects in your own schema.

This note was prompted by a question on the OTN database forum about contradiction between a traditional SQL statement to report free and used space and a screen dump from Enterprise Manager.  I don’t think the thread reached a firm conclusion – but apart from the potential for the recyclebin to confuse the issue, there are extra possibilities thanks to auto-extensible data files, and the Enterprise Manager’s scope for querying a complete different set of views such as dba_tablespace_usage_metrics and v$filespace_usage. In fact the EM code clearly had at least one error in it (which makes any of its results suspect) because it managed to report the critical tablespace as 390% used!

 

 

May 2, 2017

Aliases

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 9:23 am GMT May 2,2017

Here’s a performance problem that came up on OTN recently. The following query (reformatted) takes “ages” to run – how do you address the problem:

SELECT
	COUNT(*) 
FROM
	smp_dbuser2.workflow_step_report
WHERE
	report_ID IN (
		SELECT	report_id
		FROM	smp_dbuser2.workflow_report
		WHERE	trunc(start_time) = '28-Apr-2017'
		AND	user_id = 'nbi_ssc'
	)
;


Various pieces of relevant information were supplied (the workflow_report table holds 1.4M rows the workflow_step_report table holds 740M rows and some indexes were described), but most significantly we were given the execution plan:

--------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name                 | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                      |     1 |     6 |    10T  (1)|999:59:59 |       |       |
|   1 |  SORT AGGREGATE       |                      |     1 |     6 |            |          |       |       |
|*  2 |   FILTER              |                      |       |       |            |          |       |       |
|   3 |    PARTITION HASH ALL |                      |   731M|  4187M|  5363K  (1)| 17:52:47 |     1 |   128 |
|   4 |     TABLE ACCESS FULL | WORKFLOW_STEP_REPORT |   731M|  4187M|  5363K  (1)| 17:52:47 |     1 |   128 |
|*  5 |    FILTER             |                      |       |       |            |          |       |       |
|   6 |     PARTITION HASH ALL|                      |     2 |    38 | 14161   (1)| 00:02:50 |     1 |    32 |
|*  7 |      TABLE ACCESS FULL| WORKFLOW_REPORT      |     2 |    38 | 14161   (1)| 00:02:50 |     1 |    32 |
--------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( EXISTS (SELECT 0 FROM "SMP_DBUSER2"."WORKFLOW_REPORT" "WORKFLOW_REPORT" WHERE :B1=:B2
              AND "USER_ID"='nbi_ssc' AND TRUNC(INTERNAL_FUNCTION("START_TIME"))=TO_DATE(' 2017-04-28 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))
   5 - filter(:B1=:B2)
   7 - filter("USER_ID"='nbi_ssc' AND TRUNC(INTERNAL_FUNCTION("START_TIME"))=TO_DATE(' 2017-04-28
              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

You’ll notice that the optimizer has transformed the IN subquery into an EXISTS subquery – operation 2 is a FILTER operation, and you can see that the filter predicate at operation 2 shows the existence subquery that would be executed.

If you look carefully at the execution plan (all of it), what can you deduce from it ? What, then, should be your next step in dealing with this performance problem ?

Observations

I did emphasise the need to examine ALL of the execution plan – and the important feature appears not in the body of the plan but in the predicate section.

The body tells us that Oracle has executed the query with a FILTER subquery approach, and we can see that the IN subquery has been transformed into an EXISTS subquery. In many cases Oracle could unnest the subquery and turn it into a join (often a semi-join), but it hasn’t in this case and we might wonder why not. Look closely at the text given for the subquery in the filter predicate section:


SELECT  0 
FROM   "SMP_DBUSER2"."WORKFLOW_REPORT" "WORKFLOW_REPORT" 
WHERE  :B1=:B2
AND    "USER_ID"='nbi_ssc' 
AND    TRUNC(INTERNAL_FUNCTION("START_TIME"))=TO_DATE(' 2017-04-28 00:00:00','syyyy-mm-dd hh24:mi:ss')

When an IN subquery is transformed into an EXISTS subquery, then we usually see: “outer.column in (select inner.column …)” turning into a predicate in the existence subquery of the form “inner.column = :B1”, so why do we have “:B1 = :B2” when we expect to see “workflow_report.report_id = :B1” ?

The (obvious, if you know your optimizer) answer is that there is no column report_id in table workflow_report but “column capture” means the optimizer has assumed that report_id in the subquery refers to workflow_step_report.report_id – hence “:B1 = :B2”. The consequence of this strange predicate is that the subquery may execute once for every row in the outer table (though scalar subquery caching may reduce the number of executions) performning a tablescan as it does so.

The correct next step is to check whether this was a simple typing error – the query as it stands is valid but not very sensible, so what was the intention. It turned out that there was a column workflow_report.id, and that was the column that should have been selected in the subquery. (The OP also changed the trunc(start_date) to use a carefully constructed range-based clause – but that wasn’t really terribly important; and several people noted that some efficiency could be gained through suitable indexes – but that was missing the important point.)

Here’s the new query, with execution plan:


SELECT  COUNT(*) 
FROM    smp_dbuser2.workflow_step_report    wsr
WHERE   wsr.report_ID IN (
                SELECT  wr.id
                FROM    smp_dbuser2.workflow_report    wr
                WHERE   wr.start_time >= to_date( '28-Apr-2017','dd-mon-yyyy') 
                and     wr.start_time <  to_date( '28-Apr-2017','dd-mon-yyyy') + 1
                AND     wr.user_id = 'nbi_ssc'
        )
;

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                                |     1 |    31 | 22760   (1)| 00:04:34 |       |       |
|   1 |  SORT AGGREGATE           |                                |     1 |    31 |            |          |       |       |
|   2 |   NESTED LOOPS            |                                | 34458 |  1043K| 22760   (1)| 00:04:34 |       |       |
|   3 |    PARTITION HASH ALL     |                                |    72 |  1800 | 22759   (1)| 00:04:34 |     1 |    32 |
|*  4 |     TABLE ACCESS FULL     | WORKFLOW_REPORT                |    72 |  1800 | 22759   (1)| 00:04:34 |     1 |    32 |
|   5 |    PARTITION HASH ITERATOR|                                |   477 |  2862 |     1   (0)| 00:00:01 |   KEY |   KEY |
|*  6 |     INDEX RANGE SCAN      | WORKFLOW_STEP_REPORT_REPORT_ID |   477 |  2862 |     1   (0)| 00:00:01 |   KEY |   KEY |
----------------------------------------------------------------------------------------------------------------------------

The modified query completed in 5 seconds – presumably because there were only a few relevant rows in the workflow_report table and the related workflow_step_report rows were well clustered and accessible through a suitable “foreign key” index (there’s also a hint in the plan costs that the partitioning column for workflow_step_report is the report_id)

The final point to note about the rewritten query is the use of table aliases – both tables have a short alias (wsr and wr), and every column is qualified by its table alias. If this approach had been taken in the original code then the attempt to run it would have resulted in an error like:

ERROR at line 7:
ORA-00904: "WR"."REPORT_ID": invalid identifier

Update – 11th July 2017

Here’s a worse example of the same failure – deleting everything from a table because you didn’t use aliases properly.

April 27, 2017

Quiz Night

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 5:29 pm GMT Apr 27,2017

If this is the closing section of thetkprof output from the trace file of a single end-user session that has a performance problem, what’s the most obvious deduction you can make about the cause of the problem, and what sort of action would you take next ?


OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse      296      0.06       0.10          0        651          0           0
Execute    440      1.39       2.24          7       4664          0         146
Fetch      345     29.38      48.27          0    1709081          0         346
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1081     30.84      50.63          7    1714396          0         492

Misses in library cache during parse: 5
Misses in library cache during execute: 7

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                     497        0.00          0.00
  SQL*Net message from client                   496       27.03         50.35
  direct path read                                1        0.00          0.00


OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse   128199      6.94      11.46          0       2740          0           0
Execute 2274845    371.25     605.60         30   10031162          0       68200
Fetch   2225314     10.94      18.17          5     879297          0      577755
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   4628358    389.14     635.23         35   10913199          0      645955

Misses in library cache during parse: 9701
Misses in library cache during execute: 134

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  direct path read                                1        0.00          0.00
  latch: shared pool                              3        0.00          0.00

12666  user  SQL statements in session.
  495  internal SQL statements in session.
13161  SQL statements in session.
********************************************************************************
Trace file: {concealed file name}.trc
Trace file compatibility: 11.1.0.7
Sort options: default

       1  session in tracefile.
   12666  user  SQL statements in trace file.
     495  internal SQL statements in trace file.
   13161  SQL statements in trace file.
    5266  unique SQL statements in trace file.
 39046623  lines in trace file.
     742  elapsed seconds in trace file.


There’s no absoutely right answer to the last question, by the way – there are a couple of variations in approach that are likely to get to the same result in the same time, and some of those variations might have made looking at the tail end data the 2nd or 3rd step in the sequence.

Update

One of the skills of using your time effectively when trouble-shooting is the way you balance the time you spend noting the little details that might be relevant, and the time you then spend following up each detail to check for relevance and corroborating evidence. The amateurs might throw themselves into in-depth research on the first detail they notice; or if they’ve started with a short list of observations simply spend too much time on the first before moving on to the second.

An exercise like looking at this tail-end this tkprof file and talking about what you spot and what you might do is just a little exercise in how to pace yourself as you tackle a problem. So, from my perspective, here a couple of obvious starting points:

  • I said it was an end-user session complaining about performance – the last line of the file tells use that the elapsed time was 742 seconds, and the summary of recursive statement tells us there were 2.2 million executions. How can ANYTHING an end-user wants to do “quickly” require 3,000 executions per second of 12 minutes? I can ask that question because I know that “recursive” executions aren’t necessarily “sys-recursive”, SQL statements executed inside a PL/SQL block are also recursive – on top of that I can see, anyway, that there are 12,466 (probably all different) USER statements in the file – the user (or the user’s code) is doing something it almost certainly shouldn’t be doing. Those 2.2M executions are responsible for 10M buffer visits – does that suggest a lot of single row processing ?
  • Associated with the 13,161 statements in the trace file there are 9,700 misses in the library cache during parse – that means “hard” parsing, probably means the user is getting through about 800 “new” statements per minute – but there are only 5,266 unique statements so part of the problem is that some recently used statements (or child cursors, at least) are being flushed from the library cache: again that suggests that they’re doing something wrong.
  • As one of the commentators pointed out – a detail that I hadn’t noted initially – the number of “current” buffers is zero. This isn’t a mini-batch updating the database, it’s just a report (done badly).

What would I do next ? If it wasn’t already immediately obvious I’d look for the statements that were responible for the very large numbers of executions; a call to tkprof with sort=execnt as a parameter would push the high execution counts to the top of the file.  (The worst one had over 900,000 executions, the next 600,000). Then I’d do a grep, sed, sort with uniq -c to find the statement which (I’d guess – and there were 3 of them responsible for a few thousand variants each) were the generated texts using concatenation instead of bind variables. Then I’d go and find the owner of the code and sit down with them to work out how it should be re-engineered.

 

April 21, 2017

Undo Understood

Filed under: Infrastructure,Oracle,Read Consistency,Troubleshooting,undo — Jonathan Lewis @ 10:45 am GMT Apr 21,2017

It’s hard to understand all the ramifications of Oracle’s undo handling, and it’s not hard to find cases where the resulting effects are very confusing. In a recent post on the OTN database forum resulted in one response insisting that the OP was obviously updating a table with frequent commits from one session while querying it from another thereby generating a large number of undo reads in the querying session.

It’s a possible cause of the symptoms that had been described – although not the only possible cause, especially since the symptoms hadn’t been described completely. It’s actually possible to see this type of activity when there are no updates and no outstanding commits taking place at all on the target table. Unfortunately it’s quite hard to demonstrate this with a quick, simple, script in recent versions of Oracle unless you do some insanely stupid things to make the problem appear – but I know how to do “insanely stupid” in Oracle, so here we go; first some data creation:

rem
rem     Script:         undo_rec_apply_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2017
rem

create table t2(v1 varchar2(100));
insert into t2 values(rpad('x',100));
commit;

create table t1
nologging
pctfree 99 pctused 1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        cast(lpad(rownum,10,'0') as varchar2(10))       v1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 8e4 -- > comment to bypass WordPress formatting issue
;

alter table t1 add constraint t1_pk primary key(id)
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

The t2 table is there as a target for a large of updates from a session other than the one demonstrating the problem. The t1 table has been defined and populated in a way that puts one row into each of 80,000 blocks (though, with ASSM and my specific tablespace definition of uniform 1MB extents, the total space is about 80,400 blocks). I’ve got a primary key declaration that allows me to pick single rows/blocks from the table if I want to.

At this point I’m going to do a lot of updates to the main table using a very inefficient strategy to emulate the type of thing that can happen on a very large table with lots of random updates and many indexes to maintain:


begin
        for i in 1..800 loop
                update t1 set v1 = upper(v1) where id = 100 * i;
                execute immediate 'alter system switch logfile';
                execute immediate 'alter system flush buffer_cache';
                commit;
                dbms_lock.sleep(0.01);
        end loop;
end;
/

set transaction read only;

I’m updating every 100th row/block in the table with single row commits, but before each commit I’m switching log files and flushing the buffer cache.

This is NOT an experiment to try on a production system, or even a development system if there are lots of busy developers or testers around – and if you’re running your dev/test in archivelog mode (which, for some of your systems you should be) you’re going to end up with a lot of archived redo logs. I have to do this switch to ensure that the updated blocks are unpinned so that they will be written to disc and flushed from the cache by the flush buffer cache. (This extreme approach would not have been necessary in earlier versions of Oracle, but the clever developers at Oracle Corp. keep adding “damage limitation” touches to the code that I have to work around to create small tests.) Because the block has been flushed from memory before the commit the session will record a “commit cleanout failures: block lost” on each commit. By the time this loop has run to completion there will be 800 blocks from the table on disc needing a “delayed block cleanout”.

Despite the extreme brute force I use in this loop, there is a further very important detail that has to be set before this test will work (at least in 11.2.0.4, which is what I’ve used in my test runs). I had to start the database with the hidden parameter _db_cache_pre_warm set to false. If I don’t have the database started with this feature disabled Oracle would notice that the buffer cache had a lot of empty space and would “pre-warm” the cache by loading a few thousand blocks from t1 as I updated one row – with the side effect that the update from the previous cycle of the loop would be cleaned out on the current cycle of the loop. If you do run this experiment, remember to reset the parameter and restart the instance when you’ve finished.

I’ve finished this chunk of code with a call to “set transaction read only” – this emulates the start of a long-running query: it captures a point in time (through the current SCN) and any queries that run in the session from now on have to be read-consistent with that point in time. After doing this I need to use a second session to do a bit of hard work – in my case the following:

execute snap_rollstats.start_snap

begin
        for i in 1..10000 loop
                update t2 set v1 = upper(v1);
                update t2 set v1 = lower(v1);
                commit;
        end loop;
end;
/

execute snap_rollstats.end_snap

The calls to the snap_rollstats package simply read v$rollstat and give me a report of the changes in the undo segment statistics over the period of the loop. I’ve executed 10,000 transactions in the interval, which was sufficient on my system to use each undo segment header at least 1,000 times and (since there are 34 transaction table slots in each undo segment header) overwrite each transaction table slot about 30 times. You can infer from these comments that I have only 10 undo segments active at the time, your system may have many more (check the number of rows in v$rollstat) so you may want to scale up that 10,000 loop count accordingly.

At this point, then, the only thing I’ve done since the start of my “long running query” is to update another table from another session. What happens when I do a simple count() from t1 that requires a full tablescan ?

alter system flush buffer_cache;

execute snap_filestat.start_snap
execute snap_my_stats.start_snap

select count(v1) from t1;

execute snap_my_stats.end_snap
execute snap_filestat.end_snap

I’ve flushed the buffer cache to get rid of any buffered undo blocks – again an unreasonable thing to do in production but a valid way of emulating the aging out of undo blocks that would take place in a production system – and surrounded my count() with a couple of packaged call to report the session stats and file I/O stats due to my query. (If you’re sharing your database then the file I/O stats will be affected by the activity of other users, of course, but in my case I had a private database.)

Here are the file stats:


--------------
Datafile Stats
--------------
file#       Reads      Blocks    Avg Size   Avg Csecs     S_Reads   Avg Csecs     M_Reads   Avg Csecs    Max      Writes      Blocks   Avg Csecs    Max
File name
-----       -----      ------    --------   ---------     -------   ---------     -------   ---------    ---      ------      ------   ---------    ---
    1          17          17       1.000        .065          17        .065           0        .000      6           0           0        .000     15
/u01/app/oracle/oradata/TEST/datafile/o1_mf_system_938s4mr3_.dbf
    3         665         665       1.000        .020         665        .020           0        .000      6           0           0        .000     15
/u01/app/oracle/oradata/TEST/datafile/o1_mf_undotbs1_938s5n46_.dbf
    5         631      80,002     126.786        .000           2        .045         629        .000      6           0           0        .000     17
/u01/app/oracle/oradata/TEST/datafile/o1_mf_test_8k__cz1w7tz1_.dbf

As expected I’ve done a number of multiblock reads of my data tablespace for a total of roughly 80,000 blocks read. What you may not have expected is that I’ve done 665 single block reads of the undo tablespace.

What have I been doing with all those undo blocks ? Check the session stats:


Session stats
-------------
Name                                                                     Value
----                                                                     -----
transaction tables consistent reads - undo records applied              10,014
transaction tables consistent read rollbacks                                10

We’ve been reading undo blocks so that we can create read-consistent copies of the 10 undo segment headers that were active in my instance. We haven’t (and you’ll have to trust me on this, I can’t show you the stats that aren’t there!) reported any “data blocks consistent reads – undo records applied”.

If you want to see a detailed explanation of what has happened you’ll need to read Oracle Core (UK source), chapter 3 (and possibly chapter 2 to warm yourself up for the topic). In outline the following type of thing happens:

  • Oracle gets to the first block updated in t1 and sees that there’s an ITL (interested transaction list) entry that hasn’t been marked as committed (we flushed the block from memory before the commit cleanout could take place so the relevant transaction is, apparently, still running and the row is still marked as locked).
  • Let’s say the ITL entry says the transaction was for undo segment 34, transaction table slot 11, sequence 999. Oracle reads the undo segment header block for undo segment 34 and checks transaction table slot 11, which is now at sequence 1032. Oracle can infer from this that the transaction that updated the table has committed – but can’t yet know whether it committed before or after the start of our “long running query”.
  • Somehow Oracle has to get slot 11 back to sequence 999 so that it can check the commit SCN recorded in the slot at that sequence number. This is where we see “undo records applied” to make the “transaction table read consistent”. It can do this because the undo segment header has a “transaction control” section in it that records some details of the most recent transaction started in that segment. When a transaction starts it updates this information, but saves the old version of the transaction control and the previous version of its transaction table slot in its first undo record, consequently Oracle can clone the undo segment header block, identify the most recent transaction, find its first undo record and apply it to unwind the transaction table information. As it does so it has also wound the transaction control section backwards one step, so it can use that (older) version to go back another step … and so on, until it takes the cloned undo segment header so far back that it takes our transaction table slot back to sequence 999 – and the job is done, we can now check the actual commit SCN.  (Or, if we’re unlucky, we might receive an ORA-01555 before we get there)

So – no changes to the t1 table during the query, but lots of undo records read because OTHER tables have been changing.

Footnote:

In my example the tablescan used direct path reads – so the blocks that went through delayed block cleanout were in private memory, which means they weren’t in the buffer cache and didn’t get written out to disc. When I flushed the buffer cache (again to emulate aging out of undo blocks etc.) and repeated the tablescan Oracle had to go through all that work of creating read consistent transaction tables all over again.

Footnote 2:

Despite the grand claim implied by the title I hope everyone realises that this is just one little detail of how undo handling can interfere with their expectations of performance.

 

 

« Previous PageNext Page »

Powered by WordPress.com.