Oracle Scratchpad

February 15, 2013

Transactions

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 8:07 am UTC Feb 15,2013

It’s very easy to get a lot of information from an AWR (or Statspack) report – provided you remember what all the numbers represent. From time to time I find that someone asks me a question about some statistic and my mind goes completely blank about the exact interpretation; but fortunately it’s always possible to cross check because so many of the statistics are cross-linked. Here’s an example of a brief mental block I ran into a few days ago – I thought I knew the answer, but realised that I wasn’t 100% sure that my memory was correct:

In this Load Profile (for an AWR report of 60.25 minutes), what does that Transactions figure actually represent ?

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:             57,567.09             12,028.39
              Logical reads:             48,043.83             10,038.54
              Block changes:                314.07                 65.62
             Physical reads:                667.70                139.51
            Physical writes:                 46.25                  9.66
                 User calls:                619.33                129.41
                     Parses:                505.67                105.66
                Hard parses:                 36.94                  7.72
                      Sorts:                313.05                 65.41
                     Logons:                  0.56                  0.12
                   Executes:              1,165.42                243.51
               Transactions:                  4.79

  % Blocks changed per Read:    0.65    Recursive Call %:    95.18
 Rollback per transaction %:   24.95       Rows per Sort:    25.09

And, while we’re at it, what does the “Rollback per transaction %” actually mean – and what, if anything, can we infer from the value ?

Since we’re looking at transactions, it’s a fairly safe bet that the figure is something to do with commits – but is it counting only committed transactions, or does it included rolled back transactions [philosophical question - is a change that's never committed really a transaction, since it "never happened" as far as the rest of the world is concerned?]. Fortunately we can look at the Intance Activity Statistics to check.

First, though, lets convert the “Per Second” figures into an absolute value by multiplying by the duration in seconds (3,615) of the report. Allowing for rounding errors we’re looking for a value between ceiling(4.785 *3615) and floor(4.795 * 3615), i.e. between 17,298 and 17,338.  Let’s also keep in mind that the “Rollback per transaction %” is very close to 25%.

Here, then, are a couple of useful figures from the Instance Activity:


Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
user commits                                 12,986            3.6           0.8
user rollbacks                                4,316            1.2           0.3

Almost immediately you can see that user rollbacks are roughly one third of user commits (I promise I didn’t massage these number), which means the rollback percentage is just (user rollbacks / (user commits + user rollbacks)). As a further sanity check, 4,316 + 12,986 = 17,302 which falls nicely into our required range. Ta-da, job done: “Transactions” is the sum of user commits and user rollbacks (as a first approximation – but is it the whole answer).

Of course, I picked a fairly extreme example from my AWR library to make a point – which means you might now be asking whether all those rollbacks pose some sort of threat. How serious are they ? Luckily there are a couple more statistics that tell us:


Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
rollback changes - undo records                 222            0.1           0.0
transaction rollbacks                            59            0.0           0.0

The name of the first statistic has been trimmed a little in this text report, the full name is: “rolllback changes – undo records applied”. As you can see, we’ve only applied 222 undo records in our 4,316 rollback calls, so we’re not really rolling anything back (most of the time). Moreoever, the “transaction rollbacks” corroborates this observation – we have only attempted to rollback 59 “real” (data-changing) transactions. Most of those rollbacks are probably the default “rollback after every call” that some web application servers make.

As a closing thought – if you’re responsible for several different systems, it’s convenient to keep a couple of ”reference” AWR or Statspack reports from busy, normal, and quiet periods for each system; that way, if you use the figures from one system to work out the meaning of some derived value you can use the figures from another system to check if your hypothesis is correct.

Footnote: there are two other statistics that include the text “undo records applied” (not that you’ll see those words in the textual AWR report), but they both relate to creating read-consistent copies of blocks.

Footnote 2: Is anyone getting worried by my “first approximation” comment – and has anyone started wondering if there may be more to transaction rollbacks that user rollbacks, and if so where they fit into the arithmetic ?  To be continued.

 

January 28, 2013

Losing it

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 6:08 pm UTC Jan 28,2013

The example I gave last week showing how a SORT operation in an execution plan might include the work of resolving function calls in your SQL and might, therefore, be reporting much higher resource utilisation than expected reminded me of some problems I’ve had with gaps in execution plans in the past. So I thought I’d give a little demonstration of the way in which the completeness of execution plans can develop over time.

We’ll start with the same two tables I had in last week’s demo.

create table t1
as
select
	rownum 			id,
	lpad(rownum,200)	padding
from	all_objects
where	rownum <= 2500 ; create table t2 as select	* from t1 ; begin 	dbms_stats.gather_table_stats( 		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T2',
		method_opt	 => 'for all columns size 1'
	);

end;
/

Once we have the tables and stats, we can start running a few very simple queries – I have a sequence of three queries to demonstrate, showing the lovely progression of history:

select
	(select max(t1.id) from t1 where t1.id <= t2.id) id
from	t2
;

select	id
from	t1
minus
select
	(select max(t1.id) from t1 where t1.id <= t2.id) id
from	t2
;

select
	case mod(id,2)
		when 1 then (select max(t1.id) from t1 where t1.id <= t2.id)  		when 0 then (select max(t1.id) from t1 where t1.id >= t2.id)
	end id
from	t2
;

The first, and simplest query, demonstrates 8i failing to produce the right plan – but 9i handles it correctly:

Plan from 8.1.7.4
  Id  Par  Pos  Ins Plan
---- ---- ---- ---- ----------------------------------------------------------------------
   0         3        SELECT STATEMENT (all_rows)     Cost (3,2500,10000)
   1    0    1    2     TABLE ACCESS (analyzed)  TEST_USER T2 (full)  Cost (3,2500,10000)

Plan from 9.2.0.8
-------------------------------------------------------------------------
| Id  | Operation            |  Name       | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |  2500 | 10000 |    12   (9)|
|   1 |  SORT AGGREGATE      |             |     1 |     4 |            |
|*  2 |   TABLE ACCESS FULL  | T1          |   125 |   500 |    12   (9)|
|   3 |  TABLE ACCESS FULL   | T2          |  2500 | 10000 |    12   (9)|
-------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."ID"<=:B1)

The second query, including a section of SQL that 9i handled properly, shows an incomplete plan in 9i and 10g, but gets a complete plan in 11g. (The row estimate of 125 in line 5 of the second plan is the usual 5% estimate for a range-based predicate against an unknown value: 5% of 2,500 is 125).

Plan from 9.2.0.8 / 10.2.0.5
-------------------------------------------------------------------------
| Id  | Operation            |  Name       | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |  2500 | 20000 |    26  (58)|
|   1 |  MINUS               |             |       |       |            |
|   2 |   SORT UNIQUE        |             |  2500 | 10000 |    13  (16)|
|   3 |    TABLE ACCESS FULL | T1          |  2500 | 10000 |    12   (9)|
|   4 |   SORT UNIQUE        |             |  2500 | 10000 |    13  (16)|
|   5 |    TABLE ACCESS FULL | T2          |  2500 | 10000 |    12   (9)|
-------------------------------------------------------------------------

Plan from 11.1.0.7
----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |  2500 | 20000 |    25  (56)| 00:00:01 |
|   1 |  MINUS              |      |       |       |            |          |
|   2 |   SORT UNIQUE       |      |  2500 | 10000 |    12   (9)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T1   |  2500 | 10000 |    11   (0)| 00:00:01 |
|   4 |   SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  5 |    TABLE ACCESS FULL| T1   |   125 |   500 |    11   (0)| 00:00:01 |
|   6 |   SORT UNIQUE       |      |  2500 | 10000 |    12   (9)| 00:00:01 |
|   7 |    TABLE ACCESS FULL| T2   |  2500 | 10000 |    11   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("T1"."ID"<=:B1)

But if we start to hide subqueries inside CASE operators (decodes would do the same), 11g starts to get it a little wrong, as shown by the third example. The depth column of the plan_table can be calculated incorrectly, in this case giving the impression that line 6 is a descendent of line 5. (Funnily enough, 9i gets this example right because the code to display the plan from the plan_table uses the old “connect by” query on id and parent_id rather than the calculated depth column.

Plan from 11.1.0.7 / 11.2.0.3
------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |  2500 | 20000 |    25  (56)| 00:00:01 |
|   1 |  MINUS                |      |       |       |            |          |
|   2 |   SORT UNIQUE         |      |  2500 | 10000 |    12   (9)| 00:00:01 |
|   3 |    TABLE ACCESS FULL  | T1   |  2500 | 10000 |    11   (0)| 00:00:01 |
|   4 |   SORT AGGREGATE      |      |     1 |     4 |            |          |
|*  5 |    TABLE ACCESS FULL  | T1   |   125 |   500 |    11   (0)| 00:00:01 |
|   6 |     SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  7 |      TABLE ACCESS FULL| T1   |   125 |   500 |    11   (0)| 00:00:01 |
|   8 |   SORT UNIQUE         |      |  2500 | 10000 |    12   (9)| 00:00:01 |
|   9 |    TABLE ACCESS FULL  | T2   |  2500 | 10000 |    11   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("T1"."ID"<=:B1)    7 - filter("T1"."ID">=:B1)

You might note, by the way, that in all these examples the estimated COST of the plans is very misleading. The optimizer has made no attempt to allow for the cost of the repeated execution of the scalar subqueries. This doesn’t really matter, of course, for very simple queries like this, but it could make a big difference if something of this sort were embedded in the middle of a more complex statement.

Next time you have to unravel the execution plan for a complex query with scalar subqueries floating around the place – there may be bits of the plan that you can’t see, or that aren’t doing quite what you think you’re being told. When interpretation gets tough make sure you track through the query and the plan to see if the plan is likely to be a complete and truthful representation of what the statement has to do.

January 25, 2013

Sorting

Filed under: Execution plans,Oracle,Troubleshooting — Jonathan Lewis @ 5:45 pm UTC Jan 25,2013

Here’s a little quirk of execution plans that came up recently on the Oak Table network. If you call a function in a query, and do some sorting with the results, where does the work of calling the function get reported in the execution plan if you trace the query or look at the in-memory rowsource execution stats. Let’s take a look at a simple example:

create table t1
as
select
	rownum 			id,
	lpad(rownum,200)	padding
from	all_objects
where	rownum <= 2500
;

create table t2
as
select	* from t1
;

-- collect stats

create or replace function f (i_target in number)
return number
as
	m_target	number;
begin
	select max(id) into m_target from t1 where id <= i_target;
	return m_target;
end;
/

So I’ve got two tables with exactly the same data and a function that will do a full tablescan of t1 (which is going to be 75 blocks) and return the original input (assuming the input was between 1 and 2,500). Here’s the query I want to run (and it will return no rows), followed by the base execution plan.


select	/*+ gather_plan_statistics */
	id
from 	t1
minus
select
	f(id)
from	t2
;

select * from table(dbms_xplan.display_cursor(null,null,'basic +rows'));

--------------------------------------------
| Id  | Operation           | Name | Rows  |
--------------------------------------------
|   0 | SELECT STATEMENT    |      |       |
|   1 |  MINUS              |      |       |
|   2 |   SORT UNIQUE       |      |  2500 |
|   3 |    TABLE ACCESS FULL| T1   |  2500 |
|   4 |   SORT UNIQUE       |      |  2500 |
|   5 |    TABLE ACCESS FULL| T2   |  2500 |
--------------------------------------------

Because of the call to f() in the select against t2, I’m going to call the function 2,500 times and incur a load of buffer reads (2,500 * 75) doing so. Where will those buffer gets and the attendant CPU appear in the plan ? This example is by no means an exhaustive analysis of all the possible options when you include functions in your select list, but in this particular case the function isn’t called until we run the SORT UNIQUE operation at line 4:


select * from table(dbms_xplan.display_cursor(null,null,'allstats last projection'));

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      0 |00:00:00.54 |     187K|       |       |          |
|   1 |  MINUS              |      |      1 |        |      0 |00:00:00.54 |     187K|       |       |          |
|   2 |   SORT UNIQUE       |      |      1 |   2500 |   2500 |00:00:00.01 |      75 | 70656 | 70656 |63488  (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |   2500 |   2500 |00:00:00.01 |      75 |       |       |          |
|   4 |   SORT UNIQUE       |      |      1 |   2500 |   2500 |00:00:00.54 |     187K| 70656 | 70656 |63488  (0)|
|   5 |    TABLE ACCESS FULL| T2   |      1 |   2500 |   2500 |00:00:00.01 |      75 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - STRDEF[22]
   2 - (#keys=1) "ID"[NUMBER,22]
   3 - "ID"[NUMBER,22]
   4 - (#keys=1) "F"("ID")[22]
   5 - "ID"[NUMBER,22]

The 187K buffer gets for the function calls (and any other resources) appear at line 4 of the plan, when you might have expected them to appear in line5 as part of the tablescan. This observation can be confirmed by checking the column projection information – the output from line 5 is the “ID”, the output from line 4 includes “F”(“ID”). I don’t often look at the projection information, but it’s nice to know that sometimes it can give you some ideas of what’s going on when the row source execution stats don’t seem to be what you were expecting.

Footnote: The same sort of effect appears with scalar subqueries in the select lists, although in any  recent versions of Oracle the plan for the scalar subquery appears in the main plan (although sometimes in a counter-intuitive position) and will give you a much better idea of where and why the work is being done.

 

January 11, 2013

Quiz Night

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 6:41 pm UTC Jan 11,2013

Warning – this is a catch question, and I haven’t given you enough information to have any idea of the right answer; though, by telling you that I haven’t given you enough information to have any idea of the right answer, you now have some information that might help you to get closer to the right answer.

I have a simple heap table with no indexes. Immediately after flushing the buffer_cache I’ve run a query that looks ike this:

select max(column_ZZZ) from table_X;

The most significant session stats for this operation are as follows:

Name                                           Value
----                                        ---------
session logical reads                          20,651
consistent gets                                20,651
consistent gets direct                         20,649
physical reads                                    655
physical reads cache                                1
physical reads direct                             654
Number of read IOs issued                           6
no work - consistent read gets                 20,649
table scan rows gotten                        329,922
table scan blocks gotten                       10,649
table fetch continued row                         645
buffer is not pinned count                     10,000

The instance is 11.2.0.3, so serial direct path reads have been used for the (necessary) tablescan. The tablespace is using 8KB blocks, 1MB uniform extent sizing, and manual (freelist) segment space management. The data in the table was created by a pl/sql loop of inserts with commits, there have been no updates, deletes, merges or rollbacks. Here’s the code (with one crtical detail hidden) that populated the table:

begin
	for i in 1..X loop
		insert into member(member_id, block_age_max_nbr) values (1,1);
		commit;
	end loop;
end;
/

execute dbms_stats.gather_table_stats(user,'member',method_opt => 'for all columns size 1')

Roughly how many rows are there in the table ?

Update 13th Jan:

The answer is 10,000; the interesting observations are in this comment and the reply.

January 7, 2013

Analysing Statspack 13

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 6:44 pm UTC Jan 7,2013

A recent (Jan 2013) post on the OTN database forum reported a performance problem on Oracle 9.2.0.6 (so no AWR), and posted a complete statspack report to one of the public file-sharing sites. It’s been some time since I did a quick run through the highlights of trouble-shooting with statspack, so I’ve picked out a few points from this one to comment on.

As usual, although this specific report is Statspack, the same analysis would have gone into looking at a more modern AWR report, although I will make a couple of comments at the end about the extra material that would have been available by default with the AWR report that would have helped us help the OP.

First, here are the Load Profile, and “Top 5″ parts of the report. To make sense of them, you also need to know that report covered a two-hour interval, and the machine had 12 CPUs; there’s also a little clue in the fact that this is 32-bit Oracle with a 600MB SGA (with the sga_max_size set to 1.6GB) running (alone) on a machine with 12GB of memory.

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:             54,437.02             65,579.53
              Logical reads:          2,053,336.52          2,473,626.15
              Block changes:                383.39                461.86
             Physical reads:             43,270.67             52,127.57
            Physical writes:                 46.63                 56.18
                 User calls:                 35.83                 43.16
                     Parses:                 25.53                 30.76
                Hard parses:                  0.18                  0.21
                      Sorts:                186.25                224.38
                     Logons:                  0.08                  0.10
                   Executes:                108.76                131.02
               Transactions:                  0.83

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time                                                       74,717    84.80
latch free                                        181,788       8,101     9.19
db file scattered read                         47,546,136       2,893     3.28
db file sequential read                        23,092,208       2,006     2.28
buffer busy waits                               1,812,920         193      .22
          -------------------------------------------------------------

Working from the top down, the first thing we notice is that 2M logical I/Os per second is quite busy, and that 43, 000 blocks read per second is pretty amazing – especially since the number of user calls and executes isn’t particularly large.

A cross check to the Top 5 then shows that a huge fraction of the work done by the database is CPU. At 2 hours and 12 cores we have 12 * 2 * 3600 = 86,400 CPU seconds available, and we’ve used 74,700 of them – and that’s into Cary Millsap’s “knee of the curve” for queueing problems. Since this isn’t the AWR, of course, we don’t have the OS Stats available, so we can’t see if the rest of the machines CPU is being used up outside the database – but looking at the db file read figures I’ll bet it is. If you have extreme CPU usage in an Oracle system, then latch contention problems are very likely to show up – and that’s our next highest figure – AWR would probably tells us which latch but, again based on the db file read figures, I’ll guess it’s the cache buffers chains latches.

So we look at the db file read figures and do the arithmetic that the AWR would show us to work out that we’re seeing multiblock read times of 0.06 milliseconds, and single block read times of 0.08 milliseconds. These have to be coming from the file system cache to be this fast. (Even then I think we may also seeing a timing problem (perhaps a measurement bug) that’s managing to lose wait time.) First thoughts, then – are that there is some code that is so inefficient that it’s doing far more physical I/O than it should; on the plus side the excess memory (12GB) means we’re being protected from the worst effects of disk I/O, but the caching effect means we’re able to burn up CPU at an astonishing rate to do the work.

So let’s look at the SQL to see where the CPU and disk activity is likely to be going. Here’s extract from the SQL ordered by Gets, and SQL Ordered by Reads. (We don’t have SQL ordered by Time in Statspack, of course):

                                                     CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
  5,636,117,159           28  201,289,898.5   38.1 ######## #########  835260576
Module: RAXTRX
 INSERT INTO RA_INTERFACE_ERRORS  (INTERFACE_LINE_ID,   MESSAGE_
TEXT,   INVALID_VALUE) SELECT  INTERFACE_LINE_ID, :b_err_msg6, '
trx_number='||T.TRX_NUMBER||','||'customer_trx_id='||TL.CUSTOMER
_TRX_ID FROM RA_INTERFACE_LINES_GT IL, RA_CUSTOMER_TRX_LINES TL,
 RA_CUSTOMER_TRX T WHERE  IL.REQUEST_ID = :b1 AND    IL.INTERFAC

    511,626,129           20   25,581,306.5    3.5  8155.90  15045.56  205460129
Module: JDBC Thin Client
select  TRANSACTION_BRANCH , BRANCH_ADDRESS_1 , BRANCH_ADDRESS_2
 , BRANCH_PHONES , BRANCH_FAX , BRANCH_TIN , TRX_NUMBER , TRX_NU
MBER_DISP , TRX_DATE , TRX_DATE_DISP , BILL_TO_CUSTOMER_ID , BIL
L_TO_CUSTOMER_NAME , SHIP_DATE_ACTUAL_DISP , SHIP_TO_NAME , SHIP
_TO_ADDRESS1 , SHIP_TO_ADDRESS2 , SHIP_TO_ADDRESS3 , SHIP_TO_ADD

                                                     CPU      Elapsd
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    281,953,754           20   14,097,687.7   90.5  8155.90  15045.56  205460129
Module: JDBC Thin Client
select  TRANSACTION_BRANCH , BRANCH_ADDRESS_1 , BRANCH_ADDRESS_2
 , BRANCH_PHONES , BRANCH_FAX , BRANCH_TIN , TRX_NUMBER , TRX_NU
MBER_DISP , TRX_DATE , TRX_DATE_DISP , BILL_TO_CUSTOMER_ID , BIL
L_TO_CUSTOMER_NAME , SHIP_DATE_ACTUAL_DISP , SHIP_TO_NAME , SHIP
_TO_ADDRESS1 , SHIP_TO_ADDRESS2 , SHIP_TO_ADDRESS3 , SHIP_TO_ADD

     94,281,089        8,550       11,027.0   30.3  2924.20   4550.22 3761220362
Module: JDBC Thin Client
SELECT DELV_ADDR.ADDRESS1 FROM APPS.OE_ORDER_HEADERS_ALL OE_HEAD
ERS , APPS.RA_SITE_USES_ALL DELV_SITE , APPS.RA_ADDRESSES_ALL DE
LV_ADDR WHERE TO_CHAR(ORDER_NUMBER) = :B1 AND DELV_SITE.SITE_USE
_ID(+) = OE_HEADERS.DELIVER_TO_ORG_ID AND DELV_ADDR.ADDRESS_ID(+
) = DELV_SITE.ADDRESS_ID

     94,011,901        8,544       11,003.3   30.2  2919.47   4527.54  803969757
Module: JDBC Thin Client
SELECT COALESCE(DELV_ADDR.CITY, DELV_ADDR.PROVINCE) FROM APPS.OE
_ORDER_HEADERS_ALL OE_HEADERS , APPS.RA_SITE_USES_ALL DELV_SITE
, APPS.RA_ADDRESSES_ALL DELV_ADDR WHERE TO_CHAR(ORDER_NUMBER) =
:B1 AND DELV_SITE.SITE_USE_ID(+) = OE_HEADERS.DELIVER_TO_ORG_ID
AND DELV_ADDR.ADDRESS_ID(+) = DELV_SITE.ADDRESS_ID

     93,981,254        8,551       10,990.7   30.2  2919.13   4533.69 4273350236
Module: JDBC Thin Client
SELECT DELV_ADDR.ADDRESS2 FROM APPS.OE_ORDER_HEADERS_ALL OE_HEAD
ERS , APPS.RA_SITE_USES_ALL DELV_SITE , APPS.RA_ADDRESSES_ALL DE
LV_ADDR WHERE TO_CHAR(ORDER_NUMBER) = :B1 AND DELV_SITE.SITE_USE
_ID(+) = OE_HEADERS.DELIVER_TO_ORG_ID AND DELV_ADDR.ADDRESS_ID(+
) = DELV_SITE.ADDRESS_ID

The first statement reported by Gets is responsible for 500 millions buffer gets in the interval (and even then that’s 40% of the total – so there may be a couple more big ones that have fallen out of the library cache in the interim, perhap an hourly report on the previous hour would catch them). Look at the table that the SQL is inserting into – RA_INTERFACT_ERRORS. We might hope that an error table wasn’t going to receive many errors – so perhaps this is just an example of a statement with an insanely bad execution plan; clearly one to look at. Trouble is, though, that unless you set statspack to run at level 6 you don’t capture the execution plans – unlike the AWR which capture them by default. The batch took a couple of days to complete, though, so it would have been possible either to change the snapshot level, or to pull the execution plans from memory.

Looking at the SQL ordered by Reads, we see another example of the top statement being much more expensive than anything following it – but there’s an odd little pattern that caught my eye as I scanned the list. The next three statements look very similar to each other, and have similar statistics, and if you add up their stats the answer is very close to the figures for the first statement. Moreover, the report says that the top statement is responsible for 90% of the work – I think the top statement is actually calling the next three, so we need to look at those statements to see how to make them efficient in order to fix the first statement.

You might note, by the way, the capitals and the bind variables (:B1) in the next three statements – I suspect that there is a PL/SQL function in the first statement that is being called too many times. (20 executions of the first statement, 8,500 of the other three). Perhaps a change in the execution plan of the main query has produced a change in table order which has resulted in some predicate – one with a pl/sql function in it – being tested much more frequently than usual.

A quick cross-reference to the instance activity is in order:

physical reads                           311,462,252       43,270.7     52,127.6
prefetched blocks                        240,396,452       33,397.7     40,233.7

table fetch by rowid                  12,348,226,523    1,715,508.0  2,066,648.8
table fetch continued row             12,047,845,308    1,673,776.8  2,016,375.8

table scan blocks gotten                 361,692,971       50,249.1     60,534.4
table scan rows gotten                 7,787,418,493    1,081,886.4  1,303,333.6
table scans (long tables)                     32,041            4.5          5.4

A huge fraction of the physical reads seem to be for prefetched blocks – that a good indication that they are for tablescans (or index fast full scans), and when we compare rowid access with tablescan access we see that there are a lot of “long” tablescans (maybe the expensive three queries are all doing tablescans of a long table, that would account for 25,500 of those reported), with with 360M blocks scanned. Apart from anything else, all those block scans will be using a big chunk of CPU. The AWR would show us which segments were being scanned, of course, as would statspack if we were running it at level 7.

I’ve included the table fetch by rowid and table fetch continued row. Normally we should be concerned by “continued fetches” – but in this case I’m prepared to overlook them temporarily: it’s possible that after the critical SQL is corrected we may still see that most of the fetches by rowid turn into continued fetches, and at that point it will be worth worrying; but at present, even though they’re probably responsible for a serious fraction of the CPU usage, those continued fetches may be for rows that we’re not going to be fetching once the SQL is doing the right thing.

The last little detail, just for confirmation, comes from the latch activity part of the report; I’ve just selected the busiest 4 latches:


                                           Pct    Avg   Wait                 Pct
                              Get          Get   Slps   Time       NoWait NoWait
Latch                       Requests      Miss  /Miss    (s)     Requests   Miss
------------------------ -------------- ------ ------ ------ ------------ ------
cache buffers chains     ##############    0.0    0.0      0  577,860,644    0.1
cache buffers lru chain         157,333    0.3    0.0      0  687,799,033    0.8
multiblock read objects     114,602,621    0.6    0.0     12            0
simulator hash latch        811,998,194    0.0    0.0      2            0

                                      Get                            Spin &
Latch Name                       Requests      Misses      Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
cache buffers chains       ##############  12,262,452     180,671 0/0/0/0/0
multiblock read objects       114,602,621     719,477         115 719369/101/7/0/0
simulator hash latch          811,998,194       3,889          19 3872/15/2/0/

As you can see, and as we had guessed, they’re all about the buffer cache – doing multiblock reads and loading/dumping blocks from the cache; and almost all of the sleeps are on the cache buffers chains.

Next Steps

As a quick and dirty which might help a bit – put more of the sga_max_size into the buffer cache – it doesn’t happen automatically in 9.2 – and that may reduce the interaction between the O/S and Oracle and save a bit of latching and CPU.

Address the three statements doing all the physical I/O, and check the path for the statement that cause them to be executed. Fix the statement that has done 40% of the buffer gets. In all cases the problem may simply be an execution plan change – but perhaps the machine has been hiding an inherent performance problem and on this occasion a small change in data (a few more errors, perhaps) resulted in the excess work generating a massive increase in workload and contention.

We do know, however, that there is a lot of work done on buffer gets that didn’t show up in this report – so run off the report for the previous hour and see if we can find a couple more big hitters. Once we’ve sorted out the worst SQL, we might then want to see if the volume of “continued rows” is high – and find out why they’re appearing. It’s possible that it’s due to a table with more than 254 columns, of course, or a long column, but we’ll worry about that only when we need to.

One final thought – it’s easy to fix the wrong thing when you’re looking at statspack reports. Although we can see some SQL that is “obviously” overloading the machine, maybe the business critical task that has to complete promptly is simply hidden because it does (relatively) little work. Unless we can fix the big problems extremely quickly, we ought to take a little time to check with the business and the developers that what we’re looking at really is the code that’s running the critical business task.

Longer term, of course, upgrade to 9.2.0.8 at least and then start running Statspack at level 7 for better information. Then upgrade some more.

December 7, 2012

Update Error

Filed under: Execution plans,Oracle,subqueries,Troubleshooting — Jonathan Lewis @ 5:10 pm UTC Dec 7,2012

When doing updates with statements that use correlated subqueries, it’s important to make sure that your brackets (parentheses) are in the right place. Here are two statements that look fairly similar but have remarkably different results – and sometimes people don’t realise how different the statements are:

update t1
set
        padding = (
                select
                        t2.padding
                from    t2
                where   t2.id = t1.id
        )
where   t1.small_vc &lt;= lpad(20,10,'0')
;

update t1
set
        padding = (
                select
                        t2.padding
                from    t2
                where   t2.id = t1.id
                and     t1.small_vc &lt;= lpad(20,10,'0')
        )
;

The first statement will update rows in t1 where t1.small_vc <= lpad(20,10,’0′), copying the value of padding from t2 where (if) the id columns match, but setting t1.padding to null if there is no match.

The second statement will update every single row in t1 – some of the rows will probably be updated as expected (i.e. to get the same result as the first statement) but every row where t1.small_vc is greater than lpad(20,10,’0′) will have the padding column set to null.

Originally I created this little demo to respond to a question on OTN – but when I ran it with rowsource execution statistics enabled, I discovered that it also happened to demonstrate an odd bug in that feature – even in 11.2.0.3. Here are the two plans corresponding to the above two statements. The update is supposed to update 20 rows in t1, the incorrect form of the code will update all 10,000 rows in the table:


20 rows updated.

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  73tz2pqv4z8rc, child number 0
-------------------------------------
update t1 set  padding = (   select    t2.padding   from t2   where
t2.id = t1.id  ) where t1.small_vc &lt;= lpad(20,10,'0')

Plan hash value: 417405447

------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT             |       |      1 |        |      0 |00:00:00.01 |     219 |
|   1 |  UPDATE                      | T1    |      1 |        |      0 |00:00:00.01 |     219 |
|*  2 |   TABLE ACCESS FULL          | T1    |      1 |     20 |     20 |00:00:00.01 |     173 |
|   3 |   TABLE ACCESS BY INDEX ROWID| T2    |     20 |      1 |     20 |00:00:00.01 |      45 |
|*  4 |    INDEX UNIQUE SCAN         | T2_PK |     20 |      1 |     20 |00:00:00.01 |      25 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(&quot;T1&quot;.&quot;SMALL_VC&quot;&lt;='0000000020')
   4 - access(&quot;T2&quot;.&quot;ID&quot;=:B1)

10000 rows updated.

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  ax22zpj8xdwpn, child number 0
-------------------------------------
update t1 set  padding = (   select    t2.padding   from t2   where
t2.id = t1.id   and t1.small_vc &lt;= lpad(20,10,'0')  )

Plan hash value: 2442374960

-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |       |      3 |        |      0 |00:00:00.27 |   21576 |
|   1 |  UPDATE                       | T1    |      3 |        |      0 |00:00:00.27 |   21576 |
|   2 |   TABLE ACCESS FULL           | T1    |      3 |  10000 |  29692 |00:00:00.03 |     514 |
|*  3 |   FILTER                      |       |  19692 |        |     40 |00:00:00.01 |      85 |
|   4 |    TABLE ACCESS BY INDEX ROWID| T2    |     40 |      1 |     40 |00:00:00.01 |      85 |
|*  5 |     INDEX UNIQUE SCAN         | T2_PK |     40 |      1 |     40 |00:00:00.01 |      45 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(:B1&lt;='0000000020')
   5 - access(&quot;T2&quot;.&quot;ID&quot;=:B1)

Note, in both cases, how the number of rows reported in the update operation (line 1) is zero although we might expect it to match the A-rows value reported in the table access full operation (line 2) which tells us how many rows were identified for update. But something strange has happened in the line reporting the table access full in the second case – clearly there’s a bug somewhere because there were only 10,000 rows in the table, yet the tablescan has (apparently) identified 29,692 rows for update.

Not only is the tablescan line reporting silly numbers, the filter line is as well (because that should only start 10,000 times – once per row in the table) moreover the subquery lines (4 and 5) should only start 20 times each because the filter should eliminate the need to run the subquery for all rows except the 20 where “T1″.”SMALL_VC”<=’0000000020. Even stranger is the fact that when I ran this test several times in a row (recreating the tables each time), the starts and A-rows values for the filter and table access full lines changed (decreasing) every time!

It took me a little while to figure out what was going on – but the doubling of the starts for the subquery gave me a hint, confirmed when I took a snapshot of how the session stats changed when I ran the second update. It’s just another variant of the odd “write consistency” behaviour that can appear when running tablescan through a large volume of data and updating it (although I have to say that I hadn’t seen it happening for a mere 10,000 rows before).  Oracle starts the tablescan and update, and after a while decides that it has to rollback and restart the update – the 20 rows that were subject to the subquery were the first 20 rows of the table, which is why they were almost sure to trigger the subquery twice each, had they appeared half way through the table the effect might not have been so consistent).

Footnote:

If you want to repeat my test for yourselves, here’s the code to generate the data sets:

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level &lt;= 1e4
)
select
        rownum                  id,
        lpad(rownum,10,'0')     small_vc,
        rpad('x',100)           padding
from
        generator       v1,
        generator       v2
where
        rownum &lt;= 1e4
;

create table t2
as
select * from t1
;

alter table t1 add constraint t1_pk primary key(id);
alter table t2 add constraint t2_pk primary key(id);

November 30, 2012

v$sqlstats

Filed under: AWR,Bugs,Oracle,Troubleshooting — Jonathan Lewis @ 5:45 pm UTC Nov 30,2012

Sometimes you find bugs on MOS (Metalink, OCIS, whatever) that make you feel positively ill. I’ve just been on a customer site where (in passing) they mentioned that one of their historic queries avainst v$sqlstats now tool just over one second (CPU) in 11.2.0.3 when it had previously taken about 200ms on 10.2.0.4***. After a little checking it seemed likely that the change was possibly related to the fact that they had increased the size of the SGA significantly, allowing for a much larger shared pool and library cache; however there have been numerous code changes in the shared pool area on the route from 10g to 11g, so I decided to check MOS to see if anyone else had seen a similar problem. I found this:

Bug 13550185 : QUERIES ON V$SQLSTATS MAY CAUSE CRASH OR HANG

Version 11.2.0.3

Fixed in 12.1

Last updated: 29th Nov 2012

There is a documented workaround in the base bug - stop collecting SQL execution statistics, but this rather defeats the point of having AWR snapshot capturing all that “SQL ordered by … “ stuff. On the other hand, if some queries against v$sqlstats can cause sessions to crash or the database to hang then every AWR snapshot might be the one that hangs your database !

The bug in question references Solaris 10 on SPARC, but a quick search for the bug number finds 66 other entries in the bug database and the first one I looked at was on Redhat Linux 5; and the bug also describes itself as “platform generic”. On a positive note, the status is currently “80 – Development to QA/Fix Delivered Internal”, and when I raised the reported the threat to the client, they contacted their Oracle Support contact and discovered that there was a patch available already (number 15033625), and that they had already installed it as a side effect of installing a merge patch to fix a problem with corruption in the shared pool. If you search the patch lists on MOS you’ll find that there are patches for many different versions in the 11.2.0.2 and 11.2.0.3 range, for several different platforms.

*** An increase from 200 ms to 1 second may not seem something worth worrying about, but the client ran the query every 10 seconds because (as it said on the packet in 10g) v$sqlstats is latch free and very cheap to run and (b) they have SLAs which mean they basically need to know if any of their critical, high-frequency, queries are going bad within one minute of the problem starting to occur.

August 30, 2012

Surprises

Filed under: Bugs,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 5:55 pm UTC Aug 30,2012

I’ve given examples in the past of how you can be suprised by Oracle when a mechanism that has “always worked” in the past suddenly just doesn’t work because some unexpected internal coding mechanism hits a boundary condition. One example I gave of this was rebuilding indexes online – where a limitation on the key size of index organized tables made it impossible to do an online rebuild of an index on a heap table because of an ORA-01450 (maximum key length) error that was raised against the (index-organized) journal table that  Oracle creates internally to support the rebuild.

Slightly more recently the topic of the sys_op_map_nonnull() function came up in some comments about the ora_hash function – and I made a passing comment about the function appearing in materialized view refreshes and then never followed it up. Fortunately I got an email from Jared Still and Don Seiler a few days ago which contained a nice short demonstration of the problem so, with their permission, I’ve published it below.
(more…)

July 25, 2012

Questions – 1

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 5:13 pm UTC Jul 25,2012

A recent question on the OTN Database forum:

If the block size of the database is 8K and average row length is 2K and if we  select all column of the table the I/O show that it had read more blocks then compare to specific column of the same table. Why is this?

Secondly if Oracle brings a complete block is the db buffer cache, while reading block from disk, they why there is a difference of block count in two queries. This difference reveals to me when I check the EXPLAIN PLAN for two different queries against the same table but one select all columns and the other one select specific column.

Kindly help me in clearing this confusion.

I can’t see anything subtle and complex in the problem as stated, so why doesn’t the OP give a clear explanation of what’s puzzling them?
(more…)

February 27, 2012

Geek Stuff

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:50 pm UTC Feb 27,2012

A recent post on the OTN database forum raises a problem with v$sql_shared_memory:

query to V$SQL_SHARED_MEMORY don’t return rows

please explain why ?

A follow-up posting then describes how the OP picked the view definition from v$fixed_view_definitions and used the text of that query instead of the view itself – and still got no rows returned:

SELECT * FROM
(select /*+use_nl(h,c)*/ c.inst_id,kglnaobj,kglfnobj, kglnahsh, kglobt03, kglobhd6, rtrim(substr(ksmchcom, 1, instr(ksmchcom, ‘:’, 1, 1) – 1)), ltrim(substr(ksmchcom, -(length(ksmchcom) – (instr(ksmchcom, ‘:’, 1, 1))), (length(ksmchcom) – (instr(ksmchcom, ‘:’, 1, 1)) + 1))), ksmchcom, ksmchptr, ksmchsiz, ksmchcls, ksmchtyp, ksmchpar from x$kglcursor c, x$ksmhp h where ksmchds = kglobhd6 and kglhdadr != kglhdpar)
WHERE ROWNUM < 3;

The answer is quite simple – but in two parts.

(more…)

February 10, 2012

11.2 New views

Filed under: AWR,Oracle,Troubleshooting — Jonathan Lewis @ 7:55 am UTC Feb 10,2012

It looks like Laimutis Nedzinskas has spent some time producing  a list of new views, and changes to existing views, for trouble-shooting specialists. The lists cover the dynamic performance views and the AWR history views, and the differences between 10.2 and 11.2. The lists include, where relevant, convenient hyperlinks to the view definitions in the online copy of  the 11g Server Reference manual.

I haven’t checked the lists in detail, but I can see myself  taking advantage of his efforts from time to time:

January 30, 2012

Index Hash

Filed under: CBO,Indexing,Oracle,Troubleshooting — Jonathan Lewis @ 6:12 pm UTC Jan 30,2012

You might think from the title that this little note is going to be about the index hash join – you would be half right, it’s also about how the optimizer seems to make a complete hash of dealing with index hash joins.

Let’s set up a simple data set and a couple of indexes so that we can take a closer look:
(more…)

January 26, 2012

Ouch!

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 8:19 pm UTC Jan 26,2012

Here’s a set of Instance Activity stats I’ve never seen before, and I’d rather never see again. From an active standby running 11.1.0.7 on AIX:
(more…)

December 19, 2011

Correlation oddity

Filed under: Bugs,Indexing,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 6:26 pm UTC Dec 19,2011

This one’s so odd I nearly posted it as a “Quiz Night” – but decided that it would be friendlier simply to demonstrate it. Here’s a simple script to create a couple of identical tables. It’s using my standard environment but, apart from fiddling with optimizer settings, I doubt if there’s any reason why you need to worry too much about getting the environment exactly right.
(more…)

November 13, 2011

Irrational Ratios

Filed under: Oracle,Ratios,Statspack — Jonathan Lewis @ 11:23 am UTC Nov 13,2011

I’ve pointed out in the past how bad the Instance Efficiency ratios are in highlighting a performance problem. Here’s a recent example from OTN repeating the point. The question, paraphrased, was:

After going through AWR reports (Instance Efficiency Percentages) I observed they have low Execute to Parse % but high Soft Parse %.
Please share if you had faced such issue and any suggestions to solve this

(more…)

« Previous PageNext Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,439 other followers