Oracle Scratchpad

July 1, 2009

Distributed Queries

Filed under: Execution plans, Hints — Jonathan Lewis @ 7:20 am UTC Jul 1,2009

Some time ago I wrote a note about distributed DML, pointing out that the driving_site() hint works with distributed queries but not with distributed DML; so insert as select, or create as select and so on will “ignore” the hint.

This is just a little follow-up to give you an  idea of what execution plans for distrtibuted queries look like so that you can tell whether your query is going to work locally or remotely.

I have created a loopback database link (d10g@loopback) with a connection qualifier - another term for the glossary eventually – so the example runs on a single database, but the optimizer believes it is running on a distributed pair; and I’ve defined a couple of small tables to join.

Here’s the query with its execution plan when I run the join at the local database. I’ve used the driving_site() hint for textual clarity to identify the “home” database as the place to run the query, but in the absence of a hint I believe the query will always run at the local database – the optimizer appears to have no mechanism for evaluating the network cost of changing the location where the query runs.


SQL  select
  2  	/*+ driving_site (dh) */
  3  	dh.small_vc,
  4  	da.large_vc
  5  from
  6  	dist_home		dh,
  7  	dist_away@d10g@loopback	da
  8  where
  9  	dh.small_vc like '12%'
 10  and	da.id = dh.id
 11  ;

Execution Plan
----------------------------------------------------------
Plan hash value: 1261259267

------------------------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |     2 |   260 |    21   (0)| 00:00:01 |        |      |
|   1 |  NESTED LOOPS      |           |     2 |   260 |    21   (0)| 00:00:01 |        |      |
|*  2 |   TABLE ACCESS FULL| DIST_HOME |     2 |    30 |    19   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | DIST_AWAY |     1 |   115 |     1   (0)| 00:00:01 |   D10G | R->S |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DH"."SMALL_VC" LIKE '12%')

Remote SQL Information (identified by operation id):
----------------------------------------------------
   3 - SELECT "ID","LARGE_VC" FROM "DIST_AWAY" "DA" WHERE "ID"=:1 (accessing
       'D10G.JLCOMP.CO.UK@LOOPBACK' )

Note, in particular, that the operation in line 3 is “REMOTE” – and the name of the remote object is dist_away. This should be enough to tell us (at least in this case) where the query is actually executing and which database is the remote one.  The “Remote SQL Information” makes things even more clear: it tells us the SQL that is being sent to “the other” database, and tells us where that database is. This query is running at the database that holds table dist_home, and is sending (for each row selected from dist_home) a query to d10g@jlcomp.demon.co.uk@loopback.

Now we repeat the test, but put into the driving_site() hint the alias of the table that is located at the remote database:


SQL  select
  2  	/*+ driving_site (da) */
  3  	dh.small_vc,
  4  	da.large_vc
  5  from
  6  	dist_home		dh,
  7  	dist_away@d10g@loopback	da
  8  where
  9  	dh.small_vc like '12%'
 10  and	da.id = dh.id
 11  ;

Execution Plan
----------------------------------------------------------
Plan hash value: 4154226149

----------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT REMOTE      |           |     2 |   450 |    15   (0)| 00:00:01 |        |      |
|   1 |  NESTED LOOPS                |           |     2 |   450 |    15   (0)| 00:00:01 |        |      |
|   2 |   REMOTE                     | DIST_HOME |     2 |    40 |    13   (0)| 00:00:01 |      ! | R->S |
|   3 |   TABLE ACCESS BY INDEX ROWID| DIST_AWAY |     1 |   205 |     1   (0)| 00:00:01 |   D10G |      |
|*  4 |    INDEX UNIQUE SCAN         | DA_PK     |     1 |       |     0   (0)| 00:00:01 |   D10G |      |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("A1"."ID"="A2"."ID")

Remote SQL Information (identified by operation id):
----------------------------------------------------
   2 - SELECT "ID","SMALL_VC" FROM "DIST_HOME" "A2" WHERE "SMALL_VC" LIKE '12%' (accessing '!' )

Note
-----
   - fully remote statement

The “Note” section is a bit of a clue here – the query is running at the remote site. We also see at line 2 that the “REMOTE” operation is against a table called dist_home – which isn’t remote as far as we’re concerned, it’s in our local database! But we’re seeing the execution plan from the perspective of “the other” database – and to the other database we are remote. This is confirmed by the instance information and the “Remote SQL Information” where we see “!” as the location of dist_home. The “!” is my database’s name for itself: the remote database is running the query, and sending a select statement to me for each row it selects from dist_away.

 

Footnote: If you’ve ever seen sysdate, for example, turning into sysdate@! in the predicate section of an execution plan, you now know why. “!” is the database’s name for itself and the database is making it very clear whose value of sysdate it is using.

June 30, 2009

Connect By

Filed under: Uncategorized — Jonathan Lewis @ 9:22 am UTC Jun 30,2009

If you have to deal with hierarchical queries (or “connect by” queries, as they’re commonly known) you might be interested in this note on the OTN Forum where someone has a problem with repeated executions of the same query (same inputs, same results, same execution plan) operating at wildly different speeds with amazing changes in workload.

Ultimately the solution was the same as for the more obvious question: “why does my sort sometimes take much longer than usual?”  – shortage of memory (in this case a very low setting for the pga_aggregate_target).

This example is an interesting variation, though, as it demonstrates how the availability of resources can make Oracle choose to execute different parts of an execution plan – giving another reason for the “conditional plan” strategy I’ve described in the past.

There’s an interesting post on hierarchical queries on Christian Antognini’s blog (which I’ve also referenced from the OTN posting)

June 25, 2009

Explain VIEW

Filed under: Uncategorized — Jonathan Lewis @ 7:32 pm UTC Jun 25,2009

A brief note on reading execution plans.


------------------------------------------------------------
| Id  | Operation                     | Name    | Time     |
------------------------------------------------------------
|   0 | SELECT STATEMENT              |         | 00:00:17 |
|*  1 |  HASH JOIN                    |         | 00:00:17 |
|   2 |   VIEW                        | VW_SQ_1 | 00:00:01 |
|   3 |    HASH GROUP BY              |         | 00:00:01 |
|   4 |     TABLE ACCESS FULL         | EMP     | 00:00:01 |
|   5 |   VIEW                        |         | 00:00:17 |
|   6 |    TABLE ACCESS BY INDEX ROWID| EMP     | 00:00:17 |
|   7 |     INDEX FULL SCAN           | E_D     | 00:00:01 |
------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("DEPT_NO"="OUTER"."DEPT_NO")
       filter("OUTER"."SAL">"VW_COL_1")

Where you see the VIEW operator this is Oracle telling you that it has had to “suspend” operations and produce an intermediate result set by running the “sub-plan” under the VIEW.

You’re likely to see this operator if your query includes a non-mergeable view (perhaps because of a no_merge hint), or if Oracle has transformed your query in some way that has produced a non-mergeable view.

In the example above, one of the view operators (the one at line 2) refers to an object named “VW_SQ_1″. This is an internal view name that gives me a clue that Oracle has unnested an “existence” subquery. Its a convenient fact that many of the names generated by Oracle start with “VW_”, e.g. vw_nso_1, vw_nsq_1, vw_wif_1, vw_gbc_5, though there are some, inevitably, that don’t follow this convention, e.g. index$_join$_001

The other view operator (line 5) has appeared because I had an inline view with a no_merge hint in the main query.

Although the view operator requires an intermediate result set to be built, it’s important to remember that it need not create the entire result set before passing rows up to its parent (which is why I put quote marks around the word “suspend” earlier on in the article – the presence of the view operator  may have no visible blocking effect on the execution).

Looking at the example – the hash group by at line 3 has to complete before any rows can be passed up to the view operator and on to build the hash table in line 1, so there is a blocking effect there. By comparison, in line 5 the view operator can retrieve rows from the emp table in line 6 one at a time by virtue of the index full scan in line 7 and pass them on up to line 1 to probe the hash table – there is no blocking effect associated with the view.

In fact, when you look closely at what’s going on, you can see that the while the view represents the need to recognise an intermediate result set, the need for a complete data set isn’t dictated by the view operator, it is a function of the first child of the view.

June 24, 2009

Advert

Filed under: Uncategorized — Jonathan Lewis @ 9:50 am UTC Jun 24,2009

1st July, London – I’ll be repeating my “How to read Statspack/AWR” presentation again for the DBMS SIG of the UKOUG. It starts with a short set of slides, and ends with me picking one or two reports (text format, not HTML) from the audience and working through them in real time to demonstrate the approach that I use. I’ve been sent one report in advance – it would be nice to have a couple more, or get an email letting me know that you’re bringing one.

Planning ahead for the USA:

I will be doing a one-day event (Troubleshooting) in Richmond, Virginia on Monday 26th Oct, followed by a keynote for the Virginia Oracle User Group on 27th Oct. Then I’ll be flying on to Dallas to do two more days on 28th/29th Oct. Contact John Goodhue of SpeakTech for further details of the tutorials.

June 23, 2009

Glossary

Filed under: Infrastructure — Jonathan Lewis @ 11:43 am UTC Jun 23,2009

It occurred to me recently that I might be making casual use of terms whose meaning isn’t necessarily known to the less experienced user, so I’ve decided to build a glossary to supply a brief definition for each term.

It’s going to take some time to build, but I’ll try to keep adding to it whenever I have a few minutes, and add a comment to it to alert people about changes.

I’ve also added a special link near the top of the options column (in the “Special Links” list) to make it easy for people to get to the page.

If you have any suggestions for terms, feel free to add them in the comments. Please stick to one term per comment so that I can delete the comment after I’ve added the term to the glossary. As a guideline, you should be able to fit the term into a sentence like: “what is an X” – I’m not going to use this page to answer “how does X work.”

a.k.a.: “also known as”

Block cleanout: See this generic note on various uses of the term “clean” in Oracle.

Clean block: See this generic note on various uses of the term “clean” in Oracle.

Commit cleanout: See this generic note on various uses of the term “clean” in Oracle.

CPU Costing: See System statistics.

Delayed block cleanout: See this generic note on various uses of the term “clean” in Oracle.

Delayed logging block cleanout: See this generic note on various uses of the term “clean” in Oracle.

Dirty block: See this generic note on various uses of the term “clean” in Oracle.

Glossary: Alphabetical list of terms peculiar to a field of knowledge with definitions or explanations.

ITL: Abbreviation for Interested Transaction List. There is a little list stored near the top of each table or index block identifying transactions that have recently modified the contents of that block. When a transaction wants to change some rows in a block it has to acquire one of the ITL entries in that block and write some identifying information into that entry. 

The transaction needs to acquire only one ITL entry in the block no matter how many rows in the block it changes. A single transaction may modify many blocks at once and it has to own an ITL entry in each of those blocks. If all the ITL entries for a block are currently in use then a transaction may add a new ITL to the list – if there is space available to do so and if (for earlier versions of Oracle) the size of the ITL has not been limited by the setting for maxtrans. The initial size of the ITL is set by initrans- with a minimum setting that is dependent on the version of Oracle and the type of the object.

Interested Transaction List: See ITL.

LIO: (a.k.a Logical I/O) – a visit to a block that is in the buffer cache, initiated by a search though a cache buffers chain while holding a latch. (See also PIO). For further reading about logical I/Os, see this note.

MBRC: multiblock read count. Ideally this term should be used to refer to the System Statistic (q.v.) that represents the average number of blocks achieved (or expected) in multiblock reads (of type “db file scattered read”). Historically this abbreviation has been used as a shorthand for the parameter db_file_multiblock_read_count but to avoid confusion it would be better to refer to the latter as the dbf_mbrc (lower case) in future.

PIO: (a.k.a. Physical I/O). A call by Oracle to the operating system to copy a block into the buffer cache. Common usage is a little variable and lacking in precision; depending on context you may notice that the term is sometimes used to describe a single read request for multiple blocks – but it is usually used to refer to the number of blocks read.

q.v.: “which see” (Latin: quod vide)

SCN: abbreviation for System Change Number or System Commit Number (neither term is absolutely perfect). The SCN acts as a type of clock mechanism for a single instance – but it does not have a regular “tick”, instead it is a simple counter that goes up by one every time a session issues a commit (or rollback). But there are other occasions when the SCN can change: instances involved in distributed transactions or queries will synchronise their SCNs (upwards) at the end of each dialogue; similarly, the instances in a RAC setup resynchronise their SCNs extremely frequently. There is also a block-related cause for the SCN to change – each change to a block increments the block’s “change number” which consists of the current SCN combined with a single byte counter: if you make more than 255 changes in a period when no commits occur then the SCN will go up by one and the counter byte on that block will set itself back to one  (the value zero is reserved for “logically corrupt” blocks – reported by Oracle in the alert log under error ORA-01578).

A session takes note of the current SCN at various critical moments (when a transaction starts, when a transaction commits, when a query starts) and the current SCN is written to datablocks in various places (control files, data file headers, a couple of places in data block headers, ITL entries). A session is constantly comparing the current SCN, or one of its remembered SCNs, with the SCNs it sees stored in the database to ensure that it is looking at a safe, correct, and appropriate version of the data.

Segment Header Block: (also just Segment Header). Historically the first block of the first extent of a segment – holding critical information about the size of the segment, number and location of extents, and control information about free space in the segment. In newer versions of Oracle there may be a few space management blocks before the segment header block – this variation is a detail of the implementation of ASSM (automatic segment space management) tablespaces. The view dba_extents will give you the file and block id of the first block of the first extent; view dba_segments will give you the position of the segment header in that extent. Undo segments introduce a special case – the segment header block of an undo segment holds some extra information that is specific to the operation of undo.

System Statistics: (also referred to as “CPU costing”) a set of numbers modelling the performance of the hardware. (Not data-related statistics for objects in the SYSTEM tablespace).  The statistics attempt to model the average disk speed, the CPU speed, and the ability of the machine to cope with parallel execution. The numbers are:

  • cpuspeed: speed of the CPU in millions of some “Oracle benchmark operation” per second
  • cpuspeednw: as above using a “no workload” method – introduced in 10g.
  • sreadtim: the average read time for a single block read (“db file sequential read”) in milliseconds. If gathered this is rounded to the nearest millisecond, if you set it using dbms_stats.set_system_stats() you are not limited to whole numbers.
  • mreadtim: the average read time for a multiblock read (“db file scattered read”) in milliseconds. If gathered this is rounded to the nearest millisecond, if you set it using dbms_stats.set_system_stats() you are not limited to whole numbers. The assumed size of the read is given by the statistics MBRC
  • MBRC: the average size of a multiblock read (“db file scattered read”) actually achieved during the measurement interval, rounded to the nearest whole number of blocks.
  • maxthr: maximum throughput in bytes per unit time achieved by a session
  • slavethr: average throughput in bytes per unit time achieved by a parallel execution slave. The CBO will limit the adjustment made in its arithmetic for parallel execution by maxthr/slavethr
  • ioseektim: introduced in 10g – the average disk seek time in milliseconds; defaults to 10m/s
  • iotfrspeed: introduced in 10g – the disk data transfer rate in bytes per millisecond – default to 4KB (4096B) per millisecond. If you do not have values for the other I/O stats (or if they are inconsistent – e.g. mreadtim < sreadtim) then 10g will synthesize figure from these transfer rates and the db_file_multiblock_read_count.

Transaction Table: Each undo segment header block (q.v.) has an area in it called the transaction table. When a transaction starts, it picks an undo segment to use then picks the oldest free entry (called a transaction table “slot”) from that segment header’s transaction table. This slot is the identifying marker for a transaction, and information about it is published in two ways – first as the transaction ID in v$transaction, and secondly in v$lock where a TX lock is reported in mode 6 (exclusive) by the session running the transaction. Both locations identify the transaction by the triple: (undo segment number, slot number, sequence number). (The number of slots in a transaction table is limited so each one has a sequence number stamped on it, and each time a slot is re-used the sequence number goes up by one). As a transaction starts, the “start SCN” is one of the items written to the transaction table slot; when the transaction commits this is changed to the “commit SCN“.

Transaction Table Slot: a single row in a Transaction Table (q.v.) representing a single transaction. Since there are only a limited number of rows in a transaction table, part of the row data includes a “usage counter”, known as the “sequence”. A transaction id is then the combination of the undo segment number that the transaction table is in, the row number in the transaction table, and the sequence number of the row.

June 19, 2009

Tablespaces

Filed under: Infrastructure, Troubleshooting — Jonathan Lewis @ 7:08 pm UTC Jun 19,2009

This could nearly be one for the “ancient history” series, because it starts with a quote from that marvellous book “Practical Oracle 8i”, where I wrote in Chapter 8:

Tablespace names are never removed from the data dictionary (cluster SYS.C_TS#), and by default each tablespace entry occupies a minimum of one block in this cluster, even if it is a locally managed tablespace rather than a dictionary-managed tablespace. 

Since SMON scans TS$ every five minutes you need to avoid letting this table get too large.Your options are (1) hack the SQL.BSQ table that creates the data dictionary to reduce the cluster size in the C_TS# cluster – but make sure you get official approval from Oracle Support first, and (2) recycle tablespace names as much as possible and avoid using ‘time-related’ tablespaces names, especially if you are taking advantage of transportable tablespaces .

A few days ago, I got an email from Tom Kyte, because someone had written to AskTom asking him to throw more light on this comment.

Tom explained the comment, and I added a little extra note (the query is one used by smon to identify free space that can be coalesced in dictionary-managed tablespaces). But the notable  thing about this exchange was that someone then wrote in to say that they had experienced in 9i and 10g a variant of the problem that I had predicted - and Oracle has now got a patch for the problem in 10.2.0.4 with a possible enhancement in 11.2.

For some interesting background reading on how the number of tablespaces you have – or have had at some time – in your database, those of you with Metalink accounts might want to look at document ID 5861536.8, and Bug 5855429.

Open World

Filed under: Uncategorized — Jonathan Lewis @ 12:25 pm UTC Jun 19,2009

It’s been two or three years since I attended OOW in San Francisco, which means it’s about time I made the trip again (Swine Flu and Financial Crisis notwithstanding). So I put my name down to do a presentation.

The acceptance email came through a couple of nights ago, so if you’re interested in hearing me talk about how to make best use of hints,  I’ll be somewhere in the Moscone centre, some time towards the middle of October.

June 16, 2009

Clean it up

Filed under: Infrastructure — Jonathan Lewis @ 7:08 pm UTC Jun 16,2009

There is some confusion about the expression “clean” in Oracle circles, so I thought I’d write a short note to explain the different ways in which the word may be applied to Oracle blocks. There are five terms to consider:

  1. clean
  2. commit cleanout
  3. block cleanout
  4. delayed block cleanout
  5. delayed logging block cleanout

Clean:

A block (or, to be strictly accurate, a buffered copy of a block) is “clean” if and only if the copy in memory is the same as the version on disc. Conversly, if the copy in memory and the version on disc differ (presumably because something has modified the in-memory copy) then the buffer copy is “dirty”.

Commit Cleanout:

When you modify some data you will make some buffered blocks “dirty”. It is quite possible that the database writer (dbwr) will copy those blocks to disc (making the buffer versions “clean”) before you issue your commit. When you issue the commit, your session will update its transaction table slot (which is in an undo segment header block), generating a little redo to describe this action, and then call the log writer (lgwr) to write.

Optionally, your session may also revisit a few of the blocks it has modified and mark the associated ITL (interested transaction list) entry in that block with the “commit SCN”. It is this activity that is called a commit cleanout. The commit cleanout does not do a complete tidy up on the block (it leaves lock bytes in place, for example), it simply ensures that the next process to see the block knows that your transaction has committed. (This feature was introduced arounve 7.3 to reduce block pinging in OPS.)

Note that any of the blocks that had previously been written by dbwr will have been made “dirty” again and dbwr will have write to write them to disc again eventually. However, although a commit cleanout changes blocks it does not create any redo to describe those changes – and it doesn’t report db block gets for the block vists that do the changes.

Block Cleanout / Delayed Block Cleanout:

It’s possible that some of the blocks changed by your transaction were written to disc by dbwr before you issued the commit and were even flushed from the buffer cache before you issued the commit. Your session will not re-read these blocks in order to do a commit cleanout on them. In fact, if your transaction was quite large and modified a lot of blocks, it’s quite possible that your session will not even do a commit cleanout on all of the blocks which are still in memory – Oracle doesn’t want to make the user wait for a commit to complete – so it’s a bit lazy about commit cleanout.

At some later point in time another session may read one of those blocks and discover that the ITL includes a transaction that has committed but not been cleaned out. (It can work this out by cross-checking the ITL entry with the relevant transaction table slot in the undo segment header block).

 This session will read the commit SCN from the transaction table slot, tidy up the ITL entry, and clear all the related lock bytes.  (And it will do this for all committed transactions it finds on the block).  This process is known as block cleanout, and because this full cleanout mechanism never takes place on the commit it is more commonly known as delayed block cleanout.

The delayed block cleanout operation changes the block so it generates redo – which is why you may see redo being generated on a select statement, especially after a very big update. (Interestingly, Oracle will report db block changes as this happens – but doesn’t record the block visits as db block gets.)

Delayed Logging Block Cleanout

Finally, let’s go back to a block that was subject to commit cleanout. If you query this block you can see the effects of the committed transaction, but since the commit cleanout wrote the commit SCN into its ITL slot you know when the transaction committed and generally don’t need to do anything to tidy the block up as you read it. However if you want to update the block that’s a good moment at which to finish off the cleanout of the block – in fact you may even want to re-use that ITL entry and update some of the rows which are still showing a lock byte. So, at this point, you complete the block cleanout, and you generate redo that describes not only the changed you are now making but also the earlier commit cleanout.  This process is known as ‘delayed logging’ block cleanout – because the redo log for the commit cleanout has been delayed until this moment.

There are lots more tiny bits of information, odd behaviour in special cases, and peculiarities of statistics reporting that I could go into – but I’ve probably confused enough people already, so I’ll stop there.

June 14, 2009

Undocumented Hints

Filed under: Uncategorized — Jonathan Lewis @ 6:26 pm UTC Jun 14,2009

The bits of Oracle which aren’t documented always seem to be the bits that are hard to resist, so I thought I’d make a brief comment on undocumented hints.

Of course, you should not take advantage of any undocumented feature without first getting approval from Oracle support, but some hints seem to me to fall into a special category where you are more likely to get that approval – and here are my thoughts on why.

There seem to be four main reasons why some hints are not documented:

  • There are hints that you’re absolutely not supposed to know about. One such example is the infamous no_trigger hint that appeared briefly in a version that I will deliberately leave unspecified (in case anyone is still using it). And if anyone posts the version number in a comment, I will delete it.
  • There are hints which are actually internal hints that operate in situations that the end-user cannot manufacture. The index_rrs() hint that used to appear in some parallel query slave code (for parallel index fast full scans) was an example of this type of hint.
  • There are hints which I assume are not yet documented because there is some beta code that is still subject to finalisation sitting in the production release. The hint may become official, it may cease to exist. The not-quite-documented selectivity() hint of 9i – which subsequently disappeared – may have been such a case.
  • There are hints which never got into the (right place in the) documentation because no-one told the editors about it in time. I like to think that most of the undocunented hints like this – but I may be wrong. The dynamic_sampling_est_cdn() hint is probably an example of this type. It’s in the 9.2 Performance Tuning Guide and Reference – but not in the table of contents, the index, or the lists of hints given in the chapter on hints. Instead it appeared as an afterthought in the section on the dynamic_sampling() hint. It’s gone from the 10gR2 manual (and the documentation for hints has moved into the SQL Reference)

So when you look at undocumented hints, you have to ask yourself – are they supposed to be official, or will they soon be official, or are they actually official if only you can find where they’ve been documented.

Footnote: If you want to see a complete list of hints, including the version where they appeared, and the version where they got into the “stored outline” code, then 11g gives you the view v$sql_hint.

June 12, 2009

Consistent Gets – 2

Filed under: Infrastructure — Jonathan Lewis @ 8:45 pm UTC Jun 12,2009

How would you describe “db  block gets” and “consistent gets” in a couple of short paragraphs ? Having asked the question, I suppose I ought to offer my version of the answer up for scrutiny and dissection. 

Before supply my descriptions, I’d like to point out that there are (at least) two other mechanisms that Oracle uses to visit data blocks in the buffer cache: re-visiting pinned buffers, and doing a fast cleanout on commit.

These buffer visits can be seen in the statistics: “buffer is pinned count” and “commit cleanouts successfully completed”. (The statistic “commit cleanouts” tells you how many times Oracle tried to do a commit cleanout, and there are various statistics to explain the failures.)

Whan I asked the question originally, I pointed out that it as possible for Oracle to change a block without recording any type of “get” and without generating any undo or redo; that’s the “commit cleanout”. Strictly speaking I wasn’t telling the exact truth, the cleanout is completed by the next person to update the block, and they generate the relevant undo and redo, so technically you could argue that the commit cleanout is the source of some undo and redo, even though its apparance is delayed.

So, back on point:

A db block get is used when Oracle has to see the most recent version of a block, including all the uncommitted changes – but “commit cleanouts” are not considered to be db block gets even though the block has to be the most recent version of the block.

The two commonest reasons for db block gets are (a) accessing a block to change the data – you should only change the latest version of the block – and (b) access to index blocks to check uniqueness and  referential integrity constraints – behind the scenes, your session  needs to see uncommitted changes from other sessions (even though you, end-user aren’t allowed to see these changes) in case you are trying to make a change that will become illegal when the other session commits. This conflict is often the cause of mode 4 waits on TX enqueues. [Paragraph updated in response to a comment below.]

A consistent get is any other “latch protected” block visit (i.e. excluding visits to pinned buffers) that does not need to see the latest version of the block.

The two commonest reasons for consistent gets are (a) accesses to undo blocks to find records that can be used to construct a “read-consistent” version of a data block and (b) access to a version of  a data block showing only the committed changes as at a given point in time (as indicated by a system change numner (SCN)).

Of course, when all you want to do is give the right sort of impression about what’s going on, the “traditional” descriptions are adequate approximations: “a db block get access the latest version of a block”, “a consistent read uses undo to produce a version of a block as at an earlier point in time”. But it’s nice to be able to add more precision when you see unexpected patterns in the numbers in a Statspack or AWR report.

Here, for example, is a thought that’s often overlooked – and it’s a point I made in the earlier posting. A “read-consistent” block may be in a state that never existed in the past; consider this scenario:

  • Session 1 updates row 1 in the block, but does not commit.
  • Session 2 updates row 2 in the block, and commits
  • Session 3 queries the block

Session 3 has to apply undo records to a copy of the block to reverse out the change to row 1. The “read-consistent” copy of the block includes the change to row 2, but not the change to row 1 – it is in a state that never existed in the past.

Once you’ve realised the significance of this simple scenario you can understand why it’s possible to see extreme numbers of consistent gets when several processes mix short and long transactions to update a small section of a database – the processes are constantly thrashing through the undo segments to create versions of a handful of blocks to hide each other’s changes. This effect can become particularly serious with code that does “select for update” followed a little later by “update …”

One final thought – if you’re about to update a block, you acquire it with a db block get to make sure you’ve got the latest version. But before you update it you have to create a read-consistent copy to check that (a) the target rows you can see in the current block existed when you started your update and (b) there aren’t any  “missing” rows that were target rows when you started the update but don’t look like targets in the current block because they have been deleted or updated by other transactions (which may or may not have committed) in the interim.

When you start to think carefully about read-consistency, you sometimes wonder how the techies at Oracle Corp. managed to get it working at all !

June 9, 2009

Quiz Night

Filed under: Infrastructure — Jonathan Lewis @ 9:41 pm UTC Jun 9,2009

I was reading a blog earlier on this evening that was talking about the definitions of “consistent gets” and “db block gets”. The point of the article was that it’s very hard to find good descriptions of exactly what these two terms mean.

I’ll add the link to the blog item (and give my definitions) in a few days time but, in the interim, does anyone want to try their hand at defining these two very common terms in a couple of short paragraphs.

[Updated: 11th June] Just a brief note to tantalize a little more, did you know that :

  •  You can visit blocks in the buffer cache without recording a consistent get or a db block get.
  • You can CHANGE the current version of a block without recording a db block get (or even a consistent get – and without generating undo or redo).
  • Consistent gets can create versions of a block that have NEVER, EVER, existed at any point in time (or, much the same thing, at any SCN).

[Updated: 12th June] I’ve published my definitions in a separate note. But this is the link to the blog that started this thought running.

June 8, 2009

Book content

Filed under: Non-technical — Jonathan Lewis @ 7:12 pm UTC Jun 8,2009

Rob Freeman raised an interesting topic on Oracle-L a couple of weeks ago with the following:

My question is, what constitutes Oracle Book Writing mal-practice (and I pray I’ve never committed it).  Certainly mistakes crop up in books all the time, I’m as guilty as any writer of this. This chapter I’m reading though, in an effort to get the reader to doing something quickly, does not lay any foundation, skips critical steps and actually prompts them to do what I consider some very dangerous things.

The posting didn’t really generate a lot of discussion – which is a shame – and my privileges to write to Oracle-L lapsed some time ago, so I’m writing my response to Rob’s observations here.

Mark Powell made the following comment, which I thought was an important one:

It is one thing to make a mistake and another to write something without giving proper thought to related facts which could impact someone following the information given.

Like Mark, I think you have to go just a little further than “without attention to detail” to be accused of “Oracle Book Writing malpractice”; but when you’re writing about topics like backup and recovery you’re immediately on the boundary and should take extreme care with investigating and justifying your claims.

Surprisingly, though, the commonest form of response was along the lines of “caveat emptor” (let the buyer beware), “always check before you do anything”, and even one comment about it being “good to learn from mistakes”.

First: “let the buyer beware” is not an acceptable response. The buyer presumably bought the book to learn – so how can they be in a position to judge that what they’re being taught is bad. Note, particularly, that Rob said the chapter “skips critical steps” – so if the available instructions appear to work how is the learner supposed to know that something critical is missing ? By the time he finds out it may be far too late.

Secondly: if people like Rob won’t stand up in public and say “this is wrong” who will ? How will anything get better if no-one is prepared to point out the errors and omissions and explain what’s wrong ? I hope that some day, when I’m browsing Amazon perhaps, I’ll see a review from Rob on some recent Oracle book that says:  “Chapter NN should be treated with extreme caution and here are a few of the reasons….”. He’s probably too nice to be that blunt, but he maybe he could start with “Chapters A, B and C were good because … but …”.

In passing, I’ve had a couple of people tell me in the past that the best way to deal with rubbish is to write good stuff instead.

Wrong!

It’s easy to write rubbish, and rubbish propagates quickly. Look at the comments from Amar Kumar Padhi:

I recollect one incident related to RAID selection for disk storage. Different people produced different published sources that had different conclusion resulting in utter confusion.

If some of those published sources had been critical appraisals of other sources then Amar Kumar’s juniors might have been in a better position to assess the quality and relevance of the information and avoid confusion.

Don’t be satisfied with rubbish – there’s too much of it about and it’s not going to go away by itself.

Finally, for those of you who do want to live by “caveat emptor”, here’s a guideline that might help you avoid being burnt too often.

June 7, 2009

PGA leaks

Filed under: Infrastructure, Troubleshooting, trace files — Jonathan Lewis @ 7:53 pm UTC Jun 7,2009

Here’s a simple script that I created a short time ago while investigating a memory problem on a client site. The purpose of writing the script was, as always, to strip the client’s code back to a bare minimum in an attempt to work out the root cause of a problem. (Warning: if you want to run this script, your Oracle shadow process will grab about 1GB of PGA RAM )


define m_string_length = 200

drop table t1 purge;
create table t1(v1 varchar2( &m_string_length ));

create or replace procedure p1 (
	i_rowcount	in number	default 1000000,
	i_bulk_pause	in number	default 0,
	i_forall_pause	in number	default 0,
	i_free_pause	in number	default 0
)
as

	type w_type is table of varchar2( &m_string_length );
	w_list 		w_type := w_type();
	w_free		w_type := w_type();

begin
	for i in 1..i_rowcount loop
		w_list.extend;
		w_list(i) := rpad('x', &m_string_length );
	end loop;

	dbms_lock.sleep(i_bulk_pause);

	forall i in 1..w_list.count
	insert into t1 values(w_list(i));

	dbms_lock.sleep(i_forall_pause);

	commit;

	w_list := w_free;
	dbms_session.free_unused_user_memory;

	dbms_lock.sleep(i_free_pause);

end;
/

The procedure simply fills a pl/sql array with data, then uses the forall insert syntax to write the whole lot into a table. The first version of the code used a select with bulk collect to load the array (emulating the job run by the client), but I wanted to make the code even simpler.

After creating the table and procedure I simply called the procedure and, in a slightly more complex version of the code,  introduced some long wait times (using dbms_lock.sleep) between steps. Then I used another session (logged on as SYS) to monitor what demands this procedure made for RAM for the PGA as it ran.

This is the code I ran from the monitoring session (supplying the SID from the first session as the input parameter):


define m_sid = &1

column name format a40

column	value		format	999,999,999,999

column	category	format	a10
column	allocated	format	999,999,999,999
column	used		format	999,999,999,999
column	max_allocated	format	999,999,999,999

column	pga_used_mem		format	999,999,999,999
column	pga_alloc_mem		format	999,999,999,999
column	pga_freeable_mem	format	999,999,999,999
column	pga_max_mem		format	999,999,999,999

select
	name, value
from
	v$sesstat ss,
	v$statname sn
where
	sn.name like '%ga memory%'
and	ss.statistic# = sn.statistic#
and	ss.sid = &m_sid
;

select
	category,
	allocated,
	used,
	max_allocated
from
	v$process_memory
where
	pid = (
		select	pid
		from	v$process
		where
			addr = (
				select	paddr
				from	V$session
				where	sid = &m_sid
			)
		)
;

select
	pga_used_mem,
	pga_alloc_mem,
	pga_freeable_mem,
	pga_max_mem
from
	v$process
where
	addr = (
		select	paddr
		from	V$session
		where	sid = &m_sid
	)
;

The figures I’m after are just the session and process memory from v$sesstat, and the two different ways of reporting process memory from v$process and v$process_memory. These are the results I got on a database running 10.2.0.3. The first set reports the the state of PGA memory just after after the creation of the array, the second set reports it just after the call to the procedure has ended.


NAME                                                VALUE
---------------------------------------- ----------------
session uga memory                                221,824
session uga memory max                            221,824
session pga memory                            313,843,284
session pga memory max                        313,843,284

CATEGORY          ALLOCATED             USED    MAX_ALLOCATED
---------- ---------------- ---------------- ----------------
SQL                   1,008              104           55,276
PL/SQL               22,412           17,848           22,412
Other           627,269,497                       627,269,497

    PGA_USED_MEM    PGA_ALLOC_MEM PGA_FREEABLE_MEM      PGA_MAX_MEM
---------------- ---------------- ---------------- ----------------
     235,870,025      627,292,917                0      627,292,917

You’ll notice several key points.

First (obviously) v$sesstat tells use that we’ve used a lot of memory – which isn’t too surprising since we generated 1,000,000 character strings of length 200, so we know we’re going to see at least a couple of hundred megabytes of RAM being used.

But there’s much more to see in these three result sets. The report from v$process_memory says we’ve allocated more than 600MB in the “Other” category – which is a little surprising since (a) we know that we’ve used the memory for a pl/sql array – so why is it in “Other” – and (b) it’s twice as much memory as reported in v$sesstat and (c) we don’t see any of it in the “Used” column.

In fact, a dump of the pga and uga heaps (oradebug dump heapdump 5) shows that we have 313mb of RAM in the pga heap, of which 77MB is marked as free (but not yet released from the heap). From this we we can probably infer that the code maintaining v$process_memory is broken. I suspect that we really ought to see that 313MB should in both the “Allocated” and “Used” columns with, quite possibly, that 77MB appearing in a separate row (not visible in this report) for the category “Freeable”.

Note that the report from v$process is closer to the actual state given by the heap dump. It shows pga_used_mem at 236M – which means that the 77M free from the heap dump is correctly subtracted from the 313MB total, but possibly that 77MB should be reported as “PGA_Freeable_Mem”, rather than disappearing completely. This view still has a problem, though: the pga_alloc_mem and pga_max_mem are both displaying the same doubling effect as v$process_memory.

Bottom line: v$sesstat seems to show you most of the truth, with v$process helping you to seperate the freeable memory out from the currently allocated memory, but both v$process and v$process_memory are not to be trusted. (Some of these anomalies are still present in 11.1.0.6 by the way).

The next bit of output shows you the report results after the procedure call had completed:


NAME                                                VALUE
---------------------------------------- ----------------
session uga memory                            498,681,064
session uga memory max                        498,681,064
session pga memory                            598,662,740
session pga memory max                        911,924,820

CATEGORY          ALLOCATED             USED    MAX_ALLOCATED
---------- ---------------- ---------------- ----------------
SQL                   3,024              516           55,276
PL/SQL          488,711,516        4,222,224      488,711,516
Other           110,135,753                       736,607,661

    PGA_USED_MEM    PGA_ALLOC_MEM PGA_FREEABLE_MEM      PGA_MAX_MEM
---------------- ---------------- ---------------- ----------------
     498,885,505      598,850,293                0    1,225,374,453

This is extraordinary. The session has allocated – and is still holding – more than 488MB in the PL/SQL category just because of that one million row insert. (My client was processing 3.8 million rows, and they “lost” 7.7GB of memory to this operation – the memory loss gives the appearance of growing geometrically with the number of rows).

In fact the penality is more than that 488MB; the memory that had been allocated for the bulk collect in category “Other” has been released and the 110MB still remaining in that categiry is also a side effect of the forall insert.

Note, by the way, that the pga_max_mem from v$process_memory is still 313M higher than it should be, as is the max_allocated in the “Other” category in v$process_memory.

When I first saw this behaviour I decided it was obviously a bug and spent a good 20 minutes searching Metalink for possible matches without success (I was looking for “forall insert”, if I’d tried “bulk insert” I would have had more luck) – so I forwarded my test case to the rest of the Oak Table to see if they had any thoughts about it and got a fairly prompt reply from Tanel Poder that this was bug 5866410. (He’d done a pga heapdump, and searched Metalink for the label of the memory type that was using most of the space – like all good strategies it was so obvious after it had been explained !)

The bug is a pl/sql memory leak in “forall insert…” It’s fixed in 11.1.0.6 and 10.2.0.5, with backports available to 10.2.0.3 and 10.2.0.4 on a couple of platforms so far (Solaris and IBM ZLinux as I write).

So if your code does very large “forall insert” calls and you see Oracle error ORA-04030 from time to time, or workarea operations dumping to disc unexpectedly, you may temporarily be losing large amounts of memory to this bug – in which case check Metalink for the bug number and see if there’s a patch that’s good for your platform. (And if you’re on 10.2.0.3 or 10.2.0.4 and there isn’t a patch for your platform yet, there’s a reasonable chance you can get one created since it’s already been done on a couple of other platforms.)

June 5, 2009

Online Rebuild

Filed under: Uncategorized — Jonathan Lewis @ 7:44 pm UTC Jun 5,2009

Here’s a little oddity that may be waiting to catch you out – but only if you like to create indexes with very long keys.


create table t1(
	v1	varchar2(4000),
	v2	varchar2(2387),
	v3	varchar2(100)
);

create index t1_i1 on t1(v1, v2);

alter index t1_i1 rebuild;
alter index t1_i1 rebuild online;

My key value is at the limit for an 8KB block size in Oracle 9i and later – which is roughly 80% of (block size - 190 bytes). In earlier versions of Oracle the limit was roughly half that (i.e. 40% rather than 80%). 

If you try to create a longer key you’ll see Oracle error ORA-01450: “maximum key length (6398) exceeded”.  (If you’ve built your indexes using a different blocksize the number in brackets will be different – and you will have to adjust the demo code for 16KB and 32KB block sizes). The difference between my declared column lengths and the error limit relates to the overheads in an index entry – but seems to “forget” the one byte extra for non-unique indexes.

But this is the output you’ll (probably) get from running the script if you’re using an 8KB block size:


Table created.

Index created.

Index altered.

alter index t1_i1 rebuild online
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01450: maximum key length (3215) exceeded

So I can create an index with a very large key, rebuild that index safely – and then fail when I try to rebuild the index with the online option!

What’s going on ?

When you use an online index rebuild Oracle has to create a log of all changes you make to the table while the rebuild is running. Once the log is set up, Oracle creates a new index using a read-consistent copy of the data in the table,  then copies the logged changes into this index, and finishes off by tidying up the data dictionary, dropping temporary segments, and dropping the log.

This process locks the table twice, by the way,  once while getting started and creating the log, and then again when finishing off and dopping the log. If you read Richard Foote’s blog regularly you will know that these locks are only held for a short time but can be hard to acquire and may lead to a randomly long delay in the rebuild.

For my purposes, though, the critical feature is  the object that holds the logging information. This is created as an index organized table (IOT) with a name like sys_journal_NNNNN where NNNNN is the object_id of the index you are trying to rebuild. (You can check this by enabling sql_trace just before doing a rebuild.)

It’s the IOT that has the problem with the excess key length – its limit is still set to match the 40% limit used for earlier versions of Oracle, and the key to the IOT is the key defined for the index, plus the rowid for the row with that key.

It’s possible that this outdated limit is simply an oversight in the code; it’s possible that it’s a deliberate design decision that has to stay in place to allow for the different strategies for leaf-block splits adopted by standard B-tree indexes and the index structures supporting IOTs. Either way it’s a limitation that might finally catch you out months after you’ve created an unusually lengthy index key.

Just as a quick demonstration that the problem is in the IOT, we need only run a simple piece of DDL (the sample below was cut and pasted from a live session):


SQL> create table iot1(
  2     v1      varchar2(4000),
  3     v2      varchar2(2380),
  4     v3      varchar2(100),
  5     constraint iot1_pk primary key(v1,v2)
  6  )
  7  organization index
  8  overflow
  9  ;
create table iot1(
*
ERROR at line 1:
ORA-01450: maximum key length (3215) exceeded

Of course, when you see little oddities like this you might think: “I wouldn’t create an index like that – so I’m not worried”. But it’s worth spending a little time looking at the symptoms from different directions before dismissing the problem.

Try this experiment (starting with the same t1 heap table above) but see if you can guess what’s going to happen and why … before you read the explanation:


create or replace function f (i_in varchar2)
return varchar2
deterministic
as
begin
 return i_in;
end;
/

create index t1_f1 on t1(f(v3));

alter index t1_f1 rebuild;
alter index t1_f1 rebuild online;

You can rebuild the index but you can’t rebuild it online because you get the same Oracle error ORA-01450.

You didn’t build an index with a large key deliberately, but a function returning a varchar2() implicitly returns a 4,000 byte string – so your index has been built on a virtual column of 4,000 bytes.


select
 column_name, column_length
from
 user_ind_columns
where
 index_name = 'T1_F1'
/

COLUMN_NAME          COLUMN_LENGTH
-------------------- -------------
SYS_NC00004$                  4000

1 row selected.

If you want to build this type of index, and be able to rebuild it online, your index definition will have to be something like:


drop index t1_f1;
create index t1_f1 on t1(substr(f(v3),1,100));
alter index t1_f1 rebuild online;

With this definition, the function-based index can be rebuilt online. (Of course, it may no longer be the index you want, and some of your SQL has to be modified so that the predicates match the new index definition.) 

Footnote 1: The 40% limit on IOT keys is further constrained if you are using a 16KB block size by a “hard” limit of 3,800 bytes – so rebuilding the index with a different blocksize will only work around the problem in a limited range of cases.

Footnote 2: Although I implied at the start of the article that it would be unusual to have very long index keys there are quite a lot of applications that define tables with fairly long (e.g. 254, 1000) varchar2() columns: “just in case”.  Watch out when you start creating indexes on these columns.

June 4, 2009

Health Check

Filed under: Uncategorized — Jonathan Lewis @ 6:35 pm UTC Jun 4,2009

Someone recently posted a question on the OTN Database General forum with the title like “Health Check on Oracle database” but gave no really solid context to indicate the type, purpose, or frequency of activity that they had in mind.

I can think of three or four possible interpretations for what he was saying – and this made me wonder what possible interpretations other people might have.

I won’t link to the specific thread just yet – someone has already asked “What do you mean by an Oracle health check” and if the OP comes up with a response I wouldn’t want to give too much of a lead to other people before they come up with their own interpretations.

So the question is this: in the complete absence of any sensible context, if you overheard a stranger talking about an Orace health check, what sort of activity do you think they might have in mind ?

Next Page »

Blog at WordPress.com.