Oracle Scratchpad

November 25, 2008

SAS Bug

Filed under: CBO,Performance,sorting,trace files — Jonathan Lewis @ 11:29 pm GMT Nov 25,2008

I see that Tom Kyte has found a nasty little bug waiting to trap a few unlucky people as they patch to 10.2.0.4, or upgrade to 11g.

By now, you should probably be using the pga_aggregate_target rather than the sort_area_size to control the memory allocation for sorting – especially from 10g onwards – but you may still have a few special tasks in big batch jobs where you have code that switches the workarea_size_policy to manual and sets a specific (large) sort_area_size. Perhaps something like:


alter session set workarea_size_policy = manual;
alter session set sort_area_size = 104857600;
--execute big job here
alter session set workarea_size_policy = auto;

If this is the case, be careful when you patch to 10.2.0.4 or upgrade to 11.1.0.6, as the code may suddenly slow down quite dramatically.

There seems to be a bug with this particular ‘alter session’ command (and possibly with others – I haven’t yet tested) that “delays” the change. It’s quite easy to see this if you enable event 10032 to get the “sort parameters” dump. Here’s a sample of code I’ve run against a small 11g system (11.1.0.6) to demonstrate the issue.


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects 
	where	rownum <= 3000
)
select
	rownum				id,
	lpad(rownum,10,'0')		small_vc,
	lpad(rownum,100,'0')		big_vc
from
	generator	v1,
	generator	v2
where
	rownum <= 10000
;

-- gather statistics here

I’ve created a table with 10,000 rows and a couple of character columns that I’m now going to select and sort. The small_vc column is only 10 characters, so I shouldn’t need more than a couple of hundred KB to select and sort this column; and the big_vc column is only 100 bytes, so 4MB should be plenty to select and sort that column:


alter session set events '10032 trace name context forever';
alter session set workarea_size_policy = manual;
alter session set sort_area_size = 262144;


-- first the small one

select 
	count(*)
from
	(
		select 
			/*+ 
				qb_name(ordered) 
				no_merge
				no_eliminate_oby(@ordered)
			*/
			small_vc
		from
			t1
		order by
			small_vc
	)
;

-- then the big one

alter session set sort_area_size = 4194304;

select 
	count(*)
from
	(
		select 
			/*+ 
				qb_name(ordered) 
				no_merge
				no_eliminate_oby(@ordered)
			*/
			big_vc
		from
			t1
		order by
			big_vc
	)
;

I’ve used the /*+ no_merge */ hint in the SQL, with an inline view that contains the select and “order by” so that I can force Oracle to sort the data but only output a single row. Then I’ve used the /*+ no_eliminate_oby() */ hint to make sure that the optimizer doesn’t eliminate the (apparently) redundant “order by” clause in the in-line view (a trick that it learnt to do in 10gR2 – see the notes towards the end of this article).

The trace files dumped by event 10032 show the following results.

From the first query:


---- Sort Parameters ------------------------------
sort_area_size                    65536
sort_area_retained_size           65536
sort_multiblock_read_count        1
max intermediate merge width      3
---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             10000
Output records                            10000
Disk blocks 1st pass                      20
Total disk blocks used                    22
Total number of comparisons performed     10103
  Comparisons performed by in-memory sort 10099
  Comparisons while searching for key in-memory 4
Temp segments allocated                   1
Extents allocated                         1
Uses version 2 sort
Uses asynchronous IO

Despite my call to set the sort_area_size to 256KB, the operation has used 64KB (the system default), and dumped about 160KB of data to disk.

Then, for the second query:


---- Sort Parameters ------------------------------
sort_area_size                    262144
sort_area_retained_size           262144
sort_multiblock_read_count        2
max intermediate merge width      7
---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             10000
Output records                            10000
Disk blocks 1st pass                      134
Total disk blocks used                    136
Total number of comparisons performed     10103
  Comparisons performed by in-memory sort 10099
  Comparisons while searching for key in-memory 4
Temp segments allocated                   1
Extents allocated                         2
Uses version 2 sort
Uses asynchronous IO

I set the sort_area_size to 4MB, and the operation has used 256KB – which is the value I set on the previous sort. We can also see that we’ve dumped about 1.1MB of data to disk.

So, if you’ve got code that tries to set the sort_area_size before running a big job, it’s possible that that job will run with the wrong value for the parameter. The effects could be dire.

Fortunately the workaround is simple (and carries no side-effects) – just edit the code to do the ‘alter session’ twice in a row. The second call will ensure that the value you want actually gets used.

7 Comments »

  1. Is it possible that it’s not that you need to run the statement twice to work around it…but just wait long enough…i.e. is the first ALTER SESSION statement taking effect after a period of time or after you issue it for the second time?

    …I know, I should test that…but no free time currently.

    Comment by Jeff Moss — November 26, 2008 @ 9:02 am GMT Nov 26,2008 | Reply

  2. This “bug” remember me a similar case with External tables.

    Sometime if you issue an “alter table” because you want to add a file, you must do it twice, or the file will not be added.

    Rather strange…no?

    (Must say, that the set up that shows this particular thing, shows others, like you do a concatenation in a string to build a big dinamic-sql and you find that the end of the query is cut away. You rewrite the concatenation, something like a||b||c become ab||c, and everything works well.)

    Comment by lascoltodelvenerdi — November 26, 2008 @ 1:36 pm GMT Nov 26,2008 | Reply

  3. Setting the sort_area_size twice only seems to work when the sort is done serially. It is still ignored when the sort is done in parallel.

    Also if I “alter session force parallel dml” and set the 10032 trace on an insert as select then the both the query and the index maintenance become serial.

    Comment by Scott — November 26, 2008 @ 5:13 pm GMT Nov 26,2008 | Reply

  4. Jeff,
    It’s always good to think of different ways of extending the test. That was one I tried, and I also tried running a couple of queries before the one I printed in case this made the optimizer “notice” that the parameter had changed.

    lascoltodelvenerdi,
    I find that I discover all sorts of extra things by accident when designing a test for a specific problem. In this case (in 11g), I found that my usual trick of using the /*+ gather_plan_statistics */ hint followed by the extended call to dbms_xplan.display_cursor() randomly failed to find the cursor. (Cause and workaround still to be determined).

    Scott,
    Thanks for the note. I’ll have to look into that when I get a bit of time.

    Comment by Jonathan Lewis — November 27, 2008 @ 8:30 pm GMT Nov 27,2008 | Reply

  5. Jonathan,

    this nasty bug actually already appeared in the 10.2.0.3 patch set and was meant as a “bug fix”, see Metalink note 452466.1 and corresponding bug 6113096. I’ve written a blog note about this issue a couple of months ago:

    http://oracle-randolf.blogspot.com/2008/02/nasty-bug-introduced-with-patch-set.html

    Repeating your test case in the 11.1.0.7 patch set (32bit-Windows) it’s looks like the bug has been fixed although I cannot find it in the “list of bug fixes”.

    Regards,
    Randolf

    Comment by Randolf Geist — November 27, 2008 @ 8:58 pm GMT Nov 27,2008 | Reply

  6. Randolf,

    Thanks for the link and the references to Metalink. I’ve just run my test case against 10.2.0.3 – and it does, indeed, display the same problem, with the same workaround.

    I guess your article and the bug notes also explain the anomaly that Scott reported.

    Comment by Jonathan Lewis — November 27, 2008 @ 9:54 pm GMT Nov 27,2008 | Reply

  7. […] “I see that Tom Kyte has found a nasty little bug waiting to trap a few unlucky people as they patch to 10.2.0.4, or upgrade to 11g.” So writes Jonathan Lewis in his item on the SAS bug. […]

    Pingback by Log Buffer #125: a Carnival of the Vanities for DBAs — November 28, 2008 @ 5:46 pm GMT Nov 28,2008 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

%d bloggers like this: