Oracle Scratchpad

November 25, 2008

SAS Bug

Filed under: CBO,Performance,sorting,trace files — Jonathan Lewis @ 11:29 pm GMT Nov 25,2008

I see that Tom Kyte has found a nasty little bug waiting to trap a few unlucky people as they patch to 10.2.0.4, or upgrade to 11g.

By now, you should probably be using the pga_aggregate_target rather than the sort_area_size to control the memory allocation for sorting – especially from 10g onwards – but you may still have a few special tasks in big batch jobs where you have code that switches the workarea_size_policy to manual and sets a specific (large) sort_area_size. Perhaps something like:


alter session set workarea_size_policy = manual;
alter session set sort_area_size = 104857600;
--execute big job here
alter session set workarea_size_policy = auto;

If this is the case, be careful when you patch to 10.2.0.4 or upgrade to 11.1.0.6, as the code may suddenly slow down quite dramatically.

There seems to be a bug with this particular ‘alter session’ command (and possibly with others – I haven’t yet tested) that “delays” the change. It’s quite easy to see this if you enable event 10032 to get the “sort parameters” dump. Here’s a sample of code I’ve run against a small 11g system (11.1.0.6) to demonstrate the issue.


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects 
	where	rownum <= 3000
)
select
	rownum				id,
	lpad(rownum,10,'0')		small_vc,
	lpad(rownum,100,'0')		big_vc
from
	generator	v1,
	generator	v2
where
	rownum <= 10000
;

-- gather statistics here

I’ve created a table with 10,000 rows and a couple of character columns that I’m now going to select and sort. The small_vc column is only 10 characters, so I shouldn’t need more than a couple of hundred KB to select and sort this column; and the big_vc column is only 100 bytes, so 4MB should be plenty to select and sort that column:


alter session set events '10032 trace name context forever';
alter session set workarea_size_policy = manual;
alter session set sort_area_size = 262144;


-- first the small one

select 
	count(*)
from
	(
		select 
			/*+ 
				qb_name(ordered) 
				no_merge
				no_eliminate_oby(@ordered)
			*/
			small_vc
		from
			t1
		order by
			small_vc
	)
;

-- then the big one

alter session set sort_area_size = 4194304;

select 
	count(*)
from
	(
		select 
			/*+ 
				qb_name(ordered) 
				no_merge
				no_eliminate_oby(@ordered)
			*/
			big_vc
		from
			t1
		order by
			big_vc
	)
;

I’ve used the /*+ no_merge */ hint in the SQL, with an inline view that contains the select and “order by” so that I can force Oracle to sort the data but only output a single row. Then I’ve used the /*+ no_eliminate_oby() */ hint to make sure that the optimizer doesn’t eliminate the (apparently) redundant “order by” clause in the in-line view (a trick that it learnt to do in 10gR2 – see the notes towards the end of this article).

The trace files dumped by event 10032 show the following results.

From the first query:


---- Sort Parameters ------------------------------
sort_area_size                    65536
sort_area_retained_size           65536
sort_multiblock_read_count        1
max intermediate merge width      3
---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             10000
Output records                            10000
Disk blocks 1st pass                      20
Total disk blocks used                    22
Total number of comparisons performed     10103
  Comparisons performed by in-memory sort 10099
  Comparisons while searching for key in-memory 4
Temp segments allocated                   1
Extents allocated                         1
Uses version 2 sort
Uses asynchronous IO

Despite my call to set the sort_area_size to 256KB, the operation has used 64KB (the system default), and dumped about 160KB of data to disk.

Then, for the second query:


---- Sort Parameters ------------------------------
sort_area_size                    262144
sort_area_retained_size           262144
sort_multiblock_read_count        2
max intermediate merge width      7
---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             10000
Output records                            10000
Disk blocks 1st pass                      134
Total disk blocks used                    136
Total number of comparisons performed     10103
  Comparisons performed by in-memory sort 10099
  Comparisons while searching for key in-memory 4
Temp segments allocated                   1
Extents allocated                         2
Uses version 2 sort
Uses asynchronous IO

I set the sort_area_size to 4MB, and the operation has used 256KB – which is the value I set on the previous sort. We can also see that we’ve dumped about 1.1MB of data to disk.

So, if you’ve got code that tries to set the sort_area_size before running a big job, it’s possible that that job will run with the wrong value for the parameter. The effects could be dire.

Fortunately the workaround is simple (and carries no side-effects) – just edit the code to do the ‘alter session’ twice in a row. The second call will ensure that the value you want actually gets used.

7 Comments »

  1. Is it possible that it’s not that you need to run the statement twice to work around it…but just wait long enough…i.e. is the first ALTER SESSION statement taking effect after a period of time or after you issue it for the second time?

    …I know, I should test that…but no free time currently.

    Comment by Jeff Moss — November 26, 2008 @ 9:02 am GMT Nov 26,2008 | Reply

  2. This “bug” remember me a similar case with External tables.

    Sometime if you issue an “alter table” because you want to add a file, you must do it twice, or the file will not be added.

    Rather strange…no?

    (Must say, that the set up that shows this particular thing, shows others, like you do a concatenation in a string to build a big dinamic-sql and you find that the end of the query is cut away. You rewrite the concatenation, something like a||b||c become ab||c, and everything works well.)

    Comment by lascoltodelvenerdi — November 26, 2008 @ 1:36 pm GMT Nov 26,2008 | Reply

  3. Setting the sort_area_size twice only seems to work when the sort is done serially. It is still ignored when the sort is done in parallel.

    Also if I “alter session force parallel dml” and set the 10032 trace on an insert as select then the both the query and the index maintenance become serial.

    Comment by Scott — November 26, 2008 @ 5:13 pm GMT Nov 26,2008 | Reply

  4. Jeff,
    It’s always good to think of different ways of extending the test. That was one I tried, and I also tried running a couple of queries before the one I printed in case this made the optimizer “notice” that the parameter had changed.

    lascoltodelvenerdi,
    I find that I discover all sorts of extra things by accident when designing a test for a specific problem. In this case (in 11g), I found that my usual trick of using the /*+ gather_plan_statistics */ hint followed by the extended call to dbms_xplan.display_cursor() randomly failed to find the cursor. (Cause and workaround still to be determined).

    Scott,
    Thanks for the note. I’ll have to look into that when I get a bit of time.

    Comment by Jonathan Lewis — November 27, 2008 @ 8:30 pm GMT Nov 27,2008 | Reply

  5. Jonathan,

    this nasty bug actually already appeared in the 10.2.0.3 patch set and was meant as a “bug fix”, see Metalink note 452466.1 and corresponding bug 6113096. I’ve written a blog note about this issue a couple of months ago:

    http://oracle-randolf.blogspot.com/2008/02/nasty-bug-introduced-with-patch-set.html

    Repeating your test case in the 11.1.0.7 patch set (32bit-Windows) it’s looks like the bug has been fixed although I cannot find it in the “list of bug fixes”.

    Regards,
    Randolf

    Comment by Randolf Geist — November 27, 2008 @ 8:58 pm GMT Nov 27,2008 | Reply

  6. Randolf,

    Thanks for the link and the references to Metalink. I’ve just run my test case against 10.2.0.3 – and it does, indeed, display the same problem, with the same workaround.

    I guess your article and the bug notes also explain the anomaly that Scott reported.

    Comment by Jonathan Lewis — November 27, 2008 @ 9:54 pm GMT Nov 27,2008 | Reply

  7. […] “I see that Tom Kyte has found a nasty little bug waiting to trap a few unlucky people as they patch to 10.2.0.4, or upgrade to 11g.” So writes Jonathan Lewis in his item on the SAS bug. […]

    Pingback by Log Buffer #125: a Carnival of the Vanities for DBAs — November 28, 2008 @ 5:46 pm GMT Nov 28,2008 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.