I see that Tom Kyte has found a nasty little bug waiting to trap a few unlucky people as they patch to 10.2.0.4, or upgrade to 11g.
By now, you should probably be using the pga_aggregate_target rather than the sort_area_size to control the memory allocation for sorting – especially from 10g onwards – but you may still have a few special tasks in big batch jobs where you have code that switches the workarea_size_policy to manual and sets a specific (large) sort_area_size. Perhaps something like:
alter session set workarea_size_policy = manual;
alter session set sort_area_size = 104857600;
--execute big job here
alter session set workarea_size_policy = auto;
If this is the case, be careful when you patch to 10.2.0.4 or upgrade to 11.1.0.6, as the code may suddenly slow down quite dramatically.
There seems to be a bug with this particular ‘alter session’ command (and possibly with others – I haven’t yet tested) that “delays” the change. It’s quite easy to see this if you enable event 10032 to get the “sort parameters” dump. Here’s a sample of code I’ve run against a small 11g system (11.1.0.6) to demonstrate the issue.
create table t1
as
with generator as (
select --+ materialize
rownum id
from all_objects
where rownum <= 3000
)
select
rownum id,
lpad(rownum,10,'0') small_vc,
lpad(rownum,100,'0') big_vc
from
generator v1,
generator v2
where
rownum <= 10000
;
-- gather statistics here
I’ve created a table with 10,000 rows and a couple of character columns that I’m now going to select and sort. The small_vc column is only 10 characters, so I shouldn’t need more than a couple of hundred KB to select and sort this column; and the big_vc column is only 100 bytes, so 4MB should be plenty to select and sort that column:
alter session set events '10032 trace name context forever';
alter session set workarea_size_policy = manual;
alter session set sort_area_size = 262144;
-- first the small one
select
count(*)
from
(
select
/*+
qb_name(ordered)
no_merge
no_eliminate_oby(@ordered)
*/
small_vc
from
t1
order by
small_vc
)
;
-- then the big one
alter session set sort_area_size = 4194304;
select
count(*)
from
(
select
/*+
qb_name(ordered)
no_merge
no_eliminate_oby(@ordered)
*/
big_vc
from
t1
order by
big_vc
)
;
I’ve used the /*+ no_merge */ hint in the SQL, with an inline view that contains the select and “order by” so that I can force Oracle to sort the data but only output a single row. Then I’ve used the /*+ no_eliminate_oby() */ hint to make sure that the optimizer doesn’t eliminate the (apparently) redundant “order by” clause in the in-line view (a trick that it learnt to do in 10gR2 – see the notes towards the end of this article).
The trace files dumped by event 10032 show the following results.
From the first query:
---- Sort Parameters ------------------------------
sort_area_size 65536
sort_area_retained_size 65536
sort_multiblock_read_count 1
max intermediate merge width 3
---- Sort Statistics ------------------------------
Initial runs 1
Input records 10000
Output records 10000
Disk blocks 1st pass 20
Total disk blocks used 22
Total number of comparisons performed 10103
Comparisons performed by in-memory sort 10099
Comparisons while searching for key in-memory 4
Temp segments allocated 1
Extents allocated 1
Uses version 2 sort
Uses asynchronous IO
Despite my call to set the sort_area_size to 256KB, the operation has used 64KB (the system default), and dumped about 160KB of data to disk.
Then, for the second query:
---- Sort Parameters ------------------------------
sort_area_size 262144
sort_area_retained_size 262144
sort_multiblock_read_count 2
max intermediate merge width 7
---- Sort Statistics ------------------------------
Initial runs 1
Input records 10000
Output records 10000
Disk blocks 1st pass 134
Total disk blocks used 136
Total number of comparisons performed 10103
Comparisons performed by in-memory sort 10099
Comparisons while searching for key in-memory 4
Temp segments allocated 1
Extents allocated 2
Uses version 2 sort
Uses asynchronous IO
I set the sort_area_size to 4MB, and the operation has used 256KB – which is the value I set on the previous sort. We can also see that we’ve dumped about 1.1MB of data to disk.
So, if you’ve got code that tries to set the sort_area_size before running a big job, it’s possible that that job will run with the wrong value for the parameter. The effects could be dire.
Fortunately the workaround is simple (and carries no side-effects) – just edit the code to do the ‘alter session’ twice in a row. The second call will ensure that the value you want actually gets used.
Is it possible that it’s not that you need to run the statement twice to work around it…but just wait long enough…i.e. is the first ALTER SESSION statement taking effect after a period of time or after you issue it for the second time?
…I know, I should test that…but no free time currently.
Comment by Jeff Moss — November 26, 2008 @ 9:02 am GMT Nov 26,2008 |
This “bug” remember me a similar case with External tables.
Sometime if you issue an “alter table” because you want to add a file, you must do it twice, or the file will not be added.
Rather strange…no?
(Must say, that the set up that shows this particular thing, shows others, like you do a concatenation in a string to build a big dinamic-sql and you find that the end of the query is cut away. You rewrite the concatenation, something like a||b||c become ab||c, and everything works well.)
Comment by lascoltodelvenerdi — November 26, 2008 @ 1:36 pm GMT Nov 26,2008 |
Setting the sort_area_size twice only seems to work when the sort is done serially. It is still ignored when the sort is done in parallel.
Also if I “alter session force parallel dml” and set the 10032 trace on an insert as select then the both the query and the index maintenance become serial.
Comment by Scott — November 26, 2008 @ 5:13 pm GMT Nov 26,2008 |
Jeff,
It’s always good to think of different ways of extending the test. That was one I tried, and I also tried running a couple of queries before the one I printed in case this made the optimizer “notice” that the parameter had changed.
lascoltodelvenerdi,
I find that I discover all sorts of extra things by accident when designing a test for a specific problem. In this case (in 11g), I found that my usual trick of using the /*+ gather_plan_statistics */ hint followed by the extended call to dbms_xplan.display_cursor() randomly failed to find the cursor. (Cause and workaround still to be determined).
Scott,
Thanks for the note. I’ll have to look into that when I get a bit of time.
Comment by Jonathan Lewis — November 27, 2008 @ 8:30 pm GMT Nov 27,2008 |
Jonathan,
this nasty bug actually already appeared in the 10.2.0.3 patch set and was meant as a “bug fix”, see Metalink note 452466.1 and corresponding bug 6113096. I’ve written a blog note about this issue a couple of months ago:
http://oracle-randolf.blogspot.com/2008/02/nasty-bug-introduced-with-patch-set.html
Repeating your test case in the 11.1.0.7 patch set (32bit-Windows) it’s looks like the bug has been fixed although I cannot find it in the “list of bug fixes”.
Regards,
Randolf
Comment by Randolf Geist — November 27, 2008 @ 8:58 pm GMT Nov 27,2008 |
Randolf,
Thanks for the link and the references to Metalink. I’ve just run my test case against 10.2.0.3 – and it does, indeed, display the same problem, with the same workaround.
I guess your article and the bug notes also explain the anomaly that Scott reported.
Comment by Jonathan Lewis — November 27, 2008 @ 9:54 pm GMT Nov 27,2008 |
[…] “I see that Tom Kyte has found a nasty little bug waiting to trap a few unlucky people as they patch to 10.2.0.4, or upgrade to 11g.” So writes Jonathan Lewis in his item on the SAS bug. […]
Pingback by Log Buffer #125: a Carnival of the Vanities for DBAs — November 28, 2008 @ 5:46 pm GMT Nov 28,2008 |