<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Quiz Night</title>
	<atom:link href="http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/feed/" rel="self" type="application/rss+xml" />
	<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/</link>
	<description>Just another Oracle weblog</description>
	<lastBuildDate>Tue, 18 Jun 2013 04:32:58 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: Hemant K Chitale</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-38130</link>
		<dc:creator><![CDATA[Hemant K Chitale]]></dc:creator>
		<pubDate>Mon, 06 Dec 2010 03:13:09 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-38130</guid>
		<description><![CDATA[Thank you, Jonathan.  A simple demonstration of the difference. I&#039;ll scale this to larger datasets with different distinct counts and datatypes as well.]]></description>
		<content:encoded><![CDATA[<p>Thank you, Jonathan.  A simple demonstration of the difference. I&#8217;ll scale this to larger datasets with different distinct counts and datatypes as well.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-38026</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Thu, 02 Dec 2010 08:29:22 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-38026</guid>
		<description><![CDATA[Hemant,

It was only a trivial bit of code I whipped up in about fiften minutes to make the point that it was possible for either aggregation to be the faster depending on things like level of aggregation - it wasn&#039;t anything designed to investigate the phenomenon in any depth:

[sourcecode]

create table t1 nologging
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects 
	where	rownum &lt;= 3000
)
select
	lpad(mod(rownum,1000),6)	small_vc_K,
	lpad(rownum,6)			small_vc_M
	from
	generator	v1,
	generator	v2
where
	rownum &lt;= 1000000
;


begin
	dbms_stats.gather_table_stats(
		ownname		 =&gt; user,
		tabname		 =&gt;&#039;T1&#039;,
		estimate_percent =&gt; null,
		block_sample 	 =&gt; true,
		method_opt 	 =&gt; &#039;for all columns size 1&#039;
	);
end;
/


set serveroutput off
set timing on

spool hash_agg

prompt	===========================
prompt	1000 distinct values (hash)
prompt	===========================

select
	/*+ gather_plan_statistics 1000 */
	count(*)
from
	(
	select	/*+ no_merge */
		distinct small_vc_K
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,&#039;allstats last&#039;));

prompt	===========================
prompt	1000 distinct values (sort)
prompt	===========================

select
	/*+ gather_plan_statistics 1000 */
	count(*)
from
	(
	select	/*+ no_merge no_use_hash_aggregation */
		distinct small_vc_K
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,&#039;allstats last&#039;));


prompt	==============================
prompt	1000000 distinct values (hash)
prompt	==============================

select
	/*+ gather_plan_statistics 1000000 */
	count(*)
from
	(
	select	/*+ no_merge */
		distinct small_vc_M
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,&#039;allstats last&#039;));

prompt	==============================
prompt	1000000 distinct values (sort)
prompt	==============================

select
	/*+ gather_plan_statistics 1000000 */
	count(*)
from
	(
	select	/*+ no_merge no_use_hash_aggregation */
		distinct small_vc_M
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,&#039;allstats last&#039;));

[/sourcecode]]]></description>
		<content:encoded><![CDATA[<p>Hemant,</p>
<p>It was only a trivial bit of code I whipped up in about fiften minutes to make the point that it was possible for either aggregation to be the faster depending on things like level of aggregation &#8211; it wasn&#8217;t anything designed to investigate the phenomenon in any depth:</p>
<pre class="brush: plain; title: ; notranslate">

create table t1 nologging
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects 
	where	rownum &lt;= 3000
)
select
	lpad(mod(rownum,1000),6)	small_vc_K,
	lpad(rownum,6)			small_vc_M
	from
	generator	v1,
	generator	v2
where
	rownum &lt;= 1000000
;


begin
	dbms_stats.gather_table_stats(
		ownname		 =&gt; user,
		tabname		 =&gt;'T1',
		estimate_percent =&gt; null,
		block_sample 	 =&gt; true,
		method_opt 	 =&gt; 'for all columns size 1'
	);
end;
/


set serveroutput off
set timing on

spool hash_agg

prompt	===========================
prompt	1000 distinct values (hash)
prompt	===========================

select
	/*+ gather_plan_statistics 1000 */
	count(*)
from
	(
	select	/*+ no_merge */
		distinct small_vc_K
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

prompt	===========================
prompt	1000 distinct values (sort)
prompt	===========================

select
	/*+ gather_plan_statistics 1000 */
	count(*)
from
	(
	select	/*+ no_merge no_use_hash_aggregation */
		distinct small_vc_K
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));


prompt	==============================
prompt	1000000 distinct values (hash)
prompt	==============================

select
	/*+ gather_plan_statistics 1000000 */
	count(*)
from
	(
	select	/*+ no_merge */
		distinct small_vc_M
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

prompt	==============================
prompt	1000000 distinct values (sort)
prompt	==============================

select
	/*+ gather_plan_statistics 1000000 */
	count(*)
from
	(
	select	/*+ no_merge no_use_hash_aggregation */
		distinct small_vc_M
	from
		t1
	)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hemant K Chitale</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-38000</link>
		<dc:creator><![CDATA[Hemant K Chitale]]></dc:creator>
		<pubDate>Wed, 01 Dec 2010 16:35:03 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-38000</guid>
		<description><![CDATA[Jonathan,
Re your testing with two datasets where Hash Aggregation was faster in one and Sort Aggregation in the other .... can you publish how you created the two datasets ?

A few months ago, I had a similar issue.  A job that was processing data by Monthly Partitions had been tested against year 2010 partitions and performed well in Production as it began processing 2010 partitions.  However, as it went &quot;back in time&quot; to the year 2009, it suddenly took excruciatingly long from November 2009.  The Group By was overflowing to disk -- with tempspace usage more than 3x the actual data volume.  I found that data had been updated back in December 2009 and the average row length had changed.  So, I suspected that the Hash Group By was performing poorly against the older data because of the &quot;nature of the data&quot;.  However, I never did get a chance to prove my suspicion -- viz by disabling groupbyhashaggregation and testing each month individually.
(A different process was implemented so that code was scrapped).

If you can publish the way you generated the two datasets, I can run more tests.

Hemant K Chitale]]></description>
		<content:encoded><![CDATA[<p>Jonathan,<br />
Re your testing with two datasets where Hash Aggregation was faster in one and Sort Aggregation in the other &#8230;. can you publish how you created the two datasets ?</p>
<p>A few months ago, I had a similar issue.  A job that was processing data by Monthly Partitions had been tested against year 2010 partitions and performed well in Production as it began processing 2010 partitions.  However, as it went &#8220;back in time&#8221; to the year 2009, it suddenly took excruciatingly long from November 2009.  The Group By was overflowing to disk &#8212; with tempspace usage more than 3x the actual data volume.  I found that data had been updated back in December 2009 and the average row length had changed.  So, I suspected that the Hash Group By was performing poorly against the older data because of the &#8220;nature of the data&#8221;.  However, I never did get a chance to prove my suspicion &#8212; viz by disabling groupbyhashaggregation and testing each month individually.<br />
(A different process was implemented so that code was scrapped).</p>
<p>If you can publish the way you generated the two datasets, I can run more tests.</p>
<p>Hemant K Chitale</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37920</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Sun, 28 Nov 2010 12:09:31 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37920</guid>
		<description><![CDATA[Flado,

That&#039;s a nice way of summarising the difference.
We might also add that the sort aggregation is sensitive to the pre-existing order of the data in a way that the hash aggregation is not.  (Which reminds me of the &lt;a href=&quot;http://jonathanlewis.wordpress.com/2009/12/28/short-sorts/&quot; rel=&quot;nofollow&quot;&gt;&lt;em&gt;&lt;strong&gt;answer to a quiz about sorting&lt;/strong&gt;&lt;/em&gt;&lt;/a&gt; I set some time ago.) ]]></description>
		<content:encoded><![CDATA[<p>Flado,</p>
<p>That&#8217;s a nice way of summarising the difference.<br />
We might also add that the sort aggregation is sensitive to the pre-existing order of the data in a way that the hash aggregation is not.  (Which reminds me of the <a href="http://jonathanlewis.wordpress.com/2009/12/28/short-sorts/" rel="nofollow"><em><strong>answer to a quiz about sorting</strong></em></a> I set some time ago.) </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Flado</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37912</link>
		<dc:creator><![CDATA[Flado]]></dc:creator>
		<pubDate>Sat, 27 Nov 2010 19:11:19 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37912</guid>
		<description><![CDATA[Different tools for different purposes... Conceptually, hash aggregation is very sensitive to the number of groups and relatively insensitive to the total volume. Sort aggregation needs to sort the entire set and doesn&#039;t really care about the number of distinct values (groups) contained therein.
If one tool was better in all cases, we wouldn&#039;t have the other.
IMHO.
Flado]]></description>
		<content:encoded><![CDATA[<p>Different tools for different purposes&#8230; Conceptually, hash aggregation is very sensitive to the number of groups and relatively insensitive to the total volume. Sort aggregation needs to sort the entire set and doesn&#8217;t really care about the number of distinct values (groups) contained therein.<br />
If one tool was better in all cases, we wouldn&#8217;t have the other.<br />
IMHO.<br />
Flado</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37904</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Sat, 27 Nov 2010 18:42:05 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37904</guid>
		<description><![CDATA[
goldenorbit,

I&#039;ve just created two data sets and run a &#039;select distinct&#039; against them. The first data set was designed to aggregate down from a large volume to a small volume - the hash aggregation was 30% faster than the sort aggregation. The second data set aggregated a large data set without reducing the size very much - the sort aggregation was about 30% faster than the hash aggregation.  

The memory requirement for the hash aggregation was much larger than for the sort aggregates, and the temp space used for the hash aggregation was larger then for the sort aggregation when I made it spill to disc.

I haven&#039;t looked at the hash aggregation closely, but I think it is trying to reduce CPU usage by using more memory than the sort. However, when there is little aggregation taking place the CPU cost of larger memory allocations exceeds the CPU saving of not sorting.  Moreover, on the dump to disc the sort aggregate simply has to store streams of sorted data which the hash aggregation probably has to store the data and a significant amount of hashing structure (possibly hash keys) - which means that hash aggregations that spill to disk would suffer an I/O disadvantage.

I think someone made a comment on one of my other postings that the variable (_smm_max_size) setting the maximum memory for a single workarea operation was a 32 bit integer: which means the maximum memory you could use for a (serial) hash aggregation would be 4GB irrepespective of the available PGA.  (It might even be 2GB if it&#039;s a signed 32-bit).]]></description>
		<content:encoded><![CDATA[<p>goldenorbit,</p>
<p>I&#8217;ve just created two data sets and run a &#8216;select distinct&#8217; against them. The first data set was designed to aggregate down from a large volume to a small volume &#8211; the hash aggregation was 30% faster than the sort aggregation. The second data set aggregated a large data set without reducing the size very much &#8211; the sort aggregation was about 30% faster than the hash aggregation.  </p>
<p>The memory requirement for the hash aggregation was much larger than for the sort aggregates, and the temp space used for the hash aggregation was larger then for the sort aggregation when I made it spill to disc.</p>
<p>I haven&#8217;t looked at the hash aggregation closely, but I think it is trying to reduce CPU usage by using more memory than the sort. However, when there is little aggregation taking place the CPU cost of larger memory allocations exceeds the CPU saving of not sorting.  Moreover, on the dump to disc the sort aggregate simply has to store streams of sorted data which the hash aggregation probably has to store the data and a significant amount of hashing structure (possibly hash keys) &#8211; which means that hash aggregations that spill to disk would suffer an I/O disadvantage.</p>
<p>I think someone made a comment on one of my other postings that the variable (_smm_max_size) setting the maximum memory for a single workarea operation was a 32 bit integer: which means the maximum memory you could use for a (serial) hash aggregation would be 4GB irrepespective of the available PGA.  (It might even be 2GB if it&#8217;s a signed 32-bit).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: goldenorbit</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37886</link>
		<dc:creator><![CDATA[goldenorbit]]></dc:creator>
		<pubDate>Thu, 25 Nov 2010 06:29:41 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37886</guid>
		<description><![CDATA[May I ask a question about HASH GROUP BY here, since the explain plan shows this choice - why HASH GROUP BY is not running faster than SORT GROUP BY in either 11g or 10g? Why HASH GROUP BY still requires a heck of temp space even with a pretty big PGA_TARGET (let&#039;s say 16GB) setting.

Thanks.]]></description>
		<content:encoded><![CDATA[<p>May I ask a question about HASH GROUP BY here, since the explain plan shows this choice &#8211; why HASH GROUP BY is not running faster than SORT GROUP BY in either 11g or 10g? Why HASH GROUP BY still requires a heck of temp space even with a pretty big PGA_TARGET (let&#8217;s say 16GB) setting.</p>
<p>Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dominic Brooks</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37778</link>
		<dc:creator><![CDATA[Dominic Brooks]]></dc:creator>
		<pubDate>Sun, 21 Nov 2010 21:03:05 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37778</guid>
		<description><![CDATA[I believe that the predicates were originally omitted due to a performance bug querying v$sql_plan in historical versions.
If you look at the current AWR statement which should populate the requisite columns in wrh$_sql_plan then you&#039;ll find that they are deliberately set to NULL.

I did open an SR some time ago to enquire about plans to include this information at some point and was told that it wasn&#039;t going to happen - seems like a big hole of missing potentially valuable information to me.]]></description>
		<content:encoded><![CDATA[<p>I believe that the predicates were originally omitted due to a performance bug querying v$sql_plan in historical versions.<br />
If you look at the current AWR statement which should populate the requisite columns in wrh$_sql_plan then you&#8217;ll find that they are deliberately set to NULL.</p>
<p>I did open an SR some time ago to enquire about plans to include this information at some point and was told that it wasn&#8217;t going to happen &#8211; seems like a big hole of missing potentially valuable information to me.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pavol Babel</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37753</link>
		<dc:creator><![CDATA[Pavol Babel]]></dc:creator>
		<pubDate>Sun, 21 Nov 2010 12:56:20 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37753</guid>
		<description><![CDATA[Very interesting QUIZ Jonathan, thank you for that. I&#039;ve made some tests on 10gR2 database and FILTER always behaves as described. Is the behaviour same in 11gR2, too? I don&#039;t have any 11g database available to play with.]]></description>
		<content:encoded><![CDATA[<p>Very interesting QUIZ Jonathan, thank you for that. I&#8217;ve made some tests on 10gR2 database and FILTER always behaves as described. Is the behaviour same in 11gR2, too? I don&#8217;t have any 11g database available to play with.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/11/19/quiz-night-9/#comment-37750</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Sun, 21 Nov 2010 11:11:52 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4768#comment-37750</guid>
		<description><![CDATA[Thanks for the suggestions and comments from everyone - I&#039;ve added a few comments to the post.]]></description>
		<content:encoded><![CDATA[<p>Thanks for the suggestions and comments from everyone &#8211; I&#8217;ve added a few comments to the post.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
