Oracle Scratchpad

December 21, 2008

Group By

Filed under: CBO — Jonathan Lewis @ 9:05 pm UTC Dec 21,2008

In an earlier article on Cost Based Query Transformation, I showed an example of “complex view merging” where the optimizer turned an in-line aggregate view into a simple join with late aggregation.

In 11g, the optimizer can do the opposite transformation using a mechanism known as “Group by Placement”.

Here’s a simple SQL statement, with the associated execution plan from 10g:


select
	t1.v1, sum(t2.n2)
from	t1, t2
where
	t2.n1 between 10 and 90
and	t1.n1 = t2.n1
group by
	t1.v1
;

--------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |TempSpc| Cost  |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |  1236 |   232K|       |   176 |
|   1 |  HASH GROUP BY      |      |  1236 |   232K|  1784K|   176 |
|*  2 |   HASH JOIN         |      |  7639 |  1439K|       |    30 |
|*  3 |    TABLE ACCESS FULL| T2   |  1236 |  9888 |       |    14 |
|*  4 |    TABLE ACCESS FULL| T1   |  1236 |   223K|       |    14 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")
   3 - filter("T2"."N1">=10 AND "T2"."N1"=10 AND "T1"."N1"<=90)

As you might expect, we have joined the two tables, and then done a large scale aggregation (using the 10g “hash group by” mechanism). But here’s what happens in 11g (11.1.0.6) for this sample data set and query:


--------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes |TempSpc| Cost  |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  3000 |   591K|       |   142 |
|   1 |  HASH GROUP BY        |          |  3000 |   591K|  1272K|   142 |
|*  2 |   HASH JOIN           |          |  3000 |   591K|       |    44 |
|   3 |    VIEW               | VW_GBC_5 |   200 |  3400 |       |    29 |
|   4 |     HASH GROUP BY     |          |   200 |  2400 |       |    29 |
|*  5 |      TABLE ACCESS FULL| T2       |  1236 | 14832 |       |    14 |
|   6 |    TABLE ACCESS FULL  | T1       |  3000 |   541K|       |    14 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="ITEM_1")
   5 - filter("T2"."N1"=10)

In this case we have aggregated data from t2 before joining to t1 because the optimizer has decided the volume selected from t2 is so large that aggregating before joining would be cheaper than joining before aggregating.

This is a fantastic trick – I’ve often seen examples where it would obviously be “nice” to rewrite a query to do this sort of thing, but the performance benefit is often small, and not worth the effort involved in rewriting and testing the code. Now it can happen automatically.

Of course, as always, the optimizer arithmetic may be inaccurate – so Oracle has supplied hints to allow you force the transformation when it doesn’t happen, and stop the transformation if it does happen. For example:


select
	/*+
		qb_name(main)
		no_place_group_by(@main(t2@main))
	*/
	t1.v1, sum(t2.n2)
from	t1, t2
where
	t2.n1 between 10 and 90
and	t1.n1 = t2.n1
group by
	t1.v1
;

select
	/*+
		qb_name(main)
		place_group_by(@main(t2@main))
	*/
	t1.v1, sum(t2.n2)
from	t1, t2
where
	t2.n1 between 10 and 50
and	t1.n1 = t2.n1
group by
	t1.v1
;

You’ll notice that I’ve used the qb_name() hint in these examples – this makes it easier to control the details in more complex cases.

The place_group_by and no_place_group_by hints are a little more subtle than my example shows – and I look forward to seeing them documented – so I would be a little cautious about using them. You’ll see in my example execution plan that there is a reference to an internal view called vw_gbc_5. The name relates in some way to the order of the optimizer’s processing, and I found that the “full” use of the place_group_by hint included this number as a second parameter when checked the outline information from the execution plan.

4 Comments »

  1. This is great. There’s a stench of a business intelligence enhancement all over it, and I’ve seen quite a few cases where this would be a wonderful enhancement in the joining of a fact to multiple dimension tables. For example, in joining a Sales fact table to a Geography dimension table, how much nicer to aggregate the sales metrics and defer the join to the lengthy text items, (“Birmingham”, “San Francisco”, “Upper Volta”, “New Zealand” etc) than to include said lengthy text in the group-by.

    A common situation, actually the more common situation, in a BI environemt is to have a query in which the projected column is at a higher level in the hierarchy that the join column, however. something like:


    select
           dd.month,
           sum(fs.sales_amt)
    from d_dates,
           f_sales
    where
           dd.month = 'January'
    and     dd.day#  = fs.day#
    group by
           dd.month
    ;

    In manually optimising queries like that it is sometimes the case that it is beneficial to aggregate to the level of the join column, perform the join, then aggregate a second time to the level of the report columns. In the above example, we’d optimize to something like:


    select
           dd.month,
           sum(sum_sales_amt)
    from    d_dates,
           (
           select
                   fs.day#,
                   sum(fs.sales_amt) sum_sales_amt
           from    d_dates,
                   f_sales
           where
                   dd.month = 'January'
           and dd.day#  = fs.day#
           group by
                   fs.day#
           ) f
    where   d.day# = dd.day#
    group by dd.month
    /

    (with the required hints to prevent merging)

    I wonder if the 11g optimizer can handle that? Maybe after Christmas I’ll have the time to test. In theory, given a join from a fact table to multiple dimension tables, there would be quite an interesting choice of whether to perform an intermediate aggregation, and which dimensions to defer aggregation on until after the join. Certain aggregation types (well, count(dsistinct …) anyway) would prevent an intermediate aggregation step i suppose.

    Comment by David Aldridge — December 21, 2008 @ 11:15 pm UTC Dec 21,2008 | Reply

  2. I’m confused – easily done I know…

    You say “…optimizer has decided the volume selected from t2 is so large that aggregating before joining would be cheaper than joining before aggregating.” – but the volume selected is 1236 rows, which is tiny…am I missing something here?

    Comment by Jeff Moss — December 22, 2008 @ 9:58 am UTC Dec 22,2008 | Reply

  3. Jeff,

    No, you’re not missing anything – the estimate is only 1,236 rows at that point, but it’s still large enough (in this case, with my parameter settings) to make the optimizer choose this transformation.

    For “n1 between 10 and 50″ I got a smaller amount of driving data, which gave me the “hash then group” we saw from 10g – my intention with this comment was simply to indicate that the quantity of data had an effect, not that it had to be a very large quantity.

    Comment by Jonathan Lewis — December 22, 2008 @ 11:29 am UTC Dec 22,2008 | Reply

  4. Jeff is not missing something, but I was missing that what I suggested was what the feature already does.

    I need to go lie down.

    Comment by David Aldridge — December 22, 2008 @ 1:56 pm UTC Dec 22,2008 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.