Test Data

December 8, 2011

Test Data

Filed under: Oracle,Performance,Subquery Factoring,Tuning — Jonathan Lewis @ 6:31 pm GMT Dec 8,2011

The UKOUG conference is over for another year – but it has left me with plenty to do and lots of things to investigate. Here’s just one little point that I picked up during one of the 10 minute “Oak Talks” that members of the Oak Table Network were doing in the lunch breaks.

There is a fairly well-known strategy for generating a list of numbers by using a “select from dual … connect by …” query, but I hadn’t realised that there were two ways of using it. The code I’ve usually used is this:

select
        rownum id
from
        dual
connect by
        rownum <= 4000
;

But it looks as if most people use it like this:

select
        rownum id
from
        dual
connect by
        level <= 4000
;

The latter is clearly more “appropriate” than the former in that the use of level pseudo-column appears to be more consistent with the use of the connect by clause, so I suspect I just happened to do it the “wrong” way the first time I used it, and never noticed that it looked a little odd because it was returning the expected result.

The “Oak Talk” was the one I gave, by the way, and one of the comments I made was that this code was fine if you wanted a tiny amount of data, but if you wanted a larger amount of data you ought to do something like this:

with generator as (
	select
		rownum id
	from dual
	connect by
		rownum <= 4000
)
select
	/*+ gather_plan_statistics */
	count(*)
from	(
	select
		/*+ no_merge */
		rownum id
	from
		generator,
		generator
	where
		rownum <= 10000000
	)
;

The key feature is that I’ve used the original query twice, joining the result set to itself, rather than simply increasing the target value. In this example I’ve put the select statement that generates 10,000,000 rows of data into an inline non-mergeable view and counted the result set. This allowed me to run the query without having to generate a massive output before showing you the execution stats. Here’s the plan with stats:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   1 |  TEMP TABLE TRANSFORMATION      |                            |      1 |        |      1 |00:04:57.76 |      31 |      7 |      7 |       |       |          |
|   2 |   LOAD AS SELECT                |                            |      1 |        |      1 |00:00:00.03 |      10 |      0 |      7 |   261K|   261K|  261K (0)|
|   3 |    COUNT                        |                            |      1 |        |   4000 |00:00:00.02 |       0 |      0 |      0 |       |       |          |
|   4 |     CONNECT BY WITHOUT FILTERING|                            |      1 |        |   4000 |00:00:00.01 |       0 |      0 |      0 |       |       |          |
|   5 |      FAST DUAL                  |                            |      1 |      1 |      1 |00:00:00.01 |       0 |      0 |      0 |       |       |          |
|   6 |   SORT AGGREGATE                |                            |      1 |      1 |      1 |00:04:57.72 |      18 |      7 |      0 |       |       |          |
|   7 |    VIEW                         |                            |      1 |      1 |     10M|00:01:30.04 |      18 |      7 |      0 |       |       |          |
|*  8 |     COUNT STOPKEY               |                            |      1 |        |     10M|00:01:00.04 |      18 |      7 |      0 |       |       |          |
|   9 |      MERGE JOIN CARTESIAN       |                            |      1 |      1 |     10M|00:00:40.04 |      18 |      7 |      0 |       |       |          |
|  10 |       VIEW                      |                            |      1 |      1 |   2500 |00:00:00.04 |       9 |      7 |      0 |       |       |          |
|  11 |        TABLE ACCESS FULL        | SYS_TEMP_0FD9D6602_4E35F82 |      1 |      1 |   2500 |00:00:00.04 |       9 |      7 |      0 |       |       |          |
|  12 |       BUFFER SORT               |                            |   2500 |      1 |     10M|00:00:10.01 |       9 |      0 |      0 | 64512 | 64512 |57344  (0)|
|  13 |        VIEW                     |                            |      1 |      1 |   4000 |00:00:00.02 |       9 |      0 |      0 |       |       |          |
|  14 |         TABLE ACCESS FULL       | SYS_TEMP_0FD9D6602_4E35F82 |      1 |      1 |   4000 |00:00:00.01 |       9 |      0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - filter(ROWNUM<=10000000)

As you can see, there was a little I/O (direct path writes to materialize the factored subquery, then db file scattered reads to read it back), and a little bit of memory used as the query ran. The run time of the query was about 4 seconds, compared to using the simple form of the query:

PLAN_TABLE_OUTPUT
-----------------
SQL_ID  0gftkt677gyq1, child number 0
-------------------------------------
select  /*+ gather_plan_statistics */  count(*) from (  select   /*+ no_merge
*/   rownum id  from dual  connect by   rownum <= 10000000 )

Plan hash value: 2085675642
----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Starts | E-Rows | A-Rows |   A-Time   |
----------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE                 |      |      1 |      1 |      1 |00:04:56.47 |
|   2 |   VIEW                          |      |      1 |      1 |     10M|00:06:16.47 |
|   3 |    COUNT                        |      |      1 |        |     10M|00:05:46.47 |
|   4 |     CONNECT BY WITHOUT FILTERING|      |      1 |        |     10M|00:05:16.47 |
|   5 |      FAST DUAL                  |      |      1 |      1 |      1 |00:00:00.01 |
----------------------------------------------------------------------------------------

This query took about 8 seconds, so there’s not really a lot of difference in terms of speed between the two – especially when you consider that you’re probably going to do something rather more complex than select a simple rownum to create some data. But there’s an important difference which shows up when you look at the session stats, in particular the memory usage. The following figures show the changes in UGA and PGA memory from running the two queries.

Session delta (complex query)
Name                                                                     Value
----                                                                     -----
session uga memory                                                     647,164
session uga memory max                                                 647,164
session pga memory                                                   3,604,480
session pga memory max                                               3,604,480

Session delta (simple query)
Name                                                                     Value
----                                                                     -----
session uga memory                                                      65,464
session uga memory max                                             299,754,000
session pga memory max                                             300,023,808

The simple query is only 4 seconds (100%) slower – but the amount of PGA memory it requires is huge (80 times more) compared to the complex query. What’s more, if you look back at the execution plan with gather_plan_statistics, you’ll see that it’s memory that doesn’t show in the “workarea operations” part of the plan. You’ve probably seen several articles on my blog using this factored subquery approach to generate data – now you know why it’s the default code I have in my testing template – if you want to generate 100M rows, it makes an enormous difference.

[Further reading on “subquery factoring”]

Comments (15)

15 Comments »

If your goal is to generate a large amount of data, it will be better to use pipelined function like this

create function generator (n pls_integer) return sys.odcinumberlist pipelined is
begin
  for i in 1 .. n loop
    pipe row (i);
  end loop;  
end;
/

select count(*) from table(generator(1e7));

It took 2sec vs 5sec for complex query on my laptop.

Comment by Valentin Nikotin — December 8, 2011 @ 6:48 pm GMT Dec 8,2011 | Reply

Valentin,

Very nice – and it’s only 1.18 seconds on my laptop (although that goes up to 2.23 if I gather plan statistics). More significantly it’s another option that doesn’t use a lot of memory.

Comment by Jonathan Lewis — December 8, 2011 @ 7:14 pm GMT Dec 8,2011 | Reply

And this is the fastest way that I have found to count to 1e7

create or replace function generator (n pls_integer) return sys.odcinumberlist pipelined is
  m pls_integer := trunc(n / 10);
  r pls_integer := n - 10 * m;
begin
  for i in 1 .. m loop
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
    pipe row (null);
  end loop;  
  for i in 1 .. r loop
    pipe row (null);
  end loop;
end;
/
alter function generator compile plsql_code_type = native;

It's 0.67sec, but if I add rownum into count - time increases to 1.6sec:

SQL> set timing on
SQL> select count(*) from table(generator(1e7));

  COUNT(*)
----------
  10000000

Elapsed: 00:00:00.67
SQL> select count(rownum) from table(generator(1e7));

COUNT(ROWNUM)
-------------
     10000000

Elapsed: 00:00:01.60

Comment by Valentin Nikotin — December 8, 2011 @ 8:10 pm GMT Dec 8,2011 | Reply

I’m a “level” guy myself, I think it looks more natural for some reason.

I did wonder during the talk if there was going to be any real difference between using the subquery generator and a cartesian product or just the simple select with a larger value for rownum (level). You said the more complex form was quicker but not really how much {you did only have 10 minutes}. I was going to test it myself, so you have saved me the task. I think I will change how I generate my test data.

BTW I think you need to tidy up the info on PGA and UGA, as it stands it shows a massive increase in usage for both and you only say PGA in the text.

Comment by mwidlake — December 8, 2011 @ 6:54 pm GMT Dec 8,2011 | Reply
- Martin,
  
  I should probably have pointed out that the increase in the PGA memory was due to the increase in the UGA – which is located inside the PGA.
  
  Comment by Jonathan Lewis — December 8, 2011 @ 7:16 pm GMT Dec 8,2011 | Reply
Hi Jonathan,

have you checked this?

http://blog.tanelpoder.com/2008/06/08/generating-lots-of-rows-using-connect-by-safely/

Greetings.

Comment by Richard — December 8, 2011 @ 7:39 pm GMT Dec 8,2011 | Reply
Richard,

Thanks for the link – the follow-on to Adrian Billington’s article on the topic is particularly useful: http://www.oracle-developer.net/display.php?id=408.

I am curious about the comment on Tanel’s article that the “with subquery” approach being faster than the inlne view – I only used “with” because I used to use “select rownum from all_objects” to generate the starting rows and either “with” or “no_merge” was a necessity, so I never checked for speed when I changed to the “connect by” approach.

Comment by Jonathan Lewis — December 9, 2011 @ 5:56 am GMT Dec 9,2011 | Reply
according to Tom Kyte (if I understood him correctly), it is a bug that
```
select
        rownum id
from
        dual
connect by
        rownum <= 4000
```
returns the same as
```
select
        level id
from
        dual
connect by
        level <= 4000
```
see

Index Operations

Comment by Sokrates — December 9, 2011 @ 10:51 am GMT Dec 9,2011 | Reply
- Sokrates,
  
  Thanks for that link. It’s worthing knowing. I’ll have to try to remember to fix up my examples from now on to use the right method; fortunately it’s not something I’ve ever put into production code (I think).
  
  Comment by Jonathan Lewis — December 11, 2011 @ 2:20 pm GMT Dec 11,2011 | Reply
What does the A-Time=00:04:57.76 or 00:04:56.47 mean? You mentioned 4 seconds and 8 seconds for both your queries respectively.

Comment by Yuri — December 9, 2011 @ 11:28 am GMT Dec 9,2011 | Reply
- Yuri,
  
  I should have edited that column out, I was running the test on a multi-CPU 32-bit virtual machine running Windows XP Pro, guesting under Windows 7 on a 64-bit multi-core machine, and the combinatin seems to do some strange things with timing occasionally.
  
  In theory the plan is claiming that the query took about 5 minutes to run, when in practice the real elapsed time was much smaller.
  
  Comment by Jonathan Lewis — December 11, 2011 @ 1:17 pm GMT Dec 11,2011 | Reply
Hi Jonathan,
I’ve often used this technique to generate rows for a calendar and such. However one thing that always bothered me slightly is when you join to the results – dynamic sampling aside, can the optimizer estimate the number of rows correctly?

Create Table xxTest as (Select Level x From Dual Connect By Level user,tabname=>’XXTEST’);

Select /*+gather_plan_statistics*/*
From (Select level x from Dual Connect By Level < 100000) v,
xxTest x
Where v.x=x.x;

SQL_ID gczu8jq05naph, child number 0
————————————-
Select /*+gather_plan_statistics*/* From (Select level x from Dual
Connect By Level < 100000) v, xxTest x Where v.x=x.x

Plan hash value: 2236841197

——————————————————————————————————————————
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
——————————————————————————————————————————
| 0 | SELECT STATEMENT | | 1 | | 99 |00:00:00.09 | 3 | | | |
|* 1 | HASH JOIN | | 1 | 1 | 99 |00:00:00.09 | 3 | 5918K| 3056K| 5573K (0)|
| 2 | VIEW | | 1 | 1 | 99999 |00:00:00.95 | 0 | | | |
| 3 | CONNECT BY WITHOUT FILTERING| | 1 | | 99999 |00:00:00.31 | 0 | 2048 | 2048 | 2048 (0)|
| 4 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 | | | |
| 5 | TABLE ACCESS FULL | XXTEST | 1 | 99 | 99 |00:00:00.01 | 3 | | | |
——————————————————————————————————————————

Predicate Information (identified by operation id):
—————————————————

1 – access("V"."X"="X"."X")

Now I would have expected that had the optimizer know it was going to get ~ 100k rows from the subquery, it would have actually read xxTest FIRST, that being the “smaller table”?

If I hint it up, I get what I would have expected lower down in the plan, but then the final cardinality is way out:

Select /*+gather_plan_statistics*/*
From (Select /*+cardinality(Dual 100000)*/ level x from Dual Connect By Level < 100000) v,
xxTest x
Where v.x=x.x;

SQL_ID axmzg5x57c3yq, child number 0
————————————-
Select /*+gather_plan_statistics*/* From (Select /*+cardinality(Dual
99999)*/ level x from Dual Connect By Level < 100000) v, xxTest
x Where v.x=x.x

Plan hash value: 4293312143

——————————————————————————————————————————
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
——————————————————————————————————————————
| 0 | SELECT STATEMENT | | 1 | | 99 |00:00:00.01 | 3 | | | |
|* 1 | HASH JOIN | | 1 | 99999 | 99 |00:00:00.01 | 3 | 2440K| 2440K| 1440K (0)|
| 2 | TABLE ACCESS FULL | XXTEST | 1 | 99 | 99 |00:00:00.01 | 3 | | | |
| 3 | VIEW | | 1 | 99999 | 99999 |00:00:01.13 | 0 | | | |
| 4 | CONNECT BY WITHOUT FILTERING| | 1 | | 99999 |00:00:00.36 | 0 | 2048 | 2048 | 2048 (0)|
| 5 | FAST DUAL | | 1 | 99999 | 1 |00:00:00.01 | 0 | | | |
——————————————————————————————————————————

Predicate Information (identified by operation id):
—————————————————

1 – access("V"."X"="X"."X")

If this technique is used as a row generator to join to other tables, and we know (roughly) that it will always give us X number of rows (for example if we are generating a year calendar), would it be advisable to hint up the row generator?

Comment by John — June 5, 2015 @ 11:00 am BST Jun 5,2015 | Reply
- Sorry about the formatting above – I did use code tags but then the page crashed and I had to copy/paste it out, and forgot to add back in!
  
  Comment by John — June 5, 2015 @ 11:02 am BST Jun 5,2015 | Reply
- John,
  
  There are various undocumented hints like column_stats() that might help – but I wouldn’t want to go into those in a production system, and even the cardinality hint is a little undesirable from a support perspective; however I think that’s the only thing you can do in cases like these. It’s worth remembering that you can give a cardinalty to a query block or to an N-table join, and that might help in some cases. It’s hard to get it right in complex cases, though.
  
  E.g.
```
select
        /*+
                qb_name(main)
--              cardinality(@main 200)
                cardinality(@main xxtest v1 200)
                cardinality(@inline 100000)
        */
        v1.padding,
        xxtest.padding
from
        xxtest,
        (
                select
                        /*+
                                qb_name(inline)
--                              cardinality (100000)
                        */
                        rownum  id,
                        rpad('y',100)   padding
                from dual
                connect by
                        level < 100000
        )       v1
where
        v1.id = xxtest.id
;


--------------------------------------------------------------------------
| Id  | Operation                       | Name   | Rows  | Bytes | Cost  |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |   200 | 23800 |     4 |
|*  1 |  HASH JOIN                      |        |   200 | 23800 |     4 |
|   2 |   TABLE ACCESS FULL             | XXTEST |    99 |  5346 |     1 |
|   3 |   VIEW                          |        |   100K|  6347K|     1 |
|   4 |    COUNT                        |        |       |       |       |
|*  5 |     CONNECT BY WITHOUT FILTERING|        |       |       |       |
|   6 |      FAST DUAL                  |        |     1 |       |     1 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("V1"."ID"="XXTEST"."ID")
   5 - filter(LEVEL<100000)
```
  Comment by Jonathan Lewis — June 19, 2015 @ 6:07 pm BST Jun 19,2015 | Reply
[…] Thanks to Jonathan Lewis for elaborating on how to create sample data safely. […]

Pingback by The performance story that didn’t turn out as one: a case of or-expansion and how the database mitigated the problem on its own | Martins Blog — September 24, 2021 @ 4:22 pm BST Sep 24,2021 | Reply