Oracle Scratchpad

December 8, 2011

Test Data

Filed under: Oracle,Performance,Subquery Factoring,Tuning — Jonathan Lewis @ 6:31 pm BST Dec 8,2011

The UKOUG conference is over for another year – but it has left me with plenty to do and lots of things to investigate. Here’s just one little point that I picked up during one of the 10 minute “Oak Talks” that members of the Oak Table Network were doing in the lunch breaks.

There is a fairly well-known strategy for generating a list of numbers by using a “select from dual … connect by …” query, but I hadn’t realised that there were two ways of using it. The code I’ve usually used is this:

select
        rownum id
from
        dual
connect by
        rownum <= 4000
;


But it looks as if most people use it like this:

select
        rownum id
from
        dual
connect by
        level <= 4000
;

The latter is clearly more “appropriate” than the former in that the use of level pseudo-column appears to be more consistent with the use of the connect by clause, so I suspect I just happened to do it the “wrong” way the first time I used it, and never noticed that it looked a little odd because it was returning the expected result.

The “Oak Talk” was the one I gave, by the way, and one of the comments I made was that this code was fine if you wanted a tiny amount of data, but if you wanted a larger amount of data you ought to do something like this:

with generator as (
	select
		rownum id
	from dual
	connect by
		rownum <= 4000
)
select
	/*+ gather_plan_statistics */
	count(*)
from	(
	select
		/*+ no_merge */
		rownum id
	from
		generator,
		generator
	where
		rownum <= 10000000
	)
;

The key feature is that I’ve used the original query twice, joining the result set to itself, rather than simply increasing the target value. In this example I’ve put the select statement that generates 10,000,000 rows of data into an inline non-mergeable view and counted the result set. This allowed me to run the query without having to generate a massive output before showing you the execution stats. Here’s the plan with stats:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   1 |  TEMP TABLE TRANSFORMATION      |                            |      1 |        |      1 |00:04:57.76 |      31 |      7 |      7 |       |       |          |
|   2 |   LOAD AS SELECT                |                            |      1 |        |      1 |00:00:00.03 |      10 |      0 |      7 |   261K|   261K|  261K (0)|
|   3 |    COUNT                        |                            |      1 |        |   4000 |00:00:00.02 |       0 |      0 |      0 |       |       |          |
|   4 |     CONNECT BY WITHOUT FILTERING|                            |      1 |        |   4000 |00:00:00.01 |       0 |      0 |      0 |       |       |          |
|   5 |      FAST DUAL                  |                            |      1 |      1 |      1 |00:00:00.01 |       0 |      0 |      0 |       |       |          |
|   6 |   SORT AGGREGATE                |                            |      1 |      1 |      1 |00:04:57.72 |      18 |      7 |      0 |       |       |          |
|   7 |    VIEW                         |                            |      1 |      1 |     10M|00:01:30.04 |      18 |      7 |      0 |       |       |          |
|*  8 |     COUNT STOPKEY               |                            |      1 |        |     10M|00:01:00.04 |      18 |      7 |      0 |       |       |          |
|   9 |      MERGE JOIN CARTESIAN       |                            |      1 |      1 |     10M|00:00:40.04 |      18 |      7 |      0 |       |       |          |
|  10 |       VIEW                      |                            |      1 |      1 |   2500 |00:00:00.04 |       9 |      7 |      0 |       |       |          |
|  11 |        TABLE ACCESS FULL        | SYS_TEMP_0FD9D6602_4E35F82 |      1 |      1 |   2500 |00:00:00.04 |       9 |      7 |      0 |       |       |          |
|  12 |       BUFFER SORT               |                            |   2500 |      1 |     10M|00:00:10.01 |       9 |      0 |      0 | 64512 | 64512 |57344  (0)|
|  13 |        VIEW                     |                            |      1 |      1 |   4000 |00:00:00.02 |       9 |      0 |      0 |       |       |          |
|  14 |         TABLE ACCESS FULL       | SYS_TEMP_0FD9D6602_4E35F82 |      1 |      1 |   4000 |00:00:00.01 |       9 |      0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - filter(ROWNUM<=10000000)

As you can see, there was a little I/O (direct path writes to materialize the factored subquery, then db file scattered reads to read it back), and a little bit of memory used as the query ran. The run time of the query was about 4 seconds, compared to using the simple form of the query:

PLAN_TABLE_OUTPUT
-----------------
SQL_ID  0gftkt677gyq1, child number 0
-------------------------------------
select  /*+ gather_plan_statistics */  count(*) from (  select   /*+ no_merge
*/   rownum id  from dual  connect by   rownum <= 10000000 )

Plan hash value: 2085675642
----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Starts | E-Rows | A-Rows |   A-Time   |
----------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE                 |      |      1 |      1 |      1 |00:04:56.47 |
|   2 |   VIEW                          |      |      1 |      1 |     10M|00:06:16.47 |
|   3 |    COUNT                        |      |      1 |        |     10M|00:05:46.47 |
|   4 |     CONNECT BY WITHOUT FILTERING|      |      1 |        |     10M|00:05:16.47 |
|   5 |      FAST DUAL                  |      |      1 |      1 |      1 |00:00:00.01 |
----------------------------------------------------------------------------------------

This query took about 8 seconds, so there’s not really a lot of difference in terms of speed between the two – especially when you consider that you’re probably going to do something rather more complex than select a simple rownum to create some data. But there’s an important difference which shows up when you look at the session stats, in particular the memory usage. The following figures show the changes in UGA and PGA memory from running the two queries.

Session delta (complex query)
Name                                                                     Value
----                                                                     -----
session uga memory                                                     647,164
session uga memory max                                                 647,164
session pga memory                                                   3,604,480
session pga memory max                                               3,604,480

Session delta (simple query)
Name                                                                     Value
----                                                                     -----
session uga memory                                                      65,464
session uga memory max                                             299,754,000
session pga memory max                                             300,023,808

The simple query is only 4 seconds (100%) slower – but the amount of PGA memory it requires is huge (80 times more) compared to the complex query. What’s more, if you look back at the execution plan with gather_plan_statistics, you’ll see that it’s memory that doesn’t show in the “workarea operations” part of the plan. You’ve probably seen several articles on my blog using this factored subquery approach to generate data – now you know why it’s the default code I have in my testing template – if you want to generate 100M rows, it makes an enormous difference.

[Further reading on "subquery factoring"]

11 Comments »

  1. If your goal is to generate a large amount of data, it will be better to use pipelined function like this

    create function generator (n pls_integer) return sys.odcinumberlist pipelined is
    begin
      for i in 1 .. n loop
        pipe row (i);
      end loop;  
    end;
    /
    
    select count(*) from table(generator(1e7));
    

    It took 2sec vs 5sec for complex query on my laptop.

    Comment by Valentin Nikotin — December 8, 2011 @ 6:48 pm BST Dec 8,2011 | Reply

    • Valentin,

      Very nice – and it’s only 1.18 seconds on my laptop (although that goes up to 2.23 if I gather plan statistics). More significantly it’s another option that doesn’t use a lot of memory.

      Comment by Jonathan Lewis — December 8, 2011 @ 7:14 pm BST Dec 8,2011 | Reply

      • And this is the fastest way that I have found to count to 1e7

        create or replace function generator (n pls_integer) return sys.odcinumberlist pipelined is
          m pls_integer := trunc(n / 10);
          r pls_integer := n - 10 * m;
        begin
          for i in 1 .. m loop
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
            pipe row (null);
          end loop;  
          for i in 1 .. r loop
            pipe row (null);
          end loop;
        end;
        /
        alter function generator compile plsql_code_type = native;
        
        It's 0.67sec, but if I add rownum into count - time increases to 1.6sec:
        
        SQL> set timing on
        SQL> select count(*) from table(generator(1e7));
        
          COUNT(*)
        ----------
          10000000
        
        Elapsed: 00:00:00.67
        SQL> select count(rownum) from table(generator(1e7));
        
        COUNT(ROWNUM)
        -------------
             10000000
        
        Elapsed: 00:00:01.60
        

        Comment by Valentin Nikotin — December 8, 2011 @ 8:10 pm BST Dec 8,2011 | Reply

  2. I’m a “level” guy myself, I think it looks more natural for some reason.

    I did wonder during the talk if there was going to be any real difference between using the subquery generator and a cartesian product or just the simple select with a larger value for rownum (level). You said the more complex form was quicker but not really how much {you did only have 10 minutes}. I was going to test it myself, so you have saved me the task. I think I will change how I generate my test data.

    BTW I think you need to tidy up the info on PGA and UGA, as it stands it shows a massive increase in usage for both and you only say PGA in the text.

    Comment by mwidlake — December 8, 2011 @ 6:54 pm BST Dec 8,2011 | Reply

    • Martin,

      I should probably have pointed out that the increase in the PGA memory was due to the increase in the UGA – which is located inside the PGA.

      Comment by Jonathan Lewis — December 8, 2011 @ 7:16 pm BST Dec 8,2011 | Reply

  3. Hi Jonathan,

    have you checked this?

    http://blog.tanelpoder.com/2008/06/08/generating-lots-of-rows-using-connect-by-safely/

    Greetings.

    Comment by Richard — December 8, 2011 @ 7:39 pm BST Dec 8,2011 | Reply

  4. Richard,

    Thanks for the link – the follow-on to Adrian Billington’s article on the topic is particularly useful: http://www.oracle-developer.net/display.php?id=408.

    I am curious about the comment on Tanel’s article that the “with subquery” approach being faster than the inlne view – I only used “with” because I used to use “select rownum from all_objects” to generate the starting rows and either “with” or “no_merge” was a necessity, so I never checked for speed when I changed to the “connect by” approach.

    Comment by Jonathan Lewis — December 9, 2011 @ 5:56 am BST Dec 9,2011 | Reply

  5. according to Tom Kyte (if I understood him correctly), it is a bug that

    select
            rownum id
    from
            dual
    connect by
            rownum <= 4000
    

    returns the same as

    select
            level id
    from
            dual
    connect by
            level <= 4000
    

    see
    http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:12718072439781#36409968445172

    Comment by Sokrates — December 9, 2011 @ 10:51 am BST Dec 9,2011 | Reply

    • Sokrates,

      Thanks for that link. It’s worthing knowing. I’ll have to try to remember to fix up my examples from now on to use the right method; fortunately it’s not something I’ve ever put into production code (I think).

      Comment by Jonathan Lewis — December 11, 2011 @ 2:20 pm BST Dec 11,2011 | Reply

  6. What does the A-Time=00:04:57.76 or 00:04:56.47 mean? You mentioned 4 seconds and 8 seconds for both your queries respectively.

    Comment by Yuri — December 9, 2011 @ 11:28 am BST Dec 9,2011 | Reply

    • Yuri,

      I should have edited that column out, I was running the test on a multi-CPU 32-bit virtual machine running Windows XP Pro, guesting under Windows 7 on a 64-bit multi-core machine, and the combinatin seems to do some strange things with timing occasionally.

      In theory the plan is claiming that the query took about 5 minutes to run, when in practice the real elapsed time was much smaller.

      Comment by Jonathan Lewis — December 11, 2011 @ 1:17 pm BST Dec 11,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,514 other followers