Oracle Scratchpad

June 11, 2018

dbms_random

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 8:31 am BST Jun 11,2018

In a recent ODC thread someone had a piece of SQL that was calling dbms_random.string(‘U’,20) to generate random values for a table of 100,000,000 rows. The thread was about how to handle the ORA-30009 error (not enough memory for operation) that is almost inevitable when you use the “select from dual connect by level <= n” strategy for generating very large numbers of rows, but this example of calling dbms_random.string() so frequently prompted me to point out an important CPU saving , and then publicise through this blog a little known fact (or deduction) about the dbms_random.string() function.

If you generate a random string of length 6 using only upper-case letters there are 308,915,766 different combinations (266); so if you’re after “nearly unique” values for 100 million rows then a six character string is probably good enough – it might give you a small percentage of values which appear in a handful rows but most of the values are likely to be unique or have two rows. If you want to get closer to uniqueness then 7 characters will do it, and 8 will make it almost certain that you will get a unique value in every row.

So if you want “nearly unique” and “random 20 character strings” it’s probably sufficient to generate random strings of 6 to 8 characters and then rpad() them up to 20 characters with spaced – the saving in CPU will be significant; roughly a factor of 3 (which is going to matter when you’re trying to generate 100 million rows. As a little demo I supplied the OP with a script to create a table of just one million random strings – first of 20 random characters, then of 6 random characters with 14 spaces appended. The run time (mostly CPU) dropped from 1 minute 55 seconds to 41 seconds.

Why is there such a difference ? Because to generate a random string of 6 characters Oracle generates a random string of one character six times in a row and concatenates them. The difference between 6 calls and 20 calls per row gives you that factor of around 3. For a quick demo, try running the following anonymous PL/SQL block:

rem
rem     Script:         random_speed.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2010
rem

begin
        dbms_random.seed(0);
        dbms_output.put_line(dbms_random.string('U',6));
        dbms_output.new_line;

        dbms_random.seed(0);
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
end;
/

Jere are the results I got from instances of 12.1.0.2, 12.2.0.1, and 18.1.0.0 (from LiveSQL):


BVGFJB
B
V
G
F
J
B

I haven’t shown the tests for all the possible dbms_random.string() options but, unsurprisingly, changing the test to use the ‘L’ (lower case alpha) option produces the same effect (and the same 6 letters changed to lower case). The same effect, with different characters, also appeared using the ‘A’ (mixed case alpha), ‘X’ (uppercase alphanumeric) and ‘P’ (all printable characters) options.

I haven’t considered the effect of using a multi-byte character set – maybe Oracle calls its random number generator once per byte rather than once per character. The investigation is left as an exercise to the interested reader.

tl;dr

When generating a very large number of random strings – keep the “operational” part of the string as short as you can and leave the rest to be rpad()‘ed.

7 Comments »

  1. I wondered if using Linux utilities to generate the random strings might be quicker.

    Based on the first attempt, no it is not quicker

    PL/SQL

    
    
    JKSTILL@p1 > l
      1  declare
      2
      3     s varchar2(12);
      4
      5  begin
      6
      7     for i in 1..1e6
      8     loop
      9
     10             s := dbms_random.string('U',12);
     11
     12     end loop;
     13
     14
     15* end;
    JKSTILL@p1 > /
    
    PL/SQL procedure successfully completed.
    
    Elapsed: 00:00:09.83
    
    

    I have seen that run in 7 seconds as well when the server is less busy.

    From the OS generating similar data:

    
     time dd if=/dev/urandom bs=2048  | tr -cd '[:upper:]' | dd iflag=fullblock bs=12 count=1000000 | perl -e 'my $FH=*STDIN; while ( my $l = read $FH, $d, 12 ){ print "$d\n" }' > x
    1000000+0 records in
    1000000+0 records out
    12000000 bytes (12 MB) copied, 8.80371 s, 1.4 MB/s
    
    real    0m8.808s
    user    0m0.845s
    sys     0m9.979s
    
    

    There may be better ways to do this. I used Perl because the sed regex was becoming incomprehensible.

    The amount of time added by Perl is negligible; this is all /dev/urandom.

    So Oracle developers seem to have done quite a good job at optimizing the dbms_random.string() function, even if it is somewhat slow.

    Of course by using /dev/urandom as significant portion of the output is filtered by the tr command.

    If there were a method to get only a certain range of data from something like /dev/urandom, this might be much faster.

    Oracle may be using /dev/urandom as well, though I did not try to discover if that is the case.

    Comment by jkstill — June 11, 2018 @ 5:30 pm BST Jun 11,2018 | Reply

    • I’m just wondering what might happen if this was done as an external table with a pre-processor script that output /dev/urandom directly? I don’t have a system that I can test it on at the moment unfortunately.

      Comment by John — June 14, 2018 @ 2:27 pm BST Jun 14,2018 | Reply

  2. Hi, Jonathan

    Thanks for your this blog for my thread on ODC database forum.

    Apart from I modified my code according to your nice suggestion – using rpad() to fill the left 14 characters for “rpad(dbms_random.string(‘x’, 6), 20)”. Pleas take a look at the code as follows,

    
    SQL> drop table t1 purge;
       
    Table dropped.
      
    SQL> 
    SQL> set timing on
    SQL> 
    SQL> create table t1
      2  segment creation immediate
      3  nologging
      4  as
      5  with x as ((SELECT level r FROM dual CONNECT BY level <= 1e4))  -->comment to fix WordPress format issue
      6  select r as id
      7            , to_char(sysdate + r/24/3600, 'yyyy-mm-dd hh24:mi:ss') as inc_datetime
      8            , trunc(dbms_random.value(0, 100)) as random_id
      9            , rpad(dbms_random.string('x', 6), 20) random_string
     10  from  (SELECT /*+ no_merge */
     11             rownum r
     12             FROM  x d1,
     13                          x d2
     14             WHERE  rownum <= 1e8   --> comment to fix WordPress format issue
     15  );
     
    Table T1 created.
     
    Elapsed: 01:09:30.462
    
    SQL> 
    SQL> set timing off
    SQL> 
    SQL> select sum(bytes)/1024/1024 size_mb from user_segments
      2  where segment_name = 'T1';
     
       SIZE_MB
    ----------------
               5990
    
    

    I also feel this speed is very slow. Do you have better ideas for improving my code? I am not sure that whether it is more efficient and fast about using parallel hint.

    Best Regards
    Quanwen Zhao

    Comment by Quanwen Zhao — June 12, 2018 @ 8:27 am BST Jun 12,2018 | Reply

    • Quanwen Zhao,

      The first thing to do is check the execution plan then investigate WHERE the time is spent.

      Running your code for 1e5 rows takes my machine 6 seconds which appears to be similar to your hardware (6 seconds * 1e3 = 1 hour 40 minutes) – so the rates are in the same ballpark. Before running 1e8 rows I would experiment with 1e5 or 1e6 a few times.

      Creating one column at a time:

      id: 0.15 seconds
      inc_datetime: 0.24 seconds
      random_id: 1.57 seconds
      random_string: 4.47 seconds

      random_string without padding: 4.21 seconds — I used (cast(dbms_random.string(‘x’, 6) as varchar2(6))

      random_string reduced to 4 random characters with no padding: 3.07 seconds.

      Almost all the time goes on calls to dbms_random() – so there’s no way to reduce the time and still get the same data – unless you run the data creation in parallel.

      There is a problem with parallelism though: including rownum in the query serializes the data generation. (See: https://jonathanlewis.wordpress.com/2015/02/12/parallel-rownum/ and https://jonathanlewis.wordpress.com/2013/07/25/parallel-to-serial/ ).

      In this case you can avoid rownum (in the main query – it’s safe in the “with subquery”) by replacing your reference to rownum an expression like “(d1.r-1) * 1e4 + d2.r.” (and get rid of the rownum <= 1e8). If you do this you should find the query goes parallel. To make effective use of parallelism, though, you will have to add a padding column to the driving "with subquery" so that the materialized view takes a larger number of blocks.

      e.g.

      create table t1
      nologging
      as
      with x as (
              select  /*+ materialize cardinality(1000) */
                      level rn0, rpad('x', 300) padding
              from
                      dual
              connect by
                      level <= 1e3  -- &gt; comment to fix WordPress format issue
      ),
      y as (
              select
                      /*+ parallel(3) cardinality(1e6) */
                      (d1.rn0 - 1) * 1000 + d2.rn0    rn1
              from    x d1, x d2
      )
      select
              rn1 as id,
              to_char(
                      sysdate + rn1 / 86400,
                      'yyyy-mm-dd hh24:mi:ss'
              )                                       as inc_datetime,
              trunc(dbms_random.value(0, 100))        as random_id,
              rpad(dbms_random.string('x', 6), 20)    as random_string
      from
              y
      ;
      
      
      

      Running serially this completed in 61 seconds, while running parallel 3 it completed in 26 seconds. (The poor scaling is probably because it's running as VM with a deliberate choke on CPU consumption).

      Note that I've created 1e6 rows by using the same CTE with 1e3 rows twice.

      Comment by Jonathan Lewis — June 14, 2018 @ 1:04 pm BST Jun 14,2018 | Reply

      • Jonathan,

        Thank you very much. -:)

        I’m going to read two posts you provided to me and continuously do some testing according to your good suggestion!

        Comment by Quanwen Zhao — June 16, 2018 @ 9:37 am BST Jun 16,2018 | Reply

  3. […] 14, 2018 : according to Jonathan gave me a comment on his blog post “dbms_random”, I erased all this prompt “SQL> “ in […]

    Pingback by SET SQLFORMAT on SQLcl 18.1.1 | Focusing and Concentrating Oracle Recipes, Tips and Techniques — June 14, 2018 @ 4:14 am BST Jun 14,2018 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.