Oracle Scratchpad

June 11, 2018

dbms_random

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 8:31 am BST Jun 11,2018

In a recent ODC thread someone had a piece of SQL that was calling dbms_random.string(‘U’,20) to generate random values for a table of 100,000,000 rows. The thread was about how to handle the ORA-30009 error (not enough memory for operation) that is almost inevitable when you use the “select from dual connect by level <= n” strategy for generating very large numbers of rows, but this example of calling dbms_random.string() so frequently prompted me to point out an important CPU saving , and then publicise through this blog a little known fact about the dbms_random.string() function.

If you generate a random string of length 6 using only upper-case letters there are 308,915,766 different combinations (266); so if you’re after “nearly unique” values for 100 million rows then a six character string is probably good enough – it might give you a small percentage of values which appear in a handful rows but most of the values are likely to be unique or have two rows. If you want to get closer to uniqueness then 7 characters will do it, and 8 will make it almost certain that you will get a unique value in every row.

So if you want “nearly unique” and “random 20 character strings” it’s probably sufficient to generate random strings of 6 to 8 characters and then rpad() them up to 20 characters with spaced – the saving in CPU will be significant; roughly a factor of 3 (which is going to matter when you’re trying to generate 100 million rows. As a little demo I supplied the OP with a script to create a table of just one million random strings – first of 20 random characters, then of 6 random characters with 14 spaces appended. The run time (mostly CPU) dropped from 1 minute 55 seconds to 41 seconds.

Why is there such a difference ? Because to generate a random string of 6 characters Oracle generates a random string of one character six times in a row and concatenates them. The difference between 6 calls and 20 calls per row gives you that factor of around 3. For a quick demo, try running the following anonymous PL/SQL block:

rem
rem     Script:         random_speed.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2010
rem

begin
        dbms_random.seed(0);
        dbms_output.put_line(dbms_random.string('U',6));
        dbms_output.new_line;

        dbms_random.seed(0);
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
        dbms_output.put_line(dbms_random.string('U',1));
end;
/

Here are the results I got from instances of 12.1.0.2, 12.2.0.1, and 18.1.0.0 (from LiveSQL):


BVGFJB
B
V
G
F
J
B

I haven’t shown the tests for all the possible dbms_random.string() options but, unsurprisingly, changing the test to use the ‘L’ (lower case alpha) option produces the same effect (and the same 6 letters changed to lower case). The same effect, with different characters, also appeared using the ‘A’ (mixed case alpha), ‘X’ (uppercase alphanumeric) and ‘P’ (all printable characters) options.

I haven’t considered the effect of using a multi-byte character set – maybe Oracle calls its random number generator once per byte rather than once per character. The investigation is left as an exercise to the interested reader.

tl;dr

When generating a very large number of random strings – keep the “operational” part of the string as short as you can and leave the rest to be rpad()‘ed.

Update 24th Feb 2020

Until yesterday I hadn’t considered the possibility that I could just examine the dbms_random package – it’s not wrapped, and it’s generated by the script $ORACLE_HOME/rdbms/admin/dbmsrand.sql; so you just have to read the (very short) source to see how it generates strings (and to answer the question about multibyte character sets).

 

 

7 Comments »

  1. I wondered if using Linux utilities to generate the random strings might be quicker.

    Based on the first attempt, no it is not quicker

    PL/SQL

    
    
    JKSTILL@p1 > l
      1  declare
      2
      3     s varchar2(12);
      4
      5  begin
      6
      7     for i in 1..1e6
      8     loop
      9
     10             s := dbms_random.string('U',12);
     11
     12     end loop;
     13
     14
     15* end;
    JKSTILL@p1 > /
    
    PL/SQL procedure successfully completed.
    
    Elapsed: 00:00:09.83
    
    

    I have seen that run in 7 seconds as well when the server is less busy.

    From the OS generating similar data:

    
     time dd if=/dev/urandom bs=2048  | tr -cd '[:upper:]' | dd iflag=fullblock bs=12 count=1000000 | perl -e 'my $FH=*STDIN; while ( my $l = read $FH, $d, 12 ){ print "$d\n" }' > x
    1000000+0 records in
    1000000+0 records out
    12000000 bytes (12 MB) copied, 8.80371 s, 1.4 MB/s
    
    real    0m8.808s
    user    0m0.845s
    sys     0m9.979s
    
    

    There may be better ways to do this. I used Perl because the sed regex was becoming incomprehensible.

    The amount of time added by Perl is negligible; this is all /dev/urandom.

    So Oracle developers seem to have done quite a good job at optimizing the dbms_random.string() function, even if it is somewhat slow.

    Of course by using /dev/urandom as significant portion of the output is filtered by the tr command.

    If there were a method to get only a certain range of data from something like /dev/urandom, this might be much faster.

    Oracle may be using /dev/urandom as well, though I did not try to discover if that is the case.

    Comment by jkstill — June 11, 2018 @ 5:30 pm BST Jun 11,2018 | Reply

    • I’m just wondering what might happen if this was done as an external table with a pre-processor script that output /dev/urandom directly? I don’t have a system that I can test it on at the moment unfortunately.

      Comment by John — June 14, 2018 @ 2:27 pm BST Jun 14,2018 | Reply

  2. Hi, Jonathan

    Thanks for your this blog for my thread on ODC database forum.

    Apart from I modified my code according to your nice suggestion – using rpad() to fill the left 14 characters for “rpad(dbms_random.string(‘x’, 6), 20)”. Pleas take a look at the code as follows,

    
    SQL> drop table t1 purge;
       
    Table dropped.
      
    SQL> 
    SQL> set timing on
    SQL> 
    SQL> create table t1
      2  segment creation immediate
      3  nologging
      4  as
      5  with x as ((SELECT level r FROM dual CONNECT BY level <= 1e4))  -->comment to fix WordPress format issue
      6  select r as id
      7            , to_char(sysdate + r/24/3600, 'yyyy-mm-dd hh24:mi:ss') as inc_datetime
      8            , trunc(dbms_random.value(0, 100)) as random_id
      9            , rpad(dbms_random.string('x', 6), 20) random_string
     10  from  (SELECT /*+ no_merge */
     11             rownum r
     12             FROM  x d1,
     13                          x d2
     14             WHERE  rownum <= 1e8   --> comment to fix WordPress format issue
     15  );
     
    Table T1 created.
     
    Elapsed: 01:09:30.462
    
    SQL> 
    SQL> set timing off
    SQL> 
    SQL> select sum(bytes)/1024/1024 size_mb from user_segments
      2  where segment_name = 'T1';
     
       SIZE_MB
    ----------------
               5990
    
    

    I also feel this speed is very slow. Do you have better ideas for improving my code? I am not sure that whether it is more efficient and fast about using parallel hint.

    Best Regards
    Quanwen Zhao

    Comment by Quanwen Zhao — June 12, 2018 @ 8:27 am BST Jun 12,2018 | Reply

    • Quanwen Zhao,

      The first thing to do is check the execution plan then investigate WHERE the time is spent.

      Running your code for 1e5 rows takes my machine 6 seconds which appears to be similar to your hardware (6 seconds * 1e3 = 1 hour 40 minutes) – so the rates are in the same ballpark. Before running 1e8 rows I would experiment with 1e5 or 1e6 a few times.

      Creating one column at a time:

      id: 0.15 seconds
      inc_datetime: 0.24 seconds
      random_id: 1.57 seconds
      random_string: 4.47 seconds

      random_string without padding: 4.21 seconds — I used (cast(dbms_random.string(‘x’, 6) as varchar2(6))

      random_string reduced to 4 random characters with no padding: 3.07 seconds.

      Almost all the time goes on calls to dbms_random() – so there’s no way to reduce the time and still get the same data – unless you run the data creation in parallel.

      There is a problem with parallelism though: including rownum in the query serializes the data generation. (See: https://jonathanlewis.wordpress.com/2015/02/12/parallel-rownum/ and https://jonathanlewis.wordpress.com/2013/07/25/parallel-to-serial/ ).

      In this case you can avoid rownum (in the main query – it’s safe in the “with subquery”) by replacing your reference to rownum an expression like “(d1.r-1) * 1e4 + d2.r.” (and get rid of the rownum <= 1e8). If you do this you should find the query goes parallel. To make effective use of parallelism, though, you will have to add a padding column to the driving "with subquery" so that the materialized view takes a larger number of blocks.

      e.g.

      create table t1
      nologging
      as
      with x as (
              select  /*+ materialize cardinality(1000) */
                      level rn0, rpad('x', 300) padding
              from
                      dual
              connect by
                      level <= 1e3  -- &gt; comment to fix WordPress format issue
      ),
      y as (
              select
                      /*+ parallel(3) cardinality(1e6) */
                      (d1.rn0 - 1) * 1000 + d2.rn0    rn1
              from    x d1, x d2
      )
      select
              rn1 as id,
              to_char(
                      sysdate + rn1 / 86400,
                      'yyyy-mm-dd hh24:mi:ss'
              )                                       as inc_datetime,
              trunc(dbms_random.value(0, 100))        as random_id,
              rpad(dbms_random.string('x', 6), 20)    as random_string
      from
              y
      ;
      
      
      

      Running serially this completed in 61 seconds, while running parallel 3 it completed in 26 seconds. (The poor scaling is probably because it's running as VM with a deliberate choke on CPU consumption).

      Note that I've created 1e6 rows by using the same CTE with 1e3 rows twice.

      Comment by Jonathan Lewis — June 14, 2018 @ 1:04 pm BST Jun 14,2018 | Reply

      • Jonathan,

        Thank you very much. -:)

        I’m going to read two posts you provided to me and continuously do some testing according to your good suggestion!

        Comment by Quanwen Zhao — June 16, 2018 @ 9:37 am BST Jun 16,2018 | Reply

  3. […] 14, 2018 : according to Jonathan gave me a comment on his blog post “dbms_random”, I erased all this prompt “SQL> “ in […]

    Pingback by SET SQLFORMAT on SQLcl 18.1.1 | Focusing and Concentrating Oracle Recipes, Tips and Techniques — June 14, 2018 @ 4:14 am BST Jun 14,2018 | Reply

  4. No matter, Jonathan :-)

    You can fix up them on your spare time. I’ve also found like this statement “connect by level <= 1e6", in particularly, “<" symbol often have an issue. It’s best to replace with “<“. If it’s “>” and replace with “>”. All of these is what I find from your “Comments” – https://jonathanlewis.wordpress.com/comments-on-comment/. Many thanks.

    BTW afterwards I add 2 hints – PARALLEL and NO_MERGE on my SQL code snippet as well, nevertheless its spending time is still 1 hours and 10 minutes. Where is about consuming time? “1e8” rows of data is too much? How to trace these time cost?

    drop table t1 purge;
    
    Table T1 dropped.
    
    Elapsed: 00:00:00.337
    
    create table t1
    segment creation immediate
    nologging
    as
    with x as ((SELECT level r FROM dual CONNECT BY level <= 1e4))
    select /*+ parallel */ /*+ no_merge */
           r as id
           , to_char(sysdate + r/24/3600, 'yyyy-mm-dd hh24:mi:ss') as inc_datetime
           , trunc(dbms_random.value(0, 100)) as random_id
           , rpad(dbms_random.string('x', 6), 20) random_string
    --     , dbms_random.string('x', 20) random_string
    from  (SELECT rownum r
           FROM  x d1,
                 x d2
           WHERE  rownum &lt;= 1e8
    );
    
    Table T1 created.
    
    Elapsed: 01:10:23.118
    

    Very appreciate you if any help.

    Best Regards
    Quanwen Zhao

    Comment by Quanwen Zhao — June 14, 2018 @ 8:00 am BST Jun 14,2018 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.