Oracle Scratchpad

March 12, 2010

Sample Clause

Filed under: Infrastructure,Troubleshooting — Jonathan Lewis @ 5:22 pm GMT Mar 12,2010

Updated July 2020 -with comments inline. The “buggy” behaviour is consistent across all recent versions of Oracle (from 10gR2 onwards), so is presumably and update to the (internal) specification.

In some of my previous posts (particularly the ones about analysing the data by running aggregate queries) I’ve mentioned the “sample” clause from time to time, so I thought I’d better write a short note about it for those not familiar with it.

To demonstrate the feature, my first step was to run a test script I wrote a few years ago – and discovered that the feature seems to be broken (or, maybe changed) in 10.2.0.3 and 11.1.0.7) – so at the end of this post I’ve supplied a little test that you can try at home on your own versions.

The only things I use the sample clause for are tablescans and index fast full scans on a single table or index. In fact this was an explicit limitation in the earlier versions of Oracle, but this restriction has been relaxed in more recent versions. Typically, I’d have a query with the shape:

select ...
from    tableX {sample clause}
where   ...
group by ...
order by ...

The intention was to select a (fairly small) representative sample of data from a very large table.

In 10g the syntax for the sample clause is:

sample [block] ( N [,M] )  [seed(S)]
  • You can sample rows or blocks (optional keyword [block])
  • You must specify the percentage sample size (N) – recognizing that the result will only approximate that size
  • You can specify that the sample should consist of groups of ([,M]) consecutive rows or blocks
    • Updated: It looks as if [,M] does not apply to row sampling, only to block sampling
  • You can supply a seed value ([seed (S)]) to the random number generator that drives the sample so that the sample is repeatable.
  • The ‘seed’ option appeared in 10g, and is documented there.
  • The ‘consecutive rows/blocks’ option doesn’t seem to be documented but appears in 9i.
    • Updated: It looks as if for single block samples you HAVE to use [,1] from 10g onwards

So here are the problems in 10g:

The option “sample (N, M)” doesn’t seem to work properly – instead of N% of the data using groups of M consecutive rows in the table, it seems to sample N*M percent of the rows completely randomly across the table. 9i and 10g behave very differently. Fair enough, it’s not documented, so I shouldn’t expect the “,M” feature to work anyway.

Updated: the behaviour above may have been a bug in 10.1; from 10.2 the correct N% is selected and the [,M] is simply ignored.

The option “sample block N” doesn’t seem to work properly. I have a table with 1,640 blocks, so a sample of 5% ought to show me something like 82 blocks.  In 9i I keep seeing sample sizes between 69 and 97 blocks, which is not unreasonable.  In 10g I keep getting sample sizes of zero, 66, 132 and 198 blocks – with a couple of other numbers occasionally. This is not reasonable. Luckily there is a workaround: if I use the undocumented syntax “sample (N,1)” the 10g results are then consistent with the 9i results.

Updated: it looks as if the [,1] is a design feature rather than a workaround, so for blocks you have to use (n,m).

Here’s the stripped-down version of my test script. It generates 200,000 rows, and the critical item in the data is the id_col which is a zero-padded counter that lets us track the data selected fairly easily.

rem
rem     Script:         c_sample.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2002
rem     Purpose:        Investigation of 9.2 SAMPLE clause
rem

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum  id
        from    all_objects
        where   rownum <= 5000 -- > comment to avoid wordpress format issue
)
select
        lpad(rownum,10,'0')     id_col,
        rpad('x',42)            padding
from
        generator       v1,
        generator       v2
where
        rownum <= 200000 -- > comment to avoid wordpress format issue
;

select
        substr(id_col,1,9),
        count(*)
from
        t1 sample (1,10) -- seed (0)
group by
        substr(id_col,1,9)
order by
        substr(id_col,1,9)
;

-- collect and check the table stats here

rem
rem     We get about 122 rows per block and 1,640 blocks at 8K blocks
rem     for the 200,000 row test (first sample size)
rem
rem     A sample of 1 percent should get us 2,000 rows, but if we sample
rem     in sets of 10 then drop off the last digit of the id_col, we
rem     should get about 200 pairs of values - where each pair sums to
rem     10 rows. Expect about 400 rows of output in 9i.
rem
rem     The first pattern I got contained lines like:
rem             000000108          4
rem             000000109          6            10 consecutive rows
rem             000000206          8
rem             000000207          2            10 consecutive rows
rem             000000336          7
rem             000000337          3            10 consecutive rows
rem
rem     But this pattern doesn't appear from 10g onwards. (You'll
rem     get about 2,000 rows of output).
rem

break on report
compute sum of count(*) on report

select
        substr(id_col,1,9),
        count(*)
from
        t1 sample (1,10) -- seed (0)
group by
        substr(id_col,1,9)
order by
        substr(id_col,1,9)
;

rem
rem     Similarly a sample of 10 percent is about 160 blocks.
rem     But 8 blocks in a row is about 976 rows. So if we cut off
rem     the last three digits of the id_col we should see another
rem     pattern. Expect about 42 rows of output.
rem
rem     My second run gave
rem             0000003        217
rem             0000004        759              976 consecutive rows
rem             0000007        557
rem             0000008        419              976 consecutive rows
rem             0000017        919
rem             0000018         57              976 consecutive rows
rem

select
        substr(id_col,1,7),
        count(*)
from
        t1 sample block (10,8) -- seed (0)
group by
        substr(id_col,1,7)
order by
        substr(id_col,1,7)
;

rem
rem     A simple sample of 5% of the blocks, reported by
rem     stripping out the block detail from the rowid
rem     should give me about 82 blocks - so 82 rows of
rem     output.
rem

select
        substr(rowid,1,15),
        count(*)
from
        t1 sample block (5,1)
group by
        substr(rowid,1,15)
order by
        substr(rowid,1,15)
;



4 Comments »

  1. Nice clear analysis (as always) thanks.
    I’ve wanted to use this type functionality test out query plans on subsets of the real tables to limit the running result set sizes of a execution plan. Of course I can use CTAS to create smaller versions of the tables but that’s resource intensive. It’s so tempting to think there might be a way to accomplish the same thing with some limiting clause like the sample clause. Of course the sample clause is problematic in its own ways but still wondering if there is any way to do it. I started to play with creating views that had the sample clause in them. As I recall the queries worked but I didn’t get far enough to determine whether the timing and resource results were worth anything. If I go back and try some more I will at least know of some of the other gotchas.

    Comment by Kyle Hailey — March 16, 2010 @ 10:18 pm GMT Mar 16,2010 | Reply

    • Kyle,

      Thanks for the comment. From memory there was a time when the sample clause could only work with a single table query, and then Oracle did something to extend it. I never got around to testing it, though. I would be quite surprised if it worked well given the conflicting requirements of relational integrity and random selection.

      Comment by Jonathan Lewis — March 20, 2010 @ 1:08 pm GMT Mar 20,2010 | Reply

  2. […] 21-How does sample clause work? Jonathan Lewis-Sample Clause […]

    Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:08 am BST Apr 23,2010 | Reply

  3. […] Jonathan Lewis mentioned that this second parameter is the number of consecutive groups/rows during row or block sampling. It was introduced during changes for the SAMPLE clause in 10g. […]

    Pingback by 10 “secret” Oracle DB features you might not know – svenweller — July 2, 2020 @ 9:50 pm BST Jul 2,2020 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.