Updated July 2020 -with comments inline. The “buggy” behaviour is consistent across all recent versions of Oracle (from 10gR2 onwards), so is presumably and update to the (internal) specification.
In some of my previous posts (particularly the ones about analysing the data by running aggregate queries) I’ve mentioned the “sample” clause from time to time, so I thought I’d better write a short note about it for those not familiar with it.
To demonstrate the feature, my first step was to run a test script I wrote a few years ago – and discovered that the feature seems to be broken (or, maybe changed) in 10.2.0.3 and 11.1.0.7) – so at the end of this post I’ve supplied a little test that you can try at home on your own versions.
The only things I use the sample clause for are tablescans and index fast full scans on a single table or index. In fact this was an explicit limitation in the earlier versions of Oracle, but this restriction has been relaxed in more recent versions. Typically, I’d have a query with the shape:
select ... from tableX {sample clause} where ... group by ... order by ...
The intention was to select a (fairly small) representative sample of data from a very large table.
In 10g the syntax for the sample clause is:
sample [block] ( N [,M] ) [seed(S)]
- You can sample rows or blocks (optional keyword [block])
- You must specify the percentage sample size (N) – recognizing that the result will only approximate that size
- You can specify that the sample should consist of groups of ([,M]) consecutive rows or blocks
- Updated: It looks as if [,M] does not apply to row sampling, only to block sampling
- You can supply a seed value ([seed (S)]) to the random number generator that drives the sample so that the sample is repeatable.
- The ‘seed’ option appeared in 10g, and is documented there.
- The ‘consecutive rows/blocks’ option doesn’t seem to be documented but appears in 9i.
- Updated: It looks as if for single block samples you HAVE to use [,1] from 10g onwards
So here are the problems in 10g:
The option “sample (N, M)” doesn’t seem to work properly – instead of N% of the data using groups of M consecutive rows in the table, it seems to sample N*M percent of the rows completely randomly across the table. 9i and 10g behave very differently. Fair enough, it’s not documented, so I shouldn’t expect the “,M” feature to work anyway.
Updated: the behaviour above may have been a bug in 10.1; from 10.2 the correct N% is selected and the [,M] is simply ignored.
The option “sample block N” doesn’t seem to work properly. I have a table with 1,640 blocks, so a sample of 5% ought to show me something like 82 blocks. In 9i I keep seeing sample sizes between 69 and 97 blocks, which is not unreasonable. In 10g I keep getting sample sizes of zero, 66, 132 and 198 blocks – with a couple of other numbers occasionally. This is not reasonable. Luckily there is a workaround: if I use the undocumented syntax “sample (N,1)” the 10g results are then consistent with the 9i results.
Updated: it looks as if the [,1] is a design feature rather than a workaround, so for blocks you have to use (n,m).
Here’s the stripped-down version of my test script. It generates 200,000 rows, and the critical item in the data is the id_col which is a zero-padded counter that lets us track the data selected fairly easily.
rem rem Script: c_sample.sql rem Author: Jonathan Lewis rem Dated: June 2002 rem Purpose: Investigation of 9.2 SAMPLE clause rem create table t1 nologging as with generator as ( select --+ materialize rownum id from all_objects where rownum <= 5000 -- > comment to avoid wordpress format issue ) select lpad(rownum,10,'0') id_col, rpad('x',42) padding from generator v1, generator v2 where rownum <= 200000 -- > comment to avoid wordpress format issue ; select substr(id_col,1,9), count(*) from t1 sample (1,10) -- seed (0) group by substr(id_col,1,9) order by substr(id_col,1,9) ; -- collect and check the table stats here rem rem We get about 122 rows per block and 1,640 blocks at 8K blocks rem for the 200,000 row test (first sample size) rem rem A sample of 1 percent should get us 2,000 rows, but if we sample rem in sets of 10 then drop off the last digit of the id_col, we rem should get about 200 pairs of values - where each pair sums to rem 10 rows. Expect about 400 rows of output in 9i. rem rem The first pattern I got contained lines like: rem 000000108 4 rem 000000109 6 10 consecutive rows rem 000000206 8 rem 000000207 2 10 consecutive rows rem 000000336 7 rem 000000337 3 10 consecutive rows rem rem But this pattern doesn't appear from 10g onwards. (You'll rem get about 2,000 rows of output). rem break on report compute sum of count(*) on report select substr(id_col,1,9), count(*) from t1 sample (1,10) -- seed (0) group by substr(id_col,1,9) order by substr(id_col,1,9) ; rem rem Similarly a sample of 10 percent is about 160 blocks. rem But 8 blocks in a row is about 976 rows. So if we cut off rem the last three digits of the id_col we should see another rem pattern. Expect about 42 rows of output. rem rem My second run gave rem 0000003 217 rem 0000004 759 976 consecutive rows rem 0000007 557 rem 0000008 419 976 consecutive rows rem 0000017 919 rem 0000018 57 976 consecutive rows rem select substr(id_col,1,7), count(*) from t1 sample block (10,8) -- seed (0) group by substr(id_col,1,7) order by substr(id_col,1,7) ; rem rem A simple sample of 5% of the blocks, reported by rem stripping out the block detail from the rowid rem should give me about 82 blocks - so 82 rows of rem output. rem select substr(rowid,1,15), count(*) from t1 sample block (5,1) group by substr(rowid,1,15) order by substr(rowid,1,15) ;
Nice clear analysis (as always) thanks.
I’ve wanted to use this type functionality test out query plans on subsets of the real tables to limit the running result set sizes of a execution plan. Of course I can use CTAS to create smaller versions of the tables but that’s resource intensive. It’s so tempting to think there might be a way to accomplish the same thing with some limiting clause like the sample clause. Of course the sample clause is problematic in its own ways but still wondering if there is any way to do it. I started to play with creating views that had the sample clause in them. As I recall the queries worked but I didn’t get far enough to determine whether the timing and resource results were worth anything. If I go back and try some more I will at least know of some of the other gotchas.
Comment by Kyle Hailey — March 16, 2010 @ 10:18 pm GMT Mar 16,2010 |
Kyle,
Thanks for the comment. From memory there was a time when the sample clause could only work with a single table query, and then Oracle did something to extend it. I never got around to testing it, though. I would be quite surprised if it worked well given the conflicting requirements of relational integrity and random selection.
Comment by Jonathan Lewis — March 20, 2010 @ 1:08 pm GMT Mar 20,2010 |
[…] 21-How does sample clause work? Jonathan Lewis-Sample Clause […]
Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:08 am BST Apr 23,2010 |
[…] Jonathan Lewis mentioned that this second parameter is the number of consecutive groups/rows during row or block sampling. It was introduced during changes for the SAMPLE clause in 10g. […]
Pingback by 10 “secret” Oracle DB features you might not know – svenweller — July 2, 2020 @ 9:50 pm BST Jul 2,2020 |