Oracle Scratchpad

March 12, 2010

Sample Clause

Filed under: Infrastructure,Troubleshooting — Jonathan Lewis @ 5:22 pm BST Mar 12,2010

In some of my previous posts (particularly the ones about analysing the data by running aggregate queries) I’ve mentioned the “sample” clause from time to time, so I thought I’d better write a short note about it for those not familiar with it.

To demonstrate the feature, my first step was to run a test script I wrote a few years ago – and discovered that the feature is broken (at least, in 10.2.0.3 and 11.1.0.7) – so at the end of this post I’ve supplied a little test that you can try at home on your own versions.

The only things I use the sample clause for are tablescans and index fast full scans on a single table or index. In fact this was an explicit limitation in the earlier versions of Oracle, but this restriction has been relaxed in more recent versions. Typically, I’d have a query with the shape:

select ...
from    tableX {sample clause}
where   ...
group by ...
order by ...

The intention was to select a (fairly small) representative sample of data from a very large table.

In 10g the syntax for the sample clause is:

sample [block] ( N [,M] )  [seed(S)]
  • You can sample rows or blocks (optional keyword [block])
  • You must specify the percentage sample size (N) – recognizing that the result will only approximate that size
  • You can specify that the sample should consist of groups of ([,M]) consecutive rows or blocks
  • You can supply a seed value ([seed (S)]) to the random number generator that drives the sample so that the sample is repeatable.
  • The ‘seed’ option appeared in 10g, and is documented there.
  • The ‘consecutive rows/blocks’ option doesn’t seem to be documented but appears in 9i.

So here are the problems in 10g:

The option “sample (N, M)” doesn’t seem to work properly – instead of N% of the data using groups of M consecutive rows in the table, it seems to sample N*M percent of the rows completely randomly across the table. 9i and 10g behave very differently. Fair enough, it’s not documented, so I shouldn’t expect the “,M” feature to work anyway.

The option “sample block N” doesn’t seem to work properly. I have a table with 1,640 blocks, so a sample of 5% ought to show me something like 82 blocks.  In 9i I keep seeing sample sizes between 69 and 97 blocks, which is not unreasonable.  In 10g I keep getting sample sizes of zero, 66, 132 and 198 blocks – with a couple of other numbers occasionally. This is not reasonable. Luckily there is a workaround: if I use the undocumented syntax “sample (N,1)” the 10g results are then consistent with the 9i results.

Here’s the stripped-down version of my test script. It generates 200,000 rows, and the critical item in the data is the id_col which is a zero-padded counter that lets us track the data selected fairly easily.

create table t1
nologging
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 5000
)
select
	lpad(rownum,10,'0')	id_col,
	rpad('x',42)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 200000
;

select
	substr(id_col,1,9),
	count(*)
from
	t1 sample (1,10) -- seed (0)
group by
	substr(id_col,1,9)
order by
	substr(id_col,1,9)
;

-- collect and check the table stats here

rem
rem	We get about 122 rows per block and 1,640 blocks at 8K blocks
rem	for the 200,000 row test (first sample size)
rem
rem	A sample of 1 percent should get us 2,000 rows, but if we sample
rem	in sets of 10 then drop off the last digit of the id_col, we
rem	should get about 200 pairs of values - where each pair sums to
rem	10 rows. Expect about 400 rows of output in 9i.
rem
rem	The first pattern I got contained lines like:
rem		000000108          4
rem		000000109          6		10 consecutive rows
rem		000000206          8
rem		000000207          2		10 consecutive rows
rem		000000336          7
rem		000000337          3		10 consecutive rows
rem
rem	But this pattern doesn't appear from 10g onwards. (You'll
rem	get about 2,000 rows of output).
rem

break on report
compute sum of count(*) on report

select
	substr(id_col,1,9),
	count(*)
from
	t1 sample (1,10) -- seed (0)
group by
	substr(id_col,1,9)
order by
	substr(id_col,1,9)
;

rem
rem	Similarly a sample of 10 percent is about 160 blocks.
rem	But 8 blocks in a row is about 976 rows. So if we cut off
rem	the last three digits of the id_col we should see another
rem	pattern. Expect about 42 rows of output.
rem
rem	My second run gave
rem		0000003        217
rem		0000004        759		976 consecutive rows
rem		0000007        557
rem		0000008        419		976 consecutive rows
rem		0000017        919
rem		0000018         57		976 consecutive rows
rem

select
	substr(id_col,1,7),
	count(*)
from
	t1 sample block (10,8) -- seed (0)
group by
	substr(id_col,1,7)
order by
	substr(id_col,1,7)
;

rem
rem	A simple sample of 5% of the blocks, reported by
rem	stripping out the block detail from the rowid
rem	should give me about 82 blocks - so 82 rows of
rem	output.
rem

select
	substr(rowid,1,15),
	count(*)
from
	t1 sample block (5)
group by
	substr(rowid,1,15)
order by
	substr(rowid,1,15)
;

3 Comments »

  1. Nice clear analysis (as always) thanks.
    I’ve wanted to use this type functionality test out query plans on subsets of the real tables to limit the running result set sizes of a execution plan. Of course I can use CTAS to create smaller versions of the tables but that’s resource intensive. It’s so tempting to think there might be a way to accomplish the same thing with some limiting clause like the sample clause. Of course the sample clause is problematic in its own ways but still wondering if there is any way to do it. I started to play with creating views that had the sample clause in them. As I recall the queries worked but I didn’t get far enough to determine whether the timing and resource results were worth anything. If I go back and try some more I will at least know of some of the other gotchas.

    Comment by Kyle Hailey — March 16, 2010 @ 10:18 pm BST Mar 16,2010 | Reply

    • Kyle,

      Thanks for the comment. From memory there was a time when the sample clause could only work with a single table query, and then Oracle did something to extend it. I never got around to testing it, though. I would be quite surprised if it worked well given the conflicting requirements of relational integrity and random selection.

      Comment by Jonathan Lewis — March 20, 2010 @ 1:08 pm BST Mar 20,2010 | Reply

  2. [...] 21-How does sample clause work? Jonathan Lewis-Sample Clause [...]

    Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:08 am BST Apr 23,2010 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,877 other followers