Here’s a note I’ve been meaning to research and write up for more than 18 months – ever since Dion Cho pinged a note I’d written about the effects of partitioning because of a comment it made about the “2% small table threshold”.
It has long been an item of common knowledge that Oracle has a “small table threshold” that allows for special treatment of data segments that are smaller than two percent of the size of the buffer cache, viz:
-
If a table is longer than the threshold then a full tablescan of the table will only use db_file_multiblock_read_count buffers at the end of the LRU (least recently used) chain to read the table and (allowing a little inaccuracy for multi-user systems, pinning and so on) keeps recycling the same few buffers to read the table thus protecting the bulk of the buffer cache from being wiped out by a single large tablescan. Such a tablescan would be recorded under the statistic “table scans (long tables)”.
If a table is shorter than the threshold then it is read to the midpoint of the cache (just like any other block read) but – whether by accident or design – the touch count (x$bh.tch) is not set and the table will fall off the LRU end of the buffer cache fairly promptly as other objects are read into the buffer. Such a tablescan would be recorded under the statistic “table scans (short tables)”.
Then, in July 2009, Dion Cho decided to check this description before repeating it, and set about testing it on Oracle 10gR2 – producing some surprising results and adding another item to my to-do list. Since then I have wanted to check his conclusions, check whether the original description had ever been true and when (or if) it had changed.
As a simple starting point, of course, it was easy to check the description of the relevant (hidden) parameter to see when it changed:
8.1.7.4 _small_table_threshold threshold level of table size for forget-bit enabled during scan 9.2.0.4 _small_table_threshold threshold level of table size for direct reads 11.2.0.1 _small_table_threshold lower threshold level of table size for direct reads
This suggests that the behaviour might have changed some time in 9i (9.2.0.4 happened to be the earliest 9i listing of x$ksppi I had on file) – so I clearly had at least three major versions to check.
The behaviour of the cache isn’t an easy thing to test, though, because there are a number of special cases to consider – in particular the results could be affected by the positioning of the “mid-point” marker (x$kcbwds.cold_hd) that separates the “cold” buffers from the “hot” buffers. By default the hot portion of the default buffer is 50% of the total cache (set by hidden parameter _db_percent_hot_default) but on instance startup or after a “flush buffer cache” there are no used buffers so the behaviour can show some anomalies.
So here’s the basic strategy:
-
Start the instance
Create a number of relatively small tables with no indexes
Create a table large enough to come close to filling the cache, with an index to allow indexed access
Collect stats on the table – largest last.
Query the large table through an index range scan to fill the cache
Repeat a couple of times with at least 3 second pauses to allow for incrementing the touch count
Check x$bh for buffer usage
Run repeated tablescans for the smaller tables to see how many blocks end up in the cache at different sizes.
Here’s some sample code:
create table t_15400
pctfree 99
pctused 1
as
with generator as (
select --+ materialize
rownum id
from dual
connect by
rownum <= 10000
)
select
rownum id,
lpad(rownum,10,'0') small_vc,
rpad('x',100) padding
from
generator v1,
generator v2
where
rownum <= 15400
;
create index t_15400_id on t_15400(id);
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T_15400',
estimate_percent => 100,
method_opt => 'for all columns size 1'
);
end;
/
select
object_name, object_id, data_object_id
from
user_objects
where
object_name in (
'T_300',
'T_770',
'T_1540',
'T_3750',
'T_7700',
'T_15400',
'T_15400_ID'
)
order by
object_id
;
select
/*+ index(t) */
max(small_vc)
from
t_15400 t
where
id > 0
;
The extract shows the creation of just the last and largest table I created and collected statistics for – and it was the only one with an index. I chose the number of blocks (I’ve rigged one row per block) because I had set up a db_cache_size of 128MB on my 10.2.0.3 Oracle instance and this had given me 15,460 buffers.
As you can see from the query against user_objects my test case included tables with 7,700 rows (50%), 3,750 rows (25%), 1,540 rows (10%), 770 rows (5%) and 300 rows (2%). (The number in brackets are the approximate sizes of the tables – all slightly undersized – relative to the number of buffers in the default cache).
Here’s the query that I then ran against x$bh (connected as sys from another session) to see what was in the cache (the range of values needs to be adjusted to cover the range of object_id reported from user_objects):
select obj, tch, count(*) from x$bh where obj between 77710 and 77720 group by obj, tch order by count(*) ;
Typical results from 10.2.0.3
After executing the first index range scan of t_15400 to fill the cache three times:
OBJ TCH COUNT(*)
---------- ---------- ----------
75855 0 1
75854 0 1
75853 0 1
75851 0 1
75850 0 1
75849 0 1
75852 0 1
75855 2 9 -- Index blocks, touch count incremented
75855 1 18 -- Index blocks, touch count incremented
75854 1 11521 -- Table blocks, touch count incremented
Then after three tablescans, at 4 second intervals, of the 7,700 block table:
OBJ TCH COUNT(*)
---------- ---------- ----------
75853 3 1 -- segment header of 7700 table, touch count incremented each time
75855 0 1
75854 0 1
75852 0 1
75849 0 1
75850 0 1
75851 0 1
75855 2 9
75855 1 10
75853 0 3991 -- lots of blocks from 7700 table, no touch count increment
75854 1 7538
Then repeating the tablescan of the 3,750 block table three times:
OBJ TCH COUNT(*)
---------- ---------- ----------
75853 3 1
75855 0 1
75854 0 1
75851 0 1
75852 3 1 -- segment header block, touch count incremented each time
75849 0 1
75850 0 1
75855 2 9
75855 1 10
75853 0 240
75852 0 3750 -- table completely cached - touch count not incremented
75854 1 7538
Then repeating the tablescan of the 1,540 block table three times:
OBJ TCH COUNT(*)
---------- ---------- ----------
75853 3 1
75855 0 1
75854 0 1
75851 3 1 -- segment header block, touch count incremented each time
75849 0 1
75850 0 1
75852 3 1
75855 2 9
75855 1 10
75853 0 149
75851 2 1540 -- Table fully cached, touch count incremented but only to 2
75852 0 2430
75854 1 7538
Then executing the tablescan of the 770 block table three times:
OBJ TCH COUNT(*)
---------- ---------- ----------
75853 3 1
75855 0 1
75850 3 1 -- segment header block, touch count incremented each time
75849 0 1
75851 3 1
75852 3 1
75854 0 1
75855 2 9
75855 1 10
75851 0 69
75853 0 149
75850 2 770 -- Table fully cached, touch count incremented but only to 2
75851 2 1471
75852 0 1642
75854 1 7538
Finally executing the tablescan of the 300 block table three times:
OBJ TCH COUNT(*)
---------- ---------- ----------
75853 3 1
75855 0 1
75854 0 1
75850 3 1
75852 3 1
75851 3 1
75855 2 9
75855 1 10
75851 0 69
75850 0 131
75853 0 149
75849 3 301 -- Table, and segment header, cached and touch count incremented 3 times
75850 2 639
75852 0 1342
75851 2 1471
75854 1 7538
This set of results on its own isn’t conclusive, of course, but the indications for 10.2.0.3 are:
-
a) “Large” tablescans don’t increment the touch count – so avoiding promotion to the hot portion of the buffer
b) There is a 25% boundary (ca. 3750 in this case) above which a tablescan will start to recycle the buffer it has used
c) There is a 10% boundary (ca. 1540 in this case) below which repeating a scan WILL increment the touch count
d) There is a 2% boundary (ca. 300 in this case) below which tablescans will always increment the touch count.
I can’t state with any certainty where the used and recycled buffers might be, but since blocks from the 3750 tablescan removed the blocks from the 7700 tablescan, it’s possible that “large” tablescans do somehow go “to the bottom quarter” of the LRU.
There also some benefit in checking the statistics “table scans (short)” and “table scans (long)” as the tests run. For the 2% (300 block) table I recorded 3 short tablescans; for the tables in the 2% to 10% range (770 and 1540) I recorded one long and two short (which is consistent with the touch count increment of 2 – the first scan was expected to be long, but the 2nd and 3rd were deemed to be short based on some internal algorithm about the tables being fully cached); finally for the tables above 10% we always got 3 long tablescans.
But as it says in the original note on small partitions – there are plenty of questions still to answer:
-
I’ve cited 2%, 10%, and 25% and only one of these is set by a parameter (_small_table_threshold is derived as 2% of the db_cache_size – in simple cases) are the other figures derived, settable, or hard-coded.
I’ve quoted the 2% as the fraction of the db_cache_size – but we have automatic SGA management in 10g, automatic memory management in 11g, and up to eight different cache sizing parameters in every version from 9i onwards. What figure is used as the basis for the 2%, and is that 2% of the blocks or 2% of the bytes, and if you have multiple block sizes does each cache perhaps allow 2% of its own size.
And then, in 11g we have to worry about automatic direct path serial tablescans – and it would be easy to think that the “_small_table_threshold” may have been describing that feature since (at least) 9.2.0.4 if its description hadn’t changed slightly for 11.2 !
So much to do, so little time — but at least you know that there’s something that needs careful investigation if you’re planning to do lots of tablescans.
Footnote: Having written some tests, it’s easy to change versions. Running on 8.1.7.4 and 9.2.0.8, with similar sized caches, I could see that the “traditional” description of the “small_table_threshold” was true – a short tablescan was anything less 2% of the buffer cache, long tablescans were (in effect) done using just a window of “db_file_multiblock_read_count” buffers, and in both cases the touch count was never set (except for the segment header block).


Nice analysis as always.
Currently I’m a bit perplexed by the caching results on my 10.2.0.4 (compatibility 10.2.0.3) on Linux:
drop table cacher;
create table cacher(c1 char(2000), c2 char(2000), c3 char(2000)) nologging;
insert /*+ append */ into cacher
select 'x', 'x', 'x'
from dual
connect by level <= 220000 ;
EXEC DBMS_STATS.gather_table_stats(NULL, 'CACHER');
commit;
alter system flush buffer_cache;
set autot on stat
select count(*) from cacher;
select count(*) from cacher;
select count(*) from cacher;
set autot off;
with parameters
_db_block_buffers 273102
_small_table_threshold 5462
80% of buffer cache Statistics with stats gathered
----------------------------------------------------------
220033 consistent gets
220005 physical reads
----------------------------------------------------------
220023 consistent gets
0 physical reads
----------------------------------------------------------
220023 consistent gets
0 physical reads
On my windows box 11.2.0.1 the caching seems to start out at 10% of the buffer cache, then if I change small_table_threshold, that value seems to be what is used for the caching limit:
with parameters
_db_block_buffers 51220 25% = 12805,10% = 5122
_small_table_threshold 1024
drop table cacher;
create table cacher(c1 char(2000), c2 char(2000), c3 char(2000)) nologging;
insert /*+ append */ into cacher
select 'x', 'x', 'x'
from dual
-- connect by level <= 5120 -- caches
connect by level <= 5122 -- doesn't cache ;
commit;
alter system flush buffer_cache;
select count(*) from cacher;
select count(*) from cacher;
set autot on stat
select count(*) from cacher;
set autot off;
I get these results
3rd count(*) Statistics 10% of buffer cache size
----------------------------------------------------------
5126 consistent gets
5122 physical reads
now if I increase small table threshold and bounce the database:
alter system set "_small_table_threshold"=10240 scope=spfile;
startup force
I can now cache more on full table scan:
Statistics with size just below small table threshold
----------------------------------------------------------
10091 consistent gets
10004 physical reads
----------------------------------------------------------
10011 consistent gets
0 physical reads
----------------------------------------------------------
10011 consistent gets
0 physical reads
Statistics for FTS with size just above small table threshold
----------------------------------------------------------
11101 consistent gets
11288 physical reads
----------------------------------------------------------
11004 consistent gets
11000 physical reads
----------------------------------------------------------
11004 consistent gets
11000 physical reads
Also if I set small table threshold greater than the size of the buffer cache I get no caching with a table bigger than the buffer cache:
alter system set "_small_table_threshold"=1100000 scope=spfile;
startup force;
Statistics
-----------------------------------------------------
102545 consistent gets
102522 physical reads
-----------------------------------------------------
102460 consistent gets
102425 physical reads
-----------------------------------------------------
102460 consistent gets
102425 physical reads
where as on LINUX I get surpringly helpful caching with a table 5x the buffer cache:
_db_block_buffers 273102
_small_table_threshold 5462
500% bigger buffer cache
----------------------------------------------------------
1350131 consistent gets
1350002 physical reads
----------------------------------------------------------
1350037 consistent gets
1177881 physical reads
----------------------------------------------------------
1350037 consistent gets
1177856 physical reads
Comment by Kyle Hailey — March 25, 2011 @ 8:07 pm UTC Mar 25,2011 |
Kyle,
One of the boundary conditions on the buffer cache behaviour appears when the entire cache is “free” – that’s why I started my experiment with an index range scan to load a very large fraction of the cache before I started doing any tablescans.
In your 10.2.0.4 case the whole cache was free, which means there are no useful data blocks to protect, so the code seems to have been allowed to use the whole cache on the tablescan.
Your first 11.2.0.1 experiment (table above/below 10%) suggests that Oracle Corp. has changed the boundary case so that it always uses the 10% that would apply to a “full” cache, even if the whole cache is free.
The second 11.2.0.1 experiment (_small_table_threshold set to 10% of cache_size) suggests that the algorithm I suggested relating to the 10% limit still holds at 10% – and NOT at 5 * _small_table_threshold.
With the very large _small_table_threshold, it would be useful to check how many blocks from the table were in the cache after the tablescan – it might be 10% of the cache. (You didn’t say what the cache size was in this test, was it about 110,000, or was it back to the original 220,000).
A possibly explanation for the 5x buffer cache anomaly might be that with a starting empty cache you fill the cache as the tablescan starts, and then keep recycling the bottom 25% of the cache from that point onwards.
Comment by Jonathan Lewis — March 26, 2011 @ 8:39 pm UTC Mar 26,2011 |
>> A possibly explanation for the 5x buffer cache anomaly might be that with a starting
>> empty cache you fill the cache as the tablescan starts, and then keep recycling
>> the bottom 25% of the cache from that point onwards.
That would make sense! I was wondering what kind of algorithm Oracle could use to efficiently cache a table that was larger than the buffer cache, but using this (accidental?) algorithm would be an explanation.
What I’m most interested in is the parameters that control the caching mechanisms.
My main concern is a customers 10.2.0.4 is not caching even with an empty buffer cache when on my in house system it is caching with tables up to 80% of the buffer cache size. (I’m on LINUX and customer is on Solaris)
In both cases _small_table_threshold is the default.
The use case I’m looking to solve is a customer going from depending on the UNIX file system cache to depending on the SGA alone as they are switching to Direct File I/O.
On their system (Solaris 10.2.0.4 empty cache) the caching only happens well below 80%, more like 10-25% (I didn’t have the liberty to run extensive tests and nail down the value). I’m trying to figure out the options for making caching happen with bigger tables. Customer won’t use the cache table option because they want the new configuration to run just like the old without any extra administration. They are willing to change init params and maybe _small_table_threshold is the only way, but on my LINUX 10.2.0.4 I have the default _small_table_threshold and I’m seeing caching up with tables up to 80% of buffer cache size (start with an empty buffer cache) and I’m wondering “is there some other mechanism controlling caching besides _small_table_threshold?”
Customer wants the same “feel” with direct I/O as without. Without Direct I/O they could read a table once via FTS and then upon second read, the response time would be as if it was cached even though Oracle reports a high level of physical reads because the reads are coming from the UNIX file system cache.
Now, with Direct IO and even if I have them increase the size of their buffer even beyond the size of this table, my concern is they won’t be caching the table. I can have them change _small_table_threshold but I’m wondering if there is another mechanism which seems to be the case on my 10.2.0.4 on LINUX.
The other alternative is something on the customers database is preventing the caching of larger tables even after an “alter system flush buffer_cache”
PS my test script came from:
http://dioncho.wordpress.com/2009/04/22/strong-doubt-on-small-table_small_table_threshold/
Charles Hooper pointed me to your entry here, from his blog at:
http://hoopercharles.wordpress.com/2010/06/17/_small_table_threshold-parameter-and-buffer-cache-what-is-wrong-with-this-quote/
Glenn Fawcett has a good blog on the overhead of using the UNIX file system cache verses Oracle buffer cache
http://blogs.sun.com/glennf/entry/where_do_you_cache_oracle
Comment by Kyle Hailey — March 28, 2011 @ 6:28 pm UTC Mar 28,2011 |
[...] http://jonathanlewis.wordpress.com/2011/03/24/small-tables/ http://dioncho.wordpress.com/2009/04/22/strong-doubt-on-small-table_small_table_threshold/ http://www.dbacomp.com.br/blog/?p=73 [...]
Pingback by אתגר החודש – אתגר מאי2011 - Israel Database Portal — June 1, 2011 @ 12:24 pm UTC Jun 1,2011 |
Hi Jonathan,
Very interesting reads. I follow up the thread that I opened in OTN under title “serial table scan with direct path read compared to db file scattered read”, http://forums.oracle.com/forums/thread.jspa?messageID=10081807
I have a table 1.7 Million rows with 1,943,824 blocks (no index) and a default buffer cache of 8320MB. That is 8320*1024*1024/8192 = 1,064,960 8K available blocks. At 25% max this table can have up to 1,064,960 * 0.25 = 266,240 blocks in memory or around 14% of the table. So roughly in 7 cycles the whole table can be scanned in.
When the buffer is empty after reboot, it prefers to use “direct path read” (DPR) as opposed to full serial table scan (FTS). So there must be some costing estimates that says DPR is cheaper than FTS. I suspect this may be the size of the underlying table larger than 2% of small_table_threshold as shown below).
The full stats are as follows:
Parameter buffer cache size/MB ------------------------------ -------------------- buffer_cache 8,320 Small table threshold at 2% of buffer cache size/MB Small table block limit --------------------------------------------------- ----------------------- 166 21,299 My table details TABLE_NAME rows block size/KB blocks avg free space/KB Table size/MB -------------------------------- ------------ ------------- ------------ ----------------- ------------- TDASH 1,729,204 8 1,943,824 805 13,714 Table block size/threhold limit ------------------------------- 91When I run a test load on this table as
ALTER SESSION SET TRACEFILE_IDENTIFIER = 'test_with_tdash_ssdtester_noindex'; DECLARE type array is table of tdash%ROWTYPE index by binary_integer; l_data array; l_rec tdash%rowtype; BEGIN SELECT a.* ,RPAD('*',4000,'*') AS PADDING1 ,RPAD('*',4000,'*') AS PADDING2 BULK COLLECT INTO l_data FROM ALL_OBJECTS a; DBMS_MONITOR.SESSION_TRACE_ENABLE ( waits=>true ); FOR rs IN 1 .. 100 LOOP BEGIN SELECT * INTO l_rec FROM tdash WHERE object_id = l_data(rs).object_id; EXCEPTION WHEN NO_DATA_FOUND THEN NULL; END; END LOOP; END;It takes 6,520 seconds to finish.
When I force it not to use DPR, it chooses db file scattered read and it finishes in 4,299 seconds. Details are posted in the link
So I am a bit confused.
cheers,
Mich
Comment by Mich Talebzadeh — January 13, 2012 @ 4:25 pm UTC Jan 13,2012 |
Mich,
I’ve replied to this on the OTN thread, but I didn’t say anything about Oracle choosing to do the direct path read just after startup. The public information about this is still under investigation but I recall seeing some notes somewhere about Oracle’s decision being based in part on the number (or percentage) of blocks from the object that are already in the cache. When you start up there’s nothing from the object in the cache, so it maximises the chance of Oracle picking the direct path option.
I made some comments, and showed some arithmetic, about the effect of an empty cache on the speed of repeating the tablescan using db file scattered, so I won’t repeat that bit here.
Comment by Jonathan Lewis — January 26, 2012 @ 9:37 pm UTC Jan 26,2012 |