Oracle Scratchpad

October 31, 2006

Virtual columns revisited

Filed under: Indexing,Statistics,Tuning — Jonathan Lewis @ 11:57 pm GMT Oct 31,2006

Here’s an idea I had some time ago, but have never implemented on a production system. I’d be interested to know if anyone has tried it.

Virtual columns are useful because they can give Oracle statistics about values derived from stored columns. So, for example, you might like to have a virtual column defined as salary + nvl(commission,0) so that the optimizer can estimate resonably accurately how many employess have salary + nvl(commission,0) > 100000.

But (until 11g, perhaps) you can’t create virtual columns unless you index them using the feature that Oracle describes as ‘function-based indexes’, and perhaps you have far too many virtual columns to make index creation viable. Here’s a workaround (maybe).

Create a single, huge, function-based index in an unusable state, containing all the virtual columns you would like to have – or at least the first 32 of them as that’s the largest number of columns you can have in a single index. Then gather stats and see what happens. For example:


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	/*+ ordered use_merge(v2) */
	rownum				id,
	to_char(
		sysdate + rownum/20,
		'YYYYMMDD'
	)				date_char,
	dbms_random.string('U',6)	vc1,
	rpad('x',100)			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 10000
;    

create index t1_i1 on t1(
	to_date(date_char,'YYYYMMDD'),
	substr(vc1,1,4),
	to_char(
		to_date(date_char,'YYYYMMDD'),
		'Day'
	)
)
unusable
;    

alter session set skip_unusable_indexes = true;    

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		estimate_percent => null,
		block_sample 	 => true,
		method_opt 	 => 'for all columns size 1',
		degree		 => null,
		granularity      => 'default',
		cascade		 => true
	);
end;
/          

I was running this test on a 9.2 database, so I’ve set skip_unusable_indexes to true (which is the default state in 10g). Note, in particular, that I’ve created the index as unusable, and that I’ve got two different ways of manipulating the date column that I’ve stored as a character on the table.

When you run the gather_table_stats procedure, you will get an error because the index is unusable, but you will get column-level statistics created on the hidden columns that represent the virtual column definitions in the index:


select
	column_name, num_distinct, low_value, high_value
from
	user_tab_cols
where
	table_name = 'T1'
and	column_name != 'PADDING'
;        

COLUMN_NAME  NUM_DISTINCT LOW_VALUE          HIGH_VALUE
------------ ------------ ------------------ ------------------
ID                  10000 C102               C302
DATE_CHAR             500 3230303631313031   3230303830333134
VC1                 10000 414142524F51       5A5A594F4743
SYS_NC00005$          500 786A0B01010101     786C030E010101
SYS_NC00006$         9888 41414252           5A5A594F
SYS_NC00007$            7 467269646179202020 5765646E6573646179        

Note particularly that column SYS_NC00007$ (the column representing the converson of the ‘pseudo-date’ column into the day of a week) has 7 distinct values, a low value of ‘Monday’ (space padded, interestingly to nine characters) and a high value of ‘Wednesday’). Look what these statistics do for queries run through autotrace.

      

set autotrace traceonly explain      

select
	count(*)
from	t1
where	date_char between '20061115' and '20061215'
;      

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=20 Card=1 Bytes=9)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'T1' (Cost=20 Card=20 Bytes=180)      

select
	count(*)
from	t1
where	to_date(date_char,'YYYYMMDD')
		between to_date('15-Nov-2006','dd-mon-yyyy')
		    and to_date('15-Dec-2006','dd-mon-yyyy')
;      

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=20 Card=1 Bytes=9)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'T1' (Cost=20 Card=641 Bytes=5769)      

Suddenly we get correct cardinalities for queries. (Note that the ‘pseudo-date’ column was based on sysdate + rownum / 20, so in a one-month range we expect about 620 rows – and the optimizer’s estimate is only out by one day’s worth at 641; compared to being out by a factor of 31 when we use the original character-based column.

Any problems ?

None that I can think of. Of course it will be a little embarassing (and resource intensive) if someone, or something, tries to rebuild that index. And if you truncate the table the index will automatically be marked as usable and be maintained as new data arrives unless you remember to set it back to unusable immediately after the truncate.

It’s also a bit of a problem that any attempt to gather table stats (or schema or database stats) with the cascade option enabled will cause the procedure to fail when it hits the unusable indexes. So you’re going to have to write some code to do a special job of collecting statistics on any tables with these funny indexes so that you collect their statistics without the cascade option, and then collect the statistics on the ‘real’ indexes.

Finally, you’ll need to handle that skip_unusable_indexes parameter somehow – possibly through a logon trigger in versions of Oracle prior to 10g.

So maybe it’s worth a little testing – the threat seems small, the benefits could be huge. But I’ve only played with it on a small scale; possibly a near-real test (on a UAT, or full-scale Development system) would highlight some flaw I hadn’t thought of.  If you try it, let me know.

7 Comments »

  1. I’m playing with it…but our scenario is a bit different…no time to explain in depth, but our functional column involves an NVL on a NULLable column to the max date of 31-DEC-9999(“Bad Defaults Affect Cardinality” is the slide you present I believe) – our problem is that we can get the virtual column index but can’t get a histogram on it so can’t see the skew we have on the column in order to assist the CBO to get the cardinalty right.

    One thing I have just seen though, is that you obtain the column low/high values from USER_TAB_COLS. I tried doing the same select, not realising I’d mistakenly typed USER_TAB_COLUMNS and I couldn’t see the SYS virtual column…on realising my typo and switching back to USER_TAB_COLS it appears…which I thought was an interesting little gotcha.

    Comment by Jeff Moss — November 2, 2006 @ 3:02 pm GMT Nov 2,2006 | Reply

  2. Jeff, There is a method_opt for gather_table_stats which does the job. Try ‘for all hidden columns size NNN’. This is available from 9.2 if not earlier.

    Comment by Jonathan Lewis — November 2, 2006 @ 4:11 pm GMT Nov 2,2006 | Reply

  3. …I knew about method_opt for getting the histograms – I’ve worked out now that when I was looking for the histogram I was using a little script of mine to give me lots of optimiser stats for the table in question…unfortunately, as I alluded to in the last post, I found that my script happened to be using the DBA_TAB_COLUMNS view rather than the DBA_TAB_COLS view…which meant I couldn’t actually see the virtual column at all let alone whether there was a histogram on it…after fixing my script and retesting all this I find it’s exactly as you say, the histogram can be created and it does help with getting the correct cardinality for a query with a predicate on a column with skewed data in it.

    Now our problem is that the value we are testing for BETWEEN the from_date AND NVL(to_date,TO_DATE(’31-DEC-9999′,’DD-MON-YYYY’)) comes from a scalar subquery select…so the optimiser doesn’t know what value it is and can’t get the right cardinality – with a literal date value it works fine…more of an architectural approach issue now I guess.

    Comment by Jeff Moss — November 3, 2006 @ 7:53 am GMT Nov 3,2006 | Reply

  4. Can’t remember which version it came into, but you can also directly reference hidden column names, eg “for column SYS_… size 10″ etc

    Comment by Connor — November 4, 2006 @ 2:35 am GMT Nov 4,2006 | Reply

  5. Excellent!!!
    (Few months too late… but just what I need right now :-)

    Comment by Mirjana — May 4, 2007 @ 7:57 am GMT May 4,2007 | Reply

  6. Indeed, virtual columns are in 11g now!

    Comment by Shakespeare — August 30, 2007 @ 3:32 pm GMT Aug 30,2007 | Reply

  7. [...] **1: In fact, you can collect statistics on “virtual columns” pre-11g by creating function-based indexes in an unusable state. The method works, but it’s not really a good [...]

    Pingback by Virtual Columns « Oracle Scratchpad — April 16, 2009 @ 7:26 pm GMT Apr 16,2009 | Reply


RSS feed for comments on this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,267 other followers