Virtual columns revisited
Here’s an idea I had some time ago, but have never implemented on a production system. I’d be interested to know if anyone has tried it.
Virtual columns are useful, because they can give Oracle statistics about values derived from stored columns. So, for example, you might like to have a virtual column defined as salary + nvl(commission,0) so that the optimizer can estimate resonably accurately how many employess have salary + nvl(commission,0) > 100000.
But (until 11g, perhaps) you can’t create virtual columns unless you index them using the feature that Oracle describes as ‘function-based indexes’, and perhaps you have far too many virtual columns to make index creation viable. Here’s a workaround (maybe).
Create a single, huge, function-based index in an unusable state, containing all the virtual columns you would like to have - or at least the first 32 of them, as that’s the largest number of columns you can have in a single index. Then gather stats and see what happens. For example:
create table t1
as
with generator as (
select --+ materialize
rownum id
from all_objects
where rownum <= 3000
)
select
/*+ ordered use_merge(v2) */
rownum id,
to_char(
sysdate + rownum/20,
'YYYYMMDD'
) date_char,
dbms_random.string('U',6) vc1,
rpad('x',100) padding
from
generator v1,
generator v2
where
rownum <= 10000
;
create index t1_i1 on t1(
to_date(date_char,'YYYYMMDD'),
substr(vc1,1,4),
to_char(
to_date(date_char,'YYYYMMDD'),
'Day'
)
)
unusable
;
alter session set skip_unusable_indexes = true;
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T1',
estimate_percent => null,
block_sample => true,
method_opt => 'for all columns size 1',
degree => null,
granularity => 'default',
cascade => true
);
end;
/
I was running this test on a 9.2 database, so I’ve set skip_unusable_indexes to true (which is the default state in 10g). Note, in particular, that I’ve created the index as unusable, and that I’ve got two different ways of manipulating the date column that I’ve stored as a character on the table.
When you run the gather_table_stats procedure, you will get an error because the index is unusable, but you will get column-level statistics created on the hidden columns that represent the virtual column definitions in the index:
select
column_name, num_distinct, low_value, high_value
from
user_tab_cols
where
table_name = 'T1'
and column_name != 'PADDING'
;
COLUMN_NAME NUM_DISTINCT LOW_VALUE HIGH_VALUE
------------ ------------ ------------------ ------------------
ID 10000 C102 C302
DATE_CHAR 500 3230303631313031 3230303830333134
VC1 10000 414142524F51 5A5A594F4743
SYS_NC00005$ 500 786A0B01010101 786C030E010101
SYS_NC00006$ 9888 41414252 5A5A594F
SYS_NC00007$ 7 467269646179202020 5765646E6573646179
Note particularly that column SYS_NC00007$ (the column representing the converson of the ‘pseudo-date’ column into the day of a week) has 7 distinct values, a low value of ‘Monday’ and a high value of ‘Wednesday’). Look what these statistics do for queries run through autotrace.
set autotrace traceonly explain
select
count(*)
from t1
where date_char between '20061115' and '20061215'
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=20 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'T1' (Cost=20 Card=20 Bytes=180)
select
count(*)
from t1
where to_date(date_char,'YYYYMMDD')
between to_date('15-Nov-2006','dd-mon-yyyy')
and to_date('15-Dec-2006','dd-mon-yyyy')
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=20 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'T1' (Cost=20 Card=641 Bytes=5769)
Suddenly we get correct cardinalities for queries. (Note that the ‘pseudo-date’ column was based on sysdate + rownum / 20, so in a one-month range we expect about 620 rows - and the optimizer’s estimate is only out by one day’s worth at 641; compared to being out by a factor of 31 when we use the original character-based column.
Any problems ?
None that I can think of. Of course it will be a little embarassing (and resource intensive) if someone, or something, tries to rebuild that index. And if you truncate the table the index will automatically be marked as usable and be maintained as new data arrives unless you remember to set it back to unusable immediately after the truncate.
It’s also a bit of a problem that any attempt to gather table stats (or schema or database stats) with the cascade option enabled will cause the procedure to fail when it hits the unusable indexes. So you’re going to have to write some code to do a special job of collecting statistics on any tables with these funny indexes so that you collect their statistics without the cascade option, and then collect the statistics on the ‘real’ indexes.
Finally, you’ll need to handle that skip_unusable_indexes parameter somehow - possibly through a logon trigger in versions of Oracle prior to 10g.
So maybe it’s worth a little testing - the threat seems small, the benefits could be huge. But I’ve only played with it on a small scale; possibly a near-real test (on a UAT, or full-scale Development system) would highlight some flaw I hadn’t thought of. If you try it, let me know.
[...] I noticed a rebirth in the technical blog - Jonathan Lewis is interesting (as always) and writing on some interesting indexing (or not really indexing) ideas to help the CBO on its way. I might give one a try but my great fear is that a DBA tidying up actually builds one of those indexes. I know this sort of thing happens - I once had my account removed by one such efficient DBA once; one who did not know that his boss had a perverted sense of humour. The lead DBA always wanted to create an account for user SCOTT (but identified by something other than the expected!) but the junior DBA removed all accounts for user SCOTT for reasons of security - some of you might think denying me access to a database is a good idea. [...]
Pingback by Mid-week, ETL and i-nodes « Pete-s random notes — November 1, 2006 @ 10:19 pm UTC Nov 1,2006
I’m playing with it…but our scenario is a bit different…no time to explain in depth, but our functional column involves an NVL on a NULLable column to the max date of 31-DEC-9999(”Bad Defaults Affect Cardinality” is the slide you present I believe) - our problem is that we can get the virtual column index but can’t get a histogram on it so can’t see the skew we have on the column in order to assist the CBO to get the cardinalty right.
One thing I have just seen though, is that you obtain the column low/high values from USER_TAB_COLS. I tried doing the same select, not realising I’d mistakenly typed USER_TAB_COLUMNS and I couldn’t see the SYS virtual column…on realising my typo and switching back to USER_TAB_COLS it appears…which I thought was an interesting little gotcha.
Comment by Jeff Moss — November 2, 2006 @ 3:02 pm UTC Nov 2,2006
Jeff, There is a method_opt for gather_table_stats which does the job. Try ‘for all hidden columns size NNN’. This is available from 9.2 if not earlier.
Comment by Jonathan Lewis — November 2, 2006 @ 4:11 pm UTC Nov 2,2006
…I knew about method_opt for getting the histograms - I’ve worked out now that when I was looking for the histogram I was using a little script of mine to give me lots of optimiser stats for the table in question…unfortunately, as I alluded to in the last post, I found that my script happened to be using the DBA_TAB_COLUMNS view rather than the DBA_TAB_COLS view…which meant I couldn’t actually see the virtual column at all let alone whether there was a histogram on it…after fixing my script and retesting all this I find it’s exactly as you say, the histogram can be created and it does help with getting the correct cardinality for a query with a predicate on a column with skewed data in it.
Now our problem is that the value we are testing for BETWEEN the from_date AND NVL(to_date,TO_DATE(’31-DEC-9999′,’DD-MON-YYYY’)) comes from a scalar subquery select…so the optimiser doesn’t know what value it is and can’t get the right cardinality - with a literal date value it works fine…more of an architectural approach issue now I guess.
Comment by Jeff Moss — November 3, 2006 @ 7:53 am UTC Nov 3,2006
Can’t remember which version it came into, but you can also directly reference hidden column names, eg “for column SYS_… size 10″ etc
Comment by Connor — November 4, 2006 @ 2:35 am UTC Nov 4,2006
Excellent!!!
(Few months too late… but just what I need right now
Comment by Mirjana — May 4, 2007 @ 7:57 am UTC May 4,2007
Indeed, virtual columns are in 11g now!
Comment by Shakespeare — August 30, 2007 @ 3:32 pm UTC Aug 30,2007
[...] a rebirth in the technical blog - Jonathan Lewis is interesting (as always) and writing on some interesting indexing (or not really indexing) ideas to help the CBO on its way. I might give one a try but my great fear [...]
Pingback by Rittman Mead Consulting » Blog Archive » Mid-week, ETL and i-nodes — March 17, 2008 @ 2:56 pm UTC Mar 17,2008