Oracle Scratchpad

September 27, 2013

Virtual Stats

Filed under: CBO,Execution plans,Oracle,Statistics — Jonathan Lewis @ 6:49 am BST Sep 27,2013

Or – to be more accurate – real statistics on a virtual column.

This is one of the “10 top tips” that I came up with for my joint session with Maria Colgan at Oracle Open World 2013. How to give the optimizer more information that might help it find a better execution plans when you can’t change the code. I’ll start with a small data set including a virtual column (running 11.1.0.7), and a couple of problem queries:


rem
rem     Script:         virtual_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2013
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4  --> comment to avoid wordpress format issue
)
select
        rownum          n1,
        rownum          n2
from
        generator       v2
; 

alter table t1 add (
        m2      generated always as (mod(n2,3)) virtual
) ; 

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

end;
/


explain plan for
select * from t1 where mod(n1,3) = 0
;

explain plan for
select * from t1 where mod(n1,3) != 0
;

In many cases the wrong cardinality will make a big difference to the optimizer’s choice of execution plan, and anything we can do “legally” to help the optimizer estimate a better cardinality may give us a better plan. I’m only intending to make a point about cardinality estimates with this example, so I haven’t bothered to create a subtle data set, or any indexes – so when I look at the execution plans I’ll only be looking at the “rows” column of the output.

Given the mod(n1,3) predicates and the 10,000 rows in the table we can estimate that the two cardinalities ought to be 3,333 and 6,667 respectively – but what do the plans show us:

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 |  1000 |     5  (20)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 |  1000 |     5  (20)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(MOD("N1",3)=0)

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500 |  5000 |     5  (20)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   500 |  5000 |     5  (20)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(MOD("N1",3)<>0)

The estimate are 100 and 500 respectively – examples of the optimizer’s basic guessing strategy of 1% for “function of column = constant” and 5% for “function of column != constant”. (There are a couple of variations on this theme, but 1% and 5% commonly appear.)

You’ll notice, of course, that n1 and n2 are defined in the same way, so what do the estimates look like when I change the predicates to references n2. There’s an enormous difference:

explain plan for
select * from t1 where mod(n2,3)= 0
;

select * from table(dbms_xplan.display);

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  3333 | 33330 |     5  (20)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  3333 | 33330 |     5  (20)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T1"."M2"=0)

explain plan for
select * from t1 where mod(n2,3) != 0
;

select * from table(dbms_xplan.display);
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  6667 | 66670 |     5  (20)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  6667 | 66670 |     5  (20)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T1"."M2"<>0)

The optimizer has got the estimates right! And the reason becomes clear when you look at the predicate section of the execution plan. The optimizer has transformed the queries by recognising that the supplied predicates match the definition of the virtual column, so it has rewritten the queries to use the virtual columns, which means getting rid of the references to the mod() function, and it has accurate statistics about the virtual column.

Think carefully about the significance of what we’ve done. Without changing the code we’ve made it possible for the optimizer to do a better job because it’s no longer guessing.

As I said on the day – this tip (like all tips) is just the tip of the iceberg – take it, think about it, and apply it sensibly and carefully.

Update 4th Oct 2013

It didn’t take long for an iceberg to appear: I was running a training session for a client in Prague yesterday and presented an example of using virtual columns to change execution plans without changing the code and one member of the audience (I forget whether he was a developer of a DBA) promptly came up with this very important question: “What if your code includes statements like insert into t1 values (1,1);”

Of course, your code should always list the column names in the insert statement, but some code doesn’t. In 11g the code will fail with Oracle error “ORA-00947: not enough values”. There is a workaround: rather than creating a virtual column create extended stats to represent the expression. The internal, virtual column representing the expression will then be invisible and the ORA-00947 will not occur.

There is a slight downside to this strategy, though – you’re only allowed max(20, number of column/10) sets of extended stats on a single table. That’s quite a lot, but for some people it may not be enough; the limit on virtual columns is dictated by the standard 1,000 column limit for a table.

In 12c there is a nicer alternative – simply declare the virtual column to be invisible. Conveniently Oracle won’t see it for the insert – but you can still use it explicitly by name, and if you use the defining expression in a where clause the optimizer will still recognise that the expression exists as a virtual column with real statistics.

Footnote: a follow-on thought I had about visibility was that some screen-painters automatically generate a single update statement intended to update every column that has been selected by the screen: so if you display a virtual column on-screen (rather than displaying the result of the expression) then the update statement will fail with Oracle error “ORA-54017: UPDATE operation disallowed on virtual columns.”

Update 16th Jan 2014


Another iceberg has just appeared – in a question on the Oracle developer forum.

What’s the side effect of a virtual column based on a function that takes a (relatively) long time to run; and by “relatively” long time I mean that it won’t take long enough for a user inserting or updating a couple of rows to notice the difference – say 10 micro-seconds if it’s pure CPU, say 1 millisecond if it does something like reading a database block (as the function from the example on the forum probably does) ?

When you gather stats for the table Oracle has to run that function for every row in the sample – which is the whole table if you’re using 11g with the auto_sample_size (hence approximate NDV) enabled. In the case of the poster on the forum they’d added a function-based index – which means a hidden virtual column on the table – and seen their stats collection time change from 3 minutes to 2 hours.

Oracle doesn’t store the value of a virtual column on the table – it calculates it at run time when you query the table for that column. It does store the value if the column is part of an index, of course, but only in the index itself, and the stats collection routines don’t (yet) have a mechanism for scanning an index to calculate the column stats for such a column.

A little footnote to this: if you add a virtual column to a table and then gather stats on the table you will see that the average row length increases by the average length of the virtual column. This suggests at first sight that Oracle is actually storing the values for the column; but a block dump will show that this is not the case – it’s just an example of the way in which the full code set for a new feature tends to have little bits missing round the edges: the code to calculate the average column blindly calls the definition of the virtual column to calculate the length of the result, and the code to calculate the average row length simply sums the average column lengths – neither piece of code (yet) allows for the fact that the column is virtual.  (If you use the analyze command I think it takes the row length from the block by comparing address offsets within the block – which means it gets a more consistent answer.)

Update August 2018

I’ve been drawn back to this note following a comment I made on the database forum about creating a virtual column based on extracting a value from an XML (LOB) column. If you want to avoid the workload of executing the function for every row in the table you could create a table preference that sets the method_opt to gather stats on all the columns except the virtual one, something like the following for my sample table:


begin
        dbms_stats.set_table_prefs(
                ownname => user,
                tabname => 'T1',
                pname   => 'METHOD_OPT',
                pvalue  => 'for columns size 1 n1, n2'
        );
end;
/

You might hope that a simple call to gather table stats would then gather stats only for the listed columns with any stats on any other columns remaining unchanged. Unfortunately I’ve just got around to testing this (Sept 2018) and discovered that the code Oracle uses to gather the stats will still count() the number of non-null values in every single column – even though it doesn’t actually record the counts afterwards.

5 Comments »

  1. When I read your article about histograms http://allthingsoracle.com/histograms-part-1-why/ the following text:

    “There is a limitation to the virtual column / function-based index approach (whichever method you use) – you have to change the application code to take advantage of it – the “proper” virtual columns in 11g make the code look neater than the FBI code, but there still has to be a change, e.g (for the FBI approach):”

    I was going to signal in the histogram article that, to take advantage of virtual column, it might not be necessary to change the code, but finally renounced to publish the comment. I am happy to see you proving that here in this new article on virtual stats.

    And, for this particular case of virtual column definition that contains a literal value, mod (n2,3), I was wondering what will happen if the cursor sharing value has been set to FORCE

    SQL> alter session set cursor_sharing=FORCE;
    
    Session altered.
    
    SQL> select * from t1 where mod(n2,3)= 0;
    3333 rows selected.
    
    SQL>  select * from table(dbms_xplan.display_cursor);
    
    --------------------------------------------------------------------------
    SQL_ID  bv18rk2yf2ck9, child number 0
    -------------------------------------
    select * from t1 where mod(n2,:"SYS_B_0")= :"SYS_B_1"
    
    Plan hash value: 3617692013
    
    --------------------------------------------------------------------------
    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------
    |   0 | SELECT STATEMENT  |      |       |       |     8 (100)|          |
    |*  1 |  TABLE ACCESS FULL| T1   |   100 |  1100 |     8   (0)| 00:00:01 |
    --------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter(MOD("N2",:SYS_B_0)=:SYS_B_1)
     

    The CBO got it wrong in this case.

    Best regards

    Comment by Mohamed Houri — September 29, 2013 @ 3:55 pm BST Sep 29,2013 | Reply

  2. Mohamed,

    Thanks for pointing out that the side effects that cursor_sharing can have on this strategy. It’s another of those “what if” types of questions that has to be raised as early on as possible before committing to a new feature.

    Comment by Jonathan Lewis — October 4, 2013 @ 9:39 am BST Oct 4,2013 | Reply

  3. […] or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table […]

    Pingback by Auto Sample Size | Oracle Scratchpad — March 2, 2014 @ 6:39 pm GMT Mar 2,2014 | Reply

  4. Josef,

    [ed: This is a reply to a comment that seems to have been deleted.]

    It’s very easy to read a little more into that sentence in the manual than is really there.
    I made a similar point, but perhaps with less risk of misinterpretation in the first of a three-parter on 12c histogram:

    12c histograms

    12c Histograms pt.2

    12c Histograms pt.3

    Comment by Jonathan Lewis — July 8, 2015 @ 7:09 pm BST Jul 8,2015 | Reply

  5. […] little while ago I added a post-script to a note I’d written five years ago about gathering stats on a virtual column and had updated with a reference to a problem on the […]

    Pingback by Column Stats | Oracle Scratchpad — September 12, 2018 @ 1:46 pm BST Sep 12,2018 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.