Oracle Scratchpad

September 4, 2015

Histogram Tip

Filed under: CBO,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:32 am GMT Sep 4,2015

I’ve just responded to the call for items for the “IOUG Quick Tips” booklet for 2015 – so it’s probably about time to post the quick tip that I put into the 2014 issue. It’s probably nothing new to most readers of the blog, but sometimes an old thing presented in a new way offers fresh insights or better comprehension.

Histogram Tips

A histogram, created in the right way, at the right time, and supported by the correct client-side code, can be a huge benefit to the optimizer; but if you don’t create and use them wisely they can easily become a source of inconsistent performance, and the automatic statistics gathering can introduce an undesirable overhead during the overnight batch. This note explains how you can create histograms very cheaply on the few columns where they are most likely to have a beneficial effect.

set_column_stats

The dbms_stats package have many procedures and functions built into it that allow us to see (get) and manipulate (set) the stored statistics; in particular it holds two functions get_column_stats() and set_column_stats(), and we can use these procedures to create a histogram very cheaply whenever we want at very low cost. Here’s an example that could be modified to suit a character column in a table where you’ve previously collected some stats. It uses a copy of all_objects, limited to exactly 10,000 rows.


create table t1 as
select
	*
from
	all_objects
where
	rownum <= 10000 ; begin dbms_stats.gather_table_stats( ownname => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 1'
	);
end;
/


declare

	m_distcnt		number;		-- num_distinct
	m_density		number;		-- density
	m_nullcnt		number;		-- num_nulls
	m_avgclen		number;		-- avg_col_len

	srec			dbms_stats.statrec;
	c_array		dbms_stats.chararray;

begin

	dbms_stats.get_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'object_type', 
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec			=> srec,
		avgclen		=> m_avgclen
	); 

	c_array		:= dbms_stats.chararray('A', 'B', 'C', 'X', 'Y');
	srec.bkvals	:= dbms_stats.numarray (  2,   2,   2, 500, 494);
--	srec.rpcnts	:= dbms_stats.numarray (  0,   0,   0,   0,   0);
	srec.epc := 5;

	dbms_stats.prepare_column_values(srec, c_array);

	m_distcnt	:= 5;
	m_density	:= 1/(5000);

	dbms_stats.set_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'object_type', 
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec			=> srec,
		avgclen		=> m_avgclen
	); 

end;
/

Key features of the code: as you can see, the two calls have identical parameters which identify the table and column name (there is an optional parameter for a (sub) partition name), and most of the basic statistics about the column. The histogram (or low and high values) are accessed through a special record type, and we can populate that record type by supplying an ordered list of values, a matching list of frequencies, and a count of how many values we have supplied.

Since my code is fixing stats on a varchar2() column I’ve declared an array of type dbms_stats.chararray to hold the list of values I want to see in a frequency histogram – there are other array types for dates, raw, number, etc. I’ve then used the structure of the stats record I had declared to hold the list of frequencies (srec.bkvals – possibly a short name for “bucket values”) and the count (srec.epc“end-point count”).

The call to dbms_stats.prepare_column_stats() takes my two arrays and massages them into the correct format for storage as a histogram that I can then write into the data dictionary with the closing call to dbms_stats.set_column_stats(). Before making that call, though, I’ve also set the “num_distinct” variable to tell the optimizer that there are 5 distinct values for the column (it makes sense, but isn’t absolutely necessary, for the num_distinct to match the number of values in the array), and set the “density” to a value that I would like the optimizer to use in it calculations if someone asks for a value that is not in my list.

I’ve included (but commented out) a line that’s relevant to the new histogram mechanisms in 12c –the srec.rpcnts (“repeat counts”) array is used in “hybrid histograms”. It’s not relevant to my example where I’m trying to create a pure frequency histogram, but if I don’t set the array I get an Oracle error: “ORA-06532: Subscript outside of limit”.

Results

There’s one important point to the method that isn’t instantly visible in the code – I created my table with 10,000 rows and there will be no nulls for object_type; but if you sum the array of frequencies it comes to exactly 1,000. This apparent contradiction is not a problem – the optimizer will compare the histogram figures to the total number of non-null entries it has recorded (in other words user_tables.num_rowsuser_tab_columns.num_nulls), and scale up the histogram accordingly. This means that a query for ‘A’ should return an estimated row count of 20 (rather than 2), ‘X’ should return 5,000 (rather than 500) and ‘D’ should return 2 (10,000 rows * 1/5000, the selectivity I had set for non-existent values).

With a little editing to save space, here’s a cut-n-paste from an SQL*Plus session running against 12c:


SQL> set autotrace traceonly explain
SQL> select * from t1 where object_type = 'A';

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    20 |  2100 |    22   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |    20 |  2100 |    22   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("OBJECT_TYPE"='A')

SQL> select * from t1 where object_type = 'X';

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  5000 |   512K|    22   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  5000 |   512K|    22   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("OBJECT_TYPE"='X')

SQL> select * from t1 where object_type = 'D';

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |   105 |    22   (5)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     1 |   105 |    22   (5)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("OBJECT_TYPE"='D')

Conclusion

It is very easy to create truly representative histograms (if you know your data) and the resources required to do so are minimal. If you see instability due to bad luck, or bad timing, gathering a histogram then you benefit enormously from writing code to construct histograms.

Footnote

In 12c the introduction of the “approximate NDV” strategy to collecting frequency histograms, and the introduction of the “Top-frequency” histogram has made automatic gathering of histograms on columns with a relatively small number of distinct values much faster and safer – but timing may still be an issue, and the resources needed to gather a safe hybrid histogram may still justify a hand-coded approach.

 

August 17, 2015

Index Usage

Filed under: extended stats,Indexing,Oracle,Tuning — Jonathan Lewis @ 4:25 pm GMT Aug 17,2015

The question of how to identify indexes that could be dropped re-appeared (yet again) on the OTN database forum last week. It’s not really surprising that it recurs so regularly – the problem isn’t an easy one to solve but new (and even less new) users keep hoping that there’s a quick and easy solution.

There are, however, strategies and pointers that can help you to optimise the trade-off between effort, risk, and reward. Broadly the idea is to spend a small amount of effort finding a relatively small number of “expensive” indexes that might be safe to drop, so that when you do the detailed analysis you have a good chance that the time spent will be rewarded by a positive result.

Before we get to some results posted on OTN, it’s worth thinking about the global impact and what we’re trying to achieve, and the threats that go with our attempt to achieve it.

The key detail, of course, is that index maintenance is an expensive process. We could insert 1,000 rows into a table at a cost of writing about 25 table blocks plus a few undo blocks plus something like half a megabyte of redo (assuming, for the purposes of illustration that each row is about 200 bytes on insert). Add one index to the table and we might have to locate and modify 1,000 separate index leaf blocks. The increment on the redo might be about quarter of a megabyte and we may have to access 1,000 different undo blocks for read consistency reasons, but the simple fact that we may need 1,000 buffers to be able to maintain that index is likely to be a significant extra cost on the insert. Make that 10 indexes, or 70 (as one unhappy DBA once told me) and the probability of being able to do high-speed inserts becomes rather low.

Of course we hope that our indexes will allow our queries to operate efficiently with great precision, but inevitably we get to a point where the benefit of precision is outweighed by the cost of maintenance. Our target, then, is to design the set of indexes that makes it possible for the optimizer to find good paths for all the important queries and “good enough” paths for the rest. By the time the system is live, though, it’s too late for “proper design”, and the only option is for damage limitation, a bit of guesswork, and some live testing with fingers crossed (thank goodness for invisible indexes).

The starting point is usually an attempt to identify “the indexes we are not using”, which is typically translated into “the indexes that do not appear in execution plans” – but that’s not actually a good target, for various reasons:

  • Problem 1: If we are using an index it’s possible that we shouldn’t be and that there’s an alternative index available that ought to be more efficient. A corollary to this is that if you do identify and drop such an index you may find that the optimizer doesn’t use the alternative index you were expecting it to use until you take some action to help the optimizer recognise that the alternative is a good choice.
  • Problem 2: if we aren’t using a particular index then perhaps we should be using it and would use it if we dropped one of the other indexes on the table. (And there’s always the possibility that we didn’t happen to use it during the interval we were checking but do use it at some other times)
  • Problem 3: the optimizer is capable of using information about the number of distinct keys in a multi-column index to select an executon plan even though it may not use that index in the plan it finally chooses. We may be able to work around this problem in current versions of Oracle by creating a column group (extended statistics) that matches the definition of each index we drop – but there’s a limit of 20 column groups per table (and we may have to find the “opposite end” of each join where we use the index stats and create a matching column group there).
  • Problem 4: There are some indexes we might not be using but which must exist to avoid the “foreign key locking” problem. It should be easy enough to check, before dropping an index, whether it has to exist to match a foreign key; and even then it may be possible to show that nothing in the application would cause the locking problem to appear – and as a safety measure you could disable locks on the (child) table to ensure that the application doesn’t grind to a halt because of foreign key locking problems.

Provided you remember that problems like these exist, and think carefully about the indexes that your strategy suggests, there are various ways you could approach the problem of identifying indexes that don’t get into execution plans.

v$object_usage

The ink had barely dried on the manual pages for this view before several people (including me) had written notes explaining why this view wasn’t particularly helpful. (I think I even said something about this in Practical Oracle 8i). I won’t repeat the discussion here but it revolves around the fact that an index is flagged as “used” even if it has only been used once in a single execution of a single statement – so you don’t get any idea of the real importance of the index.

v$sql_plan et. al.

If you review the set of in-memory execution plans (and the AWR or Statspack equivalents) you can identify indexes which definitely have been used – but (a) it’s expensive to scan v$sql_plan frequently and (b) the AWR/Statspack repositories only capture a subset of the more expensive plans, so it’s easy to miss indexes which have been used and are relatively important but aren’t in the repository and don’t happen to be in memory at the moments you look.

Review the definitions

If you examine the index definitions you may spot indexes where look very similar. If one index starts with the same columns, in the same order, as another index, there is a good chance that you could reduce two indexes to one – especially if the whole of one of the indexes is the “leading edge” of the other – for example:

  • (dp_datetime_date)
  • (dp_datetime_date, dp_compid)

Even if the leading edges match and the trailing edges differ we might be able to collapse two indexes into one – depending on how selective the leading columns are and how the indexes are used – for example:

  • (dp_compid, ddzt, cirmhcx, ct_nxr_mhcx, dp_datetime_date)
  • (dp_compid, ddzt, cirmhcx, ct_nxr_mhcx, pnr_cfrqsj_date)

which could perhaps be replaced by one of :

  • (dp_compid, ddzt, cirmhcx, ct_nxr_mhcx, dp_datetime_date, pnr_cfrqsj_date)

or

  • (dp_compid, ddzt, cirmhcx, ct_nxr_mhcx, pnr_cfrqsj_date, dp_datetime_date)

Guessing about the use of a typical date column, though, it’s possible that in this example the current trailing date columns are used with a range-based predicate, so it’s possible that this strategy won’t be effective for this pair of indexes.

Even if the order of later columns in the index doesn’t match you may still find that a pair of indexes could be reduced to a single index – for example the pair:

  • (dp_datetime_date, dp_compid)
  • (dp_datetime_date, ddzdt, dp_compid, ct_nxrdh, ct_smsmobilno)

which could perhaps be replaced by just:

  • (dp_datetime_date, dp_compid, ddzdt, ct_nxrdh, ct_smsmobilno)

As a safety measure, of course, you would probably create a new index, then make the subject indexes invisible, and wait for at least a week to see whether any performance problems appear (remembering that one automatic performance threat would be the increase in workload as yet another index – temporarily – has to be maintained).

The difficulty of eliminating indexes by examination is that it takes a lot of effort to investigate all the possibilities, so you really need some way of choosing a relatively small subset of indexes that might be worth the effort. This brings me to the principle topic of this posting – using segment statistics to help you pick which indexes might be worth the effort.

v$segstat / v$segment_statistics

Oracle records a number of workload statistics for each object in memory. The view v$segstat is an efficient version of these statistics, and v$segment_statistics is a friendlier version that joins v$segstat to tables user$, obj$ and ts$, with a filter against ind$ to turn meaningless numbers into names.

SQL&amp;gt; desc V$segstat
 Name                    Null?    Type
 ----------------------- -------- ----------------
 TS#                              NUMBER
 OBJ#                             NUMBER
 DATAOBJ#                         NUMBER
 STATISTIC_NAME                   VARCHAR2(64)
 STATISTIC#                       NUMBER
 VALUE                            NUMBER

SQL&amp;gt; desc V$segment_statistics
 Name                    Null?    Type
 ----------------------- -------- ----------------
 OWNER                            VARCHAR2(30)
 OBJECT_NAME                      VARCHAR2(30)
 SUBOBJECT_NAME                   VARCHAR2(30)
 TABLESPACE_NAME                  VARCHAR2(30)
 TS#                              NUMBER
 OBJ#                             NUMBER
 DATAOBJ#                         NUMBER
 OBJECT_TYPE                      VARCHAR2(18)
 STATISTIC_NAME                   VARCHAR2(64)
 STATISTIC#                       NUMBER
 VALUE                            NUMBER

For each segment Oracle records the following statistics (according to v$segstat_name – but there are a couple more hidden statistics reported in the underlying x$ksolsstat object):

NAME                             SAMPLED
-------------------------------- -------
logical reads                    YES
buffer busy waits                NO
gc buffer busy                   NO
db block changes                 YES
physical reads                   NO
physical writes                  NO
physical read requests           NO
physical write requests          NO
physical reads direct            NO
physical writes direct           NO
optimized physical reads         NO
optimized physical writes        NO
gc cr blocks received            NO
gc current blocks received       NO
ITL waits                        NO
row lock waits                   NO
space used                       NO
space allocated                  NO
segment scans                    NO

Both Statspack (at level 7) and the AWR report have several “Top N” sections for segment statistics. If we examine these stats for all the indexes on a given table we can get some clues about which indexes are likely to be worth further investigation to see if they could be dropped.

One very simple measure is the number of “physical reads” (which, for indexes, will generally be very similar to “physical read requests”). Since a (real) physical read is generally going to take a significant amount of time, segments with very large numbers of physical reads could be contributing a lot of of time to the total database time – so it’s worth knowing why it’s responsible for so many physical reads and worth cross-checking with v$sql_plan (and its historic equivalents) which statements seem to be using or modifying this index.

Even if it turns out that the index is absolutely necessary, you might still be able to spot opportunities to improve efficiency. If it is subject to a significant number of physical reads it may be that the index is just very large – could you make it smaller by rebuilding it with compression on some of the leading columns, is it an index which (for some reason you can identify) tends to degenerate over time and waste a lot of space and should you rebuild it occasionally. It might be possible (depending on the predicates used) to re-arrange the column order in such a way that the activity is focused onto a particular section of the index rather than being spread across the entire index – or you could even find that by careful choice of global partitioning (which is legal on even a non-partitioned table) you might be able to isolate the activity to a small section of the index.

A more interesting measure, though, comes from comparing the “logical reads” with the number of “db block changes”; and that’s the point of this posting – except that I’ve spent so much time on it already that I’m going to have to write part 2 some time next week.

 

July 17, 2015

Descending Indexes

Filed under: CBO,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 8:42 am GMT Jul 17,2015

I’ve written about optimizer defects with descending indexes before now but a problem came up on the OTN database forum a few days ago that made me decide to look very closely at an example where the arithmetic was clearly defective. The problem revolves around a table with two indexes, one on a date column (TH_UPDATE_TIMESTAMP) and the other a compound index which starts with the same column in descending order (TH_UPDATE_TIMESTAMP DESC, TH_TXN_CODE). The optimizer was picking the “descending” index in cases where it was clearly the wrong index (even after the statistics had been refreshed and all the usual errors relating to date-based indexes had been discounted). Here’s an execution plan from the system which shows that there’s something wrong with the optimizer:


SELECT COUNT(*) FROM TXN_HEADER WHERE TH_UPDATE_TIMESTAMP BETWEEN SYSDATE-30 AND SYSDATE;

---------------------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                  |     1 |     8 |            |          |
|*  2 |   FILTER           |                  |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| TXN_HEADER_IDX17 |  1083K|  8462K|     4   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

There are two clues here: first, Oracle has used the (larger) two-column index when the single column is (almost guaranteed to be) the better choice simply because it will be smaller; secondly, we have a cost of 4 to acquire 1M rowids through an (implicitly b-tree) index range scan, which would require at least 250,000 index entries per leaf block to keep the cost that low (or 2,500 if you set the optimizer_index_cost_adj to 1; so it might just be possible if you had a 32KB block size).

The OP worked hard to supply information we asked for, and to follow up any suggestions we made; in the course of this we got the information that the table had about 90M rows and this “timestamp” column had about 45M distinct values ranging from 6th Sept 2012 to 2nd July 2015 with no nulls.

Based on this information I modelled the problem in an instance of 11.2.0.4 (the OP was using 11.2.0.3).


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                                          id,
        to_date('06-SEP-2012 15:13:00','dd-mon-yyyy hh24:mi:ss') +
                trunc((rownum - 1 )/4) / (24 * 60)                      upd_timestamp,
        mod(rownum - 1,10)                                              txn_code,
        rpad('x',50)                                                    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4 * 1030  *  24 * 60
;

create index t1_asc on t1(upd_timestamp) nologging;
create index t1_desc on t1(upd_timestamp desc) nologging;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T1',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

My data set has 4 rows per minute from 6th Sept 2012 to 3rd July 2015, with both an ascending and descending index on the upd_timestamp column. For reference, here are the statistics about the two indexes:


INDEX_NAME               SAMPLE     BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ---------- ----------- -----------------
T1_DESC                 5932800          2       17379            154559
T1_ASC                  5932800          2       15737             59825

The ascending index is smaller with a significantly better clustering_factor, so for queries where either index would be a viable option the ascending index is the one (we think) that the optimizer should choose. In passing, the 5.9M index entries is exactly the number of rows in the table – these stats were computed automatically as the indexes were created.

Here’s a simple query with execution plan (with rowsource execution stats):

select max(id) from t1 where upd_timestamp between
to_date('01-jun-2015','dd-mon-yyyy')                and
to_date('30-jun-2015','dd-mon-yyyy')

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |      1 |        |    29 (100)|      1 |00:00:01.29 |    4710 |
|   1 |  SORT AGGREGATE              |         |      1 |      1 |            |      1 |00:00:01.29 |    4710 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1      |      1 |    167K|    29   (0)|    167K|00:00:00.98 |    4710 |
|*  3 |    INDEX RANGE SCAN          | T1_DESC |      1 |    885 |     5   (0)|    167K|00:00:00.35 |     492 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."SYS_NC00005$">=HEXTORAW('878CF9E1FEF8FEFAFF')  AND
              "T1"."SYS_NC00005$"<=HEXTORAW('878CF9FEF8FEF8FF') )
       filter((SYS_OP_UNDESCEND("T1"."SYS_NC00005$")>=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND SYS_OP_UNDESCEND("T1"."SYS_NC00005$")<=TO_DATE(' 2015-06-30 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))

The optimizer’s index estimate of 885 rowids is a long way off the actual rowcount of 167,000 – even though I have perfect stats describing uniform data and the query is simple enough that the optimizer should be able to get a good estimate. Mind you, the predicate section looks a lot messier than you might expect, and the table estimate is correct (though not consistent with the index estimate, of course).

Here’s the plan I get when I make the descending index invisible:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |      1 |        |  2146 (100)|      1 |00:00:01.25 |    2134 |
|   1 |  SORT AGGREGATE              |        |      1 |      1 |            |      1 |00:00:01.25 |    2134 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1     |      1 |    167K|  2146   (1)|    167K|00:00:00.94 |    2134 |
|*  3 |    INDEX RANGE SCAN          | T1_ASC |      1 |    167K|   453   (2)|    167K|00:00:00.31 |     446 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("UPD_TIMESTAMP">=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "UPD_TIMESTAMP"<=TO_DATE(' 2015-06-30 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

With the descending index invisible we can see that the cardinality estimate for the ascending index correct and notice how much higher this makes the cost. It’s not surprising that the optimizer picked the wrong index with such a difference in cost. The question now is why did the optimizer get the index cardinality (i.e. selectivity, hence cost) so badly wrong.

The answer is that the optimizer has made the same mistake that applications make by storing dates as character strings. It’s using the normal range-based calculation for the sys_op_descend() values recorded against the virtual column and has lost all understanding of the fact that these values represent dates. I can demonstrate this most easily by creating one more table, inserting a couple of rows, gathering stats, and showing you what the internal storage of a couple of “descendomg” dates looks like.


drop table t2;

create table t2 (d1 date);
create index t2_i1 on t2(d1 desc);

insert into t2 values('01-Jun-2015');
insert into t2 values('30-Jun-2015');
commit;

select d1, sys_nc00002$ from t2;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T2',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

column endpoint_value format 999,999,999,999,999,999,999,999,999,999,999,999
break on table_name skip 1

select
        table_name, column_name, endpoint_number, endpoint_value, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') end_hex
from
        user_tab_histograms
where
        table_name in ('T1','T2')
and     column_name like 'SYS%'
order by
        table_name, column_name, endpoint_number
;

I’ve put the dates 1st June 2015 and 30th June 2015 into the table because those were the values I used in the where clause of my query. Here are the results showing the internal representation of the (descending) index column and the stored low and high values for the virtual columns in both t1 and t2.


D1        SYS_NC00002$
--------- ------------------------
01-JUN-15 878CF9FEF8FEF8FF
30-JUN-15 878CF9E1FEF8FEFAFF

TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER                                   ENDPOINT_VALUE END_HEX
-------------------- -------------------- --------------- ------------------------------------------------ ---------------------------------
T1                   SYS_NC00005$                       0  703,819,340,111,320,000,000,000,000,000,000,000    878CF8FCEFF30BCEBC8823F7000000
                     SYS_NC00005$                       1  703,880,027,955,346,000,000,000,000,000,000,000    878FF6F9EFF20256F3B404F7400000

T2                   SYS_NC00002$                       0  703,819,411,001,549,000,000,000,000,000,000,000    878CF9E1FEF8F24C1C7CED46200000
                     SYS_NC00002$                       1  703,819,419,969,389,000,000,000,000,000,000,000    878CF9FEF8FEEED28AC5B22A200000

If you check the t2 data (sys_nc00002$) values against the end_hex values in the histogram data you’ll see they match up to the first 6 bytes (12 digits). Oracle has done its standard processing – take the first 6 bytes of the column, covert to decimal, round to the most significant 15 digits (technically round(value, -21)), convert and store the result as hex.

So let’s do some arithmetic. The selectivity of a range scan that is in-bounds is (informally): “range we want” / “total range”. I’ve set up t2 to show us the values we will need to calculate the range we want, and I’ve reported the t1 values to allow us to calculate the total range, we just have to subtract the lower value (endpoint number 0) from the higher value for the two sys_nc0000N$ columns. So (ignoring the 21 zeros everywhere) our selectivity is:

  • (703,819,419,969,389 – 703,819,411,001,549) / ( 703,880,027,955,346 – 703,819,340,111,320) = 0.00014777
  • We have 5.9M rows in the table, so the cardinality estimate should be about: 5,932,800 * 0.00014777 = 876.69

The actual cardinality estimate was 885 – but we haven’t allowed for the exact form of the range-based predicate: we should add 1/num_distinct at both ends because the range is a closed range (greater than or EQUAL to, less than or EQUAL to) – which takes the cardinality estimate up to 884.69 which rounds to the 885 that the optimizer produced.

Conclusion

This note shows that Oracle is (at least for dates) losing information about the underlying data when dealing with the virtual columns associated with descending columns in indexes. As demonstrated that can lead to extremely bad selectivity estimates for predicates based on the original columns, and these selectivity estimates make it possible for Oracle to choose the wrong index and introduce a spuriously low cost into the calculation of the full execution plan.

Footnote

This note seems to match bug note 11072246 “Wrong Cardinality estimations for columns with DESC indexes”, but that bug is reported as fixed in 12.1, while this test case reproduces in 12.1.0.2

 

January 16, 2015

Spatial space

Filed under: Infrastructure,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 1:00 pm GMT Jan 16,2015

One thing you (ought to) learn very early on in an Oracle career is that there are always cases you haven’t previously considered. It’s a feature that is frequently the downfall of “I found it on the internet” SQL.  Here’s one (heavily paraphrased) example that appeared on the OTN database forum a few days ago:

select table_name,round((blocks*8),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE’;

select table_name,round((num_rows*avg_row_len/1024),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE’;

The result from the first query is 704 kb,  the result from the second is 25.4 kb … fragmentation, rebuild, CTAS etc. etc.

The two queries are perfectly reasonable approximations (for an 8KB block size, with pctfree of zero) for the allocated space and actual data size for a basic heap table – and since the two values here don’t come close to matching it’s perfectly reasonable to consider doing something like a rebuild or shrink space to reclaim space and (perhaps) to improve performance.

In this case it doesn’t look as if the space reclaimed is likely to be huge (less than 1MB), on the other hand it’s probably not going to take much time to rebuild such a tiny table; it doesn’t seem likely that the rebuild could make a significant difference to performance (though apparently it did), but the act of rebuilding might cause execution plans to change for the better because new statistics might appear as the rebuild took place. The figures came from a test system, though, so maybe the table on the production system was much larger and the impact would be greater.

Being cautious about wasting time and introducing risk, I made a few comments about the question –  and learned that one of the columns was of type SDO_GEOMETRY. This makes a big difference about what to do next, because dbms_stats.gather_table_stats() doesn’t process such columns correctly, which results in a massive under-estimate for the avg_row_len (which is basically the sum of avg_col_len for the table). Here’s an example (run on 12c, based on some code taken from the 10gR2 manuals):


drop table cola_markets purge;

CREATE TABLE cola_markets (
  mkt_id NUMBER,
  name VARCHAR2(32),
  shape SDO_GEOMETRY);

INSERT INTO cola_markets VALUES(
  1,
  'cola_a',
  SDO_GEOMETRY(
    2003,  -- two-dimensional polygon
    NULL,
    NULL,
    SDO_ELEM_INFO_ARRAY(1,1003,3), -- one rectangle (1003 = exterior)
    SDO_ORDINATE_ARRAY(1,1, 5,7) -- only 2 points needed to
          -- define rectangle (lower left and upper right) with
          -- Cartesian-coordinate data
  )
);

insert into cola_markets select * from cola_markets;
/
/
/
/
/
/
/
/
/

execute dbms_stats.gather_table_stats(user,'cola_markets')
select
	avg_row_len, num_rows, blocks,
	round(avg_row_len * num_rows / 7200,0) expected_blocks
from user_tables where table_name = 'COLA_MARKETS';

analyze table cola_markets compute statistics;
select
	avg_row_len, num_rows, blocks,
	round(avg_row_len * num_rows / 7200,0) expected_blocks
from user_tables where table_name = 'COLA_MARKETS';

If you care to count the number of times I execute the “insert as select” it’s 10, so the table ends up with 2^10 = 1024 rows. The 7,200 in the calculated column converts bytes to approximate blocks on the assumption of 8KB blocks and pctfree = 10. Here are the results following the two different methods for generating object statistics:


PL/SQL procedure successfully completed.

AVG_ROW_LEN   NUM_ROWS     BLOCKS EXPECTED_BLOCKS
----------- ---------- ---------- ---------------
         14       1024        124               2

Table analyzed.

AVG_ROW_LEN   NUM_ROWS     BLOCKS EXPECTED_BLOCKS
----------- ---------- ---------- ---------------
        109       1024        124              16

Where does the difference in Expected_blocks come from ? (The Blocks figures is 124 because I’ve used 1MB uniform extents – 128 block – under ASSM (which means 4 space management blocks at the start of the first extent.)

Here are the column lengths after the call to dbms_stats: as you can see the avg_row_len is the sum of avg_col_len.


select column_name, data_type, avg_col_len
from   user_tab_cols
where  table_name = 'COLA_MARKETS'
order by
        column_id
;

COLUMN_NAME          DATA_TYPE                AVG_COL_LEN
-------------------- ------------------------ -----------
MKT_ID               NUMBER                             3
NAME                 VARCHAR2                           7
SYS_NC00010$         SDO_ORDINATE_ARRAY
SHAPE                SDO_GEOMETRY
SYS_NC00008$         NUMBER                             0
SYS_NC00004$         NUMBER                             4
SYS_NC00005$         NUMBER                             0
SYS_NC00006$         NUMBER                             0
SYS_NC00007$         NUMBER                             0
SYS_NC00009$         SDO_ELEM_INFO_ARRAY

The figures from the analyze command are only slightly different, but fortunately the analyze command uses the row directory pointers to calculate the actual row allocation, so picks up information about the impact of inline varrays, LOBs, etc. that the dbms_stats call might not be able to handle.


COLUMN_NAME          DATA_TYPE                AVG_COL_LEN
-------------------- ------------------------ -----------
MKT_ID               NUMBER                             2
NAME                 VARCHAR2                           6
SYS_NC00010$         SDO_ORDINATE_ARRAY
SHAPE                SDO_GEOMETRY
SYS_NC00008$         NUMBER                             1
SYS_NC00004$         NUMBER                             3
SYS_NC00005$         NUMBER                             1
SYS_NC00006$         NUMBER                             1
SYS_NC00007$         NUMBER                             1
SYS_NC00009$         SDO_ELEM_INFO_ARRAY

As a basic reminder – whenever you do anything slightly non-trivial (e.g. something you couldn’t have done in v5, say) then remember that all those dinky little script things you find on the Internet might not actually cover your particular case.

November 20, 2014

Quantum Data

Filed under: CBO,Oracle,Statistics — Jonathan Lewis @ 11:30 am GMT Nov 20,2014

That’s data that isn’t there until you look for it, sort of, from the optimizer’s perspective.

Here’s some code to create a sample data set:


create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum					id,
	mod(rownum-1,200)			mod_200,
	mod(rownum-1,10000)			mod_10000,
	lpad(rownum,50)				padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e6
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 1'
	);
end;
/

Now derive the execution plans for a couple of queries noting, particularly, that we are using queries that are NOT CONSISTENT with the current state of the data (or more importantly the statistics about the data) – we’re querying outside the known range.


select * from t1 where mod_200  = 300;
select * from t1 where mod_200 >= 300;

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  2462 |   151K|  1246   (5)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |  2462 |   151K|  1246   (5)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200"=300)

SQL> select * from t1 where mod_200 >=300;

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  2462 |   151K|  1246   (5)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |  2462 |   151K|  1246   (5)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200">=300)

The predicted cardinality for mod_200 = 300 is the same as that for mod_200 >= 300. So, to be self-consistent, the optimizer really ought to report no rows (or a token 1 row) for any value of mod_200 greater than 300 – but it doesn’t.


SQL> select * from t1 where mod_200 = 350;

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1206 | 75978 |  1246   (5)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |  1206 | 75978 |  1246   (5)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200"=350)

SQL> select * from t1 where mod_200 =360;

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   955 | 60165 |  1246   (5)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |   955 | 60165 |  1246   (5)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200"=360)

SQL> select * from t1 where mod_200 =370;

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   704 | 44352 |  1246   (5)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |   704 | 44352 |  1246   (5)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200"=370)

SQL> select * from t1 where mod_200 =380;

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   452 | 28476 |  1246   (5)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |   452 | 28476 |  1246   (5)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200"=380)

SQL> select * from t1 where mod_200 in (350, 360, 370, 380);

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  3317 |   204K|  1275   (7)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T1   |  3317 |   204K|  1275   (7)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("MOD_200"=350 OR "MOD_200"=360 OR "MOD_200"=370 OR "MOD_200"=380)

The IN-list results are consistent with the results for the individual values – but the result for the IN-list is NOT consistent with the result for the original mod_200 >= 300. The optimizer uses a “linear decay” strategy for handling predicates that go out of range, but not in a consistent way. It seems that, as far as out-of-range, range-based predicates are concerned, the data doesn’t exist until the wave front collapses.

Footnote:

This type of anomaly COULD affect some execution plans if your statistics haven’t been engineered to avoid the problems of “out of range” traps.

November 19, 2014

Comparisons

Filed under: Histograms,humour,Oracle,Statistics — Jonathan Lewis @ 12:47 pm GMT Nov 19,2014

“You can’t compare apples with oranges.”

Oh, yes you can! The answer is 72,731,533,037,581,000,000,000,000,000,000,000.


SQL> 
SQL> create table fruit(v1 varchar2(30));
SQL> 
SQL> insert into fruit values('apples');
SQL> insert into fruit values('oranges');
SQL> commit;
SQL> 
SQL> 
SQL> begin
  2  	     dbms_stats.gather_table_stats(
  3  		     ownname	      => user,
  4  		     tabname	      =>'FRUIT',
  5  		     method_opt       => 'for all columns size 2'
  6  	     );
  7  end;
  8  /
SQL> 
SQL> select
  2  	     endpoint_number,
  3  	     endpoint_value,
  4  	     to_char(endpoint_value,'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') hex_value
  5  from
  6  	     user_tab_histograms
  7  where
  8  	     table_name = 'FRUIT'
  9  order by
 10  	     endpoint_number
 11  ;

ENDPOINT_NUMBER                                   ENDPOINT_VALUE HEX_VALUE
--------------- ------------------------------------------------ -------------------------------
              1  505,933,332,254,715,000,000,000,000,000,000,000  6170706c65731ad171a7dca6e00000
              2  578,664,865,292,296,000,000,000,000,000,000,000  6f72616e67658acc6c9dcaf5000000
SQL> 
SQL> 
SQL> 
SQL> select
  2  	     max(endpoint_value) - min(endpoint_value) diff
  3  from
  4  	     user_tab_histograms
  5  where
  6  	     table_name = 'FRUIT'
  7  ;

                                            DIFF
------------------------------------------------
  72,731,533,037,581,000,000,000,000,000,000,000
SQL> 
SQL> spool off


October 17, 2014

sreadtim

Filed under: Oracle,Statistics,System Stats,Troubleshooting — Jonathan Lewis @ 1:22 pm GMT Oct 17,2014

Here’s a question that appeared in my email a few days ago:

Based on the formula: “sreadtim = ioseektim + db_block_size/iotrfrspeed”, sreadtim should always bigger than ioseektim.

But I just did a query on my system, find it otherwise, get confused:

SQL> SELECT * FROM SYS.AUX_STATS$;<

SNAME                          PNAME                               PVAL1 PVAL2
------------------------------ ------------------------------ ---------- --------------------
SYSSTATS_INFO                  STATUS                                    COMPLETED
SYSSTATS_INFO                  DSTART                                    10-08-2014 10:45
SYSSTATS_INFO                  DSTOP                                     10-10-2014 10:42
SYSSTATS_INFO                  FLAGS                                   1
SYSSTATS_MAIN                  CPUSPEEDNW                     680.062427
SYSSTATS_MAIN                  IOSEEKTIM                              10
SYSSTATS_MAIN                  IOTFRSPEED                           4096
SYSSTATS_MAIN                  SREADTIM                            4.716
SYSSTATS_MAIN                  MREADTIM                            2.055
SYSSTATS_MAIN                  CPUSPEED                             1077
SYSSTATS_MAIN                  MBRC                                    4
SYSSTATS_MAIN                  MAXTHR                          956634112
SYSSTATS_MAIN                  SLAVETHR                           252928

How do we explain this ?

This question highlights two points – one important, the other only slightly less so.

The really important point is one of interpretation.  Broadly speaking we could reasonably say that the (typical) time required to perform a single block read is made up of the (typical) seek time plus the transfer time which, using the names of the statistics above, would indeed give us the relationship: sreadtim = ioseektim + db_block_size/iotfrspeed; but we have to remember that we are thinking of a simplified model of the world. The values that we capture for sreadtim include the time it takes for a request to get from Oracle to the O/S, through the various network software and hardware layers and back again; the formula ignores those components completely and, moreover, doesn’t allow for the fact that some “reads” could actually come from one of several caches between Oracle and the disc without any physical disc access actually taking place. Similarly we should be aware that the time for an actual I/O seek would vary dramatically with the current position  of the read head, the radial position of the target block, the speed and current direction of movement of the read head, and the rotational distance to the target block. The formula is not attempting to express a physical law, it is simply expressing an approximation that we might use in a first line estimate of performance.

In fact we can see in the figures above that multi-block reads (typically of 4 blocks)  were faster than single block reads on this hardware for the duration of the sampling period – and that clearly doesn’t fit with the simple view embedded in our formula of how disc drives work.  (It’s a fairly typical effect of SANs, of course, that large read requests make the SAN software start doing predictive read-ahead, so that the next read request from Oracle may find that the SAN has already loaded the required data into its cache.)

There is, however, the second point that these figures highlight – but you have to be in the know to spot the detail: whatever the complexities introduced by SAN caching, we’re not comparing the right numbers. The ioseektim and iotfrspeed shown here are the default values used by Oracle. It looks as if the user has called dbms_stats.gather_system_stats() with a 48 hour workload (dstart = 8th Oct, dstop = 10th Oct) but hasn’t yet executed the procedure using the ‘noworkload’ option. Perhaps the ioseektim and iotfrspeed figures from a noworkload call would look a little more reasonable when compared with the 4.716 milliseconds of the gathered sreadtim.

There may still be a large gap between the model and the reality, but until the two sets of figures we’re using come from the same place we shouldn’t be comparing them.

 

July 4, 2014

Adjusting Histograms

Filed under: Histograms,Oracle,Statistics — Jonathan Lewis @ 8:32 pm GMT Jul 4,2014

This is a quick response to a question on an old blog post asking how you can adjust the high value if you’ve already got a height-balanced histogram in place. It’s possible that someone will come up with a tidier method, but this was just a quick sample I created and tested on 11.2.0.4 in a few minutes.  (Note – this is specifically for height-balanced histograms,  and it’s not appropriate for 12c which has introduced hybrid histograms that will require me to modify my “histogram faking” code a little).

rem
rem	Script:		adjust_histogram.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Jun 2014
rem	Purpose:
rem
rem	Last tested
rem		11.2.0.4
rem	Not tested
rem		12.1.0.1
rem		11.1.0.7
rem		10.2.0.5
rem	Outdated
rem		 9.2.0.8
rem		 8.1.7.4	no WITH subquery
rem
rem	Notes:
rem	Follow-on from a query on my blog about setting the high value
rem	when you have a histogram.  We could do this by hacking, or by
rem	reading the user_tab_histogram values and doing a proper prepare
rem

start setenv
set timing off

execute dbms_random.seed(0)

drop table t1;

begin
	begin		execute immediate 'purge recyclebin';
	exception	when others then null;
	end;

	begin
		dbms_stats.set_system_stats('MBRC',16);
		dbms_stats.set_system_stats('MREADTIM',10);
		dbms_stats.set_system_stats('SREADTIM',5);
		dbms_stats.set_system_stats('CPUSPEED',1000);
	exception
		when others then null;
	end;
/*
	begin		execute immediate 'begin dbms_stats.delete_system_stats; end;';
	exception	when others then null;
	end;

	begin		execute immediate 'alter session set "_optimizer_cost_model"=io';
	exception	when others then null;
	end;

	begin		execute immediate 'alter session set "_optimizer_gather_stats_on_load" = false';
	exception	when others then null;
	end;
*/

	begin		execute immediate  'begin dbms_space_admin.materialize_deferred_segments(''TEST_USER''); end;';
	exception	when others then null;
	end;

end;
/

create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	trunc(sysdate,'YYYY') + trunc(dbms_random.normal * 100,1)	d1
from
	generator	v1,
	generator	v2
where
	rownum <= 1e4
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 32'
	);

end;
/

spool adjust_histogram.lst

prompt	==================
prompt	Current High Value
prompt	==================

select to_char(max(d1),'dd-Mon-yyyy hh24:mi:ss') from t1;

prompt	==============================
prompt	Initial Histogram distribution
prompt	==============================

select
	endpoint_number,
	to_date(to_char(trunc(endpoint_value)),'J') + mod(endpoint_value,1) d_val,
	endpoint_value,
	lag(endpoint_value,1) over(order by endpoint_number) lagged_epv,
	endpoint_value -
		lag(endpoint_value,1) over(order by endpoint_number)  delta
from	user_tab_histograms
where
	table_name = 'T1'
and	column_name = 'D1'
;

rem
rem	Note - we can't simply overwrite the last srec.novals
rem	because that doesn't adjust the stored high_value.
rem	We have to make a call to prepare_column_values,
rem	which means we have to turn the stored histogram
rem	endpoint values into their equivalent date types.
rem

prompt	==================
prompt	Hacking the values
prompt	==================

declare

	m_distcnt		number;
	m_density		number;
	m_nullcnt		number;
	srec			dbms_stats.statrec;
	m_avgclen		number;

	d_array			dbms_stats.datearray := dbms_stats.datearray();
	ct			number;

begin

	dbms_stats.get_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'd1',
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec		=> srec,
		avgclen		=> m_avgclen
	); 

	ct := 0;
	for r in (
		select	to_date(to_char(trunc(endpoint_value)),'J') + mod(endpoint_value,1) d_val
		from	user_tab_histograms
		where	table_name = 'T1'
		and	column_name = 'D1'
		order by endpoint_number
	) loop

		ct := ct + 1;
		d_array.extend;
		d_array(ct) := r.d_val;
		if ct = 1 then
			srec.bkvals(ct) := 0;
		else
			srec.bkvals(ct) := 1;
		end if;

	end loop;

	d_array(ct) := to_date('30-Jun-2015','dd-mon-yyyy');

	dbms_stats.prepare_column_values(srec, d_array);

	dbms_stats.set_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'd1',
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec		=> srec,
		avgclen		=> m_avgclen
	);
end;
/

prompt	============================
prompt	Final Histogram distribution
prompt	============================

select
	endpoint_number,
	to_date(to_char(trunc(endpoint_value)),'J') + mod(endpoint_value,1) d_val,
	endpoint_value,
	lag(endpoint_value,1) over(order by endpoint_number) lagged_epv,
	endpoint_value -
		lag(endpoint_value,1) over(order by endpoint_number)  delta
from	user_tab_histograms
where
	table_name = 'T1'
and	column_name = 'D1'
;

spool off

doc

#

May 5, 2014

endpoint_value

Filed under: Histograms,Oracle,Statistics — Jonathan Lewis @ 2:58 pm GMT May 5,2014

I wrote a note a few years ago about translating the endpoint_value for histograms on character columns, and in that note I left the casual comment that you would: fiddle with this bit of code to handle multibyte character sets”. For anyone who has never found time to do that fiddling, here’s a solution.  In fact, it’s also a better solution for the original character problem than the one I originally published – I have no idea why I didn’t use the utl_raw package to simplify the code years ago.

First, a brief explanation of the problem. To store a character value in the numeric endpoint_value column of a histogram, Oracle:

  • Takes the first 15 bytes of the string (after padding the string with zeros (for varchar2) or spaces (for char))
  • Treats the result as a 15 byte (30 digit) hexadecimal number, and converts to decimal
  • Rounds to 15 significant digits (why?!) and stores the result as the endpoint_value

There are two problems with my old code when trying to convert back to a (multi-byte) character format; after converting back to a hexadecimal format it doesn’t know how many consecutive bytes should be used to make up each character of the string – this problem can be circumvented by using the utl_raw.cast_to_[n]varchar2() function. (Note: the dbms_stats package has a number of convert_raw_value() procedures which could also be used, but to take advantage of them I’d have to create a wrapper function for each procedure.)

The second problem is one that’s easy to overlook (and possibly survive with) for some time – some multi-byte characters start with a zero byte – or perhaps more than a single zero byte.  If the first character of a string that’s going into the histogram data starts with such a character then conversion to a number loses the leading zeros, which means you have to find a way of recovering the right number of zeros before doing the conversion from raw to character. The solution to this one is also fairly straightforward: we know that the endpoint_value when converted to a hexadecimal number should be 15 bytes, which means in its hex string presentation it will be 30 characters: if it’s not 30 characters than pad it on the left with zeros up to 30 characters, then call utl_raw.cast_to_[n]varchar2() on the result. Here’s a dataset to demonstrate the point (the results show came from 11.2.0.4 with the varchar2() using WE8MSWIN1252 and nvarchar2() using AL16UTF16):

create table t1
as
select
	object_type				v_type,
	cast(object_type as nvarchar2(19))	n_type
from
	all_objects
where
	rownum <= 10000
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 254'
	);
end;
/

select 
	column_name,
	endpoint_number,
	lpad(to_char(endpoint_value,'fmxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'),30,'0')
from 
	user_tab_histograms
where 
	table_name = 'T1' 
order by
	column_name desc,
	endpoint_number
;

Column                 EP no LPAD(TO_CHAR(ENDPOINT_VALUE,'F
-------------------- ------- ------------------------------
V_TYPE                    10 434c5553544566083af75039400000
                          12 434f4e5445585b0be13e9077800000
                          13 45444954494f54acac5b71e6c00000
                          65 46554e4354494bfd6e91eac9000000
                       1,218 494e444557ffec1b8cf7b386200000
                       1,284 494e44455820524115d2f593c00000
                       1,287 4f5045524154487b7a4542a0400000
                       1,459 5041434b41476016d731ed1a200000
                       1,473 50524f43454439c0c7fd90d5c00000
                       1,475 53455155454e581a4a938544400000
                       4,576 53594e4f4e59319bbd81e9d6a00000
                       5,613 5441424c45001de17dde9b57c00000
                       5,685 5441424c45204dd13d0c1786c00000
                       5,717 5441424c45204dd13d0c1786c00000
                       6,425 54595044ffffeda95d6a0f02200000
                      10,000 56494556fffffb4cec1ab3e3600000

N_TYPE                    10 0043004c00550091d5132d65900000
                          12 0043004f004e00699afbb463980000
                          13 0045004400490039662489da280000
                          65 00460055004e0032efa0cbcb200000
                       1,218 0049004e0044000ee521e524f00000
                       1,284 0049004e0044000ee521e524f00000
                       1,287 004f00500045007d7a2958e3400000
                       1,459 0050004100430082c9d92def100000
                       1,473 00500052004f003864f4c314680000
                       1,475 005300450051001bb6610d60c00000
                       4,576 00530059004e00250bbf32dc380000
                       5,613 0054004100420046e721e46a700000
                       5,685 0054004100420046e721e46a700000
                       5,717 0054004100420046e721e46a700000
                       6,425 0054005900500056f579391d680000
                      10,000 0056004900450054a090433f680000


32 rows selected.

As you can see, we’ve got a frequency histogram on both columns. If you’re good with ASCII codes you’ll be able to translate the first row for the v_type data: 43 = C, 4c = L, 55 = U, 53 = S, 54 = T, 45 = E, 66 = B, and then we run into funny characters; the ‘B’ and everything after it is part of the rounding error due to the algorithm Oracle is using. You can also check the first row for the n_type: it’s coming from a fixed width character set, 2 bytes per character: 0043 = C, 004c = L, 0055 = U, 0091 = {rounding error} – the rounding means we get much less information from the nvarchar2() histogram.

Rather than depending on knowing our code pages, though, we can get Oracle to do the translation work:


select
	column_name,
	endpoint_number,
	substr(lpad(to_char(endpoint_value,'fmxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'),30,'0'),1,30)  hex_value,
	decode(
		column_name,
			'V_TYPE',
				utl_raw.cast_to_varchar2(
					substr(lpad(to_char(endpoint_value,'fmxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'),30,'0'),1,14) 
				),
			'N_TYPE',
				utl_raw.cast_to_nvarchar2(
					substr(lpad(to_char(endpoint_value,'fmxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'),30,'0'),1,16) 
				)
	)	char_value,
	endpoint_value,
	endpoint_actual_value
from
	user_tab_histograms
where
	table_name = 'T1'
order by
	column_name desc,
	endpoint_number
;

--
--	Clear graphics mode
--

execute dbms_output.put_line(chr(15))

Column       EP no HEX_VALUE                      CHAR_VALUE                                   ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
---------- ------- ------------------------------ ---------- ------------------------------------------------ --------------------------------
V_TYPE          10 434c5553544566083af75039400000 CLUSTEf     349,432,112,834,658,000,000,000,000,000,000,000 CLUSTER
                12 434f4e5445585b0be13e9077800000 CONTEX[     349,492,405,757,772,000,000,000,000,000,000,000 CONTEXT
                13 45444954494f54acac5b71e6c00000 EDITIOT     359,653,496,833,182,000,000,000,000,000,000,000 EDITION
                65 46554e4354494bfd6e91eac9000000 FUNCTIK     365,190,985,547,816,000,000,000,000,000,000,000 FUNCTION
             1,218 494e444557ffec1b8cf7b386200000 INDEWyi     380,625,107,598,029,000,000,000,000,000,000,000 INDEX
             1,284 494e44455820524115d2f593c00000 INDEX R     380,625,107,598,182,000,000,000,000,000,000,000 INDEX PARTITION
             1,287 4f5045524154487b7a4542a0400000 OPERATH     411,819,536,792,506,000,000,000,000,000,000,000 OPERATOR
             1,459 5041434b41476016d731ed1a200000 PACKAG`     416,707,436,884,205,000,000,000,000,000,000,000 PACKAGE
             1,473 50524f43454439c0c7fd90d5c00000 PROCED9     417,053,186,114,358,000,000,000,000,000,000,000 PROCEDURE
             1,475 53455155454e581a4a938544400000 SEQUENX     432,366,569,392,218,000,000,000,000,000,000,000 SEQUENCE
             4,576 53594e4f4e59319bbd81e9d6a00000 SYNONY1     432,771,978,053,825,000,000,000,000,000,000,000 SYNONYM
             5,613 5441424c45001de17dde9b57c00000 TABLE      437,476,545,404,166,000,000,000,000,000,000,000 TABLE
             5,685 5441424c45204dd13d0c1786c00000 TABLE M     437,476,545,404,318,000,000,000,000,000,000,000 TABLE PARTITION
             5,717 5441424c45204dd13d0c1786c00000 TABLE M     437,476,545,404,318,000,000,000,000,000,000,000 TABLE SUBPARTITION
             6,425 54595044ffffeda95d6a0f02200000 TYPDyyi     437,964,430,179,117,000,000,000,000,000,000,000 TYPE
            10,000 56494556fffffb4cec1ab3e3600000 VIEVyyu     448,023,639,403,471,000,000,000,000,000,000,000 VIEW

N_TYPE          10 0043004c00550091d5132d65900000 CLU?          1,358,944,964,706,820,000,000,000,000,000,000  C L U S T E R
                12 0043004f004e00699afbb463980000 CONi          1,358,945,893,128,790,000,000,000,000,000,000  C O N T E X T
                13 0045004400490039662489da280000 EDI9          1,399,507,307,977,370,000,000,000,000,000,000  E D I T I O N
                65 00460055004e0032efa0cbcb200000 FUN2          1,419,794,978,849,800,000,000,000,000,000,000  F U N C T I O N
             1,218 0049004e0044000ee521e524f00000 IND          1,480,640,041,218,460,000,000,000,000,000,000  I N D E X
             1,284 0049004e0044000ee521e524f00000 IND          1,480,640,041,218,460,000,000,000,000,000,000  I N D E X   P A R T I T I O N
             1,287 004f00500045007d7a2958e3400000 OPE}          1,602,335,117,815,120,000,000,000,000,000,000  O P E R A T O R
             1,459 0050004100430082c9d92def100000 PAC?          1,622,612,885,134,180,000,000,000,000,000,000  P A C K A G E
             1,473 00500052004f003864f4c314680000 PRO8          1,622,618,146,436,010,000,000,000,000,000,000  P R O C E D U R E
             1,475 005300450051001bb6610d60c00000 SEQ          1,683,461,351,951,280,000,000,000,000,000,000  S E Q U E N C E
             4,576 00530059004e00250bbf32dc380000 SYN%          1,683,467,541,637,310,000,000,000,000,000,000  S Y N O N Y M
             5,613 0054004100420046e721e46a700000 TABF          1,703,742,523,544,060,000,000,000,000,000,000  T A B L E
             5,685 0054004100420046e721e46a700000 TABF          1,703,742,523,544,060,000,000,000,000,000,000  T A B L E   P A R T I T I O N
             5,717 0054004100420046e721e46a700000 TABF          1,703,742,523,544,060,000,000,000,000,000,000  T A B L E   S U B P A R T I T I
             6,425 0054005900500056f579391d680000 TYPV          1,703,749,951,250,410,000,000,000,000,000,000  T Y P E
            10,000 0056004900450054a090433f680000 VIET          1,744,309,818,645,610,000,000,000,000,000,000  V I E W

32 rows selected.

I’ve limited the raw conversion to substr(,1,14) for the v_code and substr(,1,16) for the n_code because this translates into 7 and 4 characters respectively – and basically you’re lucky if you always get 6 / 3 characters coming to the right values. I can’t show you exactly what my output was like because there were various hidden characters (line feed, backspaces, “shift” and so on) that made a bit of a mess of the results; that’s why the dbms_output.put_line(chr(15)) is there at the end of the script – I needed it to switch my terminal out of graphic mode.

I’ve included endpoint_actual_value in my output. Because “TABLE PARTITION” and “TABLE SUBPARTITION” came to the same endpoint_value Oracle captured the first 32 bytes of the actual values it had found. (Note (a) it is bytes not characters, (b) this goes up to 64 in 12c, and (c) it’s being stored in a varchar2() column, which is why the values for column n_type appear to have alternating spaces – that’s the impact of all the zeros.

 

May 4, 2014

Extended stats

Filed under: 12c,extended stats,Histograms,Oracle,Statistics — Jonathan Lewis @ 1:24 pm GMT May 4,2014

Like the recent article on deleting histograms this is another draft that I rediscovered while searching for some notes I had written on a different topic – so I’ve finally finished it off and published it.

Here’s a quirky little detail of extended stats that came up in an OTN thread earlier on this week [ed: actually 8th Jan 2014]. When you create column group stats, Oracle uses an undocumented function sys_op_combined_hash() to create a hash value, and if you gather simple stats on the column (i.e. no histogram) you can get some idea of the range of values that Oracle generates through the hash function. For example:


create table t1 as
select  1 n1, 2 n2
from dual
connect by level<=5000
union all
select  2, 1
from dual
connect by level<=5000 ; 
 
select dbms_stats.create_extended_stats(user,'t1','(n1, n2)') name from dual; 

begin 
	dbms_stats.gather_table_stats( 
		ownname		=> user,
		tabname		=>'T1',
		method_opt	=> 'for all columns size 1'
	);
end;
/

break on column_name skip 1

column column_name format a32
column endpoint_number heading "EP No."
column endpoint_value  heading "Value" format 999,999,999,999,999,999,999

select
	column_name, endpoint_number, endpoint_value
from
	user_tab_histograms
where
	table_name = 'T1'
order by
	column_name, endpoint_value
;

COLUMN_NAME                          EP No.                        Value
-------------------------------- ---------- ----------------------------
N1                                        0                            1
                                          1                            2

N2                                        0                            1
                                          1                            2

SYS_STUBZH0IHA7K$KEBJVXO5LOHAS            0      298,332,787,864,733,000
                                          1    8,095,781,421,167,520,000

I could have selected low_value and high_value from user_tab_cols, using utl_raw.cast_to_number() to display them in numeric format, but the view user_tab_histograms display the low and high as a two-bucket histogram if there is no actual histogram data for the column in the histogram (histgrm$) table.

We probably don’t need to worry about what the low and high values might be because taking hash values destroys any meaning that a range might have (the optimizer can’t use column group stats in range-based predicates, only in equality predicates). However, we might collect a frequency histogram (or Top-N histogram in 12c) on the column group because there might be some data skew in the sets of values that we need to tell the optimizer about – so let’s gather a histogram with 2 buckets on our sample data set and see what we get:


begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 2'
	);
end;
/

select
	column_name, endpoint_number, endpoint_value
from
	user_tab_histograms
where
	table_name = 'T1'
order by
	column_name, endpoint_value
;

COLUMN_NAME                          EP No.                        Value
-------------------------------- ---------- ----------------------------
N1                                     5000                            1
                                      10000                            2

N2                                     5000                            1
                                      10000                            2

SYS_STUBZH0IHA7K$KEBJVXO5LOHAS         5000                1,977,102,303
                                      10000                7,894,566,276

The histogram values have changed! As one of the posters on the OTN thread points out, what Oracle has actually stored in this case is mod(sys_op_combined_hash() ,9999999999).

So far I haven’t done any exhaustive testing to work out whether the change in the strategy for storing numbers makes any difference to the typical optimizer arithmetic – but I do have at least one case (relating to “missing values” behaviour where the presence or absence of a column group histogram does make a difference to the estimated cardinality in a way that seems inconsistent with other patterns of behaviour: I doubt if it’s actually due to the change in what’s stored, and one day I may come across a client where I actually need to work out what’s going on and how best to work with the anomaly.

Bonus thought:

In 12c Oracle collects column stats automatically as it loads data into an empty table; but not if it’s got extended stats defined on it.  (This is bug number 18425876, labelled as fixed in 12.2 [ed: actually fixed in 12.1.0.2]). Here’s some code modelling a client scenario where we truncate and reload a table every day. In the first part of the demonstration I’ve loaded the table twice to show that after truncating and reloading I get new stats on the table – the first load is 10,000 rows, the second is 20,000 rows and the stats reflect this automatically. In the second part of the code, after adding a set of column group stats, truncating and loading 5,000 rows, the stats from the previous cycle are still in place. (The code is only relevant to 12c, of course)

create table t1 (n1 number, n2 number);

insert	/*+ append */
into	t1
select	object_id, data_object_id
from	all_objects
where
	rownum <= 10000
;
commit;

-- stats have appeared without a call to dbms_stats to gather them.

select blocks, num_rows from user_tables where table_name = 'T1';
select column_name, num_distinct, num_nulls, density, low_value, high_value from user_tab_cols where table_name = 'T1';

truncate table t1;

insert	/*+ append */
into	t1
select	object_id, data_object_id
from	all_objects
where
	rownum <= 20000
;
commit;

-- Stats now show the latest data 

select blocks, num_rows from user_tables where table_name = 'T1';
select column_name, num_distinct, num_nulls, density, low_value, high_value from user_tab_cols where table_name = 'T1';

-- Add a column group to the stats 

select dbms_stats.create_extended_stats(user,'t1','(n1, n2)') name from dual;

truncate table t1;

insert	/*+ append */
into	t1
select	object_id, data_object_id
from	all_objects
where
	rownum <= 5000
;
commit;

-- The stats have not been updated to reflect the new data, and the column group stats are empty

select blocks, num_rows from user_tables where table_name = 'T1';
select column_name, num_distinct, num_nulls, density, low_value, high_value from user_tab_cols where table_name = 'T1';

--------------------------------------------------------
-- Here are the three consecutive sets of results
--------------------------------------------------------

    BLOCKS   NUM_ROWS
---------- ----------
        16      10000

COLUMN_NAME                      NUM_DISTINCT  NUM_NULLS    DENSITY LOW_VALUE                  HIGH_VALUE
-------------------------------- ------------ ---------- ---------- -------------------------- --------------------------
N1                                      10000          0      .0001 C103                       C3020C60
N2                                       2534       7429 .000394633 C103                       C30B2929

    BLOCKS   NUM_ROWS
---------- ----------
        32      20000

COLUMN_NAME                      NUM_DISTINCT  NUM_NULLS    DENSITY LOW_VALUE                  HIGH_VALUE
-------------------------------- ------------ ---------- ---------- -------------------------- --------------------------
N1                                      20000          0     .00005 C103                       C30A4553
N2                                       3115      16848 .000321027 C103                       C30B2929

    BLOCKS   NUM_ROWS
---------- ----------
        32      20000

COLUMN_NAME                      NUM_DISTINCT  NUM_NULLS    DENSITY LOW_VALUE                  HIGH_VALUE
-------------------------------- ------------ ---------- ---------- -------------------------- --------------------------
N1                                      20000          0     .00005 C103                       C30A4553
N2                                       3115      16848 .000321027 C103                       C30B2929
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS

The workaround given in the bug is “add the extended stats after loading the table” – but if you’re constantly truncating and reloading that means you have to drop and add the extended stats and do a tablescan to gather the column group stats every time you reload.

Note: the limitation applies whether you create a column group, “ordinary” extended stats, a virtual column, or an implicit virtual column underlying a function-based index.

Just as a little aside – when I first wrote the demo script I forgot to put in the commit; after the insert/append – which meant I was trying to create column group stats on a table which should have given me Oracle error: “ORA-12838: cannot read/modify an object after modifying it in parallel”; instead this has been trapped by the dbms_stats package and shows up as a slightly confusing:


select dbms_stats.create_extended_stats(user,'t1','(n1, n2)') name from dual
       *
ERROR at line 1:
ORA-20001: Error when processing extension -  resource busy and acquire with NOWAIT specified or timeout expired
ORA-06512: at "SYS.DBMS_STATS", line 12977
ORA-06512: at "SYS.DBMS_STATS", line 44967
ORA-06512: at "SYS.DBMS_STATS", line 44986

May 1, 2014

Delete Histogram

Filed under: Histograms,Oracle,Statistics — Jonathan Lewis @ 1:27 pm GMT May 1,2014

Here’s a note that I drafted in November 2010, then didn’t publish. I found it earlier on this morning while looking for another note I’d written about histograms so, even though it may not be something that people need so much these days, I thought: better late than never.

I’ve pointed out in the past that I’m not keen on seeing lots of histograms on a system and tend to delete them if I think they are not needed. Here’s an example of the type of code I use to delete a histogram.

declare

	srec			dbms_stats.statrec;
	m_distcnt		number;
	m_density		number;
	m_nullcnt		number;
	m_avgclen		number;

	n_array                 dbms_stats.numarray;

begin

	dbms_stats.get_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'n1',
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec		=> srec,
		avgclen		=> m_avgclen
	); 

	srec.bkvals := null;
	srec.novals :=	dbms_stats.numarray(
				utl_raw.cast_to_number(srec.minval),
				utl_raw.cast_to_number(srec.maxval)
			);
	srec.epc := 2;
	dbms_stats.prepare_column_values(srec, srec.novals);

	m_density := 1/m_distcnt;

	dbms_stats.set_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'n1',
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec		=> srec,
		avgclen		=> m_avgclen
	); 

exception
	when others then
		raise;		-- should handle div/0

end;
/

The code basically reads the column stats, resets the histogram figures to just the low and high values for the column, setting the endpoint-count to two, then adjusts the density to the standard for a column with no histogram. This specific example is for a numeric column.

Footnote: my preferred method of collecting statistics is to use method_opt => ‘for all columns size 1’ (i.e. no histograms) and then run scripts to create the histograms I want. This means that after any stats collection I need to run code that checks to see which tables have new stats, and then re-run any histogram code that I’ve written for that table.

To move from Oracle’s default histogram collection to this strategy, you could start by switching to method_opt => ‘for all columns size repeat’ (i.e. recreate existing histograms, don’t create new ones), then simply delete histograms as you find that you don’t need them, and introduce scripts to recreate the histograms that you do need. When you’ve finally got to the point where every histogram is scripted you can then switch to method_opt => ‘for all columns size 1’.

 Footnote 2: Since 2010 when I drafted this note Oracle 12c has launched, and the changes it has introduced for frequency and Top-N histograms means that I’m far less stringent in my demand that if a histogram is worth having it’s better to write code to create it. There’s a series of three articles about 12c histograms in particular at this link.

March 2, 2014

Auto Sample Size

Filed under: Function based indexes,Indexing,Infrastructure,IOT,LOBs,Oracle,Statistics — Jonathan Lewis @ 6:38 pm GMT Mar 2,2014

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:


create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)
    )
;

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.

(more…)

December 9, 2013

Bitmap join indexes

Filed under: Indexing,Oracle,Statistics — Jonathan Lewis @ 6:01 pm GMT Dec 9,2013

Here’s another of my “draft” notes that needs some expansion and, most importantly, proof.

I have a fact table with a status id column that shows a massive skew. But I also have a dimension table that holds the “status code” so (in theory, at least) I have to do a join from the statuses table to the facts table to find rows of a given status. Unfortunately the join hides the skew:

(more…)

October 9, 2013

12c Histograms pt.3

Filed under: 12c,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:13 pm GMT Oct 9,2013

It has taken much longer than I anticipated to get around to writing part 3 of this mini-series on what Oracle has done about histograms in 12c.
In part 1 I gave a thumbnail sketch of the three types of histogram available in 12c
In part 2 I described in some detail the improvements in performance and accuracy for the frequency and top-frequency histograms

In part 3 of this mini-series I’ll be describing how the implementation of the “hybrid” histogram that Oracle produces if the “approximate NDV” mechanism has been enabled and you’ve left the estimate_percent to auto_sample_size. There is little difference between the work needed to create a hybrid histogram and the work needed to generate the old “height-balanced” histogram, but the degree of information captured by the hybrid is much greater than that of the height-balanced.

(more…)

September 27, 2013

Virtual Stats

Filed under: CBO,Execution plans,Oracle,Statistics — Jonathan Lewis @ 6:49 am GMT Sep 27,2013

Or – to be more accurate – real statistics on a virtual column.

This is one of the “10 top tips” that I came up with for my session with Maria Colgan at OOW13. A method of giving more information that might improve execution plans when you can’t change the code. I’ll start with a small data set including a virtual column (running 11.1.0.7), and a couple of problem queries:
(more…)

« Previous PageNext Page »

Blog at WordPress.com.