Here’s the output I get from querying * dba_tab_histograms* for a column that has been given a frequency histogram by a call to

*.*

**dbms_stats.gather_table_stats()**ENDPOINT_NUMBER ENDPOINT_VALUE --------------- -------------- 2 -1 3190 0 3813 1 4310 4 4480 5 7507 999 17210 2000 17212 2002 17213 2004 17215 2006 17729 2008 17750 2009 17752 2010 17904 2011 17906 2012 17909 2017 17994 5000 18006 5001 18009 5002 18023 5003 18062 6001 39885 11000

In an * earlier blog* I wrote about a query that turned the figures from a frequency histogram into a list of column values and column frequencies, and if I had used that query against the histogram data I would have found that the value 11,000 appears 21,823 times – according to the histogram.

But there’s a problem with this histogram: the value 11,000 actually appeared roughly 2 million times in the real data set – and when I enabled autotrace on the query that I had used to count the number of times that 11,000 appeared the optimizer’s prediction (the cardinality in the execution plan) matched the actual value fairly closely. So how did the optimizer manage to get from the histogram to the correct cardinality ?

You have to remember that Oracle may have sampled the data to generate the stats, and it may have sampled different columns with different sample sizes. So there has to be a way for the optimizer to allow for this variation in sample sizes, and it works as follows:

Checking

fordba_tablesnum_rowsI got 15,819,184.

Checkingfordba_tab_columnsfor this column I got 11,976,735num_nulls

For simplicity let’s call these figures 16 million and 12 million. With these approximations we can see that roughly 4 million rows must have a value for this column.

Checking back to the frequency histogram we see the total row count sampled (which we get from the final (i.e. maximum) endpoint_number) is 39,885 – which we’ll call 40,000.

We have a histogram that sampled 40,000 rows out of 4,000,000 (viz: 1%) – so the optimizer scales the histogram up by a factor of 100 when it uses it, and the histogram figure of 21,827 is scaled up to (roughly) 2,000,000.

I am looking for a nice way to gather statistics for one column (e.g. after it has been added to a table and populated) without updating any of the other table statistics. My search turned up this blog entry with its reference to DBMS_STATS.GATHER_COLUMN_STATS.

Is this a misprint or are the numerous references to this procedure on the web referring to something real?

Comment by Tony Hasler — September 26, 2010 @ 3:00 pm BST Sep 26,2010 |

Tony,

Sorry, it’s typo (now corrected).

The mechanism you want is probably: “gather_table_stats()” with the method_opt set to ‘for columns {column name} size {whatever}’. Better do a quick check test, though, in case this also does something you’d rather not see to the table-level stats like num_rows or blocks.

Comment by Jonathan Lewis — September 26, 2010 @ 3:04 pm BST Sep 26,2010 |

Jonathan, sorry for taking so long to acknowledge your reply. Yes, your suggested approach does update the table statistics. The only way I can see to do what I want is to specify a statistics table on the gather_table_stats call, then delete the row for the column in that statistics table (if any), and then re-import.

Hardly elegant. Such is life.

Comment by tonyhasler — October 1, 2010 @ 5:25 pm BST Oct 1,2010 |