Here’s the output I get from querying dba_tab_histograms for a column that has been given a frequency histogram by a call to dbms_stats.gather_table_stats().
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
2 -1
3190 0
3813 1
4310 4
4480 5
7507 999
17210 2000
17212 2002
17213 2004
17215 2006
17729 2008
17750 2009
17752 2010
17904 2011
17906 2012
17909 2017
17994 5000
18006 5001
18009 5002
18023 5003
18062 6001
39885 11000
In an earlier blog I wrote about a query that turned the figures from a frequency histogram into a list of column values and column frequencies, and if I had used that query against the histogram data I would have found that the value 11,000 appears 21,827 times – according to the histogram.
But there’s a problem with this histogram: the value 11,000 actually appeared roughly 2 million times in the real data set – and when I enabled autotrace on the query that I had used to count the number of times that 11,000 appeared the optimizer’s prediction (the cardinality in the execution plan) matched the actual value fairly closely. So how did the optimizer manage to get from the histogram to the correct cardinality ?
You have to remember that Oracle may have sampled the data to generate the stats, and it may have sampled different columns with different sample sizes. So there has to be a way for the optimizer to allow for this variation in sample sizes, and it works as follows:
Checking dba_tables for num_rows I got 15,819,184.
Checking dba_tab_columns for num_nulls for this column I got 11,976,735
For simplicity let’s call these figures 16 million and 12 million. With these approximations we can see that roughly 4 million rows must have a value for this column.
Checking back to the frequency histogram we see the total row count sampled (which we get from the final (i.e. maximum) endpoint_number) is 39,885 – which we’ll call 40,000.
We have a histogram that sampled 40,000 rows out of 4,000,000 (viz: 1%) – so the optimizer scales the histogram up by a factor of 100 when it uses it, and the histogram figure of 21,827 is scaled up to (roughly) 2,000,000.

I am looking for a nice way to gather statistics for one column (e.g. after it has been added to a table and populated) without updating any of the other table statistics. My search turned up this blog entry with its reference to DBMS_STATS.GATHER_COLUMN_STATS.
Is this a misprint or are the numerous references to this procedure on the web referring to something real?
Comment by Tony Hasler — September 26, 2010 @ 3:00 pm UTC Sep 26,2010 |
Tony,
Sorry, it’s typo (now corrected).
The mechanism you want is probably: “gather_table_stats()” with the method_opt set to ‘for columns {column name} size {whatever}’. Better do a quick check test, though, in case this also does something you’d rather not see to the table-level stats like num_rows or blocks.
Comment by Jonathan Lewis — September 26, 2010 @ 3:04 pm UTC Sep 26,2010 |
Jonathan, sorry for taking so long to acknowledge your reply. Yes, your suggested approach does update the table statistics. The only way I can see to do what I want is to specify a statistics table on the gather_table_stats call, then delete the row for the column in that statistics table (if any), and then re-import.
Hardly elegant. Such is life.
Comment by tonyhasler — October 1, 2010 @ 5:25 pm UTC Oct 1,2010 |