<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Frequency Histograms 2</title>
	<atom:link href="http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/feed/" rel="self" type="application/rss+xml" />
	<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/</link>
	<description>Just another Oracle weblog</description>
	<lastBuildDate>Sat, 18 May 2013 11:04:10 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: Sigrid Keydana</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37673</link>
		<dc:creator><![CDATA[Sigrid Keydana]]></dc:creator>
		<pubDate>Tue, 09 Nov 2010 13:53:31 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37673</guid>
		<description><![CDATA[Thanks a lot Jonathan for the answer. 

Regarding your explanation,

&quot; ... and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values ... &quot;
&quot;... the stats collection would report that there were only two distinct values in the table ...&quot; -

what puzzles me here was that the sampling mechanism seems to be accurate enough to get the (possibly) correct number of distinct values, but then constructs the histogram in another way (so it even looked &quot;on purpose&quot; to me...).

Right now I&#039;ve had the possibility to get the &quot;real counts&quot; at the same time as performing the statistics gathering on a clone of the production database, and picking a column with a substantial difference between num_buckets and num_distinct I get e.g.:

sqlplus&gt; select mycol, count(1) from myschema.mytable group by mycol order by mycol;

mycol   COUNT(1)
--------------------- ----------
                    1       1072
                    2          4
                    3      15334
                    6       7536
                    7       3315
                   10        473
                   11         61
                   12        124
                   20        900
                   42          1
                   50        979
                   55          1
                   62          2
                   71      25619
                   82       5141
                   83       8708
                   84       1224
                   85       1429
                   99       3518
                  116       1807
                  118          8
                  119         28
                  126       2142
                  128          1
                  129       3324
                  139        431
                      69223


sqlplus&gt; select endpoint_value, endpoint_number from dba_tab_histograms where owner=&#039;myschema&#039; and column_name=&#039;mycol&#039; order by endpoint_value;

ENDPOINT_VALUE ENDPOINT_NUMBER
-------------- ---------------
             1              38
             3             626
             6             900
             7             996
            10            1006
            11            1007
            12            1009
            20            1032
            50            1068
            71            1979
            82            2172
            83            2503
            84            2549
            85            2603
            99            2729
           116            2779
           126            2865
           129            2983
           139            3005

sqlplus&gt; select column_name, histogram, num_nulls, num_buckets, num_distinct, abs(num_distinct - num_buckets) diff, sample_size, last_analyzed  from dba_tab_col_statistics  where histogram in (&#039;FREQUENCY&#039;,
&#039;HEIGHT BALANCED&#039;) and owner=&#039;myschema&#039; and table_name=&#039;mytable&#039; order by histogram, diff desc, column_name;  2  

COLUMN_NAME                    HISTOGRAM        NUM_NULLS NUM_BUCKETS NUM_DISTINCT       DIFF SAMPLE_SIZE LAST_ANALYZED
------------------------------ --------------- ---------- ----------- ------------ ---------- ----------- --------------------
...
mycol         FREQUENCY            69223          19           26          7        3005 09-NOV-2010 10:39:48


sqlplus&gt; select count(*) from myschema.mytable where mycol is not null;

  COUNT(*)
----------
     83182



I unfortunately don&#039;t have the time to pursue this further with the other columns now, but I wonder might there be some algorithm like &quot;if the count of a value is less than e.g. total_count / sample_size (or some proportion of this), don&#039;t build a bucket for it&quot; - here for example, the &quot;most frequent&quot; value that got no bucket has count 28, and the ratio total_count / sample_size is  27.68... (of course this is not really much data for a guess yet :-;)

Thanks again,
Sigrid]]></description>
		<content:encoded><![CDATA[<p>Thanks a lot Jonathan for the answer. </p>
<p>Regarding your explanation,</p>
<p>&#8221; &#8230; and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values &#8230; &#8221;<br />
&#8220;&#8230; the stats collection would report that there were only two distinct values in the table &#8230;&#8221; -</p>
<p>what puzzles me here was that the sampling mechanism seems to be accurate enough to get the (possibly) correct number of distinct values, but then constructs the histogram in another way (so it even looked &#8220;on purpose&#8221; to me&#8230;).</p>
<p>Right now I&#8217;ve had the possibility to get the &#8220;real counts&#8221; at the same time as performing the statistics gathering on a clone of the production database, and picking a column with a substantial difference between num_buckets and num_distinct I get e.g.:</p>
<p>sqlplus&gt; select mycol, count(1) from myschema.mytable group by mycol order by mycol;</p>
<p>mycol   COUNT(1)<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212; &#8212;&#8212;&#8212;-<br />
                    1       1072<br />
                    2          4<br />
                    3      15334<br />
                    6       7536<br />
                    7       3315<br />
                   10        473<br />
                   11         61<br />
                   12        124<br />
                   20        900<br />
                   42          1<br />
                   50        979<br />
                   55          1<br />
                   62          2<br />
                   71      25619<br />
                   82       5141<br />
                   83       8708<br />
                   84       1224<br />
                   85       1429<br />
                   99       3518<br />
                  116       1807<br />
                  118          8<br />
                  119         28<br />
                  126       2142<br />
                  128          1<br />
                  129       3324<br />
                  139        431<br />
                      69223</p>
<p>sqlplus&gt; select endpoint_value, endpoint_number from dba_tab_histograms where owner=&#8217;myschema&#8217; and column_name=&#8217;mycol&#8217; order by endpoint_value;</p>
<p>ENDPOINT_VALUE ENDPOINT_NUMBER<br />
&#8212;&#8212;&#8212;&#8212;&#8211; &#8212;&#8212;&#8212;&#8212;&#8212;<br />
             1              38<br />
             3             626<br />
             6             900<br />
             7             996<br />
            10            1006<br />
            11            1007<br />
            12            1009<br />
            20            1032<br />
            50            1068<br />
            71            1979<br />
            82            2172<br />
            83            2503<br />
            84            2549<br />
            85            2603<br />
            99            2729<br />
           116            2779<br />
           126            2865<br />
           129            2983<br />
           139            3005</p>
<p>sqlplus&gt; select column_name, histogram, num_nulls, num_buckets, num_distinct, abs(num_distinct &#8211; num_buckets) diff, sample_size, last_analyzed  from dba_tab_col_statistics  where histogram in (&#8216;FREQUENCY&#8217;,<br />
&#8216;HEIGHT BALANCED&#8217;) and owner=&#8217;myschema&#8217; and table_name=&#8217;mytable&#8217; order by histogram, diff desc, column_name;  2  </p>
<p>COLUMN_NAME                    HISTOGRAM        NUM_NULLS NUM_BUCKETS NUM_DISTINCT       DIFF SAMPLE_SIZE LAST_ANALYZED<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212; &#8212;&#8212;&#8212;&#8212;&#8212; &#8212;&#8212;&#8212;- &#8212;&#8212;&#8212;&#8211; &#8212;&#8212;&#8212;&#8212; &#8212;&#8212;&#8212;- &#8212;&#8212;&#8212;&#8211; &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br />
&#8230;<br />
mycol         FREQUENCY            69223          19           26          7        3005 09-NOV-2010 10:39:48</p>
<p>sqlplus&gt; select count(*) from myschema.mytable where mycol is not null;</p>
<p>  COUNT(*)<br />
&#8212;&#8212;&#8212;-<br />
     83182</p>
<p>I unfortunately don&#8217;t have the time to pursue this further with the other columns now, but I wonder might there be some algorithm like &#8220;if the count of a value is less than e.g. total_count / sample_size (or some proportion of this), don&#8217;t build a bucket for it&#8221; &#8211; here for example, the &#8220;most frequent&#8221; value that got no bucket has count 28, and the ratio total_count / sample_size is  27.68&#8230; (of course this is not really much data for a guess yet :-;)</p>
<p>Thanks again,<br />
Sigrid</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37669</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Mon, 08 Nov 2010 18:44:00 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37669</guid>
		<description><![CDATA[Sigrid,

It&#039;s a good question, as it&#039;s in the right technical area for this blog, suitably general in content, and one that&#039;s commonly asked and addressed on the web.

You didn&#039;t give me the details of a specific example - but I&#039;d guess that the problem lies in sampling. If you have a relatively small number of distinct values that cover MOST of the data and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values.

A recent client had a column where two values covered about 10 million rows in a table, leaving a couple of hundred rows for the remaining five or six values.  From time to time the stats collection would report that there were only two distinct values in the table - and it rarely managed to report every single value. It&#039;s cases like this that you might want to write a program to create and fix some representative stats.
]]></description>
		<content:encoded><![CDATA[<p>Sigrid,</p>
<p>It&#8217;s a good question, as it&#8217;s in the right technical area for this blog, suitably general in content, and one that&#8217;s commonly asked and addressed on the web.</p>
<p>You didn&#8217;t give me the details of a specific example &#8211; but I&#8217;d guess that the problem lies in sampling. If you have a relatively small number of distinct values that cover MOST of the data and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values.</p>
<p>A recent client had a column where two values covered about 10 million rows in a table, leaving a couple of hundred rows for the remaining five or six values.  From time to time the stats collection would report that there were only two distinct values in the table &#8211; and it rarely managed to report every single value. It&#8217;s cases like this that you might want to write a program to create and fix some representative stats.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sigrid Keydana</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37666</link>
		<dc:creator><![CDATA[Sigrid Keydana]]></dc:creator>
		<pubDate>Mon, 08 Nov 2010 16:22:35 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37666</guid>
		<description><![CDATA[Hi Jonathan,

I don&#039;t know if it&#039;s okay to ask a question that&#039;s rather &quot;peripherally related&quot; here, but this being one of the more &quot;basic&quot; ones among your posts related to frequency histograms (and as I don&#039;t find any related documentation on the net) I&#039;ll just try :-;
Looking at dba_tab_col_statistics, I find lots of frequency histograms where num_buckets is (even substantially) lower than num_distinct (in 11.1.0). However, in the literature (also in Oracle&#039;s Performing Tuning Guide) it always seems to say that the distinctive feature of a frequency histogram is that there&#039;s one bucket per distinct value...
Am I getting this totally wrong here? Or might there be a (seldom mentioned?) algorithm that lets Oracle skip some values (perhaps being too infrequent?)

Thanks a lot in advance
Sigrid]]></description>
		<content:encoded><![CDATA[<p>Hi Jonathan,</p>
<p>I don&#8217;t know if it&#8217;s okay to ask a question that&#8217;s rather &#8220;peripherally related&#8221; here, but this being one of the more &#8220;basic&#8221; ones among your posts related to frequency histograms (and as I don&#8217;t find any related documentation on the net) I&#8217;ll just try :-;<br />
Looking at dba_tab_col_statistics, I find lots of frequency histograms where num_buckets is (even substantially) lower than num_distinct (in 11.1.0). However, in the literature (also in Oracle&#8217;s Performing Tuning Guide) it always seems to say that the distinctive feature of a frequency histogram is that there&#8217;s one bucket per distinct value&#8230;<br />
Am I getting this totally wrong here? Or might there be a (seldom mentioned?) algorithm that lets Oracle skip some values (perhaps being too infrequent?)</p>
<p>Thanks a lot in advance<br />
Sigrid</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Frequency Histogram 4 &#171; Oracle Scratchpad</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37479</link>
		<dc:creator><![CDATA[Frequency Histogram 4 &#171; Oracle Scratchpad]]></dc:creator>
		<pubDate>Tue, 05 Oct 2010 18:27:20 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37479</guid>
		<description><![CDATA[[...] Filed under: Statistics,Troubleshooting &#8212; Jonathan Lewis @ 6:25 pm UTC Oct 5,2010   In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra [...]]]></description>
		<content:encoded><![CDATA[<p>[...] Filed under: Statistics,Troubleshooting &#8212; Jonathan Lewis @ 6:25 pm UTC Oct 5,2010   In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37462</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Sat, 02 Oct 2010 07:12:54 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37462</guid>
		<description><![CDATA[Eric,

Ignore my previous comment - it was just too early in the morning - you&#039;re right, we need to floor() or trunc() the value.]]></description>
		<content:encoded><![CDATA[<p>Eric,</p>
<p>Ignore my previous comment &#8211; it was just too early in the morning &#8211; you&#8217;re right, we need to floor() or trunc() the value.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: fsm</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37460</link>
		<dc:creator><![CDATA[fsm]]></dc:creator>
		<pubDate>Sat, 02 Oct 2010 05:11:36 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37460</guid>
		<description><![CDATA[Just add floor()

select to_char(floor(2454894.89011574), &#039;FM999999999&#039;) val from dual;

VAL
----------
2454894]]></description>
		<content:encoded><![CDATA[<p>Just add floor()</p>
<p>select to_char(floor(2454894.89011574), &#8216;FM999999999&#8242;) val from dual;</p>
<p>VAL<br />
&#8212;&#8212;&#8212;-<br />
2454894</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Evans</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37453</link>
		<dc:creator><![CDATA[Eric Evans]]></dc:creator>
		<pubDate>Fri, 01 Oct 2010 16:31:24 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37453</guid>
		<description><![CDATA[Yep. I am on 10.2.0.4 Linux and when I do

select to_char(2454894.89011574,&#039;FM99999999&#039;) char_val from dual;

CHAR_VAL
---------
2454895

According to the documentation:
&quot;All number format models cause the number to be rounded to the specified number of significant digits.&quot;

I wonder what settings effect this rounding or is it an outright bug. I checked Metalink and could find nothing relevant. For what it&#039;s worth, when I do a &quot;show&quot; of my NLS parameters, the only ones set are:
nls_language		AMERICAN
nls_length_semantics	BYTE
nls_nchar_conv_excp	FALSE
nls_territory		AMERICA]]></description>
		<content:encoded><![CDATA[<p>Yep. I am on 10.2.0.4 Linux and when I do</p>
<p>select to_char(2454894.89011574,&#8217;FM99999999&#8242;) char_val from dual;</p>
<p>CHAR_VAL<br />
&#8212;&#8212;&#8212;<br />
2454895</p>
<p>According to the documentation:<br />
&#8220;All number format models cause the number to be rounded to the specified number of significant digits.&#8221;</p>
<p>I wonder what settings effect this rounding or is it an outright bug. I checked Metalink and could find nothing relevant. For what it&#8217;s worth, when I do a &#8220;show&#8221; of my NLS parameters, the only ones set are:<br />
nls_language		AMERICAN<br />
nls_length_semantics	BYTE<br />
nls_nchar_conv_excp	FALSE<br />
nls_territory		AMERICA</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37437</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Thu, 30 Sep 2010 19:01:55 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37437</guid>
		<description><![CDATA[Eric,

Strange - are you sure you&#039;re not seeing the effects of the default format for dates on your system ?   Putting the value from your example into my formula (selecting from dual) gives me the date and time that I expect to see.]]></description>
		<content:encoded><![CDATA[<p>Eric,</p>
<p>Strange &#8211; are you sure you&#8217;re not seeing the effects of the default format for dates on your system ?   Putting the value from your example into my formula (selecting from dual) gives me the date and time that I expect to see.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Evans</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37432</link>
		<dc:creator><![CDATA[Eric Evans]]></dc:creator>
		<pubDate>Thu, 30 Sep 2010 05:28:48 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37432</guid>
		<description><![CDATA[Jonathan, 
 
 Had a problem with the expression above. For example, if you had the number 2454894.89011574 there appears to be some rounding to 2454895. I used the following and got the right values:

to_date(floor(endpoint_value) &#124;&#124; &#039;.&#039; &#124;&#124;
               to_char(86400 * MOD(endpoint_value, 1), &#039;FM999999999&#039;)
              ,&#039;J.sssss&#039;)

Thanks for your work on this subject.]]></description>
		<content:encoded><![CDATA[<p>Jonathan, </p>
<p> Had a problem with the expression above. For example, if you had the number 2454894.89011574 there appears to be some rounding to 2454895. I used the following and got the right values:</p>
<p>to_date(floor(endpoint_value) || &#8216;.&#8217; ||<br />
               to_char(86400 * MOD(endpoint_value, 1), &#8216;FM999999999&#8242;)<br />
              ,&#8217;J.sssss&#8217;)</p>
<p>Thanks for your work on this subject.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Lewis</title>
		<link>http://jonathanlewis.wordpress.com/2010/09/20/frequency-histograms-2/#comment-37417</link>
		<dc:creator><![CDATA[Jonathan Lewis]]></dc:creator>
		<pubDate>Wed, 29 Sep 2010 14:05:48 +0000</pubDate>
		<guid isPermaLink="false">http://jonathanlewis.wordpress.com/?p=4555#comment-37417</guid>
		<description><![CDATA[CJ,

This was my error (and a common error in the Oracle world) - I forgot to generalise the principle. I ran up a quick data set to test the code and used something that generated &#039;date-only&#039; values, but the column (and histogram) could hold &#039;date and time&#039; values. 

In the case of &#039;date and time&#039;, the time component is stored as the fraction of a day, for example, 6:00 pm on 29th Sept 2010 would be stored as: 2455469.75

This is why your format model ran into error 1481 - it didn&#039;t match the input value. The following expression should work:
[sourcecode gutter=&quot;false&quot;]
	to_date(
                to_char(endpoint_value,&#039;FM99999999&#039;) &#124;&#124; &#039;.&#039; &#124;&#124;
                to_char(86400 * mod(endpoint_value,1),&#039;FM99999&#039;),
        	&#039;J.sssss&#039;
	) ep_value
[/sourcecode]

This splits the value into a day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character (with no spaces), concatenates them with a &#039;.&#039; in the middle, and then uses the &#039;Julian&#039; and &#039;seconds&#039; conversion format.]]></description>
		<content:encoded><![CDATA[<p>CJ,</p>
<p>This was my error (and a common error in the Oracle world) &#8211; I forgot to generalise the principle. I ran up a quick data set to test the code and used something that generated &#8216;date-only&#8217; values, but the column (and histogram) could hold &#8216;date and time&#8217; values. </p>
<p>In the case of &#8216;date and time&#8217;, the time component is stored as the fraction of a day, for example, 6:00 pm on 29th Sept 2010 would be stored as: 2455469.75</p>
<p>This is why your format model ran into error 1481 &#8211; it didn&#8217;t match the input value. The following expression should work:</p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
	to_date(
                to_char(endpoint_value,'FM99999999') || '.' ||
                to_char(86400 * mod(endpoint_value,1),'FM99999'),
        	'J.sssss'
	) ep_value
</pre>
<p>This splits the value into a day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character (with no spaces), concatenates them with a &#8216;.&#8217; in the middle, and then uses the &#8216;Julian&#8217; and &#8216;seconds&#8217; conversion format.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
