Oracle Scratchpad

January 22, 2013

Statistics

Filed under: SQL Server — Jonathan Lewis @ 9:55 am BST Jan 22,2013

Updated 22nd Jan
I’ve previously advertised the fact that the latest online discussion that Red Gate has arranged for me to have with Grant Fritchey will be on 23rd Jan and we will be talking about statistics. If you’ve listened in to any of these talks in the past you’ll realise that they are completely unscripted; what you get is a couple of guys in a (virtual) pub comparing and contrasting their favourite database engines and trying to learn a little bit about how the other technology works.

It’s not completely unprepared, though; we’ve usually exchanged a couple of notes with bullet points about the ideas that might come up, and some sort of progression. That doesn’t mean that we’ll hit all the bullet points, or follow the order, or not end up talking about something completely unexpected, but it does mean that James Murtagh (our chairman for the event) has a few topics up his sleeve that he can prod us with if he thinks we are going too far off topic.

If you want some idea of how the conversation might go this time around, here’s a copy of an email I’ve just sent to James and Grant:

A couple of ideas for focal points for Wednesday

    •  Why do we need statistics    
    • What statistics are (or could be ) created            
      • How automatically (if at all)            
      • What does it cost    
    • How are statistics used    
    • What statistics do we need that aren’t collected    
    • What do we do if the statistics are missing

A few briefing comments for Oracle

a) The key reason for collecting stats is to allow the optimizer to work out how much data it will acquire as it steps through an execution plan, and how scattered that data is.  The “clustering_factor” of an index is one of the most significant things we can learn about an index and the “num_distinct” for a column combined with the “num_rows” for a table.

b) Oracle doesn’t collect statistics automatically as the data changes (there is a small change coming in 12c) – but there is a default overnight job that checks for objects with stale statistics and refreshes them automatically – this can actually cause problems. There is also a package which allows us to be very flexible about collecting statistics, or even inventing statistics.

c) To handle skewed data distribution you can create a histogram of up to 254 buckets to describe the contents of a column.

d) Recent innovations (11g) allow Oracle to be very efficient at getting accurate num_distinct for a column – but histograms are very expensive to gather if you use a large sample, and potentially very inaccurate if you use a small sample (12c has introduced some great enhancements here in terms of performance of collection and quality of histogram.)

e) Recently we were allowed to create virtual columns and statistics on virtual columns. We can also create statistics on groups of columns in the same table. We can’t create statistics across tables – even though we can (for example) create bitmap join indexes. This would be very helpful – if the stats could be kept reasonably accurate fairly cheaply.

f) For indexes the only column-related stats we collect are the number of distinct keys – if we want to get stats on partial keys we have to manual define “column groups”.

g) If stats on a table are missing, Oracle can sample some blocks in the table to see what the data looks like – the sample tends to be small. We can change the sampling rules to deal with various problems that arise if we have functions applied to columns in where clauses, or need statistics about correlated columns.

h) We have tools and mechanisms that allow us to associate “statistical corrections” to queries. Technically we can create these by hand – although the method is undocumented and unsupported – officially we need to license the Diagnostic Pack and Performance Pack to let Oracle analyze the statement thoroughly and produce the corrections.

i) There are cases where the optimizer simply guesses because there is no way to produce a justifiable estimate of statistics behaviour across tables – e.g. select where exists (subquery) assumes a 1% “survival rate”. Select where table1_column > table2_column assumes a 5% survival rate on the Cartesian join.

Addendum

Grant Fritchey has published an article on Simple Talk (the SQL Server site managed by Red Gate) covering some features of Statistics in SQL Server, so we’ll probably spend some time browsing through some of the topics there in our comparison between Oracle and SQL Server.

4 Comments »

  1. Hi Jonathan – What’s your take on hadoop and Hbase. I’m hearing a lot lately about big data, hadoop, HDFS, MapReduce.

    Appreciare your insight

    Comment by Gaurang — December 26, 2012 @ 2:02 pm BST Dec 26,2012 | Reply

    • You can either pick the most cost-effective technology for the problem you really need to solve, or you can pick the technology that can solve all your problems even though it may make it harder to solve some of those problems.

      Comment by Jonathan Lewis — December 28, 2012 @ 11:04 am BST Dec 28,2012 | Reply

      • There’s rarely just one problem to solve. It’s not a zero-sum game if you have people willing to throw extra money at the hyped product. This is a strategic question, so it requires some thinking ahead of the current round of problems, as well as not falling into the trap of thinking one solution fits all.

        Comment by jgarry — January 11, 2013 @ 11:35 pm BST Jan 11,2013 | Reply

  2. too bad it is a virtual pub. I’d buy you both a couple of rounds for doing these webinars. I work with both Oracle and Sql Server and I find your observations very helpful when I start to confuse one system’s behaviour with the others.

    Comment by Jim — January 23, 2013 @ 10:11 pm BST Jan 23,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,257 other followers