Updated 22nd Jan
I’ve previously advertised the fact that the latest online discussion that Red Gate has arranged for me to have with Grant Fritchey will be on 23rd Jan and we will be talking about statistics. If you’ve listened in to any of these talks in the past you’ll realise that they are completely unscripted; what you get is a couple of guys in a (virtual) pub comparing and contrasting their favourite database engines and trying to learn a little bit about how the other technology works.
It’s not completely unprepared, though; we’ve usually exchanged a couple of notes with bullet points about the ideas that might come up, and some sort of progression. That doesn’t mean that we’ll hit all the bullet points, or follow the order, or not end up talking about something completely unexpected, but it does mean that James Murtagh (our chairman for the event) has a few topics up his sleeve that he can prod us with if he thinks we are going too far off topic.
If you want some idea of how the conversation might go this time around, here’s a copy of an email I’ve just sent to James and Grant:
A couple of ideas for focal points for Wednesday
- Why do we need statistics
- What statistics are (or could be ) created
- How automatically (if at all)
- What does it cost
- How are statistics used
- What statistics do we need that aren’t collected
- What do we do if the statistics are missing
A few briefing comments for Oracle
a) The key reason for collecting stats is to allow the optimizer to work out how much data it will acquire as it steps through an execution plan, and how scattered that data is. The “clustering_factor” of an index is one of the most significant things we can learn about an index and the “num_distinct” for a column combined with the “num_rows” for a table.
b) Oracle doesn’t collect statistics automatically as the data changes (there is a small change coming in 12c) – but there is a default overnight job that checks for objects with stale statistics and refreshes them automatically – this can actually cause problems. There is also a package which allows us to be very flexible about collecting statistics, or even inventing statistics.
c) To handle skewed data distribution you can create a histogram of up to 254 buckets to describe the contents of a column.
d) Recent innovations (11g) allow Oracle to be very efficient at getting accurate num_distinct for a column – but histograms are very expensive to gather if you use a large sample, and potentially very inaccurate if you use a small sample (12c has introduced some great enhancements here in terms of performance of collection and quality of histogram.)
e) Recently we were allowed to create virtual columns and statistics on virtual columns. We can also create statistics on groups of columns in the same table. We can’t create statistics across tables – even though we can (for example) create bitmap join indexes. This would be very helpful – if the stats could be kept reasonably accurate fairly cheaply.
f) For indexes the only column-related stats we collect are the number of distinct keys – if we want to get stats on partial keys we have to manual define “column groups”.
g) If stats on a table are missing, Oracle can sample some blocks in the table to see what the data looks like – the sample tends to be small. We can change the sampling rules to deal with various problems that arise if we have functions applied to columns in where clauses, or need statistics about correlated columns.
h) We have tools and mechanisms that allow us to associate “statistical corrections” to queries. Technically we can create these by hand – although the method is undocumented and unsupported – officially we need to license the Diagnostic Pack and Performance Pack to let Oracle analyze the statement thoroughly and produce the corrections.
i) There are cases where the optimizer simply guesses because there is no way to produce a justifiable estimate of statistics behaviour across tables – e.g. select where exists (subquery) assumes a 1% “survival rate”. Select where table1_column > table2_column assumes a 5% survival rate on the Cartesian join.
Grant Fritchey has published an article on Simple Talk (the SQL Server site managed by Red Gate) covering some features of Statistics in SQL Server, so we’ll probably spend some time browsing through some of the topics there in our comparison between Oracle and SQL Server.