The video of the online chat that I had with Grant Fritchey about statistics in SQL Server and Oracle is now online. It went pretty well – according to the stats 537 people attended, although the peak concurrency I noticed was only 467 – of which there were still over 400 after an hour and a quarter.
At the start of the event, James Murtagh put up a quick poll to see how many attendees used SQL Server, how many used Oracle, and how many used both. There’s a degree of bias in the results, no doubt due to the self-selecting nature of the event, but I thought the results were interesting:
There were several questions outstanding by the end of the webinar, so James has sent a copy of them to both of us, and I’ve tried to give short answers below – but only for the questions that appear to be related to Oracle.
I’ve left the questions in their original form, but please note that the questions had to be typed but the audience in real-time into a tiny box as they were trying to listen to what we were saying – so some of the grammar and spelling is likely to be a little garbled.
Different Does Copy Stats Affect The Optimizer from the Collection of Stats in Oracle? Does Partition, Subpartition Level Copy Has Different Affects in Oracle?
In principle, the stats are just a set of numbers, that the optimizer reads and processes, it doesn’t care how they got into the data dictionary. There are a couple of details about partition stats, though, that give Oracle a clue about what to do when accumulating stats the next time you run dbms_stats() against the object. For some useful information about partitions and manipulating stats the best reference I have is still a set of articles by Doug Burns, which I’ve collated here with a few from other authors.
If i say 10% sample for a table, what data is considered for sampling, the existing data or the new one? I mean if the table is highly dml intensive, which data will be considered for sampling?
I guess this could be aimed at either Oracle or SQL Server. From the Oracle perspective – the sample is randomly selected across the full size of the table. You can choose to pick a sample of random rows, or a sample of random blocks.
So who are the Brent Ozar’s, Michelle Ufford’s, and Ola Hallengren’s of the Oracle world (and Grant Fritchey’s)? I’m aware of Ask Tom, but there seems to be far fewer people in the Oracle world who share their knowledge like the people I mentioned above. It looks like Red Gate is trying to help is this area, and it is greatly appreciated!
There seem to be quite a lot – being sure you’ve found one that actually produces correct and appropriate information is the problem, of course. You could do worse than start with the people on my blog roll, and bloggers who are members of the Oak Table network (whose aggregator is also listed on my blog roll.)
I think Red Gate is doing a very good job too – particularly in the way it’s trying to pull developers and dbas closer together, as well as pulling Oracle and SQL Server specialists together.
How does Oracle maintain stats on multiple columns as in correlated statistics? In SQL Server we have to create these manually depending on certain cases to give the optimizer better information with cardinality.
Oracle 11g introduced a feature known as “extended stats” which allows you to create stats on expressions and on column groups. Once you’ve created such an object it’s associated with a table as if it were a virtual column of that table, and any time you refresh the table stats the extended stats are refreshed at the same time. One of the suggestions I’ve made for this feature is to use it to create stats on the leading column combinations of indexes – which SQL Server does automatically.
How about collecting statistics for temporary tables and for collections?
Oracle 11g can use dynamic sampling to get some statistical information for a collection – with the usual limitation that the sampling is only done the first time a statement is optimised. Gathering stats for global temporary tabls (GTTs) is a bit of a problem because different people may use the same GTT for remarkably different quantities of data. Again, dynamic sampling could be used (and will be used by default from 10g onwards), but you could write some stats into the data dictionary for the GTT by creating a permanent table, populating it, collecting stats, then using the dbms_stats export and import function to copy the stats from the permanent table to the GTT. (If the GTT is declared as “on commit preserve rows” you can just load it with representative data then collect stats – but “on commit preserve” is less commonly used.)
Extended stats are great, but is there a way to determine which columns would benefit from extended stats? Most developers, or even users, don’t know, so would be nice to have (if you’ll forgive the pun) a statistical way of determining this.
Oracle recently came out with an advisor for column groups. There’s an article about it on the Optimizer Group blog.
Jonathan, what is the scale of clustering index in Oracle? Upper and lower limits?
Roughly speaking (and it is possible to created boundary cases). the clustering_factor for a B-tree index will be a value between the number of blocks in the table and the number of rows in the table. If the value is close to the number of blocks the optimizer thinks that the data is very well clustered and will favour using the index for relatively large numbers of rows, if the value is close to the number of rows in the table the optimizer thinks the data is very widely scattered and could ignore the index for even a relatively small number of rows.
The value is not intuitively meaningful, so Oracle also presents a transformed version of the clustering_factor in the form of the avg_data_blocks_per_key – which is the typical number of data block visits Oracle thinks it will have to make to collect all the rows for a typical key value.
I often see the bind problem together with histograms, which I see to often in OLTP database. I recommend not using histogram until otherwise proved. Do you agree?
Yes. Can be expensive to generate accurately, and can introduce instability when they are perfect representations of the data that was present when the histogram was created. It’s best to recognise the few cases where they can be very helpful then make sure that the application and the server (developer and DBA) co-operate to get the best effects from them.
On Oracle, when talking about partitioning, you have something called global statistics. This global statistics you can generate with DBMS_STAT, or you could have Oracle calculate it by using the partition statistics. What do you recommend Jonathan?
Oracle 11g has introduce the “approximate NDV (number of distinct values)” mechanism, and with that allows global stats (specifically the global NDV) to be calculatede incrementally for partitioned tables. This can be very helpful for systems that aren’t too comples, and don’t have too many partitions. As a general rule, though, I still think that any site that does something complex should probably work out a programmatic way of constructing stats, rather than letting Oracle gather them on partitioned tables.
Jonathan: Do you have an article that describes how you define your “perfect” histogram?
Yes – here’s one example.
How much improvement we would have to the queries when the system stats in Oracle is updated?
This is one of those questions which is impossible to answer. Essentially the introduction of system stats (when combined, if necessary, with removing the parameter db_file_multiblock_read_count from the parameter file) should give Oracle a better balance between indexed access paths with nested loops and tablescan paths with hash joins. However, as with any global change to the optimizer, some people are bound to find that a few things change for the worse.
I have a tongue-in-cheek guideline: whatever you do, the optimizer will be okay 99% of the time: when you fiddle with any parameters, or things like system stats, you’re just changing the 1% that you have to fix.
Oracle apex tracks session state in tables within its schema (eg wwv_flow_collection_members$) the data in these tables is extremely volatile – how frequently would you recommend collecting stats on tables like these where the contents can change completely over the course of a day eg from zero to hundreds of thousands of rows ? clearly an overnight gather wont help !
This is an excellent question. Every DBA and developer should be aware of tables in their systems that behave this way, because these are the tables where ANY automatic strategy for collecting stats is likely to cause problems. People often forget that your statistics are often wrong within a few seconds of collecting them – and in special cases the effect can be catastrophic.
The generic answer to this type of problem is to create statistics that tell the optimizer what you want it to think the data looks like. Alternatively, you have to include complete execution paths (through hints, or SQL Plan Baselines) in the SQL for the queries that are likely to be unstable. The target is a plan that is always “good enough” irrespective of the state of the data.
In SQL Server, if there is query with where clause for many columns,it uses all possible non clustered indexes.Is it true that oracle uses only one index per table unless there is bit map index
No. Oracle used to have an “AND_EQUAL” access path into a table to combine indexes, but since 9i that has been deprecated in favour of the “INDEX_COMBINE” access path, which combines index range scans on b-trees by converting the rowids produced from each index range scan into a string of bits, combining the bits, then converting back to rowids. There is no limit on the number of indexes that can be combined in this way.
(Oracle also has an “INDEX_JOIN” path, which allows it to combine results from as many indexes as necessary to create a result set without visiting the table at all).
Can we collect statistics on Temporary tables in Oracle?
See above – only if they are defined as ‘on commit preserve rows’, because the first step of stats collection is a commit – which leaves the table empty if it’s declared as ‘on commit delete rows’. (The word “delete” is a bad choice in this case, Oracle does delete the rows, it simply forgets the data segment)
Hello, How much consider the state of the network for the query optimization in distribuited databases? And what stadistics consider?
The Oracle optimizer does have a small component of cost allocated to the time it takes to move data across a network in a distributed query. I believe this feature appeared in 9i, and only if you enabled “CPU Costing” (aka System Stats). I haven’t looked into it very closely, but I think Oracle estimates the number of round trips it will take to move the data that has to be moved, and allows a small unit of time per round trip.
Question for Jonathan: From your experience what have parameter option for METHOD_OPT do feel has given you the best stats for performance?
I’m always in favour of doing something that is simple and “good enough” for most of the time – and then introducing a small amount of complexity where necessary. This, in part, is why I avoid histograms until I can show that they are needed, and then write code to create the right histogram. In terms of method_opt, this means “for all columns size 1”. (In 11g I would also let estimate_percent default to “auto_sample_size” because that allows the new, fast, accurate, “approximate NDV” to do its job.
After collecting state with this method_opt, I would run the code that rebuilt any histogram that had been overwritten by the most recent collection.
What sample % does the oracle daily job uses?
By default 10%, but in 11g you can change this at the database, schema, or object level using one of the “set_xxx_prefs” calls.
Jonathan, regarding the clustering factor: besides the obvious impact it has on physical IO, can’t it also have an influence on how much the query can benefit from pinned buffers. I remember that Oracle may keep buffers pinned between single buffer visits, but the number of buffers it may keep pinned is limited
The clustering_factor itself has no effect – but you’re correct in thinking that the order in which Oracle visits table blocks can make a difference to the amount of work done, even if all the blocks are buffered. This is, perhaps, one of the last steps in physically optimising the placement of your data that you might take to squeeze the maximum possible performance out of a system.
Taking a simplified example, if you had 10 rows scattered across 2 blocks A and B, and the index (for some reason) made you visit the blocks in the order ABABABABAB this would require more work than visiting them in the order AAAAABBBBB; in the first order Oracle would have to get and release each block 5 times, in the second case it would only have to get and release each block once, keeping it pinned for the other 4 rows.
Of course, the clustering_factor of the index that allows Oracle to do AAAAABBBBB will (almost certainly) be smaller than the clustering_factor of the index that allows Oracle to do ABABABABAB – but I would view the clustering_factor is an effect, not the cause.
Data compression is very popular nowadays in Oracle, compress data to save space in large databases, x3, x10, even x30 compression ratio. Does Oracle keep statistics if a row is compressed or not? does it matter to the optimizer.
Data compression doesn’t change the “logic” of statistics. From Oracle’s perspective you still have tables with rows and columns and indexes. The only difference is that you seem to have more rows per table. (There’s a five-part series on compression coming out on AllThingsOracle if you want to read up about the effects it can have.)
How to determine optimal block size in Oracle
There’s a simple answer to that question – if you need to ask, stick with the default (8KB on most platforms). There are few special cases where a different block size may give you a little performance edge, or where it’s worth mixing block sizes in a single database – but you really have to know what your application looks like to be sure it’s worth changing – and you could be in a position where you’re running in an environment that has had less field-testing than the default.
How do you ‘tell oracle to treat this col in this table as if it has this data volume, density etc’?
How do you override Oracle’s ‘incorrect’ scatter assumptions
The package dbms_stats has called to set_column_stats(), set_table_stats() and set_index_stats(). (See above for the reference to faking a histogram).
Is the process of collecting statistics (Oracle) always the same on each plattform (Unix or Windows) or exists differences?
It’s the same on both platforms.
Would scan density in SQLServer be analagous to cluster factor in Oracle?
I don’t think so. I believe scan density tells you how much data you have to pick up, while clustering_factor tries to tell you something about how much work it’s going to be to pick up all the different bits.
What are histograms and how they are add on to the statistics in oracle, to get very good plan
A histogram is, in effect, a picture of the data distribution, and can be particularly useful for columns with a small number of values with an extremely uneven data distribution – for example a flag column on a trades table that shows that almost all trades are closed with a tiny number that are in different state. By default Oracle will consider any columns that have been used in a where clause as a possible candidate for a histogram if you allow it to do its default collection – and this is often overkill.
Oracle Auto Stats runs and gather stats when data changes 10%. I have a large table that gets a small amount of rows added each day. We had two instances 10 months apart where the difference between actual row count and row count reported by statistics from a few weeks ago was 5-7 % and stats were not gathered. The execution plan went from using indexes to full table scan. Needed to manually gather stats to have optimizer use index again. How can I predict when to gather stats again?
There is no generic answer to this question. You have to understand the data and what it means to the application. In cases like this where the table is very large you may find that all you really need to do set use the set_column_stats() procedure to tell Oracle that the high-value for a date, or sequence-based, column has increased.
If you want to investigate in detail you could check the execution plan from day to day for a few weeks and watch the cardinality prediction changing as the query is reoptimised over time. This may give you a clue about why the plan can change, and how much the data has to change before you hit a break point.
I’ve seen a situation where parameter sniffing can cause problems with SQL Server mis-identifying the appropriate stats for any given query. We’ve solved this in the past by using techniques like establishing a local variable in a stored procedure and setting that value to the value of the parameter passed into the stored proc. Does Oracle have this same type of problem (parameter sniffing causing issues with assessing the proper stats)?
It’s called “bind variable peeking” in Oracle, and it does cause problems of stability. The type of workaround you’ve described wouldn’t quite work because Oracle would still see a bind variable. However, if you created a deterministic function to return the value then the optimizer wouldn’t be able to see the value and would have to use one of it’s fixed value estimates for selectivity.
The feature of SQL Server I’d like to see in Oracle for (at least some of the) cases like this is the hint that allows you to say: “optimize this query as if I’d supplied the value X”.
In context of oracle, do you perceive some danger in the common practice of “recalc every night”. Related to that can you comment on the changs in optimizer behaviour when histograms have been created.
Recalc (everything) every night can introduce instability – especially since Oracle will take automatic sample sizes. With 11g and the “approximate NDV” the instability and workload is likely to be reduced – provided you’re not collecting histograms. See further up the page for my comments on histograms.
Does each column have stats? or if a column is only used in a SELECT, not part of any index or WHERE, GROUP, etc., will it not have stats maintained?
In Oracle every column gets stats by default. Even if a column is not used in the where clause the number of non-null values and the length of the column content can make a difference to the work needed if Oracle wants to do a hash join or sort merge join – and column stats allow it to work out the total volume of data that needs to be sorted or hashed.
How do you mange/collect statistics for a new database application that is just launched and most of the transaction tables don’t have any data? Especially say for a trade or auction application… and there is a storm of activities on day one when the application is launched.
If there’s not much data then it might be quite quick and cheap the first few days to collect stats. If necessary (in 11g) you could collect “pending” stats a couple of times during the day, then switch them all from pending to active to minimize the time when you had a mixture of old new stats.
Sounds like this should be a weekly conversation for awhile
I agree – if only we could find the time – there’s lots we could talk about.
Can you address dynamic sampling in Oracle?
Big topic. Basically, when Oracle decides that certain rules apply it will sample some blocks from some of the tables in a query to see what the data looks like – and may then use the results of the sample to work out the execution plan. On the positive side this means that where there are multiple filter or join predicates on a table, or where Oracle has had to guess (e.g. because you have “function(col) = constant” then the optimizer has better information. On the minus side – the sampling only takes place once when the query is optimised – so an unlucky set of inputs may produce a plan that is bad for everyone else that wants to run that query.
Does it make sense to gather stats on oracle table after rebuilding one of the indexes of the same table
Jonathan what are your thoughts on Direct Path Reads in 11g
Outside the scope of the discussion. Any change will help some people and cause other people problems. So the most important thing is to be aware that it’s an option that the run-time engine (not the optimizer) might take.