Oracle Scratchpad

March 13, 2019

Hash Partitions

Filed under: Oracle,Partitioning,Statistics — Jonathan Lewis @ 1:13 pm GMT Mar 13,2019

Here’s an important thought if you’ve got any large tables which are purely hash partitioned. As a general guideline you should not need partition level stats on those tables. The principle of hash partitioned tables is that the rows are distributed uniformly and randomly based on the hash key so, with the assumption that the number of different hash keys is “large” compared to the number of partitions, any one partition should look the same as any other partition.

Consider, as a thought experiment (and as a warning), a table of product_deliveries which is hash partitioned by product_id with ca. 65,000 distinct products that have been hashed across 64 partitions. (Hash partitioning should always use a power of 2 for the partition count if you want the number of rows per partition to be roughly the same across all partitions – if you don’t pick a power of two then some of the partitions will be roughly twice the size of others.)

Consider a query for “deliveries to Basingstoke” – in the absence of a histogram on the delivery location the optimizer will produce a cardinality estimate that is:

  • total rows in table / number of distinct delivery locations in table

Now consider a query for: “deliveries of product X to Basingstoke” – again in the absence of histograms. The optimizer could have two ways of calculating this cardinality:

  • total rows in table / (number of distinct products in table * number of distinct delivery locations in table)
  • total rows in relevant partition / (number of distinct products in relevant partition * number of distinct delivery locations in relevant partition)

But given the intent of hash partitioning to distribute data evenly we can make three further observations:

  1. the number of rows in any one partition should be very similar to the number of rows in the table divided by the number of partitions
  2. the number of distinct products in any one partition should be very similar to the number of products in the table divided by the number of partitions
  3. the number of distinct locations in any one partition should be very similar to the number of distinct locations in the whole table.

The second condition holds because product is the partition key, the third holds because location is not the partition key.

So we can rewrite the second, partition-oriented, formula as:

  • (total rows in table / number of partitions) / ((number of distinct products in table / number of partitions) * number of distinct locations in table)

which, re-arranging parentheses and cancelling common factors, reduces to:

  • total rows in table / (number of distinct products in table * number of distinct locations in table)

which matches the first formula. (Q.E.D.) In the absence of any statistics on hash partitions the optimizer can (ought to be able to) produce reasonable cardinality estimates based purely on table-level stats.

In fact if you look back into the history of partitioning this observation is implicit in the early days of composite partitioning when the only option was for range/hash composite partitions – the optimizer never used sub-partition stats to calculate costs or cardinality it used only partition-level statistics. (And it was several years before the optimizer caught up to the fact that (e.g.) range/list composite partitioning might actually need to do arithmetic based on subpartition stats.)

I did say that the example was also a warning. Hash partitioning is “expected” to have a large number of distinct key values compared to the number of partitions. (If you don’t meet this requirement then possibly you should be using list partitioning). There’s also a “uniformity” assumption built into the arithmetic (both the basic arithmetic and the hand-waving discussion I produced above). Just imagine that your company supplies a handful of products that for some strange reason are incredibly popular  in Basingstoke. If this is the case then the assumption that “all partitions look alike” is weakened and you would have to consider the possibility that the variation would require you to produce a workaround to address problems of poor cardinality estimates that the variation might produce.

A pattern of this type has two generic effects on the optimizer, of course. First is the simple skew in the data – to have a significant impact the number of rows for the problem products would have to be much larger than average, which suggests the need for a suitably crafted histogram; secondly there’s an implied correlation between a few products and Basingstoke, so you might even end up creating a column group and manually coding a histogram on it to capture the correlation.

 

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.