Hash Partitions

September 21, 2009

Hash Partitions

Filed under: Infrastructure,Partitioning,Performance — Jonathan Lewis @ 5:57 pm BST Sep 21,2009

I made a throwaway comment in a recent posting about using powers of two for the number of partitions when using hash partitioning. The article in question was talking about globally partitioned indexes, but the “power of 2” principle was first associated with tables.

Here’s a simple demonstration of hash partitioning in action demonstrating why Oracle adopted this “power of 2” rule. We start by creating a table that doesn’t obey the rule – with six partitions – and collect stats on it to see how many rows go into each partition:

create table t1 (
	id,
	v1
)
partition by hash(id)
partitions 6
as
select
	rownum,
	object_name
from
	all_objects
where
	rownum <= 32000
;

column partition_name new_value m_pt

execute dbms_stats.gather_table_stats(user,'t1')

select	partition_position, partition_name, num_rows
from	user_tab_partitions
where	table_name = 'T1'
;

PARTITION_POSITION PARTITION_NAME NUM_ROWS
------------------ -------------------- ----------
                 1 SYS_P2161 3981
                 2 SYS_P2162 3988
                 3 SYS_P2163 8246
                 4 SYS_P2164 7945
                 5 SYS_P2165 3912
                 6 SYS_P2166 3928

Rather than seeing evenly scattered data, we see four small and two big partitions – and the two big partitions are numbers three and four, and they’re twice the size of the little ones (plus or minus a bit). Let’s see what happens if we tell Oracle to increase the number of partitons to seven:

alter table t1 add partition;

execute dbms_stats.gather_table_stats(user,'t1')

select	partition_position, partition_name, num_rows
from	user_tab_partitions
where	table_name = 'T1'
;

PARTITION_POSITION PARTITION_NAME NUM_ROWS
------------------ -------------------- ----------
                 1 SYS_P2161 3981
                 2 SYS_P2162 3988
                 3 SYS_P2163 4112
                 4 SYS_P2164 7945
                 5 SYS_P2165 3912
                 6 SYS_P2166 3928
                 7 SYS_P2167 4134

Partition three (the lower of the two big partitions) has been split into two, and half its data copied into the new partition (partition seven). Oracle has improved the data distribution with minimum overheads. To get a uniform spread of data across partitions we need to do another “add partition” which will split partition four into partitions four and eight spreading the data evenly between them.

If you create a table with 2^N hash partitions then (assuming random distribution of lots of distinct values for the partitioning key) each partition will hold a similar amount of data. If you need to increase the number of partitions you “add partition” 2^N times and Oracle, working from the bottom up, will split partition X into two partitions, X and X + 2^N.

The same “power of 2” rule applies to hash subpartitions when you use composite partitioning.

Comments (22)

22 Comments »

“Partition three (the lower of the two big partitions) has been split into two, and half its data copied into the new partition (partition seven).”

Its the bigger partition that got split into two new partitions.

Anand

Comment by Anand — September 21, 2009 @ 6:35 pm BST Sep 21,2009 | Reply
- Anand,
  
  That happens to be true in this case – so three questions to ask yourself:
  
  how would you modify the test in order to convince yourself that that wasn’t the reason why partition three was the one split ?
  
  If that were the reason for choosing number three to split, what sort of algorithm would then tell Oracle what future data should go into partition ?
  
  If you were to combine two partitions, what logically consistent strategy could Oracle pick that would make splitting and combining partitions “opposites” of each other ?
  
  (Alternatively, you can take it from me that it really is all about position, not size.)
  
  Comment by Jonathan Lewis — September 21, 2009 @ 6:56 pm BST Sep 21,2009 | Reply
I’ve once told a friend of mine that 2^N partitions was a better option, but I couldn’t articulate a reason at the time… thanks for the explanation.

[Edited by JPL – you had n^2 when you intended 2^n]

Comment by Daniel Stolf — September 21, 2009 @ 6:42 pm BST Sep 21,2009 | Reply
Interesting. I understand now. When partition numbers are not power of 2, then Oracle doesn’t use a “clean” hashing function. I wonder how the algorithm works.

From your example, following query demonstrates ora_hash vs. actual placement.

select part,part2,count(*) from (
select id, ora_hash(id,5) part, dbms_mview.pmarker(rowid) part2 from t1
)
group by part,part2 order by part2,part;

Only when you reach 8 partitions (or 16) ora_hash matches.

Comment by Christo Kutrovsky — September 21, 2009 @ 6:53 pm BST Sep 21,2009 | Reply
- Christo,
  
  I was saving the ora_hash() function for the next blog :(
  
  Interestingly the hashing is probably similar to lots of other hashing approaches that Oracle uses – including the mechanism used in the new “approximate NDV” calculation in 11g.
  
  You could imagine that the hashing function produces a number between zero and (say) 2^63, and then chops off the top bits until the result is the largest value that matches an existing partition.
  
  Comment by Jonathan Lewis — September 21, 2009 @ 7:13 pm BST Sep 21,2009 | Reply
  - Cool ! No wonder I couldn’t find my hashing function. Wrong arguments, wrong calculations …
    
    Looking forward to your ora_hash blog.
    
    Comment by Christo Kutrovsky — September 21, 2009 @ 7:29 pm BST Sep 21,2009 | Reply
    - Christo,
      
      I’ve just pointed someone from OTN to your excellent blog and presentation on memory management in unix. It’s amazing how many systems I still see wasting huge amounts of memory on mapping when they could be using large pages.
      
      Comment by Jonathan Lewis — September 21, 2009 @ 7:42 pm BST Sep 21,2009
Thank You for your explain.
That helps me understand “power of 2″ and good idea help me testing.

Comment by Surachart Opun — September 21, 2009 @ 7:13 pm BST Sep 21,2009 | Reply
I’m *very* interested in your ora_hash discussion. I’ve heard about the power of two rule here and frankly, it has never made any sense to me. It seems to me like it would be easy to keep the rows evenly distributed with a simple algorithm, but maybe I’m missing something.

Comment by Ben — September 22, 2009 @ 11:25 am BST Sep 22,2009 | Reply
Sorry,

I’m missing something. 7 is not a power of 2, 8 is.

Or the example is a “running” example i.e. 7 is better that 6 and then if you add another partition 8 is better that 7, but 9 is “worst” that 8?

Thanks,
Antonio

Comment by lascoltodelvenerdi — September 22, 2009 @ 1:49 pm BST Sep 22,2009 | Reply
Well,
the power of 2 comes from the simple algo used:
hash the col (or columns),
if the number of partitions is n, take the last ceil(log2(n)) bits (i.e. for 5, 6, 7 or 8 it will be 3 bits, between 9 and 16 partitions it will be 4)), convert it to an int m, and mod(n,m) will give you the number of the partition. so if you have 5 partitions, when m=6, 7 or 8, the row will go into partition 1, 2 or 3 hence these will have more rows than the others. To have a more uniform distribution, you better have the number of partitions=power(2)

Comment by salem — September 22, 2009 @ 2:17 pm BST Sep 22,2009 | Reply
Antonio, Ben,

The way I would summarize is, you only get even distribution when you use power of 2 partitions. Every time you add a hash partition, you split ONLY one existing into two new ones.

Regardless whether you add them one by one, or create them from scratch, the result will be the same. (Except high water mark been higher?)

With 1 partition , add a new one (split), you get 2 (power of 2) with 50%/50%.
With 2 partitions, add a new one (split), you get 3 (not power of 2) with 50%/25%/25% split
With 3 partitions, add a new one (split), you get 4 (power of 2) with 25%/25%/25%/25% split
With 4 partitions, add a new one (split), you get 5 (not power of 2) with 25%/25%/25%/13%/12% split
etc…

The idea behind this algorithm is when you add a new partitions, you wont copy data from ALL partitions, just split 1.

With this (oracle) algorithm, if you have a 1000 GiB table with 30 partitions you would have:
1000 / 32 = 31.25
28 partitions of 31.25 GiB
2 partitions of 62.50 GiB (twice the “common” size)

An add partitions command would result in:
Reading 1 of the big partitions (62.50 GiB) and writing out a new 31.25 GiB partition. (I guess the existing partition will have it’s rows deleted – but let’s skip that part for now).

So total IO (without deletion fact) 62.50 + 31.25 = 100 GiB.

With a “pure” hashing algorithm, the same table (1000 GiB) with 30 partitions would have:
30 partitions, each 33.3 GiB in size.

An add partition will have to:
Read ALL 30 partitions (1000 GiB) and write out a new 32.2 GiB partition.

Total IO (without deletion) would be: 1000 Gib + 32.2 = 1032.2 GiB. About 10 fold more.

For this reason, ora_hash cannot be used “as is” to predict which partition data would go to. I am sure Tanel or Christian could write some bit-offset edited version that will give the exact location.

Caveat for ora_hash, to split data in 8 buckets (power of 2) you have to use ora_hash([id],7) because the bucket is base 0 (not base 1).

Comment by Christo Kutrovsky — September 22, 2009 @ 2:18 pm BST Sep 22,2009 | Reply
- The algorithm is clear.
  
  I was pointing out that 7 is not a power of 2.
  
  So I think, Jonathan made a “running” example in the sense that you have to track the change if you want to really understand it.
  
  Anyway thank you for your long replay and always pay attention when the count star from 0 and not from 1! ;-)
  
  Bye,
  Antonio
  
  Comment by lascoltodelvenerdi — September 22, 2009 @ 8:05 pm BST Sep 22,2009 | Reply
- Thanks very much for that. The question of why only powers of two resulted in even distribution had been bugging me for a while. Now I see the reason for the algorithm that seemed weird to me at first.
  
  Comment by Ben — September 23, 2009 @ 5:28 pm BST Sep 23,2009 | Reply
Right, with this been a design choice by Oracle, for optimizing partition addition.

Comment by Christo Kutrovsky — September 22, 2009 @ 2:37 pm BST Sep 22,2009 | Reply
Hi Jonathan,

There is always nice things you teach us everyday. I question regarding this would be lets say we have a table with hash partition 5. Which is inadequate as i learned today. But if we

select * from t1 partition(SYS_P2161);

Will this going to touch some other partitions for getting data. If yes then how can we get this information that which partitions it’s touching.

Comment by Taral Desai — September 22, 2009 @ 10:49 pm BST Sep 22,2009 | Reply
[…] Tom’s Oak Table colleague, Jonathan Lewis, played no head games on us, but he has been at the hash partitions. He writes, “I made a throwaway comment in a recent posting about using powers of two for the […]

Pingback by Log Buffer #163: a Carnival of the Vanities for DBAs | Pythian Group Blog — September 25, 2009 @ 5:05 pm BST Sep 25,2009 | Reply
hei all,i want to ask,so what true hash function work??
i am still confused

please explain more clearly
thanks

Comment by pur — November 16, 2009 @ 5:52 am GMT Nov 16,2009 | Reply
[…] Oracle’s strategy for hash partitioning is engineered to give an even data distribution when the number of partitions is a power of two. In one of the follow-up comments, Christo Kutrovsky pre-empted my planned follow-up by mentioning […]

Pingback by ora_hash function « Oracle Scratchpad — November 21, 2009 @ 10:59 am GMT Nov 21,2009 | Reply
Hi All,

sometime back I was back experimenting to understand hash partitioning algo. why should we use “power of 2” partitions.

please follow below link

Index Operations

Regards
Saurabh

Comment by svdixit — November 23, 2009 @ 5:24 am GMT Nov 23,2009 | Reply
[…] This often looks like an easy winner. Create the index as a globally hash partitioned index, partitioned on the first column of the index. The number of partitions should take account of the degree of concurrency you have to deal with, and could easily be somewhere between 16 and 128 in extreme cases. (The number should always be a power of two because of the strategy that Oracle uses for hash partitioning). […]

Pingback by Index Explosion – 4 « Oracle Scratchpad — January 22, 2010 @ 10:49 am GMT Jan 22,2010 | Reply
[…] (sub)partitioning should be based on powers of 2, otherwise some partitions will be twice size of […]

Pingback by Negative Offload | Oracle Scratchpad — September 28, 2019 @ 5:39 pm BST Sep 28,2019 | Reply