Home > mailing lists

Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

From	amul sul
Subject	Re: [HACKERS] [POC] hash partitioning
Date	March 2, 2017 19:03:42
Msg-id	CAAJ_b94a=sVPBRDSQ6tKjFJEoRQpBdZZFBacZU0Q2+Z+xQuA_A@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] [POC] hash partitioning (Yugo Nagata <nagata@sraoss.co.jp>)
Responses	Re: [HACKERS] [POC] hash partitioning (Yugo Nagata <nagata@sraoss.co.jp>) Re: [HACKERS] [POC] hash partitioning (Greg Stark <stark@mit.edu>)
List	pgsql-hackers

Tree view

On Wed, Mar 1, 2017 at 3:50 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:

[....]

> I Agree that it is unavoidable partitions number in modulo hashing,
> but we can do in other hashing technique. Have you had thought about
> Linear hashing[1] or Consistent hashing[2]? This will allow us to
> add/drop
> partition with minimal row moment.

Thank you for your information of hash technique. I'll see them
and try to allowing the number of partitions to be changed.

Thanks for showing interest, I was also talking about this with Robert Haas and

hacking on this, here is what we came up with this.

If we want to introduce hash partitioning without syntax contort and minimal

movement while changing hash partitions (ADD-DROP/ATTACH-DETACH operation),

at start I thought we could pick up linear hashing, because of in both the

hashing we might need to move approx tot_num_of_tuple/tot_num_of_partitions

tuples at adding new partition and no row moment required at dropping

partitioning.

With further thinking and talking through the idea of using linear hashing

with my team, we realized that has some problems specially during pg_dump

and pg_upgrade. Both a regular pg_dump and the binary-upgrade version of

pg_dump which is used by pg_restore need to maintain the identity of the

partitions. We can't rely on things like OID order which may be unstable, or

name order which might not match the order in which partitions were added. So

somehow the partition position would need to be specified explicitly.

So later we came up with some syntax like this (just fyi, this doesn't add

any new keywords):

create table foo (a integer, b text) partition by hash (a);

create table foo1 partition of foo with (modulus 4, remainder 0);

create table foo2 partition of foo with (modulus 8, remainder 1); -- legal, modulus doesn't need to match

create table foo3 partition of foo with (modulus 8, remainder 4); -- illegal, overlaps foo1

Here we need to enforce a rule that every modulus must be a factor of the next

larger modulus. So, for example, if you have a bunch of partitions that all have

modulus 5, you can add a new partition with modulus 10 or a new partition with

modulus 15, but you cannot add both a partition with modulus 10 and a partition

with modulus 15, because 10 is not a factor of 15. However, you could

simultaneously use modulus 4, modulus 8, modulus 16, and modulus 32 if you

wished, because each modulus is a factor of the next larger one. You could

also use modulus 10, modulus 20, and modulus 60. But you could not use modulus

10, modulus 15, and modulus 60, because while both of the smaller module are

factors of 60, it is not true that each is a factor of the next.

Other advantages with this rule are:

1. Dropping (or detaching) and adding (or attaching) a partition can never

cause the rule to be violated.

2. We can easily build a tuple-routing data structure based on the largest

modulus.

For example: If the user has

partition 1 with (modulus 2, remainder 1),

partition 2 with (modulus 4, remainder 2),

partition 3 with (modulus 8, remainder 0) and

partition 4 with (modulus 8, remainder 4),

then we can build the following tuple routing array in the relcache:

== lookup table for hashvalue % 8 ==

0 => p3

1 => p1

2 => p2

3 => p1

4 => p4

5 => p1

6 => p2

7 => p1

3. It's also quite easy to test with a proposed new partition overlaps with any

existing partition. Just build the mapping array and see if you ever end up

trying to assign a partition to a slot that's already been assigned to some

other partition.

We can still work on the proposed syntax - and I am open for suggestions. One

more thought is to use FOR VALUES HAVING like:

CREATE TABLE foo1 PARTITION OF foo FOR VALUES HAVING (modulus 2, remainder 1);

But still more thoughts/inputs welcome here.

Attached patch implements former syntax, here is quick demonstration:

1.CREATE :

create table foo (a integer, b text) partition by hash (a);

create table foo1 partition of foo with (modulus 2, remainder 1);

create table foo2 partition of foo with (modulus 4, remainder 2);

create table foo3 partition of foo with (modulus 8, remainder 0);

create table foo4 partition of foo with (modulus 8, remainder 4);

2. Display parent table info:

postgres=# \d+ foo

Table "public.foo"

--------+---------+-----------+----------+---------+----------+--------------+-------------

Partition key: HASH (a)

Partitions: foo1 WITH (modulus 2, remainder 1),

foo2 WITH (modulus 4, remainder 2),

foo3 WITH (modulus 8, remainder 0),

foo4 WITH (modulus 8, remainder 4)

3. Display child table info:

postgres=# \d+ foo1

Table "public.foo1"

--------+---------+-----------+----------+---------+----------+--------------+-------------

Partition of: foo WITH (modulus 2, remainder 1)

4. INSERT:

postgres=# insert into foo select i, 'abc' from generate_series(1,10) i;

INSERT 0 10

postgres=# select tableoid::regclass as part, * from foo;

part | a | b

------+----+-----

foo1 | 3 | abc

foo1 | 4 | abc

foo1 | 7 | abc

foo1 | 10 | abc

foo2 | 1 | abc

foo2 | 2 | abc

foo2 | 9 | abc

foo3 | 6 | abc

foo4 | 5 | abc

foo4 | 8 | abc

(10 rows)

TODOs.

1. Maybe need some work in the CREATE TABLE .. PARTITION OF .. syntax.

2. Trim regression tests (if require).

3. Documentation

Thoughts/Comments?

Attachment

hash-partitioning_another_design-v1.patch

pgsql-hackers by date:

From: Robert Haas
Date: 02 March 2017, 18:48:20
Subject: Re: [HACKERS] Partitioned tables and relfilenode

From: Ashutosh Bapat
Date: 02 March 2017, 19:18:30
Subject: Re: [HACKERS] Print correct startup cost for the group aggregate.

Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

Attachment

Previous

Next