Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

From amul sul
Subject Re: [HACKERS] [POC] hash partitioning
Date
Msg-id CAAJ_b94a=sVPBRDSQ6tKjFJEoRQpBdZZFBacZU0Q2+Z+xQuA_A@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [POC] hash partitioning  (Yugo Nagata <nagata@sraoss.co.jp>)
Responses Re: [HACKERS] [POC] hash partitioning  (Yugo Nagata <nagata@sraoss.co.jp>)
Re: [HACKERS] [POC] hash partitioning  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Wed, Mar 1, 2017 at 3:50 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
​[....]​  
> I Agree that it is unavoidable partitions number in modulo hashing,
> but we can do in other hashing technique.  Have you had thought about
> Linear hashing[1] or Consistent hashing​[2]?​  This will allow us to
> add/drop
> partition with minimal row moment. ​

Thank you for your information of hash technique. I'll see them
and try to allowing the number of partitions to be changed.

Thanks for showing interest, I was also talking about this with Robert Haas and
hacking on this, here is what we came up with this.

If we want to introduce hash partitioning without syntax contort and minimal
movement while changing hash partitions (ADD-DROP/ATTACH-DETACH operation),
at start I thought we could pick up linear hashing, because of in both the
hashing we might need to move approx tot_num_of_tuple/tot_num_of_partitions
tuples at adding new partition and no row moment required at dropping
partitioning. 

With further thinking and talking through the idea of using linear hashing
with my team, we realized that has some problems specially during pg_dump
and pg_upgrade. Both a regular pg_dump and the binary-upgrade version of
pg_dump which is used by pg_restore need to maintain the identity of the
partitions. We can't rely on things like OID order which may be unstable, or
name order which might not match the order in which partitions were added. So
somehow the partition position would need to be specified explicitly.

So later we came up with some syntax like this (just fyi, this doesn't add
any new keywords):

create table foo (a integer, b text) partition by hash (a);
create table foo1 partition of foo with (modulus 4, remainder 0);
create table foo2 partition of foo with (modulus 8, remainder 1);  -- legal, modulus doesn't need to match
create table foo3 partition of foo with (modulus 8, remainder 4);  -- illegal, overlaps foo1

Here we​ need to​ enforce a rule that every modulus must be a factor of the next
larger modulus. So, for example, if you have a bunch of partitions that all have
modulus 5, you can add a new​ ​partition with modulus 10 or a new partition with
modulus 15, but you cannot add both a partition with modulus 10 and a partition
with modulus 15, because 10 is not a factor of 15. However, you could
simultaneously use modulus 4, modulus 8, modulus 16, and modulus 32 if you
wished, because each modulus is a factor of the next larger one. You could
also use modulus 10, modulus 20, and modulus 60. But you could not use modulus
10, modulus 15, and modulus 60, because while both of the smaller module are
factors of 60, it is not true that each is a factor of the next.

Other advantages with this rule are:

1. Dropping ​(or detaching) and adding (or attaching) ​a partition can never
cause the rule to be violated.

2. We can easily build a tuple-routing data structure based on the largest
modulus. 

For example: If the user has 
partition 1 with (modulus 2, remainder 1),
partition 2 with (modulus 4, remainder 2),
partition 3 with (modulus 8, remainder 0) and
partition 4 with (modulus 8, remainder 4),

then we can build the following tuple routing array in the relcache:

== lookup table for hashvalue % 8 ==
0 => p3
1 => p1
2 => p2
3 => p1
4 => p4
5 => p1
6 => p2
7 => p1

3. It's also quite easy to test with a proposed new partition overlaps with any
existing partition. Just build the mapping array and see if you ever end up
trying to assign a partition to a slot that's already been assigned to some
other partition.

We can still work on the proposed syntax - and I am open for suggestions. One
more thought is to use FOR VALUES HAVING like:
CREATE TABLE foo1 PARTITION OF foo FOR VALUES HAVING (modulus 2, remainder 1);

But still more thoughts/inputs welcome here.

Attached patch implements former syntax, here is quick demonstration:

1.CREATE :
create table foo (a integer, b text) partition by hash (a);
create table foo1 partition of foo with (modulus 2, remainder 1);
create table foo2 partition of foo with (modulus 4, remainder 2);
create table foo3 partition of foo with (modulus 8, remainder 0);
create table foo4 partition of foo with (modulus 8, remainder 4);

2. Display parent table info:
postgres=# \d+ foo
                                    Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
 a      | integer |           |          |         | plain    |              |
 b      | text    |           |          |         | extended |              |
Partition key: HASH (a)
Partitions: foo1 WITH (modulus 2, remainder 1),
            foo2 WITH (modulus 4, remainder 2),
            foo3 WITH (modulus 8, remainder 0),
            foo4 WITH (modulus 8, remainder 4)

3. Display child table info:
postgres=# \d+ foo1
                                    Table "public.foo1"
 Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
 a      | integer |           |          |         | plain    |              |
 b      | text    |           |          |         | extended |              |
Partition of: foo WITH (modulus 2, remainder 1)

4. INSERT:
postgres=# insert into foo select i, 'abc' from generate_series(1,10) i;
INSERT 0 10

postgres=# select tableoid::regclass as part, * from foo;
 part | a  |  b
------+----+-----
 foo1 |  3 | abc
 foo1 |  4 | abc
 foo1 |  7 | abc
 foo1 | 10 | abc
 foo2 |  1 | abc
 foo2 |  2 | abc
 foo2 |  9 | abc
 foo3 |  6 | abc
 foo4 |  5 | abc
 foo4 |  8 | abc
(10 rows)

TODOs.
1. Maybe need some work in the CREATE TABLE .. PARTITION OF .. syntax.
2. Trim regression tests (if require).
3. Documentation

Thoughts/Comments?

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Partitioned tables and relfilenode
Next
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Print correct startup cost for the group aggregate.