Re: On partitioning - Mailing list pgsql-hackers

From Robert Haas
Subject Re: On partitioning
Date
Msg-id CA+Tgmoaw+VmiJcfNCkp6iFyhA_xDAp-UShhnRh4sTHqgtSDKDQ@mail.gmail.com
Whole thread Raw
In response to Re: On partitioning  ("Amit Langote" <Langote_Amit_f8@lab.ntt.co.jp>)
Responses Re: On partitioning  (Amit Kapila <amit.kapila16@gmail.com>)
Re: On partitioning  ("Amit Langote" <Langote_Amit_f8@lab.ntt.co.jp>)
List pgsql-hackers
On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> So, we're going to support exactly two levels of partitioning?
>> partitions with partissub=false and subpartitions with partissub=true?
>>  Why not support only one level of partitioning here but then let the
>> children have their own pg_partitioned_rel entries if they are
>> subpartitioned?  That seems like a cleaner design and lets us support
>> an arbitrary number of partitioning levels if we ever need them.
>
> Yeah, that's what I thought at some point in favour of dropping partissub altogether. However, not that this design
solvesit, there is one question - if we would want to support defining for a table both partition key and sub-partition
keyin advance? That is, without having defined a first level partition yet; in that case, what level do we associate
sub-(sub-)partitioning key with or more to the point where do we keep it? 

Do we really need to allow that?  I think you let people partition a
toplevel table, and then partition its partitions once they've been
created.  I'm not sure there's a good reason to associate the
subpartitioning scheme with the toplevel table.  For one thing, that
forces all subpartitions to be partitioned the same way - do we want
to insist on that?  If we do, then I agree that we need to think a
little harder here.

> That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions)
go.VALUES clause of the definition for such a partition would look like: 
>
> (a range partition) ... VALUES LESS THAN MAXVALUE
> (a list partition) ... VALUES DEFAULT
>
> There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate
anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?) 

I think Alvaro's response further down the thread is right on target.
But to go into a bit more detail, let's consider the three possible
cases:

- Hash partitioning.  Every key value gets hashed to some partition.
The concept of an overflow or default partition doesn't even make
sense.

- List partitioning.  Each key for which the user has defined a
mapping gets sent to the corresponding partition.  The keys that
aren't mapped anywhere can either (a) cause an error or (b) get mapped
to some default partition.  It's probably useful to offer both
behaviors.  But I don't think it requires a partitionisoverflow
column, because you can represent it some other way, such as by making
partitionvalues NULL, which is otherwise meaningless.

- Range partitioning.  In this case, what you've basically got is a
list of partition bounds and a list of target partitions.   Suppose
there are N partition bounds; then there will be N+1 targets.  Some of
those targets can be undefined, meaning an attempt to insert a key
with that value will error out.  For example, suppose the user defines
a partition for values 1-3 and 10-13.  Then your list of partition
bounds looks like this:

1,3,10,13

And your list of destinations looks like this:

undefined,firstpartition,undefined,secondpartition,undefined

More commonly, the ranges will be contiguous, so that there are no
gaps.  If you have everything <10 in the first partition, everything
10-20 in the second partition, and everything else in a third
partition, then you have bounds 10,20 and destinations
firstpartition,secondpartition,thirdpartition.  If you want values
greater than 20 to error out, then you have bounds 10,20 and
destinations firstpartition,secondpartition,undefined.

In none of this do you really have "an overflow partition".  Rather,
the first and last destinations, if defined, catch everything that has
a key lower than the lowest key or higher than the highest key.  If
not defined, you error out.

> I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And
Iam thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) 

I guess you could list or hash partition on multiple columns, too.
And yes, this is why I though of pg_node_tree.

>> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
>> >
>> > -- create partitioned table and child partitions at once.
>> > CREATE TABLE parent (...)
>> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
>> > [ (
>> >      PARTITION child
>> >        {
>> >            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
>> >          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
>> >        }
>> >        [ WITH ( ... ) ] [ TABLESPACE tbs ]
>> >      [, ...]
>> >   ) ] ;
>>
>> How are you going to dump and restore this, bearing in mind that you
>> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
>> wants to do pg_dump --table name_of_a_partition?
>>
> Assuming everything's (including partitioned relation and partitions at all levels) got a pg_class entry of its own,
wouldOIDs be a problem? Or what is the nature of this problem if it's possible that it may be. 

For pg_dump --binary-upgrade, you need a statement like SELECT
binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
each pg_class entry.  So you can't easily have a single SQL statement
creating multiple such entries.

> Oh, do you mean to do away without any syntax for defining partitions with CREATE TABLE parent?

That's what I was thinking.  Or at least just make that a shorthand
for something that can also be done with a series of SQL statements.

> By the way, do you mean the following:
>
> CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000)
>
> Instead of,
>
> CREATE PARTITION child ON parent VALUES LESS THAN 10000?

To me, it seems more logical to make it a variant of CREATE TABLE,
similar to what we do already with CREATE TABLE tab OF typename.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add shutdown_at_recovery_target option to recovery.conf
Next
From: Robert Haas
Date:
Subject: Re: On partitioning