Re: On partitioning - Mailing list pgsql-hackers

From Amit Langote
Subject Re: On partitioning
Date
Msg-id 035401d012a5$af8817e0$0e9847a0$@lab.ntt.co.jp
Whole thread Raw
In response to Re: On partitioning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: On partitioning  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi Robert,

> From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> So, we're going to support exactly two levels of partitioning?
> >> partitions with partissub=false and subpartitions with partissub=true?
> >>  Why not support only one level of partitioning here but then let the
> >> children have their own pg_partitioned_rel entries if they are
> >> subpartitioned?  That seems like a cleaner design and lets us support
> >> an arbitrary number of partitioning levels if we ever need them.
> >
> > Yeah, that's what I thought at some point in favour of dropping partissub
> altogether. However, not that this design solves it, there is one question - if
> we would want to support defining for a table both partition key and sub-
> partition key in advance? That is, without having defined a first level partition
> yet; in that case, what level do we associate sub-(sub-) partitioning key with
> or more to the point where do we keep it?
>
> Do we really need to allow that?  I think you let people partition a
> toplevel table, and then partition its partitions once they've been
> created.  I'm not sure there's a good reason to associate the
> subpartitioning scheme with the toplevel table.  For one thing, that
> forces all subpartitions to be partitioned the same way - do we want
> to insist on that?  If we do, then I agree that we need to think a
> little harder here.
>

To me, it sounds better if we insist on a uniform subpartitioning scheme across all partitions. It seems that's how
it'sdone elsewhere. It would be interesting to hear what others think though. 

> > That would be a default partition. That is, where the tuples that don't
> belong elsewhere (other defined partitions) go. VALUES clause of the
> definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a place
> for tuples to go. That is, it should generate an error if a tuple can't go
> anywhere (or support auto-creating a new one like in interval partitioning?)
>
> I think Alvaro's response further down the thread is right on target.
> But to go into a bit more detail, let's consider the three possible
> cases:
>
> - Hash partitioning.  Every key value gets hashed to some partition.
> The concept of an overflow or default partition doesn't even make
> sense.
>
> - List partitioning.  Each key for which the user has defined a
> mapping gets sent to the corresponding partition.  The keys that
> aren't mapped anywhere can either (a) cause an error or (b) get mapped
> to some default partition.  It's probably useful to offer both
> behaviors.  But I don't think it requires a partitionisoverflow
> column, because you can represent it some other way, such as by making
> partitionvalues NULL, which is otherwise meaningless.
>
> - Range partitioning.  In this case, what you've basically got is a
> list of partition bounds and a list of target partitions.   Suppose
> there are N partition bounds; then there will be N+1 targets.  Some of
> those targets can be undefined, meaning an attempt to insert a key
> with that value will error out.  For example, suppose the user defines
> a partition for values 1-3 and 10-13.  Then your list of partition
> bounds looks like this:
>
> 1,3,10,13
>
> And your list of destinations looks like this:
>
> undefined,firstpartition,undefined,secondpartition,undefined
>
> More commonly, the ranges will be contiguous, so that there are no
> gaps.  If you have everything <10 in the first partition, everything
> 10-20 in the second partition, and everything else in a third
> partition, then you have bounds 10,20 and destinations
> firstpartition,secondpartition,thirdpartition.  If you want values
> greater than 20 to error out, then you have bounds 10,20 and
> destinations firstpartition,secondpartition,undefined.
>
> In none of this do you really have "an overflow partition".  Rather,
> the first and last destinations, if defined, catch everything that has
> a key lower than the lowest key or higher than the highest key.  If
> not defined, you error out.

So just to clarify, first and last destinations are considered "defined" if you have something like:

...
PARTITION p1 VALUES LESS THAN 10
PARTITION p2 VALUES BETWEEN 10 AND 20
PARTITION p3 VALUES GREATER THAN 20
...

And "not defined" if:

...
PARTITION p1 VALUES BETWEEN 10 AND 20
...

In the second case, because no explicit definitions for values less than 10 and greater than 20 are in place, rows with
thatvalue error out? If so, that makes sense.  

>
> > I wonder if your suggestion of pg_node_tree plays well here. This then
> could be a list of CONSTs or some such... And I am thinking it's a concern only
> for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.
> And yes, this is why I though of pg_node_tree.
>
> >> > * DDL syntax (no multi-column partitioning, sub-partitioning support as
> yet):
> >> >
> >> > -- create partitioned table and child partitions at once.
> >> > CREATE TABLE parent (...)
> >> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> >> > [ (
> >> >      PARTITION child
> >> >        {
> >> >            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> >> >          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> >> >        }
> >> >        [ WITH ( ... ) ] [ TABLESPACE tbs ]
> >> >      [, ...]
> >> >   ) ] ;
> >>
> >> How are you going to dump and restore this, bearing in mind that you
> >> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
> >> wants to do pg_dump --table name_of_a_partition?
> >>
> > Assuming everything's (including partitioned relation and partitions at all
> levels) got a pg_class entry of its own, would OIDs be a problem? Or what is
> the nature of this problem if it's possible that it may be.
>
> For pg_dump --binary-upgrade, you need a statement like SELECT
> binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
> each pg_class entry.  So you can't easily have a single SQL statement
> creating multiple such entries.
>

Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one in the first place?

> > Oh, do you mean to do away without any syntax for defining partitions with
> CREATE TABLE parent?
>
> That's what I was thinking.  Or at least just make that a shorthand
> for something that can also be done with a series of SQL statements.
>

Perhaps this is related to the point just above. So, a single SQL statement that defines partitioning key and few
partitions/subpartitionsbased on the key could be supported provided the resulting set of objects can also be created
usingan alternative series of steps each of which creates at most one object. Do we want a key definition to have an
oid?Perhaps not. 

> > By the way, do you mean the following:
> >
> > CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000)
> >
> > Instead of,
> >
> > CREATE PARTITION child ON parent VALUES LESS THAN 10000?
>
> To me, it seems more logical to make it a variant of CREATE TABLE,
> similar to what we do already with CREATE TABLE tab OF typename.
>

Makes sense. This would double as a way to create subpartitions too? And that would have to play well with any choice
weend up making about how we treat subpartitioning key (one of the points discussed above) 

Regards,
Amit






pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Parallel Seq Scan
Next
From: "Amit Langote"
Date:
Subject: Re: On partitioning