Re: On partitioning - Mailing list pgsql-hackers

From Robert Haas
Subject Re: On partitioning
Date
Msg-id CA+TgmoZ-ceTBn79Rb1kWpVO7DcQKtsChDenOfwXzUE=89_kvBQ@mail.gmail.com
Whole thread Raw
In response to On partitioning  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> In this design, partitions are first-class objects, not normal tables in
> inheritance hierarchies.  There are no pg_inherits entries involved at all.

Whoa.  I always assumed that table inheritance was a stepping-stone to
real partitioning, and that real partitioning would be built on top of
table inheritance.  In particular, I assume that (as Itagaki
Takahiro's patch did those many years ago) we'd add some metadata
somewhere to allow fast tuple routing (for both pruning and
inserts/updates).  What's the benefit of inventing something new
instead?

I'm skeptical about your claim that there will be no pg_inherits
entries involved at all.  You need some way to know which partitions
go with which parent table.  You can store that many-to-one mapping
someplace other than pg_inherits, but it seems to me that that doesn't
buy you anything; they're just pg_inherits entries under some other
name.  Why reinvent that?

> Each partition is assigned an Expression that receives a tuple and
> returns boolean.  This expression returns true if a given tuple belongs
> into it, false otherwise.  If a tuple for a partitioned relation is run
> through expressions of all partitions, exactly one should return true.
> If none returns true, it might be because the partition has not been
> created yet.  A user-facing error is raised in this case (Rationale: if
> user creates a partitioned rel and there is no partition that accepts
> some given tuple, it's the user's fault.)
>
> Additionally, each partitioned relation may have a master expression.
> This receives a tuple and returns an integer, which corresponds to the
> number of the partition it belongs into.

I agree with Tom: this is a bad design.  In particular, if we want to
scale to large numbers of partitions (a principal weakness of the
present system) we need the operation of routing a tuple to a
partition to be as efficient as possible.  Range partitioning can be
O(lg n) where n is the number of partitions: store a list of the
boundaries and binary-search it.  List partitioning can be O(lg k)
where k is the number of values (which may be more than the number of
partitions) via a similar technique.  Hash partitioning can be O(1).
I'm not sure what other kind of partitioning anybody would want to do,
but it's likely that they *won't* want it to be O(1) in the number of
partitions.  So I'd say have *only* the master expression.

But, really, I don't think an expression is the right way to store
this; evaluating that repeatedly will, I think, still be too slow.
Think about what happens in PL/pgsql: minimizing the number of times
that you enter and exit the executor helps performance enormously,
even if the expressions are simple enough not to need planning.  I
think the representation should be more like an array of partition
boundaries and the pg_proc OID of a comparator.

> Per-partition expressions are formed as each partition is created, and
> are based on the user-supplied partitioning criterion.  Master
> expressions are formed at relation creation time.  (XXX Can we change
> the master expression later, as a result of some ALTER command?
> Presumably this would mean that all partitions might need to be
> rewritten.)

This is another really important point.  If you store an opaque
expression mapping partitioning keys to partition numbers, you can't
do things like this efficiently.  With a more transparent
representation, like a sorted array of partition boundaries for range
partitioning, or a sorted array of hash values for consistent hashing,
you can do things like split and merge partitions efficiently,
minimizing rewriting.

> Planner -------
>
> A partitioned relation behaves just like a regular relation for purposes
> of planner.  XXX do we need special considerations regarding relation
> size estimation?
>
> For scan plans, we need to prepare Append lists which are used to scan
> for tuples in a partitioned relation.  We can setup fake constraint
> expressions based on the partitioning expressions, which let the planner
> discard unnecessary partitions by way of constraint exclusion.

So if we're going to do all this, why bother making the partitions
anything other than inheritance children?  There might be some benefit
in having the partitions be some kind of stripped-down object if we
could avoid some of these planner gymnastics and get, e.g. efficient
run-time partition pruning.  But if you're going to generate Append
plans and switch ResultRelInfos and stuff just as you would for an
inheritance hierarchy, why not just make it an inheritance hierarchy?

It seems pretty clear to me that we need partitioned tables to have
the same tuple descriptor throughout the relation, for efficient tuple
routing and so on.  But the other restrictions you're proposing to
impose on partitions have no obvious value that I can see.  We could
have a rule that when you inherit from a partition root, you can only
inherit from that one table (no multiple inheritance) and your tuple
descriptor must match precisely (down to dropped columns and column
ordering) and that would give you everything I think you really need
here.  There's no gain to be had in forbidding partitions from having
different owners, or being selected from directly, or having
user-visible names.  The first of those is arguably useless, but it's
not really causing us any problems, and the latter two are extremely
useful features.  Unless you are going to implement partition pruning
is so good that it will never fail to realize a situation where only
one partition needs to be scanned, letting users target the partition
directly is a very important escape hatch.

> (In the future we might be interested in creating specialized plan and
> execution nodes that know more about partitioned relations, to avoid
> creating useless Append trees only to prune them later.)

Good idea.

> pg_dump is able to dump a partitioned relation as a CREATE
> TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION
> commands.  The data of all partitions is considered a single COPY
> operation.
>
> XXX this limits the ability to restore in parallel.  To fix we might consider
> using one COPY for each partition.  It's not clear what relation should be
> mentioned in such a COPY command, though -- my instinct is that it
> should reference the parent table only, not the individual partition.

Targeting the individual partitions seems considerably better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: On partitioning
Next
From: Peter Eisentraut
Date:
Subject: Re: alter user set local_preload_libraries.