Re: Partitioning: issues/ideas (Was: Re: On partitioning) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Partitioning: issues/ideas (Was: Re: On partitioning)
Date
Msg-id CA+TgmoaMNReZqNbdta9G4B4bAXwQbgB+MALhZmAxOVoG4YkztQ@mail.gmail.com
Whole thread Raw
In response to Partitioning: issues/ideas (Was: Re: On partitioning)  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Responses Re: Partitioning: issues/ideas (Was: Re: On partitioning)
Re: Partitioning: issues/ideas (Was: Re: On partitioning)
List pgsql-hackers
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> * It has been repeatedly pointed out that we may want to decouple
> partitioning from inheritance because implementing partitioning as an
> extension of inheritance mechanism means that we have to keep all the
> existing semantics which might limit what we want to do with the special
> case of it which is partitioning; in other words, we would find
> ourselves in difficult position where we have to inject a special case
> code into a very generalized mechanism that is inheritance.
> Specifically, do we regard a partitions as pg_inherits children of its
> partitioning parent?

I don't think this is totally an all-or-nothing decision.  I think
everyone is agreed that we need to not break things that work today --
e.g. Merge Append.  What that implies for pg_inherits isn't altogether
clear.

> * Syntax: do we want to make it similar to one of the many other
> databases out there? Or we could invent our own?

Well, what I think we don't want is something that is *almost* like
some other database but not quite.  I lean toward inventing our own
since I'm not aware of something that we'd want to copy exactly.

> I wonder if we could add a clause like DISTRIBUTED BY to complement
> PARTITION ON that represents a hash distributed/partitioned table (that
> could be a syntax to support sharded tables maybe; we would definitely
> want to move ahead in that direction I guess)

Maybe eventually, but let's not complicate things by worrying too much
about that now.

> * Catalog: We would like to have a catalog structure suitable to
> implement capabilities like multi-column partitioning, sub-partitioning
> (with arbitrary number of levels in the hierarchy). I had suggested
> that we create two new catalogs viz. pg_partitioned_rel,
> pg_partition_def to store metadata about a partition key of a
> partitioned relation and partition bound info of a partition,
> respectively. Also, see the point about on-disk representation of
> partition bounds

I'm not convinced that there is any benefit in spreading this
information across two tables neither of which exist today.  If the
representation of the partitioning scheme is going to be a node tree,
then there's no point in taking what would otherwise have been a List
and storing each element of it in a separate tuple. The overarching
point here is that the system catalog structure should be whatever is
most convenient for the system internals; I'm not sure we understand
what that is yet.

> * It is desirable to treat partitions as pg_class relations with perhaps
> a new relkind(s). We may want to choose an implementation where only the
> lowest level relations in a partitioning hierarchy have storage; those
> at the upper layers are mere placeholder relations though of course with
> associated constraints determined by partitioning criteria (with
> appropriate metadata entered into the additional catalogs).

I think the storage-less parents need a new relkind precisely to
denote that they have no storage; I am not convinced that there's any
reason to change the relkind for the leaf nodes.  But that's been
proposed, so evidently someone thinks there's a reason to do it.

> I am not
> quite sure if each kind of the relations involved in the partitioning
> scheme have separate namespaces and, if they are, how we implement that

I am in favor of having all of the nodes in the hierarchy have names
just as relations do today -- pg_class.relname.  Anything else seems
to me to be complex to implement and of very marginal benefit.  But
again, it's been proposed.

> * In the initial implementation, we could just live with partitioning on
> a set of columns (and not arbitrary expressions of them)

Seems quite fair.

> * We perhaps do not need multi-column LIST partitions as they are not
> very widely used and may complicate the implementation

I agree that the use case is marginal; but I'm not sure it needs to
complicate the implementation much.  Depending on how the
implementation shakes out, prohibiting it might come to seem like more
of a wart than allowing it.

> * There are a number of suggestions about how we represent partition
> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> associated with the relation itself), etc. Important point to consider
> here may be that partition key may contain more than one column

Yep.

> * How we represent partition definition in memory (for a given
> partitioned relation) - important point to remember is that such a
> representation should be efficient to iterate through or
> binary-searchable. Also see the points about tuple-routing and
> partition-pruning

Yep.

> * Overflow/catchall partition: it seems we do not want/need them. It
> might seem desirable for example in cases where a big transaction enters
> a large number of tuples all but one of which find a defined partition;
> we may not want to error out in such case but instead enter that erring
> tuple into the overflow partition instead. If we choose to implement
> that, we would like to also implement the capability to move the tuples
> into the appropriate partition once it's defined. Related is the notion
> of automatically creating partitions if one is not already defined for a
> just entered tuple; but there may be locking troubles if many concurrent
> sessions try to do that

I think that dynamically creating new partitions is way beyond the
scope of what this patch should be trying to do.  If we ever do it at
all, it should not be now.  The value of a default partition (aka
overflow partition) seems to me to be debatable.  For range
partitioning, it doesn't seem entirely necessary provided that you can
define a range with only one endpoint (e.g. partition A has values 1
to 10, B has 11 and up, and C has 0 and down).  For list partitioning,
though, you might well want something like that.  But is it a
must-have?  Dunno.

> * Tuple-routing: based on the internal representation of partition
> bounds for the partitions of a given partitioned table, there should be
> a way to map a just entered tuple to partition id it belongs to. Below
> mentioned BRIN-like machinery could be made to work
>
> * Partition-pruning: again, based on the internal representation of
> partition bounds for the partitions of a given partitioned table, there
> should be a way to prune partitions deemed unnecessary per scan quals.
> One notable suggestion is to consider BRIN (-like) machinery. For
> example, it is able to tell from the scan quals whether a particular
> block range of a given heap needs to be scanned or not based on summary
> info index tuple for the block range. Though, the interface is currently
> suitable to cover a single heap with blocks in range 0 to N-1 of that
> heap. What we are looking for here is a hypothetical PartitionMemTuple
> (or PartitionBound) that is a summary of a whole relation (in this case,
> the partition) NOT a block range. But I guess the infrastructure is
> generalized enough that we could make that work. Related then would be
> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
> has correspondence with the index being used, the hypothetical
> PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
> would represent the application of comparison operator between table
> column (partitioning key column) and a constant (as per quals).

I'm not going to say this couldn't be done, but how is any of it
better than having a list of the partition bounds and binary-searching
it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [PATCH] HINT: pg_hba.conf changed since last config reload
Next
From: Tom Lane
Date:
Subject: Re: speedup tidbitmap patch: cache page