Re: On partitioning - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: On partitioning
Date
Msg-id 20141114021201.GN28859@tamriel.snowman.net
Whole thread Raw
In response to Re: On partitioning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: On partitioning  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > but
> > with declarative partitioning, I expect us to eventually be able to
> > eliminate complete partitions from consideration on both sides of a
> > partition-table join and optimize cases where we have two partitioned
> > tables being joined with a compatible join key and only actually do
> > joins between the partitions which overlap each other.  I don't see
> > those happening if we're allowing a node tree (only).  If having a node
> > tree is just one option among other partitioning options, then we can
> > provide users with the ability to choose what suits their particular
> > needs.
>
> This seems completely muddled to me.  What we're talking about is how
> to represent the partition definition in the system catalogs.  I'm not
> proposing that the user would "partition by pg_node_tree"; what the
> heck would that even mean?

They'd provide an expression which would be able to identify the
partition to be used.  In a way, this is exactly how many folks do
partitioning today with inheritence- consider the if/else trees in
triggers for handling new data coming into the parent table.  That's
also why it wouldn't be easy to optimize for.

> I'm proposing one way of serializing the
> partition definitions that the user specifies into something that can
> be stored into a system catalog, which happens to reuse the existing
> infrastructure that we use for that same purpose in various other
> places.

Ok, I didn't immediately see how a node tree would be used for this- but
I admit that I've not gone back through the entirety of this iteration
of the partitioning discussion.

> I don't have a problem with somebody coming up with another
> way of representing the data in the catalogs; I'm just brainstorming.

Ditto.

> But saying that we'll be able to optimize joins better if we store the
> same data as anyarray rather than pg_node_tree or visca versa doesn't
> make any sense at all.

Ok, if the node tree is constrained in what can be stored in it then I
understand how we could still use optimize based on what we've stored in
it.  I'm not entirely sure a node tree makes sense but at least I
understand better.

> > I'm not a fan of using pg_class- there are a number of columns in there
> > which I would *not* wish to be allowed to be different per partition
> > (starting with relowner and relacl...).  Making those NULL would be just
> > as bad (probably worse, really, since we'd also need to add new columns
> > to pg_class to indicate the partitioning...) as having a sparsely
> > populated new catalog table.
>
> I think you are, again, confused as to what we're discussing.  Nobody,
> including Alvaro, has proposed a design where the individual
> partitions don't have pg_class entries of some kind.  What we're
> talking about is where to store the metadata for partition exclusion
> and tuple routing.

This discussion has gone a few rounds before and, yes, I was just
jumping into the middle of this particular round, but I'm pretty sure
I'm not the first to point out that storing the individual partition
information into pg_class isn't ideal since there are pieces that we
don't actually want to be different per partition, as I outlined
previously.  Perhaps what that means is we should actually go the other
way and move *those* columns into a new catalog instead.

Consider this (totally off-the-cuff):

pg_relation (pg_tables? pg_heaps?) oid relname relnamespace reltype reloftype relowner relam (?) relhas* relisshared
relpersistencerelkind (?) relnatts relchecks relacl reloptions relhowpartitioned (?) 

pg_class pg_relation.oid relfilenode reltablespace relpages reltuples reltoastrelid reltoastidxid relfrozenxid
relhasindexes(?) relpartitioninfo (whatever this ends up being) 

The general idea being to seperate the user-facing notion of a "table"
from the underlying implementation, with the implementation allowing
multiple sets of files to be used for each table.  It's certainly not
for the faint of heart, but we saw what happened with our inheiritance
structure allowing different permissions on the child tables- we ended
up creating a pretty grotty hack to deal with it (going through the
parent bypasses the permissions).  That's the best solution for that
situation, but it's far from ideal and it'd be nice to avoid that same
risk with partitioning.  Additionally, if each partition is in pg_class,
how are we handling name conflicts?  Why do individual partitions even
need to have a name?  Do we allow queries against them directly?  etc..

These are just my thoughts on it and I really don't intend to derail
progress on having a partitioning system and I hope that my comments
don't lead to that happening.
Thanks,
    Stephen

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: what does this mean: "running xacts with xcnt == 0"
Next
From: Stephen Frost
Date:
Subject: Re: On the warpath again about ill-considered inclusion nests