Thread: On partitioning

On partitioning

From
Alvaro Herrera
Date:
Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
reference Tom's post
http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us
which mentions the possibility of a different partitioning
implementation than what we have so far.  As it turns out, I've been
thinking about partitioning recently, so I thought I would share what
I'm thinking so that others can poke holes.  My intention is to try to
implement this as soon as possible.


Declarative partitioning
========================

In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies.  There are no pg_inherits entries involved at all.

Partitions are a physical implementation detail.  Therefore we do not allow
the owner to be changed, or permissions to be granted directly to partitions;
all these operations happen to the parent relation instead.

System Catalogs
---------------

In pg_class we have two additional relkind values:

* relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation. It is used to indicate a parent table, i.e. one
theuser can directly address in DML queries.  Such relations DO NOT have their own storage. These use the same rules as
regulartables for access privileges, ownership and so on.
 

* relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned relation (its parent).  These cannot be
addresseddirectly in DML queries and only limited DDL support is provided.  They don't have their own pg_attribute
entrieseither and therefore they are always identical in column definitions to the parent relation.  Since they are not
accessibledirectly, there is no need for ACL considerations; the parent relation's owner is the owner, and grants are
appliedto the parent relation only. XXX --- is there a need for a partition having different column default values than
itsparent relation?
 

Partitions are numbered sequentially, normally from 1 onwards; but it is
valid to have negative partition numbers and 0.  Partitions don't have
names (except automatically generated ones for pg_class.relname, but
they are unusable in DDL).

Each partition is assigned an Expression that receives a tuple and
returns boolean.  This expression returns true if a given tuple belongs
into it, false otherwise.  If a tuple for a partitioned relation is run
through expressions of all partitions, exactly one should return true.
If none returns true, it might be because the partition has not been
created yet.  A user-facing error is raised in this case (Rationale: if
user creates a partitioned rel and there is no partition that accepts
some given tuple, it's the user's fault.)

Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.

There are two new system catalogs:

pg_partitioned_rel --> (prelrelid, prelexpr) pg_partition       -->
(partrelid, partseq, partexpr, partoverflow)

For partitioned rels that have prelexpr, we run that expression and
obtain the partition number; as a crosscheck we run partexpr and ensure
it returns true.  For partitioned rels that don't have prelexpr, we run
partexpr for each partition in turn until one returns true.  This means
that for a properly set up partitioned table, we need to run a single
expression on a tuple to find out what partition the tuple belongs into.

Per-partition expressions are formed as each partition is created, and
are based on the user-supplied partitioning criterion.  Master
expressions are formed at relation creation time.  (XXX Can we change
the master expression later, as a result of some ALTER command?
Presumably this would mean that all partitions might need to be
rewritten.)

Triggers --------

(These are user-defined triggers, not partitioning triggers.  In fact
there are no partitioning triggers at all.)

Triggers are attached to the parent relation, not to the specific
partition.  When a trigger function runs on a tuple inserted, updated or
modified on a partition, the data received by the trigger function makes
it appear that the tuple belongs to the parent relation.  There is no
need to let the trigger know which partition the tuple went in or came
from.   XXX is there a need to give it the partition number that the
tuple went it?


Syntax ------

CREATE TABLE xyz ( ... )  PARTITION BY RANGE ( a_expr ) This creates the
main table only: no partitions are created automatically.

We do not support other types of partitioning at this stage.  We will
implement these later.

We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a
table after the fact).  We leave this as a future improvement.

Allowed actions on RELKIND_PARTITIONED_REL: * ALTER TABLE <xyz> CREATE
PARTITION <n> This creates a new partition * ALTER TABLE <xyz> CREATE
PARTITION FOR <value> Same as above; the partition number is determined
automatically.

Allowed actions on a RELKIND_PARTITION:

* ALTER PARTITION <n> ON TABLE <xyz> SET TABLESPACE * ALTER PARTITION
<n> ON TABLE <xyz> DROP * CREATE INDEX .. ON PARTITION <n> ON TABLE
<xyz> * VACUUM parent PARTITION <n>


As a future extension we will allow partitions to become detached from
the parent relation, thus becoming an independent table.  This might be
a relatively expensive operation: pg_attribute entries need to be
created, for example.

Overflow Partitions -------------------

There is no explicit concept of overflow partitions.

Vacuum, aging -------------

PARTITIONED_RELs, not containing tuples directly, do not have
relfrozenxid or relminmxid.  Each partition has individual values for
these variables.

Autovacuum knows to ignore PARTITIONED_RELs, and considers each
RELKIND_PARTITION separately.

Each partition is vacuumed as a normal relation.

Planner -------

A partitioned relation behaves just like a regular relation for purposes
of planner.  XXX do we need special considerations regarding relation
size estimation?

For scan plans, we need to prepare Append lists which are used to scan
for tuples in a partitioned relation.  We can setup fake constraint
expressions based on the partitioning expressions, which let the planner
discard unnecessary partitions by way of constraint exclusion.

(In the future we might be interested in creating specialized plan and
execution nodes that know more about partitioned relations, to avoid
creating useless Append trees only to prune them later.)

Executor --------

When doing an INSERT or UPDATE ResultRelInfo needs to be expanded for
partitioned relations: the target relation of an insertion is the parent
relation, but the actual partition needs to be resolved at ModifyTable
execution time.  This means RelOptInfo needs to know about partitions;
either we deal with them as "other rels" terms, or we create a new
RelOptKind.  At any rate, running the partitioning expression on the new
tuple would give an partition index.  This needs to be done once for
each new tuple.

I think during ExecInsert, after running triggers and before executing
constraints, we need to switch resultRelationDesc from the parent
relation into the partition-specific relation.

ExecInsertIndexTuples only knows about partitions.  It's an error to
call it using a partitioned rel.

Heap Access Method ------------------ For the purposes of low-level
routines in heapam.c, only partitions exist; trying to insert or modify
tuples in a RELKIND_PARTITIONED_REL is an error.  heap_insert and
heap_multi_insert only accept inserting tuples into an individual
partition.  These routines do not check that the tuples belong into the
specific partition; that's responsibility of higher-level code.  Because
of this, code like COPY will need to make its own checks.  Maybe we
should offer another API (in between high-level things such as
ModifyTable/COPY and heapam.c) that receives tuples into a
PARTITIONED_REL and routes them into specific partitions.  Note: need to
ensure we do not slow down COPY for the regular case of
RELKIND_RELATION.


Taking backups --------------

pg_dump is able to dump a partitioned relation as a CREATE
TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION
commands.  The data of all partitions is considered a single COPY
operation.

XXX this limits the ability to restore in parallel.  To fix we might consider
using one COPY for each partition.  It's not clear what relation should be
mentioned in such a COPY command, though -- my instinct is that it
should reference the parent table only, not the individual partition.

Previous Discussion
-------------------
http://www.postgresql.org/message-id/d3c4af540703292358s8ed731el7771ab14083aa610@mail.gmail.com 
Auto Partitioning Patch - WIP version 1
(Nikhil Sontakke, March 2007)

http://www.postgresql.org/message-id/20080111231945.GY6934@europa.idg.com.au
Declarative partitioning grammar
(Gavin Sherry, January 2008)

http://www.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com
Patch for automating partitions in PostgreSQL 8.4 Beta 2
(Kedar Potdar, Jun 2009)

http://www.postgresql.org/message-id/20091029111531.96CD.52131E4D@oss.ntt.co.jp
Syntax for partitioning
(Itagaki Takahiro, Oct 2009)

http://www.postgresql.org/message-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com
Partitioning syntax
(Itagaki Takahiro, Jan 2010)


Not really related:http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.siteDynamic Partitioning using
SegmentVisibility Maps(Simon Riggs, January 2008)
 


Still To Be Designed
--------------------
* Dependency issues
* Are indexes/constraints inherited from the parent rel?
* Multiple keys?  Subpartitioning?  Hash partitioning?


Open Questions
--------------

*  What's the syntax to refer to specific partitions within a partitioned  table?  We could do "TABLE <xyz> PARTITION
<n>",but for example if in  the future we add hash partitioning, we might need some non-integer  addressing (OTOH
assigningsequential numbers to hash partitions doesn't  seem so bad).  Discussing with users of other DBMSs
partitioningfeature,  one useful phrase is "TABLE <xyz> PARTITION FOR <value>".
 

* Do we want to provide partitioned materialized views?


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> [ partition sketch ]

> In this design, partitions are first-class objects, not normal tables in
> inheritance hierarchies.  There are no pg_inherits entries involved at all.

Hm, actually I'd say they are *not* first class objects; the problem with
the existing design is exactly that child tables *are* first class
objects.  This is merely a terminology quibble though.

> * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
>   relation (its parent).  These cannot be addressed directly in DML
>   queries and only limited DDL support is provided.  They don't have
>   their own pg_attribute entries either and therefore they are always
>   identical in column definitions to the parent relation.

Not sure that not storing the pg_attribute rows is a good thing; but
that's something that won't be clear till you try to code it.

> Each partition is assigned an Expression that receives a tuple and
> returns boolean.  This expression returns true if a given tuple belongs
> into it, false otherwise.

-1, in fact minus a lot.  One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance.  If you allow the partitioning rule to be a black box then
that doesn't get any better.  I want to see a design wherein the system
understands *exactly* what the partitioning behavior is.  I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.

In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says "exactly one of them should return true" is an
untrustworthy band-aid.

(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)

> Additionally, each partitioned relation may have a master expression.
> This receives a tuple and returns an integer, which corresponds to the
> number of the partition it belongs into.

I guess this might be the same thing I'm arguing for, except that I say
it is not optional but is *the* way you define the partitioning.  And
I don't really want black-box expressions even in this formulation.
If you're looking for arbitrary partitioning rules, you can keep on
using inheritance.  The point of inventing partitioning, IMHO, is for
the system to have a lot more understanding of the behavior than is
possible now.

As an example of the point I'm trying to make, the planner should be able
to discard range-based partitions that are eliminated by a WHERE clause
with something a great deal cheaper than the theorem prover it currently
has to use for the purpose.  Black-box partitioning rules not only don't
improve that situation, they actually make it worse.

Other than that, this sketch seems reasonable ...
        regards, tom lane



Re: On partitioning

From
Greg Stark
Date:
On Fri, Aug 29, 2014 at 4:56 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> For scan plans, we need to prepare Append lists which are used to scan
> for tuples in a partitioned relation.  We can setup fake constraint
> expressions based on the partitioning expressions, which let the planner
> discard unnecessary partitions by way of constraint exclusion.
>
> (In the future we might be interested in creating specialized plan and
> execution nodes that know more about partitioned relations, to avoid
> creating useless Append trees only to prune them later.)

This seems like a big part of the point of doing first class
partitions. If we have an equivalence class that specifies a constant
for all the variables in the master expression then we should be able
to look up the corresponding partition as a O(1) operation (or
O(log(n) if it involves searching a list) rather than iterating over
all the partitions and trying to prove lots of exclusions. We might
even need a btree index to store the partitions so that we can handle
scaling up and still find the corresponding partitions quickly.

And I think there are still unanswered questions about indexes. You
seem to be implying that users would be free to create any index they
want on any partition. It's probably going to be necessary to support
creating an index on the partitioned table which would create an index
on each of the partitions and, crucially, automatically create
corresponding indexes whenever new partitions are added.

That said, everything that's here sounds pretty spot-on to me.

-- 
greg



Re: On partitioning

From
Pavel Stehule
Date:



2014-08-29 18:35 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> [ partition sketch ]

> In this design, partitions are first-class objects, not normal tables in
> inheritance hierarchies.  There are no pg_inherits entries involved at all.

Hm, actually I'd say they are *not* first class objects; the problem with
the existing design is exactly that child tables *are* first class
objects.  This is merely a terminology quibble though.

+1 .. only few partitions slowdown planning significantly from 1ms to 20ms, what is a issue with very simple queries over PK
 

> * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
>   relation (its parent).  These cannot be addressed directly in DML
>   queries and only limited DDL support is provided.  They don't have
>   their own pg_attribute entries either and therefore they are always
>   identical in column definitions to the parent relation.

Not sure that not storing the pg_attribute rows is a good thing; but
that's something that won't be clear till you try to code it.

> Each partition is assigned an Expression that receives a tuple and
> returns boolean.  This expression returns true if a given tuple belongs
> into it, false otherwise.

-1, in fact minus a lot.  One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance.  If you allow the partitioning rule to be a black box then
that doesn't get any better.  I want to see a design wherein the system
understands *exactly* what the partitioning behavior is.  I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.

In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says "exactly one of them should return true" is an
untrustworthy band-aid.

(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)

> Additionally, each partitioned relation may have a master expression.
> This receives a tuple and returns an integer, which corresponds to the
> number of the partition it belongs into.

I guess this might be the same thing I'm arguing for, except that I say
it is not optional but is *the* way you define the partitioning.  And
I don't really want black-box expressions even in this formulation.
If you're looking for arbitrary partitioning rules, you can keep on
using inheritance.  The point of inventing partitioning, IMHO, is for
the system to have a lot more understanding of the behavior than is
possible now.

As an example of the point I'm trying to make, the planner should be able
to discard range-based partitions that are eliminated by a WHERE clause
with something a great deal cheaper than the theorem prover it currently
has to use for the purpose.  Black-box partitioning rules not only don't
improve that situation, they actually make it worse.

Other than that, this sketch seems reasonable ...

                        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: On partitioning

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> And I think there are still unanswered questions about indexes.

One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition?  If so, how
are we going to handle the update chain links?
        regards, tom lane



Re: On partitioning

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
> > And I think there are still unanswered questions about indexes.
> 
> One other interesting thought that occurs to me: are we going to support
> UPDATEs that cause a row to belong to a different partition?  If so, how
> are we going to handle the update chain links?

Bah, I didn't mention it?  My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> One other interesting thought that occurs to me: are we going to support
>> UPDATEs that cause a row to belong to a different partition?  If so, how
>> are we going to handle the update chain links?

> Bah, I didn't mention it?  My current thinking is that it would be
> disallowed; if you have chosen your partitioning key well enough it
> shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
> Maybe we can allow it later, but for a first cut this seems more than
> good enough.

Hm.  I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.
        regards, tom lane



Re: On partitioning

From
Andres Freund
Date:
On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> One other interesting thought that occurs to me: are we going to support
> >> UPDATEs that cause a row to belong to a different partition?  If so, how
> >> are we going to handle the update chain links?
> 
> > Bah, I didn't mention it?  My current thinking is that it would be
> > disallowed; if you have chosen your partitioning key well enough it
> > shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
> > Maybe we can allow it later, but for a first cut this seems more than
> > good enough.
> 
> Hm.  I certainly agree that it's a case that could be disallowed for a
> first cut, but it'd be nice to have some clue about how we might allow it
> eventually.

Not pretty, but we could set t_ctid to some 'magic' value when switching
partitions. Everything chasing ctid chains could then error out when
hitting a invisible row with such a t_ctid.  The usecases for doing such
updates really are more maintenance style commands, so it's possibly not
too bad from a usability POV :(

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
>> Hm.  I certainly agree that it's a case that could be disallowed for a
>> first cut, but it'd be nice to have some clue about how we might allow it
>> eventually.

> Not pretty, but we could set t_ctid to some 'magic' value when switching
> partitions. Everything chasing ctid chains could then error out when
> hitting a invisible row with such a t_ctid.

An actual fix would presumably involve adding a partition number to the
ctid chain field in tuples in partitioned tables.  The reason I bring it
up now is that we'd have to commit to doing that (or at least leaving room
for it) in the first implementation, if we don't want to have an on-disk
compatibility break.

There is certainly room to argue that the value of this capability isn't
worth the disk space this solution would eat.  But we should have that
argument while the option is still feasible ...

> The usecases for doing such
> updates really are more maintenance style commands, so it's possibly not
> too bad from a usability POV :(

I'm afraid that might just be wishful thinking.
        regards, tom lane



Re: On partitioning

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> One other interesting thought that occurs to me: are we going to support
> >> UPDATEs that cause a row to belong to a different partition?  If so, how
> >> are we going to handle the update chain links?
> 
> > Bah, I didn't mention it?  My current thinking is that it would be
> > disallowed; if you have chosen your partitioning key well enough it
> > shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
> > Maybe we can allow it later, but for a first cut this seems more than
> > good enough.
> 
> Hm.  I certainly agree that it's a case that could be disallowed for a
> first cut, but it'd be nice to have some clue about how we might allow it
> eventually.

I hesitate to suggest this, but we have free flag bits in
MultiXactStatus.  We could use a specially marked multixact member to
indicate the OID of the target relation; perhaps set an infomask bit to
indicate that this has happened.  Of course, no HOT updates are possible
so I think it's okay from a heap_prune_chain perspective.  This abuses
the knowledge that OIDs and XIDs are both 32 bits long.  

Since nowhere else we have the space necessary to store the longer data
that a cross-partition update would require, I don't see anything else
ATM.  (For a moment I thought about abusing combo CIDs, but that doesn't
work because this requires to be persistent and visible from other
backends, neither of which is a quality of combocids.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Andres Freund
Date:
On 2014-08-29 13:29:19 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
> >> Hm.  I certainly agree that it's a case that could be disallowed for a
> >> first cut, but it'd be nice to have some clue about how we might allow it
> >> eventually.
> 
> > Not pretty, but we could set t_ctid to some 'magic' value when switching
> > partitions. Everything chasing ctid chains could then error out when
> > hitting a invisible row with such a t_ctid.
> 
> An actual fix would presumably involve adding a partition number to the
> ctid chain field in tuples in partitioned tables.  The reason I bring it
> up now is that we'd have to commit to doing that (or at least leaving room
> for it) in the first implementation, if we don't want to have an on-disk
> compatibility break.

Right. Just adding it unconditionally doesn't sound feasible to me. Our
per-row overhead is already too large. And it doesn't sound fun to have
the first-class partitions use a different heap tuple format than plain
relations.

What we could do is to add some sort of 'jump' tuple when moving a tuple
from one relation to another. So, when updating a tuple between
partitions we add another in the old partition with xmin_jump =
xmax_jump = xmax_old and have the jump tuple's content point to the new
relation.
Far from pretty, but it'd only matter overhead wise when used.

> > The usecases for doing such
> > updates really are more maintenance style commands, so it's possibly not
> > too bad from a usability POV :(
> 
> I'm afraid that might just be wishful thinking.

I admit that you might very well be right there :(

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-08-29 13:29:19 -0400, Tom Lane wrote:
>> An actual fix would presumably involve adding a partition number to the
>> ctid chain field in tuples in partitioned tables.  The reason I bring it
>> up now is that we'd have to commit to doing that (or at least leaving room
>> for it) in the first implementation, if we don't want to have an on-disk
>> compatibility break.

> What we could do is to add some sort of 'jump' tuple when moving a tuple
> from one relation to another. So, when updating a tuple between
> partitions we add another in the old partition with xmin_jump =
> xmax_jump = xmax_old and have the jump tuple's content point to the new
> relation.

Hm, that might work.  It sounds more feasible than Alvaro's suggestion
of abusing cmax --- I don't think that field is free for use in this
context.
        regards, tom lane



Re: On partitioning

From
Hannu Krosing
Date:
On 08/29/2014 07:15 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> One other interesting thought that occurs to me: are we going to support
>>> UPDATEs that cause a row to belong to a different partition?  If so, how
>>> are we going to handle the update chain links?
>> Bah, I didn't mention it?  My current thinking is that it would be
>> disallowed; if you have chosen your partitioning key well enough it
>> shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
>> Maybe we can allow it later, but for a first cut this seems more than
>> good enough.
> Hm.  I certainly agree that it's a case that could be disallowed for a
> first cut, but it'd be nice to have some clue about how we might allow it
> eventually.
There needs to be some structure that is specific to partitions and not
multiple plain tables which would then be used for both update chains and
cross-partition indexes (as you seem to imply by jumping from indexes
to update chains a few posts back).

It would need to replace plain tid (pagenr, tupnr) with triple of (partid,
pagenr, tupnr).

Cross-partition indexes are especially needed if we want to allow putting
UNIQUE constraints on non-partition-key columns.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: On partitioning

From
Hannu Krosing
Date:
On 08/29/2014 05:56 PM, Alvaro Herrera wrote:
> Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
> reference Tom's post
> http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us
> which mentions the possibility of a different partitioning
> implementation than what we have so far.  As it turns out, I've been
> thinking about partitioning recently, so I thought I would share what
> I'm thinking so that others can poke holes.  My intention is to try to
> implement this as soon as possible.
>
>
> Declarative partitioning
> ========================
> ...
> Still To Be Designed
> --------------------
> * Dependency issues
> * Are indexes/constraints inherited from the parent rel?
I'd say mostly yes.

There could some extra "constraint exclusion type" magic for
conditional indexes, but the rest probably should come from "main table"

And there should be some kind of cross-partition indexes. At"partitioning" capability, this can probably wait for
version2.
 

> * Multiple keys?  
Why not. But probably just for hash partitioning.
> Subpartitioning? 
Probably not. If you need speed for huge numbers of partitions, use
Gregs idea of keeping the partitions in a tree (or just having a
partition index).
>  Hash partitioning?
At some point definitely.


Also one thing you left unmentioned is dropping (and perhaps also
truncating)
a partition. We still may want to do historic data management the same way
we do it now, by just getting rid of the whole partition or its data.

At some point we may also want to do redistributing data between
partitions,
maybe for case where we end up with 90% of the data in on partition due to
bad partitioning key or partitioning function choice. This is again
something
that is hard now and can therefore be left to a later version.

> Open Questions
> --------------
>
> *  What's the syntax to refer to specific partitions within a partitioned
>    table?
>    We could do "TABLE <xyz> PARTITION <n>", but for example if in
>    the future we add hash partitioning, we might need some non-integer
>    addressing (OTOH assigning sequential numbers to hash partitions doesn't
>    seem so bad).  Discussing with users of other DBMSs partitioning feature,
>    one useful phrase is "TABLE <xyz> PARTITION FOR <value>".
Or more generally

TABLE <xyz> PARTITION FOR/WHERE col1=val1, col2=val2, ...;



Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: On partitioning

From
Alvaro Herrera
Date:
Hannu Krosing wrote:

> Cross-partition indexes are especially needed if we want to allow putting
> UNIQUE constraints on non-partition-key columns.

I'm not going to implement cross-partition indexes in the first patch.
They are a huge can of worms.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Robert Haas
Date:
On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> In this design, partitions are first-class objects, not normal tables in
> inheritance hierarchies.  There are no pg_inherits entries involved at all.

Whoa.  I always assumed that table inheritance was a stepping-stone to
real partitioning, and that real partitioning would be built on top of
table inheritance.  In particular, I assume that (as Itagaki
Takahiro's patch did those many years ago) we'd add some metadata
somewhere to allow fast tuple routing (for both pruning and
inserts/updates).  What's the benefit of inventing something new
instead?

I'm skeptical about your claim that there will be no pg_inherits
entries involved at all.  You need some way to know which partitions
go with which parent table.  You can store that many-to-one mapping
someplace other than pg_inherits, but it seems to me that that doesn't
buy you anything; they're just pg_inherits entries under some other
name.  Why reinvent that?

> Each partition is assigned an Expression that receives a tuple and
> returns boolean.  This expression returns true if a given tuple belongs
> into it, false otherwise.  If a tuple for a partitioned relation is run
> through expressions of all partitions, exactly one should return true.
> If none returns true, it might be because the partition has not been
> created yet.  A user-facing error is raised in this case (Rationale: if
> user creates a partitioned rel and there is no partition that accepts
> some given tuple, it's the user's fault.)
>
> Additionally, each partitioned relation may have a master expression.
> This receives a tuple and returns an integer, which corresponds to the
> number of the partition it belongs into.

I agree with Tom: this is a bad design.  In particular, if we want to
scale to large numbers of partitions (a principal weakness of the
present system) we need the operation of routing a tuple to a
partition to be as efficient as possible.  Range partitioning can be
O(lg n) where n is the number of partitions: store a list of the
boundaries and binary-search it.  List partitioning can be O(lg k)
where k is the number of values (which may be more than the number of
partitions) via a similar technique.  Hash partitioning can be O(1).
I'm not sure what other kind of partitioning anybody would want to do,
but it's likely that they *won't* want it to be O(1) in the number of
partitions.  So I'd say have *only* the master expression.

But, really, I don't think an expression is the right way to store
this; evaluating that repeatedly will, I think, still be too slow.
Think about what happens in PL/pgsql: minimizing the number of times
that you enter and exit the executor helps performance enormously,
even if the expressions are simple enough not to need planning.  I
think the representation should be more like an array of partition
boundaries and the pg_proc OID of a comparator.

> Per-partition expressions are formed as each partition is created, and
> are based on the user-supplied partitioning criterion.  Master
> expressions are formed at relation creation time.  (XXX Can we change
> the master expression later, as a result of some ALTER command?
> Presumably this would mean that all partitions might need to be
> rewritten.)

This is another really important point.  If you store an opaque
expression mapping partitioning keys to partition numbers, you can't
do things like this efficiently.  With a more transparent
representation, like a sorted array of partition boundaries for range
partitioning, or a sorted array of hash values for consistent hashing,
you can do things like split and merge partitions efficiently,
minimizing rewriting.

> Planner -------
>
> A partitioned relation behaves just like a regular relation for purposes
> of planner.  XXX do we need special considerations regarding relation
> size estimation?
>
> For scan plans, we need to prepare Append lists which are used to scan
> for tuples in a partitioned relation.  We can setup fake constraint
> expressions based on the partitioning expressions, which let the planner
> discard unnecessary partitions by way of constraint exclusion.

So if we're going to do all this, why bother making the partitions
anything other than inheritance children?  There might be some benefit
in having the partitions be some kind of stripped-down object if we
could avoid some of these planner gymnastics and get, e.g. efficient
run-time partition pruning.  But if you're going to generate Append
plans and switch ResultRelInfos and stuff just as you would for an
inheritance hierarchy, why not just make it an inheritance hierarchy?

It seems pretty clear to me that we need partitioned tables to have
the same tuple descriptor throughout the relation, for efficient tuple
routing and so on.  But the other restrictions you're proposing to
impose on partitions have no obvious value that I can see.  We could
have a rule that when you inherit from a partition root, you can only
inherit from that one table (no multiple inheritance) and your tuple
descriptor must match precisely (down to dropped columns and column
ordering) and that would give you everything I think you really need
here.  There's no gain to be had in forbidding partitions from having
different owners, or being selected from directly, or having
user-visible names.  The first of those is arguably useless, but it's
not really causing us any problems, and the latter two are extremely
useful features.  Unless you are going to implement partition pruning
is so good that it will never fail to realize a situation where only
one partition needs to be scanned, letting users target the partition
directly is a very important escape hatch.

> (In the future we might be interested in creating specialized plan and
> execution nodes that know more about partitioned relations, to avoid
> creating useless Append trees only to prune them later.)

Good idea.

> pg_dump is able to dump a partitioned relation as a CREATE
> TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION
> commands.  The data of all partitions is considered a single COPY
> operation.
>
> XXX this limits the ability to restore in parallel.  To fix we might consider
> using one COPY for each partition.  It's not clear what relation should be
> mentioned in such a COPY command, though -- my instinct is that it
> should reference the parent table only, not the individual partition.

Targeting the individual partitions seems considerably better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Amit Langote
Date:
On Sat, Aug 30, 2014 at 12:56 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
> reference Tom's post
> http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us
> which mentions the possibility of a different partitioning
> implementation than what we have so far.  As it turns out, I've been
> thinking about partitioning recently, so I thought I would share what
> I'm thinking so that others can poke holes.  My intention is to try to
> implement this as soon as possible.
>

+1.



Re: On partitioning

From
Tom Lane
Date:
Another thought about this general topic:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> ...
> Allowed actions on a RELKIND_PARTITION:
> * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz>
> ...
> Still To Be Designed
> --------------------
> * Are indexes/constraints inherited from the parent rel?

I think one of the key design decisions we have to make is whether
partitions are all constrained to have exactly the same set of indexes.
If we don't insist on that it will greatly complicate planning compared
to what we'll get if we do insist on it, because then the planner will
need to generate a separate customized plan subtree for each partition.
Aside from costing planning time, most likely that would forever prevent
us from pushing some types of intelligence about partitioning into the
executor.

Now, in the current model, it's up to the user what indexes to create
on each partition, and sometimes one might feel that maintaining a
particular index is unnecessary in some partitions.  But the flip side
of that is it's awfully easy to screw yourself by forgetting to add
some index when you add a new partition.  So I'm not real sure which
approach is superior from a purely user-oriented perspective.

I'm not trying to push one or the other answer right now, just noting
that this is a critical decision.
        regards, tom lane



Re: On partitioning

From
Hannu Krosing
Date:
On 08/31/2014 10:03 PM, Tom Lane wrote:
> Another thought about this general topic:
>
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> ...
>> Allowed actions on a RELKIND_PARTITION:
>> * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz>
>> ...
>> Still To Be Designed
>> --------------------
>> * Are indexes/constraints inherited from the parent rel?
> I think one of the key design decisions we have to make is whether
> partitions are all constrained to have exactly the same set of indexes.
> If we don't insist on that it will greatly complicate planning compared
> to what we'll get if we do insist on it, because then the planner will
> need to generate a separate customized plan subtree for each partition.
> Aside from costing planning time, most likely that would forever prevent
> us from pushing some types of intelligence about partitioning into the
> executor.
>
> Now, in the current model, it's up to the user what indexes to create
> on each partition, and sometimes one might feel that maintaining a
> particular index is unnecessary in some partitions.  But the flip side
> of that is it's awfully easy to screw yourself by forgetting to add
> some index when you add a new partition.  
The "forgetting" part is easy to solve by inheriting all indexes from
parent (or template) partition unless explicitly told not to.

One other thing that has been bothering me about this proposal
is the ability to take partitions offline for maintenance or to load
them offline ant then switch in.

In current scheme we do this using ALTER TABLE ... [NO] INHERIT ...

If we also want to have this with the not-directly-accessible partitions
then perhaps it could be done by having a possibility to move
a partition between two tables with exactly the same structure ?

> So I'm not real sure which
> approach is superior from a purely user-oriented perspective.
What we currently have is a very flexible scheme which has a few
drawbacks

1) unnecessarily complex for simple case
2) easy to shoot yourself in the foot by forgetting something
3) can be hard on planner, especially with huge number of partitions

An alternative way of solving these problems is adding some
(meta-)constraints to current way of doing things and some more
automation

CREATE TABLE FOR PARTITIONMASTER   WITH (ALL_INDEXES_SAME=ON,             SAME_STRUCTURE_ALWAYS=ON,
SINGLE_INHERITANCE_ONLY=ON,            NESTED_INHERITS=OFF,
PARTITION_FUNCTION=default_range_partitioning(int)
);

and then force these when adding inherited tables (in this case
partition tables)
either via CREATE TABLE or ALTER TABLE

Best Regards

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: On partitioning

From
Martijn van Oosterhout
Date:
On Fri, Aug 29, 2014 at 12:35:50PM -0400, Tom Lane wrote:
> > Each partition is assigned an Expression that receives a tuple and
> > returns boolean.  This expression returns true if a given tuple belongs
> > into it, false otherwise.
>
> -1, in fact minus a lot.  One of the core problems of the current approach
> is that the system, particularly the planner, hasn't got a lot of insight
> into exactly what the partitioning scheme is in a partitioned table built
> on inheritance.  If you allow the partitioning rule to be a black box then
> that doesn't get any better.  I want to see a design wherein the system
> understands *exactly* what the partitioning behavior is.  I'd start with
> supporting range-based partitioning explicitly, and maybe we could add
> other behaviors such as hashing later.
>
> In particular, there should never be any question at all that there is
> exactly one partition that a given row belongs to, not more, not less.
> You can't achieve that with a set of independent filter expressions;
> a meta-rule that says "exactly one of them should return true" is an
> untrustworthy band-aid.
>
> (This does not preclude us from mapping the tuple through the partitioning
> rule and finding that the corresponding partition doesn't currently exist.
> I think we could view the partitioning rule as a function from tuples to
> partition numbers, and then we look in pg_class to see if such a partition
> exists.)

There is one situation where you need to be more flexible, and that is
if you ever want to support online repartitioning. To do that you have
to distinguish between "I want to insert tuple X, which partition
should it go into" and "I want to know which partitions I need to look
for partition_key=Y".

For the latter you really have need an expression per partition, or
something equivalent.  If performance is an issue I suppose you could
live with having an "old" and an "new" partition scheme, so you
couldn't have two "live repartitionings" happening simultaneously.

Now, if you want to close the door on online repartitioning forever
then that fine. But being in the position of having to say "yes our
partitioning scheme sucks, but we would have to take the database down
for a week to fix it" is no fun.

Unless logical replication provides a way out maybe??

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: On partitioning

From
Greg Stark
Date:
On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Aside from costing planning time, most likely that would forever prevent
> us from pushing some types of intelligence about partitioning into the
> executor.

How would it affect this calculus if there were partitioned indexes
which were created on the overall table and guaranteed to exist on
each partition that the planner could use -- and then possibly also
per-partition indexes that might exist in addition to those? So the
planner could make deductions and leave some intelligence about
partitions to the executor as long as they only depend on partitioned
indexes but might be able to take advantage of a per-partition index
if it's an unusual situation. I'm imagining for example a partitioned
table where only the current partition is read-write and OLTP queries
restrict themselves to working only with the current partition. Having
excluded the other partitions the planner is free to use any of the
indexes liberally.

That said, I think the typical approach to this is to only allow
indexes that are defined for the whole table. If the user wants to
have different indexes for the current time period they would have a
separate table with all the indexes on it that is only moved into the
partitioned table once it's finished being used for for the atypical
queries. Oracle supports "local partitioned indexes" (which are
partitioned like the table) and "global indexes" (which span
partitions) but afaik it doesn't support indexes on only some
partitions.

Furthermore, we have partial indexes. Partial indexes mean you can
always create a partial index on just one partition's range of keys.
The index will exist for all partitions but just be empty for all but
the partitions that matter. The planner can plan based on the partial
index's where clause which would accomplish the same thing, I think.


-- 
greg



Re: On partitioning

From
Andres Freund
Date:
On 2014-08-29 20:12:16 +0200, Hannu Krosing wrote:
> It would need to replace plain tid (pagenr, tupnr) with triple of (partid,
> pagenr, tupnr).
> 
> Cross-partition indexes are especially needed if we want to allow putting
> UNIQUE constraints on non-partition-key columns.

I actually don't think this is necessary. I'm pretty sure that you can
build an efficient and correct version of unique constraints with
several underlying indexes in different partitions each. The way
exclusion constraints are implemented imo is a good guide.

I personally think that implementing cross partition indexes has a low
enough cost/benefit ratio that I doubt it's wise to tackle it anytime
soon.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Aside from costing planning time, most likely that would forever prevent
>> us from pushing some types of intelligence about partitioning into the
>> executor.

> How would it affect this calculus if there were partitioned indexes
> which were created on the overall table and guaranteed to exist on
> each partition that the planner could use -- and then possibly also
> per-partition indexes that might exist in addition to those?

That doesn't actually fix the planning-time issue at all.  Either the
planner considers each partition individually to create a custom plan
for it, or it doesn't.

The "push into executor" idea I was alluding to is that we might invent
plan constructs like a ModifyTable node that applies to a whole
inheritance^H^H^Hpartitioning tree and leaves the tuple routing to be
done at runtime.  You're not going to get a plan structure like that
if the planner is building a separate plan subtree for each partition.
        regards, tom lane



Re: On partitioning

From
Andres Freund
Date:
On 2014-08-31 16:03:30 -0400, Tom Lane wrote:
> Another thought about this general topic:
> 
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > ...
> > Allowed actions on a RELKIND_PARTITION:
> > * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz>
> > ...
> > Still To Be Designed
> > --------------------
> > * Are indexes/constraints inherited from the parent rel?
> 
> I think one of the key design decisions we have to make is whether
> partitions are all constrained to have exactly the same set of indexes.
> If we don't insist on that it will greatly complicate planning compared
> to what we'll get if we do insist on it, because then the planner will
> need to generate a separate customized plan subtree for each partition.
> Aside from costing planning time, most likely that would forever prevent
> us from pushing some types of intelligence about partitioning into the
> executor.

> Now, in the current model, it's up to the user what indexes to create
> on each partition, and sometimes one might feel that maintaining a
> particular index is unnecessary in some partitions.  But the flip side
> of that is it's awfully easy to screw yourself by forgetting to add
> some index when you add a new partition.  So I'm not real sure which
> approach is superior from a purely user-oriented perspective.

I think we're likely to end up with both. In many cases it'll be far
superior from a usability and planning perspective to have indices on
the 'toplevel table' (do we have a good name for that?).

But on the flip side, one of the significant use cases for partitioning
is dealing with historical data. In many cases old data has to be saved
for years but is barely ever queried. It'd be a shame to inflict all
indexes on all partitions for that kind of data. It'd surely be a useful
step to add sane partitioning without that capability, but we shouldn't
base the design on that decision.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Greg Stark
Date:
On Mon, Sep 1, 2014 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The "push into executor" idea I was alluding to is that we might invent
> plan constructs like a ModifyTable node that applies to a whole
> inheritance^H^H^Hpartitioning tree and leaves the tuple routing to be
> done at runtime.  You're not going to get a plan structure like that
> if the planner is building a separate plan subtree for each partition.

Well my message was assuming that in that case it would only consider
the partitioned indexes. It would only consider the isolated indexes
if the planner was able to identify a specific partition. That's
probably the only type of query where such indexes are likely to be
useful.


-- 
greg



Re: On partitioning

From
Andres Freund
Date:
On 2014-09-01 11:59:37 -0400, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
> > On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Aside from costing planning time, most likely that would forever prevent
> >> us from pushing some types of intelligence about partitioning into the
> >> executor.
> 
> > How would it affect this calculus if there were partitioned indexes
> > which were created on the overall table and guaranteed to exist on
> > each partition that the planner could use -- and then possibly also
> > per-partition indexes that might exist in addition to those?
> 
> That doesn't actually fix the planning-time issue at all.  Either the
> planner considers each partition individually to create a custom plan
> for it, or it doesn't.

We could have a information about the indexing situation in child
partitions on the toplevel table. I.e. note whether child partitions
have individual indexes. And possibly constraints.

> The "push into executor" idea I was alluding to is that we might invent
> plan constructs like a ModifyTable node that applies to a whole
> inheritance^H^H^Hpartitioning tree and leaves the tuple routing to be
> done at runtime.  You're not going to get a plan structure like that
> if the planner is building a separate plan subtree for each partition.

It doesn't sound impossible to evaluate at plan time whether to use
nodes covering several partitions or use a separate subplan for
individual partitions. We're going to need information which partitions
to scan in those nodes anyway.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Heikki Linnakangas
Date:
On 09/01/2014 06:59 PM, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Aside from costing planning time, most likely that would forever prevent
>>> us from pushing some types of intelligence about partitioning into the
>>> executor.
>
>> How would it affect this calculus if there were partitioned indexes
>> which were created on the overall table and guaranteed to exist on
>> each partition that the planner could use -- and then possibly also
>> per-partition indexes that might exist in addition to those?
>
> That doesn't actually fix the planning-time issue at all.  Either the
> planner considers each partition individually to create a custom plan
> for it, or it doesn't.

Hmm. Couldn't you plan together all partitions that do have the same 
indexes? In other words, create a custom plan for each group of 
partitions, rather than each partition?

- Heikki




Re: On partitioning

From
Hannu Krosing
Date:
On 09/01/2014 05:52 PM, Andres Freund wrote:
> On 2014-08-29 20:12:16 +0200, Hannu Krosing wrote:
>> It would need to replace plain tid (pagenr, tupnr) with triple of (partid,
>> pagenr, tupnr).
>>
>> Cross-partition indexes are especially needed if we want to allow putting
>> UNIQUE constraints on non-partition-key columns.
> I actually don't think this is necessary. I'm pretty sure that you can
> build an efficient and correct version of unique constraints with
> several underlying indexes in different partitions each. The way
> exclusion constraints are implemented imo is a good guide.
>
> I personally think that implementing cross partition indexes has a low
> enough cost/benefit ratio that I doubt it's wise to tackle it anytime
> soon.
Also it has the downside of (possibly) making DROP PARTITION either
slow or wasting space until next VACUUM.

So if building composite unique indexes over multiple per-partition
indexes is doable, I would much prefer this.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: On partitioning

From
Craig Ringer
Date:
On 09/01/2014 11:52 PM, Andres Freund wrote:
> I personally think that implementing cross partition indexes has a low
> enough cost/benefit ratio that I doubt it's wise to tackle it anytime
> soon.

UNIQUE constraints on partitioned tables (and thus foreign key
constraints pointing to partitioned tables) are a pretty big limitation
at the moment.

That said, the planner may well be able to use the greater knowledge of
the partitioned table structure to do this implictly, as it knows that a
unique index on the partition is also implicitly unique across
partitions on the partitioning key.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Craig Ringer
Date:
On 09/01/2014 04:03 AM, Tom Lane wrote:

> I think one of the key design decisions we have to make is whether
> partitions are all constrained to have exactly the same set of indexes.

... and a lot of that comes down to what use cases the partitioning is
meant to handle, and what people are expected to continue to DIY with
inheritance.

Simple range and hash partitioning are the main things being discussed.

Other moderately common partitioning uses seem to be hot/cold
partitioning, usually on unequal ranges, and closely related live/dead
partitioning for apps that soft-delete data.

In both those you may well want to suppress indexes on the cold/dead
portion, much like we currently have partial indexes.

In fact, how different is an index that's present on only a subset of
partitions to a partial index, in planning terms? We know the partitions
it is/isn't on, after all, and can form an expression that finds just
those partitions.

(I guess the answer there is that partial index planning is probably not
smart enough to be useful for this).

> If we don't insist on that it will greatly complicate planning compared
> to what we'll get if we do insist on it, because then the planner will
> need to generate a separate customized plan subtree for each partition.

Seems to be like a "make room to support it in future, but don't do it
now" thing.

Partitioning schemes like:

[prior years]
[last year]
[this year]
[this month]
[this week]

could benefit from it, but they also need things like online
repartitioning, updates to move tuples across partitions, etc.

So it's all in the "let's not lock it out for the future, but lets not
tackle it now either" box.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Bruce Momjian
Date:
On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote:
> There is one situation where you need to be more flexible, and that is
> if you ever want to support online repartitioning. To do that you have
> to distinguish between "I want to insert tuple X, which partition
> should it go into" and "I want to know which partitions I need to look
> for partition_key=Y".
> 
> For the latter you really have need an expression per partition, or
> something equivalent.  If performance is an issue I suppose you could
> live with having an "old" and an "new" partition scheme, so you
> couldn't have two "live repartitionings" happening simultaneously.
> 
> Now, if you want to close the door on online repartitioning forever
> then that fine. But being in the position of having to say "yes our
> partitioning scheme sucks, but we would have to take the database down
> for a week to fix it" is no fun.
> 
> Unless logical replication provides a way out maybe??

I am unclear why having information per-partition rather than on the
parent table helps with online reparitioning.

Robert's idea of using normal table inheritance means we can access/move
the data independently of the partitioning system.  My guess is that we
will need to do repartitioning with some tool, rather than as part of
normal database operation.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: On partitioning

From
Martijn van Oosterhout
Date:
On Tue, Sep 02, 2014 at 09:44:17AM -0400, Bruce Momjian wrote:
> On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote:
> > There is one situation where you need to be more flexible, and that is
> > if you ever want to support online repartitioning. To do that you have
> > to distinguish between "I want to insert tuple X, which partition
> > should it go into" and "I want to know which partitions I need to look
> > for partition_key=Y".
>
> I am unclear why having information per-partition rather than on the
> parent table helps with online reparitioning.

An example:

We have three partitions, one for X<0 (A), one for 0<=X<5 (B) and one
for X>=5 (C).  These are in three different tables.

Now we give the command to merge the last two partitions B&C. You now
have the choice to lock the table while you move all the tuples from C
to B.

Or you can make some adjustments such that new tuples that would have gone
to C now go to B. And if there is a query for X=10 that you look in
*both* B & C. Then the existing tuples can be moved from C to B at any
time without blocking any other operations.

Is this clearer? If you up front decide that which partition to query
will be determined by a function that can only return one table, then
the above becomes impossible.

> Robert's idea of using normal table inheritance means we can access/move
> the data independently of the partitioning system.  My guess is that we
> will need to do repartitioning with some tool, rather than as part of
> normal database operation.

Doing it as some tool seems like a hack to me. And since the idea was (I
thought) that partitions would not be directly accessable from SQL, it
has to be in the database itself.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: On partitioning

From
Robert Haas
Date:
On Tue, Sep 2, 2014 at 4:18 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Tue, Sep 02, 2014 at 09:44:17AM -0400, Bruce Momjian wrote:
>> On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote:
>> > There is one situation where you need to be more flexible, and that is
>> > if you ever want to support online repartitioning. To do that you have
>> > to distinguish between "I want to insert tuple X, which partition
>> > should it go into" and "I want to know which partitions I need to look
>> > for partition_key=Y".
>>
>> I am unclear why having information per-partition rather than on the
>> parent table helps with online reparitioning.
>
> An example:
>
> We have three partitions, one for X<0 (A), one for 0<=X<5 (B) and one
> for X>=5 (C).  These are in three different tables.
>
> Now we give the command to merge the last two partitions B&C. You now
> have the choice to lock the table while you move all the tuples from C
> to B.
>
> Or you can make some adjustments such that new tuples that would have gone
> to C now go to B. And if there is a query for X=10 that you look in
> *both* B & C. Then the existing tuples can be moved from C to B at any
> time without blocking any other operations.
>
> Is this clearer? If you up front decide that which partition to query
> will be determined by a function that can only return one table, then
> the above becomes impossible.
>
>> Robert's idea of using normal table inheritance means we can access/move
>> the data independently of the partitioning system.  My guess is that we
>> will need to do repartitioning with some tool, rather than as part of
>> normal database operation.
>
> Doing it as some tool seems like a hack to me. And since the idea was (I
> thought) that partitions would not be directly accessable from SQL, it
> has to be in the database itself.

I agree.  My main point about reusing the inheritance stuff is that
we've done over the years is that we shouldn't reinvent the wheel, but
rather build on what we've already got.

If the proposed design somehow involved treating all of the partitions
as belonging to the same TID space (which doesn't really seem
possible, but let's suspend disbelief) so that you could have a single
index that covers all the partitions, and the system would somehow
work out which TIDs live in which physical files, then it would be
reasonable to view the storage layer as an accident that higher levels
of the system don't need to know anything about.

But the actual proposal involves having multiple relations that have
to get planned just like real tables, and that means all the
optimizations that we've done on gathering statistics for inheritance
hierarchies, and MergeAppend, and every other bit of planner smarts
that we have will be applicable to this new method, too.  Let's not do
anything that forces us to reinvent all of those things.

Now, to be fair, one could certainly argue (and I would agree) that
the existing optimizations are insufficient.  In particular, the fact
that SELECT * FROM partitioned_table WHERE not_the_partitioning_key =
1 has to be planned separately for every partition is horrible, and
the fact that SELECT * FROM partitioned_table WHERE partitioning_key =
1 has to use an algorithm that is both O(n) in the partition count and
has a relatively high constant factor to exclude all of the
non-matching partitions also sucks.  But I think we're better off
trying to view those as further optimizations that we can apply to
certain special cases of partitioning - e.g. when the partitioning
syntax is used, constrain all the tables to have identical tuple
descriptors and matching indexes (and maybe constraints) so that when
you plan, you can do it once and then used the transposed plan for all
partitions.  Figuring out how to do run-time partition pruning would
be awesome, too.

But I don't see that any of this stuff gets easier by ignoring what's
already been built; then you're likely to spend all your time
reinventing the crap we've already done, and any cases where the new
system misses an optimization that's been achieved in the current
system become unpleasant dilemmas for our users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
Hi,

I tend to agree with Robert that partitioning should continue using
inheritance based implementation. In addition to his point about reinventing
things it could be pointed out that there are discussions/proposals elsewhere
about building foreign table inheritance capability; having partitioning use
the same general infrastructure would pave a way for including sharding
features more easily in future (perhaps sooner). 

Maybe I am missing something; but isn't it a case that making partitions a
physical implementation detail would make it difficult to support individual
partitions be on different servers (sharding basically)? Moreover, recent FDW
development seems to be headed in direction of substantial core support for
foreign objects/tables; it seems worthwhile for partitioning design to assume
a course so that future sharding feature developers can leverage both. Perhaps
I am just speculating here but I thought of adding this one point to the
discussion.

Having said that, it can also be seen that the subset of inheritance
infrastructure that constitutes partitioning support machinery would have to
be changed considerably if we are now onto partitioning 2.0 here.

--
Amit





Re: On partitioning

From
"Amit Langote"
Date:

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Amit Langote
> Sent: Friday, September 19, 2014 2:13 PM
> To: robertmhaas@gmail.com
> Cc: pgsql-hackers@postgresql.org; bruce@momjian.us; tgl@sss.pgh.pa.us;
> alvherre@2ndquadrant.com
> Subject: Re: [HACKERS] On partitioning
> 
> Hi,
> 

Apologize for having broken the original thread. :(

This was supposed to in reply to -
http://www.postgresql.org/message-id/CA+Tgmob5DEtO4SbD15q0OQJjyc05cTk8043Utwu_
=XDtvyGNSw@mail.gmail.com

--
Amit





Re: On partitioning

From
Bruce Momjian
Date:
On Fri, Aug 29, 2014 at 11:56:07AM -0400, Alvaro Herrera wrote:
> Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
> reference Tom's post
> http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us
> which mentions the possibility of a different partitioning
> implementation than what we have so far.  As it turns out, I've been
> thinking about partitioning recently, so I thought I would share what
> I'm thinking so that others can poke holes.  My intention is to try to
> implement this as soon as possible.

I realize there hasn't been much progress on this thread, but I wanted
to chime in to say I think our current partitioning implementation is
too heavy administratively, error-prone, and performance-heavy.  

I support a redesign of this feature.  I think the current mixture of
inheritance, triggers/rules, and check constraints can be properly
characterized as a Frankenstein solution, where we paste together parts
until we get something that works --- our partitioning badly needs a
redesign.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: On partitioning

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Fri, Aug 29, 2014 at 11:56:07AM -0400, Alvaro Herrera wrote:
> > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
> > reference Tom's post
> > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us
> > which mentions the possibility of a different partitioning
> > implementation than what we have so far.  As it turns out, I've been
> > thinking about partitioning recently, so I thought I would share what
> > I'm thinking so that others can poke holes.  My intention is to try to
> > implement this as soon as possible.
> 
> I realize there hasn't been much progress on this thread, but I wanted
> to chime in to say I think our current partitioning implementation is
> too heavy administratively, error-prone, and performance-heavy.  

On the contrary, I think there was lots of progress; there's lots of
useful feedback from the initial design proposal I posted.  I am a bit
sad to admit that I'm not working on it at the moment as I had
originally planned, though, because other priorities slipped in and I am
not able to work on this for a while.  Therefore if someone else wants
to work on this topic, be my guest -- otherwise I hope to get on it in a
few months.

> I support a redesign of this feature.  I think the current mixture of
> inheritance, triggers/rules, and check constraints can be properly
> characterized as a Frankenstein solution, where we paste together parts
> until we get something that works --- our partitioning badly needs a
> redesign.

Agreed, and I don't think just hiding the stitches is good enough.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Bruce Momjian
Date:
On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > On Fri, Aug 29, 2014 at 11:56:07AM -0400, Alvaro Herrera wrote:
> > > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
> > > reference Tom's post
> > > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us
> > > which mentions the possibility of a different partitioning
> > > implementation than what we have so far.  As it turns out, I've been
> > > thinking about partitioning recently, so I thought I would share what
> > > I'm thinking so that others can poke holes.  My intention is to try to
> > > implement this as soon as possible.
> > 
> > I realize there hasn't been much progress on this thread, but I wanted
> > to chime in to say I think our current partitioning implementation is
> > too heavy administratively, error-prone, and performance-heavy.  
> 
> On the contrary, I think there was lots of progress; there's lots of
> useful feedback from the initial design proposal I posted.  I am a bit
> sad to admit that I'm not working on it at the moment as I had
> originally planned, though, because other priorities slipped in and I am
> not able to work on this for a while.  Therefore if someone else wants
> to work on this topic, be my guest -- otherwise I hope to get on it in a
> few months.

Oh, I just meant code progress --- I agree the discussion was fruitful.

> > I support a redesign of this feature.  I think the current mixture of
> > inheritance, triggers/rules, and check constraints can be properly
> > characterized as a Frankenstein solution, where we paste together parts
> > until we get something that works --- our partitioning badly needs a
> > redesign.
> 
> Agreed, and I don't think just hiding the stitches is good enough.

LOL, yeah.  I do training on partitioning occasionally and the potential
for mistakes is huge.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> > > I realize there hasn't been much progress on this thread, but I wanted
> > > to chime in to say I think our current partitioning implementation is
> > > too heavy administratively, error-prone, and performance-heavy.
> >
> > On the contrary, I think there was lots of progress; there's lots of
> > useful feedback from the initial design proposal I posted.  I am a bit
> > sad to admit that I'm not working on it at the moment as I had
> > originally planned, though, because other priorities slipped in and I am
> > not able to work on this for a while.  Therefore if someone else wants
> > to work on this topic, be my guest -- otherwise I hope to get on it in a
> > few months.
> 
> Oh, I just meant code progress --- I agree the discussion was fruitful.
> 

FWIW, I think Robert's criticism regarding not basing this on inheritance
scheme was not responded to. He mentions a patch by Itagaki-san (four years
ago, abandoned unfortunately); details here:

https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress

This patch could be resurrected fixing some parts of it as was suggested at
the time. But, the most important decisions regarding the patch like storage
structure, syntax etc. would require building some consensus whether this is a
worthwhile direction. At least some consideration must be given to the idea
that we might want to have remote partitions backed by FDW infrastructure in
near future, although that may not be the primary goal of partitioning effort.
What do others think?

--
Amit





Re: On partitioning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> > On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
> > > Bruce Momjian wrote:
> > > > I realize there hasn't been much progress on this thread, but I wanted
> > > > to chime in to say I think our current partitioning implementation is
> > > > too heavy administratively, error-prone, and performance-heavy.
> > >
> > > On the contrary, I think there was lots of progress; there's lots of
> > > useful feedback from the initial design proposal I posted.  I am a bit
> > > sad to admit that I'm not working on it at the moment as I had
> > > originally planned, though, because other priorities slipped in and I am
> > > not able to work on this for a while.  Therefore if someone else wants
> > > to work on this topic, be my guest -- otherwise I hope to get on it in a
> > > few months.
> > 
> > Oh, I just meant code progress --- I agree the discussion was fruitful.
> 
> FWIW, I think Robert's criticism regarding not basing this on inheritance
> scheme was not responded to.

It was responded to by ignoring it.  I didn't see anybody else
supporting the idea that inheritance is in any way a sane thing to base
partitioning on.  Sure, we have accumulated lots of kludges over the
years to cope with the fact that, really, it doesn't work very well.  So
what.  We can keep them, I don't care.

Anyway as I said above, I'm not particularly interested in any more
discussion on this topic for the time being, since I don't have time to
work on this patch.  If anybody wants to continue discussing to improve
the design some more, and even implement it or parts of it, that's fine
with me -- but please expect me not to answer.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
Andres Freund
Date:
On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
> > > On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
> > > > Bruce Momjian wrote:
> > > > > I realize there hasn't been much progress on this thread, but I wanted
> > > > > to chime in to say I think our current partitioning implementation is
> > > > > too heavy administratively, error-prone, and performance-heavy.
> > > >
> > > > On the contrary, I think there was lots of progress; there's lots of
> > > > useful feedback from the initial design proposal I posted.  I am a bit
> > > > sad to admit that I'm not working on it at the moment as I had
> > > > originally planned, though, because other priorities slipped in and I am
> > > > not able to work on this for a while.  Therefore if someone else wants
> > > > to work on this topic, be my guest -- otherwise I hope to get on it in a
> > > > few months.
> > > 
> > > Oh, I just meant code progress --- I agree the discussion was fruitful.
> > 
> > FWIW, I think Robert's criticism regarding not basing this on inheritance
> > scheme was not responded to.
> 
> It was responded to by ignoring it.  I didn't see anybody else
> supporting the idea that inheritance is in any way a sane thing to base
> partitioning on.  Sure, we have accumulated lots of kludges over the
> years to cope with the fact that, really, it doesn't work very well.  So
> what.  We can keep them, I don't care.

As far as I understdood Robert's criticism it was more about the
internals, than about the userland representation. To me it's absolutely
clear that 'real partitioning' userland shouldn't be based on the
current hacks to allow it. But I do think that a first step very well
might reuse the planner/executor smarts about it. Even a good chunk of
the tablecmd.c logic might be reusable for individual partitions without
much change.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> From: Andres Freund [mailto:andres@2ndquadrant.com]
> On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote:
> > Amit Langote wrote:
> > > FWIW, I think Robert's criticism regarding not basing this on
inheritance
> > > scheme was not responded to.
> >
> > It was responded to by ignoring it.  I didn't see anybody else
> > supporting the idea that inheritance is in any way a sane thing to base
> > partitioning on.  Sure, we have accumulated lots of kludges over the
> > years to cope with the fact that, really, it doesn't work very well.  So
> > what.  We can keep them, I don't care.
> 
> As far as I understdood Robert's criticism it was more about the
> internals, than about the userland representation. To me it's absolutely
> clear that 'real partitioning' userland shouldn't be based on the
> current hacks to allow it. 

For my understanding: 

By partitioning 'userland' representation, do you mean an implementation
choice where a partition is literally an inheritance child of the partitioned
table as registered in pg_inherits? Or something else?

Thanks,
Amit





Re: On partitioning

From
'Andres Freund'
Date:
On 2014-10-28 14:34:22 +0900, Amit Langote wrote:
> 
> Hi,
> 
> > From: Andres Freund [mailto:andres@2ndquadrant.com]
> > On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote:
> > > Amit Langote wrote:
> > > > FWIW, I think Robert's criticism regarding not basing this on
> inheritance
> > > > scheme was not responded to.
> > >
> > > It was responded to by ignoring it.  I didn't see anybody else
> > > supporting the idea that inheritance is in any way a sane thing to base
> > > partitioning on.  Sure, we have accumulated lots of kludges over the
> > > years to cope with the fact that, really, it doesn't work very well.  So
> > > what.  We can keep them, I don't care.
> > 
> > As far as I understdood Robert's criticism it was more about the
> > internals, than about the userland representation. To me it's absolutely
> > clear that 'real partitioning' userland shouldn't be based on the
> > current hacks to allow it. 
> 
> For my understanding: 
> 
> By partitioning 'userland' representation, do you mean an implementation
> choice where a partition is literally an inheritance child of the partitioned
> table as registered in pg_inherits? Or something else?

Yes, I mean explicit usage of INHERITS.

In my opinion we can reuse (some of) the existing logic for INHERITS to
implement "proper" partitioning, but that should be an implementation
detail.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Robert Haas
Date:
On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> In my opinion we can reuse (some of) the existing logic for INHERITS to
> implement "proper" partitioning, but that should be an implementation
> detail.

Sure, that would be a sensible way to do it.  I mostly care about not
throwing out all the work that's been done on the planner and
executor.  Maybe you're thinking we'll eventually replace that with
something better, which is fine, but I wouldn't underestimate the
effort to make that happen.  For example, I think it's be sensible for
the first patch to just add some new user-visible syntax with some
additional catalog representation that doesn't actually do all that
much yet.  Then subsequent patches could use that additional metadata
to optimize partition prune, implement tuple routing, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Andres Freund
Date:
On 2014-10-28 08:19:36 -0400, Robert Haas wrote:
> On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > In my opinion we can reuse (some of) the existing logic for INHERITS to
> > implement "proper" partitioning, but that should be an implementation
> > detail.
> 
> Sure, that would be a sensible way to do it.  I mostly care about not
> throwing out all the work that's been done on the planner and
> executor.

In that ase I'm not sure if there's actual disagreement here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> owner@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, October 28, 2014 9:20 PM
>
> On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> > In my opinion we can reuse (some of) the existing logic for INHERITS to
> > implement "proper" partitioning, but that should be an implementation
> > detail.
>
> Sure, that would be a sensible way to do it.  I mostly care about not
> throwing out all the work that's been done on the planner and
> executor.  Maybe you're thinking we'll eventually replace that with
> something better, which is fine, but I wouldn't underestimate the
> effort to make that happen.  For example, I think it's be sensible for
> the first patch to just add some new user-visible syntax with some
> additional catalog representation that doesn't actually do all that
> much yet.  Then subsequent patches could use that additional metadata
> to optimize partition prune, implement tuple routing, etc.
>

I mentioned upthread  about the possibility of resurrecting Itagaki-san's patch [1] to try to make things work in this
direction.I would be willing to spend time on this.  I see useful reviews of the patch by Robert [2], Simon [3] at the
timebut it wasn't pursued further. I think those reviews were valuable design input that IMHO would still be relevant.
Itseems the reviews suggested some significant changes to the design proposed. Off course, there are many other
considerationsdiscussed upthread that need to be addressed. Incorporating those changes and others, I think such an
approachcould be worthwhile. 

Thoughts?

[1] https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress
[2] http://www.postgresql.org/message-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com
[3] http://www.postgresql.org/message-id/1279196337.1735.9598.camel@ebony

Thanks,
Amit





Re: On partitioning

From
Robert Haas
Date:
On Thu, Nov 6, 2014 at 9:17 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I mentioned upthread  about the possibility of resurrecting Itagaki-san's patch [1] to try to make things work in
thisdirection. I would be willing to spend time on this.  I see useful reviews of the patch by Robert [2], Simon [3] at
thetime but it wasn't pursued further. I think those reviews were valuable design input that IMHO would still be
relevant.It seems the reviews suggested some significant changes to the design proposed. Off course, there are many
otherconsiderations discussed upthread that need to be addressed. Incorporating those changes and others, I think such
anapproach could be worthwhile. 

I'd be in favor of that.  I am not sure whether the code is close
enough to what we need to be really useful, but that's for you to
decide.  In my view, the main problem we should be trying to solve
here is "avoid relying on constraint exclusion".  In other words, the
syntax for adding a partition should put some metadata into the system
catalogs that lets us do partitioning pruning very very quickly,
without theorem-proving.  For example, for list or range partitioning,
a list of partition bounds would be just right: you could
binary-search it.  The same metadata should also be suitable for
routing inserts to the proper partition, and handling partition motion
when a tuple is updated.

Now there's other stuff we might very well want to do, but I think
making partition pruning and tuple routing fast would be a pretty big
win by itself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Saturday, November 08, 2014 5:41 AM
>
> I'd be in favor of that.

Thanks!

>  I am not sure whether the code is close
> enough to what we need to be really useful, but that's for you to
> decide.

Hmm, I'm not entirely convinced about the patch as it stands either but, I will 
try to restate below what the patch in its current state does anyway (just to 
refresh):

The patch provides syntax to:* Specify partitioning key, optional partition definitions within CREATE TABLE,* A few
ALTERTABLE commands that let you define a partitioning key 
 
(partitioning a table after the fact), attach/detach an existing table as a 
partition of a partitioned table,* CREATE PARTITION to create a new partition on a partitioned table.

Above commands are merely transformed into ALTER TABLE subcommands that arrange 
partitioned table and partitions into inheritance hierarchy, but with extra 
information, that is, allowed values for the partition in a new anyarray column 
called 'pg_inherits.values'. A special case of ATExecAddInherit() namely 
ATExecAttachPartitionI(), as part of its processing, also adds partition 
constraints in the form of appropriate CHECK constraints. So, a few of the 
manual steps are automated and additional (IMHO non-opaque) metadata (namely 
partition boundaries/list values) is added.

Additionally, defining a partitioning key (PARTITION BY) creates a pg_partition 
entry that specifies for a partitioned table the following - partition kind 
(range/list),  an opclass for the key value comparison  and a key 'expression' 
(say, "colname % 10").

A few key things I can think of as needing improvement would be  (perhaps just 
reiterating a review of the patch):
* partition pruning would still depend on constraint exclusion using the CHECK 
constraints (same old)* there is no tuple-routing at all (same can be said of partition pruning 
above)* partition pruning or tuple-routing would require a scan over pg_inherits 
(perhaps inefficient)* partitioning key is an expression which might not be a good idea in early 
stages of the implementation (might be better off with just the attnum of the 
column to partition on?)* there is no DROP PARTITION (in fact, it is suggested not to go CREATE/DROP 
PARTITION route at all) -> ALTER TABLE ... ADD/DROP PARTITION?

Some other important ones:* dependency handling related oversights* constraint propagation related oversights

And then some of the oddities of behaviour that I am seeing while trying out 
things that the patch does. Please feel free to suggest those that I am not 
seeing. I am sure these improvements need more than just tablecmds.c hacking 
which is what the current patch mostly does.

The first two points could use separate follow-on patches as I feel they need 
extensive changes unless I am missing something. I will try to post possible 
solutions to these issues provided metadata in current form is OK to proceed.

> In my view, the main problem we should be trying to solve
> here is "avoid relying on constraint exclusion".  In other words, the
> syntax for adding a partition should put some metadata into the system
> catalogs that lets us do partitioning pruning very very quickly,
> without theorem-proving.  For example, for list or range partitioning,
> a list of partition bounds would be just right: you could
> binary-search it.  The same metadata should also be suitable for
> routing inserts to the proper partition, and handling partition motion
> when a tuple is updated.
>
> Now there's other stuff we might very well want to do, but I think
> making partition pruning and tuple routing fast would be a pretty big
> win by itself.
>

Those are definitely the goals worth striving for.

Thanks for your time.

Regards,
Amit






Re: On partitioning

From
Robert Haas
Date:
On Mon, Nov 10, 2014 at 8:53 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Above commands are merely transformed into ALTER TABLE subcommands that arrange
> partitioned table and partitions into inheritance hierarchy, but with extra
> information, that is, allowed values for the partition in a new anyarray column
> called 'pg_inherits.values'. A special case of ATExecAddInherit() namely
> ATExecAttachPartitionI(), as part of its processing, also adds partition
> constraints in the form of appropriate CHECK constraints. So, a few of the
> manual steps are automated and additional (IMHO non-opaque) metadata (namely
> partition boundaries/list values) is added.

I thought putting the partition boundaries into pg_inherits was a
strange choice.  I'd put it in pg_class, or in pg_partition if we
decide to create that.  Maybe as anyarray, but I think pg_node_tree
might even be better.  That can also represent data of some arbitrary
type, but it doesn't enforce that everything is uniform.  So you could
have a list of objects of the form {RANGEPARTITION :lessthan {CONST
...} :partition 16982} or similar.  The relcache could load that up
and convert the list to a C array, which would then be easy to
binary-search.

As you say, you also need to store the relevant operator somewhere,
and the fact that it's a range partition rather than list or hash,
say.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I thought putting the partition boundaries into pg_inherits was a
> strange choice.  I'd put it in pg_class, or in pg_partition if we
> decide to create that.

Yeah.  I rather doubt that we want this mechanism to be very closely
tied to the existing inheritance features.  If we do that, we are
going to need a boatload of error checks to prevent people from breaking
partitioned tables by applying the sort of twiddling that inheritance
allows.

> Maybe as anyarray, but I think pg_node_tree
> might even be better.  That can also represent data of some arbitrary
> type, but it doesn't enforce that everything is uniform.

Of course, the more general you make it, the more likely that it'll be
impossible to optimize well.
        regards, tom lane



Re: On partitioning

From
Robert Haas
Date:
On Wed, Nov 12, 2014 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I thought putting the partition boundaries into pg_inherits was a
>> strange choice.  I'd put it in pg_class, or in pg_partition if we
>> decide to create that.
>
> Yeah.  I rather doubt that we want this mechanism to be very closely
> tied to the existing inheritance features.  If we do that, we are
> going to need a boatload of error checks to prevent people from breaking
> partitioned tables by applying the sort of twiddling that inheritance
> allows.

Well, as I said upthread, I think it would be a pretty poor idea to
imagine that the first version of this feature is going to obsolete
everything we've done with inheritance.  Are we going to reinvent the
machinery to make inheritance children get scanned when the parent
does?  Reinvent Merge Append?

>> Maybe as anyarray, but I think pg_node_tree
>> might even be better.  That can also represent data of some arbitrary
>> type, but it doesn't enforce that everything is uniform.
>
> Of course, the more general you make it, the more likely that it'll be
> impossible to optimize well.

The point for me is just that range and list partitioning probably
need different structure, and hash partitioning, if we want to support
that, needs something else again.  Range partitioning needs an array
of partition boundaries and an array of child OIDs.  List partitioning
needs an array of specific values and a child table OID for each.
Hash partitioning needs something probably quite different.  We might
be able to do it as a pair of arrays - one of type anyarray and one of
type OID - and meet all needs that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Jim Nasby
Date:
On 11/12/14, 5:27 PM, Robert Haas wrote:
>>> Maybe as anyarray, but I think pg_node_tree
>>> >>might even be better.  That can also represent data of some arbitrary
>>> >>type, but it doesn't enforce that everything is uniform.
>> >
>> >Of course, the more general you make it, the more likely that it'll be
>> >impossible to optimize well.
> The point for me is just that range and list partitioning probably
> need different structure, and hash partitioning, if we want to support
> that, needs something else again.  Range partitioning needs an array
> of partition boundaries and an array of child OIDs.  List partitioning
> needs an array of specific values and a child table OID for each.
> Hash partitioning needs something probably quite different.  We might
> be able to do it as a pair of arrays - one of type anyarray and one of
> type OID - and meet all needs that way.

Another issue is I don't know that we could support multi-key partitions with something like an anyarray. Perhaps
that'sOK as a first pass, but I expect it'll be one of the next things folks ask for.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Nov 12, 2014 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> Maybe as anyarray, but I think pg_node_tree
> >> might even be better.  That can also represent data of some arbitrary
> >> type, but it doesn't enforce that everything is uniform.
> >
> > Of course, the more general you make it, the more likely that it'll be
> > impossible to optimize well.

Agreed- a node tree seems a bit too far to make this really work well..
But, I'm curious what you were thinking specifically?  A node tree which
accepts an "argument" of the constant used in the original query and
then spits back a table might work reasonably well for that case- but
with declarative partitioning, I expect us to eventually be able to
eliminate complete partitions from consideration on both sides of a
partition-table join and optimize cases where we have two partitioned
tables being joined with a compatible join key and only actually do
joins between the partitions which overlap each other.  I don't see
those happening if we're allowing a node tree (only).  If having a node
tree is just one option among other partitioning options, then we can
provide users with the ability to choose what suits their particular
needs.

> The point for me is just that range and list partitioning probably
> need different structure, and hash partitioning, if we want to support
> that, needs something else again.  Range partitioning needs an array
> of partition boundaries and an array of child OIDs.  List partitioning
> needs an array of specific values and a child table OID for each.
> Hash partitioning needs something probably quite different.  We might
> be able to do it as a pair of arrays - one of type anyarray and one of
> type OID - and meet all needs that way.

I agree that these will require different structures in the catalog..
While reviewing the superuser checks, I expected to have a similar need
and discussed various options- having multiple catalog tables, having a
single table with multiple columns, having a single table with a 'type'
column and then a bytea blob.  In the end, it wasn't really necessary as
the only thing which I expected to need more than 'yes/no' were the
directory permissions (which it looks like might end up killed anyway,
much to my sadness..), but while considering the options, I continued to
feel like anything but independent tables was hacking around to try and
reduce the number of inodes used for folks who don't actually use these
features, and that's a terrible reason to complicate the catalog and
code, in my view.

It occurs to me that we might be able to come up with a better way to
address the inode concern and therefore ignore it.  There are other
considerations to having more catalog tables, but declarative
partitioning is an important enough feature, in my view, that I wouldn't
care if it required 10 catalog tables to implement.  Misrepresenting it
with a catalog that's got a bunch of columns, all but one of which are
NULL, or by using essentially removing the knowledge of the data type
from the system by using a type column with some binary blob, isn't
doing ourselves or our users any favors.  That's not to say that I'm
against a solution which only needs one catalog table, but let's not
completely throw away proper structure because of inode or other
resource consideration issues.  We have quite a few other catalog tables
which are rarely used and it'd be good to address the issue with those
consuming resources independently.

I'm not a fan of using pg_class- there are a number of columns in there
which I would *not* wish to be allowed to be different per partition
(starting with relowner and relacl...).  Making those NULL would be just
as bad (probably worse, really, since we'd also need to add new columns
to pg_class to indicate the partitioning...) as having a sparsely
populated new catalog table.
Thanks!
    Stephen

Re: On partitioning

From
"Amit Langote"
Date:
> From: Stephen Frost [mailto:sfrost@snowman.net]
> Sent: Thursday, November 13, 2014 3:40 PM
> 
> > The point for me is just that range and list partitioning probably
> > need different structure, and hash partitioning, if we want to support
> > that, needs something else again.  Range partitioning needs an array
> > of partition boundaries and an array of child OIDs.  List partitioning
> > needs an array of specific values and a child table OID for each.
> > Hash partitioning needs something probably quite different.  We might
> > be able to do it as a pair of arrays - one of type anyarray and one of
> > type OID - and meet all needs that way.
> 
> I agree that these will require different structures in the catalog..
> While reviewing the superuser checks, I expected to have a similar need
> and discussed various options- having multiple catalog tables, having a
> single table with multiple columns, having a single table with a 'type'
> column and then a bytea blob.  In the end, it wasn't really necessary as
> the only thing which I expected to need more than 'yes/no' were the
> directory permissions (which it looks like might end up killed anyway,
> much to my sadness..), but while considering the options, I continued to
> feel like anything but independent tables was hacking around to try and
> reduce the number of inodes used for folks who don't actually use these
> features, and that's a terrible reason to complicate the catalog and
> code, in my view.
> 

Greenplum uses a single table for this purpose with separate columns for range
and list cases, for example. They store allowed values per partition though.
They have 6 partitioning related catalog/system views., by the way. Perhaps,
interesting as a reference.

http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_parti
tions.html

Thanks,
Amit





Re: On partitioning

From
"Amit Langote"
Date:
> owner@postgresql.org] On Behalf Of Amit Langote
> Sent: Thursday, November 13, 2014 3:50 PM
> 
> Greenplum uses a single table for this purpose with separate columns for
range
> and list cases, for example. They store allowed values per partition though.
> They have 6 partitioning related catalog/system views., by the way. Perhaps,
> interesting as a reference.
> 
> http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_p
> arti
> tions.html
> 

Oops, wrong link. Use this one instead.
http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_parti
tion_rule.html

> Thanks,
> Amit





Re: On partitioning

From
Robert Haas
Date:
On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Agreed- a node tree seems a bit too far to make this really work well..
> But, I'm curious what you were thinking specifically?

I gave a pretty specific example in my email.

> A node tree which
> accepts an "argument" of the constant used in the original query and
> then spits back a table might work reasonably well for that case-

A node tree is not a function.  It's a data structure.  So it doesn't
have arguments.

> but
> with declarative partitioning, I expect us to eventually be able to
> eliminate complete partitions from consideration on both sides of a
> partition-table join and optimize cases where we have two partitioned
> tables being joined with a compatible join key and only actually do
> joins between the partitions which overlap each other.  I don't see
> those happening if we're allowing a node tree (only).  If having a node
> tree is just one option among other partitioning options, then we can
> provide users with the ability to choose what suits their particular
> needs.

This seems completely muddled to me.  What we're talking about is how
to represent the partition definition in the system catalogs.  I'm not
proposing that the user would "partition by pg_node_tree"; what the
heck would that even mean?  I'm proposing one way of serializing the
partition definitions that the user specifies into something that can
be stored into a system catalog, which happens to reuse the existing
infrastructure that we use for that same purpose in various other
places.  I don't have a problem with somebody coming up with another
way of representing the data in the catalogs; I'm just brainstorming.
But saying that we'll be able to optimize joins better if we store the
same data as anyarray rather than pg_node_tree or visca versa doesn't
make any sense at all.

> I'm not a fan of using pg_class- there are a number of columns in there
> which I would *not* wish to be allowed to be different per partition
> (starting with relowner and relacl...).  Making those NULL would be just
> as bad (probably worse, really, since we'd also need to add new columns
> to pg_class to indicate the partitioning...) as having a sparsely
> populated new catalog table.

I think you are, again, confused as to what we're discussing.  Nobody,
including Alvaro, has proposed a design where the individual
partitions don't have pg_class entries of some kind.  What we're
talking about is where to store the metadata for partition exclusion
and tuple routing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > but
> > with declarative partitioning, I expect us to eventually be able to
> > eliminate complete partitions from consideration on both sides of a
> > partition-table join and optimize cases where we have two partitioned
> > tables being joined with a compatible join key and only actually do
> > joins between the partitions which overlap each other.  I don't see
> > those happening if we're allowing a node tree (only).  If having a node
> > tree is just one option among other partitioning options, then we can
> > provide users with the ability to choose what suits their particular
> > needs.
>
> This seems completely muddled to me.  What we're talking about is how
> to represent the partition definition in the system catalogs.  I'm not
> proposing that the user would "partition by pg_node_tree"; what the
> heck would that even mean?

They'd provide an expression which would be able to identify the
partition to be used.  In a way, this is exactly how many folks do
partitioning today with inheritence- consider the if/else trees in
triggers for handling new data coming into the parent table.  That's
also why it wouldn't be easy to optimize for.

> I'm proposing one way of serializing the
> partition definitions that the user specifies into something that can
> be stored into a system catalog, which happens to reuse the existing
> infrastructure that we use for that same purpose in various other
> places.

Ok, I didn't immediately see how a node tree would be used for this- but
I admit that I've not gone back through the entirety of this iteration
of the partitioning discussion.

> I don't have a problem with somebody coming up with another
> way of representing the data in the catalogs; I'm just brainstorming.

Ditto.

> But saying that we'll be able to optimize joins better if we store the
> same data as anyarray rather than pg_node_tree or visca versa doesn't
> make any sense at all.

Ok, if the node tree is constrained in what can be stored in it then I
understand how we could still use optimize based on what we've stored in
it.  I'm not entirely sure a node tree makes sense but at least I
understand better.

> > I'm not a fan of using pg_class- there are a number of columns in there
> > which I would *not* wish to be allowed to be different per partition
> > (starting with relowner and relacl...).  Making those NULL would be just
> > as bad (probably worse, really, since we'd also need to add new columns
> > to pg_class to indicate the partitioning...) as having a sparsely
> > populated new catalog table.
>
> I think you are, again, confused as to what we're discussing.  Nobody,
> including Alvaro, has proposed a design where the individual
> partitions don't have pg_class entries of some kind.  What we're
> talking about is where to store the metadata for partition exclusion
> and tuple routing.

This discussion has gone a few rounds before and, yes, I was just
jumping into the middle of this particular round, but I'm pretty sure
I'm not the first to point out that storing the individual partition
information into pg_class isn't ideal since there are pieces that we
don't actually want to be different per partition, as I outlined
previously.  Perhaps what that means is we should actually go the other
way and move *those* columns into a new catalog instead.

Consider this (totally off-the-cuff):

pg_relation (pg_tables? pg_heaps?) oid relname relnamespace reltype reloftype relowner relam (?) relhas* relisshared
relpersistencerelkind (?) relnatts relchecks relacl reloptions relhowpartitioned (?) 

pg_class pg_relation.oid relfilenode reltablespace relpages reltuples reltoastrelid reltoastidxid relfrozenxid
relhasindexes(?) relpartitioninfo (whatever this ends up being) 

The general idea being to seperate the user-facing notion of a "table"
from the underlying implementation, with the implementation allowing
multiple sets of files to be used for each table.  It's certainly not
for the faint of heart, but we saw what happened with our inheiritance
structure allowing different permissions on the child tables- we ended
up creating a pretty grotty hack to deal with it (going through the
parent bypasses the permissions).  That's the best solution for that
situation, but it's far from ideal and it'd be nice to avoid that same
risk with partitioning.  Additionally, if each partition is in pg_class,
how are we handling name conflicts?  Why do individual partitions even
need to have a name?  Do we allow queries against them directly?  etc..

These are just my thoughts on it and I really don't intend to derail
progress on having a partitioning system and I hope that my comments
don't lead to that happening.
Thanks,
    Stephen

Re: On partitioning

From
Robert Haas
Date:
On Thu, Nov 13, 2014 at 9:12 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> > I'm not a fan of using pg_class- there are a number of columns in there
>> > which I would *not* wish to be allowed to be different per partition
>> > (starting with relowner and relacl...).  Making those NULL would be just
>> > as bad (probably worse, really, since we'd also need to add new columns
>> > to pg_class to indicate the partitioning...) as having a sparsely
>> > populated new catalog table.
>>
>> I think you are, again, confused as to what we're discussing.  Nobody,
>> including Alvaro, has proposed a design where the individual
>> partitions don't have pg_class entries of some kind.  What we're
>> talking about is where to store the metadata for partition exclusion
>> and tuple routing.
>
> This discussion has gone a few rounds before and, yes, I was just
> jumping into the middle of this particular round, but I'm pretty sure
> I'm not the first to point out that storing the individual partition
> information into pg_class isn't ideal since there are pieces that we
> don't actually want to be different per partition, as I outlined
> previously.  Perhaps what that means is we should actually go the other
> way and move *those* columns into a new catalog instead.
>
> Consider this (totally off-the-cuff):
>
> pg_relation (pg_tables? pg_heaps?)
>   oid
>   relname
>   relnamespace
>   reltype
>   reloftype
>   relowner
>   relam (?)
>   relhas*
>   relisshared
>   relpersistence
>   relkind (?)
>   relnatts
>   relchecks
>   relacl
>   reloptions
>   relhowpartitioned (?)
>
> pg_class
>   pg_relation.oid
>   relfilenode
>   reltablespace
>   relpages
>   reltuples
>   reltoastrelid
>   reltoastidxid
>   relfrozenxid
>   relhasindexes (?)
>   relpartitioninfo (whatever this ends up being)
>
> The general idea being to seperate the user-facing notion of a "table"
> from the underlying implementation, with the implementation allowing
> multiple sets of files to be used for each table.  It's certainly not
> for the faint of heart, but we saw what happened with our inheiritance
> structure allowing different permissions on the child tables- we ended
> up creating a pretty grotty hack to deal with it (going through the
> parent bypasses the permissions).  That's the best solution for that
> situation, but it's far from ideal and it'd be nice to avoid that same
> risk with partitioning.  Additionally, if each partition is in pg_class,
> how are we handling name conflicts?  Why do individual partitions even
> need to have a name?  Do we allow queries against them directly?  etc..

There's certainly something to this, but "not for the faint of heart"
sounds like an understatement.

One of the good things about inheritance is that, if the system
doesn't automatically do the right thing, there's usually an escape
hatch.  If the INSERT trigger you use for tuple routing is too slow,
you can insert directly into the target partition.  If your query
doesn't realize that it can prune away all the partitions but one, or
takes too long to do it, you can query directly against that
partition.  These aren't beautiful things and I'm sure we're all
united in wanting a mechanism that will reduce the need to do them,
but we need to make sure that we are removing the need for the escape
hatch, and not just cementing it shut.

In other words, I don't think there is a problem with people querying
child tables directly; the problem is that people are forced to do so
in order to get good performance.  We shouldn't remove the ability for
people to do that unless we're extremely certain we've fixed the
problem that leads them to wish to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
Robert,

>
> I thought putting the partition boundaries into pg_inherits was a
> strange choice.  I'd put it in pg_class, or in pg_partition if we
> decide to create that.

Hmm, yeah I guess we are better off using pg_inherits for just saying that a partition is an inheritance child. Other
detailsshould go elsewhere for sure. 

>  Maybe as anyarray, but I think pg_node_tree
> might even be better.  That can also represent data of some arbitrary
> type, but it doesn't enforce that everything is uniform.  So you could
> have a list of objects of the form {RANGEPARTITION :lessthan {CONST
> ...} :partition 16982} or similar.  The relcache could load that up
> and convert the list to a C array, which would then be easy to
> binary-search.
>
> As you say, you also need to store the relevant operator somewhere,
> and the fact that it's a range partition rather than list or hash,
> say.
>

I'm wondering here if it's better to keep partition values per partition wherein we have two catalogs, say,
pg_partitioned_reland pg_partition_def.  

pg_partitioned_rel stores information like partition kind, key (attribute number(s)?), key opclass(es). Optionally, we
couldalso say here if a given record (in pg_partitioned_rel) represents an actual top-level partitioned table or a
partitionthat is sub-partitioned (wherein this record is just a dummy for keys of sub-partitioning and such); something
likepartisdummy... 

pg_partition_def stores information of individual partitions (/sub-partitions, too?) such as its parent (either an
actualtop level partitioned table or a sub-partitioning template), whether this is an overflow/default partition, and
partitionvalues. 

Such a scheme would be similar to what Greenplum [1] has.

Perhaps this duplicates inheritance and can be argued in that sense, though.

Do you think keeping partition defining values with the top-level partitioned table would make some partitioning
schemes(multikey, sub- , etc.) a bit complicated to implement? I cannot offhand imagine the actual implementation
difficultiesthat might be involved myself but perhaps you have a better idea of such details and would have a say... 

Thanks,
Amit

[1] http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_partition_rule.html

http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_partition.html





Re: On partitioning

From
Robert Haas
Date:
On Wed, Nov 19, 2014 at 10:27 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>  Maybe as anyarray, but I think pg_node_tree
>> might even be better.  That can also represent data of some arbitrary
>> type, but it doesn't enforce that everything is uniform.  So you could
>> have a list of objects of the form {RANGEPARTITION :lessthan {CONST
>> ...} :partition 16982} or similar.  The relcache could load that up
>> and convert the list to a C array, which would then be easy to
>> binary-search.
>>
>> As you say, you also need to store the relevant operator somewhere,
>> and the fact that it's a range partition rather than list or hash,
>> say.
>
> I'm wondering here if it's better to keep partition values per partition wherein we have two catalogs, say,
pg_partitioned_reland pg_partition_def. 
>
> pg_partitioned_rel stores information like partition kind, key (attribute number(s)?), key opclass(es). Optionally,
wecould also say here if a given record (in pg_partitioned_rel) represents an actual top-level partitioned table or a
partitionthat is sub-partitioned (wherein this record is just a dummy for keys of sub-partitioning and such); something
likepartisdummy... 
>
> pg_partition_def stores information of individual partitions (/sub-partitions, too?) such as its parent (either an
actualtop level partitioned table or a sub-partitioning template), whether this is an overflow/default partition, and
partitionvalues. 

Yeah, you could do something like this.  There's a certain overhead to
adding additional system catalogs, though.  It means more inodes on
disk, probably more syscaches, and more runtime spent probing those
additional syscache entries to assemble a relcache entry.  On the
other hand, it's got a certain conceptual cleanliness to it.

I do think at a very minimum it's important to have a Boolean flag in
pg_class so that we need not probe what you're calling
pg_partitioned_rel if no partitioning information is present there.  I
might be tempted to go further and add the information you are
proposing to put in pg_partitioned_rel in pg_class instead, and just
add one new catalog.  But it depends on how many columns we end up
with.

Before going too much further with this I'd mock up schemas for your
proposed catalogs and a list of DDL operations to be supported, with
the corresponding syntax, and float that here for comment.

> Such a scheme would be similar to what Greenplum [1] has.

Interesting.

> Perhaps this duplicates inheritance and can be argued in that sense, though.
>
> Do you think keeping partition defining values with the top-level partitioned table would make some partitioning
schemes(multikey, sub- , etc.) a bit complicated to implement? I cannot offhand imagine the actual implementation
difficultiesthat might be involved myself but perhaps you have a better idea of such details and would have a say... 

I don't think this is a big deal one way or the other.  We're all
database folks here, so deciding to normalize for performance or
denormalize for conceptual cleanliness shouldn't tax our powers
unduly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> > I'm wondering here if it's better to keep partition values per partition
> > wherein we have two catalogs, say, pg_partitioned_rel and pg_partition_def.
> >
> > pg_partitioned_rel stores information like partition kind, key (attribute
> > number(s)?), key opclass(es). Optionally, we could also say here if a given
> > record (in pg_partitioned_rel) represents an actual top-level partitioned table
> > or a partition that is sub-partitioned (wherein this record is just a dummy for
> > keys of sub-partitioning and such); something like partisdummy...
> >
> > pg_partition_def stores information of individual partitions (/sub-partitions,
> > too?) such as its parent (either an actual top level partitioned table or a sub-
> > partitioning template), whether this is an overflow/default partition, and
> > partition values.
>
> Yeah, you could do something like this.  There's a certain overhead to
> adding additional system catalogs, though.  It means more inodes on
> disk, probably more syscaches, and more runtime spent probing those
> additional syscache entries to assemble a relcache entry.  On the
> other hand, it's got a certain conceptual cleanliness to it.
>

Hmm, this could be a concern.
> I do think at a very minimum it's important to have a Boolean flag in
> pg_class so that we need not probe what you're calling
> pg_partitioned_rel if no partitioning information is present there.  I
> might be tempted to go further and add the information you are
> proposing to put in pg_partitioned_rel in pg_class instead, and just
> add one new catalog.  But it depends on how many columns we end up
> with.
>

I think something like pg_class.relispartitioned would be good as a minimum like you said.

> Before going too much further with this I'd mock up schemas for your
> proposed catalogs and a list of DDL operations to be supported, with
> the corresponding syntax, and float that here for comment.
>

I came up with something like the following:

* Catalog schema:

CREATE TABLE pg_catalog.pg_partitioned_rel
(  partrelid                oid    NOT NULL,  partkind                oid    NOT NULL,  partissub              bool
NOTNULL,  partkey                 int2vector NOT NULL, -- partitioning attributes  partopclass         oidvector, 
  PRIMARY KEY (partrelid, partissub),  FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),  FOREIGN KEY (partopclass)
REFERENCESpg_opclass (oid) 
)
WITHOUT OIDS ;

CREATE TABLE pg_catalog.pg_partition_def
(  partitionid                      oid     NOT NULL,  partitionparentrel       oid    NOT NULL,  partitionisoverflow
 bool  NOT NULL,  partitionvalues             anyarray, 
  PRIMARY KEY (partitionid),  FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
)
WITHOUT OIDS;

ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;

pg_partitioned_rel stores the partitioning information for a partitioned relation. A pg_class relation has
pg_partitioned_relentry if pg_class.relispartitioned is 'true'. Though this can be challenged by saying we will want to
storesub-partitioning key here too. Do we want a partition relation to be called partitioned itself for the purpose of
underlyingsubpartitions? 'partissub' would be true in that case. 

pg_partition_def has a row for each relation that has defined restrictions on the data that partkey column can take,
akaa partition. The data is known to be within the bounds defined by partitionvalues. Perhaps we could divide this into
twoviz. rangeupperbound and listvalues for two partition types. When we will get to multi-level partitioning
(sub-partitioning),the partitions described here would actually be either data containing relations (lowest level) or
placeholderrelations (upper-level). The parentrel is supposed to make it easier to scan for all partitions of a given
partitionedrelation. The partitioning hierarchy also stays in the form of inheritance stored elsewhere (pg_inherits). 

The main reasoning behind two separate catalogs (or at least keeping partition definitions separate) is to make life
easierduring future enhancements like sub-partitioning.  

* DDL syntax (no multi-column partitioning, sub-partitioning support as yet):

-- create partitioned table and child partitions at once.
CREATE TABLE parent (...)
PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
[ (    PARTITION child      {          VALUES LESS THAN { ... | MAXVALUE } -- for RANGE        | VALUES [ IN ] ( { ...
|DEFAULT } ) -- for LIST      }      [ WITH ( ... ) ] [ TABLESPACE tbs ]    [, ...] ) ] ; 

-- define partitioning key on a table
ALTER TABLE parent PARTITION BY  [ RANGE | LIST ] ( key_column ) [ opclass ] [ (...) ] ;

-- create a new partition on a partitioned table with specified values
CREATE PARTITION child  ON parent VALUES ...;

-- drop a partition of a partitioned table with specified values
DROP PARTITION child  ON parent VALUES ...;

-- attach table as a partition to a partitioned table
ALTER TABLE parent ATTACH PARTITION child VALUES ... ;

-- detach a partition (child continues to exist as a regular table)
ALTER TABLE parent DETACH PARTITION child ;

Thanks,
Amit





Re: On partitioning

From
"Amit Langote"
Date:
Sorry, a correction:

> CREATE TABLE pg_catalog.pg_partitioned_rel
> (
>    partrelid                oid    NOT NULL,
>    partkind                oid    NOT NULL,
>    partissub              bool  NOT NULL,
>    partkey                 int2vector NOT NULL, -- partitioning attributes
>    partopclass         oidvector,
> 
>    PRIMARY KEY (partrelid, partissub),

Rather, PRIMARY KEY (partrelid)

Thanks,
Amit





Re: On partitioning

From
Robert Haas
Date:
On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Before going too much further with this I'd mock up schemas for your
>> proposed catalogs and a list of DDL operations to be supported, with
>> the corresponding syntax, and float that here for comment.

More people should really comment on this.  This is a pretty big deal
if it goes forward, so it shouldn't be based on what one or two people
think.

> * Catalog schema:
>
> CREATE TABLE pg_catalog.pg_partitioned_rel
> (
>    partrelid                oid    NOT NULL,
>    partkind                oid    NOT NULL,
>    partissub              bool  NOT NULL,
>    partkey                 int2vector NOT NULL, -- partitioning attributes
>    partopclass         oidvector,
>
>    PRIMARY KEY (partrelid, partissub),
>    FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
>    FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> )
> WITHOUT OIDS ;

So, we're going to support exactly two levels of partitioning?
partitions with partissub=false and subpartitions with partissub=true?Why not support only one level of partitioning
herebut then let the
 
children have their own pg_partitioned_rel entries if they are
subpartitioned?  That seems like a cleaner design and lets us support
an arbitrary number of partitioning levels if we ever need them.

> CREATE TABLE pg_catalog.pg_partition_def
> (
>    partitionid                      oid     NOT NULL,
>    partitionparentrel       oid    NOT NULL,
>    partitionisoverflow     bool  NOT NULL,
>    partitionvalues             anyarray,
>
>    PRIMARY KEY (partitionid),
>    FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> )
> WITHOUT OIDS;
>
> ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;

What is an overflow partition and why do we want that?

What are you going to do if the partitioning key has two columns of
different data types?

> * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
>
> -- create partitioned table and child partitions at once.
> CREATE TABLE parent (...)
> PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> [ (
>      PARTITION child
>        {
>            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
>          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
>        }
>        [ WITH ( ... ) ] [ TABLESPACE tbs ]
>      [, ...]
>   ) ] ;

How are you going to dump and restore this, bearing in mind that you
have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
wants to do pg_dump --table name_of_a_partition?

I actually think it will be much cleaner to declare the parent first
and then have separate CREATE TABLE statements that glue the children
in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1,
10000).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
Hi Robert,

From: Robert Haas [mailto:robertmhaas@gmail.com]
> > * Catalog schema:
> >
> > CREATE TABLE pg_catalog.pg_partitioned_rel
> > (
> >    partrelid                oid    NOT NULL,
> >    partkind                oid    NOT NULL,
> >    partissub              bool  NOT NULL,
> >    partkey                 int2vector NOT NULL, -- partitioning attributes
> >    partopclass         oidvector,
> >
> >    PRIMARY KEY (partrelid, partissub),
> >    FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
> >    FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> > )
> > WITHOUT OIDS ;
>
> So, we're going to support exactly two levels of partitioning?
> partitions with partissub=false and subpartitions with partissub=true?
>  Why not support only one level of partitioning here but then let the
> children have their own pg_partitioned_rel entries if they are
> subpartitioned?  That seems like a cleaner design and lets us support
> an arbitrary number of partitioning levels if we ever need them.
>

Yeah, that's what I thought at some point in favour of dropping partissub altogether.

However, not that this design solves it, there is one question - if we would want to support defining for a table both
partitionkey and sub-partition key in advance? That is, without having defined a first level partition yet; in that
case,what level do we associate sub-(sub-) partitioning key with or more to the point where do we keep it? One way is
toreplace partissub by partkeylevel with level 0 being the topmost-level partitioning key and so on while keeping the
partrelidequal to the pg_class.oid of the parent. That brings us to next question of managing hierarchies in
pg_partition_defcorresponding to partkeylevel in the definition of topmost partitioned relation. But I guess those are
implementationdetails rather than representational unless I am being too naïve. 

> > CREATE TABLE pg_catalog.pg_partition_def
> > (
> >    partitionid                      oid     NOT NULL,
> >    partitionparentrel       oid    NOT NULL,
> >    partitionisoverflow     bool  NOT NULL,
> >    partitionvalues             anyarray,
> >
> >    PRIMARY KEY (partitionid),
> >    FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> > )
> > WITHOUT OIDS;
> >
> > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;
>
> What is an overflow partition and why do we want that?
>

That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions) go.
VALUESclause of the definition for such a partition would look like: 

(a range partition) ... VALUES LESS THAN MAXVALUE
(a list partition) ... VALUES DEFAULT

There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate
anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?) 

> What are you going to do if the partitioning key has two columns of
> different data types?
>

Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. They are one of the most crucial elements
ofthe scheme. 

I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I
amthinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) 

I think partkind switches the interpretation of the field as appropriate. Am I missing something? By the way, I had
mentionedwe could have two values fields each for range and list partition kind. 

> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
> >
> > -- create partitioned table and child partitions at once.
> > CREATE TABLE parent (...)
> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> > [ (
> >      PARTITION child
> >        {
> >            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> >          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> >        }
> >        [ WITH ( ... ) ] [ TABLESPACE tbs ]
> >      [, ...]
> >   ) ] ;
>
> How are you going to dump and restore this, bearing in mind that you
> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
> wants to do pg_dump --table name_of_a_partition?
>

Assuming everything's (including partitioned relation and partitions at all levels) got a pg_class entry of its own,
wouldOIDs be a problem? Or what is the nature of this problem if it's possible that it may be. 

If someone pg_dump's an individual partition as a table, we could let it be dumped as just a plain table. I am thinking
weshould be able to do that or should be doing just that (?) 

> I actually think it will be much cleaner to declare the parent first
> and then have separate CREATE TABLE statements that glue the children
> in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1,
> 10000).
>

Oh, do you mean to do away without any syntax for defining partitions with CREATE TABLE parent?

By the way, do you mean the following:

CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000)

Instead of,

CREATE PARTITION child ON parent VALUES LESS THAN 10000?

And as for the dump of a partitioned table, it does sound cleaner to do it piece by piece starting with the parent and
itspartitioning key (as ALTER on it?) followed by individual partitions using either of the syntax above. Moreover we
dumpa sub-partition as a partition on its parent partition. 

Thanks for your time and valuable input.

Regards,
Amit





Re: On partitioning

From
Jim Nasby
Date:
On 12/2/14, 9:43 PM, Amit Langote wrote:
>> >What is an overflow partition and why do we want that?
>> >
> That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions)
go.VALUES clause of the definition for such a partition would look like:
 
>
> (a range partition) ... VALUES LESS THAN MAXVALUE
> (a list partition) ... VALUES DEFAULT
>
> There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate
anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?)
 

If we are going to do this, should the data just go into the parent? That's what would happen today.

FWIW, I think an overflow would be useful, but there should be a way to (dis|en)able it.

>> >What are you going to do if the partitioning key has two columns of
>> >different data types?
>> >
> Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. They are one of the most crucial
elementsof the scheme.
 
>
> I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And
Iam thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
 
>
> I think partkind switches the interpretation of the field as appropriate. Am I missing something? By the way, I had
mentionedwe could have two values fields each for range and list partition kind.
 

The more SQL way would be records (composite types). That would make catalog inspection a LOT easier and presumably
makeit easier to change the partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records are stored
internallyas tuples; not sure if that would be faster than a List of Consts or a pg_node_tree. Nodes would
theoreticallyallow using things other than Consts, but I suspect that would be a bad idea.
 

Something else to consider... our user-space support for ranges is now rangetypes, so perhaps that's what we should use
forrange partitioning. The up-side (which would be a double-edged sword) is that you could leave holes in your
partitioningmap. Note that in the multi-key case we could still have a record of rangetypes.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> From: Robert Haas [mailto:robertmhaas@gmail.com]

> > What is an overflow partition and why do we want that?
> 
> That would be a default partition. That is, where the tuples that
> don't belong elsewhere (other defined partitions) go. VALUES clause of
> the definition for such a partition would look like:
> 
> (a range partition) ... VALUES LESS THAN MAXVALUE 
> (a list partition) ... VALUES DEFAULT
> 
> There has been discussion about whether there shouldn't be such a
> place for tuples to go. That is, it should generate an error if a
> tuple can't go anywhere (or support auto-creating a new one like in
> interval partitioning?)

In my design I initially had overflow partitions too, because I
inherited the idea from Itagaki Takahiro's patch.  Eventually I realized
that it's a useless concept, because you can always have leftmost and
rightmost partitions, which are just regular partitions (except they
don't have a "low key", resp. "high key").  If you don't define
unbounded partitions at either side, it's fine, you just raise an error
whenever the user tries to insert a value for which there is no
partition.

Not real clear to me how this applies to list partitioning, but I have
the hunch that it'd be better to deal with that without overflow
partitions as well.

BTW I think auto-creating partitions is a bad idea in general, because
you get into lock escalation mess and furthermore you have to waste time
checking for existance beforehand, which lowers performance.  Just have
a very easy command that users can run ahead of time (something like
"CREATE PARTITION FOR VALUE now() + '30 days'", whatever), and
preferrably one that doesn't fail if the partition already exist; that
way, users can have (for instance) a daily create-30-partitions-ahead
procedure which most days would only create one partition (the one for
30 days in the future) but whenever the odd case happens that the server
is turned off just at that time someday, it creates two -- one belt, 29
suspenders.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: On partitioning

From
"ktm@rice.edu"
Date:
On Wed, Dec 03, 2014 at 10:00:26AM -0300, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
> 
> > > What is an overflow partition and why do we want that?
> > 
> > That would be a default partition. That is, where the tuples that
> > don't belong elsewhere (other defined partitions) go. VALUES clause of
> > the definition for such a partition would look like:
> > 
> > (a range partition) ... VALUES LESS THAN MAXVALUE 
> > (a list partition) ... VALUES DEFAULT
> > 
> > There has been discussion about whether there shouldn't be such a
> > place for tuples to go. That is, it should generate an error if a
> > tuple can't go anywhere (or support auto-creating a new one like in
> > interval partitioning?)
> 
> In my design I initially had overflow partitions too, because I
> inherited the idea from Itagaki Takahiro's patch.  Eventually I realized
> that it's a useless concept, because you can always have leftmost and
> rightmost partitions, which are just regular partitions (except they
> don't have a "low key", resp. "high key").  If you don't define
> unbounded partitions at either side, it's fine, you just raise an error
> whenever the user tries to insert a value for which there is no
> partition.
> 
Hi,

Maybe I am not clear on the concept of an overflow partition, but I
thought that it functioned to catch any record that did not fit the
partitioning scheme. You end of range with out a "low key" or "high
key" would only catch problems in those areas. If you partitioned on
work days of the week, you should not have anything on Saturday/Sunday.
How would that work? You would want to catch anything that was not a
weekday in the overflow.

Regards,
Ken



Re: On partitioning

From
Stephen Frost
Date:
* ktm@rice.edu (ktm@rice.edu) wrote:
> On Wed, Dec 03, 2014 at 10:00:26AM -0300, Alvaro Herrera wrote:
> > In my design I initially had overflow partitions too, because I
> > inherited the idea from Itagaki Takahiro's patch.  Eventually I realized
> > that it's a useless concept, because you can always have leftmost and
> > rightmost partitions, which are just regular partitions (except they
> > don't have a "low key", resp. "high key").  If you don't define
> > unbounded partitions at either side, it's fine, you just raise an error
> > whenever the user tries to insert a value for which there is no
> > partition.
>
> Maybe I am not clear on the concept of an overflow partition, but I
> thought that it functioned to catch any record that did not fit the
> partitioning scheme. You end of range with out a "low key" or "high
> key" would only catch problems in those areas. If you partitioned on
> work days of the week, you should not have anything on Saturday/Sunday.
> How would that work? You would want to catch anything that was not a
> weekday in the overflow.

Yeah, I'm not a big fan of just dropping data on the floor either.
That's the perview of CHECK constraints and shouldn't be a factor of the
partitioning system, imv.

There is a flip side to this though, which is that users who have those
CHECK constraints probably don't want to be bothered by having to have
an overflow partition, which leads into the question of, if we have them
as a supported capability, what would the default be?  My gut feeling is
that the default should be 'no overflow', in which case I'm not sure
it's useful as it won't be there for these cases where strange data
shows up unexpectedly and the system wants to put it somewhere.

Supporting overflow partitions would also mean supporting the ability to
move data out of those partitions and into 'real' partitions which the
user creates to deal with the odd/new data.  That doesn't strike me as
being too much fun for us to have to figure out, though if we do, we
might be able to do a better job (with less blocking happening, etc)
than the user could.

Lastly, my inclination is that it's a capability which could be added
later if there is demand for it, so perhaps the best answer is to not
include it now (feature creep and all that).
Thanks,
    Stephen

Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> On 12/2/14, 9:43 PM, Amit Langote wrote:
>
> >> >What are you going to do if the partitioning key has two columns of
> >> >different data types?
> >> >
> > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought.
> They are one of the most crucial elements of the scheme.
> >
> > I wonder if your suggestion of pg_node_tree plays well here. This then could
> be a list of CONSTs or some such... And I am thinking it's a concern only for
> range partitions, no? (that is, a multicolumn partition key)
> >
> > I think partkind switches the interpretation of the field as appropriate. Am I
> missing something? By the way, I had mentioned we could have two values
> fields each for range and list partition kind.
>
> The more SQL way would be records (composite types). That would make
> catalog inspection a LOT easier and presumably make it easier to change the
> partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> are stored internally as tuples; not sure if that would be faster than a List of
> Consts or a pg_node_tree. Nodes would theoretically allow using things other
> than Consts, but I suspect that would be a bad idea.
>

While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of
pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for
reference?

> Something else to consider... our user-space support for ranges is now
> rangetypes, so perhaps that's what we should use for range partitioning. The
> up-side (which would be a double-edged sword) is that you could leave holes
> in your partitioning map. Note that in the multi-key case we could still have a
> record of rangetypes.

That is something I had mind at least at some point. My general doubt remains about the usage of user space SQL types
forcatalog fields though I may be completely uninitiated about such usage. 

Thanks,
Amit





Re: On partitioning

From
"Amit Langote"
Date:
Hi,

> From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
> Amit Langote wrote:
> 
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
> 
> > > What is an overflow partition and why do we want that?
> >
> > That would be a default partition. That is, where the tuples that
> > don't belong elsewhere (other defined partitions) go. VALUES clause of
> > the definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a
> > place for tuples to go. That is, it should generate an error if a
> > tuple can't go anywhere (or support auto-creating a new one like in
> > interval partitioning?)
> 
> In my design I initially had overflow partitions too, because I
> inherited the idea from Itagaki Takahiro's patch.  Eventually I realized
> that it's a useless concept, because you can always have leftmost and
> rightmost partitions, which are just regular partitions (except they
> don't have a "low key", resp. "high key").  If you don't define
> unbounded partitions at either side, it's fine, you just raise an error
> whenever the user tries to insert a value for which there is no
> partition.
> 

I think your mention of "low key" and "high key" of a partition has forced
me into rethinking how I was going about this. For example, in Itagaki-san's
patch, only upper bound for a range partition would go into the catalog
while the CHECK expression for that partition would use upper bound for
previous partition as lower bound for the partition (an expression of form
lower <= key AND key < upper). I'd think that's presumptuous to a certain
degree in that the arrangement does not allow holes in the range. That also
means range partitions on either end are unbounded on one side. In fact,
what I called overflow partition would get (last_partitions_upper <= key) as
its CHECK expression and vice versa.

You suggest such unbounded partitions be disallowed? Which would mean we do
not allow either of the partition bounds to be null in case of a range
partition or list of values to be non-empty in case of a LIST partition.

> Not real clear to me how this applies to list partitioning, but I have
> the hunch that it'd be better to deal with that without overflow
> partitions as well.
> 

Likewise, CHECK expression for a LIST overflow partition would look
something like NOT (key = ANY ( ARRAY[<values-of-all-other-partitions>])).

By the way, I am not saying the primary metadata of partitions is CHECK
expression anymore. I hope we can do away without them for partitioning
sooner than later.  I am looking to have bounds/values stored in the
partition definition catalog not as an expression but as something readily
amenable to use at places where it's useful. Suggestions are welcome!

> BTW I think auto-creating partitions is a bad idea in general, because
> you get into lock escalation mess and furthermore you have to waste time
> checking for existance beforehand, which lowers performance.  Just have
> a very easy command that users can run ahead of time (something like
> "CREATE PARTITION FOR VALUE now() + '30 days'", whatever), and
> preferrably one that doesn't fail if the partition already exist; that
> way, users can have (for instance) a daily create-30-partitions-ahead
> procedure which most days would only create one partition (the one for
> 30 days in the future) but whenever the odd case happens that the server
> is turned off just at that time someday, it creates two -- one belt, 29
> suspenders.
> 

Yeah, I mentioned auto-partitioning just to know if that's how people
usually prefer to have overflow cases dealt with. I'd much rather focus on
straightforward cases at this point. Having said that, I agree that users of
partitioning should have a mechanism you mention though not sure about the
details.

Thanks,
Amit





Re: On partitioning

From
Amit Kapila
Date:
On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>
> Hi,
>
> > From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> > On 12/2/14, 9:43 PM, Amit Langote wrote:
> >
> > >> >What are you going to do if the partitioning key has two columns of
> > >> >different data types?
> > >> >
> > > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought.
> > They are one of the most crucial elements of the scheme.
> > >
> > > I wonder if your suggestion of pg_node_tree plays well here. This then could
> > be a list of CONSTs or some such... And I am thinking it's a concern only for
> > range partitions, no? (that is, a multicolumn partition key)
> > >
> > > I think partkind switches the interpretation of the field as appropriate. Am I
> > missing something? By the way, I had mentioned we could have two values
> > fields each for range and list partition kind.
> >
> > The more SQL way would be records (composite types). That would make
> > catalog inspection a LOT easier and presumably make it easier to change the
> > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > are stored internally as tuples; not sure if that would be faster than a List of
> > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > than Consts, but I suspect that would be a bad idea.
> >
>
> While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_tree at a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference?
>

I think you can check the same by manually creating table
with a user-defined type.

Create type typ as (f1 int, f2 text);
Create table part_tab(c1 int, c2 typ);

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Amit Kapila
Date:
On Wed, Dec 3, 2014 at 6:30 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Amit Langote wrote:
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
>
> > > What is an overflow partition and why do we want that?
> >
> > That would be a default partition. That is, where the tuples that
> > don't belong elsewhere (other defined partitions) go. VALUES clause of
> > the definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a
> > place for tuples to go. That is, it should generate an error if a
> > tuple can't go anywhere (or support auto-creating a new one like in
> > interval partitioning?)
>
> In my design I initially had overflow partitions too, because I
> inherited the idea from Itagaki Takahiro's patch.  Eventually I realized
> that it's a useless concept, because you can always have leftmost and
> rightmost partitions, which are just regular partitions (except they
> don't have a "low key", resp. "high key").  If you don't define
> unbounded partitions at either side, it's fine, you just raise an error
> whenever the user tries to insert a value for which there is no
> partition.
>
> Not real clear to me how this applies to list partitioning, but I have
> the hunch that it'd be better to deal with that without overflow
> partitions as well.
>

Well, overflow partitions might not sound to be a nice idea and we
might not want to do it or atleast not in first version, however
I think it could be useful in certain cases like if in a long running
transaction user is able to insert many rows into appropriate partitions
and one row falls out of the defined partition's range; an error in such a
case can annoy user, also I think similar situation could occur for
bulk insert (COPY).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
"Amit Langote"
Date:
From: Amit Kapila [mailto:amit.kapila16@gmail.com]
On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> > The more SQL way would be records (composite types). That would make
> > catalog inspection a LOT easier and presumably make it easier to change the
> > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > are stored internally as tuples; not sure if that would be faster than a List of
> > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > than Consts, but I suspect that would be a bad idea.
> >
>
> While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of
pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for
reference?
>

> I think you can check the same by manually creating table
> with a user-defined type.

> Create type typ as (f1 int, f2 text);
> Create table part_tab(c1 int, c2 typ);

Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom type
touse in a system catalog given the way a system catalog is created. That's my concern but it may not be valid. 

Thanks,
Amit





Re: On partitioning

From
Amit Kapila
Date:
On Tue, Dec 2, 2014 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> Before going too much further with this I'd mock up schemas for your
> >> proposed catalogs and a list of DDL operations to be supported, with
> >> the corresponding syntax, and float that here for comment.
>
> More people should really comment on this.  This is a pretty big deal
> if it goes forward, so it shouldn't be based on what one or two people
> think.
>
> > * Catalog schema:
> >
> > CREATE TABLE pg_catalog.pg_partitioned_rel
> > (
> >    partrelid                oid    NOT NULL,
> >    partkind                oid    NOT NULL,
> >    partissub              bool  NOT NULL,
> >    partkey                 int2vector NOT NULL, -- partitioning attributes
> >    partopclass         oidvector,
> >
> >    PRIMARY KEY (partrelid, partissub),
> >    FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
> >    FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> > )
> > WITHOUT OIDS ;
>
> So, we're going to support exactly two levels of partitioning?
> partitions with partissub=false and subpartitions with partissub=true?
>  Why not support only one level of partitioning here but then let the
> children have their own pg_partitioned_rel entries if they are
> subpartitioned?  That seems like a cleaner design and lets us support
> an arbitrary number of partitioning levels if we ever need them.
>
> > CREATE TABLE pg_catalog.pg_partition_def
> > (
> >    partitionid                      oid     NOT NULL,
> >    partitionparentrel       oid    NOT NULL,
> >    partitionisoverflow     bool  NOT NULL,
> >    partitionvalues             anyarray,
> >
> >    PRIMARY KEY (partitionid),
> >    FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> > )
> > WITHOUT OIDS;
> >
> > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;
>
> What is an overflow partition and why do we want that?
>
> What are you going to do if the partitioning key has two columns of
> different data types?
>
> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
> >
> > -- create partitioned table and child partitions at once.
> > CREATE TABLE parent (...)
> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> > [ (
> >      PARTITION child
> >        {
> >            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> >          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> >        }
> >        [ WITH ( ... ) ] [ TABLESPACE tbs ]
> >      [, ...]
> >   ) ] ;
>
> How are you going to dump and restore this, bearing in mind that you
> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
> wants to do pg_dump --table name_of_a_partition?
>

Do we really need to support dml or pg_dump for individual partitions?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Amit Kapila
Date:
On Fri, Dec 5, 2014 at 12:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > The more SQL way would be records (composite types). That would make
> > > catalog inspection a LOT easier and presumably make it easier to change the
> > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > > are stored internally as tuples; not sure if that would be faster than a List of
> > > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > > than Consts, but I suspect that would be a bad idea.
> > >
> >
> > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_tree at a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference?
> >
>
> > I think you can check the same by manually creating table
> > with a user-defined type.
>
> > Create type typ as (f1 int, f2 text);
> > Create table part_tab(c1 int, c2 typ);
>
> Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom type to use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid.
>

I think you are right.  I think in this case we need something similar
to column pg_index.indexprs which is of type pg_node_tree(which
seems to be already suggested by Robert). So may be we can proceed
with this type and see if any one else has better idea.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
"Amit Langote"
Date:
From: Amit Kapila [mailto:amit.kapila16@gmail.com]
On Fri, Dec 5, 2014 at 12:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > The more SQL way would be records (composite types). That would make
> > > catalog inspection a LOT easier and presumably make it easier to change the
> > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > > are stored internally as tuples; not sure if that would be faster than a List of
> > > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > > than Consts, but I suspect that would be a bad idea.
> > >
> >
> > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of
pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for
reference?
> >
>
> > I think you can check the same by manually creating table
> > with a user-defined type.
>
> > Create type typ as (f1 int, f2 text);
> > Create table part_tab(c1 int, c2 typ);
>
> Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom
typeto use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid. 
>

> I think you are right.  I think in this case we need something similar
> to column pg_index.indexprs which is of type pg_node_tree(which
> seems to be already suggested by Robert). So may be we can proceed
> with this type and see if any one else has better idea.

Yeah, with that, I was thinking we may be able to do something like dump a Node that describes the range partition
boundsor list of allowed values (say, RangePartitionValues, ListPartitionValues). 

Thanks,
Amit





Re: On partitioning

From
"Amit Langote"
Date:

From: Amit Kapila [mailto:amit.kapila16@gmail.com]
Sent: Friday, December 05, 2014 5:10 PM
To: Amit Langote
Cc: Jim Nasby; Robert Haas; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 12:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > The more SQL way would be records (composite types). That would make
> > > catalog inspection a LOT easier and presumably make it easier to change the
> > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > > are stored internally as tuples; not sure if that would be faster than a List of
> > > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > > than Consts, but I suspect that would be a bad idea.
> > >
> >
> > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of
pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for
reference?
> >
>
> > I think you can check the same by manually creating table
> > with a user-defined type.
>
> > Create type typ as (f1 int, f2 text);
> > Create table part_tab(c1 int, c2 typ);
>
> Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom
typeto use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid. 
>
>
>  I think you are right.  I think in this case we need something similar
> to column pg_index.indexprs which is of type pg_node_tree(which
> seems to be already suggested by Robert). So may be we can proceed
> with this type and see if any one else has better idea.

One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized to
beused with advantage during planning. But, it seems we could deserialize it in advance back to the internal form (like
anarray of a struct) as part of the cached relation data. This overhead would only be incurred in case of partitioned
tables.Perhaps this is what Robert suggested elsewhere. 

Thanks,
Amit





Re: On partitioning

From
Robert Haas
Date:
On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> So, we're going to support exactly two levels of partitioning?
>> partitions with partissub=false and subpartitions with partissub=true?
>>  Why not support only one level of partitioning here but then let the
>> children have their own pg_partitioned_rel entries if they are
>> subpartitioned?  That seems like a cleaner design and lets us support
>> an arbitrary number of partitioning levels if we ever need them.
>
> Yeah, that's what I thought at some point in favour of dropping partissub altogether. However, not that this design
solvesit, there is one question - if we would want to support defining for a table both partition key and sub-partition
keyin advance? That is, without having defined a first level partition yet; in that case, what level do we associate
sub-(sub-)partitioning key with or more to the point where do we keep it? 

Do we really need to allow that?  I think you let people partition a
toplevel table, and then partition its partitions once they've been
created.  I'm not sure there's a good reason to associate the
subpartitioning scheme with the toplevel table.  For one thing, that
forces all subpartitions to be partitioned the same way - do we want
to insist on that?  If we do, then I agree that we need to think a
little harder here.

> That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions)
go.VALUES clause of the definition for such a partition would look like: 
>
> (a range partition) ... VALUES LESS THAN MAXVALUE
> (a list partition) ... VALUES DEFAULT
>
> There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate
anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?) 

I think Alvaro's response further down the thread is right on target.
But to go into a bit more detail, let's consider the three possible
cases:

- Hash partitioning.  Every key value gets hashed to some partition.
The concept of an overflow or default partition doesn't even make
sense.

- List partitioning.  Each key for which the user has defined a
mapping gets sent to the corresponding partition.  The keys that
aren't mapped anywhere can either (a) cause an error or (b) get mapped
to some default partition.  It's probably useful to offer both
behaviors.  But I don't think it requires a partitionisoverflow
column, because you can represent it some other way, such as by making
partitionvalues NULL, which is otherwise meaningless.

- Range partitioning.  In this case, what you've basically got is a
list of partition bounds and a list of target partitions.   Suppose
there are N partition bounds; then there will be N+1 targets.  Some of
those targets can be undefined, meaning an attempt to insert a key
with that value will error out.  For example, suppose the user defines
a partition for values 1-3 and 10-13.  Then your list of partition
bounds looks like this:

1,3,10,13

And your list of destinations looks like this:

undefined,firstpartition,undefined,secondpartition,undefined

More commonly, the ranges will be contiguous, so that there are no
gaps.  If you have everything <10 in the first partition, everything
10-20 in the second partition, and everything else in a third
partition, then you have bounds 10,20 and destinations
firstpartition,secondpartition,thirdpartition.  If you want values
greater than 20 to error out, then you have bounds 10,20 and
destinations firstpartition,secondpartition,undefined.

In none of this do you really have "an overflow partition".  Rather,
the first and last destinations, if defined, catch everything that has
a key lower than the lowest key or higher than the highest key.  If
not defined, you error out.

> I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And
Iam thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) 

I guess you could list or hash partition on multiple columns, too.
And yes, this is why I though of pg_node_tree.

>> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
>> >
>> > -- create partitioned table and child partitions at once.
>> > CREATE TABLE parent (...)
>> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
>> > [ (
>> >      PARTITION child
>> >        {
>> >            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
>> >          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
>> >        }
>> >        [ WITH ( ... ) ] [ TABLESPACE tbs ]
>> >      [, ...]
>> >   ) ] ;
>>
>> How are you going to dump and restore this, bearing in mind that you
>> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
>> wants to do pg_dump --table name_of_a_partition?
>>
> Assuming everything's (including partitioned relation and partitions at all levels) got a pg_class entry of its own,
wouldOIDs be a problem? Or what is the nature of this problem if it's possible that it may be. 

For pg_dump --binary-upgrade, you need a statement like SELECT
binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
each pg_class entry.  So you can't easily have a single SQL statement
creating multiple such entries.

> Oh, do you mean to do away without any syntax for defining partitions with CREATE TABLE parent?

That's what I was thinking.  Or at least just make that a shorthand
for something that can also be done with a series of SQL statements.

> By the way, do you mean the following:
>
> CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000)
>
> Instead of,
>
> CREATE PARTITION child ON parent VALUES LESS THAN 10000?

To me, it seems more logical to make it a variant of CREATE TABLE,
similar to what we do already with CREATE TABLE tab OF typename.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Do we really need to support dml or pg_dump for individual partitions?

I think we do.  It's quite reasonable for a DBA (or developer or
whatever) to want to dump all the data that's in a single partition;
for example, maybe they have the table partitioned, but also spread
across several servers.  When the data on one machine grows too big,
they want to dump that partition, move it to a new machine, and drop
the partition from the old machine.  That needs to be easy and
efficient.

More generally, with inheritance, I've seen the ability to reference
individual inheritance children be a real life-saver on any number of
occasions.  Now, a new partitioning system that is not as clunky as
constraint exclusion will hopefully be fast enough that people don't
need to do it very often any more.  But I would be really cautious
about removing the option.  That is the equivalent of installing a new
fire suppression system and then boarding up the emergency exit.
Yeah, you *hope* the new fire suppression system is good enough that
nobody will ever need to go out that way any more.  But if you're
wrong, people will die, so getting rid of it isn't prudent.  The
stakes are not quite so high here, but the principle is the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Fri, Dec 5, 2014 at 3:11 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think you are right.  I think in this case we need something similar
>> to column pg_index.indexprs which is of type pg_node_tree(which
>> seems to be already suggested by Robert). So may be we can proceed
>> with this type and see if any one else has better idea.
>
> Yeah, with that, I was thinking we may be able to do something like dump a Node that describes the range partition
boundsor list of allowed values (say, RangePartitionValues, ListPartitionValues).
 

That's exactly what the kind of thing I was thinking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Jim Nasby
Date:
On 12/5/14, 3:42 AM, Amit Langote wrote:
>> >  I think you are right.  I think in this case we need something similar
>> >to column pg_index.indexprs which is of type pg_node_tree(which
>> >seems to be already suggested by Robert). So may be we can proceed
>> >with this type and see if any one else has better idea.
> One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized
tobe used with advantage during planning. But, it seems we could deserialize it in advance back to the internal form
(likean array of a struct) as part of the cached relation data. This overhead would only be incurred in case of
partitionedtables. Perhaps this is what Robert suggested elsewhere.
 

In order to store a composite type in a catalog, we would need to have one field that has the typid of the composite,
andthe field that stores the actual composite data would need to be a "dumb" varlena that stores the composite
HeapTupleHeader.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Jim Nasby
Date:
On 12/5/14, 1:22 PM, Jim Nasby wrote:
> On 12/5/14, 3:42 AM, Amit Langote wrote:
>>> >  I think you are right.  I think in this case we need something similar
>>> >to column pg_index.indexprs which is of type pg_node_tree(which
>>> >seems to be already suggested by Robert). So may be we can proceed
>>> >with this type and see if any one else has better idea.
>> One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized
tobe used with advantage during planning. But, it seems we could deserialize it in advance back to the internal form
(likean array of a struct) as part of the cached relation data. This overhead would only be incurred in case of
partitionedtables. Perhaps this is what Robert suggested elsewhere.
 
>
> In order to store a composite type in a catalog, we would need to have one field that has the typid of the composite,
andthe field that stores the actual composite data would need to be a "dumb" varlena that stores the composite
HeapTupleHeader.

On further thought; if we disallow NULL as a partition boundary, we don't need a separate rowtype; we could just use
theone associated with the relation itself. Presumably that would make comparing tuples to the relation list a lot
easier.

I was hung up on how that would work in the case of ALTER TABLE, but we'd have the same problem with using
pg_node_tree:if you alter a table in such a way that *might* affect your partitioning, you have to do some kind of
revalidationanyway.
 

The other option would be to use some custom rowtype to store boundary values and have a method that can form a
boundarytuple from a real one. Either way, I suspect this is better than frequently evaluating pg_node_trees.
 

There may be one other option. If range partitions are defined in terms of an expression that is different for every
partition(ie: (substr(product_key, 1, 4), date_trunc('month', sales_date))) then we could use a hash of that expression
toidentify a partition. In other words, range partitioning becomes a special case of hash partitioning. I do think we
needa programmatic means to identify the range of an individual partition and hash won't solve that, but the
performanceof that case isn't critical so we could use pretty much whatever we wanted to there.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Robert Haas
Date:
On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> The other option would be to use some custom rowtype to store boundary
> values and have a method that can form a boundary tuple from a real one.
> Either way, I suspect this is better than frequently evaluating
> pg_node_trees.

On what basis do you expect that?  Every time you use a view, you're
using a pg_node_tree.  Nobody's ever complained that having to reload
the pg_node_tree column was too slow, and I see no reason to suppose
that things would be any different here.

I mean, we can certainly invent something new if there is a reason to
do so.  But you (and a few other people) seem to be trying pretty hard
to avoid using the massive amount of infrastructure that we already
have to do almost this exact thing, which puzzles the heck out of me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Jim Nasby
Date:
On 12/5/14, 2:02 PM, Robert Haas wrote:
> On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> The other option would be to use some custom rowtype to store boundary
>> values and have a method that can form a boundary tuple from a real one.
>> Either way, I suspect this is better than frequently evaluating
>> pg_node_trees.
>
> On what basis do you expect that?  Every time you use a view, you're
> using a pg_node_tree.  Nobody's ever complained that having to reload
> the pg_node_tree column was too slow, and I see no reason to suppose
> that things would be any different here.
>
> I mean, we can certainly invent something new if there is a reason to
> do so.  But you (and a few other people) seem to be trying pretty hard
> to avoid using the massive amount of infrastructure that we already
> have to do almost this exact thing, which puzzles the heck out of me.

My concern is how to do the routing of incoming tuples. I'm assuming it'd be significantly faster to compare two tuples
thanto run each tuple through a bunch of nodetrees.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Robert Haas
Date:
On Fri, Dec 5, 2014 at 3:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On what basis do you expect that?  Every time you use a view, you're
>> using a pg_node_tree.  Nobody's ever complained that having to reload
>> the pg_node_tree column was too slow, and I see no reason to suppose
>> that things would be any different here.
>>
>> I mean, we can certainly invent something new if there is a reason to
>> do so.  But you (and a few other people) seem to be trying pretty hard
>> to avoid using the massive amount of infrastructure that we already
>> have to do almost this exact thing, which puzzles the heck out of me.
>
> My concern is how to do the routing of incoming tuples. I'm assuming it'd be
> significantly faster to compare two tuples than to run each tuple through a
> bunch of nodetrees.

As I said before, that's a completely unrelated problem.

To quickly route tuples for range or list partitioning, you're going
to want to have an array of Datums in memory and bseach it.  That says
nothing about how they should be stored on disk.  Whatever the on-disk
representation looks like, the relcache is going to need to reassemble
it into an array that can be binary-searched.  As long as that's not
hard to do - and none of the proposals here would make it hard to do -
there's no reason to care about this from that point of view.

At least, not that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Amit Kapila
Date:
On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.

How would you distinguish values in list partition for multiple
columns? I mean for range partition, we are sure there will
be either one value for each column, but for list it could
be multiple and not fixed for each partition, so I think it will not
be easy to support the multicolumn partition key for list
partitions.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Amit Kapila
Date:
On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do we really need to support dml or pg_dump for individual partitions?
>
> I think we do.  It's quite reasonable for a DBA (or developer or
> whatever) to want to dump all the data that's in a single partition;
> for example, maybe they have the table partitioned, but also spread
> across several servers.  When the data on one machine grows too big,
> they want to dump that partition, move it to a new machine, and drop
> the partition from the old machine.  That needs to be easy and
> efficient.
>
> More generally, with inheritance, I've seen the ability to reference
> individual inheritance children be a real life-saver on any number of
> occasions.  Now, a new partitioning system that is not as clunky as
> constraint exclusion will hopefully be fast enough that people don't
> need to do it very often any more.  But I would be really cautious
> about removing the option.  That is the equivalent of installing a new
> fire suppression system and then boarding up the emergency exit.
> Yeah, you *hope* the new fire suppression system is good enough that
> nobody will ever need to go out that way any more.  But if you're
> wrong, people will die, so getting rid of it isn't prudent.  The
> stakes are not quite so high here, but the principle is the same.
>

Sure, I don't feel we should not provide anyway to take dump
for individual partition but not at level of independent table.
May be something like --table <table_name>
--partition <partition_name>.

In general, I think we should try to avoid exposing that partitions are
individual tables as that might hinder any future enhancement in that
area (example if we someone finds a different and better way to
arrange the partition data, then due to the currently exposed syntax,
we might feel blocked). 

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
"Amit Langote"
Date:
Hi Robert,

> From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> So, we're going to support exactly two levels of partitioning?
> >> partitions with partissub=false and subpartitions with partissub=true?
> >>  Why not support only one level of partitioning here but then let the
> >> children have their own pg_partitioned_rel entries if they are
> >> subpartitioned?  That seems like a cleaner design and lets us support
> >> an arbitrary number of partitioning levels if we ever need them.
> >
> > Yeah, that's what I thought at some point in favour of dropping partissub
> altogether. However, not that this design solves it, there is one question - if
> we would want to support defining for a table both partition key and sub-
> partition key in advance? That is, without having defined a first level partition
> yet; in that case, what level do we associate sub-(sub-) partitioning key with
> or more to the point where do we keep it?
>
> Do we really need to allow that?  I think you let people partition a
> toplevel table, and then partition its partitions once they've been
> created.  I'm not sure there's a good reason to associate the
> subpartitioning scheme with the toplevel table.  For one thing, that
> forces all subpartitions to be partitioned the same way - do we want
> to insist on that?  If we do, then I agree that we need to think a
> little harder here.
>

To me, it sounds better if we insist on a uniform subpartitioning scheme across all partitions. It seems that's how
it'sdone elsewhere. It would be interesting to hear what others think though. 

> > That would be a default partition. That is, where the tuples that don't
> belong elsewhere (other defined partitions) go. VALUES clause of the
> definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a place
> for tuples to go. That is, it should generate an error if a tuple can't go
> anywhere (or support auto-creating a new one like in interval partitioning?)
>
> I think Alvaro's response further down the thread is right on target.
> But to go into a bit more detail, let's consider the three possible
> cases:
>
> - Hash partitioning.  Every key value gets hashed to some partition.
> The concept of an overflow or default partition doesn't even make
> sense.
>
> - List partitioning.  Each key for which the user has defined a
> mapping gets sent to the corresponding partition.  The keys that
> aren't mapped anywhere can either (a) cause an error or (b) get mapped
> to some default partition.  It's probably useful to offer both
> behaviors.  But I don't think it requires a partitionisoverflow
> column, because you can represent it some other way, such as by making
> partitionvalues NULL, which is otherwise meaningless.
>
> - Range partitioning.  In this case, what you've basically got is a
> list of partition bounds and a list of target partitions.   Suppose
> there are N partition bounds; then there will be N+1 targets.  Some of
> those targets can be undefined, meaning an attempt to insert a key
> with that value will error out.  For example, suppose the user defines
> a partition for values 1-3 and 10-13.  Then your list of partition
> bounds looks like this:
>
> 1,3,10,13
>
> And your list of destinations looks like this:
>
> undefined,firstpartition,undefined,secondpartition,undefined
>
> More commonly, the ranges will be contiguous, so that there are no
> gaps.  If you have everything <10 in the first partition, everything
> 10-20 in the second partition, and everything else in a third
> partition, then you have bounds 10,20 and destinations
> firstpartition,secondpartition,thirdpartition.  If you want values
> greater than 20 to error out, then you have bounds 10,20 and
> destinations firstpartition,secondpartition,undefined.
>
> In none of this do you really have "an overflow partition".  Rather,
> the first and last destinations, if defined, catch everything that has
> a key lower than the lowest key or higher than the highest key.  If
> not defined, you error out.

So just to clarify, first and last destinations are considered "defined" if you have something like:

...
PARTITION p1 VALUES LESS THAN 10
PARTITION p2 VALUES BETWEEN 10 AND 20
PARTITION p3 VALUES GREATER THAN 20
...

And "not defined" if:

...
PARTITION p1 VALUES BETWEEN 10 AND 20
...

In the second case, because no explicit definitions for values less than 10 and greater than 20 are in place, rows with
thatvalue error out? If so, that makes sense.  

>
> > I wonder if your suggestion of pg_node_tree plays well here. This then
> could be a list of CONSTs or some such... And I am thinking it's a concern only
> for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.
> And yes, this is why I though of pg_node_tree.
>
> >> > * DDL syntax (no multi-column partitioning, sub-partitioning support as
> yet):
> >> >
> >> > -- create partitioned table and child partitions at once.
> >> > CREATE TABLE parent (...)
> >> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> >> > [ (
> >> >      PARTITION child
> >> >        {
> >> >            VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> >> >          | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> >> >        }
> >> >        [ WITH ( ... ) ] [ TABLESPACE tbs ]
> >> >      [, ...]
> >> >   ) ] ;
> >>
> >> How are you going to dump and restore this, bearing in mind that you
> >> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
> >> wants to do pg_dump --table name_of_a_partition?
> >>
> > Assuming everything's (including partitioned relation and partitions at all
> levels) got a pg_class entry of its own, would OIDs be a problem? Or what is
> the nature of this problem if it's possible that it may be.
>
> For pg_dump --binary-upgrade, you need a statement like SELECT
> binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
> each pg_class entry.  So you can't easily have a single SQL statement
> creating multiple such entries.
>

Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one in the first place?

> > Oh, do you mean to do away without any syntax for defining partitions with
> CREATE TABLE parent?
>
> That's what I was thinking.  Or at least just make that a shorthand
> for something that can also be done with a series of SQL statements.
>

Perhaps this is related to the point just above. So, a single SQL statement that defines partitioning key and few
partitions/subpartitionsbased on the key could be supported provided the resulting set of objects can also be created
usingan alternative series of steps each of which creates at most one object. Do we want a key definition to have an
oid?Perhaps not. 

> > By the way, do you mean the following:
> >
> > CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000)
> >
> > Instead of,
> >
> > CREATE PARTITION child ON parent VALUES LESS THAN 10000?
>
> To me, it seems more logical to make it a variant of CREATE TABLE,
> similar to what we do already with CREATE TABLE tab OF typename.
>

Makes sense. This would double as a way to create subpartitions too? And that would have to play well with any choice
weend up making about how we treat subpartitioning key (one of the points discussed above) 

Regards,
Amit






Re: On partitioning

From
"Amit Langote"
Date:

From: Amit Kapila [mailto:amit.kapila16@gmail.com]
Sent: Saturday, December 06, 2014 5:00 PM
To: Robert Haas
Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such...
AndI am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) 
>
> I guess you could list or hash partition on multiple columns, too.
>
> How would you distinguish values in list partition for multiple
> columns? I mean for range partition, we are sure there will
> be either one value for each column, but for list it could
> be multiple and not fixed for each partition, so I think it will not
> be easy to support the multicolumn partition key for list
> partitions.

Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning
isnot widely used. It is used in combination with range or hash partitioning as composite partitioning. So, perhaps we
neednot worry about that. 

Regards,
Amit





Re: On partitioning

From
"Amit Langote"
Date:

From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila
Sent: Saturday, December 06, 2014 5:06 PM
To: Robert Haas
Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do we really need to support dml or pg_dump for individual partitions?
>
> I think we do.  It's quite reasonable for a DBA (or developer or
> whatever) to want to dump all the data that's in a single partition;
> for example, maybe they have the table partitioned, but also spread
> across several servers.  When the data on one machine grows too big,
> they want to dump that partition, move it to a new machine, and drop
> the partition from the old machine.  That needs to be easy and
> efficient.
>
> More generally, with inheritance, I've seen the ability to reference
> individual inheritance children be a real life-saver on any number of
> occasions.  Now, a new partitioning system that is not as clunky as
> constraint exclusion will hopefully be fast enough that people don't
> need to do it very often any more.  But I would be really cautious
> about removing the option.  That is the equivalent of installing a new
> fire suppression system and then boarding up the emergency exit.
> Yeah, you *hope* the new fire suppression system is good enough that
> nobody will ever need to go out that way any more.  But if you're
> wrong, people will die, so getting rid of it isn't prudent.  The
> stakes are not quite so high here, but the principle is the same.
>
>
> Sure, I don't feel we should not provide anyway to take dump
> for individual partition but not at level of independent table.
> May be something like --table <table_name>
> --partition <partition_name>.
>

This does sound cleaner.

> In general, I think we should try to avoid exposing that partitions are
> individual tables as that might hinder any future enhancement in that
> area (example if we someone finds a different and better way to
> arrange the partition data, then due to the currently exposed syntax,
> we might feel blocked).

Sounds like a concern. I guess you are referring to whether we allow a partition relation to be included in the range
tableand then some other cases. In the former case we could allow referring to individual partitions by some additional
syntaxif it doesn’t end up looking too ugly or invite a bunch of other issues. 

This seems to have been discussed a little bit upthread (for example, see "Open Questions" in Alvaro's original
proposaland Hannu Krosing's reply).  

Regards,
Amit





Re: On partitioning

From
Amit Kapila
Date:
On Mon, Dec 8, 2014 at 11:01 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> Sent: Saturday, December 06, 2014 5:00 PM
> To: Robert Haas
> Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
> Subject: Re: [HACKERS] On partitioning
>
> On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
> >
> > I guess you could list or hash partition on multiple columns, too.
> >
> > How would you distinguish values in list partition for multiple
> > columns? I mean for range partition, we are sure there will
> > be either one value for each column, but for list it could
> > be multiple and not fixed for each partition, so I think it will not
> > be easy to support the multicolumn partition key for list
> > partitions.
>
> Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning is not widely used.

So I think it is better to be clear why we are not planning to
support it, is it that because it is not required by users or
is it due to the reason that code seems to be tricky or is it due
to both of the reasons.  It might help us if anyone raises this
during the development of this patch or in general if someone
requests such a feature.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
"Amit Langote"
Date:
From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> > > How would you distinguish values in list partition for multiple
> > > columns? I mean for range partition, we are sure there will
> > > be either one value for each column, but for list it could
> > > be multiple and not fixed for each partition, so I think it will not
> > > be easy to support the multicolumn partition key for list
> > > partitions.
>
> >Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list
partitioningis not widely used. 
>
> So I think it is better to be clear why we are not planning to
> support it, is it that because it is not required by users or
> is it due to the reason that code seems to be tricky or is it due
> to both of the reasons.  It might help us if anyone raises this
> during the development of this patch or in general if someone
> requests such a feature.

Coming back to the how pg_node_tree representation for list partitions -

For each column in a multicolumn list partition key, a value would look like a dumped Node for List of Consts (all
allowedvalues in a given list partition). And the whole key would then be a List of such Nodes (a dump thereof). That's
perhapspretty verbose but I guess that's supposed to be only a catalog representation. During relcache building, we
turnthis back into a collection of structs to efficiently locate the partition of interest whatever the method of doing
thatends up being (based on partition type). The relcache step ensures that we have decoupled the concern of quickly
locatingan interesting partition from its catalog representation. 

Of course, there may be flaws in this picture and would only reveal themselves when actually trying to implement it or
theycan be pointed out in advance. 

Thanks,
Amit





Re: On partitioning

From
Josh Berkus
Date:
All,

Pardon me for jumping into this late.  In general, I like Alvaro's
approach.  However, I wanted to list the major shortcomings of the
existing replication system (based on complaints by PGX's users and on
IRC) and compare them to Alvaro's proposed implementation to make sure
that enough of them are addressed, and that the ones which aren't
addressed are not being addressed as a clear decision.  We can't address
*all* of the limitations of the current system, but let's make sure that
we're addressing enough of them to make implementing a 2nd partitioning
system worthwhile.

Where I have ? is because I'm not clear from Alvaro's proposal whether
they're addressed or not.

1.The Trigger Problem
the need to write triggers for INSERT/UPDATE/DELETE.
Addressed.

2. The Clutter Problem
cluttering up system views and dumps with hundreds of partitioned tables
Addressed.

3. Creation Problem
the need two write triggers and/or cron jobs to create new partitions
Addressed.

4. Creation Locking Problem
high probability of lock pile-ups whenever a new partition is created on
demand due to multiple backends trying to create the partition at the
same time.
Not Addressed?

5. Constant Problem
Since current partitioned query planning happens before the rewrite
phase, SELECTs do not use partition logic to evaluate even simple
expressions, let alone IMMUTABLE or STABLE functions.
Addressed??

6. Unique Index Problem
Cannot create a unique index across multiple partitions, which prevents
the partitioned table from being FK'd.
Not Addressed
(but could be addressed in the future)

7. JOIN Problem
Two partitioned tables being JOINed need to append and materialize
before the join, causing a very slow join under some circumstances, even
if both tables are partitioned on the same ranges.
Not Addressed?
(but could be addressed in the future)

8. COPY Problem
Cannot bulk-load into the Master, just into individual partitions.
Addressed.

9. Hibernate Problem
When using the trigger method, inserts into the master partition return
0, which Hibernate and some other ORMs regard as an insert failure.
Addressed.

10. Scaling Problem
Inheritance partitioning becomes prohibitively slow for the planner at
somewhere between 100 and 500 partitions depending on various factors.
No idea?

11. Hash Partitioning
Some users would prefer to partition into a fixed number of
hash-allocated partitions.
Not Addressed.

12. Extra Constraint Evaluation
Inheritance partitioning evaluates *all* constraints on the partitions,
whether they are part of the partitioning scheme or not.  This is way
expensive if those are, say, polygon comparisons.
Addressed.


Additionally, I believe that Alvaro's proposal will make the following
activities which are supported by partition-by-inheritance more
difficult or impossible.  Again, these are probably acceptable because
inheritance partitioning isn't going away.  However, we should
consciously decide that:

A. COPY/ETL then attach
In inheritance partitioning, you can easily build a partition outside
the master and then "attach" it, allowing for minimal disturbance of
concurrent users.  Could be addressed in the future.

B. Catchall Partition
Many partitioning schemes currently contain a "catchall" partition which
accepts rows outside of the range of the partitioning scheme, due to bad
input data.  Probably not handled on purpose; Alvaro is proposing that
we reject these instead, or create the partitions on demand, which is a
legitimate approach.

C. Asymmetric Partitioning / NULLs in partition column
This is the classic Active/Inactive By Month setup for partitions.
Could be addressed via special handling for NULL/infinity in the
partitioned column.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Robert Haas
Date:
On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I guess you could list or hash partition on multiple columns, too.
>
> How would you distinguish values in list partition for multiple
> columns? I mean for range partition, we are sure there will
> be either one value for each column, but for list it could
> be multiple and not fixed for each partition, so I think it will not
> be easy to support the multicolumn partition key for list
> partitions.

I don't understand.  If you want to range partition on columns (a, b),
you say that, say, tuples with (a, b) values less than (100, 200) go
here and the rest go elsewhere.  For list partitioning, you say that,
say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
rest go elsewhere.  I'm not sure how useful that is but it's not
illogical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 12:13 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> So just to clarify, first and last destinations are considered "defined" if you have something like:
>
> ...
> PARTITION p1 VALUES LESS THAN 10
> PARTITION p2 VALUES BETWEEN 10 AND 20
> PARTITION p3 VALUES GREATER THAN 20
> ...
>
> And "not defined" if:
>
> ...
> PARTITION p1 VALUES BETWEEN 10 AND 20
> ...

Yes.

>> For pg_dump --binary-upgrade, you need a statement like SELECT
>> binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
>> each pg_class entry.  So you can't easily have a single SQL statement
>> creating multiple such entries.
>
> Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one in the first place?

I mean that the binary upgrade script needs to set the OID for every
pg_class object being restored, and it does that by stashing away up
to one (1) pg_class OID before each CREATE statement.  If a single
CREATE statement generates multiple pg_class entries, this method
doesn't work.

> Makes sense. This would double as a way to create subpartitions too? And that would have to play well with any choice
weend up making about how we treat subpartitioning key (one of the points discussed above)
 

Yes, I think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sure, I don't feel we should not provide anyway to take dump
> for individual partition but not at level of independent table.
> May be something like --table <table_name>
> --partition <partition_name>.
>
> In general, I think we should try to avoid exposing that partitions are
> individual tables as that might hinder any future enhancement in that
> area (example if we someone finds a different and better way to
> arrange the partition data, then due to the currently exposed syntax,
> we might feel blocked).

I guess I'm in disagreement with you - and, perhaps - the majority on
this point.  I think that ship has already sailed: partitions ARE
tables.  We can try to make it less necessary for users to ever look
at those tables as separate objects, and I think that's a good idea.
But trying to go from a system where partitions are tables, which is
what we have today, to a system where they are not seems like a bad
idea to me.  If we make a major break from how things work today,
we're going to end up having to reimplement stuff that already works.

Besides, I haven't really seen anyone propose something that sounds
like a credible alternative.  If we could make partition objects
things that the storage layer needs to know about but the query
planner doesn't need to understand, that'd be maybe worth considering.
But I don't see any way that that's remotely feasible.  There are lots
of places that we assume that a heap consists of blocks number 0 up
through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
and pieces of the way index vacuuming is handled, which in turn bleeds
into Hot Standby.  You can't just decide that now block numbers are
going to be replaced by some more complex structure, or even that
they're now going to be nonlinear, without breaking a huge amount of
stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Josh Berkus
Date:
On 12/08/2014 11:05 AM, Robert Haas wrote:
> I guess I'm in disagreement with you - and, perhaps - the majority on
> this point.  I think that ship has already sailed: partitions ARE
> tables.  We can try to make it less necessary for users to ever look
> at those tables as separate objects, and I think that's a good idea.
> But trying to go from a system where partitions are tables, which is
> what we have today, to a system where they are not seems like a bad
> idea to me.  If we make a major break from how things work today,
> we're going to end up having to reimplement stuff that already works.

I don't thing its feasible to drop inheritance partitioning at this
point; too many user exploit a lot of peculiarities of that system which
wouldn't be supported by any other system.  So any new partitioning
system we're talking about would be *in addition* to the existing
system.  Hence my prior email trying to make sure that a new proposed
system is sufficiently different from the existing one to be worthwhile.

> Besides, I haven't really seen anyone propose something that sounds
> like a credible alternative.  If we could make partition objects
> things that the storage layer needs to know about but the query
> planner doesn't need to understand, that'd be maybe worth considering.
> But I don't see any way that that's remotely feasible. 

On the other hand, as long as partitions exist exclusively at the
planner layer, we can't fix the existing major shortcomings of
inheritance partitioning, such as its inability to handle expressions.
Again, see previous.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Andres Freund
Date:
On 2014-12-08 14:05:52 -0500, Robert Haas wrote:
> On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Sure, I don't feel we should not provide anyway to take dump
> > for individual partition but not at level of independent table.
> > May be something like --table <table_name>
> > --partition <partition_name>.
> >
> > In general, I think we should try to avoid exposing that partitions are
> > individual tables as that might hinder any future enhancement in that
> > area (example if we someone finds a different and better way to
> > arrange the partition data, then due to the currently exposed syntax,
> > we might feel blocked).
> 
> I guess I'm in disagreement with you - and, perhaps - the majority on
> this point.  I think that ship has already sailed: partitions ARE
> tables.  We can try to make it less necessary for users to ever look
> at those tables as separate objects, and I think that's a good idea.
> But trying to go from a system where partitions are tables, which is
> what we have today, to a system where they are not seems like a bad
> idea to me.  If we make a major break from how things work today,
> we're going to end up having to reimplement stuff that already works.

I don't think this makes much sense. That'd severely restrict our
ability to do stuff for a long time. Unless we can absolutely rely on
the fact that partitions have the same schema and such we'll rob
ourselves of significant optimization opportunities.

> Besides, I haven't really seen anyone propose something that sounds
> like a credible alternative.  If we could make partition objects
> things that the storage layer needs to know about but the query
> planner doesn't need to understand, that'd be maybe worth considering.
> But I don't see any way that that's remotely feasible.  There are lots
> of places that we assume that a heap consists of blocks number 0 up
> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
> and pieces of the way index vacuuming is handled, which in turn bleeds
> into Hot Standby.  You can't just decide that now block numbers are
> going to be replaced by some more complex structure, or even that
> they're now going to be nonlinear, without breaking a huge amount of
> stuff.

I think you're making a wrong fundamental assumption here. Just because
we define partitions to not be full relations doesn't mean we have to
treat them entirely separate. I don't see why a pg_class.relkind = 'p'
entry would be something actually problematic. That'd easily allow to
treat them differently in all the relevant places (all of ALTER TABLE,
DML et al) and still allow all of the current planner/executor
infrastructure. We can even allow direct SELECTs from individual
partitions if we want to - that's trivial to achieve.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 2:30 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/08/2014 11:05 AM, Robert Haas wrote:
>> I guess I'm in disagreement with you - and, perhaps - the majority on
>> this point.  I think that ship has already sailed: partitions ARE
>> tables.  We can try to make it less necessary for users to ever look
>> at those tables as separate objects, and I think that's a good idea.
>> But trying to go from a system where partitions are tables, which is
>> what we have today, to a system where they are not seems like a bad
>> idea to me.  If we make a major break from how things work today,
>> we're going to end up having to reimplement stuff that already works.
>
> I don't thing its feasible to drop inheritance partitioning at this
> point; too many user exploit a lot of peculiarities of that system which
> wouldn't be supported by any other system.  So any new partitioning
> system we're talking about would be *in addition* to the existing
> system.  Hence my prior email trying to make sure that a new proposed
> system is sufficiently different from the existing one to be worthwhile.

I think any new partitioning system should keep the good things about
the existing system, of which there are some, and not try to reinvent
the wheel.  The yard stick for a new system shouldn't be "is this
different enough?" but "does this solve the problems without creating
new ones?".

>> Besides, I haven't really seen anyone propose something that sounds
>> like a credible alternative.  If we could make partition objects
>> things that the storage layer needs to know about but the query
>> planner doesn't need to understand, that'd be maybe worth considering.
>> But I don't see any way that that's remotely feasible.
>
> On the other hand, as long as partitions exist exclusively at the
> planner layer, we can't fix the existing major shortcomings of
> inheritance partitioning, such as its inability to handle expressions.
> Again, see previous.

Huh?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I guess I'm in disagreement with you - and, perhaps - the majority on
>> this point.  I think that ship has already sailed: partitions ARE
>> tables.  We can try to make it less necessary for users to ever look
>> at those tables as separate objects, and I think that's a good idea.
>> But trying to go from a system where partitions are tables, which is
>> what we have today, to a system where they are not seems like a bad
>> idea to me.  If we make a major break from how things work today,
>> we're going to end up having to reimplement stuff that already works.
>
> I don't think this makes much sense. That'd severely restrict our
> ability to do stuff for a long time. Unless we can absolutely rely on
> the fact that partitions have the same schema and such we'll rob
> ourselves of significant optimization opportunities.

I don't think that's mutually exclusive with the idea of
partitions-as-tables.  I mean, you can add code to the ALTER TABLE
path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
wherever you want.

>> Besides, I haven't really seen anyone propose something that sounds
>> like a credible alternative.  If we could make partition objects
>> things that the storage layer needs to know about but the query
>> planner doesn't need to understand, that'd be maybe worth considering.
>> But I don't see any way that that's remotely feasible.  There are lots
>> of places that we assume that a heap consists of blocks number 0 up
>> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
>> and pieces of the way index vacuuming is handled, which in turn bleeds
>> into Hot Standby.  You can't just decide that now block numbers are
>> going to be replaced by some more complex structure, or even that
>> they're now going to be nonlinear, without breaking a huge amount of
>> stuff.
>
> I think you're making a wrong fundamental assumption here. Just because
> we define partitions to not be full relations doesn't mean we have to
> treat them entirely separate. I don't see why a pg_class.relkind = 'p'
> entry would be something actually problematic. That'd easily allow to
> treat them differently in all the relevant places (all of ALTER TABLE,
> DML et al) and still allow all of the current planner/executor
> infrastructure. We can even allow direct SELECTs from individual
> partitions if we want to - that's trivial to achieve.

We may just be using different words to talk about more-or-less the
same thing, then.  What I'm saying is that I want these things to keep
working:

- Indexes.
- Merge append and any other inheritance-aware query planning techniques.
- Direct access to individual partitions to bypass
tuple-routing/query-planning overhead.

I am not necessarily saying that I have a problem with putting other
restrictions on partitions, like requiring them to have the same tuple
descriptor or the same ACLs as their parents.  Those kinds of details
bear discussion, but I'm not intrinsically opposed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Andres Freund
Date:
On 2014-12-08 14:48:50 -0500, Robert Haas wrote:
> On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I guess I'm in disagreement with you - and, perhaps - the majority on
> >> this point.  I think that ship has already sailed: partitions ARE
> >> tables.  We can try to make it less necessary for users to ever look
> >> at those tables as separate objects, and I think that's a good idea.
> >> But trying to go from a system where partitions are tables, which is
> >> what we have today, to a system where they are not seems like a bad
> >> idea to me.  If we make a major break from how things work today,
> >> we're going to end up having to reimplement stuff that already works.
> >
> > I don't think this makes much sense. That'd severely restrict our
> > ability to do stuff for a long time. Unless we can absolutely rely on
> > the fact that partitions have the same schema and such we'll rob
> > ourselves of significant optimization opportunities.
> 
> I don't think that's mutually exclusive with the idea of
> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> wherever you want.

That'll be a lot of places you'll need to touch. More fundamentally: Why
should we name something a table that's not one?

> >> Besides, I haven't really seen anyone propose something that sounds
> >> like a credible alternative.  If we could make partition objects
> >> things that the storage layer needs to know about but the query
> >> planner doesn't need to understand, that'd be maybe worth considering.
> >> But I don't see any way that that's remotely feasible.  There are lots
> >> of places that we assume that a heap consists of blocks number 0 up
> >> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
> >> and pieces of the way index vacuuming is handled, which in turn bleeds
> >> into Hot Standby.  You can't just decide that now block numbers are
> >> going to be replaced by some more complex structure, or even that
> >> they're now going to be nonlinear, without breaking a huge amount of
> >> stuff.
> >
> > I think you're making a wrong fundamental assumption here. Just because
> > we define partitions to not be full relations doesn't mean we have to
> > treat them entirely separate. I don't see why a pg_class.relkind = 'p'
> > entry would be something actually problematic. That'd easily allow to
> > treat them differently in all the relevant places (all of ALTER TABLE,
> > DML et al) and still allow all of the current planner/executor
> > infrastructure. We can even allow direct SELECTs from individual
> > partitions if we want to - that's trivial to achieve.
> 
> We may just be using different words to talk about more-or-less the
> same thing, then.

That might be

> What I'm saying is that I want these things to keep working:

> - Indexes.

Nobody argued against that I think.

> - Merge append and any other inheritance-aware query planning
> techniques.

Same here.

> - Direct access to individual partitions to bypass
> tuple-routing/query-planning overhead.

I think that might be ok in some cases, but in general I'd be very wary
to allow that. I think it might be ok to allow direct read access, but
everything else I'd be opposed. I'd much rather go the route of allowing
to few things and then gradually opening up if required than the other
way round (as that pretty much will never happen because it'll break
deployed systems).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: On partitioning

From
Josh Berkus
Date:
On 12/08/2014 11:40 AM, Robert Haas wrote:
>> I don't thing its feasible to drop inheritance partitioning at this
>> point; too many user exploit a lot of peculiarities of that system which
>> wouldn't be supported by any other system.  So any new partitioning
>> system we're talking about would be *in addition* to the existing
>> system.  Hence my prior email trying to make sure that a new proposed
>> system is sufficiently different from the existing one to be worthwhile.
> 
> I think any new partitioning system should keep the good things about
> the existing system, of which there are some, and not try to reinvent
> the wheel.  The yard stick for a new system shouldn't be "is this
> different enough?" but "does this solve the problems without creating
> new ones?".

It's unrealistic to assume that a new system would support all of the
features of the existing inheritance partitioning without restriction.In fact, I'd say that such a requirement amounts
tosaying "don't
 
bother trying".

For example, inheritance allows us to have different indexes,
constraints, and even columns on partitions.  We can have overlapping
partitions, and heterogenous multilevel partitioning (partition this
customer by month but partition that customer by week).  We can even add
triggers on individual partitions to reroute data away from a specific
partition.   A requirement to support all of these peculiar uses of
inheritance partitioning would doom any new partitioning project.

>>> Besides, I haven't really seen anyone propose something that sounds
>>> like a credible alternative.  If we could make partition objects
>>> things that the storage layer needs to know about but the query
>>> planner doesn't need to understand, that'd be maybe worth considering.
>>> But I don't see any way that that's remotely feasible.
>>
>> On the other hand, as long as partitions exist exclusively at the
>> planner layer, we can't fix the existing major shortcomings of
>> inheritance partitioning, such as its inability to handle expressions.
>> Again, see previous.
> 
> Huh?

Explained in the other email I posted on this thread.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't think that's mutually exclusive with the idea of
>> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
>> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
>> wherever you want.
>
> That'll be a lot of places you'll need to touch. More fundamentally: Why
> should we name something a table that's not one?

Well, I'm not convinced that it isn't one.  And adding a new relkind
will involve a bunch of code churn, too.  But I don't much care to
pre-litigate this: when someone has got a patch, we can either agree
that the approach is OK or argue that it is problematic because X.  I
think we need to hammer down the design in broad strokes first, and
I'm not sure we're totally there yet.

>> - Direct access to individual partitions to bypass
>> tuple-routing/query-planning overhead.
>
> I think that might be ok in some cases, but in general I'd be very wary
> to allow that. I think it might be ok to allow direct read access, but
> everything else I'd be opposed. I'd much rather go the route of allowing
> to few things and then gradually opening up if required than the other
> way round (as that pretty much will never happen because it'll break
> deployed systems).

Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 2:58 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> I think any new partitioning system should keep the good things about
>> the existing system, of which there are some, and not try to reinvent
>> the wheel.  The yard stick for a new system shouldn't be "is this
>> different enough?" but "does this solve the problems without creating
>> new ones?".
>
> It's unrealistic to assume that a new system would support all of the
> features of the existing inheritance partitioning without restriction.
>  In fact, I'd say that such a requirement amounts to saying "don't
> bother trying".
>
> For example, inheritance allows us to have different indexes,
> constraints, and even columns on partitions.  We can have overlapping
> partitions, and heterogenous multilevel partitioning (partition this
> customer by month but partition that customer by week).  We can even add
> triggers on individual partitions to reroute data away from a specific
> partition.   A requirement to support all of these peculiar uses of
> inheritance partitioning would doom any new partitioning project.

I don't think it has to be possible to support every use case that we
can support today; clearly, a part of the goal here is to be LESS
general so that we can be more performant.  But I think the urge to
change too many things at once had better be tempered by a clear-eyed
vision of what can reasonably be accomplished in one patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Jim Nasby
Date:
On 12/8/14, 1:05 PM, Robert Haas wrote:
> Besides, I haven't really seen anyone propose something that sounds
> like a credible alternative.  If we could make partition objects
> things that the storage layer needs to know about but the query
> planner doesn't need to understand, that'd be maybe worth considering.
> But I don't see any way that that's remotely feasible.  There are lots
> of places that we assume that a heap consists of blocks number 0 up
> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
> and pieces of the way index vacuuming is handled, which in turn bleeds
> into Hot Standby.  You can't just decide that now block numbers are
> going to be replaced by some more complex structure, or even that
> they're now going to be nonlinear, without breaking a huge amount of
> stuff.

Agreed, but it's possible to keep a block/CTID interface while doing something different on the disk.

If you think about it, partitioning is really a hack anyway. It clutters up your logical set implementation with a
bunchof physical details. What most people really want when they implement partitioning is simply data locality.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Jim Nasby
Date:
On 12/8/14, 12:26 PM, Josh Berkus wrote:
> 4. Creation Locking Problem
> high probability of lock pile-ups whenever a new partition is created on
> demand due to multiple backends trying to create the partition at the
> same time.
> Not Addressed?

Do users actually try and create new partitions during DML? That sounds doomed to failure in pretty much any system...

> 6. Unique Index Problem
> Cannot create a unique index across multiple partitions, which prevents
> the partitioned table from being FK'd.
> Not Addressed
> (but could be addressed in the future)

And would be extremely useful even with simple inheritance, let alone partitioning...

> 9. Hibernate Problem
> When using the trigger method, inserts into the master partition return
> 0, which Hibernate and some other ORMs regard as an insert failure.
> Addressed.

It would be really nice to address this with regular inheritance too...

> 11. Hash Partitioning
> Some users would prefer to partition into a fixed number of
> hash-allocated partitions.
> Not Addressed.

Though, you should be able to do that in either system if you bother to define your own hash in a BEFORE trigger...

> A. COPY/ETL then attach
> In inheritance partitioning, you can easily build a partition outside
> the master and then "attach" it, allowing for minimal disturbance of
> concurrent users.  Could be addressed in the future.

How much of the desire for this is because our current "row routing" solutions are very slow? I suspect that's the
biggestreason, and hopefully Alvaro's proposal mostly eliminates it.
 

> B. Catchall Partition
> Many partitioning schemes currently contain a "catchall" partition which
> accepts rows outside of the range of the partitioning scheme, due to bad
> input data.  Probably not handled on purpose; Alvaro is proposing that
> we reject these instead, or create the partitions on demand, which is a
> legitimate approach.
>
> C. Asymmetric Partitioning / NULLs in partition column
> This is the classic Active/Inactive By Month setup for partitions.
> Could be addressed via special handling for NULL/infinity in the
> partitioned column.

If we allowed for a "catchall partition" and supported normal inheritance/triggers on that partition then users could
continueto do whatever they needed with data that didn't fit the "normal" partitioning pattern.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Josh Berkus
Date:
On 12/08/2014 02:12 PM, Jim Nasby wrote:
> On 12/8/14, 12:26 PM, Josh Berkus wrote:
>> 4. Creation Locking Problem
>> high probability of lock pile-ups whenever a new partition is created on
>> demand due to multiple backends trying to create the partition at the
>> same time.
>> Not Addressed?
> 
> Do users actually try and create new partitions during DML? That sounds
> doomed to failure in pretty much any system...

There is no question that it would be easier for users to create
partitions on demand automatically.  Particularly if you're partitioning
by something other than time.  For a particular case, consider users on
RDS, which has no cron jobs for creating new partitons; it's on demand
or manually.

It's quite possible that there is no good way to work out the locking
for on-demand partitions though, but *if* we're going to have a 2nd
partition system, I think it's important to at least discuss the
problems with on-demand creation.

>> 11. Hash Partitioning
>> Some users would prefer to partition into a fixed number of
>> hash-allocated partitions.
>> Not Addressed.
> 
> Though, you should be able to do that in either system if you bother to
> define your own hash in a BEFORE trigger...

That doesn't do you any good with the SELECT query, unless you change
your middleware to add a hash(column) to every query.  Which would be
really hard to do for joins.

>> A. COPY/ETL then attach
>> In inheritance partitioning, you can easily build a partition outside
>> the master and then "attach" it, allowing for minimal disturbance of
>> concurrent users.  Could be addressed in the future.
> 
> How much of the desire for this is because our current "row routing"
> solutions are very slow? I suspect that's the biggest reason, and
> hopefully Alvaro's proposal mostly eliminates it.

That doesn't always work, though.  In some cases the partition is being
built using some fairly complex logic (think of partitions which are
based on matviews) and there's no fast way to create the new data.
Again, this is an acceptable casualty of an improved design, but if it
will be so, we should consciously decide that.

>> B. Catchall Partition
>> Many partitioning schemes currently contain a "catchall" partition which
>> accepts rows outside of the range of the partitioning scheme, due to bad
>> input data.  Probably not handled on purpose; Alvaro is proposing that
>> we reject these instead, or create the partitions on demand, which is a
>> legitimate approach.
>>
>> C. Asymmetric Partitioning / NULLs in partition column
>> This is the classic Active/Inactive By Month setup for partitions.
>> Could be addressed via special handling for NULL/infinity in the
>> partitioned column.
> 
> If we allowed for a "catchall partition" and supported normal
> inheritance/triggers on that partition then users could continue to do
> whatever they needed with data that didn't fit the "normal" partitioning
> pattern.

That sounds to me like it would fall under the heading of "impossible
levels of backwards-compatibility".


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
"Amit Langote"
Date:
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >> I guess you could list or hash partition on multiple columns, too.
> >
> > How would you distinguish values in list partition for multiple
> > columns? I mean for range partition, we are sure there will
> > be either one value for each column, but for list it could
> > be multiple and not fixed for each partition, so I think it will not
> > be easy to support the multicolumn partition key for list
> > partitions.
>
> I don't understand.  If you want to range partition on columns (a, b),
> you say that, say, tuples with (a, b) values less than (100, 200) go
> here and the rest go elsewhere.  For list partitioning, you say that,
> say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
> rest go elsewhere.  I'm not sure how useful that is but it's not
> illogical.
>

In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and
b.I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and
manipulateas range partvalues.  

Thanks,
Amit





Re: On partitioning

From
Amit Kapila
Date:
On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >> I guess you could list or hash partition on multiple columns, too.
> > >
> > > How would you distinguish values in list partition for multiple
> > > columns? I mean for range partition, we are sure there will
> > > be either one value for each column, but for list it could
> > > be multiple and not fixed for each partition, so I think it will not
> > > be easy to support the multicolumn partition key for list
> > > partitions.
> >
> > I don't understand.  If you want to range partition on columns (a, b),
> > you say that, say, tuples with (a, b) values less than (100, 200) go
> > here and the rest go elsewhere.  For list partitioning, you say that,
> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
> > rest go elsewhere.  I'm not sure how useful that is but it's not
> > illogical.
> >
>
> In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and manipulate as range partvalues.
>

Yeah and also how would user specify the values, as an example
assume that table is partitioned on monthly_salary, so partition
definition would look:

PARTITION BY LIST(monthly_salary)
(
PARTITION salary_less_than_thousand VALUES(300, 900),
PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
...
)

Now if user wants to define multi-column Partition based on
monthly_salary and annual_salary, how do we want him to
specify the values.  Basically how to distinguish which values
belong to first column key and which one's belong to second
column key.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Amit Kapila
Date:
On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
>
> Well, I'm not convinced that it isn't one.  And adding a new relkind
> will involve a bunch of code churn, too.  But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X.  I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.

That's right, I think at this point defining the top level behaviour/design
is very important to proceed, we can decide about the better
implementation approach afterwards (may be once initial patch is ready,
because it might not be a major work to do it either way).  So here's where
we are on this point till now as per my understanding, I think that direct 
operations should be prohibited on partitions, you think that they should be
allowed and Andres think that it might be better to allow direct operations
on partitions for Read. 

>
> >> - Direct access to individual partitions to bypass
> >> tuple-routing/query-planning overhead.
> >
> > I think that might be ok in some cases, but in general I'd be very wary
> > to allow that. I think it might be ok to allow direct read access, but
> > everything else I'd be opposed. I'd much rather go the route of allowing
> > to few things and then gradually opening up if required than the other
> > way round (as that pretty much will never happen because it'll break
> > deployed systems).
>
> Why?
>

Because I think it will be difficult for users to write/maintain more of such
code, which is one of the complaints with previous system where user
needs to write triggers to route the tuple to appropriate partition.
I think in first step we should try to improve the tuple routing algorithm
so that it is not pain for users or atleast it should be at par with some of
the other competitive database systems and if we are not able
to come up with such an implementation, then may be we can think of
providing it as a special way for users to improve performance.

Another reason is that fundamentally partitions are managed internally
to divide the user data in a way so that access could be cheaper and we
take the specifications for defining the partitions from users and allowing
operations on internally managed objects could lead to user writing quite
some code to do what database actually does internally.  If we see that
TOAST table are internally used to manage large tuples, however we
don't want users to directly perform dml on those tables.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
"Amit Langote"
Date:
On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>> > From: Robert Haas [mailto:robertmhaas@gmail.com]
>> > I don't understand.  If you want to range partition on columns (a, b),
>> > you say that, say, tuples with (a, b) values less than (100, 200) go
>> > here and the rest go elsewhere.  For list partitioning, you say that,
>> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
>> > rest go elsewhere.  I'm not sure how useful that is but it's not
>> > illogical.
>> >
>>
>> In case of list partitioning, 100 and 200 would respectively be one of the
>> values in lists of allowed values for a and b. I thought his concern is
>> whether this "list of values for each column in partkey" is as convenient to
>> store and manipulate as range partvalues.
>>
>
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.
>

Amit, in one of my earlier replies to your question of why we may not want to implement multi-column list partitioning
(lackof user interest in the feature or possible complexity of the code), I tried to explain how that may work if we do
chooseto go that way. Basically, something we may call PartitionColumnValue should be such that above issue can be
suitablysorted out. 

For example, a partition defining/bounding value would be a pg_node_tree representation of List of one of the (say)
followingparse nodes as appropriate -  

typedef struct PartitionColumnValue
{   NodeTag    type,   Oid        *partitionid,   char        *partcolname,   char        partkind,   Node
*partrangelower,  Node        *partrangeupper,   List        *partlistvalues 
};

OR separately,

typedef struct RangePartitionColumnValue
{   NodeTag    type,   Oid        *partitionid,   char        *partcolname,   Node        *partrangelower,   Node
*partrangeupper 
};

&

typedef struct ListPartitionColumnValue
{   NodeTag    type,   Oid        *partitionid,   char        *partcolname,   List        *partlistvalues
};

Where a partition definition would look like

typedef struct PartitionDef
{   NodeTag    type,   RangeVar    partition,   RangeVar    parentrel,   char        *kind,   Node        *values,
List       *options,   char        *tablespacename 
};

PartitionDef.values is an (ordered) List of PartitionColumnValue each of which corresponds to one column in the
partitionkey in that order. 

We should be able to devise a way to load the pg_node_tree representation of  PartitionDef.values (on-disk
pg_partition_def.partvalues)into relcache using a "suitable data structure" so that it becomes readily usable in
varietyof contexts that we are interested in using this information.  

Regards,
Amit





Re: On partitioning

From
"Amit Langote"
Date:

On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>> > From: Robert Haas [mailto:robertmhaas@gmail.com]
>> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> > >> I guess you could list or hash partition on multiple columns, too.
>> > >
>> > > How would you distinguish values in list partition for multiple
>> > > columns? I mean for range partition, we are sure there will
>> > > be either one value for each column, but for list it could
>> > > be multiple and not fixed for each partition, so I think it will not
>> > > be easy to support the multicolumn partition key for list
>> > > partitions.
>> >
>> > I don't understand.  If you want to range partition on columns (a, b),
>> > you say that, say, tuples with (a, b) values less than (100, 200) go
>> > here and the rest go elsewhere.  For list partitioning, you say that,
>> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
>> > rest go elsewhere.  I'm not sure how useful that is but it's not
>> > illogical.
>> >
>>
>> In case of list partitioning, 100 and 200 would respectively be one of the
>> values in lists of allowed values for a and b. I thought his concern is
>> whether this "list of values for each column in partkey" is as convenient to
>> store and manipulate as range partvalues.
>>
>
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.
>

Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail?

Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same.

... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?

Thanks,
Amit





Re: On partitioning

From
Alvaro Herrera
Date:
Josh Berkus wrote:

Hi,

> Pardon me for jumping into this late.  In general, I like Alvaro's
> approach.

Please don't call this "Alvaro's approach" as I'm not involved in this
anymore.  Amit Langote has taken ownership of it now.  While some
resemblance to what I originally proposed might remain, I haven't kept
track of how this has evolved and this might be a totally different
thing now.  Or not.

Anyway I just wanted to comment on a single point:

> 6. Unique Index Problem
> Cannot create a unique index across multiple partitions, which prevents
> the partitioned table from being FK'd.
> Not Addressed
> (but could be addressed in the future)

I think it's unlikely that we will ever create a unique index that spans
all the partitions, actually.  Even if there are some wild ideas on how
to implement such a thing, the number of difficult issues that no one
knows how to attack seems too large.  I would perhaps be thinking in
allowing foreign keys to be defined on column sets that are prefixed by
partition keys; unique indexes must exist on all partitions on the same
columns including the partition keys.  (Perhaps make an extra exception
that if a partition allows a single value for the partition column, that
column need not be part of the unique index.)

> 10. Scaling Problem
> Inheritance partitioning becomes prohibitively slow for the planner at
> somewhere between 100 and 500 partitions depending on various factors.
> No idea?

At least it was my intention to make the system scale to huge number of
partitions, but this requires some forward thinking (such as avoiding
loading the index list of all of them or evern opening all of them at
the planner stage) and I think would be defeated if we want to keep
all the generality of the inheritance-based approach.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: On partitioning

From
Alvaro Herrera
Date:
Amit Kapila wrote:
> On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> > >> I don't think that's mutually exclusive with the idea of
> > >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> > >> wherever you want.
> > >
> > > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > > should we name something a table that's not one?
> >
> > Well, I'm not convinced that it isn't one.  And adding a new relkind
> > will involve a bunch of code churn, too.  But I don't much care to
> > pre-litigate this: when someone has got a patch, we can either agree
> > that the approach is OK or argue that it is problematic because X.  I
> > think we need to hammer down the design in broad strokes first, and
> > I'm not sure we're totally there yet.
> 
> That's right, I think at this point defining the top level behaviour/design
> is very important to proceed, we can decide about the better
> implementation approach afterwards (may be once initial patch is ready,
> because it might not be a major work to do it either way).  So here's where
> we are on this point till now as per my understanding, I think that direct
> operations should be prohibited on partitions, you think that they should be
> allowed and Andres think that it might be better to allow direct operations
> on partitions for Read.

FWIW in my original proposal I was rejecting some things that after
further consideration turn out to be possible to allow; for instance
directly referencing individual partitions in COPY.  We could allow
something like

COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
or maybe
COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT

and this would emit the whole partition for year 2000 of table
lineitems, and only that (the value is just computed on the fly to fit
the partitioning constraints for that individual partition).  Then
pg_dump would be able to dump each and every partition separately.

In a similar way we could have COPY FROM allow input into individual
partitions so that such a dump can be restored in parallel for each
partition.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: On partitioning

From
Josh Berkus
Date:
On 12/09/2014 12:17 AM, Amit Langote wrote:
>> Now if user wants to define multi-column Partition based on
>> > monthly_salary and annual_salary, how do we want him to
>> > specify the values.  Basically how to distinguish which values
>> > belong to first column key and which one's belong to second
>> > column key.
>> >
> Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail?
> 
> Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same.
> 
> ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?

... or just use arrays.

PARTITION BY LIST ( monthly_salary, annual_salary )PARTITION salary_small VALUES ({[300,400],[5000,6000]})
) ....

... but that begs the question of how partition by list over two columns
(or more) would even work?  You'd need an a*b number of partitions, and
the user would be pretty much certain to miss a few value combinations.Maybe we should just restrict list partitioning
toa single column for
 
a first release, and wait and see if people ask for more?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Jim Nasby
Date:
On 12/8/14, 5:19 PM, Josh Berkus wrote:
> On 12/08/2014 02:12 PM, Jim Nasby wrote:
>> On 12/8/14, 12:26 PM, Josh Berkus wrote:
>>> 4. Creation Locking Problem
>>> high probability of lock pile-ups whenever a new partition is created on
>>> demand due to multiple backends trying to create the partition at the
>>> same time.
>>> Not Addressed?
>>
>> Do users actually try and create new partitions during DML? That sounds
>> doomed to failure in pretty much any system...
>
> There is no question that it would be easier for users to create
> partitions on demand automatically.  Particularly if you're partitioning
> by something other than time.  For a particular case, consider users on
> RDS, which has no cron jobs for creating new partitons; it's on demand
> or manually.
>
> It's quite possible that there is no good way to work out the locking
> for on-demand partitions though, but *if* we're going to have a 2nd
> partition system, I think it's important to at least discuss the
> problems with on-demand creation.

Yeah, we should discuss it. Perhaps the right answer here may be our own job scheduler, something a lot of folks want
anyway.

>>> 11. Hash Partitioning
>>> Some users would prefer to partition into a fixed number of
>>> hash-allocated partitions.
>>> Not Addressed.
>>
>> Though, you should be able to do that in either system if you bother to
>> define your own hash in a BEFORE trigger...
>
> That doesn't do you any good with the SELECT query, unless you change
> your middleware to add a hash(column) to every query.  Which would be
> really hard to do for joins.
>
>>> A. COPY/ETL then attach
>>> In inheritance partitioning, you can easily build a partition outside
>>> the master and then "attach" it, allowing for minimal disturbance of
>>> concurrent users.  Could be addressed in the future.
>>
>> How much of the desire for this is because our current "row routing"
>> solutions are very slow? I suspect that's the biggest reason, and
>> hopefully Alvaro's proposal mostly eliminates it.
>
> That doesn't always work, though.  In some cases the partition is being
> built using some fairly complex logic (think of partitions which are
> based on matviews) and there's no fast way to create the new data.
> Again, this is an acceptable casualty of an improved design, but if it
> will be so, we should consciously decide that.

Is there an example you can give here? If the scheme is that complicated I'm failing to see how you're supposed to do
thingslike partition elimination.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Amit Kapila
Date:
On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/09/2014 12:17 AM, Amit Langote wrote:
> >> Now if user wants to define multi-column Partition based on
> >> > monthly_salary and annual_salary, how do we want him to
> >> > specify the values.  Basically how to distinguish which values
> >> > belong to first column key and which one's belong to second
> >> > column key.
> >> >
> > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail?
> >
> > Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same.
> >
> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?
>

Using SUBPARTITION is not the answer for multi-column partition,
I think if we have to support it for List partitioning then something
on lines what Josh has mentioned below could workout, but I don't
think it is important to support multi-column partition for List at this
stage.  

> ... or just use arrays.
>
> PARTITION BY LIST ( monthly_salary, annual_salary )
>         PARTITION salary_small VALUES ({[300,400],[5000,6000]})
> ) ....
>
> ... but that begs the question of how partition by list over two columns
> (or more) would even work?  You'd need an a*b number of partitions, and
> the user would be pretty much certain to miss a few value combinations.
>  Maybe we should just restrict list partitioning to a single column for
> a first release, and wait and see if people ask for more?
>

I also think we should not support multi-column list partition in first
release.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Amit Kapila
Date:
On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Amit Kapila wrote:
> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> > wrote:
> > > >> I don't think that's mutually exclusive with the idea of
> > > >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> > > >> wherever you want.
> > > >
> > > > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > > > should we name something a table that's not one?
> > >
> > > Well, I'm not convinced that it isn't one.  And adding a new relkind
> > > will involve a bunch of code churn, too.  But I don't much care to
> > > pre-litigate this: when someone has got a patch, we can either agree
> > > that the approach is OK or argue that it is problematic because X.  I
> > > think we need to hammer down the design in broad strokes first, and
> > > I'm not sure we're totally there yet.
> >
> > That's right, I think at this point defining the top level behaviour/design
> > is very important to proceed, we can decide about the better
> > implementation approach afterwards (may be once initial patch is ready,
> > because it might not be a major work to do it either way).  So here's where
> > we are on this point till now as per my understanding, I think that direct
> > operations should be prohibited on partitions, you think that they should be
> > allowed and Andres think that it might be better to allow direct operations
> > on partitions for Read.
>
> FWIW in my original proposal I was rejecting some things that after
> further consideration turn out to be possible to allow; for instance
> directly referencing individual partitions in COPY.  We could allow
> something like
>
> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> or maybe
> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
>
or
COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
COPY [TABLE] lineitems PARTITION <part_1,part_2,>  TO STDOUT

I think we should try to support operations on partitions via main
table whereever it is required.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
"Amit Langote"
Date:

On Wed, Dec 10, 2014 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 12/09/2014 12:17 AM, Amit Langote wrote:
>> >> Now if user wants to define multi-column Partition based on
>> >> > monthly_salary and annual_salary, how do we want him to
>> >> > specify the values.  Basically how to distinguish which values
>> >> > belong to first column key and which one's belong to second
>> >> > column key.
>> >> >
>> > Perhaps you are talking about "syntactic" difficulties that I totally
>> > missed in my other reply to this mail?
>> >
>> > Can we represent the same data by rather using a subpartitioning scheme?
>> > ISTM, semantics would remain the same.
>> >
>> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?
>>
>
> Using SUBPARTITION is not the answer for multi-column partition,
> I think if we have to support it for List partitioning then something
> on lines what Josh has mentioned below could workout, but I don't
> think it is important to support multi-column partition for List at this
> stage.
>

Yeah, I realize multicolumn list partitioning and list-list composite partitioning are different things in many
respects.And given how awkward multicolumn list partitioning is looking to implement, I also think we only allow single
columnin a list partition key. 

>> ... or just use arrays.
>>
>> PARTITION BY LIST ( monthly_salary, annual_salary )
>>         PARTITION salary_small VALUES ({[300,400],[5000,6000]})
>> ) ....
>>
>> ... but that begs the question of how partition by list over two columns
>> (or more) would even work?  You'd need an a*b number of partitions, and
>> the user would be pretty much certain to miss a few value combinations.
>>  Maybe we should just restrict list partitioning to a single column for
>> a first release, and wait and see if people ask for more?
>>
>
> I also think we should not support multi-column list partition in first
> release.
>

Yes.

Thanks,
Amit





Re: On partitioning

From
"Amit Langote"
Date:

On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>>
>> Amit Kapila wrote:
>> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com>
>> > wrote:
>> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
>> > wrote:
>> > > >> I don't think that's mutually exclusive with the idea of
>> > > >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
>> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR,
>> > > >> ...)
>> > > >> wherever you want.
>> > > >
>> > > > That'll be a lot of places you'll need to touch. More fundamentally:
>> > > > Why
>> > > > should we name something a table that's not one?
>> > >
>> > > Well, I'm not convinced that it isn't one.  And adding a new relkind
>> > > will involve a bunch of code churn, too.  But I don't much care to
>> > > pre-litigate this: when someone has got a patch, we can either agree
>> > > that the approach is OK or argue that it is problematic because X.  I
>> > > think we need to hammer down the design in broad strokes first, and
>> > > I'm not sure we're totally there yet.
>> >
>> > That's right, I think at this point defining the top level
>> > behaviour/design
>> > is very important to proceed, we can decide about the better
>> > implementation approach afterwards (may be once initial patch is ready,
>> > because it might not be a major work to do it either way).  So here's
>> > where
>> > we are on this point till now as per my understanding, I think that
>> > direct
>> > operations should be prohibited on partitions, you think that they
>> > should be
>> > allowed and Andres think that it might be better to allow direct
>> > operations
>> > on partitions for Read.
>>
>> FWIW in my original proposal I was rejecting some things that after
>> further consideration turn out to be possible to allow; for instance
>> directly referencing individual partitions in COPY.  We could allow
>> something like
>>
>> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
>> or maybe
>> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
>>
> or
> COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> COPY [TABLE] lineitems PARTITION <part_1,part_2,>  TO STDOUT
>
> I think we should try to support operations on partitions via main
> table whereever it is required.
>

We can also allow to explicitly name a partition

COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;

Thanks,
Amit





Re: On partitioning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> > wrote:

> >> FWIW in my original proposal I was rejecting some things that after
> >> further consideration turn out to be possible to allow; for instance
> >> directly referencing individual partitions in COPY.  We could allow
> >> something like
> >>
> >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> >> or maybe
> >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> >>
> > or
> > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> > COPY [TABLE] lineitems PARTITION <part_1,part_2,>  TO STDOUT
> >
> > I think we should try to support operations on partitions via main
> > table whereever it is required.

Um, I think the only difference is that you added the noise word TABLE
which we currently don't allow in COPY, and that you added the
possibility of using named partitions, about which see below.

> We can also allow to explicitly name a partition
> 
> COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;

The problem with naming partitions is that the user has to pick names
for every partition, which is tedious and doesn't provide any
significant benefit.  The input I had from users of other partitioning
systems was that they very much preferred not to name the partitions at
all, which is why I chose the PARTITION FOR VALUE syntax (not sure if
this syntax is exactly what other systems use; it just seemed the
natural choice.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: On partitioning

From
Robert Haas
Date:
On Wed, Dec 10, 2014 at 9:22 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit.  The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all, which is why I chose the PARTITION FOR VALUE syntax (not sure if
> this syntax is exactly what other systems use; it just seemed the
> natural choice.)

FWIW, Oracle does name partitions.  It generates the names
automatically if you don't care to specify them, and the partition
names for a given table live in their own namespace that is separate
from the toplevel object namespace.  For example:

CREATE TABLE sales    ( invoice_no NUMBER,      sale_year  INT NOT NULL,      sale_month INT NOT NULL,      sale_day
INTNOT NULL )  STORAGE (INITIAL 100K NEXT 50K) LOGGING  PARTITION BY RANGE ( sale_year, sale_month, sale_day)    (
PARTITIONsales_q1 VALUES LESS THAN ( 1999, 04, 01 )       TABLESPACE tsa STORAGE (INITIAL 20K, NEXT 10K),
PARTITIONsales_q2 VALUES LESS THAN ( 1999, 07, 01 )       TABLESPACE tsb,      PARTITION sales_q3 VALUES LESS THAN (
1999,10, 01 )       TABLESPACE tsc,      PARTITION sales q4 VALUES LESS THAN ( 2000, 01, 01 )       TABLESPACE tsd)
ENABLEROW MOVEMENT;
 

I don't think this practice has much to recommend it.  We're going to
need a way to refer to individual partitions by name, and I don't see
much benefit in making that name something other than what is stored
in pg_class.relname.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 5:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Agreed, but it's possible to keep a block/CTID interface while doing
> something different on the disk.

Objection: hand-waving.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.

I assume you just add some parentheses.

PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
(valA2, valB2), (valA3, valB3))

Multi-column list partitioning may or may not be worth implementing,
but the syntax is not a real problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:

> From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund
> <andres@2ndquadrant.com> wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
>
> Well, I'm not convinced that it isn't one.  And adding a new relkind
> will involve a bunch of code churn, too.  But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X.  I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.
>

In heap_create(), do we create storage for a top level partitioned table (say, RELKIND_PARTITIONED_TABLE)? How about a
partitionthat is further sub-partitioned? We might allocate storage for a partition at some point and then later choose
tosub-partition it. In such a case, perhaps, we would have to move existing data to the storage of subpartitions and
deallocatethe partition's storage. In other words only leaf relations in a partition hierarchy would have storage. Is
theresuch a notion within code for some other purpose or we'd have to invent it for partitioning scheme? 

Thanks,
Amit





Re: On partitioning

From
Robert Haas
Date:
On Wed, Dec 10, 2014 at 7:25 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> In heap_create(), do we create storage for a top level partitioned table (say, RELKIND_PARTITIONED_TABLE)? How about
apartition that is further sub-partitioned? We might allocate storage for a partition at some point and then later
chooseto sub-partition it. In such a case, perhaps, we would have to move existing data to the storage of subpartitions
anddeallocate the partition's storage. In other words only leaf relations in a partition hierarchy would have storage.
Isthere such a notion within code for some other purpose or we'd have to invent it for partitioning scheme? 

I think it would be advantageous to have storage only for the leaf
partitions, because then you don't need to waste time doing a
zero-block sequential scan of the root as part of the append-plan, an
annoyance of the current system.

We have no concept for this right now; in fact, right now, the relkind
fully determines whether a given relation has storage.  One idea is to
make the leaves relkind = 'r' and the interior notes some new relkind.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Amit Kapila
Date:
On Wed, Dec 10, 2014 at 7:52 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Amit Langote wrote:
>
> > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> > > wrote:
>
> > >> FWIW in my original proposal I was rejecting some things that after
> > >> further consideration turn out to be possible to allow; for instance
> > >> directly referencing individual partitions in COPY.  We could allow
> > >> something like
> > >>
> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > >> or maybe
> > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> > >>
> > > or
> > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> > > COPY [TABLE] lineitems PARTITION <part_1,part_2,>  TO STDOUT
> > >
> > > I think we should try to support operations on partitions via main
> > > table whereever it is required.
>
> Um, I think the only difference is that you added the noise word TABLE
> which we currently don't allow in COPY,

Yeah, we could eliminate TABLE keyword from this syntax, the reason
I have kept was for easier understanding of syntax, currently we don't have
concept of PARTITION in COPY syntax, but now if we want to introduce
such a concept, then it might be better to have TABLE keyword for the
purpose of syntax clarity.

> and that you added the
> possibility of using named partitions, about which see below.
>
> > We can also allow to explicitly name a partition
> >
> > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;
>
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit.  The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all,

It seems to me both Oracle and DB2 supports named partitions, so even
though there are user's which don't prefer named partitions, I suspect
equal or more number of users will be there who will prefer for the sake
of migration and because they are already used to such a syntax. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Amit Kapila
Date:
On Wed, Dec 10, 2014 at 11:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah and also how would user specify the values, as an example
> > assume that table is partitioned on monthly_salary, so partition
> > definition would look:
> >
> > PARTITION BY LIST(monthly_salary)
> > (
> > PARTITION salary_less_than_thousand VALUES(300, 900),
> > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> > ...
> > )
> >
> > Now if user wants to define multi-column Partition based on
> > monthly_salary and annual_salary, how do we want him to
> > specify the values.  Basically how to distinguish which values
> > belong to first column key and which one's belong to second
> > column key.
>
> I assume you just add some parentheses.
>
> PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
> (valA2, valB2), (valA3, valB3))
>
> Multi-column list partitioning may or may not be worth implementing,
> but the syntax is not a real problem.
>

Yeah either this way or what Josh has suggested upthread, the main
point was that if at all we want to support multi-column list partitioning
then we need to have slightly different syntax, however I feel that we
can leave multi-column list partitioning for first version.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Robert Haas
Date:
On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah either this way or what Josh has suggested upthread, the main
> point was that if at all we want to support multi-column list partitioning
> then we need to have slightly different syntax, however I feel that we
> can leave multi-column list partitioning for first version.

Yeah, possibly.

I think we could stand to have a lot more discussion about the syntax
here.  So far the idea seems to be to copy what Oracle has, but it's
not clear if we're going to have exactly what Oracle has or something
subtly different.  I personally don't find the Oracle syntax very
PostgreSQL-ish.  Stuff like "VALUES LESS THAN 500" doesn't sit
especially well with me - less than according to which opclass?  Are
we going to insist that partitioning must use the default btree
opclass so that we can use that syntax?  That seems kind of lame.

There are lots of interesting things we could do here, e.g.:

CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);
CREATE TABLE child_name PARTITION OF parent_name  FOR { (value, ...) [ TO (value, ...) ] } [, ...];

So instead of making a hard distinction between range and list
partitioning, you can say:

CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7);
CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12);
CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30),
(120) TO (130);

Now that might be a crappy idea for various reasons, but the point is
there are a lot of details to be hammered out with the syntax, and
there are several ways we can go wrong.  If we choose an
overly-limiting syntax, we're needlessly restricting what can be done.
If we choose an overly-permissive syntax, we'll restrict the
optimization opportunities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > Yeah either this way or what Josh has suggested upthread, the main
> > point was that if at all we want to support multi-column list partitioning
> > then we need to have slightly different syntax, however I feel that we
> > can leave multi-column list partitioning for first version.
>
> Yeah, possibly.
>
> I think we could stand to have a lot more discussion about the syntax
> here.  So far the idea seems to be to copy what Oracle has, but it's
> not clear if we're going to have exactly what Oracle has or something
> subtly different.  I personally don't find the Oracle syntax very
> PostgreSQL-ish.  Stuff like "VALUES LESS THAN 500" doesn't sit
> especially well with me - less than according to which opclass?  Are
> we going to insist that partitioning must use the default btree
> opclass so that we can use that syntax?  That seems kind of lame.
>

Syntax like VALUES LESS THAN 500 also means, we then have to go figure out what's that partition's lower bound based on
upperbound of the previous one. Forget holes in the range if they matter. I expressed that concern elsewhere in favour
ofhaving available both a range's lower and upper bounds. 

> There are lots of interesting things we could do here, e.g.:
>
> CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);

So, no PARTITION BY [RANGE | LIST] clause huh?

What we are calling pg_partitioned_rel would obtain following bits of information from such a definition of a
partitionedrelation: 
* column(s) to partition on and respective opclass(es)* the level this partitioned relation lies in the partitioning
hierarchy   (determining its relkind and storage qualification) 

By the way, I am not sure how we define a partitioning key on a partition (in other words, a subpartitioning key on the
correspondingpartitioned relation). Perhaps (only) via ALTER TABLE on a partition relation? 

> CREATE TABLE child_name PARTITION OF parent_name
>    FOR { (value, ...) [ TO (value, ...) ] } [, ...];
>

So it's still a CREATE "TABLE" but the part 'PARTITION OF' turns this "table" into something having characteristics of
apartition relation getting all kinds of new treatments at various places. It appears there is a redistribution of
table-characteristicsbetween a partitioned relation and its partition. We take away storage from the former and instead
giveit to the latter. On the other hand, the latter's data is only accessible through the former perhaps with escape
routesfor direct access via some special syntax attached to various access commands. We also stand to lose certain
abilitieswith a partitioned relation such as not able to define a unique constraint (other than what partition key
couldpotentially help ensure) or use it as target of foreign key constraint (just reiterating). 

What we call pg_partition_def obtains following bits of information from such a definition of a partition relation:
* parent relation (partitioned relation this is partition of)* partition kind (do we even want to keep carrying this
around as a separate field in catalog?)* values this partition holds 

The last part being the most important.

In case of what we would have called a 'LIST' partition, this could look like

... FOR VALUES (val1, val2, val3, ...)

Assuming we only support partition key to contain only one column in such a case.

In case of what we would have called a 'RANGE' partition, this could look like

... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)

How about BETWEEN ... AND ... ?

Here we allow a partition key to contain more than one column.

> So instead of making a hard distinction between range and list
> partitioning, you can say:
>
> CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7);
> CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12);
> CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30),
> (120) TO (130);
>

I would include the noise keyword VALUES just for readability if anything.

> Now that might be a crappy idea for various reasons, but the point is
> there are a lot of details to be hammered out with the syntax, and
> there are several ways we can go wrong.  If we choose an
> overly-limiting syntax, we're needlessly restricting what can be done.
> If we choose an overly-permissive syntax, we'll restrict the
> optimization opportunities.
>

I am not sure but perhaps RANGE and LIST as partitioning kinds may as well just be noise keywords. We can parse those
valuesinto a parse node such that we don’t have to care about whether they describe partition as being one kind or the
other.Say a List of something like, 

typedef struct PartitionColumnValue
{   NodeTag    type,   Oid        *partitionid,   char       *partcolname,   Node       *partrangelower,   Node
*partrangeupper,  List       *partlistvalues 
};

Or we could still add a (char) partkind just to say which of the fields matter.

We don't need any defining values here for hash partitions if and when we add support for the same. We would either be
usinga system-wide common hash function or we could add something with partitioning key definition. 

Thanks,
Amit





Re: On partitioning

From
Amit Kapila
Date:
On Thu, Dec 11, 2014 at 8:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah either this way or what Josh has suggested upthread, the main
> > point was that if at all we want to support multi-column list partitioning
> > then we need to have slightly different syntax, however I feel that we
> > can leave multi-column list partitioning for first version.
>
> Yeah, possibly.
>
> I think we could stand to have a lot more discussion about the syntax
> here.  So far the idea seems to be to copy what Oracle has, but it's
> not clear if we're going to have exactly what Oracle has or something
> subtly different.  I personally don't find the Oracle syntax very
> PostgreSQL-ish.

I share your concern w.r.t the difficulties it can create if we don't
do it carefully (one of the issue you have mentioned upthread about
pg_dump, other such things could cause problems, if not thought
of carefully from the beginning).  One more thing, on a quick check
it seems to me even DB2 uses some-thing similar to Oracle for
defining partitions

CREATE TABLE orders(id INT, shipdate DATE, …) 
PARTITION BY RANGE(shipdate) 
( PARTITION q4_05 STARTING MINVALUE, 
  PARTITION q1_06 STARTING '1/1/2006',
  PARTITION q2_06 STARTING '4/1/2006',
  PARTITION q3_06 STARTING '7/1/2006',
  PARTITION q4_06 STARTING '10/1/2006' 
                              ENDING ‘12/31/2006' )

I don't think there is any pressing need for PostgreSQL to use
syntax similar to what some of the other databases use, however
it has an advantage for ease of migration and ease of use (as
people are already familiar with using such syntax).

> Stuff like "VALUES LESS THAN 500" doesn't sit
> especially well with me - less than according to which opclass?  Are
> we going to insist that partitioning must use the default btree
> opclass so that we can use that syntax?  That seems kind of lame.
>

Can't we simply specify the opclass along with column name while
specifying partition clause which I feel is something similar to we
already do in CREATE INDEX syntax.

CREATE TABLE sales
     ( invoice_no NUMBER,
       sale_year  INT NOT NULL,
       sale_month INT NOT NULL,
       sale_day   INT NOT NULL )
   PARTITION BY RANGE ( sale_year <opclass>)
     ( PARTITION sales_q1 VALUES LESS THAN (1999)

Isn't the default operator class for a partition column would fit the
bill for this particular case as the operators required in this syntax
will be quite simple?

> There are lots of interesting things we could do here, e.g.:
>
> CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);
> CREATE TABLE child_name PARTITION OF parent_name
>    FOR { (value, ...) [ TO (value, ...) ] } [, ...];
>

The only thing which slightly bothers me about this syntax is that
it makes apparent that partitions are separate tables and it would
be inconvenient if we choose to disallow some operations on
partitions.  I think it might be better we treat partitions as a way
to divide the large amount of data and users be only given the
option to specify boundaries to divide this data and storage mechanism
of partitions should be an internal detail (something like we do in
TOAST table case).  I am not sure which syntax users will be more
comfortable to use as I am seeing and using Oracle type syntax from
long time so my opinion could be biased in this case.  It would be really
helpful if others who need or use partitioning scheme can share their
inputs.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: On partitioning

From
Robert Haas
Date:
On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> In case of what we would have called a 'LIST' partition, this could look like
>
> ... FOR VALUES (val1, val2, val3, ...)
>
> Assuming we only support partition key to contain only one column in such a case.
>
> In case of what we would have called a 'RANGE' partition, this could look like
>
> ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)
>
> How about BETWEEN ... AND ... ?

Sure.  Mind you, I'm not proposing that the syntax I just mooted is
actually for the best.  What I'm saying is that we need to talk about
it.

> I am not sure but perhaps RANGE and LIST as partitioning kinds may as well just be noise keywords. We can parse those
valuesinto a parse node such that we don’t have to care about whether they describe partition as being one kind or the
other.Say a List of something like, 
>
> typedef struct PartitionColumnValue
> {
>     NodeTag    type,
>     Oid        *partitionid,
>     char       *partcolname,
>     Node       *partrangelower,
>     Node       *partrangeupper,
>     List       *partlistvalues
> };
>
> Or we could still add a (char) partkind just to say which of the fields matter.
>
> We don't need any defining values here for hash partitions if and when we add support for the same. We would either
beusing a system-wide common hash function or we could add something with partitioning key definition. 

Yeah, range and list partition definitions are very similar, but hash
partition definitions are a different kettle of fish.  I don't think
we really need hash partitioning for anything right away - it's pretty
useless unless you've got, say, a way for the partitions to be foreign
tables living on remote servers - but we shouldn't pick a design that
will make it really hard to add later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Jim Nasby
Date:
On 12/12/14, 8:03 AM, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp>  wrote:
>> >In case of what we would have called a 'LIST' partition, this could look like
>> >
>> >... FOR VALUES (val1, val2, val3, ...)
>> >
>> >Assuming we only support partition key to contain only one column in such a case.
>> >
>> >In case of what we would have called a 'RANGE' partition, this could look like
>> >
>> >... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)
>> >
>> >How about BETWEEN ... AND ... ?
> Sure.  Mind you, I'm not proposing that the syntax I just mooted is
> actually for the best.  What I'm saying is that we need to talk about
> it.

Frankly, if we're going to require users to explicitly define each partition then I think the most appropriate API
wouldbe a function. Users will be writing code to create new partitions as needed, and it's generally easier to write
codethat calls a function as opposed to glomming a text string together and passing that to EXECUTE.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
Robert Haas
Date:
On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> Sure.  Mind you, I'm not proposing that the syntax I just mooted is
>> actually for the best.  What I'm saying is that we need to talk about
>> it.
>
> Frankly, if we're going to require users to explicitly define each partition
> then I think the most appropriate API would be a function. Users will be
> writing code to create new partitions as needed, and it's generally easier
> to write code that calls a function as opposed to glomming a text string
> together and passing that to EXECUTE.

I have very little idea what the API you're imagining would actually
look like from this description, but it sounds like a terrible idea.
We don't want to make this infinitely general.  We need a *fast* way
to go from a value (or list of values, one per partitioning column) to
a partition OID, and the way to get there is not to call arbitrary
user code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Claudio Freire
Date:
On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>> Sure.  Mind you, I'm not proposing that the syntax I just mooted is
>>> actually for the best.  What I'm saying is that we need to talk about
>>> it.
>>
>> Frankly, if we're going to require users to explicitly define each partition
>> then I think the most appropriate API would be a function. Users will be
>> writing code to create new partitions as needed, and it's generally easier
>> to write code that calls a function as opposed to glomming a text string
>> together and passing that to EXECUTE.
>
> I have very little idea what the API you're imagining would actually
> look like from this description, but it sounds like a terrible idea.
> We don't want to make this infinitely general.  We need a *fast* way
> to go from a value (or list of values, one per partitioning column) to
> a partition OID, and the way to get there is not to call arbitrary
> user code.

I think this was mentioned upthread, but I'll repeat it anyway since
it seems to need repeating.

More than fast, you want it analyzable (by the planner). Ie: it has to
be easy to prove partition exclusion against a where clause.



Re: On partitioning

From
Tom Lane
Date:
Claudio Freire <klaussfreire@gmail.com> writes:
> On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I have very little idea what the API you're imagining would actually
>> look like from this description, but it sounds like a terrible idea.
>> We don't want to make this infinitely general.  We need a *fast* way
>> to go from a value (or list of values, one per partitioning column) to
>> a partition OID, and the way to get there is not to call arbitrary
>> user code.

> I think this was mentioned upthread, but I'll repeat it anyway since
> it seems to need repeating.

> More than fast, you want it analyzable (by the planner). Ie: it has to
> be easy to prove partition exclusion against a where clause.

Actually, I'm not sure that's what we want.  I thought what we really
wanted here was to postpone partition-routing decisions to runtime,
so that the behavior would be efficient whether or not the decision
could be predetermined at plan time.

This still leads to the same point Robert is making: the routing
decisions have to be cheap and fast.  But it's wrong to think of it
in terms of planner proofs.
        regards, tom lane



Re: On partitioning

From
Claudio Freire
Date:
On Fri, Dec 12, 2014 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I have very little idea what the API you're imagining would actually
>>> look like from this description, but it sounds like a terrible idea.
>>> We don't want to make this infinitely general.  We need a *fast* way
>>> to go from a value (or list of values, one per partitioning column) to
>>> a partition OID, and the way to get there is not to call arbitrary
>>> user code.
>
>> I think this was mentioned upthread, but I'll repeat it anyway since
>> it seems to need repeating.
>
>> More than fast, you want it analyzable (by the planner). Ie: it has to
>> be easy to prove partition exclusion against a where clause.
>
> Actually, I'm not sure that's what we want.  I thought what we really
> wanted here was to postpone partition-routing decisions to runtime,
> so that the behavior would be efficient whether or not the decision
> could be predetermined at plan time.
>
> This still leads to the same point Robert is making: the routing
> decisions have to be cheap and fast.  But it's wrong to think of it
> in terms of planner proofs.

You'll need proofs whether at the planner or at the execution engine.

A sequential scan over a partition with a query like

select * from foo where date between X and Y

Would be ripe for that but at some point you need to prove that the
where clause excludes whole partitions. Be it at runtime (while
executing the sequential scan node) or planning time.



Re: On partitioning

From
Josh Berkus
Date:
On 12/12/2014 02:10 PM, Tom Lane wrote:
> Actually, I'm not sure that's what we want.  I thought what we really
> wanted here was to postpone partition-routing decisions to runtime,
> so that the behavior would be efficient whether or not the decision
> could be predetermined at plan time.
> 
> This still leads to the same point Robert is making: the routing
> decisions have to be cheap and fast.  But it's wrong to think of it
> in terms of planner proofs.

The other reason I'd really like to have the new partitioning taken out
of the planner: expressions.

Currently, if you have partitions with constraints on, day,
"event_date", the following WHERE clause will NOT use CE and will scan
all partitions:

WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and
'2014-12-11'.

This is despite the fact that the expression above gets rewritten to a
constant by the time the query is executed; by then it's too late.  To
say nothing of functions like to_timestamp(), now(), etc.

As long as partitions need to be chosen at plan time, I don't see a good
way to fix the expression problem.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Claudio Freire
Date:
On Fri, Dec 12, 2014 at 7:40 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/12/2014 02:10 PM, Tom Lane wrote:
>> Actually, I'm not sure that's what we want.  I thought what we really
>> wanted here was to postpone partition-routing decisions to runtime,
>> so that the behavior would be efficient whether or not the decision
>> could be predetermined at plan time.
>>
>> This still leads to the same point Robert is making: the routing
>> decisions have to be cheap and fast.  But it's wrong to think of it
>> in terms of planner proofs.
>
> The other reason I'd really like to have the new partitioning taken out
> of the planner: expressions.
>
> Currently, if you have partitions with constraints on, day,
> "event_date", the following WHERE clause will NOT use CE and will scan
> all partitions:
>
> WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and
> '2014-12-11'.
>
> This is despite the fact that the expression above gets rewritten to a
> constant by the time the query is executed; by then it's too late.  To
> say nothing of functions like to_timestamp(), now(), etc.
>
> As long as partitions need to be chosen at plan time, I don't see a good
> way to fix the expression problem.

Fair enough, but that's not the same as not requiring easy proofs. The
planner might not the one doing the proofs, but you still need proofs.

Even if the proving method is hardcoded into the partitioning method,
as in the case of list or range partitioning, it's still a proof. With
arbitrary functions (which is what prompted me to mention proofs) you
can't do that. A function works very well for inserting, but not for
selecting.

I could be wrong though. Maybe there's a way to turn SQL functions
into analyzable things? But it would still be very easy to shoot
yourself in the foot by writing one that is too complex.



Re: On partitioning

From
Alvaro Herrera
Date:
Claudio Freire wrote:

> Fair enough, but that's not the same as not requiring easy proofs. The
> planner might not the one doing the proofs, but you still need proofs.
> 
> Even if the proving method is hardcoded into the partitioning method,
> as in the case of list or range partitioning, it's still a proof. With
> arbitrary functions (which is what prompted me to mention proofs) you
> can't do that. A function works very well for inserting, but not for
> selecting.
> 
> I could be wrong though. Maybe there's a way to turn SQL functions
> into analyzable things? But it would still be very easy to shoot
> yourself in the foot by writing one that is too complex.

Arbitrary SQL expressions (including functions) are not the thing to use
for partitioning -- at least that's how I understand this whole
discussion.  I don't think you want to do "proofs" as such -- they are
expensive.

To make this discussion a bit clearer, there are two things to
distinguish: one is routing tuples, when an INSERT or COPY command
references the partitioned table, into the individual partitions
(ingress); the other is deciding which partitions to read when a SELECT
query wants to read tuples from the partitioned table (egress).

On ingress, what you want is something like being able to do something
on the tuple that tells you which partition it belongs into.  Ideally
this is something much lighter than running an expression; if you can
just apply an operator to the partitioning column values, that should be
plenty fast.  This requires no proof.

On egress you need some direct way to compare the scan quals with the
partitioning values.  I would imagine this to be similar to how scan
quals are compared to the values stored in a BRIN index: each scan qual
has a corresponding operator strategy and a scan key, and you can say
"aye" or "nay" based on a small set of operations that can be run
cheaply, again without any proof or running arbitrary expressions.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: On partitioning

From
Claudio Freire
Date:
<p><br /> El 12/12/2014 23:09, "Alvaro Herrera" <<a
href="mailto:alvherre@2ndquadrant.com">alvherre@2ndquadrant.com</a>>escribió:<br /> ><br /> > Claudio Freire
wrote:<br/> ><br /> > > Fair enough, but that's not the same as not requiring easy proofs. The<br /> > >
plannermight not the one doing the proofs, but you still need proofs.<br /> > ><br /> > > Even if the
provingmethod is hardcoded into the partitioning method,<br /> > > as in the case of list or range partitioning,
it'sstill a proof. With<br /> > > arbitrary functions (which is what prompted me to mention proofs) you<br />
>> can't do that. A function works very well for inserting, but not for<br /> > > selecting.<br /> >
><br/> > > I could be wrong though. Maybe there's a way to turn SQL functions<br /> > > into analyzable
things?But it would still be very easy to shoot<br /> > > yourself in the foot by writing one that is too
complex.<br/> ><br /> > Arbitrary SQL expressions (including functions) are not the thing to use<br /> > for
partitioning-- at least that's how I understand this whole<br /> > discussion.  I don't think you want to do
"proofs"as such -- they are<br /> > expensive.<br /> ><br /> > To make this discussion a bit clearer, there
aretwo things to<br /> > distinguish: one is routing tuples, when an INSERT or COPY command<br /> > references
thepartitioned table, into the individual partitions<br /> > (ingress); the other is deciding which partitions to
readwhen a SELECT<br /> > query wants to read tuples from the partitioned table (egress).<br /> ><br /> > On
ingress,what you want is something like being able to do something<br /> > on the tuple that tells you which
partitionit belongs into.  Ideally<br /> > this is something much lighter than running an expression; if you can<br
/>> just apply an operator to the partitioning column values, that should be<br /> > plenty fast.  This requires
noproof.<br /> ><br /> > On egress you need some direct way to compare the scan quals with the<br /> >
partitioningvalues.  I would imagine this to be similar to how scan<br /> > quals are compared to the values stored
ina BRIN index: each scan qual<br /> > has a corresponding operator strategy and a scan key, and you can say<br />
>"aye" or "nay" based on a small set of operations that can be run<br /> > cheaply, again without any proof or
runningarbitrary expressions.<p>Interesting that you mention BRIN. It does seem that it could be made to work with
BRIN'soperator classes.<p>In fact, a partition-wide brin tuple could be stored per partition and that in itself could
bethe definition for the partition.<p>Either preinitialized or dynamically updated. Would work even for arbitrary
routingfunctions, especially if the operator class to use is customizable.<p>I stand corrected.<br /> 

Re: On partitioning

From
José Luis Tallón
Date:
On 12/12/2014 05:43 AM, Amit Langote wrote:
> [snip]
> In case of what we would have called a 'LIST' partition, this could look like
>
> ... FOR VALUES (val1, val2, val3, ...)
>
> Assuming we only support partition key to contain only one column in such a case.

Hmmm….

[...] PARTITION BY LIST(col1 [, col2, ...])
    just like we do for indexes would do.


and CREATE PARTITION child_name OF parent_name    FOR [VALUES] (val1a,val2a), (val1b,val2b), (val1c,val2c)
[IN tblspc_name]
    just like we do for multi-valued inserts.

> In case of what we would have called a 'RANGE' partition, this could look like
>
> ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)
>
> How about BETWEEN ... AND ... ?

Unless I'm missing something obvious, we already have range types for 
this, don't we?

...   PARTITION BY RANGE (col)

CREATE PARTITION child_name OF parent_name    FOR [VALUES] '[val1min,val1max)', '[val2min,val2max)', 
'[val3min,val3max)'    [IN tblspc_name]

and I guess this should simplify a fully flexible implementation (if you 
can construct a RangeType for it, you can use that for partitioning).
This would substitute the ugly (IMHO) "VALUES LESS THAN" syntax with a 
more flexible one    (even though it might end up being converted into "less than" 
boundaries internally for implementation/optimization purposes)

In both cases we would need to allow for overflows / default partition 
different from the parent table.


Plus some ALTER PARTITION part_name TABLESPACE=tblspc_name


The main problem being that we are assuming named partitions here, which 
might not be that practical at all.

> [snip]
>> I would include the noise keyword VALUES just for readability if 
>> anything. 

+1


FWIW, deviating from already "standard" syntax (Oracle-like --as 
implemented by PPAS for example-- or DB2-like) is quite 
counter-productive unless we have very good reasons for it... which 
doesn't mean that we have to do it exactly like they do (specially if we 
would like to go the incremental implementation route).

Amit: mind if I add the DB2 syntax for partitioning to the wiki, too?
    This might as well help with deciding the final form of 
partitioning (and define the first implementation boundaries, too)


Thanks,
    / J.L.





Re: On partitioning

From
José Luis Tallón
Date:
On 12/13/2014 03:09 AM, Alvaro Herrera wrote:
> [snip]
> Arbitrary SQL expressions (including functions) are not the thing to use
> for partitioning -- at least that's how I understand this whole
> discussion.  I don't think you want to do "proofs" as such -- they are
> expensive.

Yup. Plus, it looks like (from reading Oracle's documentation) they end 
up converting the LESS THAN clauses into range lists internally.
Anyone that can attest to this? (or just disprove it, if I'm wrong)

I just suggested using the existing RangeType infrastructure for this ( 
<<, >> and && operators, specifically, might do the trick) before 
reading your mail citing BRIN.    ... which might as well allow some interesting runtime 
optimizations when range partitioning is used and *a huge* number of 
partitions get defined --- I'm specifically thinking about massive OLTP 
with very deep (say, 5 years' worth) archival partitioning where it 
would be inconvenient to have the tuple routing information always in 
memory.
I'm specifically suggesting some ( range_value -> partitionOID) mapping 
using a BRIN index for this --- it could be auto-created just like we do 
for primary keys.

Just my 2c


Thanks,
    / J.L.




Re: On partitioning

From
Jim Nasby
Date:
On 12/12/14, 3:48 PM, Robert Haas wrote:
> On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>> Sure.  Mind you, I'm not proposing that the syntax I just mooted is
>>> actually for the best.  What I'm saying is that we need to talk about
>>> it.
>>
>> Frankly, if we're going to require users to explicitly define each partition
>> then I think the most appropriate API would be a function. Users will be
>> writing code to create new partitions as needed, and it's generally easier
>> to write code that calls a function as opposed to glomming a text string
>> together and passing that to EXECUTE.
>
> I have very little idea what the API you're imagining would actually
> look like from this description, but it sounds like a terrible idea.
> We don't want to make this infinitely general.  We need a *fast* way
> to go from a value (or list of values, one per partitioning column) to
> a partition OID, and the way to get there is not to call arbitrary
> user code.

You were talking about the syntax for partition creation/definition; that's the API I was referring to.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: On partitioning

From
David Fetter
Date:
On Fri, Dec 12, 2014 at 09:03:12AM -0500, Robert Haas wrote:

> Yeah, range and list partition definitions are very similar, but
> hash partition definitions are a different kettle of fish.  I don't
> think we really need hash partitioning for anything right away -
> it's pretty useless unless you've got, say, a way for the partitions
> to be foreign tables living on remote servers -

There's a patch enabling exactly this feature in the queue for 9.5.

https://commitfest.postgresql.org/action/patch_view?id=1386

> but we shouldn't pick a design that will make it really hard to add
> later.

Indeed not :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: On partitioning

From
José Luis Tallón
Date:
On 12/13/2014 05:57 PM, José Luis Tallón wrote:
> On 12/13/2014 03:09 AM, Alvaro Herrera wrote:
>> [snip]
>> Arbitrary SQL expressions (including functions) are not the thing to use
>> for partitioning -- at least that's how I understand this whole
>> discussion.  I don't think you want to do "proofs" as such -- they are
>> expensive.
>
> Yup. Plus, it looks like (from reading Oracle's documentation) they 
> end up converting the LESS THAN clauses into range lists internally.
> Anyone that can attest to this? (or just disprove it, if I'm wrong)
>
> I just suggested using the existing RangeType infrastructure for this 
> ( <<, >> and && operators, specifically, might do the trick) before 
> reading your mail citing BRIN.
>     ... which might as well allow some interesting runtime 
> optimizations when range partitioning is used and *a huge* number of 
> partitions get defined --- I'm specifically thinking about massive 
> OLTP with very deep (say, 5 years' worth) archival partitioning where 
> it would be inconvenient to have the tuple routing information always 
> in memory.
> I'm specifically suggesting some ( range_value -> partitionOID) 
> mapping using a BRIN index for this --- it could be auto-created just 
> like we do for primary keys.

Reviewing the existing documentation on this topic I have stumbled on an 
e-mail by Simon Riggs from almost seven years ago
http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site

.... where he suggested a way of physically partitioning tables by using 
segments in a way that sounds to be quite close to what we are proposing 
here.

ISTM that the partitioning meta-data might very well be augmented a bit 
in the direction Simon pointed to, adding support for "effectively 
read-only" and/or "explicitly marked read-only" PARTITIONS (not segments 
in this case) for an additional optimization. We would need some syntax 
additions (ALTER PARTITION <name> SET READONLY) in this case.
This feature can be added later on, of course.


I'd like to explicitly remark the potentially performance-enhancing 
effect of fillfactor=100 (cfr. 
http://www.postgresql.org/docs/9.3/static/sql-createtable.html) and 
partitions marked "effectively read-only" (cfr. Simon's proposal) when 
coupled with "fullscan analyze" vs. the regular sample-based analyze 
that autovacuum performs.
When a partition consists of multiple *segments*, a generalization of 
the proposed BRIN index (to cover segments in addition to partitions) 
will further speed up scans.




Just for the record, allowing some partitions to be moved to foreign 
tables (i.e. foreign servers, via postgres_fdw) will multiply the 
usefullness of this "partitioned table wide" BRIN index .... now 
becoming a real "global index".

> Just my 2c
>
>
> Thanks,
>
>     / J.L.
>
>
>




Re: On partitioning

From
Amit Langote
Date:
On Sun, Dec 14, 2014 at 1:57 AM, José Luis Tallón
<jltallon@adv-solutions.net> wrote:
> On 12/13/2014 03:09 AM, Alvaro Herrera wrote:
>>
>> [snip]
>> Arbitrary SQL expressions (including functions) are not the thing to use
>> for partitioning -- at least that's how I understand this whole
>> discussion.  I don't think you want to do "proofs" as such -- they are
>> expensive.
>
>
> Yup. Plus, it looks like (from reading Oracle's documentation) they end up
> converting the LESS THAN clauses into range lists internally.
> Anyone that can attest to this? (or just disprove it, if I'm wrong)
>
> I just suggested using the existing RangeType infrastructure for this ( <<,
>>> and && operators, specifically, might do the trick) before reading your
> mail citing BRIN.
>     ... which might as well allow some interesting runtime optimizations
> when range partitioning is used and *a huge* number of partitions get
> defined --- I'm specifically thinking about massive OLTP with very deep
> (say, 5 years' worth) archival partitioning where it would be inconvenient
> to have the tuple routing information always in memory.
> I'm specifically suggesting some ( range_value -> partitionOID) mapping
> using a BRIN index for this --- it could be auto-created just like we do for
> primary keys.
>
> Just my 2c

Since we are keen on being able to reuse existing infrastructure, I
think this and RangeType, ArrayType stuff is worth thinking about
though I am afraid we may lose a certain level of generality of
expression we might very well be able to afford. Though that's
something difficult to definitely say without actually studying it a
little more detail which I haven't quite yet. We may be able to go
somewhere with it perhaps. And of course the original designers of the
infrastructure in question would be better able to vouch for it I
think.

Thanks,
Amit



Re: On partitioning

From
Amit Langote
Date:
On Sun, Dec 14, 2014 at 1:40 AM, José Luis Tallón
<jltallon@adv-solutions.net> wrote:
> On 12/12/2014 05:43 AM, Amit Langote wrote:
>
> Amit: mind if I add the DB2 syntax for partitioning to the wiki, too?
>
>     This might as well help with deciding the final form of partitioning
> (and define the first implementation boundaries, too)
>

Please go ahead.

Thanks,
Amit



Re: On partitioning

From
"Amit Langote"
Date:
Alvaro wrote:
> Claudio Freire wrote:
>
> > Fair enough, but that's not the same as not requiring easy proofs. The
> > planner might not the one doing the proofs, but you still need proofs.
> >
> > Even if the proving method is hardcoded into the partitioning method,
> > as in the case of list or range partitioning, it's still a proof. With
> > arbitrary functions (which is what prompted me to mention proofs) you
> > can't do that. A function works very well for inserting, but not for
> > selecting.
> >
> > I could be wrong though. Maybe there's a way to turn SQL functions
> > into analyzable things? But it would still be very easy to shoot
> > yourself in the foot by writing one that is too complex.
>
> Arbitrary SQL expressions (including functions) are not the thing to use
> for partitioning -- at least that's how I understand this whole
> discussion.  I don't think you want to do "proofs" as such -- they are
> expensive.
>

This means if a user puts arbitrary expressions in a partition definition, say,

... FOR VALUES  extract(month from current_date) TO extract(month from current_date + interval '3 months'),

we make sure that those expressions are pre-computed to literal values. The exact time when that happens is open for
discussionI guess. It could be either DDL time or, if feasible, during relation cache building when we compute the
valuefrom pg_node_tree of this expression which we may choose to store in the partition definition catalog. The former
entailsan obvious challenge of figuring out how we store the computed value into catalog (pg_node_tree of a Const?). 

> To make this discussion a bit clearer, there are two things to
> distinguish: one is routing tuples, when an INSERT or COPY command
> references the partitioned table, into the individual partitions
> (ingress); the other is deciding which partitions to read when a SELECT
> query wants to read tuples from the partitioned table (egress).
>
> On ingress, what you want is something like being able to do something
> on the tuple that tells you which partition it belongs into.  Ideally
> this is something much lighter than running an expression; if you can
> just apply an operator to the partitioning column values, that should be
> plenty fast.  This requires no proof.
>

And I am thinking this's all executor stuff.

> On egress you need some direct way to compare the scan quals with the
> partitioning values.  I would imagine this to be similar to how scan
> quals are compared to the values stored in a BRIN index: each scan qual
> has a corresponding operator strategy and a scan key, and you can say
> "aye" or "nay" based on a small set of operations that can be run
> cheaply, again without any proof or running arbitrary expressions.
>

My knowledge of this is far from being perfect, though to clear any confusions -

As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made to
work.Of course, I may be missing something.  

When you say "scan qual has a corresponding operator strategy", I'd think that is a part of scan key in executor, no?

Thanks,
Amit





Re: On partitioning

From
Claudio Freire
Date:
On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On egress you need some direct way to compare the scan quals with the
>> partitioning values.  I would imagine this to be similar to how scan
>> quals are compared to the values stored in a BRIN index: each scan qual
>> has a corresponding operator strategy and a scan key, and you can say
>> "aye" or "nay" based on a small set of operations that can be run
>> cheaply, again without any proof or running arbitrary expressions.
>>
>
> My knowledge of this is far from being perfect, though to clear any confusions -
>
> As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made
towork. Of course, I may be missing something.
 

Let me be overly verbose, don't take it as patronizing, just answering
in lots of detail why this could be a good idea to try.

Normal indexes store a pointer for each key value of sorts. So B-Tree
gets you a set of tids for each key, and so does GIN and hash.

But BRIN is different in that it does the mapping differently. BRIN
stores a compact, approximate representation of the set of keys within
a page range. It can tell with some degree of (in)accuracy whether a
key or key range could be part of that page range or not. The way it
does this is abstracted out, but at its core, it stores a "compressed"
representation of the key set that takes a constant amount of bits to
store, and no more, no matter how many elements. What changes as the
element it represents grows, is its accuracy.

Currently, BRIN only supports min-max representations. It will store,
for each page range, the minimum and maximum of some columns, and when
you query it, you can compare range vs range, and discard whole page
ranges.

Now, what are partitions, if not page ranges?

A BRIN tuple is a min-max pair. But BRIN in more general, it could use
other data structures to hold that "compressed representation", if
someone implemented them. Like bloom filters [0].

A BRIN index is a complex data structure because it has to account for
physically growing tables, but all the complexities vanish when you
fix a "block range" (the BR in BRIN) to a partition. Then, a mere
array of BRIN tuples would suffice.

BRIN already contains the machinery to turn quals into something that
filters out entire partitions, if you provide the BRIN tuples.

And you could even effectively matain a BRIN index for the partitions
(just a BRIN tuple per partition, dynamically updated with every
insertion).

If you do that, you start with empty partitions, and each insert
updates the BRIN tuple. Avoiding concurrency loss in this case would
be tricky, but in theory this could allow very general partition
exclusion. In fact it could even work with constraint exclusion right
now: you'd have a single-tuple BRIN index for each partition and
benefit from it.

But you don't need to pay the price of updating BRIN indexes, as
min-max tuples for each partition can be produced while creating the
partitions if the syntax already provides the information. Then, it's
just a matter of querying this meta-data which just happens to have
the form of a BRIN tuple for each partition.

[0] http://en.wikipedia.org/wiki/Bloom_filter



Re: On partitioning

From
"Amit Langote"
Date:
Claudio Freire wrote:
> On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On egress you need some direct way to compare the scan quals with the
> >> partitioning values.  I would imagine this to be similar to how scan
> >> quals are compared to the values stored in a BRIN index: each scan qual
> >> has a corresponding operator strategy and a scan key, and you can say
> >> "aye" or "nay" based on a small set of operations that can be run
> >> cheaply, again without any proof or running arbitrary expressions.
> >>
> >
> > My knowledge of this is far from being perfect, though to clear any
> confusions -
> >
> > As far as planning is concerned, I could not imagine how index access
> method way of pruning partitions could be made to work. Of course, I may
> be missing something.
>
> Let me be overly verbose, don't take it as patronizing, just answering
> in lots of detail why this could be a good idea to try.
>

Thanks for explaining. It helps.

> Normal indexes store a pointer for each key value of sorts. So B-Tree
> gets you a set of tids for each key, and so does GIN and hash.
>
> But BRIN is different in that it does the mapping differently. BRIN
> stores a compact, approximate representation of the set of keys within
> a page range. It can tell with some degree of (in)accuracy whether a
> key or key range could be part of that page range or not. The way it
> does this is abstracted out, but at its core, it stores a "compressed"
> representation of the key set that takes a constant amount of bits to
> store, and no more, no matter how many elements. What changes as the
> element it represents grows, is its accuracy.
>
> Currently, BRIN only supports min-max representations. It will store,
> for each page range, the minimum and maximum of some columns, and
> when
> you query it, you can compare range vs range, and discard whole page
> ranges.
>
> Now, what are partitions, if not page ranges?

Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this
isa rang partition, list of values in case this is a list partition and hash value in case this is a hash partition. 

There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an
accessmethod for a relation (say, top-level partitioned relation) that comes into play in executor if that access
methodsurvives as preferred access method by the planner. I cannot see a way to generalize it further and make it
supporteach block range as a separate relation and then use it for partition pruning in planner. This is assuming a
partitionedrelation is planned as an Append node which contains a list of plans for surviving partition relations based
on,say, restrict quals. 

I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so?

> A BRIN tuple is a min-max pair. But BRIN in more general, it could use
> other data structures to hold that "compressed representation", if
> someone implemented them. Like bloom filters [0].
>
> A BRIN index is a complex data structure because it has to account for
> physically growing tables, but all the complexities vanish when you
> fix a "block range" (the BR in BRIN) to a partition. Then, a mere
> array of BRIN tuples would suffice.
>
> BRIN already contains the machinery to turn quals into something that
> filters out entire partitions, if you provide the BRIN tuples.
>

IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right?

> And you could even effectively matain a BRIN index for the partitions
> (just a BRIN tuple per partition, dynamically updated with every
> insertion).
>
> If you do that, you start with empty partitions, and each insert
> updates the BRIN tuple. Avoiding concurrency loss in this case would
> be tricky, but in theory this could allow very general partition
> exclusion. In fact it could even work with constraint exclusion right
> now: you'd have a single-tuple BRIN index for each partition and
> benefit from it.
>
> But you don't need to pay the price of updating BRIN indexes, as
> min-max tuples for each partition can be produced while creating the
> partitions if the syntax already provides the information. Then, it's
> just a matter of querying this meta-data which just happens to have
> the form of a BRIN tuple for each partition.
>

Thanks,
Amit





Re: On partitioning

From
José Luis Tallón
Date:
On 12/15/2014 07:42 AM, Claudio Freire wrote:
> [snip]

> If you do that, you start with empty partitions, and each insert 
> updates the BRIN tuple. Avoiding concurrency loss in this case would 
> be tricky, but in theory this could allow very general partition 
> exclusion. In fact it could even work with constraint exclusion right 
> now: you'd have a single-tuple BRIN index for each partition and 
> benefit from it. But you don't need to pay the price of updating BRIN 
> indexes, as min-max tuples for each partition can be produced while 
> creating the partitions if the syntax already provides the 
> information. Then, it's just a matter of querying this meta-data which 
> just happens to have the form of a BRIN tuple for each partition.

Yup. Indeed this is the way I outlined in my previous e-mail.

The only point being: Why bother with BRIN when we already have the 
range machinery, and it's trivial to add pointers to partitions from 
each range?

I suggested that BRIN would solve a situation when the amount of 
partitions is huge (say, thousands) and we might need to be able to 
efficiently locate the appropriate partition. In this situation, a 
linear search might become prohibitive, or the data structure (a simple 
B-Tree, maybe) become too big to be worth keeping in memory. This is 
where being able to store the "partition index" on disk would be 
interesting.

Moreover, I guess that ---by using this approach 
(B-Tree[range]->partition_id and/or BRIN)--- we could efficiently answer 
the question "do we have any tuple with this key in some partition?" 
which AFAICS is pretty close to us having "global indexes".



Regards,
    / J.L.




Re: On partitioning

From
Claudio Freire
Date:
On Mon, Dec 15, 2014 at 8:09 AM, José Luis Tallón
<jltallon@adv-solutions.net> wrote:
> On 12/15/2014 07:42 AM, Claudio Freire wrote:
>>
>> [snip]
>
>
>> If you do that, you start with empty partitions, and each insert updates
>> the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but
>> in theory this could allow very general partition exclusion. In fact it
>> could even work with constraint exclusion right now: you'd have a
>> single-tuple BRIN index for each partition and benefit from it. But you
>> don't need to pay the price of updating BRIN indexes, as min-max tuples for
>> each partition can be produced while creating the partitions if the syntax
>> already provides the information. Then, it's just a matter of querying this
>> meta-data which just happens to have the form of a BRIN tuple for each
>> partition.
>
>
> Yup. Indeed this is the way I outlined in my previous e-mail.
>
> The only point being: Why bother with BRIN when we already have the range
> machinery, and it's trivial to add pointers to partitions from each range?

The part of BRIN I find useful is not its on-disk structure, but all
the execution machinery that checks quals against BRIN tuples. It's
not a trivial part of code, and is especially useful since it's
generalizable. New BRIN operator classes can be created and that's an
interesting power to have in partitioning as well.

Casting from ranges into min-max BRIN tuples seems quite doable, so
both range and list notation should work fine. But BRIN works also for
the generic "routing expression" some people seem to really want, and
dynamically updated BRIN meta-indexes seem to be the only efficient
solution for that.

BRIN lacks some features, as you noted, so it does need some love
before it's usable for this. But they're features BRIN itself would
find useful so you take out two ducks in one shot.

> I suggested that BRIN would solve a situation when the amount of partitions
> is huge (say, thousands) and we might need to be able to efficiently locate
> the appropriate partition. In this situation, a linear search might become
> prohibitive, or the data structure (a simple B-Tree, maybe) become too big
> to be worth keeping in memory. This is where being able to store the
> "partition index" on disk would be interesting.

BRIN also does a linear search, so it doesn't solve that. BRIN's only
power is that it can answer very fast whether some quals rule out a
partition.



Re: On partitioning

From
Robert Haas
Date:
On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> This means if a user puts arbitrary expressions in a partition definition, say,
>
> ... FOR VALUES  extract(month from current_date) TO extract(month from current_date + interval '3 months'),
>
> we make sure that those expressions are pre-computed to literal values.

I would expect that to fail, just as it would fail if you tried to
build an index using a volatile expression.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
"Amit Langote"
Date:
Robert wrote:
> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > This means if a user puts arbitrary expressions in a partition definition, say,
> >
> > ... FOR VALUES  extract(month from current_date) TO extract(month from
> current_date + interval '3 months'),
> >
> > we make sure that those expressions are pre-computed to literal values.
>
> I would expect that to fail, just as it would fail if you tried to
> build an index using a volatile expression.

Oops, wrong example, sorry. In case of an otherwise good expression?

Thanks,
Amit





Re: On partitioning

From
Robert Haas
Date:
On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Robert wrote:
>> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> > This means if a user puts arbitrary expressions in a partition definition, say,
>> >
>> > ... FOR VALUES  extract(month from current_date) TO extract(month from
>> current_date + interval '3 months'),
>> >
>> > we make sure that those expressions are pre-computed to literal values.
>>
>> I would expect that to fail, just as it would fail if you tried to
>> build an index using a volatile expression.
>
> Oops, wrong example, sorry. In case of an otherwise good expression?

I'm not really sure what you are getting here.  An "otherwise-good
expression" basically means a constant.  Index expressions have to be
things that always produce the same result given the same input,
because otherwise you might get a different result when searching the
index than you did when building it, and then you would fail to find
keys that are actually present.  In the same way, partition boundaries
also need to be constants.  Maybe you could allow expressions that can
be constant-folded, but that's about it.  If you allow anything else,
then the partition boundary might "move" once it's been established
and then some of the data will be in the wrong partition.

What possible use case is there for defining partitions with
non-constant boundaries?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Claudio Freire
Date:
On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Robert wrote:
>>> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> > This means if a user puts arbitrary expressions in a partition definition, say,
>>> >
>>> > ... FOR VALUES  extract(month from current_date) TO extract(month from
>>> current_date + interval '3 months'),
>>> >
>>> > we make sure that those expressions are pre-computed to literal values.
>>>
>>> I would expect that to fail, just as it would fail if you tried to
>>> build an index using a volatile expression.
>>
>> Oops, wrong example, sorry. In case of an otherwise good expression?
>
> I'm not really sure what you are getting here.  An "otherwise-good
> expression" basically means a constant.  Index expressions have to be
> things that always produce the same result given the same input,
> because otherwise you might get a different result when searching the
> index than you did when building it, and then you would fail to find
> keys that are actually present.

I think the point is partitioning based on the result of an expression
over row columns. Or if it's not, it should be made anyway:

PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12);

Or something like that.



Re: On partitioning

From
Josh Berkus
Date:
On 12/15/2014 10:55 AM, Robert Haas wrote:
>> This means if a user puts arbitrary expressions in a partition definition, say,
>> >
>> > ... FOR VALUES  extract(month from current_date) TO extract(month from current_date + interval '3 months'),
>> >
>> > we make sure that those expressions are pre-computed to literal values.
> I would expect that to fail, just as it would fail if you tried to
> build an index using a volatile expression.

Yes, I wasn't saying that expressions should be used when *creating* the
partitions, which strikes me as a bad idea for several reasons.
Expressions should be usable when SELECTing data from the partitions.
Right now, they aren't, because the planner picks parttiions well before
the rewrite phase which would reduce "extract (month from current_date)"
to a constant.

Right now, if you partition by an integer ID even, and do:

SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 )

... postgres will scan all partitions because ( 3 + 4 ) is an expression
and isn't evaluated until after CE is done.

I don't think there's an easy way to do the expression rewrite while
we're still in planning, is there?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Amit Langote
Date:
On 17-12-2014 AM 12:15, Robert Haas wrote:
> On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Robert wrote:
>>> I would expect that to fail, just as it would fail if you tried to
>>> build an index using a volatile expression.
>>
>> Oops, wrong example, sorry. In case of an otherwise good expression?
> 
> I'm not really sure what you are getting here.  An "otherwise-good
> expression" basically means a constant.  Index expressions have to be
> things that always produce the same result given the same input,
> because otherwise you might get a different result when searching the
> index than you did when building it, and then you would fail to find
> keys that are actually present.  In the same way, partition boundaries
> also need to be constants.  Maybe you could allow expressions that can
> be constant-folded, but that's about it.  

Yeah, this is what I meant. Expressions that can be constant-folded.
Sorry, the example I chose was pretty lame. I was just thinking about
kind of stuff that something like pg_node_tree would be a good choice
for as on-disk representation of partition values. Though definitely it
wouldn't be to store arbitrary expressions that evaluate to different
values at different times.

Thanks,
Amit




Re: On partitioning

From
Amit Langote
Date:
On 17-12-2014 AM 12:28, Claudio Freire wrote:
> On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm not really sure what you are getting here.  An "otherwise-good
>> expression" basically means a constant.  Index expressions have to be
>> things that always produce the same result given the same input,
>> because otherwise you might get a different result when searching the
>> index than you did when building it, and then you would fail to find
>> keys that are actually present.
> 
> I think the point is partitioning based on the result of an expression
> over row columns. 

Actually, in this case, I was thinking about a partition definition not
partition key definition. That is, using an expression as partition
value which has problems that I see.

> Or if it's not, it should be made anyway:
> 
> PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12);
> 
> Or something like that.
> 

Such a thing seems very desirable though there are some tradeoffs
compared to having partitioning key be just attrnums. Or at least we can
start with that.

An arbitrary expression as partitioning key means that we have to
recompute such an expression for each input row. Think how inefficient
that may be when bulk-loading into a partitioned table during, say, a
COPY. Though there may be ways to fix that.

Thanks,
Amit




Re: On partitioning

From
Robert Haas
Date:
On Tue, Dec 16, 2014 at 1:45 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Yes, I wasn't saying that expressions should be used when *creating* the
> partitions, which strikes me as a bad idea for several reasons.
> Expressions should be usable when SELECTing data from the partitions.
> Right now, they aren't, because the planner picks parttiions well before
> the rewrite phase which would reduce "extract (month from current_date)"
> to a constant.
>
> Right now, if you partition by an integer ID even, and do:
>
> SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 )
>
> ... postgres will scan all partitions because ( 3 + 4 ) is an expression
> and isn't evaluated until after CE is done.

Well, actually, that case works fine:

rhaas=# create table partitioned_table (id integer, data text);
CREATE TABLE
rhaas=# create table child1 (check (id < 1000)) inherits (partitioned_table);
CREATE TABLE
rhaas=# create table child2 (check (id >= 1000)) inherits (partitioned_table);
CREATE TABLE
rhaas=# explain select * from partitioned_table where id = (3 + 4);                              QUERY PLAN
------------------------------------------------------------------------Append  (cost=0.00..25.38 rows=7 width=36)  ->
SeqScan on partitioned_table  (cost=0.00..0.00 rows=1 width=36)        Filter: (id = 7)  ->  Seq Scan on child1
(cost=0.00..25.38rows=6 width=36)        Filter: (id = 7)
 
(5 rows)

The reason is that 3 + 4 gets constant-folded pretty early on in the process.

But in a more complicated case where the value there isn't known until
runtime, yeah, it scans everything.  I'm not sure what the best way to
fix that is.  If the partition bounds were stored in a structured way,
as we've been discussing, then the Append or Merge Append node could,
when initialized, check which partition the id = X qual routes to and
ignore the rest.  But that's more iffy with the current
representation, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Josh Berkus
Date:
On 12/16/2014 05:52 PM, Robert Haas wrote:
> But in a more complicated case where the value there isn't known until
> runtime, yeah, it scans everything.  I'm not sure what the best way to
> fix that is.  If the partition bounds were stored in a structured way,
> as we've been discussing, then the Append or Merge Append node could,
> when initialized, check which partition the id = X qual routes to and
> ignore the rest.  But that's more iffy with the current
> representation, I think.

Huh.  I was just testing:

WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz
'2014-12-01' + interval '1 month')

In that case, the expression above got folded to constants by the time
Postgres did the index scans, but it scanned all partitions.  So somehow
(timestamptz + interval) doesn't get constant-folded until after
planning, at least not on 9.3.

And of course this leaves out common patterns like "now() - interval '30
days'" or "to_timestamp('20141201','YYYYMMDD')"

Anyway, what I'm saying is that I personally regard the inability to
handle even moderately complex expressions a major failing of our
existing partitioning scheme (possibly its worst single failing), and I
would regard any new partitioning feature which didn't address that
issue as suspect.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Robert Haas
Date:
On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/16/2014 05:52 PM, Robert Haas wrote:
>> But in a more complicated case where the value there isn't known until
>> runtime, yeah, it scans everything.  I'm not sure what the best way to
>> fix that is.  If the partition bounds were stored in a structured way,
>> as we've been discussing, then the Append or Merge Append node could,
>> when initialized, check which partition the id = X qual routes to and
>> ignore the rest.  But that's more iffy with the current
>> representation, I think.
>
> Huh.  I was just testing:
>
> WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz
> '2014-12-01' + interval '1 month')
>
> In that case, the expression above got folded to constants by the time
> Postgres did the index scans, but it scanned all partitions.  So somehow
> (timestamptz + interval) doesn't get constant-folded until after
> planning, at least not on 9.3.
>
> And of course this leaves out common patterns like "now() - interval '30
> days'" or "to_timestamp('20141201','YYYYMMDD')"
>
> Anyway, what I'm saying is that I personally regard the inability to
> handle even moderately complex expressions a major failing of our
> existing partitioning scheme (possibly its worst single failing), and I
> would regard any new partitioning feature which didn't address that
> issue as suspect.

I understand, but I think you need to be careful not to stonewall all
progress in the name of getting what you want.  Getting the
partitioning metadata into the system catalogs in a suitable format
will be a huge step forward regardless of whether it solves this
particular problem right away or not, because it will make it possible
to solve this problem in a highly-efficient way, which is quite hard
to do right now.

For example, we could (right now) write code that would do run-time
partition pruning by taking the final filter clause, with all values
substituted in, and re-checking for partitions that can be pruned via
constraint exclusion.  But that would be expensive and would often
fail to find anything useful.  Even in the best case where it works
out it's O(n) in the number of partitions, and will therefore perform
badly for large numbers of partitions (even, say, 1000).  But once the
partitioning metadata is stored in the catalog, we can implement this
as a binary search -- O(lg n) time -- and the constant factor should
be lower -- and it will be pretty easy to skip it in cases where it's
useless so that we don't waste cycles spinning our wheels.  Whether
the initial patch covers all the cases you care about or not, and it
probably won't, it will be a really big step towards making it
POSSIBLE to handle those cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Josh Berkus
Date:
On 12/16/2014 07:35 PM, Robert Haas wrote:
> On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Anyway, what I'm saying is that I personally regard the inability to
>> handle even moderately complex expressions a major failing of our
>> existing partitioning scheme (possibly its worst single failing), and I
>> would regard any new partitioning feature which didn't address that
>> issue as suspect.
> 
> I understand, but I think you need to be careful not to stonewall all
> progress in the name of getting what you want.  Getting the
> partitioning metadata into the system catalogs in a suitable format
> will be a huge step forward regardless of whether it solves this
> particular problem right away or not, because it will make it possible
> to solve this problem in a highly-efficient way, which is quite hard
> to do right now.

Sure.  But there's a big difference between "we're going to take these
steps and that problem will be fixable eventually" and "we're going to
retain features of the current partitioning system which make that
problem impossible to fix."  The drift of discussion on this thread
*sounded* like the latter, and I've been calling attention to the issue
in an effort to make sure that it's not.

Last week, I wrote a longish email listing out the common problems users
have with our current partitioning as a way of benchmarking the plan for
new partitioning.  While some people responded to that post, absolutely
nobody discussed the list of issues I gave.  Is that because there's
universal agreement that I got the major issues right?  Seems doubtful.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Heikki Linnakangas
Date:
On 12/17/2014 08:53 PM, Josh Berkus wrote:
> Last week, I wrote a longish email listing out the common problems users
> have with our current partitioning as a way of benchmarking the plan for
> new partitioning.  While some people responded to that post, absolutely
> nobody discussed the list of issues I gave.  Is that because there's
> universal agreement that I got the major issues right?  Seems doubtful.

That was a good list.

- Heikki




Re: On partitioning

From
Josh Berkus
Date:
On 12/17/2014 11:19 AM, Heikki Linnakangas wrote:
> On 12/17/2014 08:53 PM, Josh Berkus wrote:
>> Last week, I wrote a longish email listing out the common problems users
>> have with our current partitioning as a way of benchmarking the plan for
>> new partitioning.  While some people responded to that post, absolutely
>> nobody discussed the list of issues I gave.  Is that because there's
>> universal agreement that I got the major issues right?  Seems doubtful.
> 
> That was a good list.

;-)

Ok, that made my morning.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: On partitioning

From
Robert Haas
Date:
On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/16/2014 07:35 PM, Robert Haas wrote:
>> On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Anyway, what I'm saying is that I personally regard the inability to
>>> handle even moderately complex expressions a major failing of our
>>> existing partitioning scheme (possibly its worst single failing), and I
>>> would regard any new partitioning feature which didn't address that
>>> issue as suspect.
>>
>> I understand, but I think you need to be careful not to stonewall all
>> progress in the name of getting what you want.  Getting the
>> partitioning metadata into the system catalogs in a suitable format
>> will be a huge step forward regardless of whether it solves this
>> particular problem right away or not, because it will make it possible
>> to solve this problem in a highly-efficient way, which is quite hard
>> to do right now.
>
> Sure.  But there's a big difference between "we're going to take these
> steps and that problem will be fixable eventually" and "we're going to
> retain features of the current partitioning system which make that
> problem impossible to fix."  The drift of discussion on this thread
> *sounded* like the latter, and I've been calling attention to the issue
> in an effort to make sure that it's not.
>
> Last week, I wrote a longish email listing out the common problems users
> have with our current partitioning as a way of benchmarking the plan for
> new partitioning.  While some people responded to that post, absolutely
> nobody discussed the list of issues I gave.  Is that because there's
> universal agreement that I got the major issues right?  Seems doubtful.

I agreed with many of the things you listed but not all of them.
However, I don't think it's realistic to burden whatever patch Amit
writes with the duty of, for example, making global indexes work.
That's a huge problem all of its own.  Now, conceivably, we could try
to solve that as part of the next patch by insisting that the
"partitions" have to really be block number ranges within a single
relfilenode rather than separate relfilenodes as they are today ...
but I think that's a bad design which we would likely regret bitterly.
I also think that it would likely make what's being talked about here
so complicated that it will never go anywhere.  I think it's better
that we focus on solving one problem really well - storing metadata
for partition boundaries in the catalog so that we can do efficient
tuple routing and partition pruning - and leave the other problems for
later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: On partitioning

From
Amit Langote
Date:
On 18-12-2014 AM 04:52, Robert Haas wrote:
> On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>
>> Sure.  But there's a big difference between "we're going to take these
>> steps and that problem will be fixable eventually" and "we're going to
>> retain features of the current partitioning system which make that
>> problem impossible to fix."  The drift of discussion on this thread
>> *sounded* like the latter, and I've been calling attention to the issue
>> in an effort to make sure that it's not.
>>
>> Last week, I wrote a longish email listing out the common problems users
>> have with our current partitioning as a way of benchmarking the plan for
>> new partitioning.  While some people responded to that post, absolutely
>> nobody discussed the list of issues I gave.  Is that because there's
>> universal agreement that I got the major issues right?  Seems doubtful.
> 
> I agreed with many of the things you listed but not all of them.
> However, I don't think it's realistic to burden whatever patch Amit
> writes with the duty of, for example, making global indexes work.
> That's a huge problem all of its own.  Now, conceivably, we could try
> to solve that as part of the next patch by insisting that the
> "partitions" have to really be block number ranges within a single
> relfilenode rather than separate relfilenodes as they are today ...
> but I think that's a bad design which we would likely regret bitterly.
> I also think that it would likely make what's being talked about here
> so complicated that it will never go anywhere.  I think it's better
> that we focus on solving one problem really well - storing metadata
> for partition boundaries in the catalog so that we can do efficient
> tuple routing and partition pruning - and leave the other problems for
> later.
> 

Yes, I think partitioning as a whole is a BIG enough project that we
need to tackle it as a series of steps each of which is a discussion of
its own. The first step might as well be discussing how we represent a
partitioned table. We have a number of design decisions to make during
this step itself and we would definitely want to reach a consensus on
these points.

Things like where we indicate if a table is partitioned (pg_class), what
the partition key looks like, where it is stored, what the partition
definition looks like, where it is stored, how we represent arbitrary
number of levels in partitioning hierarchy, how we implement that only
leaf level relations in a hierarchy have storage, what are implications
of all these choices, etc. Some of these points are being discussed.

I agree that while we are discussing these points, we could also be
discussing how we solve problems of existing partitioning implementation
using whatever the above things end up being. Proposed approaches to
solve those problems might be useful to drive the first step as well or
perhaps that's how it should be done anyway.

Thanks,
Amit




Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 06-01-2015 PM 03:40, Amit Langote wrote:
> 
> I agree that while we are discussing these points, we could also be
> discussing how we solve problems of existing partitioning implementation
> using whatever the above things end up being. Proposed approaches to
> solve those problems might be useful to drive the first step as well or
> perhaps that's how it should be done anyway.
> 

I realize the discussion has not quite brought us to *conclusions* so
far though surely there has been valuable input from people. Anyway,
starting a new thread with the summary of what has been (please note
that the order of listing the points does not necessarily connote the
priority):

* It has been repeatedly pointed out that we may want to decouple
partitioning from inheritance because implementing partitioning as an
extension of inheritance mechanism means that we have to keep all the
existing semantics which might limit what we want to do with the special
case of it which is partitioning; in other words, we would find
ourselves in difficult position where we have to inject a special case
code into a very generalized mechanism that is inheritance.
Specifically, do we regard a partitions as pg_inherits children of its
partitioning parent?

* Syntax: do we want to make it similar to one of the many other
databases out there? Or we could invent our own? I like the syntax that
Robert suggested that covers the cases of RANGE and LIST partitioning
without actually having to use those keywords explicitly; something like
the following:

CREATE TABLE parent PARTITION ON (column [ USING opclass ] [, ... ]);

CREATE TABLE child PARTITION OF parent_name  FOR VALUES { (value, ...) [ TO (value, ...) ] }

So instead of making a hard distinction between range and list
partitioning, you can say:

CREATE TABLE child_name PARTITION OF parent_name FOR VALUES (3, 5, 7);

wherein, child is effectively a LIST partition

CREATE TABLE child PARTITION OF parent_name FOR VALUES (8) TO (12);

wherein, child is effectively a RANGE partition on one column

CREATE TABLE child PARTITION OF parent_name FOR VALUES(20, 120) TO (30,
130);

wherein, child is effectively a RANGE partition on two columns

I wonder if we could add a clause like DISTRIBUTED BY to complement
PARTITION ON that represents a hash distributed/partitioned table (that
could be a syntax to support sharded tables maybe; we would definitely
want to move ahead in that direction I guess)

* Catalog: We would like to have a catalog structure suitable to
implement capabilities like multi-column partitioning, sub-partitioning
(with arbitrary number of levels in the hierarchy). I had suggested
that we create two new catalogs viz. pg_partitioned_rel,
pg_partition_def to store metadata about a partition key of a
partitioned relation and partition bound info of a partition,
respectively. Also, see the point about on-disk representation of
partition bounds

* It is desirable to treat partitions as pg_class relations with perhaps
a new relkind(s). We may want to choose an implementation where only the
lowest level relations in a partitioning hierarchy have storage; those
at the upper layers are mere placeholder relations though of course with
associated constraints determined by partitioning criteria (with
appropriate metadata entered into the additional catalogs). I am not
quite sure if each kind of the relations involved in the partitioning
scheme have separate namespaces and, if they are, how we implement that

* In the initial implementation, we could just live with partitioning on
a set of columns (and not arbitrary expressions of them)

* We perhaps do not need multi-column LIST partitions as they are not
very widely used and may complicate the implementation

* There are a number of suggestions about how we represent partition
bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
associated with the relation itself), etc. Important point to consider
here may be that partition key may contain more than one column

* How we represent partition definition in memory (for a given
partitioned relation) - important point to remember is that such a
representation should be efficient to iterate through or
binary-searchable. Also see the points about tuple-routing and
partition-pruning

* Overflow/catchall partition: it seems we do not want/need them. It
might seem desirable for example in cases where a big transaction enters
a large number of tuples all but one of which find a defined partition;
we may not want to error out in such case but instead enter that erring
tuple into the overflow partition instead. If we choose to implement
that, we would like to also implement the capability to move the tuples
into the appropriate partition once it's defined. Related is the notion
of automatically creating partitions if one is not already defined for a
just entered tuple; but there may be locking troubles if many concurrent
sessions try to do that

* Tuple-routing: based on the internal representation of partition
bounds for the partitions of a given partitioned table, there should be
a way to map a just entered tuple to partition id it belongs to. Below
mentioned BRIN-like machinery could be made to work

* Partition-pruning: again, based on the internal representation of
partition bounds for the partitions of a given partitioned table, there
should be a way to prune partitions deemed unnecessary per scan quals.
One notable suggestion is to consider BRIN (-like) machinery. For
example, it is able to tell from the scan quals whether a particular
block range of a given heap needs to be scanned or not based on summary
info index tuple for the block range. Though, the interface is currently
suitable to cover a single heap with blocks in range 0 to N-1 of that
heap. What we are looking for here is a hypothetical PartitionMemTuple
(or PartitionBound) that is a summary of a whole relation (in this case,
the partition) NOT a block range. But I guess the infrastructure is
generalized enough that we could make that work. Related then would be
an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
has correspondence with the index being used, the hypothetical
PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
would represent the application of comparison operator between table
column (partitioning key column) and a constant (as per quals).


Please help bridge the gap in my understanding of these points. I hope
we can put the discussion on a concrete footing so that it leads to a
way towards implementation sooner than later. Some points need more
immediate attention as we would like to first tackle the issue of
partition metadata. Reusing existing infrastructure should be encouraged
with obvious enhancements as we find fit. I am beginning to feel there
is a need to prototype a good enough solution that incorporates the
suggestions that have been already provided or will be provided. It may
be the only way forward though I think it definitely worthwhile to spend
some time to arrive at such a set of good enough ideas on various aspects.

Thanks,
Amit




Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Robert Haas
Date:
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> * It has been repeatedly pointed out that we may want to decouple
> partitioning from inheritance because implementing partitioning as an
> extension of inheritance mechanism means that we have to keep all the
> existing semantics which might limit what we want to do with the special
> case of it which is partitioning; in other words, we would find
> ourselves in difficult position where we have to inject a special case
> code into a very generalized mechanism that is inheritance.
> Specifically, do we regard a partitions as pg_inherits children of its
> partitioning parent?

I don't think this is totally an all-or-nothing decision.  I think
everyone is agreed that we need to not break things that work today --
e.g. Merge Append.  What that implies for pg_inherits isn't altogether
clear.

> * Syntax: do we want to make it similar to one of the many other
> databases out there? Or we could invent our own?

Well, what I think we don't want is something that is *almost* like
some other database but not quite.  I lean toward inventing our own
since I'm not aware of something that we'd want to copy exactly.

> I wonder if we could add a clause like DISTRIBUTED BY to complement
> PARTITION ON that represents a hash distributed/partitioned table (that
> could be a syntax to support sharded tables maybe; we would definitely
> want to move ahead in that direction I guess)

Maybe eventually, but let's not complicate things by worrying too much
about that now.

> * Catalog: We would like to have a catalog structure suitable to
> implement capabilities like multi-column partitioning, sub-partitioning
> (with arbitrary number of levels in the hierarchy). I had suggested
> that we create two new catalogs viz. pg_partitioned_rel,
> pg_partition_def to store metadata about a partition key of a
> partitioned relation and partition bound info of a partition,
> respectively. Also, see the point about on-disk representation of
> partition bounds

I'm not convinced that there is any benefit in spreading this
information across two tables neither of which exist today.  If the
representation of the partitioning scheme is going to be a node tree,
then there's no point in taking what would otherwise have been a List
and storing each element of it in a separate tuple. The overarching
point here is that the system catalog structure should be whatever is
most convenient for the system internals; I'm not sure we understand
what that is yet.

> * It is desirable to treat partitions as pg_class relations with perhaps
> a new relkind(s). We may want to choose an implementation where only the
> lowest level relations in a partitioning hierarchy have storage; those
> at the upper layers are mere placeholder relations though of course with
> associated constraints determined by partitioning criteria (with
> appropriate metadata entered into the additional catalogs).

I think the storage-less parents need a new relkind precisely to
denote that they have no storage; I am not convinced that there's any
reason to change the relkind for the leaf nodes.  But that's been
proposed, so evidently someone thinks there's a reason to do it.

> I am not
> quite sure if each kind of the relations involved in the partitioning
> scheme have separate namespaces and, if they are, how we implement that

I am in favor of having all of the nodes in the hierarchy have names
just as relations do today -- pg_class.relname.  Anything else seems
to me to be complex to implement and of very marginal benefit.  But
again, it's been proposed.

> * In the initial implementation, we could just live with partitioning on
> a set of columns (and not arbitrary expressions of them)

Seems quite fair.

> * We perhaps do not need multi-column LIST partitions as they are not
> very widely used and may complicate the implementation

I agree that the use case is marginal; but I'm not sure it needs to
complicate the implementation much.  Depending on how the
implementation shakes out, prohibiting it might come to seem like more
of a wart than allowing it.

> * There are a number of suggestions about how we represent partition
> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> associated with the relation itself), etc. Important point to consider
> here may be that partition key may contain more than one column

Yep.

> * How we represent partition definition in memory (for a given
> partitioned relation) - important point to remember is that such a
> representation should be efficient to iterate through or
> binary-searchable. Also see the points about tuple-routing and
> partition-pruning

Yep.

> * Overflow/catchall partition: it seems we do not want/need them. It
> might seem desirable for example in cases where a big transaction enters
> a large number of tuples all but one of which find a defined partition;
> we may not want to error out in such case but instead enter that erring
> tuple into the overflow partition instead. If we choose to implement
> that, we would like to also implement the capability to move the tuples
> into the appropriate partition once it's defined. Related is the notion
> of automatically creating partitions if one is not already defined for a
> just entered tuple; but there may be locking troubles if many concurrent
> sessions try to do that

I think that dynamically creating new partitions is way beyond the
scope of what this patch should be trying to do.  If we ever do it at
all, it should not be now.  The value of a default partition (aka
overflow partition) seems to me to be debatable.  For range
partitioning, it doesn't seem entirely necessary provided that you can
define a range with only one endpoint (e.g. partition A has values 1
to 10, B has 11 and up, and C has 0 and down).  For list partitioning,
though, you might well want something like that.  But is it a
must-have?  Dunno.

> * Tuple-routing: based on the internal representation of partition
> bounds for the partitions of a given partitioned table, there should be
> a way to map a just entered tuple to partition id it belongs to. Below
> mentioned BRIN-like machinery could be made to work
>
> * Partition-pruning: again, based on the internal representation of
> partition bounds for the partitions of a given partitioned table, there
> should be a way to prune partitions deemed unnecessary per scan quals.
> One notable suggestion is to consider BRIN (-like) machinery. For
> example, it is able to tell from the scan quals whether a particular
> block range of a given heap needs to be scanned or not based on summary
> info index tuple for the block range. Though, the interface is currently
> suitable to cover a single heap with blocks in range 0 to N-1 of that
> heap. What we are looking for here is a hypothetical PartitionMemTuple
> (or PartitionBound) that is a summary of a whole relation (in this case,
> the partition) NOT a block range. But I guess the infrastructure is
> generalized enough that we could make that work. Related then would be
> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
> has correspondence with the index being used, the hypothetical
> PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
> would represent the application of comparison operator between table
> column (partitioning key column) and a constant (as per quals).

I'm not going to say this couldn't be done, but how is any of it
better than having a list of the partition bounds and binary-searching
it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Ashutosh Bapat
Date:


On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> * It has been repeatedly pointed out that we may want to decouple
> partitioning from inheritance because implementing partitioning as an
> extension of inheritance mechanism means that we have to keep all the
> existing semantics which might limit what we want to do with the special
> case of it which is partitioning; in other words, we would find
> ourselves in difficult position where we have to inject a special case
> code into a very generalized mechanism that is inheritance.
> Specifically, do we regard a partitions as pg_inherits children of its
> partitioning parent?

I don't think this is totally an all-or-nothing decision.  I think
everyone is agreed that we need to not break things that work today --
e.g. Merge Append.  What that implies for pg_inherits isn't altogether
clear.

> * Syntax: do we want to make it similar to one of the many other
> databases out there? Or we could invent our own?

Well, what I think we don't want is something that is *almost* like
some other database but not quite.  I lean toward inventing our own
since I'm not aware of something that we'd want to copy exactly.

> I wonder if we could add a clause like DISTRIBUTED BY to complement
> PARTITION ON that represents a hash distributed/partitioned table (that
> could be a syntax to support sharded tables maybe; we would definitely
> want to move ahead in that direction I guess)

Maybe eventually, but let's not complicate things by worrying too much
about that now.

Instead we might want to specify which server (foreign or local) each of the partition go to, something like LOCATED ON clause for each of the partitions with default as local server.
 

> * Catalog: We would like to have a catalog structure suitable to
> implement capabilities like multi-column partitioning, sub-partitioning
> (with arbitrary number of levels in the hierarchy). I had suggested
> that we create two new catalogs viz. pg_partitioned_rel,
> pg_partition_def to store metadata about a partition key of a
> partitioned relation and partition bound info of a partition,
> respectively. Also, see the point about on-disk representation of
> partition bounds

I'm not convinced that there is any benefit in spreading this
information across two tables neither of which exist today.  If the
representation of the partitioning scheme is going to be a node tree,
then there's no point in taking what would otherwise have been a List
and storing each element of it in a separate tuple. The overarching
point here is that the system catalog structure should be whatever is
most convenient for the system internals; I'm not sure we understand
what that is yet.

> * It is desirable to treat partitions as pg_class relations with perhaps
> a new relkind(s). We may want to choose an implementation where only the
> lowest level relations in a partitioning hierarchy have storage; those
> at the upper layers are mere placeholder relations though of course with
> associated constraints determined by partitioning criteria (with
> appropriate metadata entered into the additional catalogs).

I think the storage-less parents need a new relkind precisely to
denote that they have no storage; I am not convinced that there's any
reason to change the relkind for the leaf nodes.  But that's been
proposed, so evidently someone thinks there's a reason to do it.

> I am not
> quite sure if each kind of the relations involved in the partitioning
> scheme have separate namespaces and, if they are, how we implement that

I am in favor of having all of the nodes in the hierarchy have names
just as relations do today -- pg_class.relname.  Anything else seems
to me to be complex to implement and of very marginal benefit.  But
again, it's been proposed.

> * In the initial implementation, we could just live with partitioning on
> a set of columns (and not arbitrary expressions of them)

Seems quite fair.

> * We perhaps do not need multi-column LIST partitions as they are not
> very widely used and may complicate the implementation

I agree that the use case is marginal; but I'm not sure it needs to
complicate the implementation much.  Depending on how the
implementation shakes out, prohibiting it might come to seem like more
of a wart than allowing it.

> * There are a number of suggestions about how we represent partition
> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> associated with the relation itself), etc. Important point to consider
> here may be that partition key may contain more than one column

Yep.

> * How we represent partition definition in memory (for a given
> partitioned relation) - important point to remember is that such a
> representation should be efficient to iterate through or
> binary-searchable. Also see the points about tuple-routing and
> partition-pruning

Yep.

> * Overflow/catchall partition: it seems we do not want/need them. It
> might seem desirable for example in cases where a big transaction enters
> a large number of tuples all but one of which find a defined partition;
> we may not want to error out in such case but instead enter that erring
> tuple into the overflow partition instead. If we choose to implement
> that, we would like to also implement the capability to move the tuples
> into the appropriate partition once it's defined. Related is the notion
> of automatically creating partitions if one is not already defined for a
> just entered tuple; but there may be locking troubles if many concurrent
> sessions try to do that

I think that dynamically creating new partitions is way beyond the
scope of what this patch should be trying to do.  If we ever do it at
all, it should not be now.  The value of a default partition (aka
overflow partition) seems to me to be debatable.  For range
partitioning, it doesn't seem entirely necessary provided that you can
define a range with only one endpoint (e.g. partition A has values 1
to 10, B has 11 and up, and C has 0 and down).  For list partitioning,
though, you might well want something like that.  But is it a
must-have?  Dunno.

> * Tuple-routing: based on the internal representation of partition
> bounds for the partitions of a given partitioned table, there should be
> a way to map a just entered tuple to partition id it belongs to. Below
> mentioned BRIN-like machinery could be made to work
>
> * Partition-pruning: again, based on the internal representation of
> partition bounds for the partitions of a given partitioned table, there
> should be a way to prune partitions deemed unnecessary per scan quals.
> One notable suggestion is to consider BRIN (-like) machinery. For
> example, it is able to tell from the scan quals whether a particular
> block range of a given heap needs to be scanned or not based on summary
> info index tuple for the block range. Though, the interface is currently
> suitable to cover a single heap with blocks in range 0 to N-1 of that
> heap. What we are looking for here is a hypothetical PartitionMemTuple
> (or PartitionBound) that is a summary of a whole relation (in this case,
> the partition) NOT a block range. But I guess the infrastructure is
> generalized enough that we could make that work. Related then would be
> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
> has correspondence with the index being used, the hypothetical
> PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
> would represent the application of comparison operator between table
> column (partitioning key column) and a constant (as per quals).

I'm not going to say this couldn't be done, but how is any of it
better than having a list of the partition bounds and binary-searching
it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 19-01-2015 PM 12:37, Ashutosh Bapat wrote:
> On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
>> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>>> I wonder if we could add a clause like DISTRIBUTED BY to complement
>>> PARTITION ON that represents a hash distributed/partitioned table (that
>>> could be a syntax to support sharded tables maybe; we would definitely
>>> want to move ahead in that direction I guess)
>>
>> Maybe eventually, but let's not complicate things by worrying too much
>> about that now.
>>
> 
> Instead we might want to specify which server (foreign or local) each of
> the partition go to, something like LOCATED ON clause for each of the
> partitions with default as local server.
> 

Given how things stand today, we do not allow DDL with the FDW
interface, unless I'm missing something. So, we are restricted to only
going the other way around, say,

CREATE FOREIGN TABLE partXX PARTITION OF parent SERVER ...;

assuming we like the proposed syntax -

CREATE TABLE child PARTITION OF parent;

I think this is also assuming we are relying on foreign table
inheritance. That is, both that partitioning is based on inheritance and
foreign tables support inheritance (which should be the case soon)

Still, I think Robert may be correct in that it would not be sooner that
we integrate foreign tables with partitioning scheme (I guess mostly the
syntax aspect of it).

Thanks,
Amit




Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 17-01-2015 AM 02:34, Robert Haas wrote:
> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> * It has been repeatedly pointed out that we may want to decouple
>> partitioning from inheritance because implementing partitioning as an
>> extension of inheritance mechanism means that we have to keep all the
>> existing semantics which might limit what we want to do with the special
>> case of it which is partitioning; in other words, we would find
>> ourselves in difficult position where we have to inject a special case
>> code into a very generalized mechanism that is inheritance.
>> Specifically, do we regard a partitions as pg_inherits children of its
>> partitioning parent?
> 
> I don't think this is totally an all-or-nothing decision.  I think
> everyone is agreed that we need to not break things that work today --
> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
> clear.
> 

One point is that an implementation may end up establishing the
parent-partition hierarchy somewhere other than (or in addition to)
pg_inherits. One intention would be to avoid tying partitioning scheme
to certain inheritance features that use pg_inherits. For example,
consider call sites of find_all_inheritors(). One notable example is
Append/MergeAppend which would be of interest to partitioning. We would
want to reuse that part of the infrastructure but we could might as well
write an equivalent, say find_all_partitions() which scans something
other than pg_inherits to get all partitions.

Now, we may not want to do that and instead add special case code to
prevent partitioning from fiddling with unnecessary inheritance features
in the code paths of inheritance. This seems like an important decision
to make.

>> * Syntax: do we want to make it similar to one of the many other
>> databases out there? Or we could invent our own?
> 
> Well, what I think we don't want is something that is *almost* like
> some other database but not quite.  I lean toward inventing our own
> since I'm not aware of something that we'd want to copy exactly.
> 
>> I wonder if we could add a clause like DISTRIBUTED BY to complement
>> PARTITION ON that represents a hash distributed/partitioned table (that
>> could be a syntax to support sharded tables maybe; we would definitely
>> want to move ahead in that direction I guess)
> 
> Maybe eventually, but let's not complicate things by worrying too much
> about that now.
> 

Agree that we may not want to mix the two too much at this point.

>> * Catalog: We would like to have a catalog structure suitable to
>> implement capabilities like multi-column partitioning, sub-partitioning
>> (with arbitrary number of levels in the hierarchy). I had suggested
>> that we create two new catalogs viz. pg_partitioned_rel,
>> pg_partition_def to store metadata about a partition key of a
>> partitioned relation and partition bound info of a partition,
>> respectively. Also, see the point about on-disk representation of
>> partition bounds
> 
> I'm not convinced that there is any benefit in spreading this
> information across two tables neither of which exist today.  If the
> representation of the partitioning scheme is going to be a node tree,
> then there's no point in taking what would otherwise have been a List
> and storing each element of it in a separate tuple. The overarching
> point here is that the system catalog structure should be whatever is
> most convenient for the system internals; I'm not sure we understand
> what that is yet.
> 

Agree that some concrete idea of internal representation should help
guide the catalog structure. If we are going to cache the partitioning
info in relcache (which we most definitely will), then we should try to
make sure to consider the scenario of having a lot of partitioned tables
with a lot of individual partitions. It looks like it would be similar
to a scenarios where there are a lot of inheritance hierarchies. But,
availability of partitioning feature would definitely cause these
numbers to grow larger. Perhaps this is an important point driving this
discussion.

I guess this remains tied to the decision we would like make regarding
inheritance (pg_inherits, etc.)

>> * It is desirable to treat partitions as pg_class relations with perhaps
>> a new relkind(s). We may want to choose an implementation where only the
>> lowest level relations in a partitioning hierarchy have storage; those
>> at the upper layers are mere placeholder relations though of course with
>> associated constraints determined by partitioning criteria (with
>> appropriate metadata entered into the additional catalogs).
> 
> I think the storage-less parents need a new relkind precisely to
> denote that they have no storage; I am not convinced that there's any
> reason to change the relkind for the leaf nodes.  But that's been
> proposed, so evidently someone thinks there's a reason to do it.
> 

Again, this remains partly tied to decisions we make regarding catalog
structure.

I am not sure but wouldn't we ever need to tell from a pg_class entry
that a leaf relation has partition bounds associated with it? One reason
I can see that we may not need it is that we would rather use
relispartitioned of a non-leaf relation to trigger finding all its
partitions and their associated bounds; we don't need to know (or
reserve a field for) that a relation has partition bounds associated
with it. The bounds can be stored in pg_partition indexed by relid.
Maybe relkind is not the right field for this anyway.

With that said, would we be comfortable with putting partition key into
pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if
relispartitioned, also look for relpartkey?

>> I am not
>> quite sure if each kind of the relations involved in the partitioning
>> scheme have separate namespaces and, if they are, how we implement that
> 
> I am in favor of having all of the nodes in the hierarchy have names
> just as relations do today -- pg_class.relname.  Anything else seems
> to me to be complex to implement and of very marginal benefit.  But
> again, it's been proposed.
> 

The same follows from the my other comments.

>> * In the initial implementation, we could just live with partitioning on
>> a set of columns (and not arbitrary expressions of them)
> 
> Seems quite fair.
> 
>> * We perhaps do not need multi-column LIST partitions as they are not
>> very widely used and may complicate the implementation
> 
> I agree that the use case is marginal; but I'm not sure it needs to
> complicate the implementation much.  Depending on how the
> implementation shakes out, prohibiting it might come to seem like more
> of a wart than allowing it.
> 

Hmm, I guess implementation may turn out to be generalized enough that
prohibiting it would become a special case and more work.

>> * There are a number of suggestions about how we represent partition
>> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
>> associated with the relation itself), etc. Important point to consider
>> here may be that partition key may contain more than one column
> 
> Yep.
> 
>> * How we represent partition definition in memory (for a given
>> partitioned relation) - important point to remember is that such a
>> representation should be efficient to iterate through or
>> binary-searchable. Also see the points about tuple-routing and
>> partition-pruning
> 
> Yep.
> 
>> * Overflow/catchall partition: it seems we do not want/need them. It
>> might seem desirable for example in cases where a big transaction enters
>> a large number of tuples all but one of which find a defined partition;
>> we may not want to error out in such case but instead enter that erring
>> tuple into the overflow partition instead. If we choose to implement
>> that, we would like to also implement the capability to move the tuples
>> into the appropriate partition once it's defined. Related is the notion
>> of automatically creating partitions if one is not already defined for a
>> just entered tuple; but there may be locking troubles if many concurrent
>> sessions try to do that
> 
> I think that dynamically creating new partitions is way beyond the
> scope of what this patch should be trying to do.  If we ever do it at
> all, it should not be now.  The value of a default partition (aka
> overflow partition) seems to me to be debatable.  For range
> partitioning, it doesn't seem entirely necessary provided that you can
> define a range with only one endpoint (e.g. partition A has values 1
> to 10, B has 11 and up, and C has 0 and down).  For list partitioning,
> though, you might well want something like that.  But is it a
> must-have?  Dunno.
> 
>> * Tuple-routing: based on the internal representation of partition
>> bounds for the partitions of a given partitioned table, there should be
>> a way to map a just entered tuple to partition id it belongs to. Below
>> mentioned BRIN-like machinery could be made to work
>>
>> * Partition-pruning: again, based on the internal representation of
>> partition bounds for the partitions of a given partitioned table, there
>> should be a way to prune partitions deemed unnecessary per scan quals.
>> One notable suggestion is to consider BRIN (-like) machinery. For
>> example, it is able to tell from the scan quals whether a particular
>> block range of a given heap needs to be scanned or not based on summary
>> info index tuple for the block range. Though, the interface is currently
>> suitable to cover a single heap with blocks in range 0 to N-1 of that
>> heap. What we are looking for here is a hypothetical PartitionMemTuple
>> (or PartitionBound) that is a summary of a whole relation (in this case,
>> the partition) NOT a block range. But I guess the infrastructure is
>> generalized enough that we could make that work. Related then would be
>> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
>> has correspondence with the index being used, the hypothetical
>> PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
>> would represent the application of comparison operator between table
>> column (partitioning key column) and a constant (as per quals).
> 
> I'm not going to say this couldn't be done, but how is any of it
> better than having a list of the partition bounds and binary-searching
> it?
> 

Of course, my description of it is pretty hand-wavy.

A primary question for me about partition-pruning is when do we do it?
Should we model it after relation_excluded_by_constraints() and hence
totally plan-time? But, the tone of the discussion is that we postpone
partition-pruning to execution-time and hence my perhaps misdirected
attempts to inject it into some executor machinery.

Thanks,
Amit




Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 20-01-2015 AM 10:48, Amit Langote wrote:
> On 17-01-2015 AM 02:34, Robert Haas wrote:
>> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> * It is desirable to treat partitions as pg_class relations with perhaps
>>> a new relkind(s). We may want to choose an implementation where only the
>>> lowest level relations in a partitioning hierarchy have storage; those
>>> at the upper layers are mere placeholder relations though of course with
>>> associated constraints determined by partitioning criteria (with
>>> appropriate metadata entered into the additional catalogs).
>>
>> I think the storage-less parents need a new relkind precisely to
>> denote that they have no storage; I am not convinced that there's any
>> reason to change the relkind for the leaf nodes.  But that's been
>> proposed, so evidently someone thinks there's a reason to do it.
>>
> 
> Again, this remains partly tied to decisions we make regarding catalog
> structure.
> 
> I am not sure but wouldn't we ever need to tell from a pg_class entry
> that a leaf relation has partition bounds associated with it? One reason
> I can see that we may not need it is that we would rather use
> relispartitioned of a non-leaf relation to trigger finding all its
> partitions and their associated bounds; we don't need to know (or
> reserve a field for) that a relation has partition bounds associated
> with it. The bounds can be stored in pg_partition indexed by relid.
> Maybe relkind is not the right field for this anyway.
> 
> With that said, would we be comfortable with putting partition key into
> pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if
> relispartitioned, also look for relpartkey?
> 

This paints a picture that our leaf relations would be plain old
relations. They are almost similar in all respects (how they are
planned, modified, maintained, ...). They just have an additional
property that the values they can contain are restricted by, say,
pg_partition.values; but it doesn't concern how they are planned.
Planning related changes are confined to upper layers of the hierarchy
instead. Kinda like saying instead of doing
relation_excluded_by_constraints(childrel), we'd instead say
prune_useless_partitions(&partitionedrel) possibly at some other site
than its counterpart. Guess that illustrates the point.

I am not sure again if we want to limit access to individual partitions
unless via some special syntax, then what that means for the above. We
have been discussing that. Such access limiting could (only) be
facilitated by a new relkind.

On the other hand, the non-leaf relations are slightly new kind of
relations in that they do not have storage (they could have a tablespace
which would be the default tablespace for its underlying partitions).
Obviously they do not have indexes pointing at them. Because they are
further partitioned, they are differently planned - most probably Append
with partition-pruning (almost like Append with constraint-exclusion but
supposedly quicker because of the explicit access to partition
definitions and perhaps execution-time). INSERT/COPY on these involve
routing tuple to the appropriate leaf relation.

Not surprisingly, this is almost similar to the picture that Alvaro had
presented modulo some differences.

Thanks,
Amit




Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Robert Haas
Date:
On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Specifically, do we regard a partitions as pg_inherits children of its
>>> partitioning parent?
>>
>> I don't think this is totally an all-or-nothing decision.  I think
>> everyone is agreed that we need to not break things that work today --
>> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
>> clear.
>
> One point is that an implementation may end up establishing the
> parent-partition hierarchy somewhere other than (or in addition to)
> pg_inherits. One intention would be to avoid tying partitioning scheme
> to certain inheritance features that use pg_inherits. For example,
> consider call sites of find_all_inheritors(). One notable example is
> Append/MergeAppend which would be of interest to partitioning. We would
> want to reuse that part of the infrastructure but we could might as well
> write an equivalent, say find_all_partitions() which scans something
> other than pg_inherits to get all partitions.

IMHO, there's little reason to avoid putting pg_inherits entries in
for the partitions, and then this just works.  We can find other ways
to make it work if that turns out to be better, but if we don't have
one, there's no reason to complicate things.

> Agree that some concrete idea of internal representation should help
> guide the catalog structure. If we are going to cache the partitioning
> info in relcache (which we most definitely will), then we should try to
> make sure to consider the scenario of having a lot of partitioned tables
> with a lot of individual partitions. It looks like it would be similar
> to a scenarios where there are a lot of inheritance hierarchies. But,
> availability of partitioning feature would definitely cause these
> numbers to grow larger. Perhaps this is an important point driving this
> discussion.

Yeah, it would be good if the costs of supporting, say, 1000
partitions were negligible.

> A primary question for me about partition-pruning is when do we do it?
> Should we model it after relation_excluded_by_constraints() and hence
> totally plan-time? But, the tone of the discussion is that we postpone
> partition-pruning to execution-time and hence my perhaps misdirected
> attempts to inject it into some executor machinery.

It's useful to prune partitions at plan time, because then you only
have to do the work once.  But sometimes you don't know enough to do
it at plan time, so it's useful to do it at execution time, too.
Then, you can do it differently for every tuple based on the actual
value you have.  There's no point in doing 999 unnecessary relation
scans if we can tell which partition the actual run-time value must be
in.  But I think execution-time pruning can be a follow-on patch.  If
you don't restrict the scope of the first patch as much as possible,
you're not going to have much luck getting this committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 21-01-2015 AM 01:42, Robert Haas wrote:
> On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> Specifically, do we regard a partitions as pg_inherits children of its
>>>> partitioning parent?
>>>
>>> I don't think this is totally an all-or-nothing decision.  I think
>>> everyone is agreed that we need to not break things that work today --
>>> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
>>> clear.
>>
>> One point is that an implementation may end up establishing the
>> parent-partition hierarchy somewhere other than (or in addition to)
>> pg_inherits. One intention would be to avoid tying partitioning scheme
>> to certain inheritance features that use pg_inherits. For example,
>> consider call sites of find_all_inheritors(). One notable example is
>> Append/MergeAppend which would be of interest to partitioning. We would
>> want to reuse that part of the infrastructure but we could might as well
>> write an equivalent, say find_all_partitions() which scans something
>> other than pg_inherits to get all partitions.
> 
> IMHO, there's little reason to avoid putting pg_inherits entries in
> for the partitions, and then this just works.  We can find other ways
> to make it work if that turns out to be better, but if we don't have
> one, there's no reason to complicate things.
> 

Ok, I will go forward and stick to pg_inherits approach for now. Perhaps
the concerns I am expressing have other solutions that don't require
abandoning pg_inherits approach altogether.

>> Agree that some concrete idea of internal representation should help
>> guide the catalog structure. If we are going to cache the partitioning
>> info in relcache (which we most definitely will), then we should try to
>> make sure to consider the scenario of having a lot of partitioned tables
>> with a lot of individual partitions. It looks like it would be similar
>> to a scenarios where there are a lot of inheritance hierarchies. But,
>> availability of partitioning feature would definitely cause these
>> numbers to grow larger. Perhaps this is an important point driving this
>> discussion.
> 
> Yeah, it would be good if the costs of supporting, say, 1000
> partitions were negligible.
> 
>> A primary question for me about partition-pruning is when do we do it?
>> Should we model it after relation_excluded_by_constraints() and hence
>> totally plan-time? But, the tone of the discussion is that we postpone
>> partition-pruning to execution-time and hence my perhaps misdirected
>> attempts to inject it into some executor machinery.
> 
> It's useful to prune partitions at plan time, because then you only
> have to do the work once.  But sometimes you don't know enough to do
> it at plan time, so it's useful to do it at execution time, too.
> Then, you can do it differently for every tuple based on the actual
> value you have.  There's no point in doing 999 unnecessary relation
> scans if we can tell which partition the actual run-time value must be
> in.  But I think execution-time pruning can be a follow-on patch.  If
> you don't restrict the scope of the first patch as much as possible,
> you're not going to have much luck getting this committed.
> 

Ok, I will limit myself to focusing on following things at the moment:

* Provide syntax in CREATE TABLE to declare partition key
* Provide syntax in CREATE TABLE to declare a table as partition of a
partitioned table and values it contains
* Arrange to have partition key and values stored in appropriate
catalogs (existing or new)
* Arrange to cache partitioning info of partitioned tables in relcache

Thanks,
Amit




Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 21-01-2015 PM 07:26, Amit Langote wrote:
> Ok, I will limit myself to focusing on following things at the moment:
> 
> * Provide syntax in CREATE TABLE to declare partition key

While working on this, I stumbled upon the question of how we deal with
any index definitions following from constraints defined in a CREATE
statement. I think we do not want to have a physical index created for a
table that is partitioned (in other words, has no heap of itself). As
the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
two decisions to make here:

1) how do we deal with any index definitions (either explicit or
implicit following from constraints defined on it) - do we allow them by
marking them specially, say, in pg_index, as being mere
placeholders/templates or invent some other mechanism?

2) As a short-term solution, do we simply reject creating any indexes
(/any constraints that require them) on a table whose definition also
includes PARTITION ON clause? Instead define them on its partitions (or
any relations in hierarchy that are not further partitioned).

Or maybe I'm missing something...

Thanks,
Amit




Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Jim Nasby
Date:
On 1/25/15 7:42 PM, Amit Langote wrote:
> On 21-01-2015 PM 07:26, Amit Langote wrote:
>> Ok, I will limit myself to focusing on following things at the moment:
>>
>> * Provide syntax in CREATE TABLE to declare partition key
>
> While working on this, I stumbled upon the question of how we deal with
> any index definitions following from constraints defined in a CREATE
> statement. I think we do not want to have a physical index created for a
> table that is partitioned (in other words, has no heap of itself). As
> the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
> EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
> two decisions to make here:
>
> 1) how do we deal with any index definitions (either explicit or
> implicit following from constraints defined on it) - do we allow them by
> marking them specially, say, in pg_index, as being mere
> placeholders/templates or invent some other mechanism?
>
> 2) As a short-term solution, do we simply reject creating any indexes
> (/any constraints that require them) on a table whose definition also
> includes PARTITION ON clause? Instead define them on its partitions (or
> any relations in hierarchy that are not further partitioned).
>
> Or maybe I'm missing something...

Wasn't the idea that the parent table in a partitioned table wouldn't actually have a heap of it's own? If there's no
heapthere can't be an index.
 

That said, I think this is premature optimization that could be done later.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From
Amit Langote
Date:
On 27-01-2015 AM 05:46, Jim Nasby wrote:
> On 1/25/15 7:42 PM, Amit Langote wrote:
>> On 21-01-2015 PM 07:26, Amit Langote wrote:
>>> Ok, I will limit myself to focusing on following things at the moment:
>>>
>>> * Provide syntax in CREATE TABLE to declare partition key
>>
>> While working on this, I stumbled upon the question of how we deal with
>> any index definitions following from constraints defined in a CREATE
>> statement. I think we do not want to have a physical index created for a
>> table that is partitioned (in other words, has no heap of itself). As
>> the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
>> EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
>> two decisions to make here:
>>
>> 1) how do we deal with any index definitions (either explicit or
>> implicit following from constraints defined on it) - do we allow them by
>> marking them specially, say, in pg_index, as being mere
>> placeholders/templates or invent some other mechanism?
>>
>> 2) As a short-term solution, do we simply reject creating any indexes
>> (/any constraints that require them) on a table whose definition also
>> includes PARTITION ON clause? Instead define them on its partitions (or
>> any relations in hierarchy that are not further partitioned).
>>
>> Or maybe I'm missing something...
> 
> Wasn't the idea that the parent table in a partitioned table wouldn't
> actually have a heap of it's own? If there's no heap there can't be an
> index.
>

Yes, that's right. Perhaps, we should look at heap-less partitioned
relation thingy not so soon as you say below.

> That said, I think this is premature optimization that could be done later.

It seems so.

Thanks,
Amit