Thread: Declarative partitioning

Re: Declarative partitioning

From

Thom Brown

Date:

18 August 2015, 11:02:38

On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi,

I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Syntax
======

1. Creating a partitioned table

CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);

Where column_list consists of simple column names or expressions:

PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)

PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

Note: LIST partition key supports only one column.

For each column, you could write operator class name:

PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),

If not specified, the default btree operator class based on type of each
key column is used. If none of the available btree operator classes are
compatible with the partitioning strategy (list/range), error is thrown.
Built-in btree operator classes cover a good number of types for list and
range partitioning in practical scenarios.

A table created using this form is of proposed new relkind
RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
created to store partition key info.

Note: A table cannot be partitioned after-the-fact using ALTER TABLE.

Normal dependencies are created between the partitioned table and operator
classes, object in partition expressions like functions.

2. Creating a partition of a partitioned table

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec;

Where values_spec is:

listvalues: [IN] (val1, ...)

rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )

A table created using this form has proposed pg_class.relispartition set
to true. An entry in pg_partition (see below) is created to store the
partition bound info.

The values_spec should match the partitioning strategy of the partitioned
table. In case of a range partition, the values in START and/or END should
match columns in the partition key.

Defining a list partition is fairly straightforward - just spell out the
list of comma-separated values. Error is thrown if the list of values
overlaps with one of the existing partitions' list.

CREATE TABLE persons_by_state (name text, state text)
PARTITION BY LIST ON (state);

CREATE TABLE persons_IL
PARTITION OF persons_by_state
FOR VALUES IN ('IL');

CREATE TABLE persons_fail
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
ERROR: cannot create partition that overlaps with an existing one

For a range partition, there are more than one way:

Specify both START and END bounds: resulting range should not overlap with
the range(s) covered by existing partitions. Error is thrown otherwise.
Although rare in practice, gaps between ranges are OK.

CREATE TABLE measurement(logdate date NOT NULL)
PARTITION BY RANGE ON (logdate);

CREATE TABLE measurement_y2006m02
PARTITION OF measurement
FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-15') END ('2006-03-01');
ERROR: cannot create partition that overlaps with an existing one

Specify only the START bound: add the partition on the left of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition will cover the range [START, +INFINITY) and become the rightmost
partition. Error is thrown if the specified START causes overlap.

CREATE TABLE measurement_y2006m01
PARTITION OF measurement
FOR VALUES START ('2006-01-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one

Specify only the END bound: add the partition on the right of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition would cover the range (-INFINITY, END) and become the leftmost
partition. Error is thrown if the specified END causes overlap.

CREATE TABLE measurement_y2006m03
PARTITION OF measurement
FOR VALUES END ('2006-04-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one

For each partition, START and END bound values are stored in the
catalog. Note that the lower bound is inclusive, whereas the upper bound
is exclusive.

Note: At most one range partition can have null min bound in which case it
covers the range (-INFINITY, END). Also, at most one range partition can
have null max bound in which case it covers the range [START, +INFINITY).

A normal dependency is created between the parent and the partition.

3. Multi-level partitioning

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)

This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.

4. (yet unimplemented) Attach partition (from existing table)

ALTER TABLE partitioned_table
ATTACH PARTITION partition_name
FOR VALUES values_spec
USING [TABLE] table_name;

ALTER TABLE table_name
SET VALID PARTITION OF <parent>;

The first of the above pair of commands would attach table_name as a (yet)
'invalid' partition of partitioned_table (after confirming that it matches
the schema and does not overlap with other partitions per FOR VALUES
spec). It would also record the FOR VALUES part in the partition catalog
and set pg_class.relispartition to true for table_name.

After the first command is done, the second command would take exclusive
lock on table_name, scan the table to check if it contains any values
outside the boundaries defined by FOR VALUES clause defined previously,
throw error if so, mark as valid partition of parent if not.

Does that make sense?

5. Detach partition

ALTER TABLE partitioned_table
DETACH PARTITION partition_name [USING table_name]

This removes partition_name as partition of partitioned_table. The table
continues to exist with the same name or 'table_name', if specified.
pg_class.relispartition is set to false for the table, so it behaves like
a normal table.

System catalogs
===============

1. pg_partitioned_rel

CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
{
Oid partrelid; /* partitioned table pg_class.oid */
char partstrategy; /* partitioning strategy 'l'/'r' */
int16 partnatts; /* number of partition columns */
int2vector partkey; /* column numbers of partition columns;
* 0 where specified column is an
* expresion */
oidvector partclass; /* operator class to compare keys */
pg_node_tree partexprs; /* expression trees for partition key
* members that are not simple column
* references; one for each zero entry
* in partkey[] */
};

2. pg_partition (omits partisvalid alluded to above)

CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
{
Oid partitionid; /* partition oid */
Oid partparent; /* parent oid */
anyarray partlistvalues; /* list of allowed values of the only
* partition column */
anyarray partrangebounds; /* list of bounds of ranges of
* allowed values per partition key
* column */
};

Further notes
=============

There are a number of restrictions on performing after-the-fact changes
using ALTER TABLE to partitions (ie, relispartition=true):

* Cannot add/drop column
* Cannot set/drop OIDs
* Cannot set/drop NOT NULL
* Cannot set/drop default
* Cannot alter column type
* Cannot add/drop alter constraint (table level)
* Cannot change persistence
* Cannot change inheritance
* Cannot link to a composite type

Such changes should be made to the topmost parent in the partitioning
hierarchy (hereafter referred to as just parent). These are recursively
applied to all the tables in the hierarchy. Although the last two items
cannot be performed on parent either.

Dropping a partition using DROP TABLE is not allowed. It needs to detached
using ALTER TABLE on parent before it can be dropped as a normal table.

Triggers on partitions are not allowed. They should be defined on the
parent. That said, I could not figure out a way to implement row-level
AFTER triggers on partitioned tables (more about that in a moment); so
they are currently not allowed:

CREATE TRIGGER audit_trig
AFTER INSERT ON persons
FOR EACH ROW EXECUTE PROCEDURE audit_func();
ERROR: Row-level AFTER triggers are not supported on partitioned tables

Column/table constraints on partitions are not allowed. They should be
defined on the parent. Foreign key constraints are not allowed due to
above limitation (no row-level after triggers).

A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.

Creating index on parent is not allowed. They should be defined on (leaf)
partitions. Because of this limitation, primary keys are not allowed on a
partitioned table. Perhaps, we should be able to just create a dummy
entry somewhere to represent an index on parent (which every partition
then copies.) Then by restricting primary key to contain all partition key
columns, we can implement unique constraint over the whole partitioned
table. That will in turn allow us to use partitioned tables as PK rels in
a foreign key constraint provided row-level AFTER trigger issue is resolved.

VACUUM/ANALYZE on individual partitions should work like normal tables.
I've not implemented something like inheritance tree sampling for
partitioning tree in this patch. Autovacuum has been taught to ignore
parent tables and vacuum/analyze partitions normally.

Dropping a partitioned table should (?) unconditionally drop all its
partitions probably but, currently the patch uses dependencies, so
requires to specify CASCADE to do the same.

What should TRUNCATE on partitioned table do?

Ownership, privileges/permissions, RLS should be managed through the
parent table although not comprehensively addressed in the patch.

There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.

Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.

Internal representations
========================

For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
fields to store the partitioning metadata. That includes partition key
tuple (pg_partitioned_rel) including some derived info (opfamily,
opcintype, compare proc FmgrInfos, partition expression trees).

Additionally, it also includes a PartitionInfo object which includes
partition OIDs array, partition bound arrays (if range partitioned,
rangemax is sorted in ascending order and OIDs are likewise ordered). It
is built from information in pg_partition catalog.

While RelationBuildDesc() initializes the basic key info, fields like
expression trees, PartitionInfo are built on demand and cached. For
example, InitResultRelInfo() builds the latter to populate the newly added
ri_PartitionKeyInfo and ri_Partitions fields, respectively.

PartitionInfo object is rebuilt on every cache invalidation of the rel
which includes when adding/attaching/detaching a new partition.

Planner and executor considerations
=====================================

The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:

postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory

Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.

By extending RelOptInfo to include partitioning info for partitioned rels,
it might be possible to perform partition pruning directly without
previously having to expand them. Although, as things stand now, it's not
clear how that might work - when would partition RTEs be added to the
rtable? The rtable is assumed not to change after
setup_simple_rel_arrays() has done its job which is much earlier than when
it would be desirable for the partitioned table expansion (along with
partition pruning) to happen. Moreover, if that means we might not be able
to build RelOptInfo's for partitions, how to choose best paths for them
(index paths or not, etc.)?

I'm also hoping we don't require something like inheritance_planner() for
when partitioned tables are target rels. I assume considerations for why
the special processing is necessary for inheritance trees in that scenario
don't apply to partitioning trees. So, if grouping_planner() returns a
Append plan (among other options) for the partitioning tree, tacking a
ModifyTable node on top should do the trick?

Suggestions greatly welcome in this area.

Other items
===========

Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:

* New regression tests
* Documentation updates
* pg_dump, psql, etc.

For reference, some immediately previous discussions:

* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org

* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp

Comments welcome!

Thanks,
Amit

Wow, didn't expect to see that email this morning.

A very quick test:

CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item text) partition by range on ((extract(year from purchase_time)),(extract(month from purchase_time)));
ERROR: referenced relation "purchases" is not a table or foreign table

Thom

Re: Declarative partitioning

From

Amit Langote

Date:

18 August 2015, 11:24:10

Hi Thom,

On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote:
>
>
> Wow, didn't expect to see that email this morning.
>
> A very quick test:
>
> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item
> text) partition by range on ((extract(year from
> purchase_time)),(extract(month from purchase_time)));
> ERROR:  referenced relation "purchases" is not a table or foreign table
>

Thanks for the quick test.

Damn, I somehow missed adding the new relkind to a check in
process_owned_by(). Will fix this and look for any such oversights.

Thanks,
Amit

Re: Declarative partitioning

From

David Fetter

Date:

18 August 2015, 13:43:33

On Tue, Aug 18, 2015 at 07:30:20PM +0900, Amit Langote wrote:
> Hi,
> 
> I would like propose $SUBJECT for this development cycle. Attached is a
> WIP patch that implements most if not all of what's described below. Some
> yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Thanks for pushing this forward!  We've needed this done for at least
a decade.

> 4. (yet unimplemented) Attach partition (from existing table)
> 
> ALTER TABLE partitioned_table
> ATTACH PARTITION partition_name
> FOR VALUES values_spec
> USING [TABLE] table_name;
> 
> ALTER TABLE table_name
> SET VALID PARTITION OF <parent>;
> 
> The first of the above pair of commands would attach table_name as a (yet)
> 'invalid' partition of partitioned_table (after confirming that it matches
> the schema and does not overlap with other partitions per FOR VALUES
> spec). It would also record the FOR VALUES part in the partition catalog
> and set pg_class.relispartition to true for table_name.
> 
> After the first command is done, the second command would take exclusive
> lock on table_name, scan the table to check if it contains any values
> outside the boundaries defined by FOR VALUES clause defined previously,
> throw error if so, mark as valid partition of parent if not.

One small change to make this part more efficient:

1. Take the access exclusive lock on table_name.
2. Check for a matching constraint on it.
3. If it's there, mark it as a valid partition.
4. If not, check for values outside the boundaries as above.

Should the be a *valid* constraint?  Perhaps that should be
parameterized, as I'm not yet seeing a compelling argument either
direction.  I'm picturing something like:
   ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST]

where TRUST would mean that an existing constraint need not be VALID.

> Does that make sense?

Yep.

> 5. Detach partition
> 
> ALTER TABLE partitioned_table
> DETACH PARTITION partition_name [USING table_name]
> 
> This removes partition_name as partition of partitioned_table. The table
> continues to exist with the same name or 'table_name', if specified.
> pg_class.relispartition is set to false for the table, so it behaves like
> a normal table.

Could this take anything short of an access exclusive lock on the
parent?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Declarative partitioning

From

Josh Berkus

Date:

18 August 2015, 17:31:48

Amit,

> I would like propose $SUBJECT for this development cycle. Attached is a
> WIP patch that implements most if not all of what's described below. Some
> yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

First of all, wow!  Really happy to see this.

> 
> Syntax
> ======
> 
> 1. Creating a partitioned table
> 
> CREATE TABLE table_name
> PARTITION BY {RANGE|LIST}
> ON (column_list);
> 
> Where column_list consists of simple column names or expressions:
> 
> PARTITION BY LIST ON (name)
> PARTITION BY RANGE ON (year, month)
> 
> PARTITION BY LIST ON ((lower(left(name, 2)))
> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

So far so good.  Have you given any thought as to how a user will
determine which partition corresponds to which values (for purposes of
dropping/maintaining a partition)?

Also, won't doing things like extract() for range partitions make it
much harder for you to develop the planner parts of this solution?

What about defining an interval instead, such as:

PARTITION BY RANGE USING ( interval ) ON ( column );

i.e.

PARTITION BY RANGE USING ( INTERVAL '1 month' ) ON ( submitted_date );
PARTITION BY RANGE USING ( 100000 ) ON ( user_id );

This would make it easy for you to construct range type values defining
the range of each partition, which would then make the planner work much
easier than calling a function would, no?

Or am I misunderstanding how you're using ranges here? It kind of seems
like you're still leaving specific range defintions up to the user,
which is (from my perspective) unsatisfactory (see below).

I'm assuming that all range partitions will be [ closed, open ) ranges.


> 2. Creating a partition of a partitioned table
> 
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec;
> 
> Where values_spec is:
> 
> listvalues: [IN] (val1, ...)
> 
> rangevalues: START (col1min, ... ) END (col1max, ... )
>                | START (col1min, ... )
>                | END (col1max, ... )

So, one thing I missed in here is anything about automated partitioning
of tables; that is, creating new partitions based on incoming data or a
simple statement which doesn't require knowledge of the partitioning
scheme.  It's possible (and entirely accceptable) that you're
considering automated partition creation outside of the scope of this
patch.  However, for range partitions, it would be *really* useful to
have this syntax:

CREATE NEXT PARTITION ON parent_table;

Which would just create the "next" partition based on whatever the range
partitoning scheme is, instead of requiring the user to calculate start
and end values which might or might not match the parent partitioning
scheme, and might leave gaps.  Also this would be useful for range
partitions:

CREATE PARTITION ON parent_table USING ( start_value );

... where start_value is the start range of the new partition.  Again,
easier for users to get correct.

Both of these require the idea of regular intervals for range
partitions, that is, on a table partitioned by month on a timestamptz
column, each partition will have the range [ month:1, nextmonth:1 ).
This is the most common use-case for range partitions (like, 95% of all
partitioning cases I've seen), so a new partitioning scheme ought to
address it.

While there are certainly users who desire the ability to define
arbitrary ranges for each range partition, these are by far the minority
and could be accomodated by a different path with more complex syntax.
Further, I'd wager that most users who want to define arbitrary ranges
for range partitions aren't going to be satisfied with the other
restrictions on declarative partitioning (e.g. same constraints, columns
for all partitions) and are going to use inheritance partitioning anyway.

> 5. Detach partition
> 
> ALTER TABLE partitioned_table
> DETACH PARTITION partition_name [USING table_name]
> 
> This removes partition_name as partition of partitioned_table. The table
> continues to exist with the same name or 'table_name', if specified.
> pg_class.relispartition is set to false for the table, so it behaves like
> a normal table.

What about DROPping partitions?  Do they need to be detached first?

> Creating index on parent is not allowed. They should be defined on (leaf)
> partitions. Because of this limitation, primary keys are not allowed on a
> partitioned table. Perhaps, we should be able to just create a dummy
> entry somewhere to represent an index on parent (which every partition
> then copies.) 

This would be preferable, yes.  Making users remember to manually create
indexes on each partition is undesirable.

> What should TRUNCATE on partitioned table do?

On the master table?  Truncate all individual partitions.  Do not drop
the partitions.

On a partitition?  Truncate just that partition.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Declarative partitioning

From

Marc Mamin

Date:

18 August 2015, 17:58:43

great :-)

>2. Creating a partition of a partitioned table
>
>CREATE TABLE table_name
>PARTITION OF partitioned_table_name
>FOR VALUES values_spec;
>
>Where values_spec is:
>
>listvalues: [IN] (val1, ...)
>

Would it make sense to allow one complementary partition to the listvalues?

listvalues: [[NOT] IN] (val1, ...)

I've thought a few times about moving data with some most common values to dedicated partitions
and keeping the rest in a separate one...

best regards,

Marc Mamin

Re: Declarative partitioning

From

Corey Huinker

Date:

18 August 2015, 18:00:05

On Tue, Aug 18, 2015 at 6:30 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi,

I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Syntax
======

1. Creating a partitioned table

CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);

Where column_list consists of simple column names or expressions:

PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)

PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

Note: LIST partition key supports only one column.

For each column, you could write operator class name:

PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),

If not specified, the default btree operator class based on type of each
key column is used. If none of the available btree operator classes are
compatible with the partitioning strategy (list/range), error is thrown.
Built-in btree operator classes cover a good number of types for list and
range partitioning in practical scenarios.

A table created using this form is of proposed new relkind
RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
created to store partition key info.

Note: A table cannot be partitioned after-the-fact using ALTER TABLE.

Normal dependencies are created between the partitioned table and operator
classes, object in partition expressions like functions.

2. Creating a partition of a partitioned table

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec;

Where values_spec is:

listvalues: [IN] (val1, ...)

rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )

A table created using this form has proposed pg_class.relispartition set
to true. An entry in pg_partition (see below) is created to store the
partition bound info.

The values_spec should match the partitioning strategy of the partitioned
table. In case of a range partition, the values in START and/or END should
match columns in the partition key.

Defining a list partition is fairly straightforward - just spell out the
list of comma-separated values. Error is thrown if the list of values
overlaps with one of the existing partitions' list.

CREATE TABLE persons_by_state (name text, state text)
PARTITION BY LIST ON (state);

CREATE TABLE persons_IL
PARTITION OF persons_by_state
FOR VALUES IN ('IL');

CREATE TABLE persons_fail
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
ERROR: cannot create partition that overlaps with an existing one

For a range partition, there are more than one way:

Specify both START and END bounds: resulting range should not overlap with
the range(s) covered by existing partitions. Error is thrown otherwise.
Although rare in practice, gaps between ranges are OK.

CREATE TABLE measurement(logdate date NOT NULL)
PARTITION BY RANGE ON (logdate);

CREATE TABLE measurement_y2006m02
PARTITION OF measurement
FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-15') END ('2006-03-01');
ERROR: cannot create partition that overlaps with an existing one

Specify only the START bound: add the partition on the left of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition will cover the range [START, +INFINITY) and become the rightmost
partition. Error is thrown if the specified START causes overlap.

CREATE TABLE measurement_y2006m01
PARTITION OF measurement
FOR VALUES START ('2006-01-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one

Specify only the END bound: add the partition on the right of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition would cover the range (-INFINITY, END) and become the leftmost
partition. Error is thrown if the specified END causes overlap.

CREATE TABLE measurement_y2006m03
PARTITION OF measurement
FOR VALUES END ('2006-04-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one

For each partition, START and END bound values are stored in the
catalog. Note that the lower bound is inclusive, whereas the upper bound
is exclusive.

Note: At most one range partition can have null min bound in which case it
covers the range (-INFINITY, END). Also, at most one range partition can
have null max bound in which case it covers the range [START, +INFINITY).

A normal dependency is created between the parent and the partition.

3. Multi-level partitioning

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)

This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.

4. (yet unimplemented) Attach partition (from existing table)

ALTER TABLE partitioned_table
ATTACH PARTITION partition_name
FOR VALUES values_spec
USING [TABLE] table_name;

ALTER TABLE table_name
SET VALID PARTITION OF <parent>;

The first of the above pair of commands would attach table_name as a (yet)
'invalid' partition of partitioned_table (after confirming that it matches
the schema and does not overlap with other partitions per FOR VALUES
spec). It would also record the FOR VALUES part in the partition catalog
and set pg_class.relispartition to true for table_name.

After the first command is done, the second command would take exclusive
lock on table_name, scan the table to check if it contains any values
outside the boundaries defined by FOR VALUES clause defined previously,
throw error if so, mark as valid partition of parent if not.

Does that make sense?

5. Detach partition

ALTER TABLE partitioned_table
DETACH PARTITION partition_name [USING table_name]

This removes partition_name as partition of partitioned_table. The table
continues to exist with the same name or 'table_name', if specified.
pg_class.relispartition is set to false for the table, so it behaves like
a normal table.

System catalogs
===============

1. pg_partitioned_rel

CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
{
Oid partrelid; /* partitioned table pg_class.oid */
char partstrategy; /* partitioning strategy 'l'/'r' */
int16 partnatts; /* number of partition columns */
int2vector partkey; /* column numbers of partition columns;
* 0 where specified column is an
* expresion */
oidvector partclass; /* operator class to compare keys */
pg_node_tree partexprs; /* expression trees for partition key
* members that are not simple column
* references; one for each zero entry
* in partkey[] */
};

2. pg_partition (omits partisvalid alluded to above)

CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
{
Oid partitionid; /* partition oid */
Oid partparent; /* parent oid */
anyarray partlistvalues; /* list of allowed values of the only
* partition column */
anyarray partrangebounds; /* list of bounds of ranges of
* allowed values per partition key
* column */
};

Further notes
=============

There are a number of restrictions on performing after-the-fact changes
using ALTER TABLE to partitions (ie, relispartition=true):

* Cannot add/drop column
* Cannot set/drop OIDs
* Cannot set/drop NOT NULL
* Cannot set/drop default
* Cannot alter column type
* Cannot add/drop alter constraint (table level)
* Cannot change persistence
* Cannot change inheritance
* Cannot link to a composite type

Such changes should be made to the topmost parent in the partitioning
hierarchy (hereafter referred to as just parent). These are recursively
applied to all the tables in the hierarchy. Although the last two items
cannot be performed on parent either.

Dropping a partition using DROP TABLE is not allowed. It needs to detached
using ALTER TABLE on parent before it can be dropped as a normal table.

Triggers on partitions are not allowed. They should be defined on the
parent. That said, I could not figure out a way to implement row-level
AFTER triggers on partitioned tables (more about that in a moment); so
they are currently not allowed:

CREATE TRIGGER audit_trig
AFTER INSERT ON persons
FOR EACH ROW EXECUTE PROCEDURE audit_func();
ERROR: Row-level AFTER triggers are not supported on partitioned tables

Column/table constraints on partitions are not allowed. They should be
defined on the parent. Foreign key constraints are not allowed due to
above limitation (no row-level after triggers).

A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.

Creating index on parent is not allowed. They should be defined on (leaf)
partitions. Because of this limitation, primary keys are not allowed on a
partitioned table. Perhaps, we should be able to just create a dummy
entry somewhere to represent an index on parent (which every partition
then copies.) Then by restricting primary key to contain all partition key
columns, we can implement unique constraint over the whole partitioned
table. That will in turn allow us to use partitioned tables as PK rels in
a foreign key constraint provided row-level AFTER trigger issue is resolved.

VACUUM/ANALYZE on individual partitions should work like normal tables.
I've not implemented something like inheritance tree sampling for
partitioning tree in this patch. Autovacuum has been taught to ignore
parent tables and vacuum/analyze partitions normally.

Dropping a partitioned table should (?) unconditionally drop all its
partitions probably but, currently the patch uses dependencies, so
requires to specify CASCADE to do the same.

What should TRUNCATE on partitioned table do?

Ownership, privileges/permissions, RLS should be managed through the
parent table although not comprehensively addressed in the patch.

There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.

Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.

Internal representations
========================

For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
fields to store the partitioning metadata. That includes partition key
tuple (pg_partitioned_rel) including some derived info (opfamily,
opcintype, compare proc FmgrInfos, partition expression trees).

Additionally, it also includes a PartitionInfo object which includes
partition OIDs array, partition bound arrays (if range partitioned,
rangemax is sorted in ascending order and OIDs are likewise ordered). It
is built from information in pg_partition catalog.

While RelationBuildDesc() initializes the basic key info, fields like
expression trees, PartitionInfo are built on demand and cached. For
example, InitResultRelInfo() builds the latter to populate the newly added
ri_PartitionKeyInfo and ri_Partitions fields, respectively.

PartitionInfo object is rebuilt on every cache invalidation of the rel
which includes when adding/attaching/detaching a new partition.

Planner and executor considerations
=====================================

The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:

postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory

Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.

By extending RelOptInfo to include partitioning info for partitioned rels,
it might be possible to perform partition pruning directly without
previously having to expand them. Although, as things stand now, it's not
clear how that might work - when would partition RTEs be added to the
rtable? The rtable is assumed not to change after
setup_simple_rel_arrays() has done its job which is much earlier than when
it would be desirable for the partitioned table expansion (along with
partition pruning) to happen. Moreover, if that means we might not be able
to build RelOptInfo's for partitions, how to choose best paths for them
(index paths or not, etc.)?

I'm also hoping we don't require something like inheritance_planner() for
when partitioned tables are target rels. I assume considerations for why
the special processing is necessary for inheritance trees in that scenario
don't apply to partitioning trees. So, if grouping_planner() returns a
Append plan (among other options) for the partitioning tree, tacking a
ModifyTable node on top should do the trick?

Suggestions greatly welcome in this area.

Other items
===========

Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:

* New regression tests
* Documentation updates
* pg_dump, psql, etc.

For reference, some immediately previous discussions:

* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org

* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp

Comments welcome!

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Quick thoughts borne of years of slugging it out with partitions on Oracle:

- Finally!!!!!!!!!!!

- Your range partitioning will need to express exclusive/inclusive bounds, or go to the Oracle model where every partition is a cascading "values less than" test context dependent on the partitions defined before it. I would suggest that leveraging existing range types (or allowing the user to specify a range type, like for a specific collation of a text range) would allow for the most flexible and postgres-ish range definition. You seem to do this with the "[USING] opclass_name" bit, but don't follow through on the START(...) and END(...). Something like FOR VALUES <@ '[''2014-01-01'',''2015-01-01)'::daterange would cover most needs succinctly, though I admit the syntax for complex ranges could be cumbersome, though something like FOR VALUES <@ '[(''a'',1),(''b'',1))'::letter_then_number_range is still readable.

- No partitioning scheme survives first contact with reality. So you will need a facility for splitting and joining existing partitions. For splitting partitions, it's sufficient to require that the new partition share either a upper/lower bound (with the same inclusivity/exclusivity) of an existing partition, thus uniquely identifying the partition to be split, and require that the other bound be within the range of the partition to be split. Similarly, it's fair to require that the partitions to be joined be adjacent in range. In both cases, range operators make these tests simple.

- Your features 4 and 5 are implemented in Oracle with SWAP PARTITION, which is really neat for doing ETLs and index rebuilds offline in a copy table, and then swapping the data segment of that table with the partition specified. Which could be considered cheating because none of the partition metadata changed, just the pointers to the segments. We already do this with adding removing INHERIT. I'm not saying they can't be separate functionality, but keeping an atomic SWAP operation would be grand.

Re: Declarative partitioning

From

Amit Langote

Date:

19 August 2015, 07:32:24

On 2015-08-18 PM 10:43, David Fetter wrote:
>>
>> After the first command is done, the second command would take exclusive
>> lock on table_name, scan the table to check if it contains any values
>> outside the boundaries defined by FOR VALUES clause defined previously,
>> throw error if so, mark as valid partition of parent if not.
> 
> One small change to make this part more efficient:
> 
> 1. Take the access exclusive lock on table_name.
> 2. Check for a matching constraint on it.
> 3. If it's there, mark it as a valid partition.
> 4. If not, check for values outside the boundaries as above.
> 

That's an interesting idea. Thanks!

By a matching constraint, I guess you mean a 'valid' constraint from which
the declared partition constraint can be proven to follow. For (a simple)
example, from a CHECK (a >= 100 AND a < 150) on table_name, the partition
constraint implied by FOR VALUES START (100) END (200) can be assumed to hold.

> Should the be a *valid* constraint?  Perhaps that should be
> parameterized, as I'm not yet seeing a compelling argument either
> direction.  I'm picturing something like:
> 
>     ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST]
> 
> where TRUST would mean that an existing constraint need not be VALID.
> 

Hmm, I'd think this step must be able to assert the partition constraint
beyond any doubt. If the DBA added the constraint and marked it invalid,
she should first VALIDATE the constraint to make it valid by performing
whatever steps necessary before. IOW, a full heap scan at least once is
inevitable (the reason why we might want to make this a two step process
at all). Am I missing something?

> 
>> 5. Detach partition
>>
>> ALTER TABLE partitioned_table
>> DETACH PARTITION partition_name [USING table_name]
>>
>> This removes partition_name as partition of partitioned_table. The table
>> continues to exist with the same name or 'table_name', if specified.
>> pg_class.relispartition is set to false for the table, so it behaves like
>> a normal table.
> 
> Could this take anything short of an access exclusive lock on the
> parent?
> 

Yes, both the step 1 of ATTACH command and DETACH command take access
exclusive lock on the parent. They are rather quick metadata changes, so
should not stall others significantly, I think.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

19 August 2015, 08:04:54

On 2015-08-19 AM 02:57, Marc Mamin wrote:
>> 2. Creating a partition of a partitioned table
>>
>> CREATE TABLE table_name
>> PARTITION OF partitioned_table_name
>> FOR VALUES values_spec;
>>
>> Where values_spec is:
>>
>> listvalues: [IN] (val1, ...)
>>
> 
> Would it make sense to allow one complementary partition to the listvalues?  
> 
> listvalues: [[NOT] IN] (val1, ...)
> 
> I've thought a few times about moving data with some most common values to dedicated partitions
> and keeping the rest in a separate one...
> 

Thanks, that's definitely something to consider.

I have been thinking of a sort of default list partition for the "rest" of
values. Would you rather declare that with something like the below than
having to enumerate all the values in a NOT IN list? Or the NOT IN way is
more intuitive/friendly?

CREATE TABLE _rest PARTITION OF table_name FOR VALUES [ IN ] DEFAULT

Of course, at most one such partition would be allowed.

Thanks,
Amit

Re: Declarative partitioning

From

Simon Riggs

Date:

19 August 2015, 11:59:50

On 18 August 2015 at 18:31, Josh Berkus <josh@agliodbs.com> wrote:

> 2. Creating a partition of a partitioned table
>
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec;
>
> Where values_spec is:
>
> listvalues: [IN] (val1, ...)
>
> rangevalues: START (col1min, ... ) END (col1max, ... )
> | START (col1min, ... )
> | END (col1max, ... )

So, one thing I missed in here is anything about automated partitioning
of tables; that is, creating new partitions based on incoming data or a
simple statement which doesn't require knowledge of the partitioning
scheme. It's possible (and entirely accceptable) that you're
considering automated partition creation outside of the scope of this
patch.

I would like to make automatic partitioning outside the scope of this first patch.

However, for range partitions, it would be *really* useful to
have this syntax:

CREATE NEXT PARTITION ON parent_table;

Which would just create the "next" partition based on whatever the range
partitoning scheme is, instead of requiring the user to calculate start
and end values which might or might not match the parent partitioning
scheme, and might leave gaps. Also this would be useful for range
partitions:

CREATE PARTITION ON parent_table USING ( start_value );

... where start_value is the start range of the new partition. Again,
easier for users to get correct.

Both of these require the idea of regular intervals for range
partitions, that is, on a table partitioned by month on a timestamptz
column, each partition will have the range [ month:1, nextmonth:1 ).
This is the most common use-case for range partitions (like, 95% of all
partitioning cases I've seen), so a new partitioning scheme ought to
address it.

While there are certainly users who desire the ability to define
arbitrary ranges for each range partition, these are by far the minority
and could be accomodated by a different path with more complex syntax.
Further, I'd wager that most users who want to define arbitrary ranges
for range partitions aren't going to be satisfied with the other
restrictions on declarative partitioning (e.g. same constraints, columns
for all partitions) and are going to use inheritance partitioning anyway.

I like the idea of a regular partitioning step because it is how you design such tables - "lets use monthly partitions".

This gives sanely terse syntax, rather than specifying pages and pages of exact values in DDL....

PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START WITH value;

borrowing the same concepts from sequence syntax.

> Creating index on parent is not allowed. They should be defined on (leaf)

> partitions. Because of this limitation, primary keys are not allowed on a
> partitioned table. Perhaps, we should be able to just create a dummy
> entry somewhere to represent an index on parent (which every partition
> then copies.)

This would be preferable, yes. Making users remember to manually create
indexes on each partition is undesirable.

I think it is useful to allow additional indexes on partitions, if desired, but we should always automatically build the indexes that are defined on the master when we create a new partition.

Presumably unique indexes will be allowed on partitions. So if the partition key is unique, we can say the whole partitioned table is unique and call that a Primary Key.

I would want individual partitions to be placed on separate tablespaces, but not by default.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Simon Riggs

Date:

19 August 2015, 12:23:49

On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.

Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.

I think tables will eventually need FK support; its not sustainable as a long term restriction, though perhaps its something we can do in a later patch.

You haven't specified what would happen if an UPDATE would change a row's partition. I'm happy to add this to the list of restrictions by saying that the partition key cannot be updated.

We'll need regression tests that cover each restriction and docs that match. This is not something we should leave until last. People read the docs to understand the feature, helping them to reach consensus. So it is for you to provide the docs before, not wait until later. I will begin a code review once you tell me docs and tests are present. We all want the feature, so its all about the details now.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

David Fetter

Date:

19 August 2015, 12:52:49

On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote:
> On 2015-08-18 PM 10:43, David Fetter wrote:
> >>
> >> After the first command is done, the second command would take exclusive
> >> lock on table_name, scan the table to check if it contains any values
> >> outside the boundaries defined by FOR VALUES clause defined previously,
> >> throw error if so, mark as valid partition of parent if not.
> > 
> > One small change to make this part more efficient:
> > 
> > 1. Take the access exclusive lock on table_name.
> > 2. Check for a matching constraint on it.
> > 3. If it's there, mark it as a valid partition.
> > 4. If not, check for values outside the boundaries as above.
> > 
> 
> That's an interesting idea. Thanks!

I hope I'm advancing this feature rather than bogging it down...

> By a matching constraint, I guess you mean a 'valid' constraint from
> which the declared partition constraint can be proven to follow. For
> (a simple) example, from a CHECK (a >= 100 AND a < 150) on
> table_name, the partition constraint implied by FOR VALUES START
> (100) END (200) can be assumed to hold.

Well, I was assuming an exact match, but a stricter match seems like a
nice-to-have...possibly later.

> > Should the be a *valid* constraint?  Perhaps that should be
> > parameterized, as I'm not yet seeing a compelling argument either
> > direction.  I'm picturing something like:
> > 
> >     ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST]
> > 
> > where TRUST would mean that an existing constraint need not be VALID.
> 
> Hmm, I'd think this step must be able to assert the partition
> constraint beyond any doubt.  If the DBA added the constraint and
> marked it invalid, she should first VALIDATE the constraint to make
> it valid by performing whatever steps necessary before. IOW, a full
> heap scan at least once is inevitable (the reason why we might want
> to make this a two step process at all). Am I missing something?

There are use cases where we need to warn people that their assertions
need to be true, and if those assertions are not true, this will
explode, leaving them to pick the resulting shrapnel out of their
faces.  There are other parts of the system where this is true, as
when people write UDFs in C.

As I understand it, NOT VALID means, "I assert that the tuples already
here fit the constraint.  Any changes will be checked against the
constraint."

I've seen cases where a gigantic amount of data is coming out of some
distributed system which holds the constraint as an invariant.  This
let a DBA decide to add a NOT VALID constraint in order not to take
the hit of a second full scan of the data, which might have made the
import, and possibly the entire project, untenable.

See above.

> >> 5. Detach partition
> >>
> >> ALTER TABLE partitioned_table
> >> DETACH PARTITION partition_name [USING table_name]
> >>
> >> This removes partition_name as partition of partitioned_table.
> >> The table continues to exist with the same name or 'table_name',
> >> if specified.  pg_class.relispartition is set to false for the
> >> table, so it behaves like a normal table.
> > 
> > Could this take anything short of an access exclusive lock on the
> > parent?
> 
> Yes, both the step 1 of ATTACH command and DETACH command take
> access exclusive lock on the parent. They are rather quick metadata
> changes, so should not stall others significantly, I think.

So no.  Weakening required locks has been something of an ongoing
project, project-wide, and need not be part of the first cut of this
long-needed feature.

Thanks so much for working on this!

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Declarative partitioning

From

Marc Mamin

Date:

19 August 2015, 19:56:23

>On 2015-08-19 AM 02:57, Marc Mamin wrote:
>>> 2. Creating a partition of a partitioned table
>>>
>>> CREATE TABLE table_name
>>> PARTITION OF partitioned_table_name
>>> FOR VALUES values_spec;
>>>
>>> Where values_spec is:
>>>
>>> listvalues: [IN] (val1, ...)
>>>
>>
>> Would it make sense to allow one complementary partition to the listvalues?
>>
>> listvalues: [[NOT] IN] (val1, ...)
>>
>> I've thought a few times about moving data with some most common values to dedicated partitions
>> and keeping the rest in a separate one...
>>
>
>Thanks, that's definitely something to consider.
>
>I have been thinking of a sort of default list partition for the "rest" of
>values. Would you rather declare that with something like the below than
>having to enumerate all the values in a NOT IN list? Or the NOT IN way is
>more intuitive/friendly?
>
>CREATE TABLE _rest PARTITION OF table_name FOR VALUES [ IN ] DEFAULT
>
>Of course, at most one such partition would be allowed.

On the one hand I guess it will be easies to check for partition overlapping if their definitions all contain the exact
allowedvalues. 
But this could be generalized to range partitions too:

CREATE TABLE _rest FALLBACK PARTITION OF table_name

The need for it for range partitions seems very narrow at the first glimpse, but I remember bore administrative work in
orderto ensure that there always was a partition available for incoming data (from a very old time when I was still
workingwith Oracle). 

To have it comfortable and nevertheless allow to define new partitions, this would require to always check/move data
fromthe default partition to new partitions at create time. 

and 2 other thoughts:
- In your proposal, the parent table is not materialized at all. Could it be used for the fallback partition?
- what about always having a fallback partition? This would reduce the risk of unexpected failures and somewhat help
Postgresstand out from the crowd :) 

regards,
Marc Mamin

Re: Declarative partitioning

From

Josh Berkus

Date:

19 August 2015, 20:11:09

On 08/19/2015 04:59 AM, Simon Riggs wrote:
> I like the idea of a regular partitioning step because it is how you
> design such tables - "lets use monthly partitions".
> 
> This gives sanely terse syntax, rather than specifying pages and pages
> of exact values in DDL....
> 
>    PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
> START WITH value;

Oh, I like that syntax!

How would it work if there were multiple columns?  Maybe we don't want
to allow that for this form?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Declarative partitioning

From

Thom Brown

Date:

19 August 2015, 20:19:09

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 19 August 2015 at 21:10, Josh Berkus <span
dir="ltr"><<ahref="mailto:josh@agliodbs.com" target="_blank">josh@agliodbs.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">On08/19/2015 04:59 AM, Simon Riggs wrote:<br /> > I like the idea of a regular partitioning step because it
ishow you<br /> > design such tables - "lets use monthly partitions".<br /> ><br /> > This gives sanely terse
syntax,rather than specifying pages and pages<br /> > of exact values in DDL....<br /> ><br /> >    PARTITION
BYRANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )<br /> > START WITH value;<br /><br /></span>Oh, I like that
syntax!<br/><br /> How would it work if there were multiple columns?  Maybe we don't want<br /> to allow that for this
form?<br/></blockquote></div><br /></div><div class="gmail_extra"></div><div class="gmail_extra">If we went with that,
andhad:<br /><br /></div><div class="gmail_extra">CREATE TABLE orders (order_id serial, order_date date, item text)<br
/></div><divclass="gmail_extra">  PARTITION BY RANGE ON (order_date) INCREMENT BY (INTERVAL '1 month')<br /></div><div
class="gmail_extra"> START WITH '2015-01-01';<br /><br /></div><div class="gmail_extra">Where would the following
go?<br/><br /></div><div class="gmail_extra">INSERT INTO orders (order_date, item) VALUES ('2014-11-12', 'Old
item');<br/><br /></div><div class="gmail_extra">Would there automatically be an "others" partition?  Or would it
producean error and act like a constraint?<br /></div><div class="gmail_extra"><br /><div
class="gmail_signature">Thom</div></div></div>

Re: Declarative partitioning

From

Josh Berkus

Date:

19 August 2015, 22:16:29

On 08/19/2015 01:18 PM, Thom Brown wrote:
> On 19 August 2015 at 21:10, Josh Berkus <josh@agliodbs.com
> <mailto:josh@agliodbs.com>> wrote:
> 
>     On 08/19/2015 04:59 AM, Simon Riggs wrote:
>     > I like the idea of a regular partitioning step because it is how you
>     > design such tables - "lets use monthly partitions".
>     >
>     > This gives sanely terse syntax, rather than specifying pages and pages
>     > of exact values in DDL....
>     >
>     >    PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
>     > START WITH value;
> 
>     Oh, I like that syntax!
> 
>     How would it work if there were multiple columns?  Maybe we don't want
>     to allow that for this form?
> 
> 
> If we went with that, and had:
> 
> CREATE TABLE orders (order_id serial, order_date date, item text)
>   PARTITION BY RANGE ON (order_date) INCREMENT BY (INTERVAL '1 month')
>   START WITH '2015-01-01';
> 
> Where would the following go?
> 
> INSERT INTO orders (order_date, item) VALUES ('2014-11-12', 'Old item');
> 
> Would there automatically be an "others" partition?  Or would it produce
> an error and act like a constraint?

The "others" partition was brought up upthread, as an addition to the
original proposal.  I really think that an "others" partition needs to
be up to the DBA; I've seen apps where they'd want to capture it, and
apps where they'd want such an insert to error.

I, for one, would be OK with a new partitioning which didn't address the
"others" partition issue until 9.7; I see it as a wholly separable
improvement.

Plus, you can always *manually* add high/low catchall partitions.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Declarative partitioning

From

Amit Langote

Date:

20 August 2015, 02:18:28

On 2015-08-19 PM 09:23, Simon Riggs wrote:
> On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
> 
>>
>> There is no need to define tuple routing triggers. CopyFrom() and
>> ExecInsert() determine target partition just before performing
>> heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
>> constraints (on parent) are executed for tuple before being routed to a
>> partition. If no partition can be found, it's an error.
>>
>> Because row-level AFTER triggers need to save ItemPointers in trigger
>> event data and defining triggers on partitions (which is where tuples
>> really go) is not allowed, I could not find a straightforward way to
>> implement them. So, perhaps we should allow (only) row-level AFTER
>> triggers on partitions or think of modifying trigger.c to know about this
>> twist explicitly.
>>
> 
> I think tables will eventually need FK support; its not sustainable as a
> long term restriction, though perhaps its something we can do in a later
> patch.
> 

Sure. Solving the row-level AFTER trigger problem should hopefully open up
the possibility of partitioned-table-as-FK-rel implementation.

> You haven't specified what would happen if an UPDATE would change a row's
> partition. I'm happy to add this to the list of restrictions by saying that
> the partition key cannot be updated.
> 

UPDATEs that change a row's partition would cause error. I haven't
implemented that yet but will that way in the next patch.

By the last sentence, do you mean only UPDATEs to the partition key that
cause rows to jump partitions or simply any UPDATEs to the partition key?

> We'll need regression tests that cover each restriction and docs that
> match. This is not something we should leave until last. People read the
> docs to understand the feature, helping them to reach consensus. So it is
> for you to provide the docs before, not wait until later. I will begin a
> code review once you tell me docs and tests are present. We all want the
> feature, so its all about the details now.
> 

Sorry, should have added tests and docs already. I will add them in the
next version of the patch. Thanks for willing to review.

Thanks,
Amit

Re: Declarative partitioning

From

Michael Paquier

Date:

20 August 2015, 07:05:22

On Thu, Aug 20, 2015 at 11:16 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-19 PM 09:23, Simon Riggs wrote:
> We'll need regression tests that cover each restriction and docs that
> match. This is not something we should leave until last. People read the
> docs to understand the feature, helping them to reach consensus. So it is
> for you to provide the docs before, not wait until later. I will begin a
> code review once you tell me docs and tests are present. We all want the
> feature, so its all about the details now.
>

Sorry, should have added tests and docs already. I will add them in the
next version of the patch.

Yes, those would be really good to have before any review so as it is possible to grasp an understanding of what this patch does. I would like to look at it as well more in depths.

Thanks for willing to review.

Really thanks for working on that! I am sure you are going to get a lot of feedback.

Regards,

Michael

Re: Declarative partitioning

From

Simon Riggs

Date:

20 August 2015, 07:43:24

On 20 August 2015 at 03:16, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry, should have added tests and docs already. I will add them in the
next version of the patch. Thanks for willing to review.

Thanks for picking up this challenge. It's easier if you have someone interested all the way.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Amit Langote

Date:

20 August 2015, 09:17:06

On 2015-08-20 AM 05:10, Josh Berkus wrote:
> On 08/19/2015 04:59 AM, Simon Riggs wrote:
>> I like the idea of a regular partitioning step because it is how you
>> design such tables - "lets use monthly partitions".
>>
>> This gives sanely terse syntax, rather than specifying pages and pages
>> of exact values in DDL....
>>
>>    PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
>> START WITH value;
> 
> Oh, I like that syntax!
> 
> How would it work if there were multiple columns?  Maybe we don't want
> to allow that for this form?
> 

Yea, we could simply restrict it to the single column case, which does not
sound like a major restriction.

Thanks,
Amit

Re: Declarative partitioning

From

Pavan Deolasee

Date:

20 August 2015, 09:28:13

On Tue, Aug 18, 2015 at 4:00 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi,

I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Syntax
======

1. Creating a partitioned table

CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);

Where column_list consists of simple column names or expressions:

PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)

PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

How about HASH partitioning? Are there plans to support foreign tables as partitions?

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Declarative partitioning

From

Simon Riggs

Date:

20 August 2015, 09:34:21

On 20 August 2015 at 10:10, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-20 AM 05:10, Josh Berkus wrote:
> On 08/19/2015 04:59 AM, Simon Riggs wrote:
>> I like the idea of a regular partitioning step because it is how you
>> design such tables - "lets use monthly partitions".
>>
>> This gives sanely terse syntax, rather than specifying pages and pages
>> of exact values in DDL....
>>
>> PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
>> START WITH value;
>
> Oh, I like that syntax!
>
> How would it work if there were multiple columns? Maybe we don't want
> to allow that for this form?
>

Yea, we could simply restrict it to the single column case, which does not
sound like a major restriction.

PARTITION BY ...

SUBPARTITION BY ...

We should plan for that in the way we develop the internals, but full support can wait until later patches.

My view has long been that the internals are they aspect here, not the syntax. We need to be able to have a very fast partition-selection mechanism that can be used in the planner or executor for each tuple. Working backwards, we need a relcache representation that allows that, and a catalog representation that allows that and syntax to match.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Amit Langote

Date:

20 August 2015, 09:47:30

On 2015-08-20 PM 06:27, Pavan Deolasee wrote:
> On Tue, Aug 18, 2015 at 4:00 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp
>> wrote:
>>
>> PARTITION BY LIST ON (name)
>> PARTITION BY RANGE ON (year, month)
>>
>> PARTITION BY LIST ON ((lower(left(name, 2)))
>> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
>>
>>
> 
> How about HASH partitioning? Are there plans to support foreign tables as
> partitions?
> 

I've not given HASH partitioning a lot of thought yet. About the latter,
it seems it should not be too difficult to incorporate into the proposed
partitioning internals.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

20 August 2015, 09:49:21

On 2015-08-20 PM 06:34, Simon Riggs wrote:
> On 20 August 2015 at 10:10, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
> 
>> On 2015-08-20 AM 05:10, Josh Berkus wrote:
>>>>    PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
>>>> START WITH value;
>>>
>>> Oh, I like that syntax!
>>>
>>> How would it work if there were multiple columns?  Maybe we don't want
>>> to allow that for this form?
>>>
>>
>> Yea, we could simply restrict it to the single column case, which does not
>> sound like a major restriction.
>>
> 
> PARTITION BY ...
> SUBPARTITION BY ...
> 
> We should plan for that in the way we develop the internals, but full
> support can wait until later patches.
> 

At the moment, a form of SUBPARTITION BY is to allow PARTITION BY in a
partition definition. But I can see that may not be what people would expect.

> My view has long been that the internals are they aspect here, not the
> syntax. We need to be able to have a very fast partition-selection
> mechanism that can be used in the planner or executor for each tuple.
> Working backwards, we need a relcache representation that allows that, and
> a catalog representation that allows that and syntax to match.
> 

Agreed.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

20 August 2015, 09:59:30

On 2015-08-19 AM 02:59, Corey Huinker wrote:
> 
> Quick thoughts borne of years of slugging it out with partitions on Oracle:
> 
> - Finally!!!!!!!!!!!
> 
> - Your range partitioning will need to express exclusive/inclusive bounds,
> or go to the Oracle model where every partition is a cascading "values less
> than" test context dependent on the partitions defined before it. I would
> suggest that leveraging existing range types (or allowing the user to
> specify a range type, like for a specific collation of a text range) would
> allow for the most flexible and postgres-ish range definition. You seem to
> do this with the "[USING] opclass_name" bit, but don't follow through on
> the START(...) and END(...).  Something like FOR VALUES <@
> '[''2014-01-01'',''2015-01-01)'::daterange would cover most needs
> succinctly, though I admit the syntax for complex ranges could be
> cumbersome, though something like FOR VALUES <@
> '[(''a'',1),(''b'',1))'::letter_then_number_range is still readable.
> 

It seems the way of specifying per-partition definition/constraint,
especially for range partitioning, would have a number of interesting
alternatives.

By the way, the [USING opclass_name] bit is just a way of telling that a
particular key column has user-defined notion of "ordering" in case of
range partitioning and "equality" for list partitioning. The opclass would
eventually determine which WHERE clauses (looking at operators, operand
types) are candidates to help prune partitions. If we use the range_op
range_literal::range_type notation to describe partition constraint for
each partition, it might not offer much beyond the readability. We are not
really going to detect range operators being applied in WHERE conditions
to trigger partition pruning, for example. Although I may be missing
something...

> - No partitioning scheme survives first contact with reality. So you will
> need a facility for splitting and joining existing partitions. For
> splitting partitions, it's sufficient to require that the new partition
> share either a upper/lower bound (with the same inclusivity/exclusivity) of
> an existing partition, thus uniquely identifying the partition to be split,
> and require that the other bound be within the range of the partition to be
> split. Similarly, it's fair to require that the partitions to be joined be
> adjacent in range. In both cases, range operators make these tests simple.
> 

SPLIT/MERGE can be done in later patches/release, I think.

> - Your features 4 and 5 are implemented in Oracle with SWAP PARTITION,
> which is really neat for doing ETLs and index rebuilds offline in a copy
> table, and then swapping the data segment of that table with the partition
> specified. Which could be considered cheating because none of the partition
> metadata changed, just the pointers to the segments. We already do this
> with adding removing INHERIT. I'm not saying they can't be separate
> functionality, but keeping an atomic SWAP operation would be grand.
> 

I think we can manage to find ways to make the proposed ATTACH/DETACH as
useful and convenient. Thanks for reminding of SWAP PARTITION.

Regards,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

20 August 2015, 10:00:03

On 2015-08-19 PM 09:52, David Fetter wrote:
> On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote:
>>>
>>> One small change to make this part more efficient:
>>>
>>> 1. Take the access exclusive lock on table_name.
>>> 2. Check for a matching constraint on it.
>>> 3. If it's there, mark it as a valid partition.
>>> 4. If not, check for values outside the boundaries as above.
>>>
>>
>> That's an interesting idea. Thanks!
> 
> I hope I'm advancing this feature rather than bogging it down...
> 

Definitely advancing.

>> By a matching constraint, I guess you mean a 'valid' constraint from
>> which the declared partition constraint can be proven to follow. For
>> (a simple) example, from a CHECK (a >= 100 AND a < 150) on
>> table_name, the partition constraint implied by FOR VALUES START
>> (100) END (200) can be assumed to hold.
> 
> Well, I was assuming an exact match, but a stricter match seems like a
> nice-to-have...possibly later.
> 
>>> Should the be a *valid* constraint?  Perhaps that should be
>>> parameterized, as I'm not yet seeing a compelling argument either
>>> direction.  I'm picturing something like:
>>>
>>>     ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST]
>>>
>>> where TRUST would mean that an existing constraint need not be VALID.
>>
>> Hmm, I'd think this step must be able to assert the partition
>> constraint beyond any doubt.  If the DBA added the constraint and
>> marked it invalid, she should first VALIDATE the constraint to make
>> it valid by performing whatever steps necessary before. IOW, a full
>> heap scan at least once is inevitable (the reason why we might want
>> to make this a two step process at all). Am I missing something?
> 
> There are use cases where we need to warn people that their assertions
> need to be true, and if those assertions are not true, this will
> explode, leaving them to pick the resulting shrapnel out of their
> faces.  There are other parts of the system where this is true, as
> when people write UDFs in C.
> 
> As I understand it, NOT VALID means, "I assert that the tuples already
> here fit the constraint.  Any changes will be checked against the
> constraint."
> 
> I've seen cases where a gigantic amount of data is coming out of some
> distributed system which holds the constraint as an invariant.  This
> let a DBA decide to add a NOT VALID constraint in order not to take
> the hit of a second full scan of the data, which might have made the
> import, and possibly the entire project, untenable.
> 
> See above.
> 

Ah, I understand the point of parameterization (TRUST). Seems like it
would be good to have with appropriate documentation of the same. Perhaps,
it might as well a parameter to the step 1 itself.

>>>> 5. Detach partition
>>>>
>>>> ALTER TABLE partitioned_table
>>>> DETACH PARTITION partition_name [USING table_name]
>>>>
>>>> This removes partition_name as partition of partitioned_table.
>>>> The table continues to exist with the same name or 'table_name',
>>>> if specified.  pg_class.relispartition is set to false for the
>>>> table, so it behaves like a normal table.
>>>
>>> Could this take anything short of an access exclusive lock on the
>>> parent?
>>
>> Yes, both the step 1 of ATTACH command and DETACH command take
>> access exclusive lock on the parent. They are rather quick metadata
>> changes, so should not stall others significantly, I think.
> 
> So no.  Weakening required locks has been something of an ongoing
> project, project-wide, and need not be part of the first cut of this
> long-needed feature.
> 

Do you mean ATTACH and DETACH, if they require access exclusive lock on
the parent, should not be in the first cut? Or am I misreading?

If so, there is no way to drop partitions. With the patch, it would be
achieved with detach and drop (if required).

> Thanks so much for working on this!
> 

Thanks for the feedback and suggestions!

Regards,
Amit

Re: Declarative partitioning

From

David Steele

Date:

20 August 2015, 12:36:59

On 8/20/15 5:45 AM, Amit Langote wrote:
> On 2015-08-20 PM 06:27, Pavan Deolasee wrote:
>> On Tue, Aug 18, 2015 at 4:00 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp
>>> wrote:
>>>
>>> PARTITION BY LIST ON (name)
>>> PARTITION BY RANGE ON (year, month)
>>>
>>> PARTITION BY LIST ON ((lower(left(name, 2)))
>>> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
>>>
>>>
>>
>> How about HASH partitioning? Are there plans to support foreign tables as
>> partitions?
>>
>
> I've not given HASH partitioning a lot of thought yet. About the latter,
> it seems it should not be too difficult to incorporate into the proposed
> partitioning internals.

Hash partitioning seems like it could wait.  If fact, I've nearly always
implemented hash partitions as list partitions.  This gives me control
over the hash function and allows me to use properties of the key to my
advantage.  For instance - if your key is a sha1 hash there's no need to
rehash, just grab the required number of bits off the end of the key.

My experiences with Oracle's hash function were generally not good -
there's a reason many hash algorithms exist.  If/when we do hash
partitioning in Postgres I'd like to see the hash function be
user-definable.

Meanwhile, I think list and range are a good start.  I'd prefer to see
sub-partitioning before hash partitioning.

--
-David
david@pgmasters.net

Re: Declarative partitioning

From

David Fetter

Date:

20 August 2015, 13:19:09

On Thu, Aug 20, 2015 at 06:58:24PM +0900, Amit Langote wrote:
> On 2015-08-19 PM 09:52, David Fetter wrote:
> > On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote:
> > 
> > There are use cases where we need to warn people that their assertions
> > need to be true, and if those assertions are not true, this will
> > explode, leaving them to pick the resulting shrapnel out of their
> > faces.  There are other parts of the system where this is true, as
> > when people write UDFs in C.
> > 
> > As I understand it, NOT VALID means, "I assert that the tuples already
> > here fit the constraint.  Any changes will be checked against the
> > constraint."
> > 
> > I've seen cases where a gigantic amount of data is coming out of some
> > distributed system which holds the constraint as an invariant.  This
> > let a DBA decide to add a NOT VALID constraint in order not to take
> > the hit of a second full scan of the data, which might have made the
> > import, and possibly the entire project, untenable.
> 
> Ah, I understand the point of parameterization (TRUST). Seems like it
> would be good to have with appropriate documentation of the same. Perhaps,
> it might as well a parameter to the step 1 itself.

So TRUST would obviate step 2?  Better still!

> >>>> 5. Detach partition
> >>>>
> >>>> ALTER TABLE partitioned_table
> >>>> DETACH PARTITION partition_name [USING table_name]
> >>>>
> >>>> This removes partition_name as partition of partitioned_table.
> >>>> The table continues to exist with the same name or 'table_name',
> >>>> if specified.  pg_class.relispartition is set to false for the
> >>>> table, so it behaves like a normal table.
> >>>
> >>> Could this take anything short of an access exclusive lock on the
> >>> parent?
> >>
> >> Yes, both the step 1 of ATTACH command and DETACH command take
> >> access exclusive lock on the parent. They are rather quick metadata
> >> changes, so should not stall others significantly, I think.
> > 
> > So no.  Weakening required locks has been something of an ongoing
> > project, project-wide, and need not be part of the first cut of this
> > long-needed feature.
> > 
> 
> Do you mean ATTACH and DETACH, if they require access exclusive lock on
> the parent, should not be in the first cut? Or am I misreading?

Sorry I was unclear.

ATTACH and DETACH should be in the first cut even if they require an
access exclusive lock.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Declarative partitioning

From

Corey Huinker

Date:

20 August 2015, 15:52:24

It seems the way of specifying per-partition definition/constraint,
especially for range partitioning, would have a number of interesting
alternatives.

By the way, the [USING opclass_name] bit is just a way of telling that a
particular key column has user-defined notion of "ordering" in case of
range partitioning and "equality" for list partitioning. The opclass would
eventually determine which WHERE clauses (looking at operators, operand
types) are candidates to help prune partitions. If we use the range_op
range_literal::range_type notation to describe partition constraint for
each partition, it might not offer much beyond the readability. We are not
really going to detect range operators being applied in WHERE conditions
to trigger partition pruning, for example. Although I may be missing
something...

I don't think it's important that ranges be used internally for partition pruning, though I don't see a reason why they couldn't. I was making the case for range types being used at partition creation/declaration time, because you were proposing new syntax to describe a range of values in text when we already have that in range types. We should eat our own dog food.

But mostly, I wanted to make sure that range partitions could have inclusive and exclusive bounds.

> - No partitioning scheme survives first contact with reality. So you will
> need a facility for splitting and joining existing partitions. For
> splitting partitions, it's sufficient to require that the new partition
> share either a upper/lower bound (with the same inclusivity/exclusivity) of
> an existing partition, thus uniquely identifying the partition to be split,
> and require that the other bound be within the range of the partition to be
> split. Similarly, it's fair to require that the partitions to be joined be
> adjacent in range. In both cases, range operators make these tests simple.
>

SPLIT/MERGE can be done in later patches/release, I think.

Later patches for sure. When a partition "fills up", it's a critical performance issue, so the fewer steps needed to amend it the better.

Re: Declarative partitioning

From

Corey Huinker

Date:

20 August 2015, 16:32:29

My experiences with Oracle's hash function were generally not good -
there's a reason many hash algorithms exist. If/when we do hash
partitioning in Postgres I'd like to see the hash function be
user-definable.

In my experience, hash partitioning had one use: When you had run out of ways to logically partition the data, AND you had two tables in a foreign key relationship, AND that relationship was the most common use-case for selecting the two tables. In which case, your one big join became 8 or 16 little ones, and all was well in the universe...

...until those little joins started to spill to disk, and now you need to repartition by 2x hashes, and that basically means a reload of your primary fact table, and a talk with a boss about why you painted yourself into a corner when you first did that.

Meanwhile, I think list and range are a good start. I'd prefer to see
sub-partitioning before hash partitioning.

Re: Declarative partitioning

From

Josh Berkus

Date:

20 August 2015, 17:36:24

On 08/20/2015 06:19 AM, David Fetter wrote:
> On Thu, Aug 20, 2015 at 06:58:24PM +0900, Amit Langote wrote:
>> Do you mean ATTACH and DETACH, if they require access exclusive lock on
>> the parent, should not be in the first cut? Or am I misreading?
> 
> Sorry I was unclear.
> 
> ATTACH and DETACH should be in the first cut even if they require an
> access exclusive lock.
> 
> Cheers,
> David.

I don't see a way for them to *ever* not require an access exclusive lock.

We could eventually implement:

DETACH PARTITION CONCURRENTLY

... but that's the only way I can see around it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Declarative partitioning

From

David Fetter

Date:

20 August 2015, 21:27:10

On Thu, Aug 20, 2015 at 11:16:37AM +0900, Amit Langote wrote:
> On 2015-08-19 PM 09:23, Simon Riggs wrote:
> > On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> > wrote:

> > You haven't specified what would happen if an UPDATE would change a row's
> > partition. I'm happy to add this to the list of restrictions by saying that
> > the partition key cannot be updated.
> 
> UPDATEs that change a row's partition would cause error. I haven't
> implemented that yet but will that way in the next patch.
> 
> By the last sentence, do you mean only UPDATEs to the partition key that
> cause rows to jump partitions or simply any UPDATEs to the partition key?

I don't know what Simon had in mind, but it seems to me that we have
the following in descending order of convenience to users, and I
presume, descending order of difficulty of implementation:

1.  Updates propagate transparently, moving rows between partitions if needed.

2.  Updates fail only if the change to a partition key would cause the
row to have to move to another partition.

3.  All updates to the partition key fail.

Whichever of these we choose, we should document it with great
clarity.

> > We'll need regression tests that cover each restriction and docs
> > that match. This is not something we should leave until last.
> > People read the docs to understand the feature, helping them to
> > reach consensus. So it is for you to provide the docs before, not
> > wait until later. I will begin a code review once you tell me docs
> > and tests are present. We all want the feature, so its all about
> > the details now.
> 
> Sorry, should have added tests and docs already. I will add them in
> the next version of the patch. Thanks for willing to review.

Docs and tests are crucial, and thanks again for taking this on.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Declarative partitioning

From

Amit Langote

Date:

21 August 2015, 05:49:36

On 2015-08-20 PM 10:19, David Fetter wrote:
> On Thu, Aug 20, 2015 at 06:58:24PM +0900, Amit Langote wrote:
>>
>> Ah, I understand the point of parameterization (TRUST). Seems like it
>> would be good to have with appropriate documentation of the same. Perhaps,
>> it might as well a parameter to the step 1 itself.
> 
> So TRUST would obviate step 2?  Better still!
> 

I really wish we could do this in one step, but...

As you know, primary intention behind keeping step 2 separate is to do the
heavy validation work with access exclusive lock on the other table
instead of on the master table.

Assuming we add TRUST TO step 1 (with implications clarified in docs),
when/if the user does not use TRUST, the default behavior being to perform
the validation, would it be OK as it would have to take the expensive lock
on the master table?

Motivation for keeping the step 2 (and TRUST parameter with it) is really
to avoid this last concern.

All that said, we'd still need some way to tell rows that came from a
TRUSTed table but do not really belong to the partition because we skipped
checking that at all. Obviously one can always write a query on the
partition that can find them, but some form of DDL would be what people
might expect.

>>
>> Do you mean ATTACH and DETACH, if they require access exclusive lock on
>> the parent, should not be in the first cut? Or am I misreading?
> 
> Sorry I was unclear.
> 
> ATTACH and DETACH should be in the first cut even if they require an
> access exclusive lock.
> 

Ah, got it.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

21 August 2015, 05:54:00

On 2015-08-21 AM 06:27, David Fetter wrote:

>> By the last sentence, do you mean only UPDATEs to the partition key that
>> cause rows to jump partitions or simply any UPDATEs to the partition key?
> 
> I don't know what Simon had in mind, but it seems to me that we have
> the following in descending order of convenience to users, and I
> presume, descending order of difficulty of implementation:
> 
> 1.  Updates propagate transparently, moving rows between partitions if needed.
> 
> 2.  Updates fail only if the change to a partition key would cause the
> row to have to move to another partition.
> 
> 3.  All updates to the partition key fail.
> 

I was thinking I'd implement 2.

There was some discussion last year[1] about how 1 could be realized.

Thanks,
Amit

[1]
http://www.postgresql.org/message-id/20140829172216.GF10109@awork2.anarazel.de

Re: Declarative partitioning

From

Pavan Deolasee

Date:

21 August 2015, 06:20:39

On Fri, Aug 21, 2015 at 11:22 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-21 AM 06:27, David Fetter wrote:

>> By the last sentence, do you mean only UPDATEs to the partition key that
>> cause rows to jump partitions or simply any UPDATEs to the partition key?
>
> I don't know what Simon had in mind, but it seems to me that we have
> the following in descending order of convenience to users, and I
> presume, descending order of difficulty of implementation:
>
> 1. Updates propagate transparently, moving rows between partitions if needed.
>
> 2. Updates fail only if the change to a partition key would cause the
> row to have to move to another partition.
>
> 3. All updates to the partition key fail.
>

I was thinking I'd implement 2.

+1. IMHO thats a good starting point.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Declarative partitioning

From

Jim Nasby

Date:

22 August 2015, 03:34:40

On 8/18/15 12:31 PM, Josh Berkus wrote:
> Also this would be useful for range
> partitions:
>
> CREATE PARTITION ON parent_table USING ( start_value );
>
> ... where start_value is the start range of the new partition.  Again,
> easier for users to get correct.

Instead of that, I think it would be more foolproof to do

CREATE PARTITION ON parent_table FOR ( value1, ... );

instead of trusting the user to get the exact start value correct.

Though... I guess there could be value in allowing an exact start value 
but throwing an error if it doesn't sit exactly on a boundary. Might 
make it less likely to accidentally create the wrong partition.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Declarative partitioning

From

Josh Berkus

Date:

23 August 2015, 23:53:36

On 08/21/2015 08:34 PM, Jim Nasby wrote:
> On 8/18/15 12:31 PM, Josh Berkus wrote:
>> Also this would be useful for range
>> partitions:
>>
>> CREATE PARTITION ON parent_table USING ( start_value );
>>
>> ... where start_value is the start range of the new partition.  Again,
>> easier for users to get correct.
> 
> Instead of that, I think it would be more foolproof to do
> 
> CREATE PARTITION ON parent_table FOR ( value1, ... );
> 
> instead of trusting the user to get the exact start value correct.
> 
> Though... I guess there could be value in allowing an exact start value
> but throwing an error if it doesn't sit exactly on a boundary. Might
> make it less likely to accidentally create the wrong partition.

Well, I'm figuring that most people would use "CREATE NEXT PARTITION"
instead.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Declarative partitioning

From

Simon Riggs

Date:

24 August 2015, 07:47:11

On 24 August 2015 at 00:53, Josh Berkus <josh@agliodbs.com> wrote:

On 08/21/2015 08:34 PM, Jim Nasby wrote:
> On 8/18/15 12:31 PM, Josh Berkus wrote:
>> Also this would be useful for range
>> partitions:
>>
>> CREATE PARTITION ON parent_table USING ( start_value );
>>
>> ... where start_value is the start range of the new partition. Again,
>> easier for users to get correct.
>
> Instead of that, I think it would be more foolproof to do
>
> CREATE PARTITION ON parent_table FOR ( value1, ... );
>
> instead of trusting the user to get the exact start value correct.
>
> Though... I guess there could be value in allowing an exact start value
> but throwing an error if it doesn't sit exactly on a boundary. Might
> make it less likely to accidentally create the wrong partition.

Well, I'm figuring that most people would use "CREATE NEXT PARTITION"
instead.

ALTER TABLE foo ADD PARTITION NEXT;

when a sequence of partitions has been defined

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Andrew Dunstan

Date:

24 August 2015, 12:53:09

On 08/24/2015 03:46 AM, Simon Riggs wrote:
>
>
> ALTER TABLE foo ADD PARTITION NEXT;
>
> when a sequence of partitions has been defined
>
>

Or perhaps
   ALTER TABLE foo ADD PARTITION NEXT nn;

So you can set up, say, a week's worth of daily partitions at once.

I could also imagine a variant that allows you to specify partitions up 
to the one including some value.

cheers

andrew

Re: Declarative partitioning

From

Simon Riggs

Date:

26 August 2015, 14:00:20

On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Thanks for working on this. It's a great start.

3. Multi-level partitioning

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)

This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.

Multi-level partitioning is probably going to complicate things beyond sanity.

One RELKIND_PARTITIONED_REL with lots of partitions sounds best to me. We can still have N dimensions of partitioning (or partitioning and subpartitioning, if you prefer that term)

The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:

postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory

Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.

This is really the heart of this patch/design. You can work for months on all the rest of this, but you will live or die by how the optimization works because that is the thing we really need to work well. Previous attempts ignored this aspect and didn't get committed. It's hard, perhaps even scary, but its critical. It's the 80/20 rule in reverse - 20% of the code is 80% of the difficulty.

I suggest you write a partition query test script .sql and work towards making this work. Not exhaustive and weird tests, but 5-10 key queries that need to be optimized precisely and quickly. I'm sure that's been done before.

Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:

* New regression tests
* Documentation updates
* pg_dump, psql, etc.

For reference, some immediately previous discussions:

* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org

* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp

If you want to achieve consensus, please write either docs or README files that explain how this works.

It took me a few seconds to notice deviations from Alvaro's original post. I shouldn't have to read a full thread to see what the conclusions were, you need to record them coherently.

Some great examples of such things are

src/backend/optimizer/README
src/backend/access/nbtree/README

Please imagine how far such code would have got without them, then look at the code comments on the top of each of the functions in that area for examples of the clarity of design this needs.

Comments welcome!

Yes, comments in code are indeed welcome, as well as the README/docs.

I couldn't see why you invented a new form of Alter Table recursion.

We will need to support multi-row batched COPY.

I'm pleased to see this patch and will stay with it to completion, perhaps others also. We have 3 more CFs in this release, Nov, Jan, Mar - so this has a great chance of making it into 9.6. The current patch implements a bunch of stuff, but its hard to say what, how or why it does it and without the planner stuff its all moot. My recommendation is we say "Returned with Feedback" on this now, looking forward to next patch.

If you submit another patch before Nov, I will review it without waiting for Nov 1.

There will be much discussion on syntax, but that is not the key point. DDL Support routines are usually pretty straightforward too, so that can be left for now.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Simon Riggs

Date:

26 August 2015, 14:02:08

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

Needs planner work and tests of that. ALTER TABLE etc can wait.

The new status of this patch is: Waiting on Author

Re: Declarative partitioning

From

Amit Langote

Date:

04 September 2015, 05:53:42

Sorry about the long delay in replying, to this message or the others
posted in the last few days. I should have notified in advance of my
vacation with rather limited Internet access.

On 2015-08-26 PM 11:00, Simon Riggs wrote:
> On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> 
>> 3. Multi-level partitioning
>>
>> CREATE TABLE table_name
>> PARTITION OF partitioned_table_name
>> FOR VALUES values_spec
>> PARTITION BY {RANGE|LIST} ON (columns_list)
>>
>> This variant implements a form of so called composite or sub-partitioning
>> with arbitrarily deep partitioning structure. A table created using this
>> form has both the relkind RELKIND_PARTITIONED_REL and
>> pg_class.relispartition set to true.
>>
> 
> Multi-level partitioning is probably going to complicate things beyond
> sanity.
> 
> One RELKIND_PARTITIONED_REL with lots of partitions sounds best to me. We
> can still have N dimensions of partitioning (or partitioning and
> subpartitioning, if you prefer that term)
> 

Implementation in the patch lets RELKIND_PARTITIONED_REL under another and
so on to which appears to give a capability to have arbitrarily deep
partitioning structure. That might feel unnecessary. Initial motivation to
go for such design was to not have to complicate the partition key catalog
with details of how to accommodate some notion of sub-partition template.
ISTM, most usecases (as covered by other databases) would not require to
use more than 2 levels. Maybe, we should consider catering to only that
set of usecases using some sub-partition template based design (and then
some specialized syntax).

Or, we could just not handle sub-partitioning initially, by which I mean
composite partitioning where you can have schemes such as range-list,
list-range, range-hash, etc. We do still have multi-column
range partitioning.

In any case, we have to have a design that will cater to the multi-level
partitioning.

> 
>> The patch does not yet implement any planner changes for partitioned
>> tables, although I'm working on the same and post updates as soon as
>> possible. That means, it is not possible to run SELECT/UPDATE/DELETE
>> queries on partitioned tables without getting:
>>
>> postgres=# SELECT * FROM persons;
>> ERROR: could not open file "base/13244/106975": No such file or directory
>>
>> Given that there would be more direct ways of performing partition pruning
>> decisions with the proposed, it would be nice to utilize them.
>> Specifically, I would like to avoid having to rely on constraint exclusion
>> for partition pruning whereby subquery_planner() builds append_rel_list
>> and the later steps exclude useless partitions.
>>
> 
> This is really the heart of this patch/design. You can work for months on
> all the rest of this, but you will live or die by how the optimization
> works because that is the thing we really need to work well. Previous
> attempts ignored this aspect and didn't get committed. It's hard, perhaps
> even scary, but its critical. It's the 80/20 rule in reverse - 20% of the
> code is 80% of the difficulty.
> 
> I suggest you write a partition query test script .sql and work towards
> making this work. Not exhaustive and weird tests, but 5-10 key queries that
> need to be optimized precisely and quickly. I'm sure that's been done
> before.
> 

Yes, I am working on this and hope to have something to show soon.

> 
> I couldn't see why you invented a new form of Alter Table recursion.
> 

It was intended to keep the ALTER TABLE considerations for inherited
tables (and typed tables) separate from those for partitioned tables. But...

This begs a larger question that I did not try to answer in this
design/patch - for partitions, do we need to have any catalog entries
other than the pg_class tuple? If we manage to not require them, we would
not need any AT recursion business at all except for cases that require
pg_class tuple updates. The pg_class tuple would be the only
authoritative catalog entry for partitions. Everything else belongs with
the master table. That includes catalog entries for attributes,
constraints, triggers, etc. Alvaro had mentioned something like this in
his proposal. Although, I hope that such radical design is not very
difficult to realize/code.

Also, we might have to go back to the slightly controversial question of
whether partitions share the same namespace as normal tables (partitioned
or not). In the patch, partitions are created using CREATE TABLE, altered
with ALTER TABLE (albeit with several restrictions). How about rather
partitions are created/altered using:

ALTER TABLE master CREATE PARTITION <name> ...,

ALTER TABLE master MODIFY PARTITION <name> ... (as mentioned above)

AT commands covered by the latter should only ever require updating the
pg_class tuple for the named partition.

ALTER TABLE <name> directly on a partition can be made to say something
like the following under this scheme:
 ERROR: <name> is not a table

Although, I'm not thinking of hiding partitions from everyone. For
example, maintenance commands like VACUUM/ANALYZE (including autovacuum,
of course) would be able to see them.

> We will need to support multi-row batched COPY.

Currently, tuple-routing may switch the ResultRelInfo for every
consecutive tuple. So, at once I abandoned any hope of adapting the
routing mechanism to heap_multi_insert() API. But, perhaps there is a
smarter way. Let me think about that.

> 
> I'm pleased to see this patch and will stay with it to completion, perhaps
> others also. We have 3 more CFs in this release, Nov, Jan, Mar - so this
> has a great chance of making it into 9.6. The current patch implements a
> bunch of stuff, but its hard to say what, how or why it does it and without
> the planner stuff its all moot. My recommendation is we say "Returned with
> Feedback" on this now, looking forward to next patch.
> 
> If you submit another patch before Nov, I will review it without waiting
> for Nov 1.
> 
> There will be much discussion on syntax, but that is not the key point. DDL
> Support routines are usually pretty straightforward too, so that can be
> left for now.
> 

Sure, let me address number of points you have raised. In the upcoming
versions, I will try to provide extensive documentation (both internal and
user). As for the syntax and DDL changes you think can be dealt with
later, I tend to think let's keep the bare essentials necessary.

Thanks,
Amit

Re: Declarative partitioning

From

Simon Riggs

Date:

04 September 2015, 09:41:56

On 4 September 2015 at 06:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry about the long delay in replying, to this message or the others
posted in the last few days. I should have notified in advance of my
vacation with rather limited Internet access.

No problem, I'm on leave too.

>> The patch does not yet implement any planner changes for partitioned
>> tables, although I'm working on the same and post updates as soon as
>> possible.

...

>
> This is really the heart of this patch/design. You can work for months on
> all the rest of this, but you will live or die by how the optimization
> works because that is the thing we really need to work well. Previous
> attempts ignored this aspect and didn't get committed. It's hard, perhaps
> even scary, but its critical. It's the 80/20 rule in reverse - 20% of the
> code is 80% of the difficulty.
>
> I suggest you write a partition query test script .sql and work towards
> making this work. Not exhaustive and weird tests, but 5-10 key queries that
> need to be optimized precisely and quickly. I'm sure that's been done
> before.
>

Yes, I am working on this and hope to have something to show soon.

No rush, no pressure; lets get this right.

>
> I couldn't see why you invented a new form of Alter Table recursion.
>

It was intended to keep the ALTER TABLE considerations for inherited
tables (and typed tables) separate from those for partitioned tables. But...

This begs a larger question that I did not try to answer in this
design/patch - for partitions, do we need to have any catalog entries
other than the pg_class tuple?

Everything should start from the requirements of the optimization approach. Once we have that clear, we can confirm other requirements.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Thom Brown

Date:

20 October 2015, 17:27:10

On 18 August 2015 at 12:23, Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Thom,
>
> On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote:
>>
>>
>> Wow, didn't expect to see that email this morning.
>>
>> A very quick test:
>>
>> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item
>> text) partition by range on ((extract(year from
>> purchase_time)),(extract(month from purchase_time)));
>> ERROR:  referenced relation "purchases" is not a table or foreign table
>>
>
> Thanks for the quick test.
>
> Damn, I somehow missed adding the new relkind to a check in
> process_owned_by(). Will fix this and look for any such oversights.

This doesn't seem to have materialised.  Are you still working on this?

Thom

Re: Declarative partitioning

From

Amit Langote

Date:

20 October 2015, 17:34:18

On Wednesday, 21 October 2015, Thom Brown <thom@linux.com> wrote:

On 18 August 2015 at 12:23, Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Thom,
>
> On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote:
>>
>>
>> Wow, didn't expect to see that email this morning.
>>
>> A very quick test:
>>
>> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item
>> text) partition by range on ((extract(year from
>> purchase_time)),(extract(month from purchase_time)));
>> ERROR: referenced relation "purchases" is not a table or foreign table
>>
>
> Thanks for the quick test.
>
> Damn, I somehow missed adding the new relkind to a check in
> process_owned_by(). Will fix this and look for any such oversights.

This doesn't seem to have materialised. Are you still working on this?

Yes, I will be posting to this thread soon. Sorry about the silence.

Thanks,

Amit

Re: Declarative partitioning

From

Thom Brown

Date:

20 October 2015, 17:35:20

On 20 October 2015 at 18:34, Amit Langote <amitlangote09@gmail.com> wrote:
>
>
> On Wednesday, 21 October 2015, Thom Brown <thom@linux.com> wrote:
>>
>> On 18 August 2015 at 12:23, Amit Langote <amitlangote09@gmail.com> wrote:
>> > Hi Thom,
>> >
>> > On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote:
>> >>
>> >>
>> >> Wow, didn't expect to see that email this morning.
>> >>
>> >> A very quick test:
>> >>
>> >> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp,
>> >> item
>> >> text) partition by range on ((extract(year from
>> >> purchase_time)),(extract(month from purchase_time)));
>> >> ERROR:  referenced relation "purchases" is not a table or foreign table
>> >>
>> >
>> > Thanks for the quick test.
>> >
>> > Damn, I somehow missed adding the new relkind to a check in
>> > process_owned_by(). Will fix this and look for any such oversights.
>>
>> This doesn't seem to have materialised.  Are you still working on this?
>
>
> Yes, I will be posting to this thread soon. Sorry about the silence.

Thanks.  I'd like to test it again.

Thom

Re: Declarative partitioning

From

Amit Langote

Date:

30 October 2015, 10:09:24

On 2015/08/26 23:01, Simon Riggs wrote:
> The following review has been posted through the commitfest application:
>
...

>
> Needs planner work and tests of that. ALTER TABLE etc can wait.
>
> The new status of this patch is: Waiting on Author

Alright, after spending much time on this I came up with the attached. I'm
sincerely sorry this took so long.

I'll describe below what's in those patches.

The DDL and catalogs part are not much different from what I had last
described though I took a few steps to simplify things. I dropped the
multi-level partitioning bit and changed range partition creation syntax a
little. Ability to specify both min and max for every range partition had
made life much difficult to construct and use the internal representation.
Anyway, attached document patch (and regression tests to some degree)
might give a picture of what user commands look like as of now, although
things are very much subject to overhaul in this general area of user
visible commands and features.

As pointed out by Simon, there is a need to invent a better internal
representation of partitioning constraints/metadata to address the
planning considerations more effectively and scalably. To that end, I have
tried to implement a number of things -  There is a proposed
set_partitioned_rel_size() that calls planner's partitioning module
(proposed) which takes RelOptInfo of a partitioned table and returns a
list of OIDs of partitions that are selected to be scanned based on the
baserestrictinfo (other partitions are pruned!)  The partitioned rel size
is then computed using size estimates for selected partitions. Later,
proposed set_partitioned_rel_pathlist() will create a Append/MergeAppend
path with cheapest paths for individual partitions as its subpaths. It
seems important to point out that the selected partitions each get
simple_rte_array and a simple_rel_array slots (note that those arrays are
required to be repalloc'd to make those slots available). The partition
reloptinfos are of kind RELOPT_PARTITION_REL which are functionally quite
similar to RELOPT_OTHER_MEMBER_REL but different in that they do not
require using any auxiliary structures like AppendRelInfo and
append_rel_list. The parent-partition mapping would still be necessary and
is achieved using a field called parent_relid of partition reloptinfo.

The proposed partitioning module (in the planner) figures out which of the
clauses to use for partition pruning. The rules of matching clauses to
partition key look similar to those used to match them to an index, at
least with respect to the key representation. Key info includes attribute
number, expression tree (for expression key columns), operator family OID,
per key column. While clauses matched to an index would be returned as it
is to make index paths with, things work differently for clauses matched
to partition key column(s). For each matched clause, the Const operand and
the operator would be added to operand list and operator list,
respectively, for the matched column. Operands are later evaluated to
datums. The datum arrays along with operator OIDs are passed along to
another (general-purpose) partitioning module (proposed). That module
returns a list of OIDs of selected partitions after comparing the clause
operand datums with the partition bounds.

The operator strategies (of passed operators) coupled with partitioning
method (list or range, currently) determine how partitions are selected.
For some partitioning methods, faster partition selection is enabled, for
example, being able to use binary search helps make selecting range
partitions reasonably fast. Note that the clauses matched for a given
column are effectively ANDed with one another. At the moment, using OR
clauses disables partition pruning. But it would be a matter of improving
the structure used to pass around the operands and operators to also
convey the logical structure of clauses. Also, even though range
partitioning allows multi-column key, only the clauses referencing the
first key are used to select partitions. Multi-column range partition
pruning implementation is saved for later. Note that partition bounds are
themselves stored as arrays of datums computed on-demand (cached in
RelationData as part of a structure called PartitionDesc which stores
other info about partitions like OIDs.)

There are not many executor changes currently except for those made to
ensure sane I/U/D behaviors (made to respective ExecI/U/D functions). We
can certainly think about implementing specialized scan node for
partitioned tables but I put it off for now.

Attached user documentation and regression test patches, though not quite
comprehensive, should help one get started.

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Robert Haas

Date:

05 November 2015, 16:29:53

On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The DDL and catalogs part are not much different from what I had last
> described though I took a few steps to simplify things. I dropped the
> multi-level partitioning bit

Hmm, that doesn't sound good to me.  I think multi-level partitioning
is a reasonably important use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Amit Langote

Date:

20 November 2015, 09:18:59

On 2015/11/06 1:29, Robert Haas wrote:
> On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The DDL and catalogs part are not much different from what I had last
>> described though I took a few steps to simplify things. I dropped the
>> multi-level partitioning bit
> 
> Hmm, that doesn't sound good to me.  I think multi-level partitioning
> is a reasonably important use case.

I agree. I'm in the process of reformulating this proposal from the
syntax, catalog and DDL -centric perspective and will re-incorporate
multi-level partitioning notion into it. It was a mistake to drop it.

I am thinking of introducing an explicit notion of sub-partition key and
sub-partitions (of the top parent as far as syntactic notation is
concerned). I guess it would not be unreasonable to think that most
use-cases that multi-level partitioning is used for require at most 2
levels. It will enable us to use a more intuitive syntax and make
internals easier to manage.

Thanks,
Amit

Re: Declarative partitioning

From

Simon Riggs

Date:

20 November 2015, 10:20:44

On 20 November 2015 at 09:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015/11/06 1:29, Robert Haas wrote:
> On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The DDL and catalogs part are not much different from what I had last
>> described though I took a few steps to simplify things. I dropped the
>> multi-level partitioning bit
>
> Hmm, that doesn't sound good to me. I think multi-level partitioning
> is a reasonably important use case.

I agree. I'm in the process of reformulating this proposal from the
syntax, catalog and DDL -centric perspective and will re-incorporate
multi-level partitioning notion into it. It was a mistake to drop it.

Drop it?? I think he means "in this initial patch", right Amit L ?

I don't really understand why parallel query was pursued in small pieces, but partitioning needs to happen all in one huge patch. Wishing too many things is going to slow down this feature.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Amit Langote

Date:

20 November 2015, 11:10:05

On Fri, Nov 20, 2015 at 7:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 November 2015 at 09:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>>
>> On 2015/11/06 1:29, Robert Haas wrote:
>> > On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote
>> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> >> The DDL and catalogs part are not much different from what I had last
>> >> described though I took a few steps to simplify things. I dropped the
>> >> multi-level partitioning bit
>> >
>> > Hmm, that doesn't sound good to me.  I think multi-level partitioning
>> > is a reasonably important use case.
>>
>> I agree. I'm in the process of reformulating this proposal from the
>> syntax, catalog and DDL -centric perspective and will re-incorporate
>> multi-level partitioning notion into it. It was a mistake to drop it.
>
>
> Drop it?? I think he means "in this initial patch", right Amit L ?

Yes, there was some notion of multi-level partitioning in the earlier
patch but I removed it from the version I posted on Oct 30. I do
intend to re-introduce it though.

Thanks,
Amit

Re: Declarative partitioning

From

Alvaro Herrera

Date:

20 November 2015, 11:44:59

Amit Langote wrote:
> On Fri, Nov 20, 2015 at 7:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

> > Drop it?? I think he means "in this initial patch", right Amit L ?
> 
> Yes, there was some notion of multi-level partitioning in the earlier
> patch but I removed it from the version I posted on Oct 30. I do
> intend to re-introduce it though.

I'm with Simon.  In my own experience, it's crazy to try to introduce a
huge feature such as this one in a single enormous commit.  The last
patch you posted was 300 kb without any SGML changes.

The way parallel query is being introduced is a good example to follow
(or logical multi-master replication, for that matter) --- one
infrastructure piece at a time.

Let's build incrementally.  

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Robert Haas

Date:

23 November 2015, 17:23:36

On Fri, Nov 20, 2015 at 6:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Langote wrote:
>> On Fri, Nov 20, 2015 at 7:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Drop it?? I think he means "in this initial patch", right Amit L ?
>>
>> Yes, there was some notion of multi-level partitioning in the earlier
>> patch but I removed it from the version I posted on Oct 30. I do
>> intend to re-introduce it though.
>
> I'm with Simon.  In my own experience, it's crazy to try to introduce a
> huge feature such as this one in a single enormous commit.  The last
> patch you posted was 300 kb without any SGML changes.
>
> The way parallel query is being introduced is a good example to follow
> (or logical multi-master replication, for that matter) --- one
> infrastructure piece at a time.
>
> Let's build incrementally.

I support building incrementally, but I don't see why we want to
change the catalog structure and then change it again.  That seems
like it makes the project more work, not less.

To me, it seems like there is a pretty obvious approach here: each
table can be either a plain table, or a partition root (which can look
just like an empty inheritance parent).  Then multi-level partitioning
falls right out of that design without needing to do anything extra.
If you want a single level of partitioning, you partition the original
table.  If you want two levels of partitioning, you partition the
partitions.  If you want three levels of partitioning, you partition
those.  It's being made out here that limiting ourselves to a single
of partitioning makes things simpler, but it's not very clear to me
that this is actually true.

I think it is also worth getting the syntax right from the beginning.
Even if we decide that patch #1 won't support multi-level
partitioning, we should have a plan for the syntax that can be
extended to multi-level partitioning.  If we decide after releasing
partitioning with one syntax that we really wish we'd used some other
syntax, that is going to be a really big problem - deprecating the use
of => or introducing standard_conforming_strings were projects that
took many years to complete.  We really only get one shot to get that
right.  That doesn't mean it's all got to be there in version one, but
there had better be a way to extend it to all the things we want to do
later or we are going to be deeply, deeply sad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Alvaro Herrera

Date:

23 November 2015, 18:44:38

Robert Haas wrote:

> I support building incrementally, but I don't see why we want to
> change the catalog structure and then change it again.  That seems
> like it makes the project more work, not less.

I agree with what you say.  I thought you were saying that the
implementation had to provide multi-partitioning from the get-go, not
just the design.

> To me, it seems like there is a pretty obvious approach here: each
> table can be either a plain table, or a partition root (which can look
> just like an empty inheritance parent).  Then multi-level partitioning
> falls right out of that design without needing to do anything extra.

Sounds reasonable.

> I think it is also worth getting the syntax right from the beginning.

Yes, that's critical.  We could implement the whole thing in gram.y and
then have the unsupported cases throw errors; then it's easy to see that
there are no grammar conflicts to deal with later.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Robert Haas

Date:

23 November 2015, 22:03:28

On Mon, Nov 23, 2015 at 1:44 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> I support building incrementally, but I don't see why we want to
>> change the catalog structure and then change it again.  That seems
>> like it makes the project more work, not less.
>
> I agree with what you say.  I thought you were saying that the
> implementation had to provide multi-partitioning from the get-go, not
> just the design.

Well, I *hope* that's going to fall out naturally.  If it doesn't, I
can live with that.  But I hope it will.

>> To me, it seems like there is a pretty obvious approach here: each
>> table can be either a plain table, or a partition root (which can look
>> just like an empty inheritance parent).  Then multi-level partitioning
>> falls right out of that design without needing to do anything extra.
>
> Sounds reasonable.

Cool.

>> I think it is also worth getting the syntax right from the beginning.
>
> Yes, that's critical.  We could implement the whole thing in gram.y and
> then have the unsupported cases throw errors; then it's easy to see that
> there are no grammar conflicts to deal with later.

That's worth considering, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Michael Paquier

Date:

02 December 2015, 06:41:20

On Tue, Nov 24, 2015 at 7:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 23, 2015 at 1:44 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Robert Haas wrote:
>>> I support building incrementally, but I don't see why we want to
>>> change the catalog structure and then change it again.  That seems
>>> like it makes the project more work, not less.
>>
>> I agree with what you say.  I thought you were saying that the
>> implementation had to provide multi-partitioning from the get-go, not
>> just the design.
>
> Well, I *hope* that's going to fall out naturally.  If it doesn't, I
> can live with that.  But I hope it will.
>
>>> To me, it seems like there is a pretty obvious approach here: each
>>> table can be either a plain table, or a partition root (which can look
>>> just like an empty inheritance parent).  Then multi-level partitioning
>>> falls right out of that design without needing to do anything extra.
>>
>> Sounds reasonable.
>
> Cool.
>
>>> I think it is also worth getting the syntax right from the beginning.
>>
>> Yes, that's critical.  We could implement the whole thing in gram.y and
>> then have the unsupported cases throw errors; then it's easy to see that
>> there are no grammar conflicts to deal with later.
>
> That's worth considering, too.

It seems that the consensus is to rework a bit more this patch.
Returned with feedback then?
-- 
Michael

Re: Declarative partitioning

From

Amit Langote

Date:

02 December 2015, 06:56:10

On 2015/12/02 15:41, Michael Paquier wrote:
> It seems that the consensus is to rework a bit more this patch.
> Returned with feedback then?

Yes, as far as this commitfest is concerned. Or "moved to the next
commitfest"? Not sure exactly which makes sense.

Thanks,
Amit

Re: Declarative partitioning

From

Michael Paquier

Date:

02 December 2015, 07:36:49

On Wed, Dec 2, 2015 at 3:55 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2015/12/02 15:41, Michael Paquier wrote:
>> It seems that the consensus is to rework a bit more this patch.
>> Returned with feedback then?
>
> Yes, as far as this commitfest is concerned. Or "moved to the next
> commitfest"? Not sure exactly which makes sense.

OK, then, switched as returned with feedback.
-- 
Michael

Re: Declarative partitioning

From

Amit Langote

Date:

14 December 2015, 07:15:03

On 2015/11/24 2:23, Robert Haas wrote:
> To me, it seems like there is a pretty obvious approach here: each
> table can be either a plain table, or a partition root (which can look
> just like an empty inheritance parent).  Then multi-level partitioning
> falls right out of that design without needing to do anything extra.
> If you want a single level of partitioning, you partition the original
> table.  If you want two levels of partitioning, you partition the
> partitions.  If you want three levels of partitioning, you partition
> those.  It's being made out here that limiting ourselves to a single
> of partitioning makes things simpler, but it's not very clear to me
> that this is actually true.
>
> I think it is also worth getting the syntax right from the beginning.
> Even if we decide that patch #1 won't support multi-level
> partitioning, we should have a plan for the syntax that can be
> extended to multi-level partitioning.  If we decide after releasing
> partitioning with one syntax that we really wish we'd used some other
> syntax, that is going to be a really big problem - deprecating the use
> of => or introducing standard_conforming_strings were projects that
> took many years to complete.  We really only get one shot to get that
> right.  That doesn't mean it's all got to be there in version one, but
> there had better be a way to extend it to all the things we want to do
> later or we are going to be deeply, deeply sad.

Previously proposed design would support multi-level partitioning with
some adjustments. One of the reasons to not include it in the recent patch
was a lack of clarity about proper syntax and catalog organization. That
wasn't really nice. So, here is a revised proposal for the syntax and some
general notes. I will adjust my previous patch to follow along any
feedback I receive here.

I think we can provide explicit SUBPARTITION keyword-based syntax for
multi-level partitioning. It's not unreasonable to expect that all
partitions at a given level are partitioned on the same key. IOW, why
require to declare partition key for each partition separately? Instead,
using SUBPARTITION BY on the master table seems better.

Syntax to create a partitioned table (up to 2 levels of partitioning):

CREATE TABLE foo (
...
)
PARTITION BY R/L ON (key0)
SUBPARTITION BY R/L ON (key1)
[(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>]
[(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>],
...)], ...)];

The above creates two pg_partitioned_rel entries for foo with partlevel 0
and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
creates pg_partition entries, with foo and foo_1 as partparent,
respectively.

Why just 2 levels? - it seems commonplace and makes the syntax more
intuitive? I guess it might be possible to generalize the syntax for
multi-level partitioning. Ideas? If we want to support the notion of
sub-partition template in future, that would require some thought, more
importantly proper catalog organization for the same.

To add a partition to table after-the-fact:

ALTER TABLE foo
CREATE PARTITION foo1 FOR VALUES <val> [<storage_params>] [<tblspc>]
[(SUBPARTITION foo11 FOR VALUES <val> [<storage_params>] [<tblspc>], ...)];

To add a sub-partition to an existing partition:

ALTER TABLE foo
MODIFY PARTITION foo_1
CREATE SUBPARTITION foo_1_ FOR VALUES (val) [<storage_params>] [<tblspc>];

I considered ADD/ALTER instead of CREATE/MODIFY, but there exist grammar
conflicts with ADD/ALTER column.

What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
directly to partitions on case-by-case basis (they are tables under the
hood after all), we should restrict AT to the master table. Most of the AT
changes implicitly propagate from the master table to its partitions. Some
of them could be directly applied to partitions and/or sub-partitions such
as rename, storage manipulations like - changing tablespace, storage
parameters (reloptions), etc.:

ALTER TABLE foo
RENAME PARTITION <partition-name> TO <new-name>;

ALTER TABLE foo
RENAME SUBPARTITION <sub-partition-name> TO <new-name>;

ALTER TABLE foo
SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>;

ALTER TABLE foo
SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>;

ALTER TABLE foo
SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>;

ALTER TABLE foo
SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>;

Note that the keyword DEFAULT in some cases above means, do not apply the
new setting to existing sub-partitions of the partition, rather use the
new setting for future sub-partitions. In case of SET TABLESPACE, if FOR
PARTITION clause is not specified, all partitions (actually "leaf"
partitions) of the master table are moved to the new tablespace; is that
necessary or should we just disallow that and instead output an error
asking to use [DEFAULT] FOR PARTITION/FOR SUBPARTITION to move only
specific partitions/sub-partitions?

By the way, should we also allow changing the logging of
partitions/sub-partitions as follows?

ALTER TABLE foo MODIFY PARTITION <partition-name> SET {LOGGED | UNLOGGED};

ALTER TABLE foo MODIFY SUBPARTITION <sub-partition-name> SET {LOGGED | UNLOGGED};

What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION
constraints - 2 things must be clear here: cannot create these constraints
on individual partitions and all partition columns (key0 + key1) must be
the leading columns of the key. On a related note, creating index on the
master table should create the index on all "leaf" partitions. The index
on the mater table itself would be just a logical index. Should we allow
creating or dropping indexes on partitions directly?

It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE
but I'm inclined to leave them as future enhancements. For a functionality
similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest
patch. We could extend them to also consider sub-partitions:

ALTER TABLE foo
ATTACH PARTITION <partition-name> FOR VALUES (val)
USING [ TABLE ] <table-name> [SET VALID];

ALTER TABLE foo
MODIFY PARTITION foo_1 ATTACH SUBPARTITION <sub-partition-name> FOR VALUES
(val)
USING [ TABLE ] <table-name> [SET VALID];

ALTER TABLE foo
DETACH PARTITION foo_1
USING [ TABLE ] <table-name>;

ALTER TABLE foo
MODIFY PARTITION foo_1 DETACH SUBPARTITION foo_1_1
USING [ TABLE ] <table-name>;

Note that all commands described above refer to sub-partitions directly
using names. These names refer to pg_class.relname of corresponding
relations, so any partition/sub-partition names have to be unique across
the database in a given schema. Maybe, if we stored partition name in
pg_partition, system could generate suitable pg_class name itself and
relieve that restriction. It would then be possible to refer to
sub-partitions using its pg_class name or given name qualified by the
partition name.

Let me also outline some significant internal implementation details:

* Introduce a new relkind - RELKIND_PARTITIONED_REL. Master/root tables in
partitioning hierarchies are of this relkind. Following are some key
properties of such relations -

They have a pg_partitioned_rel catalog entry containing information about
the partition key (and sub-partition key, if any). RelationBuildDesc()
initializes the partition key fields in RelationData. It consists of
linked-list of structs corresponding to keys at successive levels.

If happen to be target in I/U/D (/COPY FROM), corresponding
ResultRelInfo's partition key and partition descriptor fields are
initialized. The latter consists of information about its immediate
(level-1) partitions.

In case of U/D, rewrite phase adds tableoid junk attribute to the target
list in addition to a traditional one (ctid) for ExecUpdate and
ExecDelete.

* Next up is the notion of relations that are partitions at any level
(including non-leaf levels). Following are some key properties of such
relations -

They each have a pg_partition entry.

They cannot be targets in I/U and COPY FROM.

One cannot define rules, triggers, and RLS policies on them. Although,
AR triggers defined on a partitioned master table are propagated to the
"leaf" partitions.

Note that RELKIND_PARTITIONED_REL relations and non-leaf partitions do not
have any storage.

* How to handle multi-level partitioning -

To map tuples to a "leaf" partition, CopyFrom(), ExecInsert() (also,
ExecUpdate()) invoke a function called get_partition_for_tuple(). The
function encapsulates details of partitioning such as partitioning method
(R/L) and multi-level recursion.

A similar recursive approach could be used within planner, too. Single
invocation of function get_rel_partitions() from, say,
set_partitioned_rel_size() would return the list of "leaf" partitions of
rel excluding pruned partitions (and sub-partitions thereof). That is,
quals are matched against partition keys at all levels in the same
invocation.

* Revised catalog structures:

pg_partitioned_rel:

partrelid       Oid
partmethod      char
partlevel       int16
partnatts       int16
partkey         int2vector
partclass       oidvector
partexprs       pg_node_tree

pg_partition:

partrelid       Oid
partparent      Oid
partname        NameData
partlistvals    anyarray
partrangemaxs   anyarray
partvalid       bool
partsubtemplate bool
partreloptions  text
parttablespace  Oid

Fields of pg_partition following partrangemaxs are yet unused.

Thoughts?

Thanks,
Amit

Re: Declarative partitioning

From

Robert Haas

Date:

17 December 2015, 18:56:52

On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Syntax to create a partitioned table (up to 2 levels of partitioning):
>
> CREATE TABLE foo (
> ...
> )
> PARTITION BY R/L ON (key0)
> SUBPARTITION BY R/L ON (key1)
> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>]
> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>],
> ...)], ...)];
>
> The above creates two pg_partitioned_rel entries for foo with partlevel 0
> and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
> creates pg_partition entries, with foo and foo_1 as partparent,
> respectively.
>
> Why just 2 levels? - it seems commonplace and makes the syntax more
> intuitive? I guess it might be possible to generalize the syntax for
> multi-level partitioning. Ideas? If we want to support the notion of
> sub-partition template in future, that would require some thought, more
> importantly proper catalog organization for the same.

I do not think this is a particularly good idea.  You're going to need
to dump each partition separately at least in --binary-upgrade mode,
because each is going to have its own magic OIDs that need to be
restored, and also because there will most likely be at least some
properties that are going to vary between partitions.  You could
require that every partition have exactly the same set of columns,
constraints, rules, triggers, policies, attribute defaults, comments,
column comments, and everything else that might be different from one
partition to another, and further require that they have exactly
matching indexes.  It would take a fair amount of code to prohibit all
that, but it could be done.  However, do we really want that?  There
may well be some things were we want to enforce that the parent and
the child are exactly identical, but I doubt we want that for
absolutely every property, current and future, of the partition.  And
even if you did, because of the --binary-upgrade stuff, you still need
to to be able to dump them separately.

Therefore, I believe it is a whole lot better to make the primary
syntax for table partitioning something where you issue a CREATE
statement for the parent and then a CREATE statement for each child.
If we want to also have a convenience syntax so that people who want
to create a parent and a bunch of children in one fell swoop can do
so, fine.

I would not choose to model the syntax for creating partitions on
Oracle.  I don't find that syntax particularly nice or easy to
remember.  I say PARTITION BY RANGE, and then inside the parentheses I
use the PARTITION keyword for each partition?  Really?  But I think
copying the style while having the details be incompatible is an even
worse idea.

> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
> directly to partitions on case-by-case basis (they are tables under the
> hood after all), we should restrict AT to the master table. Most of the AT
> changes implicitly propagate from the master table to its partitions. Some
> of them could be directly applied to partitions and/or sub-partitions such
> as rename, storage manipulations like - changing tablespace, storage
> parameters (reloptions), etc.:
>
> ALTER TABLE foo
> RENAME PARTITION <partition-name> TO <new-name>;
>
> ALTER TABLE foo
> RENAME SUBPARTITION <sub-partition-name> TO <new-name>;
>
> ALTER TABLE foo
> SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>;
>
> ALTER TABLE foo
> SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>;
>
> ALTER TABLE foo
> SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>;
>
> ALTER TABLE foo
> SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>;

I don't think this is a very good idea.  This is basically proposing
that for every DDL command that you can apply to a table, you have to
spell it differently for a partition.  That seems like a lot of extra
work for no additional functionality.

> By the way, should we also allow changing the logging of
> partitions/sub-partitions as follows?

Again, I think you're coming at this from the wrong direction.
Instead of saying we're going to disallow all changes to the
partitions and then deciding we need to allow certain changes after
all, I think we should allow everything that is currently allowed for
an inherited table and then decide which of those things we need to
prohibit, and why.  For example, if you insist that a child table has
to have a tuple descriptor that matches the parent, that can improve
efficiency: Append won't need to project, and so on.  But it now
becomes very difficult to support taking a stand-alone table and
making it a partition of an existing partitioned table, because the
set of dropped columns might not match.  Having to give an error in
that case amounts to "we're sorry, we can't attach your partition to
the partitioning hierarchy because of some invisible state that you
can't see" isn't very nice.  Now I'm not saying that isn't the right
decision, but I think the design choices here need to be carefully
thought about.

Stepping away from that particular example, a blanket prohibition on
changing any attribute of a child table seems like it will prohibit a
lot of useful things that really ought to work.  And again, I don't
think it's a good idea to implement separate syntax for changing a
partition vs. changing a table.  If I want to set a partition as
unlogged, I should be able to say ALTER TABLE partition_name UNLOGGED
or maybe ALTER PARTITION partition_name UNLOGGED, not be forced to use
some new grammar production that looks completely different.

> What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION
> constraints - 2 things must be clear here: cannot create these constraints
> on individual partitions and all partition columns (key0 + key1) must be
> the leading columns of the key. On a related note, creating index on the
> master table should create the index on all "leaf" partitions. The index
> on the mater table itself would be just a logical index. Should we allow
> creating or dropping indexes on partitions directly?

I don't find this to be particularly clear.  You are assuming that
nobody wants to create a constraint that a certain value is unique
within a partition.  That might not be a real common thing to want to
do, but it could certainly be useful to somebody, and the current
system with table inheritance allows it.  For example, suppose that we
have orders partitioned on the order_date column, by month.  The user
might want to create a UNIQUE index on order_id on each partition.
Maybe they start over with order_id 1 at the beginning of each month.
But even if, as is more likely, the order IDs keep counting up from
month to month, they don't want to be forced to include the whole
partitioning key in the index in order to have it marked UNIQUE.  That
may be enough, in practice, to ensure the global uniqueness of order
IDs even though the system doesn't technically enforce it in all
cases.

If you want an index created on the parent to cascade down to all
children, that's a big project to get right.  Suppose I create an
index on the parent table.  After a while, I notice that it's getting
bloated, so I created another index with the same definition.  Now, I
drop one of the two indexes.  One of the two indexes from each child
table needs to go away, and moreover it can't be picked arbitrarily -
it has to be the one that was created at the same time as the parent
index I'm dropping.  If you want it to behave like this, you need a
whole system of bookkeeping to make it work right.

For version 1, I'd go the other way and prohibit index definitions on
the empty parent rels.  Let people create indexes on the children as
they wish.  In a future version, we can add stuff to cascade from
parents to children.

> It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE
> but I'm inclined to leave them as future enhancements. For a functionality
> similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest
> patch. We could extend them to also consider sub-partitions:

We don't need to have these in the first version, but we have to make
some architectural decisions that affect how feasible they are to
implement and in which cases, as noted above.

> One cannot define rules, triggers, and RLS policies on them. Although,
> AR triggers defined on a partitioned master table are propagated to the
> "leaf" partitions.

What value do you see us getting out of restricting these particular things?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Amit Langote

Date:

22 December 2015, 01:52:41

On 2015/12/18 3:56, Robert Haas wrote:
> On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Syntax to create a partitioned table (up to 2 levels of partitioning):
>>
>> CREATE TABLE foo (
>> ...
>> )
>> PARTITION BY R/L ON (key0)
>> SUBPARTITION BY R/L ON (key1)
>> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>]
>> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>],
>> ...)], ...)];
>>
>> The above creates two pg_partitioned_rel entries for foo with partlevel 0
>> and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
>> creates pg_partition entries, with foo and foo_1 as partparent,
>> respectively.
>>
>> Why just 2 levels? - it seems commonplace and makes the syntax more
>> intuitive? I guess it might be possible to generalize the syntax for
>> multi-level partitioning. Ideas? If we want to support the notion of
>> sub-partition template in future, that would require some thought, more
>> importantly proper catalog organization for the same.
> 
> I do not think this is a particularly good idea.  You're going to need
> to dump each partition separately at least in --binary-upgrade mode,
> because each is going to have its own magic OIDs that need to be
> restored, and also because there will most likely be at least some
> properties that are going to vary between partitions.  You could
> require that every partition have exactly the same set of columns,
> constraints, rules, triggers, policies, attribute defaults, comments,
> column comments, and everything else that might be different from one
> partition to another, and further require that they have exactly
> matching indexes.  It would take a fair amount of code to prohibit all
> that, but it could be done.  However, do we really want that?  There
> may well be some things were we want to enforce that the parent and
> the child are exactly identical, but I doubt we want that for
> absolutely every property, current and future, of the partition.  And
> even if you did, because of the --binary-upgrade stuff, you still need
> to to be able to dump them separately.
> 
> Therefore, I believe it is a whole lot better to make the primary
> syntax for table partitioning something where you issue a CREATE
> statement for the parent and then a CREATE statement for each child.
> If we want to also have a convenience syntax so that people who want
> to create a parent and a bunch of children in one fell swoop can do
> so, fine.

Regarding --binary-upgrade dump mode, how about we teach pg_dump to dump
each partition separately using ALTER TABLE parent ADD PARTITION
especially for the "magic OIDs" reason? It may very well be a CREATE
PARTITION-style command though. Note that each such command could specify
properties that can be different per partition. I said in my email,
perhaps not so clearly, that "only" WITH options, tablespace and
relpersistence can be different per partition. But I can see why that may
be severely restrictive at this stage.

By the way, what do you think about SUBPARTITION keyword-based syntax for
multi-level partitioning? Should we instead require that each partition
has its own PARTITION BY in its creation command?

> 
> I would not choose to model the syntax for creating partitions on
> Oracle.  I don't find that syntax particularly nice or easy to
> remember.  I say PARTITION BY RANGE, and then inside the parentheses I
> use the PARTITION keyword for each partition?  Really?  But I think
> copying the style while having the details be incompatible is an even
> worse idea.

As for the convenience syntax (if at all), how about:

CREATE TABLE foo ( ...
)
PARTITION BY ... ON (...)
SUBPARTITION BY ... ON (...)
opt_partition_list;

where opt_partition_list is:

PARTITIONS ( partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list [, ...]
)

where opt_subpart_list is:

SUBPARTITIONS ( subpartname FOR VALUES ... [WITH] [ TABLESPACE] [, ...]
)

PARTITIONS, SUBPARTITIONS would be new unreserved keywords. Or we can do
away with them.

>> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
>> directly to partitions on case-by-case basis (they are tables under the
>> hood after all), we should restrict AT to the master table. Most of the AT
>> changes implicitly propagate from the master table to its partitions. Some
>> of them could be directly applied to partitions and/or sub-partitions such
>> as rename, storage manipulations like - changing tablespace, storage
>> parameters (reloptions), etc.:
>>
>> ALTER TABLE foo
>> RENAME PARTITION <partition-name> TO <new-name>;
>>
>> ALTER TABLE foo
>> RENAME SUBPARTITION <sub-partition-name> TO <new-name>;
>>
>> ALTER TABLE foo
>> SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>;
>>
>> ALTER TABLE foo
>> SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>;
>>
>> ALTER TABLE foo
>> SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>;
>>
>> ALTER TABLE foo
>> SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>;
> 
> I don't think this is a very good idea.  This is basically proposing
> that for every DDL command that you can apply to a table, you have to
> spell it differently for a partition.  That seems like a lot of extra
> work for no additional functionality.

I didn't mean to propose the special syntax for every available DDL
command. Just those that modify properties that could be different per
partition. I thought there would be very few such properties and hence not
a lot of special commands. But then again, I may be wrong about what
those properties are.

> 
>> By the way, should we also allow changing the logging of
>> partitions/sub-partitions as follows?
> 
> Again, I think you're coming at this from the wrong direction.
> Instead of saying we're going to disallow all changes to the
> partitions and then deciding we need to allow certain changes after
> all, I think we should allow everything that is currently allowed for
> an inherited table and then decide which of those things we need to
> prohibit, and why.  For example, if you insist that a child table has
> to have a tuple descriptor that matches the parent, that can improve
> efficiency: Append won't need to project, and so on.  But it now
> becomes very difficult to support taking a stand-alone table and
> making it a partition of an existing partitioned table, because the
> set of dropped columns might not match.  Having to give an error in
> that case amounts to "we're sorry, we can't attach your partition to
> the partitioning hierarchy because of some invisible state that you
> can't see" isn't very nice.  Now I'm not saying that isn't the right
> decision, but I think the design choices here need to be carefully
> thought about.

Yeah, I am concerned about the ATTACH PARTITION USING TABLE case for the
very point you mention. And I can see how it may result from the
restrictive model I propose. FWIW, other databases impose a number of
restrictions on the partition roll-in case but not sure if for the
internal reasons we might want to.

> 
> Stepping away from that particular example, a blanket prohibition on
> changing any attribute of a child table seems like it will prohibit a
> lot of useful things that really ought to work.  And again, I don't
> think it's a good idea to implement separate syntax for changing a
> partition vs. changing a table.  If I want to set a partition as
> unlogged, I should be able to say ALTER TABLE partition_name UNLOGGED
> or maybe ALTER PARTITION partition_name UNLOGGED, not be forced to use
> some new grammar production that looks completely different.

Okay. ALTER PARTITION may be the way to go.

> 
>> What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION
>> constraints - 2 things must be clear here: cannot create these constraints
>> on individual partitions and all partition columns (key0 + key1) must be
>> the leading columns of the key. On a related note, creating index on the
>> master table should create the index on all "leaf" partitions. The index
>> on the mater table itself would be just a logical index. Should we allow
>> creating or dropping indexes on partitions directly?
> 
> I don't find this to be particularly clear.  You are assuming that
> nobody wants to create a constraint that a certain value is unique
> within a partition.  That might not be a real common thing to want to
> do, but it could certainly be useful to somebody, and the current
> system with table inheritance allows it.  For example, suppose that we
> have orders partitioned on the order_date column, by month.  The user
> might want to create a UNIQUE index on order_id on each partition.
> Maybe they start over with order_id 1 at the beginning of each month.
> But even if, as is more likely, the order IDs keep counting up from
> month to month, they don't want to be forced to include the whole
> partitioning key in the index in order to have it marked UNIQUE.  That
> may be enough, in practice, to ensure the global uniqueness of order
> IDs even though the system doesn't technically enforce it in all
> cases.

Okay, I didn't consider that one may want to create UNIQUE constraint per
partition. Do they need to be allowed to be different per partition?

Sorry, I could not understand your last sentence regarding global
uniqueness. Given the restrictions on its definition, in what ways could
the system fail to enforce it?

> 
> If you want an index created on the parent to cascade down to all
> children, that's a big project to get right.  Suppose I create an
> index on the parent table.  After a while, I notice that it's getting
> bloated, so I created another index with the same definition.  Now, I
> drop one of the two indexes.  One of the two indexes from each child
> table needs to go away, and moreover it can't be picked arbitrarily -
> it has to be the one that was created at the same time as the parent
> index I'm dropping.  If you want it to behave like this, you need a
> whole system of bookkeeping to make it work right.
> 
> For version 1, I'd go the other way and prohibit index definitions on
> the empty parent rels.  Let people create indexes on the children as
> they wish.  In a future version, we can add stuff to cascade from
> parents to children.

Okay, I may be missing the subtleties of global uniqueness case. I agree
we can leave it out of version 1.

> 
>> It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE
>> but I'm inclined to leave them as future enhancements. For a functionality
>> similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest
>> patch. We could extend them to also consider sub-partitions:
> 
> We don't need to have these in the first version, but we have to make
> some architectural decisions that affect how feasible they are to
> implement and in which cases, as noted above.
> 
>> One cannot define rules, triggers, and RLS policies on them. Although,
>> AR triggers defined on a partitioned master table are propagated to the
>> "leaf" partitions.
> 
> What value do you see us getting out of restricting these particular things?

Perhaps, I'm wrong in thinking that we should limit these to be associated
with only the top-level partitioned table.

Thanks a lot for the comments!

Regards,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

15 January 2016, 10:48:38

Thought a bit more on some points (see below). To anyone interested in
getting involved with the review, I'm working on getting a revised version
of the patch out soon. However, I must also mention that we need to reach
consensus on some broader design issues before any meaningful (or
fruitful) code review could be done. Please feel free to post your
thoughts about design, syntax, etc.

On 2015/12/22 10:51, Amit Langote wrote:
> On 2015/12/18 3:56, Robert Haas wrote:
>> On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote wrote:
>>> Syntax to create a partitioned table (up to 2 levels of partitioning):
>>>
>>> CREATE TABLE foo (
>>> ...
>>> )
>>> PARTITION BY R/L ON (key0)
>>> SUBPARTITION BY R/L ON (key1)
>>> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>]
>>> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>],
>>> ...)], ...)];
>>>

[ ... ]

>>
>> Therefore, I believe it is a whole lot better to make the primary
>> syntax for table partitioning something where you issue a CREATE
>> statement for the parent and then a CREATE statement for each child.
>> If we want to also have a convenience syntax so that people who want
>> to create a parent and a bunch of children in one fell swoop can do
>> so, fine.
> 

[ ... ]

> 
> By the way, what do you think about SUBPARTITION keyword-based syntax for
> multi-level partitioning? Should we instead require that each partition
> has its own PARTITION BY in its creation command?

If we have a CREATE statement for each partition, how do we generalize
that to partitions at different levels? For example, if we use something
like the following to create a partition of parent_name:

CREATE PARTITION partition_name OF parent_name FOR VALUES ...   WITH ... TABLESPACE ...

Do we then say:

CREATE PARTITION subpartition_name OF partition_name ...

to create a level 2 partition (sub-partition) of parent_name? Obviously,
as is readily apparent from the command, it is still a direct partition of
partition_name for all internal purposes (consider partition list caching
in relcache, recursive tuple routing, etc.) save some others.

I ask that also because it's related to the choice of syntax to use to
declare the partition key for the multi-level case. I'm considering the
SUBPARTITION BY notation and perhaps we could generalize it to more than
just 2 levels. So, for the above case, parent_name would have been created as:

CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...

Needless to say, when subpartition_name is created with the command we saw
a moment ago, the root partitioned table would be locked. In fact, adding
a partition anywhere in the hierarchy needs an exclusive lock on the root
table. Also, partition rule (the FOR VALUES clause) would be validated
against PARTITION BY or SUBPARTITION BY clause at the respective level.

Although, I must admit I feel a little uneasy about the inherent asymmetry
in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
PARTITION for creating sub-partitions. Is there a better way?

> As for the convenience syntax (if at all), how about:
> 
> CREATE TABLE foo (
>   ...
> )
> PARTITION BY ... ON (...)
> SUBPARTITION BY ... ON (...)
> opt_partition_list;
> 
> where opt_partition_list is:
> 
> PARTITIONS (
>   partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list
>   [, ...]
> )
> 
> where opt_subpart_list is:
> 
> SUBPARTITIONS (
>   subpartname FOR VALUES ... [WITH] [ TABLESPACE]
>   [, ...]
> )

Do we want this at all? It seems difficult to generalize this to
multi-level hierarchy of more than 2 levels.

>>> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
>>> directly to partitions on case-by-case basis (they are tables under the
>>> hood after all), we should restrict AT to the master table. Most of the AT
>>> changes implicitly propagate from the master table to its partitions. Some
>>> of them could be directly applied to partitions and/or sub-partitions such
>>> as rename, storage manipulations like - changing tablespace, storage
>>> parameters (reloptions), etc.:

[ ... ]

>>
>> I don't think this is a very good idea.  This is basically proposing
>> that for every DDL command that you can apply to a table, you have to
>> spell it differently for a partition.  That seems like a lot of extra
>> work for no additional functionality.

[ ... ]

>>> By the way, should we also allow changing the logging of
>>> partitions/sub-partitions as follows?
>>
>> Again, I think you're coming at this from the wrong direction.
>> Instead of saying we're going to disallow all changes to the
>> partitions and then deciding we need to allow certain changes after
>> all, I think we should allow everything that is currently allowed for
>> an inherited table and then decide which of those things we need to
>> prohibit, and why.  For example, if you insist that a child table has
>> to have a tuple descriptor that matches the parent, that can improve
>> efficiency: Append won't need to project, and so on.  But it now
>> becomes very difficult to support taking a stand-alone table and
>> making it a partition of an existing partitioned table, because the
>> set of dropped columns might not match.  Having to give an error in
>> that case amounts to "we're sorry, we can't attach your partition to
>> the partitioning hierarchy because of some invisible state that you
>> can't see" isn't very nice.  Now I'm not saying that isn't the right
>> decision, but I think the design choices here need to be carefully
>> thought about.
> 
> Yeah, I am concerned about the ATTACH PARTITION USING TABLE case for the
> very point you mention. And I can see how it may result from the
> restrictive model I propose. FWIW, other databases impose a number of
> restrictions on the partition roll-in case but not sure if for the
> internal reasons we might want to.

After thinking some more on this - I think that identical tuple
descriptors may not just be a nice-to-have but critical in some cases. For
example, consider built-in/trigger-less tuple routing. I'd imagine that
the partition to insert a tuple into would be determined just before
calling heap_insert() in ExecInsert() and CopyFrom(). That means the
HeapTuple that is passed to heap_insert() to insert into the partition
would be based on the root table's tuple descriptor. Note also that the
tuple would have passed through BR, IR triggers, constraints of the root
table. When the data is eventually queried from partitions directly, or
well even via the root table (considering existing executor capabilities),
partition's tuple descriptor at that point had better match the data that
went onto the disk. That means we had better keep at least the following
things in sync: number of attributes, name, position (attnum), type,
notnull-ness of individual attributes. So in order to do that, recursively
apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN
TYPE, SET/DROP NOT NULL on the root table to all the partitions and
prevent those sub-commands to be directly applied to any table
(partitions) in the partitioning hierarchy but the root. I further don't
see the point of allowing to set (or drop) column defaults in partitions
because now INSERT or COPY FROM cannot be directly applied to partitions.
Similar argument could be made for BR, IR triggers and CHECK constraints.
Am I missing something in all of this?

An alternative to doing any of that very well may be to design
trigger-less tuple routing to be smarter about possible mismatch of the
tuple descriptors but I haven't given that a lot of thought. Is that
really an alternative worth looking into?

Coming to the ATTACH PARTITION case, I am not quite sure how we can make
it as forgiving as ALTER TABLE INHERIT is given that we'd expect things
like trigger-less tuple routing to work out-of-the-box as described above.
That said, I would not dare propose an INCLUDING DROPPED extension to LIKE
or would it stick if I do? ;)

Unrelated to the above, I have been thinking about the planner
considerations for partition hierarchies with minimal changes considering
that we'd want to keep it that way for the first cut:

1) At least some optimizer/prep changes to handle adding partitions to
PlannerInfo's append_rel_list would be necessary.

2) To be able to use constraint exclusion, we'd need to create equivalent
CHECK constraints (non-dumpable) when partitions are created and perform
all the necessary bookkeeping. For time being.

You might wonder why (1) requires any special handling - would't
"partition hierarchies" still be "inheritance hierarchies" and so would
work with existing code albeit with unnecessary overheads like Var
translation lists? Let me think out loud about general background on this
so someone can correct me if it seems I'm going at it all wrong -

On one hand, I think to keep treating "partition hierarchies" as
"inheritance hierachies" might have some issues. I am afraid that
documented inheritance semantics may not be what we want to keep using for
the new partitioned tables. By that, I mean all the user-facing behaviors
where inheritance has some bearing. Should it also affect new partitioned
tables? Consider whether inheritance semantics would render infeasible
some of the things that we'd like to introduce for the new partitioned
tables such as automatic tuple routing, or keep us from improving planner
smarts and executor capabilities for partitioned tables over what we
already have.

OTOH, I may be looking at it wrongly. We would not be required to enforce
user-facing inheritance behaviors on the new partitioned tables after all.
That is to say - it's just that new partitioned tables could still use
relevant inheritance infrastructure behind-the-scenes for planning,
execution and a few other things and not care about abiding by regular
inheritance semantics. I should just go ahead and add special cases in all
places where existing inheritance handling code stands to cause trouble
down the line for partitioned tables. We might want to mention that we do
so somewhere in documentation and also note that regular inheritance
semantics does not apply. While it sounds probably fine as implementation
for the feature released initially, a day will inevitably come when this
behind-the-scenes implementation will be changed to something more
amenable to better optimization. But that's for future...

There's a great chance that not everyone cares right now about this part
of the new partitioning but just want to put it out there. There are more
contentious issues like the syntax, partitioning maintenance commands that
we plan to support (now or later) and such.

Any advice is greatly appreciated.

Thanks,
Amit

Re: Declarative partitioning

From

Corey Huinker

Date:

17 January 2016, 00:47:54

If we have a CREATE statement for each partition, how do we generalize
that to partitions at different levels? For example, if we use something
like the following to create a partition of parent_name:

CREATE PARTITION partition_name OF parent_name FOR VALUES ...
WITH ... TABLESPACE ...

Do we then say:

CREATE PARTITION subpartition_name OF partition_name ...

That's how I'd want it for partitions created after the initial partitioned table is created.

I'd like to be able to identify the parent partition by it's own partitioning parameters rather than name, like the way we can derive the name of an index in ON CONFLICT. But I see no clean way to do that, and if one did come up, we'd simply allow the user to replace

<partition_name>

with

table_name PARTITION partition_spec [...PARTITION partition_spec [ ...PARTITION turtles_all_the_way_down]]).

Again, totally fine with forcing the maintenance script to know or discover the name of the partition to be subpartitioned...for now.

to create a level 2 partition (sub-partition) of parent_name? Obviously,
as is readily apparent from the command, it is still a direct partition of
partition_name for all internal purposes (consider partition list caching
in relcache, recursive tuple routing, etc.) save some others.

I ask that also because it's related to the choice of syntax to use to
declare the partition key for the multi-level case. I'm considering the
SUBPARTITION BY notation and perhaps we could generalize it to more than
just 2 levels. So, for the above case, parent_name would have been created as:

CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...

Needless to say, when subpartition_name is created with the command we saw
a moment ago, the root partitioned table would be locked. In fact, adding
a partition anywhere in the hierarchy needs an exclusive lock on the root
table. Also, partition rule (the FOR VALUES clause) would be validated
against PARTITION BY or SUBPARTITION BY clause at the respective level.

Although, I must admit I feel a little uneasy about the inherent asymmetry
in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
PARTITION for creating sub-partitions. Is there a better way?

Provided that the syntax allows for N levels of partitioning, I don't care if it's

PARTITION BY.., PARTITION BY..., PARTITION BY ...

PARTITION BY.., SUBPARTITION BY..., SUBPARTITION BY ...

The first is probably better for meta-coding purposes, but the second makes it clear which partition layer is first.

> As for the convenience syntax (if at all), how about:
>
> CREATE TABLE foo (
> ...
> )
> PARTITION BY ... ON (...)
> SUBPARTITION BY ... ON (...)
> opt_partition_list;
>
> where opt_partition_list is:
>
> PARTITIONS (
> partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list
> [, ...]
> )
>
> where opt_subpart_list is:
>
> SUBPARTITIONS (
> subpartname FOR VALUES ... [WITH] [ TABLESPACE]
> [, ...]
> )

Do we want this at all? It seems difficult to generalize this to
multi-level hierarchy of more than 2 levels.

I want this.

Granted the syntax of a 3+ level partitioning would be cumbersome, but it is what the user wanted, and the nested PARTITION/SUBPARTITION. In those cases, the user might opt to not create more than the default first subpartition to keep the syntax sane, or we might auto-generate default partitions (with a VALUES clause of whatever "all values" is for that datatype...again, this is an area where leveraging range types would be most valuable).

On one hand, I think to keep treating "partition hierarchies" as
"inheritance hierachies" might have some issues. I am afraid that
documented inheritance semantics may not be what we want to keep using for
the new partitioned tables. By that, I mean all the user-facing behaviors
where inheritance has some bearing. Should it also affect new partitioned
tables? Consider whether inheritance semantics would render infeasible
some of the things that we'd like to introduce for the new partitioned
tables such as automatic tuple routing, or keep us from improving planner
smarts and executor capabilities for partitioned tables over what we
already have.

I feel that Automatic tuple routing should be considered they key benefit of "real" partitions over inherited tables. Trigger maintenance is most of the work of custom partitioning schemes, at least the ones I've written.

There's a great chance that not everyone cares right now about this part
of the new partitioning but just want to put it out there. There are more
contentious issues like the syntax, partitioning maintenance commands that
we plan to support (now or later) and such.

What I've read so far addresses most of my concerns.

Still somewhat on my mind:

1. ability to describe partition bounds via range types, regardless of whether the Automatic Tuple Routing uses those types internally.

2. syntax for splitting a partition in two, merging two adjacent partitions (you probably touched on these earlier and I missed it or forgot).

3. ability to swap a partition with a table not currently associated with the partitioned table.

4. The applicability of this syntax to materialized views, allowing us to do REFRESH CONCURRENTLY a few parts at a time, or only refreshing the data we know needs it.

Items 2 and 3 don't have to be implemented right away, as they're separate ALTER commands. 4 is a pipe dream. With Item 1 I ask only that we don't pick a syntax that prevents description via range types.

Re: Declarative partitioning

From

Amit Langote

Date:

22 January 2016, 04:42:29

Hi,

On 2016/01/17 9:47, Corey Huinker wrote:
>> If we have a CREATE statement for each partition, how do we generalize
>> that to partitions at different levels? For example, if we use something
>> like the following to create a partition of parent_name:
>>
>> CREATE PARTITION partition_name OF parent_name FOR VALUES ...
>>     WITH ... TABLESPACE ...
>>
>> Do we then say:
>>
>> CREATE PARTITION subpartition_name OF partition_name ...
> 
> That's how I'd want it for partitions created after the initial partitioned
> table is created.
> 
> I'd like to be able to identify the parent partition by it's own
> partitioning parameters rather than name, like the way we can derive the
> name of an index in ON CONFLICT. But I see no clean way to do that, and if
> one did come up, we'd simply allow the user to replace
>     <partition_name>
> with
>     table_name PARTITION partition_spec [...PARTITION partition_spec [
> ...PARTITION turtles_all_the_way_down]]).
> 
> Again, totally fine with forcing the maintenance script to know or discover
> the name of the partition to be subpartitioned...for now.

I'm thinking of going with this last option.

So for now, you create an empty partitioned table specifying all the
partition keys without being able to  define any partitions in the same
statement. Partitions (and partitions thereof, if any) will be created
using CREATE PARTITION statements, one for each.

>> to create a level 2 partition (sub-partition) of parent_name? Obviously,
>> as is readily apparent from the command, it is still a direct partition of
>> partition_name for all internal purposes (consider partition list caching
>> in relcache, recursive tuple routing, etc.) save some others.
>>
>> I ask that also because it's related to the choice of syntax to use to
>> declare the partition key for the multi-level case. I'm considering the
>> SUBPARTITION BY notation and perhaps we could generalize it to more than
>> just 2 levels. So, for the above case, parent_name would have been created
>> as:
>>
>> CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...
> 
>> Needless to say, when subpartition_name is created with the command we saw
>> a moment ago, the root partitioned table would be locked. In fact, adding
>> a partition anywhere in the hierarchy needs an exclusive lock on the root
>> table. Also, partition rule (the FOR VALUES clause) would be validated
>> against PARTITION BY or SUBPARTITION BY clause at the respective level.
>>
>> Although, I must admit I feel a little uneasy about the inherent asymmetry
>> in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
>> PARTITION for creating sub-partitions. Is there a better way?
> 
> Provided that the syntax allows for N levels of partitioning, I don't care
> if it's
>     PARTITION BY.., PARTITION BY..., PARTITION BY ...
> or
>     PARTITION BY.., SUBPARTITION BY..., SUBPARTITION BY ...
> 
> The first is probably better for meta-coding purposes, but the second makes
> it clear which partition layer is first.

I prefer second one too, for clarity. In any case, the PARTITION /
SUBPARTITION BY clauses are ordered. That is, successive clauses specify
partition key for corresponding successive levels of the hierarchy.

> There's a great chance that not everyone cares right now about this part
>> of the new partitioning but just want to put it out there. There are more
>> contentious issues like the syntax, partitioning maintenance commands that
>> we plan to support (now or later) and such.
>>
> 
> What I've read so far addresses most of my concerns.
> 
> Still somewhat on my mind:
> 
> 1. ability to describe partition bounds via range types, regardless of
> whether the Automatic Tuple Routing uses those types internally.

Specifying range partitioning bound as PARTITION FOR RANGE <range-literal>
sounds like it offers some flexibility, which can be seen as a good thing.
But it tends to make internal logic slightly complicated.

Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is
notationally simpler still and easier to work with internally. Also, there
will be no confusion about exclusivity of the bound if we document it so.

> 2. syntax for splitting a partition in two, merging two adjacent partitions
> (you probably touched on these earlier and I missed it or forgot).

Maintenance operations like split, merge would not be the first cut.

> 3. ability to swap a partition with a table not currently associated with
> the partitioned table.

EXCHANGE PARTITION-like ability is something on my list. There are a few
hurdles to get there. Especially, source table would have to be physical
tuple descriptor (ie, including dropped columns) compatible with the
partitioned table given some other parts of the design (I touched on those
in my earlier message).

> 4. The applicability of this syntax to materialized views, allowing us to
> do REFRESH CONCURRENTLY a few parts at a time, or only refreshing the data
> we know needs it.
> 
> Items 2 and 3 don't have to be implemented right away, as they're separate
> ALTER commands. 4 is a pipe dream. With Item 1 I ask only that we don't
> pick a syntax that prevents description via range types.

Thanks a lot for taking time to comment.

Thanks,
Amit

Re: Declarative partitioning

From

Tomas Vondra

Date:

22 January 2016, 13:54:23

Hi Amit,

thanks for working on this. Seems the last version of the patch was
submitted more than 2 months ago and I believe large parts of it will
get reworked based on the extensive discussion on this list, so I
haven't looked at the code at all.

I'd like to comment on the one thing and that's the syntax. It seems to
me we're really trying to reinvent the wheel and come up with our own
version of the syntax. Is there a particular reason why not to look at
the syntax of the other databases and adapt as much of the existing
syntax as possible?

I think that's for a few reasons - firstly it makes the life much easier
for the DBAs and users who are either migrating to PostgreSQL or have to
manage a mix of databases. Secondly, it serves as a valuable source of
engineering info, preventing the "I haven't thought of this use case"
problem.

An example of this is the proposed syntax for adding a partition
CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-15') END ('2006-03-01');

which seems a bit awkward as both the databases I'm familiar with
(Oracle and Sybase) use ALTER TABLE to do this
ALTER TABLE measurement ADD PARTITION measurement_fail VALUES LESS THAN ( ... )

And so on for the other commands.

That being said, I entirely agree with Simon (and others) that getting
the planner part work is the crucial part of the patch. But I also think
that a proper abstraction (thanks to good syntax) may be a valuable hint
how to define the catalogs and such.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Corey Huinker

Date:

22 January 2016, 18:42:28

So for now, you create an empty partitioned table specifying all the
partition keys without being able to define any partitions in the same
statement. Partitions (and partitions thereof, if any) will be created
using CREATE PARTITION statements, one for each.

...and I would assume that any attempt to insert into a partitioned table with no partitions (or lacking partitions at a defined level) would be an error? If so, I'd be ok with that.

Specifying range partitioning bound as PARTITION FOR RANGE <range-literal>
sounds like it offers some flexibility, which can be seen as a good thing.
But it tends to make internal logic slightly complicated.

Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is
notationally simpler still and easier to work with internally. Also, there
will be no confusion about exclusivity of the bound if we document it so.

I understand wanting the internal rules to be simple. Oracle clearly went with VALUES LESS THAN waterfalls for that reason.

What I'm hoping to avoid is:

- having to identify my "year2014" partition by VALUES LESS THAN '2015-01-01', a bit of cognitive dissonance defining data by what it's not.

- and then hoping that there is a year2013 partition created by someone with similar sensibilities, the partition definition being incomplete outside of the context of other partition definitions.

- and then further hoping that nobody drops the year2013 partition, thus causing new 2013 rows to fall into the year2014 partition, a side effect of an operation that did not mention the year2014 partition.

Range types do that, and if we're concerned about range type overhead, we're only dealing with the ranges at DDL time, we can break down the ATR rules into a more easily digestible form once the partition is modified.

Range continuity can be tested with -|-, but we'd only need to test for overlaps: gaps in ranges are sometimes a feature, not a bug (ex: I don't want any rows from future dates and we weren't in business before 1997).

Also, VALUES LESS THAN forces us to use discrete values. There is no way with to express with VALUES LESS THAN partitions that have float values for temperature:
ice (,0.0), water [0.0,212.0], steam (212.0,3000.0], plasma (3000.0,).

Yes, I can calculate the day after the last day in a year, I can use 212.0000000001, I can write code to rigorously check that all partitions are in place. I'd just rather not.

p.s. I'm really excited about what this will bring to Postgres in general and my organization in particular. This feature alone will help chip away at our needs for Vertica and Redshift clusters. Let me know if there's anything I can do to help.

Re: Declarative partitioning

From

Amit Langote

Date:

25 January 2016, 10:33:20

Hi,

On 2016/01/23 3:42, Corey Huinker wrote:
>> So for now, you create an empty partitioned table specifying all the
>> partition keys without being able to  define any partitions in the same
>> statement. Partitions (and partitions thereof, if any) will be created
>> using CREATE PARTITION statements, one for each.
>>
> 
> ...and I would assume that any attempt to insert into a partitioned table
> with no partitions (or lacking partitions at a defined level) would be an
> error? If so, I'd be ok with that.

Yes. Basically, for a tuple insert to succeed, there should exist a leaf
partition that the inserted tuple would map to.

>> Specifying range partitioning bound as PARTITION FOR RANGE <range-literal>
>> sounds like it offers some flexibility, which can be seen as a good thing.
>> But it tends to make internal logic slightly complicated.
>>
>> Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is
>> notationally simpler still and easier to work with internally. Also, there
>> will be no confusion about exclusivity of the bound if we document it so.
> 
> 
> I understand wanting the internal rules to be simple. Oracle clearly went
> with VALUES LESS THAN waterfalls for that reason.
> 
> What I'm hoping to avoid is:
> - having to identify my "year2014" partition by VALUES LESS THAN
> '2015-01-01', a bit of cognitive dissonance defining data by what it's not.
> - and then hoping that there is a year2013 partition created by someone
> with similar sensibilities, the partition definition being incomplete
> outside of the context of other partition definitions.
> - and then further hoping that nobody drops the year2013 partition, thus
> causing new 2013 rows to fall into the year2014 partition, a side effect of
> an operation that did not mention the year2014 partition.

I somewhat agree with these concerns. So whenever a left range partition
is deleted, a partition would begin accepting tuples that would have been
mapped to the partition just deleted. Then as you say, partition would
contain data that does not really fit the name. A keen user could then
probably rename the partition to something suitable in the same
transaction. I'd imagine that a partition deleted in this way would be the
leftmost (oldest partition) in most scenarios and there wouldn't be any
future tuples that would map to such partition. Admittedly, that may just
be wishful thinking though.

On the other hand, giving user the ability to specify both min and max
requires a lot of bookkeeping on the system's part. One of my previous
patches had that syntax and I wasn't quite happy with overall flakiness of
the implementation. Granted it would have been slightly easier if I had
used range type machinery and I tried a little but didn't stick with it
for too long. You may have read the following thread:

http://www.postgresql.org/message-id/15387.1423496163@sss.pgh.pa.us

> Range types do that, and if we're concerned about range type overhead,
> we're only dealing with the ranges at DDL time, we can break down the ATR
> rules into a more easily digestible form once the partition is modified.

We still end up with min, max pair per partition along with the
inclusivity of both as being part of the persisted partition rule.

> Range continuity can be tested with -|-, but we'd only need to test for
> overlaps: gaps in ranges are sometimes a feature, not a bug (ex: I don't
> want any rows from future dates and we weren't in business before 1997).
> 
> Also, VALUES LESS THAN forces us to use discrete values.  There is no way
> with to express with VALUES LESS THAN partitions that have float values for
> temperature:
>     ice (,0.0), water [0.0,212.0], steam (212.0,3000.0], plasma (3000.0,).
> 
> Yes, I can calculate the day after the last day in a year, I can use
> 212.0000000001, I can write code to rigorously check that all partitions
> are in place. I'd just rather not.

Hmm, this may be a concern.

I have read discussion threads of range types development and one of the
contending issues was around semantics for continuous types.

So here, if all we have is VALUES LESS THAN to define a range partition,
one cannot assign ranges to partitions that cannot be reduced to [) form,
right?

> p.s. I'm really excited about what this will bring to Postgres in general
> and my organization in particular. This feature alone will help chip away
> at our needs for Vertica and Redshift clusters. Let me know if there's
> anything I can do to help.

I am really glad to hear that. Your comments so far are much appreciated.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

28 January 2016, 08:45:37

Hi Tomas,

Thanks for your comments and sorry for replying so late.

On 2016/01/22 22:54, Tomas Vondra wrote:
> thanks for working on this. Seems the last version of the patch was
> submitted more than 2 months ago and I believe large parts of it will get
> reworked based on the extensive discussion on this list, so I haven't
> looked at the code at all.
> 
> I'd like to comment on the one thing and that's the syntax. It seems to me
> we're really trying to reinvent the wheel and come up with our own version
> of the syntax. Is there a particular reason why not to look at the syntax
> of the other databases and adapt as much of the existing syntax as possible?
> 
> I think that's for a few reasons - firstly it makes the life much easier
> for the DBAs and users who are either migrating to PostgreSQL or have to
> manage a mix of databases. Secondly, it serves as a valuable source of
> engineering info, preventing the "I haven't thought of this use case"
> problem.
> 
> An example of this is the proposed syntax for adding a partition
> 
>   CREATE TABLE measurement_fail
>   PARTITION OF measurement
>   FOR VALUES START ('2006-02-15') END ('2006-03-01');
> 
> which seems a bit awkward as both the databases I'm familiar with (Oracle
> and Sybase) use ALTER TABLE to do this
> 
>   ALTER TABLE measurement
>     ADD PARTITION measurement_fail VALUES LESS THAN ( ... )

Syntax like the one you mention allows to create/modify/move/drop
partitions at 2 levels (generally) using PARTITION and SUBPARTITION
keywords. That might be limiting to some users. I don't have a clear
picture of what a syntax that's general enough would look like, but I
proposed something like what follows:

CREATE TABLE parent ( a int, b char(10)
) PARTITION BY RANGE ON (a) SUBPARTITION BY LIST ON ((substring(b from 1 for 2)));

CREATE PARTITION partname OF parent FOR VALUES LESS THAN (100);
CREATE PARTITION subpartname OF partname FOR VALUES IN ('ab');

The latter of the CREATE PARTITION commands lets us create the so-called
sub-partition of 'parent'. Remember that in this scheme, all level 1
partitions are not actually physical tables themselves; only level 2
partitions are. If you stick one more SUBPARTITION BY in parent's
definition, you can:

CREATE PARTITION subsubpartname OF subpartname FOR VALUES ...;

This is something that the Oracle-like syntax won't be able to support.
Although, if we all agree that we'd never want to support such a case then
let's implement something that's familiar viz. the following:

CREATE TABLE parent ( a int, b char(10)
) PARTITION BY RANGE ON (a) SUBPARTITION BY LIST ON ((substring(b from 1 for 2)));

ALTER TABLE parent ADD PARTITION partname FOR VALUES LESS THAN (100);

ALTER TABLE parent MODIFY PARTITION partname ADD SUBPARTITION subpartname FOR VALUES IN ('ab');

ALTER TABLE parent MODIFY PARTITION partname DROP SUBPARTITION subpartname;

ALTER TABLE parent DROP PARTITION partname;

> And so on for the other commands.
> 
> That being said, I entirely agree with Simon (and others) that getting the
> planner part work is the crucial part of the patch. But I also think that
> a proper abstraction (thanks to good syntax) may be a valuable hint how to
> define the catalogs and such.

I tried to do that in the November commitfest but decided to just work on
the syntax as it became clear that throwing too many changes at the
reviewers/committers like that may not be such a great idea. Syntax itself
is a big enough change to discuss and reach consensus on. Let's get the
syntax, catalog and some infrastructure for basic features of the new
partitioning to work. We can think about planner enhancements for
partitioned tables that are waiting for declarative partitioning to get in
later.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

15 February 2016, 01:56:42

Hi,

Attached find a patch series to implement the following syntax:

1. Syntax for defining the partition key as a new clause in CREATE TABLE:

[ PARTITION BY {RANGE | LIST} ON ( { column_name | ( expression ) } [
opclass ] [, ...] )
    [ SUBPARTITION BY {RANGE | LIST} ON ( { column_name | ( expression ) }
[ opclass ] [, ...] ) ] ]

2. Syntax to create partitions and sub-partitions (as ALTER TABLE commands
on a partitioned table):

ADD PARTITION name FOR VALUES partition_bound_spec
  [ WITH ( storage_parameter [= value] [, ... ] )]
  [ TABLESPACE tablespace_name ]

MODIFY PARTITION partition_name ADD SUBPARTITION name FOR VALUES
partition_bound_spec
  [ WITH ( storage_parameter [= value] [, ... ] )]
  [ TABLESPACE tablespace_name ]

DROP PARTITION name
DROP SUBPARTITION name

ATTACH PARTITION name FOR VALUES partition_bound_spec
  USING [ TABLE ] table_name
  [ WITH ( storage_parameter [= value] [, ... ] )]
  [ TABLESPACE tablespace_name ]


MODIFY PARTITION partition_name ATTACH SUBPARTITION name FOR VALUES
partition_bound_spec
  USING [ TABLE ] table_name
  [ WITH ( storage_parameter [= value] [, ... ] )]
  [ TABLESPACE tablespace_name ]

Where partition_bound_spec is:

  FOR VALUES [ IN ] ( expression [, ...] )
  FOR VALUES LESS THAN ( expression [, ...] )

DETACH PARTITION name [ USING [ TABLE ] table_name]
DETACH SUBPARTITION name [ USING [ TABLE ] table_name]

Please note that ATTACH PARTITION ... USING TABLE command shown above is
not yet implemented.

As mentioned previously, this uses two system catalogs pg_partitioned_rel
and pg_partition.

This is complete with tuple routing so that no triggers or rules are
required. There is also basic planner support but no support yet to enable
constraint exclusion on partitions (which will be fulfilled shortly by
installing equivalent check constraints on partitions).

There is documentation for the new syntax and catalogs. I feel more
detailed documentation is required so I will keep on improving it in
subsequent versions as we build consensus about the syntax and general
design. There are no regression tests at all in the attached patches which
I will add in the next version along with constraint exclusion support.
pg_dump support will also be added shortly.

The individual patches have commit messages that describe code changes.
This is registered in the upcoming CF. Feedback and review is greatly
welcomed!

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Corey Huinker

Date:

15 February 2016, 20:15:34

The individual patches have commit messages that describe code changes.
This is registered in the upcoming CF. Feedback and review is greatly
welcomed!

Thanks,
Amit

We have a current system that is currently a mix of tables, each of which is range partitioned into approximately 15 partitions (using the pgxn range partitioning extension), and those partitions are themselves date-series partitioned via pg_partman. The largest table ingests about 100M rows per day in a single ETL. I will try this patch out and see how well it compares in handling the workload. Do you have any areas of interest or concern that I should monitor?

Re: Declarative partitioning

From

Amit Langote

Date:

16 February 2016, 00:28:43

Hi Corey,

On 2016/02/16 5:15, Corey Huinker wrote:
>>
>> The individual patches have commit messages that describe code changes.
>> This is registered in the upcoming CF. Feedback and review is greatly
>> welcomed!
>>
> We have a current system that is currently a mix of tables, each of which
> is range partitioned into approximately 15 partitions (using the pgxn range
> partitioning extension), and those partitions are themselves date-series
> partitioned via pg_partman. The largest table ingests about 100M rows per
> day in a single ETL. I will try this patch out and see how well it compares
> in handling the workload. Do you have any areas of interest or concern that
> I should monitor?

Thanks a lot for willing to give it a spin!

I would say this patch series is more geared toward usability.  For
example, you won't have to write a trigger to route tuples to correct
partitions. You can try your mentioned ETL load and see if the patch's
implementation of tuple routing fares any better than existing
trigger-based approach. Maybe, you were expecting something like load into
a stand-alone table and then ALTER TABLE INHERIT to instantly load into
the partitioned table (roll-in), but that command is not yet implemented
(ATTACH PARTITION command will show a "not implemented" error message).

Also, you won't see any optimizer and executor changes. Queries will still
use the same plans as existing inheritance-based partitioned tables,
although as I mentioned, constraint exclusion won't yet kick in. That will
be fixed very shortly.

And of course, comments on syntax are welcome as well.

Thanks,
Amit

Re: Declarative partitioning

From

Corey Huinker

Date:

16 February 2016, 02:08:28

Also, you won't see any optimizer and executor changes. Queries will still
use the same plans as existing inheritance-based partitioned tables,
although as I mentioned, constraint exclusion won't yet kick in. That will
be fixed very shortly.

And of course, comments on syntax are welcome as well.

Thanks,
Amit

Good to know the current limitations/expectations.

Our ETL has a great number of workers that do something like this:

1. grab a file

2. based on some metadata of that file, determine the partition that that would receive ALL of the rows in that file. It's actually multiple tables, all of which are partitioned, all of which fully expect the file data to fit in exactly one partition.

3. \copy into a temp table

4. Transform the data and insert the relevant bits into each of the target partitions derived in #2.

So while ATR is a major feature of true partitioning, it's not something we'd actually need in our current production environment, but I can certainly code it that way to benchmark ATR vs "know the destination partition ahead of time" vs "insane layered range_partitioning trigger + pg_partman trigger".

Currently we don't do anything like table swapping, but I've done that enough in the past that I could probably concoct a test of that too, once it's implemented.

As for the syntax, I'm not quite sure your patch addresses the concerned I voiced earlier: specifically if the VALUES IN works for RANGE as well as LIST, but I figured that would become clearer once I tried to actually use it. Currently we have partitioning on C-collated text ranges (no, they don't ship with postgres, I had to make a custom type) something like this:

part0: (,BIG_CLIENT)
part1: [BIG_CLIENT,BIG_CLIENT]
part2: (BIG_CLIENT,L)
part3: [L,MONSTROUSLY_BIG_CLIENT)
part4: [MONSTROUSLY_BIG_CLIENT,MONSTROUSLY_BIG_CLIENT]
part5: (MONSTROUSLY_BIG_CLIENT,RANDOM_CLIENT_LATE_IN_ALPHABET]
part6: (RANDOM_CLIENT_LATE_IN_ALPHABET,)

I can't implement that with a simple VALUES LESS THAN clause, unless I happen to know 'x' in 'BIG_CLIENTx', where 'x' is the exact first character in the collation sequence, which has to be something unprintable, and that would make those who later read my code to say something unprintable. So yeah, I'm hoping there's some way to cleanly represent such ranges.

Re: Declarative partitioning

From

Josh berkus

Date:

16 February 2016, 02:41:24

On 02/15/2016 04:28 PM, Amit Langote wrote:
> Also, you won't see any optimizer and executor changes. Queries will still
> use the same plans as existing inheritance-based partitioned tables,
> although as I mentioned, constraint exclusion won't yet kick in. That will
> be fixed very shortly.

We're not going to use CE for the new partitioning long-term, are we? 
This is just the first version, right?

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

Re: Declarative partitioning

From

Amit Langote

Date:

16 February 2016, 02:47:04

Hi Josh,

On 2016/02/16 11:41, Josh berkus wrote:
> On 02/15/2016 04:28 PM, Amit Langote wrote:
>> Also, you won't see any optimizer and executor changes. Queries will still
>> use the same plans as existing inheritance-based partitioned tables,
>> although as I mentioned, constraint exclusion won't yet kick in. That will
>> be fixed very shortly.
> 
> We're not going to use CE for the new partitioning long-term, are we? This
> is just the first version, right?

Yes. My approach in previous versions of stuffing major planner changes in
with the syntax patch was not quite proper in retrospect. So, I thought
I'd propose any major planner (and executor) changes later.

Thanks,
Amit

Re: Declarative partitioning

From

Robert Haas

Date:

16 February 2016, 12:57:13

On Fri, Jan 15, 2016 at 5:48 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> If we have a CREATE statement for each partition, how do we generalize
> that to partitions at different levels? For example, if we use something
> like the following to create a partition of parent_name:
>
> CREATE PARTITION partition_name OF parent_name FOR VALUES ...
>     WITH ... TABLESPACE ...
>
> Do we then say:
>
> CREATE PARTITION subpartition_name OF partition_name ...
>
> to create a level 2 partition (sub-partition) of parent_name?

Yes, exactly.

Personally, I would be more inclined to make this a CREATE TABLE statement, like

CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ...

> I ask that also because it's related to the choice of syntax to use to
> declare the partition key for the multi-level case. I'm considering the
> SUBPARTITION BY notation and perhaps we could generalize it to more than
> just 2 levels. So, for the above case, parent_name would have been created as:
>
> CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...
>
> Needless to say, when subpartition_name is created with the command we saw
> a moment ago, the root partitioned table would be locked. In fact, adding
> a partition anywhere in the hierarchy needs an exclusive lock on the root
> table. Also, partition rule (the FOR VALUES clause) would be validated
> against PARTITION BY or SUBPARTITION BY clause at the respective level.
>
> Although, I must admit I feel a little uneasy about the inherent asymmetry
> in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
> PARTITION for creating sub-partitions. Is there a better way?

I think if you've got SUBPARTITION as a keyword in the syntax
anywhere, you're doing it wrong.  The toplevel object shouldn't really
care whether its children are themselves partitioned or not.

> Do we want this at all? It seems difficult to generalize this to
> multi-level hierarchy of more than 2 levels.

It doesn't do anything for me.  There may be somebody who wants it,
but I don't see much value myself.

> After thinking some more on this - I think that identical tuple
> descriptors may not just be a nice-to-have but critical in some cases. For
> example, consider built-in/trigger-less tuple routing. I'd imagine that
> the partition to insert a tuple into would be determined just before
> calling heap_insert() in ExecInsert() and CopyFrom(). That means the
> HeapTuple that is passed to heap_insert() to insert into the partition
> would be based on the root table's tuple descriptor. Note also that the
> tuple would have passed through BR, IR triggers, constraints of the root
> table. When the data is eventually queried from partitions directly, or
> well even via the root table (considering existing executor capabilities),
> partition's tuple descriptor at that point had better match the data that
> went onto the disk. That means we had better keep at least the following
> things in sync: number of attributes, name, position (attnum), type,
> notnull-ness of individual attributes. So in order to do that, recursively
> apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN
> TYPE, SET/DROP NOT NULL on the root table to all the partitions and
> prevent those sub-commands to be directly applied to any table
> (partitions) in the partitioning hierarchy but the root. I further don't
> see the point of allowing to set (or drop) column defaults in partitions
> because now INSERT or COPY FROM cannot be directly applied to partitions.
> Similar argument could be made for BR, IR triggers and CHECK constraints.
> Am I missing something in all of this?

Well, in the end there are basically two choices.  Either tuple
descriptors have to match exactly, and then you can reuse a tuple
intended for one partition for some other partition without
projection; or else they don't, and you need to project.  I'm not sure
that projection is expensive enough to make disallowing mismatched
tuple descriptors a necessary design choice - and certainly that
design choice is awkward from a UI standpoint, because we will
sometimes not be able to attach a partition for a reason that the user
can neither see in the \d output nor correct.  But on the flip side,
not having to worry about projection is nice.

> An alternative to doing any of that very well may be to design
> trigger-less tuple routing to be smarter about possible mismatch of the
> tuple descriptors but I haven't given that a lot of thought. Is that
> really an alternative worth looking into?

Yes.

> On one hand, I think to keep treating "partition hierarchies" as
> "inheritance hierachies" might have some issues. I am afraid that
> documented inheritance semantics may not be what we want to keep using for
> the new partitioned tables. By that, I mean all the user-facing behaviors
> where inheritance has some bearing. Should it also affect new partitioned
> tables? Consider whether inheritance semantics would render infeasible
> some of the things that we'd like to introduce for the new partitioned
> tables such as automatic tuple routing, or keep us from improving planner
> smarts and executor capabilities for partitioned tables over what we
> already have.
>
> OTOH, I may be looking at it wrongly. We would not be required to enforce
> user-facing inheritance behaviors on the new partitioned tables after all.
> That is to say - it's just that new partitioned tables could still use
> relevant inheritance infrastructure behind-the-scenes for planning,
> execution and a few other things and not care about abiding by regular
> inheritance semantics. I should just go ahead and add special cases in all
> places where existing inheritance handling code stands to cause trouble
> down the line for partitioned tables. We might want to mention that we do
> so somewhere in documentation and also note that regular inheritance
> semantics does not apply. While it sounds probably fine as implementation
> for the feature released initially, a day will inevitably come when this
> behind-the-scenes implementation will be changed to something more
> amenable to better optimization. But that's for future...

This seems pretty vague to me.  I don't know that I have an opinion
without a more specific list of questions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Robert Haas

Date:

16 February 2016, 13:02:45

On Fri, Jan 22, 2016 at 8:54 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I'd like to comment on the one thing and that's the syntax. It seems to me
> we're really trying to reinvent the wheel and come up with our own version
> of the syntax. Is there a particular reason why not to look at the syntax of
> the other databases and adapt as much of the existing syntax as possible?

I'm not sure this is at all standardized.  How do we decide whose
syntax to adopt?  Oracle's just because it's Oracle?  I don't
particularly like inheriting the limitations of Oracle's syntax - such
as at most 2 levels of partitioning, weird VALUES LESS THAN syntax
that doesn't look like anything else we do in PG and has no way to
specify an opclass, etc.

> I think that's for a few reasons - firstly it makes the life much easier for
> the DBAs and users who are either migrating to PostgreSQL or have to manage
> a mix of databases. Secondly, it serves as a valuable source of engineering
> info, preventing the "I haven't thought of this use case" problem.
>
> An example of this is the proposed syntax for adding a partition
>
>   CREATE TABLE measurement_fail
>   PARTITION OF measurement
>   FOR VALUES START ('2006-02-15') END ('2006-03-01');
>
> which seems a bit awkward as both the databases I'm familiar with (Oracle
> and Sybase) use ALTER TABLE to do this
>
>   ALTER TABLE measurement
>     ADD PARTITION measurement_fail VALUES LESS THAN ( ... )
>
> And so on for the other commands.

One thing to think about is that pg_dump --binary-upgrade needs to
decorate each command that creates a relation.  I guess you could
decorate ALTER TABLE with the same stuff we currently apply to CREATE
TABLE but...

> That being said, I entirely agree with Simon (and others) that getting the
> planner part work is the crucial part of the patch. But I also think that a
> proper abstraction (thanks to good syntax) may be a valuable hint how to
> define the catalogs and such.

No argument on that from me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Amit Langote

Date:

17 February 2016, 02:56:54

On 2016/02/15 10:55, Amit Langote wrote:
> required. There is also basic planner support but no support yet to enable
> constraint exclusion on partitions (which will be fulfilled shortly by
> installing equivalent check constraints on partitions).

Just to follow up on this - attached now adds equivalent check constraint
with partition creation. Constraint exclusion should work. No regression
tests yet though.

>From now on, instead of attaching multiple files like in the previous
message, I will send a single tar.gz which will contain patches created by
git-format-patch.

Thanks,
Amit

Attachment

decl-part-20160217-1.tar.gz

Re: Declarative partitioning

From

Amit Langote

Date:

18 February 2016, 05:41:38

On 2016/02/16 21:57, Robert Haas wrote:
> On Fri, Jan 15, 2016 at 5:48 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> If we have a CREATE statement for each partition, how do we generalize
>> that to partitions at different levels? For example, if we use something
>> like the following to create a partition of parent_name:
>>
>> CREATE PARTITION partition_name OF parent_name FOR VALUES ...
>>     WITH ... TABLESPACE ...
>>
>> Do we then say:
>>
>> CREATE PARTITION subpartition_name OF partition_name ...
>>
>> to create a level 2 partition (sub-partition) of parent_name?
> 
> Yes, exactly.
> 
> Personally, I would be more inclined to make this a CREATE TABLE statement, like
> 
> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
> CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ...

I guess the first of which would actually be:

CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
PARTITION BY ...

Some might think that writing potentially the same PARTITION BY clause 100
times for 100 level-1 partitions could be cumbersome. That is what
SUBPARTITION BY notation may be good as a shorthand for.

> I think if you've got SUBPARTITION as a keyword in the syntax
> anywhere, you're doing it wrong.  The toplevel object shouldn't really
> care whether its children are themselves partitioned or not.

This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It
might as well be just cascaded PARTITION BYs. The point is to specify as
many of those with CREATE TABLE toplevel as the number of levels of
partitioning we want. That does however fix the number of levels in advance.

In the patch I have posted, here are some details of the tuple routing
implementation for instance - the top-level parent knows only of its
immediate partitions. Once a level-1 partition from that list is
determined using a tuple's level-1 key, the tuple is passed down to choose
one of its own partitions using the level-2 key. The key descriptor list
is associated with the top-level parent alone and the recursive code knows
to iterate the key list to apply nth key for nth level. The recursion
happens in the partition module.

Now there are also functions to let obtain, say *all* or only *leaf*
partitions (OIDs) of a top-level parent but they are used for certain DDL
scenarios. As far as DML is concerned, the level-at-a-time recursive
approach as described above is used. Queries not yet because the plan is a
flattened append of leaf partitions anyway.

If such notation convenience at the expense of loss of generality is not
worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax.

CREATE TABLE toplevel() PARTITION BY
CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY
CREATE TABLE subpartition PARTITION OF partition FOR VALUES
ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE
ALTER TABLE partitioned DETACH PARTITION name [ USING newname ]

While we are on the syntax story, how about FOR VALUES syntax for range
partitions (sorry for piggybacking it here in this message). At least some
people dislike LESS THAN notation. Corey Huinker says we should be using
range type literals for that. It's not clear though that using range type
literals directly is a way ahead. For one, range type API expects there to
exist a range type with given element type. Whereas all we require for
range partitioning proper is for column type to have a btree operator
class. Should we require it to have an associated range type as well?
Don't think that there exists an API to check that either. All in all,
range types are good to implement things in applications but not so much
within the backend (unless I'm missing something). I know reinventing the
wheel is disliked as well but perhaps we could offer something like the
following because Corey offered some examples which would help from the
flexibility:

START [ EXCL ] (startval) END [ INCL ] (endval)

That is, in range type notation, '[startval, endval)' is the default
behavior. So for each partition, there is at least the following pieces of
metadata:

Datum  *startval;
bool    startexcl;
Datum  *endval;
bool    endincl;

That requires quite some gymnastics during DDL (with no help from range
type infrastructure) to maintain the invariant that no two range
partitions overlap. Even gaps can result which are considered undesirable,
so maybe, the invariant to maintain would include no gaps in addition to
no overlap. That would have us looking for a definition of a "gap" for all
sorts of btree supporting data types.

Some people said something akin to interval partitioning would be good like:

PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START
WITH value;

But that could be just a UI addition to the design where each partition
has [startval, endval) bounds. In any case, not a version 1 material I'd
think.

>> Do we want this at all? It seems difficult to generalize this to
>> multi-level hierarchy of more than 2 levels.
> 
> It doesn't do anything for me.  There may be somebody who wants it,
> but I don't see much value myself.

OK.

>> After thinking some more on this - I think that identical tuple
>> descriptors may not just be a nice-to-have but critical in some cases. For
>> example, consider built-in/trigger-less tuple routing. I'd imagine that
>> the partition to insert a tuple into would be determined just before
>> calling heap_insert() in ExecInsert() and CopyFrom(). That means the
>> HeapTuple that is passed to heap_insert() to insert into the partition
>> would be based on the root table's tuple descriptor. Note also that the
>> tuple would have passed through BR, IR triggers, constraints of the root
>> table. When the data is eventually queried from partitions directly, or
>> well even via the root table (considering existing executor capabilities),
>> partition's tuple descriptor at that point had better match the data that
>> went onto the disk. That means we had better keep at least the following
>> things in sync: number of attributes, name, position (attnum), type,
>> notnull-ness of individual attributes. So in order to do that, recursively
>> apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN
>> TYPE, SET/DROP NOT NULL on the root table to all the partitions and
>> prevent those sub-commands to be directly applied to any table
>> (partitions) in the partitioning hierarchy but the root. I further don't
>> see the point of allowing to set (or drop) column defaults in partitions
>> because now INSERT or COPY FROM cannot be directly applied to partitions.
>> Similar argument could be made for BR, IR triggers and CHECK constraints.
>> Am I missing something in all of this?
> 
> Well, in the end there are basically two choices.  Either tuple
> descriptors have to match exactly, and then you can reuse a tuple
> intended for one partition for some other partition without
> projection; or else they don't, and you need to project.  I'm not sure
> that projection is expensive enough to make disallowing mismatched
> tuple descriptors a necessary design choice - and certainly that
> design choice is awkward from a UI standpoint, because we will
> sometimes not be able to attach a partition for a reason that the user
> can neither see in the \d output nor correct.  But on the flip side,
> not having to worry about projection is nice.
> 
>> An alternative to doing any of that very well may be to design
>> trigger-less tuple routing to be smarter about possible mismatch of the
>> tuple descriptors but I haven't given that a lot of thought. Is that
>> really an alternative worth looking into?
> 
> Yes.

I see the trade-off. I agree that considering the significance for attach
partition case is quite important.

So, the tuple routing code should be ready to use projection if there
happens to be a partition with differing tuple descriptor. In the code I
posted, a ResultRelInfo is lazily built afresh for each inserted tuple in
ExecInsert's case and for each tuple where the chosen partition is
different from the previous tuple's in CopyFrom's case. One can feel that
there is a certain overhead to that approach for the bulk-loading case
(almost every CopyFrom invocation). Now if projection enters this picture,
we have to consider that too.  Should we initialize ResultRelInfo's and
corresponding ProjectionInfo's for all partitions beforehand? Consider how
ModifyTable currently handles update on inheritance set, for example. That
would incur unnecessary overhead if only a single tuple is inserted. But
it would certainly help bulk-loading case. Am I missing something?

>> On one hand, I think to keep treating "partition hierarchies" as
>> "inheritance hierachies" might have some issues. I am afraid that
>> documented inheritance semantics may not be what we want to keep using for
>> the new partitioned tables. By that, I mean all the user-facing behaviors
>> where inheritance has some bearing. Should it also affect new partitioned
>> tables? Consider whether inheritance semantics would render infeasible
>> some of the things that we'd like to introduce for the new partitioned
>> tables such as automatic tuple routing, or keep us from improving planner
>> smarts and executor capabilities for partitioned tables over what we
>> already have.
>>
>> OTOH, I may be looking at it wrongly. We would not be required to enforce
>> user-facing inheritance behaviors on the new partitioned tables after all.
>> That is to say - it's just that new partitioned tables could still use
>> relevant inheritance infrastructure behind-the-scenes for planning,
>> execution and a few other things and not care about abiding by regular
>> inheritance semantics. I should just go ahead and add special cases in all
>> places where existing inheritance handling code stands to cause trouble
>> down the line for partitioned tables. We might want to mention that we do
>> so somewhere in documentation and also note that regular inheritance
>> semantics does not apply. While it sounds probably fine as implementation
>> for the feature released initially, a day will inevitably come when this
>> behind-the-scenes implementation will be changed to something more
>> amenable to better optimization. But that's for future...
> 
> This seems pretty vague to me.  I don't know that I have an opinion
> without a more specific list of questions.

Sorry for rather incoherent articulation in the previous message. Let me
try to say something that's perhaps more specific.

Consider that we create partitions as inheritance children, that is,
creating a partition using:

CREATE TABLE partition PARTITION OF parent FOR VALUES ...

is equivalent to saying

CREATE TABLE partition (...) INHERITS (parent)

except that the latter allows partition to have its own column
definitions, table constraints and multiple parents as long as things are
conflict-free. The former requires specifying partition bounding values.
The common thing between the two then is StoreCatalogInheritance(partOid,
parentOid) that will mark partition as inheritance child of parent in
pg_inherits. So, our new partitions are really inheritance children but
that's not apparent to users (where this last bit is important).

Then consider ALTER TABLE partition - should we need to handle it in way
different from existing inheritance code would do -
* Prevent something whereas regular inheritance wouldn't?* Do something instead of/in addition to whatever regular
inheritancedoes?

Consider adding a column to partition - regular inheritance wouldn't
prevent it because adding a column to child table doesn't get in the way
of how inheritance works. Now we can make it so that if table is a
partition created with CRATE TABLE PARTITION OF (which we can tell because
the table has pg_partition entry), we should *actively* prevent adding a
column. Inheritance already prevents dropping an inherited attribute in
child but says "cannot drop inherited column". IMO in a partition's case,
it should say "cannot drop columns of a partition".

OTOH, adding a column to parent propagates to all partitions by way of
ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is
necessary for partitioned tables (which we can tell from relkind or maybe
because table has pg_partitioned_rel entry) - especially no need to
propagate column default and check constraint, etc. Because one cannot
apply insert/update on partitions anyway. Then because defaults and check
constraints on partitions are really useless, when a user adds them to a
partition directly, we should say, "cannot alter column default for a
partition" and "cannot add check constraint to a partition" with "instead
do that operation on the root parent" hint. But what is a "root parent"?
How to determine one for a given partition? Perhaps, it's not much of a
POLA violation a user must have done something to have ended up in the
error occurring, that is, created the table as partition of (partition of
...) something.

Thanks a lot for your time!

Thanks,
Amit

Re: Declarative partitioning

From

Corey Huinker

Date:

19 February 2016, 20:06:36

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Feb 18, 2016 at 12:41 AM, Amit Langote <span
dir="ltr"><<ahref="mailto:Langote_Amit_f8@lab.ntt.co.jp"
target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0
00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="a3s" id=":1pe" style="overflow:hidden">START [ EXCL ]
(startval)END [ INCL ] (endval)<br /><br /> That is, in range type notation, '[startval, endval)' is the default<br />
behavior.So for each partition, there is at least the following pieces of<br /> metadata:</div></blockquote></div><div
class="gmail_extra"><br/></div><div class="gmail_extra">This is really close, and if it is what we ended up with we
wouldbe able to use it.</div><div class="gmail_extra"><br /></div><div class="gmail_extra">I suggest that the range
notationcan be used even when no suitable range type exists.<br /></div><div class="gmail_extra"><br /></div><div
class="gmail_extra">Iassume the code for parsing a range spec regardless of data type already exists, but in case it
doesn't,take a range spec of unknown type:</div><div class="gmail_extra"><br /></div><div
class="gmail_extra">[x,y)</div><divclass="gmail_extra"><br /></div><div class="gmail_extra">x and y are either going to
beraw strings or doublequoted strings with possible doublequote escapes, each of which would be coercible into the the
typeof the partition column.</div><div class="gmail_extra"><br /></div><div class="gmail_extra">In other words, if your
stringvalues were 'blah , blah ' and 'fabizzle', the [) range spec would be ["blah , blah ",fabizzle).</div><div
class="gmail_extra"><br/></div><div class="gmail_extra">Using regular range specs syntax also allows for the range to
beunbounded in either or both directions, which is a possibility, especially in newer tables where the expected
distributionof data is unknown.</div><div class="gmail_extra"><br /></div>p.s. Sorry I haven't been able to kick the
tiresjust yet. We have a very good use case for this, it's just a matter of getting a machine and the time to devote to
it.</div></div>

Re: Declarative partitioning

From

Peter Eisentraut

Date:

20 February 2016, 04:41:21

On 2/16/16 9:56 PM, Amit Langote wrote:
> From now on, instead of attaching multiple files like in the previous
> message, I will send a single tar.gz which will contain patches created by
> git-format-patch.

Please don't do that.

Re: Declarative partitioning

From

Amit Langote

Date:

20 February 2016, 05:09:21

On Sat, Feb 20, 2016 at 1:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 2/16/16 9:56 PM, Amit Langote wrote:
>> From now on, instead of attaching multiple files like in the previous
>> message, I will send a single tar.gz which will contain patches created by
>> git-format-patch.
>
> Please don't do that.

OK, will remember. Sorry about that.

Thanks,
Amit

Re: Declarative partitioning

From

Robert Haas

Date:

23 February 2016, 13:52:04

On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Personally, I would be more inclined to make this a CREATE TABLE statement, like
>>
>> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
>> CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ...
>
> I guess the first of which would actually be:
>
> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
> PARTITION BY ...
>
> Some might think that writing potentially the same PARTITION BY clause 100
> times for 100 level-1 partitions could be cumbersome. That is what
> SUBPARTITION BY notation may be good as a shorthand for.

I think that anybody who is doing that much partitioning is going to
use a tool to generate the DDL anyway, so it doesn't really matter.

>> I think if you've got SUBPARTITION as a keyword in the syntax
>> anywhere, you're doing it wrong.  The toplevel object shouldn't really
>> care whether its children are themselves partitioned or not.
>
> This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It
> might as well be just cascaded PARTITION BYs. The point is to specify as
> many of those with CREATE TABLE toplevel as the number of levels of
> partitioning we want. That does however fix the number of levels in advance.

Which I think is not good.  If we say that a table can be partitioned,
and a table that is a partition can also be partitioned, we've got a
nice general system.  Fixing the number of partitioning levels for the
sake of a little syntactic sugar is, IMHO, getting our priorities
backwards.

> In the patch I have posted, here are some details of the tuple routing
> implementation for instance - the top-level parent knows only of its
> immediate partitions. Once a level-1 partition from that list is
> determined using a tuple's level-1 key, the tuple is passed down to choose
> one of its own partitions using the level-2 key. The key descriptor list
> is associated with the top-level parent alone and the recursive code knows
> to iterate the key list to apply nth key for nth level. The recursion
> happens in the partition module.

I haven't looked at the code, but that sounds like the right approach.

> Now there are also functions to let obtain, say *all* or only *leaf*
> partitions (OIDs) of a top-level parent but they are used for certain DDL
> scenarios. As far as DML is concerned, the level-at-a-time recursive
> approach as described above is used. Queries not yet because the plan is a
> flattened append of leaf partitions anyway.

OK.

> If such notation convenience at the expense of loss of generality is not
> worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax.
>
> CREATE TABLE toplevel() PARTITION BY
> CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY
> CREATE TABLE subpartition PARTITION OF partition FOR VALUES
> ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE
> ALTER TABLE partitioned DETACH PARTITION name [ USING newname ]
>
> While we are on the syntax story, how about FOR VALUES syntax for range
> partitions (sorry for piggybacking it here in this message). At least some
> people dislike LESS THAN notation. Corey Huinker says we should be using
> range type literals for that. It's not clear though that using range type
> literals directly is a way ahead. For one, range type API expects there to
> exist a range type with given element type. Whereas all we require for
> range partitioning proper is for column type to have a btree operator
> class. Should we require it to have an associated range type as well?
> Don't think that there exists an API to check that either. All in all,
> range types are good to implement things in applications but not so much
> within the backend (unless I'm missing something). I know reinventing the
> wheel is disliked as well but perhaps we could offer something like the
> following because Corey offered some examples which would help from the
> flexibility:
>
> START [ EXCL ] (startval) END [ INCL ] (endval)

I don't think using range type literals is gonna work.  There's no
guarantee that the necessary range types exist.  However, we could
possibly use a syntax inspired by the syntax for range types.  I'm a
little nervous about asking people to type commands with mismatching
braces:

CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 );

...but maybe it's a good idea.  It certainly has the advantage of
being more compact than a lot of there ways we might choose to write
that.  And I think LESS THAN sucks.  It's just clunky and awkward
syntax.

> Some people said something akin to interval partitioning would be good like:
>
> PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START
> WITH value;
>
> But that could be just a UI addition to the design where each partition
> has [startval, endval) bounds. In any case, not a version 1 material I'd
> think.

Agreed; not right now.

> I see the trade-off. I agree that considering the significance for attach
> partition case is quite important.
>
> So, the tuple routing code should be ready to use projection if there
> happens to be a partition with differing tuple descriptor. In the code I
> posted, a ResultRelInfo is lazily built afresh for each inserted tuple in
> ExecInsert's case and for each tuple where the chosen partition is
> different from the previous tuple's in CopyFrom's case. One can feel that
> there is a certain overhead to that approach for the bulk-loading case
> (almost every CopyFrom invocation). Now if projection enters this picture,
> we have to consider that too.  Should we initialize ResultRelInfo's and
> corresponding ProjectionInfo's for all partitions beforehand? Consider how
> ModifyTable currently handles update on inheritance set, for example. That
> would incur unnecessary overhead if only a single tuple is inserted. But
> it would certainly help bulk-loading case. Am I missing something?

I'd say you are missing benchmarking data.  :-)

Why should the single-tuple case be harmed here?

> Sorry for rather incoherent articulation in the previous message. Let me
> try to say something that's perhaps more specific.
>
> Consider that we create partitions as inheritance children, that is,
> creating a partition using:
>
> CREATE TABLE partition PARTITION OF parent FOR VALUES ...
>
> is equivalent to saying
>
> CREATE TABLE partition (...) INHERITS (parent)
>
> except that the latter allows partition to have its own column
> definitions, table constraints and multiple parents as long as things are
> conflict-free. The former requires specifying partition bounding values.
> The common thing between the two then is StoreCatalogInheritance(partOid,
> parentOid) that will mark partition as inheritance child of parent in
> pg_inherits. So, our new partitions are really inheritance children but
> that's not apparent to users (where this last bit is important).

So far, this sounds good to me.

> Then consider ALTER TABLE partition - should we need to handle it in way
> different from existing inheritance code would do -
>
>  * Prevent something whereas regular inheritance wouldn't?
>  * Do something instead of/in addition to whatever regular inheritance does?
>
> Consider adding a column to partition - regular inheritance wouldn't
> prevent it because adding a column to child table doesn't get in the way
> of how inheritance works. Now we can make it so that if table is a
> partition created with CRATE TABLE PARTITION OF (which we can tell because
> the table has pg_partition entry), we should *actively* prevent adding a
> column. Inheritance already prevents dropping an inherited attribute in
> child but says "cannot drop inherited column". IMO in a partition's case,
> it should say "cannot drop columns of a partition".

Agreed.  So maybe pg_inherits gets an additional column that indicates
whether this is "regular" inheritance or "partitioning" inheritance,
and in the latter case there are additional restrictions.

> OTOH, adding a column to parent propagates to all partitions by way of
> ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is
> necessary for partitioned tables (which we can tell from relkind or maybe
> because table has pg_partitioned_rel entry) - especially no need to
> propagate column default and check constraint, etc. Because one cannot
> apply insert/update on partitions anyway.

Hmm.  I feel like maybe that should be allowed.  If the user wants to
bulk-load a bunch of data and knows that it goes in a particular
partition, why should they have to pay the overhead of determining
that anew for each row?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Amit Langote

Date:

24 February 2016, 08:11:27

On 2016/02/20 5:06, Corey Huinker wrote:
> On Thu, Feb 18, 2016 at 12:41 AM, Amit Langote wrote:
> 
>> START [ EXCL ] (startval) END [ INCL ] (endval)
>>
>> That is, in range type notation, '[startval, endval)' is the default
>> behavior. So for each partition, there is at least the following pieces of
>> metadata:
>>
> 
> This is really close, and if it is what we ended up with we would be able
> to use it.
> 
> I suggest that the range notation can be used even when no suitable range
> type exists.
> 
> I assume the code for parsing a range spec regardless of data type already
> exists, but in case it doesn't, take a range spec of unknown type:
> 
> [x,y)
> 
> x and y are either going to be raw strings or doublequoted strings with
> possible doublequote escapes, each of which would be coercible into the the
> type of the partition column.
> 
> In other words, if your string values were 'blah , blah ' and 'fabizzle',
> the [) range spec would be ["blah , blah ",fabizzle).

Hm, I see.  How about multi-column keys?  Do we care enough about that use
case?  I don't see a nice-enough-looking range literal for such keys.
Consider for instance,

With the partitioned table defined as:

CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2);

where we'd use LESS THAN as:

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES LESS THAN ('b', '2');
CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES LESS THAN ('b', '3');
CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES LESS THAN ('b', '4');

CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES LESS THAN ('c', '2');
CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES LESS THAN ('c', '3');
CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES LESS THAN ('c', '4');

I guess it would be the following with the new range spec:

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES ['b', 'b'], ['1', '2');
CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES ['b', 'b'], ['2', '3');
CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES ['b', 'b'], ['3', '4');

CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES ['c', 'c'], ['1', '2');
CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES ['c', 'c'], ['2', '3');
CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES ['c', 'c'], ['3', '4');

The first column does not unambiguously select a partition with such keys.
Now I can imagine that it is possible to "emulate" such a multi-column
range key with k columns by k levels of range-range-* partitioning:

CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1);

CREATE TABLE foo_ax PARTITION OF foo FOR VALUES ['b, 'c') PARTITION BY RANGE (c2);

CREATE TABLE foo_ax1x PARTITION OF foo_ax FOR VALUES ['1', '2');
CREATE TABLE foo_ax2x PARTITION OF foo_ax FOR VALUES ['2', '3');
CREATE TABLE foo_ax3x PARTITION OF foo_ax FOR VALUES ['3', '4');

CREATE TABLE foo_bx PARTITION OF foo FOR VALUES ['b, 'c') PARTITION BY RANGE (c2);

CREATE TABLE foo_bx1x PARTITION OF foo_bx FOR VALUES ['1', '2');
CREATE TABLE foo_bx2x PARTITION OF foo_bx FOR VALUES ['2', '3');
CREATE TABLE foo_bx3x PARTITION OF foo_bx FOR VALUES ['3', '4');

But IIRC, I have been contradicted once before regarding whether two ways
are exactly the same.

> Using regular range specs syntax also allows for the range to be unbounded
> in either or both directions, which is a possibility, especially in newer
> tables where the expected distribution of data is unknown.

We would want to also think about what subset of many permutations of this
syntax to accept range specs for new partitions.  Mostly to preserve the
non-overlapping invariant and I think it would also be nice to prevent gaps.

Consider that once we create:

PARTITION FOR VALUES [current_date,);

Now to create a new partition starting at later date, we have to have a
"split partition" feature which would scan the above partition to
distribute the existing data rows appropriately to the resulting two
partitions. Right?

IOW, one shouldn't create an unbounded partition if more partitions in the
unbounded direction are expected to be created.  It would be OK for
unbounded partitions to be on the lower end most of the times.

> p.s. Sorry I haven't been able to kick the tires just yet. We have a very
> good use case for this, it's just a matter of getting a machine and the
> time to devote to it.

I would appreciate it.  You could wait a little more for my next
submission which will contain some revisions to the tuple routing code.

Thanks,
Amit

Re: Declarative partitioning

From

Corey Huinker

Date:

24 February 2016, 18:31:46

Hm, I see. How about multi-column keys? Do we care enough about that use
case? I don't see a nice-enough-looking range literal for such keys.
Consider for instance,

With the partitioned table defined as:

CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2);

Good question! I would assume that we'd use a syntax that presumes c1 and c2 are a hypothetical composite type. But what does that look like?

To answer it, I tried this:

# create type duple as (a text, b text);

CREATE TYPE

# create type duplerange as range (subtype = duple);

CREATE TYPE

# select '(beebop,alula)'::duple;

duple

----------------

(beebop,alula)

(1 row)

# select '("hey ho","letsgo")'::duple;

duple

-------------------

("hey ho",letsgo)

(1 row)

analytics=# select duplerange('(beebop,alula)','("hey ho","letsgo")','(]');

duplerange

------------------------------------------

("(beebop,alula)","(""hey ho"",letsgo)"]

(1 row)

So I would assume that we'd use a syntax that presumed the columns were in a composite range type.

Which means your creates would look like (following Robert Haas's implied suggestion that we leave off the string literal quotes):

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES ( , (b,2) );

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,2), (b,3) );

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,3), (b,4) );

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,4), (c,2) );

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,2), (c,3) );

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,3), (c,4) );

That's not terrible looking.

We would want to also think about what subset of many permutations of this
syntax to accept range specs for new partitions. Mostly to preserve the
non-overlapping invariant and I think it would also be nice to prevent gaps.

Gaps *might* be intentional. I can certainly see where we'd want to set up warnings for discontinuity, or perhaps some utility functions:

pg_partitions_ranges_are_continuous('master_table_name')

pg_partitions_are_adjacent('master_table_name','p1','p2')

But for the most part, range partitions evolve from splits when one partition grows too big, so that won't be such a problem.

Consider that once we create:

PARTITION FOR VALUES [current_date,);

Now to create a new partition starting at later date, we have to have a
"split partition" feature which would scan the above partition to
distribute the existing data rows appropriately to the resulting two
partitions. Right?

Correct. And depending on policy, that might be desirable and might be not.

If the table were for death records, we'd very much want to reject rows in the future, if only to avoid upsetting the person.

If the table were of movie release dates, we'd *expect* that only dates (,current_date] would be entered, but if someone chose to leak a release date, we'd want to capture that and deal with it later.

So yeah, we're going to (eventually) need a SPLIT PARTITION that migrates rows to a new partition.

IOW, one shouldn't create an unbounded partition if more partitions in the
unbounded direction are expected to be created. It would be OK for
unbounded partitions to be on the lower end most of the times.

On this I'll have to disagree. My own use case where I use my range_partitioning extension starts off with a single partition () and all new partitions are splits of that. The ranges evolve over time as partitions grow and slow down. It's nice because we're not trying to predict where growth will be, we split where growth is.

> p.s. Sorry I haven't been able to kick the tires just yet. We have a very
> good use case for this, it's just a matter of getting a machine and the
> time to devote to it.

I would appreciate it. You could wait a little more for my next
submission which will contain some revisions to the tuple routing code.

Ok, I'll wait a bit. In the mean time I can tell you a bit about the existing production system I'm hoping to replicate in true partitioning looks like this:

Big Master Table:

Range partition by C collated text
Date Range

Date Range

...

Range partition by C collated text

Date Range

...

...

Currently this is accomplished through my range_partitioning module, and then using pg_partman on those partitions. It works, but it's a lot of moving parts.

The machine will be a 32 core AWS box. As per usual with AWS, it will be have ample memory and CPU, and be somewhat starved for I/O.

Question: I haven't dove into the code, but I was curious about your tuple routing algorithm. Is there any way for the algorithm to begin it's scan of candidate partitions based on the destination of the last row inserted this statement? I ask because most use cases (that I am aware of) have data that would naturally cluster in the same partition.

Re: Declarative partitioning

From

Jean-Pierre Pelletier

Date:

25 February 2016, 15:43:59

Why not based it on "Exclusion Constraint" ?

Most discussions as of late seems to focus on Range overlaps which appeal
(I would think) is that it supports both "equality" and "overlaps", two
popular partitioning schemes.

"Equality" as in "value1 = value2" can be implemented with "range
overlaps"
as "range(value1,value) = range(value,value2)".

I would think that Partitioning schemes can be Declarative, Efficient and
not restricted to Equality and Overlaps as long as all partitions (of a
partitioned table) are using a single partitioning definition expressed
as:
- An Immutable Expression on tuple columns, in the simplest case a single
column
- An Operator, in the simplest case, "equality"

That seems very close to the semantic of "Constraint Exclusion" as
described here:
http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-general
ized-sql-unique/

If partitioning could be based on EC, it would bring these additional
benefits:
- The choice of operator as long as it is boolean. commutative and
Indexable
- The use of Expression/Function and not just bare columns

Jean-Pierre Pelletier

Re: Declarative partitioning

From

Jean-Pierre Pelletier

Date:

25 February 2016, 16:18:15

Why not based it on "Exclusion Constraint" ?

Most discussions as of late seems to focus on Range overlaps
which appeal (I would think) is that it supports both "equality" and
"overlaps",
two popular partitioning schemes.

"Equality" as in "value1 = value2" can be implemented with "range
overlaps"
as "range(value1,value) = range(value,value2)".

I would think that Partitioning schemes can be Declarative, Efficient and
not restricted to Equality and Overlaps
as long as all partitions (of a partitioned table) are using a single
partitioning definition expressed as:
- An Immutable Expression on tuple columns, in the simplest case a single
column
- An Operator, in the simplest case, "equality"

That seems very close to the semantic of "Constraint Exclusion" as
described here:
http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-general
ized-sql-unique/

If partitioning could be based on EC, it would bring these additional
benefits:
- The choice of operator as long as it is boolean. commutative and
Indexable
- The use of Expression/Function and not just bare columns

Jean-Pierre Pelletier

Re: Declarative partitioning

From

Amit Langote

Date:

26 February 2016, 06:05:33

On 2016/02/23 22:51, Robert Haas wrote:
> On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote wrote:
>> Some might think that writing potentially the same PARTITION BY clause 100
>> times for 100 level-1 partitions could be cumbersome. That is what
>> SUBPARTITION BY notation may be good as a shorthand for.
> 
> I think that anybody who is doing that much partitioning is going to
> use a tool to generate the DDL anyway, so it doesn't really matter.
>>> I think if you've got SUBPARTITION as a keyword in the syntax
>>> anywhere, you're doing it wrong.  The toplevel object shouldn't really
>>> care whether its children are themselves partitioned or not.
>>
>> This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It
>> might as well be just cascaded PARTITION BYs. The point is to specify as
>> many of those with CREATE TABLE toplevel as the number of levels of
>> partitioning we want. That does however fix the number of levels in advance.
> 
> Which I think is not good.  If we say that a table can be partitioned,
> and a table that is a partition can also be partitioned, we've got a
> nice general system.  Fixing the number of partitioning levels for the
> sake of a little syntactic sugar is, IMHO, getting our priorities
> backwards.

OK.

To reiterate the syntax:

CREATE TABLE parent(...) PARTITION BY

CREATE TABLE partition PARTITION OF parent FOR VALUES ... [ PARTITION BY ]

ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE source_table [ NO VALIDATE ]

ALTER TABLE partitioned DETACH PARTITION name [ WITH new_name ]

A note about NO VALIDATE in ATTACH PARTITION:

If specified, it means user is telling the system to "trust" that none of
the rows contained in the source table lie outside partition boundary
specification (the FOR VALUES clause).  Which is not the same thing as
adding a NOT VALID check constraint because the check constraint is
assumed invalid by the optimizer until explicitly validated by VALIDATE
CONSTRAINT.  The default behavior is to validate by scanning the source
table to check for violating rows and fail adding the partition, if any.
Because adding a partition requires an exclusive lock on the parent, the
default behavior may cause us to have the lock for long durations which
may be undesirable.  Should do better than that?

>> While we are on the syntax story, how about FOR VALUES syntax for range
>> partitions (sorry for piggybacking it here in this message). At least some
>> people dislike LESS THAN notation. Corey Huinker says we should be using
>> range type literals for that. It's not clear though that using range type
>> literals directly is a way ahead. For one, range type API expects there to
>> exist a range type with given element type. Whereas all we require for
>> range partitioning proper is for column type to have a btree operator
>> class. Should we require it to have an associated range type as well?
>> Don't think that there exists an API to check that either. All in all,
>> range types are good to implement things in applications but not so much
>> within the backend (unless I'm missing something). I know reinventing the
>> wheel is disliked as well but perhaps we could offer something like the
>> following because Corey offered some examples which would help from the
>> flexibility:
>>
>> START [ EXCL ] (startval) END [ INCL ] (endval)
> 
> I don't think using range type literals is gonna work.  There's no
> guarantee that the necessary range types exist.  However, we could
> possibly use a syntax inspired by the syntax for range types.  I'm a
> little nervous about asking people to type commands with mismatching
> braces:
> 
> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 );
> 
> ...but maybe it's a good idea.  It certainly has the advantage of
> being more compact than a lot of there ways we might choose to write
> that.  And I think LESS THAN sucks.  It's just clunky and awkward
> syntax.

Slightly concerned about multi-column range partition key but as suggested
by Corey, we can use record-like notation.

CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2);

CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ ('a', 1), ('a', 2) );
CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES [ ('a', 2), ('a', 3) );
CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES [ ('a', 3), ('a', 4) );

CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES [ ('b', 1), ('b', 2) );
CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES [ ('b', 2), ('b', 3) );
CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES [ ('b', 3), ('b', 4) );

>> I see the trade-off. I agree that considering the significance for attach
>> partition case is quite important.
>>
>> So, the tuple routing code should be ready to use projection if there
>> happens to be a partition with differing tuple descriptor. In the code I
>> posted, a ResultRelInfo is lazily built afresh for each inserted tuple in
>> ExecInsert's case and for each tuple where the chosen partition is
>> different from the previous tuple's in CopyFrom's case. One can feel that
>> there is a certain overhead to that approach for the bulk-loading case
>> (almost every CopyFrom invocation). Now if projection enters this picture,
>> we have to consider that too.  Should we initialize ResultRelInfo's and
>> corresponding ProjectionInfo's for all partitions beforehand? Consider how
>> ModifyTable currently handles update on inheritance set, for example. That
>> would incur unnecessary overhead if only a single tuple is inserted. But
>> it would certainly help bulk-loading case. Am I missing something?
> 
> I'd say you are missing benchmarking data.  :-)
> 
> Why should the single-tuple case be harmed here?

I think setting up N ResultRelInfos in advance where the tuple would only
ever require one might be superfluous. But that may point to some flaw in
my original design or thinking about the case.

As I said, my approach so far is to "lazily" set up a ResultRelInfo once a
valid leaf partition for the tuple is found and OID returned.  Now there
will also be a ProjectionInfo if the target partition's physical
descriptor happens to differ from the root table.  All we ever start with
in ExecInsert() is root table's ResultRelInfo which comes with executor
state related to partition key (expression states, if any) and information
about root table's partitions.  ResultRelInfos for actual partitions that
the tuple is inserted into are ephemeral and are created for
ExecInsertIndexTuples's perusal only, that is, switched back to root
table's right after inserting index tuples.  Maybe, there should be an
array in EState of ResultRelInfos and ProjectionInfos of size N where N is
the number of leaf partitions and indexed by leaf partition sequence
number.  The sequence number would be based on OID order.  That means the
partition module now returns the sequence number instead of OID.

I indeed should collect some performance numbers for any resulting
implementation.

>> Consider that we create partitions as inheritance children, that is,
>> creating a partition using:
>>
>> CREATE TABLE partition PARTITION OF parent FOR VALUES ...
>>
>> is equivalent to saying
>>
>> CREATE TABLE partition (...) INHERITS (parent)
>>
>> except that the latter allows partition to have its own column
>> definitions, table constraints and multiple parents as long as things are
>> conflict-free. The former requires specifying partition bounding values.
>> The common thing between the two then is StoreCatalogInheritance(partOid,
>> parentOid) that will mark partition as inheritance child of parent in
>> pg_inherits. So, our new partitions are really inheritance children but
>> that's not apparent to users (where this last bit is important).
> 
> So far, this sounds good to me.
> 
>> Then consider ALTER TABLE partition - should we need to handle it in way
>> different from existing inheritance code would do -
>>
>>  * Prevent something whereas regular inheritance wouldn't?
>>  * Do something instead of/in addition to whatever regular inheritance does?
>>
>> Consider adding a column to partition - regular inheritance wouldn't
>> prevent it because adding a column to child table doesn't get in the way
>> of how inheritance works. Now we can make it so that if table is a
>> partition created with CRATE TABLE PARTITION OF (which we can tell because
>> the table has pg_partition entry), we should *actively* prevent adding a
>> column. Inheritance already prevents dropping an inherited attribute in
>> child but says "cannot drop inherited column". IMO in a partition's case,
>> it should say "cannot drop columns of a partition".
> 
> Agreed.  So maybe pg_inherits gets an additional column that indicates
> whether this is "regular" inheritance or "partitioning" inheritance,
> and in the latter case there are additional restrictions.

So far how I have thought it would work is that "partitions" have a
pg_partition entry.  So, wherever we want to check if an inheritance child
is a partition, we just look up the OID in the syscache on
pg_partition.partrelid.

>> OTOH, adding a column to parent propagates to all partitions by way of
>> ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is
>> necessary for partitioned tables (which we can tell from relkind or maybe
>> because table has pg_partitioned_rel entry) - especially no need to
>> propagate column default and check constraint, etc. Because one cannot
>> apply insert/update on partitions anyway.
> 
> Hmm.  I feel like maybe that should be allowed.  If the user wants to
> bulk-load a bunch of data and knows that it goes in a particular
> partition, why should they have to pay the overhead of determining
> that anew for each row?

Hm, I thought tuple routing through the parent is the (only) way to ensure
that a given partition only contains rows meant for that partition.  IOW,
there are no CHECK constraints per partition guarding the rows going in.
However, there is a provision in the latest patch to "derive" a CHECK
constraint by combining the key info and bound info and maybe
ExecConstraints() should check those in addition to
TupleConstr.has_not_null and TupleConstr.check constraints.  Thoughts?

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

26 February 2016, 07:14:18

Hi,

Thanks for your feedback.

On 2016/02/26 0:43, Jean-Pierre Pelletier wrote:
> Why not based it on "Exclusion Constraint" ?
> 
> Most discussions as of late seems to focus on Range overlaps which appeal
> (I would think) is that it supports both "equality" and "overlaps", two
> popular partitioning schemes.
> 
> "Equality" as in "value1 = value2" can be implemented with "range
> overlaps"
> as "range(value1,value) = range(value,value2)".
> 
> I would think that Partitioning schemes can be Declarative, Efficient and
> not restricted to Equality and Overlaps as long as all partitions (of a
> partitioned table) are using a single partitioning definition expressed
> as:
> - An Immutable Expression on tuple columns, in the simplest case a single
> column
> - An Operator, in the simplest case, "equality"
> 
> That seems very close to the semantic of "Constraint Exclusion" as
> described here:
> http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-general
> ized-sql-unique/
> 
> If partitioning could be based on EC, it would bring these additional
> benefits:
> - The choice of operator as long as it is boolean. commutative and
> Indexable
> - The use of Expression/Function and not just bare columns

Note that proposed patch does more or less what you say we should be doing
minus the "exclusion constraint" part.

With the proposed, you can specify an expression(s)/column(s) as partition
key along with an "operator class" for the column like below:

CREATE TABLE foo (a type_name) PARTITION BY LIST (a USING opclass_name);
CREATE TABLE bar (a type_name) PARTITION BY RANGE (a USING opclass_name);

Right now, supported partition strategies are RANGE and LIST where "btree
operators" suffice.  So in the above example, type_name must have a
suitable btree operators class defined in the system which could be
opclass_name.  If opclass_name was created as the default for type_name,
one need not write USING opclass_name.

Then when you create a partition of foo:

CREATE TABLE foo_partition PARTITION OF foo FOR VALUES IN (val1, val2);

system enforces that foo_partition only contains values such that:
 a = ANY ( ARRAY [val1, val2] ),

where the operator "=" refers to an operator belonging to the operator
class opclass_name (hence can have a user-defined notion of "equality").

And when you create a partition of bar:

CREATE TABLE bar_partition PARTITION OF bar FOR VALUES [val1, val2);

system enforces that bar_partition only contains values such that:
 val1 <= a < val2,

where operators "<=" and "<" refer to the operators belonging to the
operator class opclass_name (hence can have a user-defined notion of
ordering).

Further, system can also optimize queries based on its knowledge of
operators appearing in query clauses and implicit constraints just
mentioned above. Specifically, it can can exclude partitions using
"constraint exclusion" which is NOT the same thing as "exclusion
constraints", as you might be aware.

"Exclusion constraints" depend on having suitable a index (just like
unique constraint enforcing btree index) that uses the specified operator
to enforce the constraint:

postgres=# CREATE TABLE circles (
postgres(#     c circle,
postgres(#     EXCLUDE USING gist (c WITH &&)
postgres(# );
CREATE TABLE

postgres=# \d+ circles                      Table "public.circles"Column |  Type  | Modifiers | Storage | Stats target
|Description-------+--------+-----------+---------+--------------+-------------c      | circle |           | plain   |
           |

Indexes:   "circles_c_excl" EXCLUDE USING gist (c WITH &&)

The talk of "non-overlapping partitions" in this thread refers to the
invariant that partition DDL should maintain which uses ad-hoc logic to do
that but which is based on the semantics of the specified operators.

Thanks,
Amit

Re: Declarative partitioning

From

Ildar Musin

Date:

28 February 2016, 22:15:07

16/02/16 07:46, Amit Langote wrote:
> Hi Josh,
>
> On 2016/02/16 11:41, Josh berkus wrote:
>> On 02/15/2016 04:28 PM, Amit Langote wrote:
>>> Also, you won't see any optimizer and executor changes. Queries will still
>>> use the same plans as existing inheritance-based partitioned tables,
>>> although as I mentioned, constraint exclusion won't yet kick in. That will
>>> be fixed very shortly.
>> We're not going to use CE for the new partitioning long-term, are we? This
>> is just the first version, right?
> Yes. My approach in previous versions of stuffing major planner changes in
> with the syntax patch was not quite proper in retrospect. So, I thought
> I'd propose any major planner (and executor) changes later.
>
> Thanks,
> Amit
>
Hello Amit,

Thank you for your work. I'm currently working on extension aimed at 
planner optimization for partitioned tables 
(https://github.com/postgrespro/pg_pathman). At this moment I have an 
implementation of binary search for range partitioned tables with basic 
partitioning keys (date, timestamp, integers etc). And I'd like to try 
to combine your syntax and infrastructure with my binary search 
implementation.
There likely will be changes in range syntax and partitions cache 
structure based on discussion. So looking forward for your next patch.

Ildar

Re: Declarative partitioning

From

Amit Langote

Date:

29 February 2016, 05:01:30

Hi Ildar,

On 2016/02/29 7:14, Ildar Musin wrote:
> 16/02/16 07:46, Amit Langote wrote:
>> On 2016/02/16 11:41, Josh berkus wrote:
>>> We're not going to use CE for the new partitioning long-term, are we? This
>>> is just the first version, right?
>> Yes. My approach in previous versions of stuffing major planner changes in
>> with the syntax patch was not quite proper in retrospect. So, I thought
>> I'd propose any major planner (and executor) changes later.
>>
>> Thanks,
>> Amit
>>
> Hello Amit,
> 
> Thank you for your work. I'm currently working on extension aimed at
> planner optimization for partitioned tables
> (https://github.com/postgrespro/pg_pathman). At this moment I have an
> implementation of binary search for range partitioned tables with basic
> partitioning keys (date, timestamp, integers etc). And I'd like to try to
> combine your syntax and infrastructure with my binary search implementation.
> There likely will be changes in range syntax and partitions cache
> structure based on discussion. So looking forward for your next patch.

Sure, thanks! I will look at your extension as well.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

08 March 2016, 01:39:44

Hi Corey,

Sorry for replying so late.

On 2016/02/25 3:31, Corey Huinker wrote:
> 

[ ... ]

> So I would assume that we'd use a syntax that presumed the columns were in
> a composite range type.
> 
> Which means your creates would look like (following Robert Haas's implied
> suggestion that we leave off the string literal quotes):
> 
> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES (      , (b,2) );
> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,2), (b,3) );
> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,3), (b,4) );
> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,4), (c,2) );
> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,2), (c,3) );
> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,3), (c,4) );
> 
> That's not terrible looking.

So I tried a grammar that looked like the following:

range_spec: lb_inc bound ',' bound ub_inc    {<create-range-struct>}

lb_inc: '[' { $$ = true; } | '('  { $$ = false; }
ub_inc: ']' { $$ = true; } | ')' { $$ = false; }

bound: a_expr      {         if (IsA($1, RowExpr))    /* (key1, key2, ...) */             $$ = ((RowExpr) $1)->args;
    else /* key */             $$ = list_make1($1);      }      | /* EMPTY */ { $$ = NIL; }

Everything seemed to go dandy until I tried FOR VALUES (blah , blah],
where psql wouldn't send the command string without accepting the closing
parenthesis, :(.  So maybe I should try to put the whole thing in '', that
is, accept the full range_spec in a string, but then we are back to
requiring full-blown range parse function which I was trying to avoid by
using the aforementioned grammar.  So, I decided to move ahead with the
following grammar for time being:

START (lower-bound) [ EXCLUSIVE ]
| END (upper-bound) [ INCLUSIVE ]
| START (lower-bound) [ EXCLUSIVE ] END (upper-bound) [ INCLUSIVE ]

Where,

*-bound: a_expr            | *-bound ',' a_expr

Note that in the absence of explicit specification, lower-bound is
inclusive and upper-bound is exclusive.

So the verbosity couldn't be helped unless accepting range literal in
string form and exporting rangetypes.c:range_parse() with range
partitioning-specific hacks (consider composite bounds) to parse it are
acceptable things.

>> IOW, one shouldn't create an unbounded partition if more partitions in the
>> unbounded direction are expected to be created.  It would be OK for
>> unbounded partitions to be on the lower end most of the times.
>>
> 
> On this I'll have to disagree. My own use case where I use my
> range_partitioning extension starts off with a single partition () and all
> new partitions are splits of that. The ranges evolve over time as
> partitions grow and slow down. It's nice because we're not trying to
> predict where growth will be, we split where growth is.

Okay, perhaps I should not presume a certain usage.  However, as you know,
the usage like yours requires some mechanism of data redistribution (also
not without some syntax), which I am not targeting with the initial patch.If that was the only way of creating
partitions,matters would be a

little easier - you only specify a split point and have some rule about
inclusivity around the split point.  But we have to start with the usage
where each new partition is separately created with explicit partition
bound specification that is complete in itself and that's where the logic
to check partition invariants may get a bit complicated.

> Ok, I'll wait a bit. In the mean time I can tell you a bit about the
> existing production system I'm hoping to replicate in true partitioning
> looks like this:
> 
> Big Master Table:
>      Range partition by C collated text
>            Date Range
>            Date Range
>            ...
>      Range partition by C collated text
>            Date Range
>            Date Range
>            ...
>     ...
> 
> Currently this is accomplished through my range_partitioning module, and
> then using pg_partman on those partitions. It works, but it's a lot of
> moving parts.
> 
> The machine will be a 32 core AWS box. As per usual with AWS, it will be
> have ample memory and CPU, and be somewhat starved for I/O.
> 
> Question: I haven't dove into the code, but I was curious about your tuple
> routing algorithm. Is there any way for the algorithm to begin it's scan of
> candidate partitions based on the destination of the last row inserted this
> statement? I ask because most use cases (that I am aware of) have data that
> would naturally cluster in the same partition.

No.  Actually the tuple-routing function starts afresh for each row.  For
range partitions, it's binary search over an array of upper bounds.  There
is no row-to-row state caching in the partition module itself.

Thanks,
Amit

Re: Declarative partitioning

From

Corey Huinker

Date:

09 March 2016, 00:18:02

Sorry for replying so late.

No worries! We have jobs to do aside from this.

Everything seemed to go dandy until I tried FOR VALUES (blah , blah],
where psql wouldn't send the command string without accepting the closing
parenthesis, :(. So maybe I should try to put the whole thing in '', that
is, accept the full range_spec in a string, but then we are back to
requiring full-blown range parse function which I was trying to avoid by
using the aforementioned grammar. So, I decided to move ahead with the
following grammar for time being:

START (lower-bound) [ EXCLUSIVE ]
| END (upper-bound) [ INCLUSIVE ]
| START (lower-bound) [ EXCLUSIVE ] END (upper-bound) [ INCLUSIVE ]

Where,

*-bound: a_expr
| *-bound ',' a_expr

Note that in the absence of explicit specification, lower-bound is
inclusive and upper-bound is exclusive.

Thanks for trying. I agree that it would be a full blown range parser, and I'm not yet advanced enough to help you with that.

So presently partitions that are unbounded on the lower end aren't possible, but that's a creation syntax issue, not an infrastructure issue. Correct?

Okay, perhaps I should not presume a certain usage. However, as you know,
the usage like yours requires some mechanism of data redistribution (also
not without some syntax), which I am not targeting with the initial patch.

I'm quite fine with limitations in this initial patch, especially if they don't limit what's possible in the future.

> Question: I haven't dove into the code, but I was curious about your tuple
> routing algorithm. Is there any way for the algorithm to begin it's scan of
> candidate partitions based on the destination of the last row inserted this
> statement? I ask because most use cases (that I am aware of) have data that
> would naturally cluster in the same partition.

No. Actually the tuple-routing function starts afresh for each row. For
range partitions, it's binary search over an array of upper bounds. There
is no row-to-row state caching in the partition module itself.

bsearch should be fine, that's what I've used in my own custom partitioning schemes.

Was there a new patch, and if so, is it the one you want me to kick the tires on?

Re: Declarative partitioning

From

Amit Langote

Date:

09 March 2016, 00:57:45

Hi,

On 2016/03/09 9:17, Corey Huinker wrote:
>>
>> Sorry for replying so late.
> No worries! We have jobs to do aside from this.

Thanks!

>> Everything seemed to go dandy until I tried FOR VALUES (blah , blah],
>> where psql wouldn't send the command string without accepting the closing
>> parenthesis, :(.  So maybe I should try to put the whole thing in '', that
>> is, accept the full range_spec in a string, but then we are back to
>> requiring full-blown range parse function which I was trying to avoid by
>> using the aforementioned grammar.  So, I decided to move ahead with the
>> following grammar for time being:
>>
>> START (lower-bound) [ EXCLUSIVE ]
>> | END (upper-bound) [ INCLUSIVE ]
>> | START (lower-bound) [ EXCLUSIVE ] END (upper-bound) [ INCLUSIVE ]
>>
>> Where,
>>
>> *-bound: a_expr
>>              | *-bound ',' a_expr
>>
>> Note that in the absence of explicit specification, lower-bound is
>> inclusive and upper-bound is exclusive.
> 
> Thanks for trying. I agree that it would be a full blown range parser, and
> I'm not yet advanced enough to help you with that.
> 
> So presently partitions that are unbounded on the lower end aren't
> possible, but that's a creation syntax issue, not an infrastructure issue.
> Correct?

In case it wasn't apparent, you can create those:

FOR VALUES END (upper-bound) [INCLUSIVE]

which is equivalent to:

FOR VALUES '(, upper-bound)' or FOR VALUES '(, upper-bound]'

>>> Question: I haven't dove into the code, but I was curious about your
>> tuple
>>> routing algorithm. Is there any way for the algorithm to begin it's scan
>> of
>>> candidate partitions based on the destination of the last row inserted
>> this
>>> statement? I ask because most use cases (that I am aware of) have data
>> that
>>> would naturally cluster in the same partition.
>>
>> No.  Actually the tuple-routing function starts afresh for each row.  For
>> range partitions, it's binary search over an array of upper bounds.  There
>> is no row-to-row state caching in the partition module itself.
>>
>>
> bsearch should be fine, that's what I've used in my own custom partitioning
> schemes.
> 
> Was there a new patch, and if so, is it the one you want me to kick the
> tires on?

You didn't miss any.  I need a little more time to send the next revision
with significant design overhaul.

Thanks,
Amit

Re: Declarative partitioning

From

Corey Huinker

Date:

09 March 2016, 01:41:34

> So presently partitions that are unbounded on the lower end aren't
> possible, but that's a creation syntax issue, not an infrastructure issue.
> Correct?

In case it wasn't apparent, you can create those:

FOR VALUES END (upper-bound) [INCLUSIVE]

which is equivalent to:

FOR VALUES '(, upper-bound)' or FOR VALUES '(, upper-bound]'

Thanks for clarifying. My BNF-fu is weak.

Re: Declarative partitioning

From

Robert Haas

Date:

09 March 2016, 14:19:35

On Mon, Mar 7, 2016 at 8:39 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Which means your creates would look like (following Robert Haas's implied
>> suggestion that we leave off the string literal quotes):
>>
>> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES (      , (b,2) );
>> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,2), (b,3) );
>> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,3), (b,4) );
>> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,4), (c,2) );
>> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,2), (c,3) );
>> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,3), (c,4) );
>>
>> That's not terrible looking.
>
> So I tried a grammar that looked like the following:
>
> range_spec: lb_inc bound ',' bound ub_inc    {<create-range-struct>}
>
> lb_inc: '[' { $$ = true; } | '('  { $$ = false; }
> ub_inc: ']' { $$ = true; } | ')' { $$ = false; }
>
> bound: a_expr
>        {
>           if (IsA($1, RowExpr))    /* (key1, key2, ...) */
>               $$ = ((RowExpr) $1)->args;
>           else /* key */
>               $$ = list_make1($1);
>        }
>        | /* EMPTY */ { $$ = NIL; }

I think we're converging on a good syntax, but I don't think the
choice of nothingness to represent an open range is a good idea, both
because it will probably create grammar conflicts now or later and
also because it actually is sort of confusing and unintuitive to read
given the rest of our syntax.  I suggest using UNBOUNDED instead.

Also, I don't think allowing an a_expr as a bound is remotely sensible
- I think you should allow only Sconst, NumericOnly, NULL, and
UNBOUNDED.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Robert Haas

Date:

09 March 2016, 14:20:30

On Fri, Feb 26, 2016 at 1:05 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think setting up N ResultRelInfos in advance where the tuple would only
> ever require one might be superfluous. But that may point to some flaw in
> my original design or thinking about the case.

You have a point.  But maybe we should get it working first and
optimize it later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Corey Huinker

Date:

09 March 2016, 17:48:58

I think we're converging on a good syntax, but I don't think the
choice of nothingness to represent an open range is a good idea, both
because it will probably create grammar conflicts now or later and
also because it actually is sort of confusing and unintuitive to read
given the rest of our syntax. I suggest using UNBOUNDED instead.

As much as it reminds me of the window syntax I loathe (ROWS BETWEEN UNBOUNDED ....gah), I'm inclined to agree with Robert here.

It also probably helps for code forensics in the sense that it's easier to text search for a something than a nothing.

Re: Declarative partitioning

From

Alexander Korotkov

Date:

16 March 2016, 13:57:07

Hi, Amit!

I tried to apply your patch. It still applies, but has some duplicate oids. After fixing duplicate oids, I've noticed following warning during compilation by clang-700.1.81.

scan.c:10308:23: warning: unused variable 'yyg' [-Wunused-variable]

struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be unused depending upon options. */

tablecmds.c:12922:6: warning: variable 'is_subpart' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]

if (parent != NULL)

^~~~~~~~~~~~~~

tablecmds.c:12931:12: note: uninitialized use occurs here

partKey = is_subpart ? list_nth(rootParent->rd_partkeys, parent_level) :

^~~~~~~~~~

tablecmds.c:12922:2: note: remove the 'if' if its condition is always true

if (parent != NULL)

^~~~~~~~~~~~~~~~~~~

tablecmds.c:12912:19: note: initialize the variable 'is_subpart' to silence this warning

bool is_subpart;

= '\0'

tablecmds.c:13375:37: warning: variable 'operoid' is uninitialized when used here [-Wuninitialized]

comp_left_expr = make_opclause(operoid, BOOLOID, false,

^~~~~~~

tablecmds.c:13326:17: note: initialize the variable 'operoid' to silence this warning

Oid operoid;

= 0

Regression tests passed cleanly for me. I also examined code a bit. As I get, for DML queries, declarative partitioning works like inheritance. It just provides alternative way for collecting append_rel_list.

We're working on the other side of partitioning problem. Without solving syntax problem, we're solving performance problems in pg_pathman extension: https://github.com/postgrespro/pg_pathman. We already have interesting results which you can see in blog posts [1], [2], [3].

We already have fast algorithm for partition selection in optimizer [1] and effective optimization of filter conditions [3]. And we're planning to implement following features:

Execute time partitions selection (useful for nested loops and prepared statements);
Optimization of ordered output from patitioned tables (use Append instead of MergeAppend when possible);
Optimization of hash join when both tables are patitioned by join key.

9.6 feature freeze in coming, and we're planning our development resources for 9.7. Besides providing an extension, we would like these features to eventually arrive to core. In order to achieve this we need to port out functionality over your declarative syntax. At first, we would need some way for caching partition metadata suitable for fast partition selection. For range partition it could be sorted array or RB-tree of partition bounds. When we have this infrastructure, we can start porting pieces of pg_pathman functionality to declarative partitiong.

So, our draft plan of patches would be following:

Implement partition metadata cache suitable for fast partition selection.
Fast partition selection using metadata cache.
Optimization of filter conditions passed to partitions.
Execute time partitions selection (useful for nested loops and prepared statements);
Optimization of ordered output from patitioned tables (use Append instead of MergeAppend when possible);
Optimization of hash join when both tables are patitioned by join key.

I'd like to validate that this development plan doesn't overlaps with your plans. If out plans are not overlapping then let's accept this plan of work for 9.7.

Links.

[1] http://www.postgrespro.com/blog/pgsql/pg_pathman

[2] http://akorotkov.github.io/blog/2016/03/04/pg_pathman-beta-release/

[3] http://akorotkov.github.io/blog/2016/03/14/pg_pathman-condition-processing/

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Declarative partitioning

From

Amit Langote

Date:

16 March 2016, 14:30:06

Hi Alexander,

Thanks a lot for taking a look!

On Wed, Mar 16, 2016 at 10:56 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I tried to apply your patch.  It still applies, but has some duplicate oids.

Actually, a reworked version I will post hopefully early next week
will have fixed this.

> After fixing duplicate oids, I've noticed following warning during
> compilation by clang-700.1.81.
>
> scan.c:10308:23: warning: unused variable 'yyg' [-Wunused-variable]
>     struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be
> unused depending upon options. */
>                       ^
> tablecmds.c:12922:6: warning: variable 'is_subpart' is used uninitialized
> whenever 'if' condition is false [-Wsometimes-uninitialized]
>         if (parent != NULL)
>             ^~~~~~~~~~~~~~
> tablecmds.c:12931:12: note: uninitialized use occurs here
>         partKey = is_subpart ? list_nth(rootParent->rd_partkeys,
> parent_level) :
>                   ^~~~~~~~~~
> tablecmds.c:12922:2: note: remove the 'if' if its condition is always true
>         if (parent != NULL)
>         ^~~~~~~~~~~~~~~~~~~
> tablecmds.c:12912:19: note: initialize the variable 'is_subpart' to silence
> this warning
>         bool                    is_subpart;
>                                           ^
>                                            = '\0'
> tablecmds.c:13375:37: warning: variable 'operoid' is uninitialized when used
> here [-Wuninitialized]
>                                         comp_left_expr =
> make_opclause(operoid, BOOLOID, false,
>
> ^~~~~~~
> tablecmds.c:13326:17: note: initialize the variable 'operoid' to silence
> this warning
>                                 Oid             operoid;
>                                                        ^
>                                                         = 0

Oh, I will find and fix these issues if they are still left in the
newer version.  Thanks for reporting.

> Regression tests passed cleanly for me.  I also examined code a bit.  As I
> get, for DML queries, declarative partitioning works like inheritance.  It
> just provides alternative way for collecting append_rel_list.

Right, my latest patch doesn't touch this part at all.

> We're working on the other side of partitioning problem.  Without solving
> syntax problem, we're solving performance problems in pg_pathman extension:
> https://github.com/postgrespro/pg_pathman.  We already have interesting
> results which you can see in blog posts [1], [2], [3].

I have been following the blog posts.  Results look promising, :)

> We already have fast algorithm for partition selection in optimizer [1] and
> effective optimization of filter conditions [3]. And we're planning to
> implement following features:
>
> Execute time partitions selection (useful for nested loops and prepared
> statements);
> Optimization of ordered output from patitioned tables (use Append instead of
> MergeAppend when possible);
> Optimization of hash join when both tables are patitioned by join key.
>
> 9.6 feature freeze in coming, and we're planning our development resources
> for 9.7. Besides providing an extension, we would like these features to
> eventually arrive to core.  In order to achieve this we need to port out
> functionality over your declarative syntax.  At first, we would need some
> way for caching partition metadata suitable for fast partition selection.
> For range partition it could be sorted array or RB-tree of partition bounds.
> When we have this infrastructure, we can start porting pieces of pg_pathman
> functionality to declarative partitiong.

I had to think about the internal metadata representation (and its
caching) when developing the tuple routing solution.  I am hopeful
that it is suitable for other executor mechanisms we will build for
partitioned tables.

> So, our draft plan of patches would be following:
>
> Implement partition metadata cache suitable for fast partition selection.
> Fast partition selection using metadata cache.
> Optimization of filter conditions passed to partitions.
> Execute time partitions selection (useful for nested loops and prepared
> statements);
> Optimization of ordered output from patitioned tables (use Append instead of
> MergeAppend when possible);
> Optimization of hash join when both tables are patitioned by join key.
>
> I'd like to validate that this development plan doesn't overlaps with your
> plans.  If out plans are not overlapping then let's accept this plan of work
> for 9.7.

It looks OK to me.  Thanks for sharing it.

Thanks,
Amit

Re: Declarative partitioning

From

Alexander Korotkov

Date:

16 March 2016, 14:50:25

On Wed, Mar 16, 2016 at 5:29 PM, Amit Langote <amitlangote09@gmail.com> wrote:

> 9.6 feature freeze in coming, and we're planning our development resources
> for 9.7. Besides providing an extension, we would like these features to
> eventually arrive to core. In order to achieve this we need to port out
> functionality over your declarative syntax. At first, we would need some
> way for caching partition metadata suitable for fast partition selection.
> For range partition it could be sorted array or RB-tree of partition bounds.
> When we have this infrastructure, we can start porting pieces of pg_pathman
> functionality to declarative partitiong.

I had to think about the internal metadata representation (and its
caching) when developing the tuple routing solution. I am hopeful
that it is suitable for other executor mechanisms we will build for
partitioned tables.

Yes, it appears that I missed it. You already have sorted array for range partitioning and binary search implementation. This is good.

I'm a bit worrying about list partitioning because in this case you scan array sequentially. We could optimize this case for better handling of many list partitions. This is probably not most common case though.

> So, our draft plan of patches would be following:
>
> Implement partition metadata cache suitable for fast partition selection.
> Fast partition selection using metadata cache.
> Optimization of filter conditions passed to partitions.
> Execute time partitions selection (useful for nested loops and prepared
> statements);
> Optimization of ordered output from patitioned tables (use Append instead of
> MergeAppend when possible);
> Optimization of hash join when both tables are patitioned by join key.
>
> I'd like to validate that this development plan doesn't overlaps with your
> plans. If out plans are not overlapping then let's accept this plan of work
> for 9.7.

It looks OK to me. Thanks for sharing it.

Great! Let's work together.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Declarative partitioning

From

Robert Haas

Date:

21 March 2016, 19:55:13

On Wed, Mar 16, 2016 at 10:49 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> > I'd like to validate that this development plan doesn't overlaps with
>> > your
>> > plans.  If out plans are not overlapping then let's accept this plan of
>> > work
>> > for 9.7.
>>
>> It looks OK to me.  Thanks for sharing it.
>
>
> Great! Let's work together.

So, the last patch on this thread was posted on February 17th, and the
CF entry was marked Waiting on Author on March 2nd.  Even if we had a
new patch in hand at this point, I don't think there's any real chance
of being able to get this done for 9.6; there are too many things left
to do here in terms of figuring out syntax and scope, and of course
performance testing.  Moreover, when this goes in, it's going to open
up lots of opportunities for follow-up optimizations that we surely do
not have time to follow up on for 9.6.  And, as it is, the patch
hasn't been updated in over a month and is clearly not in final form
as it exists today.

Therefore, I have marked this Returned with Feedback.  I look forward
to returning to this topic for 9.7, and I'm willing to step up to the
plate and review this more aggressively at that time, with an eye
toward committing it when we've got it in good shape.  But I don't
think there's any way to proceed with it for 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Simon Riggs

Date:

21 March 2016, 20:01:02

On 21 March 2016 at 19:55, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 16, 2016 at 10:49 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> > I'd like to validate that this development plan doesn't overlaps with
>> > your
>> > plans. If out plans are not overlapping then let's accept this plan of
>> > work
>> > for 9.7.
>>
>> It looks OK to me. Thanks for sharing it.
>
>
> Great! Let's work together.

So, the last patch on this thread was posted on February 17th, and the
CF entry was marked Waiting on Author on March 2nd. Even if we had a
new patch in hand at this point, I don't think there's any real chance
of being able to get this done for 9.6; there are too many things left
to do here in terms of figuring out syntax and scope, and of course
performance testing. Moreover, when this goes in, it's going to open
up lots of opportunities for follow-up optimizations that we surely do
not have time to follow up on for 9.6. And, as it is, the patch
hasn't been updated in over a month and is clearly not in final form
as it exists today.

Therefore, I have marked this Returned with Feedback. I look forward
to returning to this topic for 9.7, and I'm willing to step up to the
plate and review this more aggressively at that time, with an eye
toward committing it when we've got it in good shape. But I don't
think there's any way to proceed with it for 9.6.

Good decision.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Declarative partitioning

From

Amit Langote

Date:

22 March 2016, 02:23:49

On 2016/03/22 4:55, Robert Haas wrote:
> So, the last patch on this thread was posted on February 17th, and the
> CF entry was marked Waiting on Author on March 2nd.  Even if we had a
> new patch in hand at this point, I don't think there's any real chance
> of being able to get this done for 9.6; there are too many things left
> to do here in terms of figuring out syntax and scope, and of course
> performance testing.  Moreover, when this goes in, it's going to open
> up lots of opportunities for follow-up optimizations that we surely do
> not have time to follow up on for 9.6.  And, as it is, the patch
> hasn't been updated in over a month and is clearly not in final form
> as it exists today.
> 
> Therefore, I have marked this Returned with Feedback.  I look forward
> to returning to this topic for 9.7, and I'm willing to step up to the
> plate and review this more aggressively at that time, with an eye
> toward committing it when we've got it in good shape.  But I don't
> think there's any way to proceed with it for 9.6.

OK. I agree with the decision.

Actually, I was just about to post a patch set today, but I figure it's
too late for that.  Anyway, I look forward to working on this for 9.7.

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

14 April 2016, 12:35:05

Hi Amit,

Here are some random comments. You said that you were about to post a new patch, so you might have already taken care of those comments. But anyway here they are.

1. Following patches do not apply cleanly for me.
   0001
   0003 - conflicts at #include for partition.h in rel.h.
   0004 - conflicts in src/backend/catalog/Makefile
   0005 - conflicts in src/backend/parser/gram.y

2. The patches are using now used OIDs 3318-3323. Corresponding objects need new unused oids.

3. In patch 0001-*, you are using indkey instead of partkey in one of the comments and also in documentation.

4. After all patches are applied I am getting several errors like "error: #error "lock.h may not be included from frontend code", while building rmgrdesc.c. This
seems to be because rel.h includes partition.h, which leads to inclusion of lock.h in rmgrdesc.c. Are you getting the same error message? It looks like we need separate header file for declaring function which can be used at the time of execution, which is anyway better irrespective of the compiler error.

5. In expand_partitioned_rtentry(), instead of finding all the leaf level relations, it might be better to build the RTEs, RelOptInfo and paths for
intermediate relations as well. This helps in partition pruning. In your introductory write-up you have mentioned this. It might be better if v1 includes this change, so that the partition hierarchy is visible in EXPLAIN output as well.

6. Explain output of scan on a partitioned table shows Append node with individual table scans as sub-plans. May be we should annotate Append node with
the name of the partitioned table to make EXPLAIN output more readable.

7. \d+ output of partitioned table does not show partitioning information, something necessary for V1.

8. Instead of storing partition key column numbers and expressions separately, can we store everything as expression; columns being a single Var node expression? That will avoid constructing Var nodes to lookup in equivalence classes for partition pruning or join optimizations.

10. The code distinguishes between the top level table and its partitions which in turn are partitioned. We should try to minimize this distinction as much as possible so as to use recursive functions for operating on partitions. E.g. each of the partitioned partitions may be labelled RELKIND_PARTITIONED_REL? A 0/NULL parent would distinguish between root partition and child partitions.

On Tue, Mar 22, 2016 at 7:53 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/03/22 4:55, Robert Haas wrote:
> So, the last patch on this thread was posted on February 17th, and the
> CF entry was marked Waiting on Author on March 2nd. Even if we had a
> new patch in hand at this point, I don't think there's any real chance
> of being able to get this done for 9.6; there are too many things left
> to do here in terms of figuring out syntax and scope, and of course
> performance testing. Moreover, when this goes in, it's going to open
> up lots of opportunities for follow-up optimizations that we surely do
> not have time to follow up on for 9.6. And, as it is, the patch
> hasn't been updated in over a month and is clearly not in final form
> as it exists today.
>
> Therefore, I have marked this Returned with Feedback. I look forward
> to returning to this topic for 9.7, and I'm willing to step up to the
> plate and review this more aggressively at that time, with an eye
> toward committing it when we've got it in good shape. But I don't
> think there's any way to proceed with it for 9.6.

OK. I agree with the decision.

Actually, I was just about to post a patch set today, but I figure it's
too late for that. Anyway, I look forward to working on this for 9.7.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

15 April 2016, 04:36:54

Hi Ashutosh,

On 2016/04/14 21:34, Ashutosh Bapat wrote:
> Hi Amit,
> Here are some random comments. You said that you were about to post a new
> patch, so you might have already taken care of those comments. But anyway
> here they are.

Thanks a lot for the comments. The patch set changed quite a bit since the
last version.  Once the CF entry was marked returned with feedback on
March 22, I held off sending the new version at all. Perhaps, it would
have been OK.  Anyway here it is, if you are interested. I will create an
entry in CF 2016-09 for the same.

Also, see below replies to you individual comments.

> 1. Following patches do not apply cleanly for me.
>    0001
>    0003 - conflicts at #include for partition.h in rel.h.
>    0004 - conflicts in src/backend/catalog/Makefile
>    0005 - conflicts in src/backend/parser/gram.y

These should be fixed although the attached one is a significantly
different patch set.

> 2. The patches are using now used OIDs 3318-3323. Corresponding objects
> need new unused oids.

Right, fixed.

> 3. In patch 0001-*, you are using indkey instead of partkey in one of the
> comments and also in documentation.

Fixed.

> 4. After all patches are applied I am getting several errors like "error:
> #error "lock.h may not be included from frontend code", while building
> rmgrdesc.c. This
> seems to be because rel.h includes partition.h, which leads to inclusion of
> lock.h in rmgrdesc.c. Are you getting the same error message? It looks like
> we need separate header file for declaring function which can be used at
> the time of execution, which is anyway better irrespective of the compiler
> error.

Yeah, I too am beginning to feel that dividing partition.h into separate
headers would be a good idea in long term.  For time being, I have managed
to get rid of partition.h #included in rel.h which solves the issue.

> 5. In expand_partitioned_rtentry(), instead of finding all the leaf level
> relations, it might be better to build the RTEs, RelOptInfo and paths for
> intermediate relations as well. This helps in partition pruning. In your
> introductory write-up you have mentioned this. It might be better if v1
> includes this change, so that the partition hierarchy is visible in EXPLAIN
> output as well.
>
> 6. Explain output of scan on a partitioned table shows Append node with
> individual table scans as sub-plans. May be we should annotate Append node
> with
> the name of the partitioned table to make EXPLAIN output more readable.

I got rid of all the optimizer changes in the new version (except a line
or two).  I did that by switching to storing parent-child relationships in
pg_inherits so that all the existing optimizer code for inheritance sets
works unchanged.  Also, the implementation detail that required to put
only leaf partitions in the append list is also gone.  Previously no
storage was allocated for partitioned tables (either root or any of the
internal partitions), so it was being done that way.

Regarding 6, it seems to me that because Append does not have a associated
relid (like scan nodes have with scanrelid).  Maybe we need to either fix
Append or create some enhanced version of Append which would also support
dynamic pruning.

> 7. \d+ output of partitioned table does not show partitioning information,
> something necessary for V1.

This has been fixed in the attached.

> 8. Instead of storing partition key column numbers and expressions
> separately, can we store everything as expression; columns being a single
> Var node expression? That will avoid constructing Var nodes to lookup in
> equivalence classes for partition pruning or join optimizations.

Hmm, let me consider that.  FWIW, if you look at my proposed patch (or
description thereof) back in CF 2015-11 [1], you will see that I had
modeled matching clauses to partition key on lines of how similar
processing is done for indexes (as in how indexes are matched with clauses
- down all the way to match_clause_to_indexcol()).

None of that exists in the current patch set but when we get to that
(optimizer patch that is), I perhaps wouldn't do it radically differently.
 I admit however that I hadn't thought really hard about join optimization
stuff so I may be missing something.

> 10. The code distinguishes between the top level table and its partitions
> which in turn are partitioned. We should try to minimize this distinction
> as much as possible so as to use recursive functions for operating on
> partitions. E.g. each of the partitioned partitions may be labelled
> RELKIND_PARTITIONED_REL? A 0/NULL parent would distinguish between root
> partition and child partitions.

Exactly how it's done in the attached.  Any table that is partitioned is
now a RELKIND_PARTITIONED_REL.

Thanks,
Amit

[1] http://www.postgresql.org/message-id/563341AE.5010207@lab.ntt.co.jp

Attachment

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

15 April 2016, 09:46:58

With the new set of patches, I am getting following warnings but no compilation failures. Patches apply smoothly.
partition.c:1216:21: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1637:20: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]

For creating partition the documentation seems to suggest the syntax
create table t1_p1 partition of t1 for values {start {0} end {100} exclusive;

but following syntax works
create table t1_p1 partition of t1 for values start (0) end (100) exclusive;

> 5. In expand_partitioned_rtentry(), instead of finding all the leaf level
> relations, it might be better to build the RTEs, RelOptInfo and paths for
> intermediate relations as well. This helps in partition pruning. In your
> introductory write-up you have mentioned this. It might be better if v1
> includes this change, so that the partition hierarchy is visible in EXPLAIN
> output as well.
>
> 6. Explain output of scan on a partitioned table shows Append node with
> individual table scans as sub-plans. May be we should annotate Append node
> with
> the name of the partitioned table to make EXPLAIN output more readable.

I got rid of all the optimizer changes in the new version (except a line
or two). I did that by switching to storing parent-child relationships in
pg_inherits so that all the existing optimizer code for inheritance sets
works unchanged. Also, the implementation detail that required to put
only leaf partitions in the append list is also gone. Previously no
storage was allocated for partitioned tables (either root or any of the
internal partitions), so it was being done that way.

With this set of patches we are still flattening all the partitions.
postgres=# create table t1 (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table t1_p1 partition of t1 for values start (0) end (100) exclusive partition by range(b);
CREATE TABLE
postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0) end (100) exclusive;
CREATE TABLE
postgres=# explain verbose select * from t1;
                               QUERY PLAN
-------------------------------------------------------------------------
Append (cost=0.00..32.60 rows=2262 width=8)
   -> Seq Scan on public.t1 (cost=0.00..0.00 rows=1 width=8)
         Output: t1.a, t1.b
   -> Seq Scan on public.t1_p1 (cost=0.00..0.00 rows=1 width=8)
         Output: t1_p1.a, t1_p1.b
   -> Seq Scan on public.t1_p1_p1 (cost=0.00..32.60 rows=2260 width=8)
         Output: t1_p1_p1.a, t1_p1_p1.b
(7 rows)

Retaining the partition hierarchy would help to push-down join across partition hierarchy effectively.

Regarding 6, it seems to me that because Append does not have a associated
relid (like scan nodes have with scanrelid). Maybe we need to either fix
Append or create some enhanced version of Append which would also support
dynamic pruning.

Right, I think, Append might store the relid of the parent table as well as partitioning information at that level along-with the subplans.

Some more comments:

1. Would it be better to declare PartitionDescData as

{
int nparts;
PartitionInfo *partinfo; /* array of partition information structures. */
}

2. The new syntax allows CREATE TABLE to be specified as partition of an already partitioned table. Is it possible to do the same for CREATE FOREIGN TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.

3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be obtained by calling functions exprType, exprTypemod on partexprs. Why do we need to store that information as a separate member?

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

15 April 2016, 09:48:49

For creating partition the documentation seems to suggest the syntax
create table t1_p1 partition of t1 for values {start {0} end {100} exclusive;

but following syntax works
create table t1_p1 partition of t1 for values start (0) end (100) exclusive;

> 5. In expand_partitioned_rtentry(), instead of finding all the leaf level
> relations, it might be better to build the RTEs, RelOptInfo and paths for
> intermediate relations as well. This helps in partition pruning. In your
> introductory write-up you have mentioned this. It might be better if v1
> includes this change, so that the partition hierarchy is visible in EXPLAIN
> output as well.
>
> 6. Explain output of scan on a partitioned table shows Append node with
> individual table scans as sub-plans. May be we should annotate Append node
> with
> the name of the partitioned table to make EXPLAIN output more readable.

I got rid of all the optimizer changes in the new version (except a line
or two). I did that by switching to storing parent-child relationships in
pg_inherits so that all the existing optimizer code for inheritance sets
works unchanged. Also, the implementation detail that required to put
only leaf partitions in the append list is also gone. Previously no
storage was allocated for partitioned tables (either root or any of the
internal partitions), so it was being done that way.

Retaining the partition hierarchy would help to push-down join across partition hierarchy effectively.

Regarding 6, it seems to me that because Append does not have a associated
relid (like scan nodes have with scanrelid). Maybe we need to either fix
Append or create some enhanced version of Append which would also support
dynamic pruning.

Right, I think, Append might store the relid of the parent table as well as partitioning information at that level along-with the subplans.

Some more comments

Would it be better to declare PartitionDescData as

{
int nparts;
PartitionInfo *partinfo; /* array of partition information structures. */
}

The new syntax allows CREATE TABLE to be specified as partition of an already partitioned table. Is it possible to do the same for CREATE FOREIGN TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Robert Haas

Date:

15 April 2016, 17:00:12

On Fri, Apr 15, 2016 at 5:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Retaining the partition hierarchy would help to push-down join across
> partition hierarchy effectively.

-1.  You don't get to insert cruft into the final plan for the
convenience of the optimizer.  I think the AppendPath needs to be
annotated with sufficient information to do whatever query planning
optimizations we want, and some or all of that may need to carry over
to the Append plan to allow run-time partition pruning.  But I think
that flattening nests of Appends is a good optimization and we should
preserve it.  If that makes the additional information that any given
Append needs to carry a bit more complex, so be it.

I also think it's very good that Amit has kept the query planner
unchanged in this initial patch.  Let's leave that work to phase two.
What I suggest we do when the time comes is invent new nodes
RangePartitionMap, ListPartitionMap, HashPartitionMap.  Each contains
minimal metadata needed for tuple routing or planner transformation.
For example, RangePartitionMap can contain an array of partition
boundaries - represented as Datums - and an array of mappings, each a
Node *.  The associated value can be another PartitionMap object if
there is subpartitioning in use, or an OID.  This can be used both for
matching up partitions for join pushdown, and also for fast tuple
routing and runtime partition pruning.

> 2. The new syntax allows CREATE TABLE to be specified as partition of an
> already partitioned table. Is it possible to do the same for CREATE FOREIGN
> TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.

+1 for making CREATE FOREIGN TABLE support that also, and in version
1.  And same for ATTACH PARTITION.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Amit Langote

Date:

18 April 2016, 05:03:03

Hi Ashutosh,

On 2016/04/15 18:48, Ashutosh Bapat wrote:
> With the new set of patches, I am getting following warnings but no
> compilation failures. Patches apply smoothly.
> partition.c:1216:21: warning: variable ‘form’ set but not used
> [-Wunused-but-set-variable]
> partition.c:1637:20: warning: variable ‘form’ set but not used
> [-Wunused-but-set-variable]

Ah, will fix.

> For creating partition the documentation seems to suggest the syntax
> create table t1_p1 partition of t1 for values {start {0} end {100}
> exclusive;
> but following syntax works
> create table t1_p1 partition of t1 for values start (0) end (100) exclusive;

Hmm, I see the following in docs:

...
and partition_bound_spec is:

FOR VALUES { list_spec | range_spec }

...

list_spec in FOR VALUES is:

{ IN ( expression [, ...] ) }

range_spec in FOR VALUES is:

{ START (lower-bound) [ INCLUSIVE | EXCLUSIVE ] | END (upper-bound) [ INCLUSIVE | EXCLUSIVE ] | START (lower-bound) [
INCLUSIVE| EXCLUSIVE ] END (upper-bound) [
 
INCLUSIVE | EXCLUSIVE ] }

By the way, I see that you are always specifying "exclusive" for end bound
in your examples.  The specification is optional if that wasn't apparent.Default (ie, without explicit specification)
is[start, end).
 

>> I got rid of all the optimizer changes in the new version (except a line
>> or two).  I did that by switching to storing parent-child relationships in
>> pg_inherits so that all the existing optimizer code for inheritance sets
>> works unchanged.  Also, the implementation detail that required to put
>> only leaf partitions in the append list is also gone.  Previously no
>> storage was allocated for partitioned tables (either root or any of the
>> internal partitions), so it was being done that way.
>>
> 
> With this set of patches we are still flattening all the partitions.
> postgres=# create table t1 (a int, b int) partition by range(a);
> CREATE TABLE
> postgres=# create table t1_p1 partition of t1 for values start (0) end
> (100) exclusive partition by range(b);
> CREATE TABLE
> postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0)
> end (100) exclusive;
> CREATE TABLE
> postgres=# explain verbose select * from t1;
>                                QUERY PLAN
> -------------------------------------------------------------------------
>  Append  (cost=0.00..32.60 rows=2262 width=8)
>    ->  Seq Scan on public.t1  (cost=0.00..0.00 rows=1 width=8)
>          Output: t1.a, t1.b
>    ->  Seq Scan on public.t1_p1  (cost=0.00..0.00 rows=1 width=8)
>          Output: t1_p1.a, t1_p1.b
>    ->  Seq Scan on public.t1_p1_p1  (cost=0.00..32.60 rows=2260 width=8)
>          Output: t1_p1_p1.a, t1_p1_p1.b
> (7 rows)
> Retaining the partition hierarchy would help to push-down join across
> partition hierarchy effectively.

Yes, it's still a flattened append (basically what optimizer currently
creates for arbitrary inheritance trees).  I didn't modify any of that yet.

>> Regarding 6, it seems to me that because Append does not have a associated
>> relid (like scan nodes have with scanrelid).  Maybe we need to either fix
>> Append or create some enhanced version of Append which would also support
>> dynamic pruning.
>>
> 
> Right, I think, Append might store the relid of the parent table as well as
> partitioning information at that level along-with the subplans.

For time being, I will leave this as yet unaddressed (I am thinking about
what is reasonable to do for this also considering Robert's comment).  Is
that OK?

> Some more comments
> Would it be better to declare PartitionDescData as
> {
> int nparts;
> PartitionInfo *partinfo; /* array of partition information structures. */
> }

I think that might be better.  Will do it the way you suggest.

> The new syntax allows CREATE TABLE to be specified as partition of an
> already partitioned table. Is it possible to do the same for CREATE FOREIGN
> TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.

OK, I will address this in the next version.  One question though: should
foreign table be only allowed to be leaf partitions (ie, no PARTITION BY
clause in CREATE FOREIGN TABLE ... PARTITION OF)?  By the way, ATTACH
PARTITION only allows leaf partitions anyway.

Thanks a lot for the review!

Thanks,
Amt

Re: Declarative partitioning

From

Amit Langote

Date:

18 April 2016, 05:20:05

Hi,

On 2016/04/15 18:46, Ashutosh Bapat wrote:
> 
> 3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be
> obtained by calling functions exprType, exprTypemod on partexprs. Why do we
> need to store that information as a separate member?

There was no KeyTypeCollInfo in early days of the patch and then I found
myself doing a lot of:

partexprs_item = list_head(key->partexprs);
for (attr in key->partattrs)
{   if (attr->attnum != 0)   {       // simple column reference, get type from attr   }   else   {       // expression,
gettype using exprType, etc.       partexprs_item = lnext(partexprs_item);   }

}

That ended up being quite a few places (though I managed to reduce the
number of places over time).  So, I created this struct which is
initialized when partition key is built (on first open of the partitioned
table).

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

18 April 2016, 06:19:25

On Fri, Apr 15, 2016 at 10:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 15, 2016 at 5:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Retaining the partition hierarchy would help to push-down join across
> partition hierarchy effectively.

-1. You don't get to insert cruft into the final plan for the
convenience of the optimizer. I think the AppendPath needs to be
annotated with sufficient information to do whatever query planning
optimizations we want, and some or all of that may need to carry over
to the Append plan to allow run-time partition pruning. But I think
that flattening nests of Appends is a good optimization and we should
preserve it. If that makes the additional information that any given
Append needs to carry a bit more complex, so be it.

I also think it's very good that Amit has kept the query planner
unchanged in this initial patch. Let's leave that work to phase two.
What I suggest we do when the time comes is invent new nodes
RangePartitionMap, ListPartitionMap, HashPartitionMap. Each contains
minimal metadata needed for tuple routing or planner transformation.
For example, RangePartitionMap can contain an array of partition
boundaries - represented as Datums - and an array of mappings, each a
Node *. The associated value can be another PartitionMap object if
there is subpartitioning in use, or an OID.

range table index instead of OID to make it easy to lookup the

This can be used both for
matching up partitions for join pushdown, and also for fast tuple
routing and runtime partition pruning.

I was thinking about join pushdown (in partitioning hierarchy) of multiple tables which have similar partitioning structure for few upper levels but the entire partitioning hierarchy does not match. Pushing down the join as much possible into the partition hierarchy will help. We might end up with joins between a plain table and Append relation at the leaf level. For such join push-down it looks like we will have to construct corresponding Append hierarchy, push the joins down and then (may be) collapse it to just collect the results of joins at the leaf level. But preparing for that kind of optimization need not be part of this work.

> 2. The new syntax allows CREATE TABLE to be specified as partition of an
> already partitioned table. Is it possible to do the same for CREATE FOREIGN
> TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.

+1 for making CREATE FOREIGN TABLE support that also, and in version
1. And same for ATTACH PARTITION.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

18 April 2016, 06:33:58

>> Regarding 6, it seems to me that because Append does not have a associated
>> relid (like scan nodes have with scanrelid). Maybe we need to either fix
>> Append or create some enhanced version of Append which would also support
>> dynamic pruning.
>>
>
> Right, I think, Append might store the relid of the parent table as well as
> partitioning information at that level along-with the subplans.

For time being, I will leave this as yet unaddressed (I am thinking about
what is reasonable to do for this also considering Robert's comment). Is
that OK?

Right now EXPLAIN of select * from t1, where t1 is partitioned table shows

Append

-> Seq Scan on t1

-> Seq scan on partition 1

-> seq scan on partition 2
...

which shows t1 as well as all the partitions on the same level. Users might have accepted that for inheritance hierarchy but for partitioning that can be confusing, esp. when all the error messages and documentation indicate that t1 is an empty (shell?) table. Instead showing it like

Append for t1 -- (essentially show that this is Append for partitioned table t1, exact text might vary)

-> Seq scan on partition 1
-> ....

would be more readable. Similarly if we are going to collapse all the Append hierarchy, it might get even more confusing. Listing all the intermediate partitioned tables as Seq Scan on them would be confusing for the reasons mentioned above, and not listing them might make user wonder about the reasons for their disappearance. I am not sure what's the solution their.

> Some more comments
> Would it be better to declare PartitionDescData as
> {
> int nparts;
> PartitionInfo *partinfo; /* array of partition information structures. */
> }

I think that might be better. Will do it the way you suggest.

> The new syntax allows CREATE TABLE to be specified as partition of an
> already partitioned table. Is it possible to do the same for CREATE FOREIGN
> TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.

OK, I will address this in the next version. One question though: should
foreign table be only allowed to be leaf partitions (ie, no PARTITION BY
clause in CREATE FOREIGN TABLE ... PARTITION OF)?

That seems a better way. Otherwise users might wonder whether we keep the partitions of a foreign table on the foreign server which won't be true. But then we allow foreign tables to have local tables as children in inheritance, so somebody from that background might find it incompatible. I think we shouldn't let the connection between partitioning and inheritance linger longer than necessary.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

18 April 2016, 06:38:54

On 2016/04/15 18:46, Ashutosh Bapat wrote:
>
> 3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be
> obtained by calling functions exprType, exprTypemod on partexprs. Why do we
> need to store that information as a separate member?

There was no KeyTypeCollInfo in early days of the patch and then I found
myself doing a lot of:

partexprs_item = list_head(key->partexprs);
for (attr in key->partattrs)
{
if (attr->attnum != 0)
{
// simple column reference, get type from attr
}
else
{
// expression, get type using exprType, etc.
partexprs_item = lnext(partexprs_item);
}
}

At least the two loops can be flattened to a single loop if we keep only expressions list with attributes being just Var nodes. exprType() etc. would then work seemlessly.

That ended up being quite a few places (though I managed to reduce the
number of places over time). So, I created this struct which is
initialized when partition key is built (on first open of the partitioned
table).

Hmm, I am just afraid that we might end up with some code using cached information and some using exprType, exprTypmod etc.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

18 April 2016, 07:54:08

On 2016/04/18 15:38, Ashutosh Bapat wrote:
>> There was no KeyTypeCollInfo in early days of the patch and then I found
>> myself doing a lot of:
>>
>> partexprs_item = list_head(key->partexprs);
>> for (attr in key->partattrs)
>> {
>>     if (attr->attnum != 0)
>>     {
>>         // simple column reference, get type from attr
>>     }
>>     else
>>     {
>>         // expression, get type using exprType, etc.
>>         partexprs_item = lnext(partexprs_item);
>>     }
>> }
>>
> 
> At least the two loops can be flattened to a single loop if we keep only
> expressions list with attributes being just Var nodes. exprType() etc.
> would then work seemlessly.

I didn't say anything about your suggestion to use a Node * list as a
representation for the cached partition key information.  IIUC, you mean
instead of the AttrNumber[partnatts] array with non-zero attnum for a
named column slot and 0 for a expressional column slot, create a Node *
list with Var nodes for simple column references and Expr nodes for
expressions.

I would mention that the same information is also being used in contexts
where having simple attnums may be better (for example, when extracting
key of a tuple slot during tuple routing).  Moreover, this is cached
information and I thought it may be better to follow the format that other
similar information uses (index key and such).  Furthermore, looking at
qual matching code for indexes and recently introduced foreign key
optimization, it seems we will want to use a similar representation within
optimizer for partition keys.  IndexOptInfo has int ncolumns and int *
indexkeys and then match_index_to_operand() compares index key attnums
with varattno of vars in qual.  It's perhaps speculative at the moment
because there is not much code wanting to use it yet other than partition
DDL and tuple-routing and  cached info seems to work as-is for the latter.

>> That ended up being quite a few places (though I managed to reduce the
>> number of places over time).  So, I created this struct which is
>> initialized when partition key is built (on first open of the partitioned
>> table).
>>
> 
> Hmm, I am just afraid that we might end up with some code using cached
> information and some using exprType, exprTypmod etc.

Well, you never use exprType(), etc. for partition keys in other than a
few places.  All places that do always use the cached values.  Mostly
partitioning DDL stuff so far.  Tuple routing considers collation of
individual key columns when comparing input value with partition bounds.

Am I missing something?

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

18 April 2016, 08:43:20

On Mon, Apr 18, 2016 at 1:23 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/04/18 15:38, Ashutosh Bapat wrote:
>> There was no KeyTypeCollInfo in early days of the patch and then I found
>> myself doing a lot of:
>>
>> partexprs_item = list_head(key->partexprs);
>> for (attr in key->partattrs)
>> {
>> if (attr->attnum != 0)
>> {
>> // simple column reference, get type from attr
>> }
>> else
>> {
>> // expression, get type using exprType, etc.
>> partexprs_item = lnext(partexprs_item);
>> }
>> }
>>
>
> At least the two loops can be flattened to a single loop if we keep only
> expressions list with attributes being just Var nodes. exprType() etc.
> would then work seemlessly.

I didn't say anything about your suggestion to use a Node * list as a
representation for the cached partition key information. IIUC, you mean
instead of the AttrNumber[partnatts] array with non-zero attnum for a
named column slot and 0 for a expressional column slot, create a Node *
list with Var nodes for simple column references and Expr nodes for
expressions.

I would mention that the same information is also being used in contexts
where having simple attnums may be better (for example, when extracting
key of a tuple slot during tuple routing). Moreover, this is cached
information and I thought it may be better to follow the format that other
similar information uses (index key and such). Furthermore, looking at
qual matching code for indexes and recently introduced foreign key
optimization, it seems we will want to use a similar representation within
optimizer for partition keys. IndexOptInfo has int ncolumns and int *
indexkeys and then match_index_to_operand() compares index key attnums
with varattno of vars in qual. It's perhaps speculative at the moment
because there is not much code wanting to use it yet other than partition
DDL and tuple-routing and cached info seems to work as-is for the latter.

Ok.

>> That ended up being quite a few places (though I managed to reduce the
>> number of places over time). So, I created this struct which is
>> initialized when partition key is built (on first open of the partitioned
>> table).
>>
>
> Hmm, I am just afraid that we might end up with some code using cached
> information and some using exprType, exprTypmod etc.

Well, you never use exprType(), etc. for partition keys in other than a
few places. All places that do always use the cached values. Mostly
partitioning DDL stuff so far. Tuple routing considers collation of
individual key columns when comparing input value with partition bounds.

I am not worried about the current code. But there will be a lot of code added after version 1. I am worried about that.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

18 April 2016, 08:55:39

On 2016/04/18 15:33, Ashutosh Bapat wrote:
>> For time being, I will leave this as yet unaddressed (I am thinking about
>> what is reasonable to do for this also considering Robert's comment).  Is
>> that OK?
>>
> 
> Right now EXPLAIN of select * from t1, where t1 is partitioned table shows
> Append
> -> Seq Scan on t1
> -> Seq scan on partition 1
> -> seq scan on partition 2
> ...
> which shows t1 as well as all the partitions on the same level. Users might
> have accepted that for inheritance hierarchy but for partitioning that can
> be confusing, esp. when all the error messages and documentation indicate
> that t1 is an empty (shell?) table. Instead showing it like
> Append for t1 -- (essentially show that this is Append for partitioned
> table t1, exact text might vary)
> -> Seq scan on partition 1
> -> ....
> would be more readable. Similarly if we are going to collapse all the
> Append hierarchy, it might get even more confusing. Listing all the
> intermediate partitioned tables as Seq Scan on them would be confusing for
> the reasons mentioned above, and not listing them might make user wonder
> about the reasons for their disappearance. I am not sure what's the
> solution their.

Yes, things remain confusing with Append plans for partitioned tables.
Note that currently if an internal partition doesn't show up, none of its
partitions do (given the way CHECK constraints are generated for each
table in the partition tree).

>> OK, I will address this in the next version.  One question though: should
>> foreign table be only allowed to be leaf partitions (ie, no PARTITION BY
>> clause in CREATE FOREIGN TABLE ... PARTITION OF)?
> 
> That seems a better way. Otherwise users might wonder whether we keep the
> partitions of a foreign table on the foreign server which won't be true.

Quite true.  Can't assume anything other than what the FDW for a given
foreign table lets us assume about what the remote table looks like.
Existing FDW API allows to treat a foreign table as nothing other than
regular tables at best.

> But then we allow foreign tables to have local tables as children in
> inheritance, so somebody from that background might find it incompatible. I
> think we shouldn't let the connection between partitioning and inheritance
> linger longer than necessary.

I, too, think quite some about this last sentence.  But I have only
considered DDL yet.

Thanks,
Amit

Re: Declarative partitioning

From

Ildar Musin

Date:

19 April 2016, 13:57:47

Hi Amit,

On 15.04.2016 07:35, Amit Langote wrote:
> Thanks a lot for the comments. The patch set changed quite a bit since 
> the last version. Once the CF entry was marked returned with feedback 
> on March 22, I held off sending the new version at all. Perhaps, it 
> would have been OK. Anyway here it is, if you are interested. I will 
> create an entry in CF 2016-09 for the same. Also, see below replies to 
> you individual comments.

Thanks for your new patch! I've tried it and discovered some strange 
behavior for partitioning by composite key. Here is an example of my setup:

create table test(a int, b int) partition by range (a, b);
create table test_1 partition of test for values start (0, 0) end (100, 
100);
create table test_2 partition of test for values start (100, 100) end 
(200, 200);
create table test_3 partition of test for values start (200, 200) end 
(300, 300);

It's alright so far. But if we try to insert record in which attribute 
'a' belongs to one partition and attribute 'b' belongs to another then 
record will be inserted in the first one:

insert into test(a, b) values (150, 50);

select tableoid::regclass, * from test; tableoid |  a  | b
----------+-----+---- test_2   | 150 | 50
(1 row)

I think there should be an error because value for 'b' violates range 
constraint for test_2. Now if we query data from 'test' and add filter b 
< 100, then planner will exclude partitions 'test_2' (which actually 
contains inserted row) and 'test_3' and return nothing:

select * from test where b < 100; a | b
---+---
(0 rows)

explain (costs off) select * from test where b < 100;        QUERY PLAN
--------------------------- Append   ->  Seq Scan on test         Filter: (b < 100)   ->  Seq Scan on test_1
Filter:(b < 100)

(5 rows)

-- 
Ildar Musin
i.musin@postgrespro.ru

Re: Declarative partitioning

From

Alexander Korotkov

Date:

19 April 2016, 14:27:17

On Tue, Apr 19, 2016 at 4:57 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:

On 15.04.2016 07:35, Amit Langote wrote:
Thanks a lot for the comments. The patch set changed quite a bit since the last version. Once the CF entry was marked returned with feedback on March 22, I held off sending the new version at all. Perhaps, it would have been OK. Anyway here it is, if you are interested. I will create an entry in CF 2016-09 for the same. Also, see below replies to you individual comments.

Thanks for your new patch! I've tried it and discovered some strange behavior for partitioning by composite key. Here is an example of my setup:

create table test(a int, b int) partition by range (a, b);
create table test_1 partition of test for values start (0, 0) end (100, 100);
create table test_2 partition of test for values start (100, 100) end (200, 200);
create table test_3 partition of test for values start (200, 200) end (300, 300);

It's alright so far. But if we try to insert record in which attribute 'a' belongs to one partition and attribute 'b' belongs to another then record will be inserted in the first one:

insert into test(a, b) values (150, 50);

select tableoid::regclass, * from test;
tableoid | a | b
----------+-----+----
test_2 | 150 | 50
(1 row)

That's how composite keys work. First subkey is checked. If it's equal then second subkey is checked and so on.

# SELECT (100, 100) < (150, 50), (150, 50) < (200, 200);

?column? | ?column?

----------+----------

t | t

(1 row)

Another question is that it might be NOT what users expect from that. From the syntax side it very looks like defining something boxes regions for two keys which could be replacement for subpartitioning. But it isn't so.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Declarative partitioning

From

Amit Langote

Date:

19 April 2016, 14:53:06

Hi Idlar, Alexander,

On Tue, Apr 19, 2016 at 11:26 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Apr 19, 2016 at 4:57 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
>>
>> Thanks for your new patch! I've tried it and discovered some strange
>> behavior for partitioning by composite key. Here is an example of my setup:
>>
>> create table test(a int, b int) partition by range (a, b);
>> create table test_1 partition of test for values start (0, 0) end (100,
>> 100);
>> create table test_2 partition of test for values start (100, 100) end
>> (200, 200);
>> create table test_3 partition of test for values start (200, 200) end
>> (300, 300);
>>
>> It's alright so far. But if we try to insert record in which attribute 'a'
>> belongs to one partition and attribute 'b' belongs to another then record
>> will be inserted in the first one:
>>
>> insert into test(a, b) values (150, 50);
>>
>> select tableoid::regclass, * from test;
>>  tableoid |  a  | b
>> ----------+-----+----
>>  test_2   | 150 | 50
>> (1 row)
>
>
> That's how composite keys work. First subkey is checked. If it's equal then
> second subkey is checked and so on.
>
> # SELECT (100, 100) < (150, 50), (150, 50) < (200, 200);
>  ?column? | ?column?
> ----------+----------
>  t        | t
> (1 row)

Yes.

> Another question is that it might be NOT what users expect from that.  From
> the syntax side it very looks like defining something boxes regions for two
> keys which could be replacement for subpartitioning.  But it isn't so.

Need to check why query with qual b < 100 behaves the way it does.
Something's going wrong there with the constraints (partition
predicates) that are being generated internally (as mentioned before,
still driven by constraint exclusion using the constraints generated
on-the-fly).

As for the composite range partition bounds in Ildar's example, it's
as if the second value in the key never determines the fate of a row
going into some partition, therefore no constraints should have been
generated for column b of the key.  I'm afraid that's not the case as
per the latest patch.  Will fix.

Thanks a lot for trying it out and the report.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

20 April 2016, 10:29:45

On 2016/04/19 23:52, Amit Langote wrote:
> On Tue, Apr 19, 2016 at 11:26 PM, Alexander Korotkov
>> Another question is that it might be NOT what users expect from that.  From
>> the syntax side it very looks like defining something boxes regions for two
>> keys which could be replacement for subpartitioning.  But it isn't so.
>
> Need to check why query with qual b < 100 behaves the way it does.
> Something's going wrong there with the constraints (partition
> predicates) that are being generated internally (as mentioned before,
> still driven by constraint exclusion using the constraints generated
> on-the-fly).
>
> As for the composite range partition bounds in Ildar's example, it's
> as if the second value in the key never determines the fate of a row
> going into some partition, therefore no constraints should have been
> generated for column b of the key.  I'm afraid that's not the case as
> per the latest patch.  Will fix.

The strange behavior that Ildar reported should have been fixed with the
attached updated set of patches (v2):

create table test(a int, b int) partition by range (a, b);
create table test_1 partition of test for values start (0, 0) end (100, 100);
create table test_2 partition of test for values start (100, 100) end
(200, 200);
create table test_3 partition of test for values start (200, 200) end
(300, 300);
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE

insert into test(a, b) values (150, 50);
INSERT 0 1

select * from test where b < 100;
  a  | b
-----+----
 150 | 50
(1 row)

explain (costs off) select * from test where b < 100;
        QUERY PLAN
---------------------------
 Append
   ->  Seq Scan on test
         Filter: (b < 100)
   ->  Seq Scan on test_1
         Filter: (b < 100)
   ->  Seq Scan on test_2
         Filter: (b < 100)
   ->  Seq Scan on test_3
         Filter: (b < 100)
(9 rows)


Multi-column range partitioning seems a bit tricky as far as generating
constraints on individual columns using a partition's lower and upper
bounds (both composite values) is concerned.  I mentally pictured
something like the following example scenario:

create table test(a int, b int, c int)
    partition by range (a, b, c);
create table test_1 partition of test
    for values start (0, 0, 0) end (0, 2, 0);
create table test_2 partition of test
    for values start (0, 2, 0) end (0, 3, 0);
create table test_3 partition of test
    for values start (0, 3, 0) end (0, 4, 0);
create table test_4 partition of test
    for values start (0, 4, 0) end (1, 0, 0);
create table test_5 partition of test
    for values start (1, 0, 0) end (1, 2, 0);
create table test_6 partition of test
    for values start (1, 2, 0) end (1, 3, 0);
create table test_7 partition of test
    for values start (1, 3, 0) end (1, 4, 0);
create table test_8 partition of test
    for values start (1, 4, 0) end (2, 0, 0);

Useful to think of the above as sequence of ranges [000, 020), [020, 030),
[030, 040), [040, 100), [100, 120), [120, 130), [130, 140), [140, 200) for
purposes of finding the partition for a row.

Then constraints generated internally for each partition:

test_1: a = 0 AND b >= 0 AND b <= 2
test_2: a = 0 AND b >= 2 AND b <= 3
test_3: a = 0 AND b >= 3 AND b <= 4
test_4: a >= 0 AND a <= 1
test_5: a = 1 AND b >= 0 AND b <= 2
test_6: a = 1 AND b >= 2 AND b <= 3
test_7: a = 1 AND b >= 3 AND b <= 4
test_8: a >= 1 AND a <= 2

I will try further to poke holes in my thinking about this.  Please feel
free to point out if you find any.

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Ildar Musin

Date:

20 April 2016, 14:46:51

Hi Amit,

On 20.04.2016 13:28, Amit Langote wrote:
> On 2016/04/19 23:52, Amit Langote wrote:
>> On Tue, Apr 19, 2016 at 11:26 PM, Alexander Korotkov
>>> Another question is that it might be NOT what users expect from that.  From
>>> the syntax side it very looks like defining something boxes regions for two
>>> keys which could be replacement for subpartitioning.  But it isn't so.
>> Need to check why query with qual b < 100 behaves the way it does.
>> Something's going wrong there with the constraints (partition
>> predicates) that are being generated internally (as mentioned before,
>> still driven by constraint exclusion using the constraints generated
>> on-the-fly).
>>
>> As for the composite range partition bounds in Ildar's example, it's
>> as if the second value in the key never determines the fate of a row
>> going into some partition, therefore no constraints should have been
>> generated for column b of the key.  I'm afraid that's not the case as
>> per the latest patch.  Will fix.
> The strange behavior that Ildar reported should have been fixed with the
> attached updated set of patches (v2):
>
> create table test(a int, b int) partition by range (a, b);
> create table test_1 partition of test for values start (0, 0) end (100, 100);
> create table test_2 partition of test for values start (100, 100) end
> (200, 200);
> create table test_3 partition of test for values start (200, 200) end
> (300, 300);
> CREATE TABLE
> CREATE TABLE
> CREATE TABLE
> CREATE TABLE
>
> insert into test(a, b) values (150, 50);
> INSERT 0 1
>
> select * from test where b < 100;
>    a  | b
> -----+----
>   150 | 50
> (1 row)
>
> explain (costs off) select * from test where b < 100;
>          QUERY PLAN
> ---------------------------
>   Append
>     ->  Seq Scan on test
>           Filter: (b < 100)
>     ->  Seq Scan on test_1
>           Filter: (b < 100)
>     ->  Seq Scan on test_2
>           Filter: (b < 100)
>     ->  Seq Scan on test_3
>           Filter: (b < 100)
> (9 rows)
>
>
> Multi-column range partitioning seems a bit tricky as far as generating
> constraints on individual columns using a partition's lower and upper
> bounds (both composite values) is concerned.  I mentally pictured
> something like the following example scenario:
>
> create table test(a int, b int, c int)
>      partition by range (a, b, c);
> create table test_1 partition of test
>      for values start (0, 0, 0) end (0, 2, 0);
> create table test_2 partition of test
>      for values start (0, 2, 0) end (0, 3, 0);
> create table test_3 partition of test
>      for values start (0, 3, 0) end (0, 4, 0);
> create table test_4 partition of test
>      for values start (0, 4, 0) end (1, 0, 0);
> create table test_5 partition of test
>      for values start (1, 0, 0) end (1, 2, 0);
> create table test_6 partition of test
>      for values start (1, 2, 0) end (1, 3, 0);
> create table test_7 partition of test
>      for values start (1, 3, 0) end (1, 4, 0);
> create table test_8 partition of test
>      for values start (1, 4, 0) end (2, 0, 0);
>
> Useful to think of the above as sequence of ranges [000, 020), [020, 030),
> [030, 040), [040, 100), [100, 120), [120, 130), [130, 140), [140, 200) for
> purposes of finding the partition for a row.
>
> Then constraints generated internally for each partition:
>
> test_1: a = 0 AND b >= 0 AND b <= 2
> test_2: a = 0 AND b >= 2 AND b <= 3
> test_3: a = 0 AND b >= 3 AND b <= 4
> test_4: a >= 0 AND a <= 1
> test_5: a = 1 AND b >= 0 AND b <= 2
> test_6: a = 1 AND b >= 2 AND b <= 3
> test_7: a = 1 AND b >= 3 AND b <= 4
> test_8: a >= 1 AND a <= 2
>
> I will try further to poke holes in my thinking about this.  Please feel
> free to point out if you find any.
>
> Thanks,
> Amit
Thanks for clarification! I tried the updated patch, now it works correctly.

I encountered another problem that concerns expressions as partitioning 
key. Probably there is still some work in progress. But if it will help 
here is my case:

create table inh(a int, b int) partition by range ((a+b));
create table inh_1 partition of inh for values start (0) end (10);
create table inh_2 partition of inh for values start (10) end (20);

Then if we run any SELECT query it crashes postgres:

select * from inh;

Crash occurs in get_check_expr_from_partbound(). It seems that function 
is not yet expecting an expression key and designed to handle only 
simple attributes keys. Backtrace:

#0  strlen () at ../sysdeps/x86_64/strlen.S:106
#1  0x00000000004add8a in hashname (fcinfo=0x7ffdbdb9c760) at hashfunc.c:145
#2  0x000000000099cc08 in DirectFunctionCall1Coll (func=0x4add66 
<hashname>, collation=0, arg1=0) at fmgr.c:1027
#3  0x00000000009724dd in CatalogCacheComputeHashValue (cache=0x26590b0, 
nkeys=2, cur_skey=0x7ffdbdb9cbf0) at catcache.c:207
#4  0x0000000000974979 in SearchCatCache (cache=0x26590b0, v1=32807, 
v2=0, v3=0, v4=0) at catcache.c:1151
#5  0x0000000000988e35 in SearchSysCache (cacheId=6, key1=32807, key2=0, 
key3=0, key4=0) at syscache.c:1006
#6  0x0000000000988fe3 in SearchSysCacheAttName (relid=32807, 
attname=0x0) at syscache.c:1106
#7  0x000000000098a744 in get_attnum (relid=32807, attname=0x0) at 
lsyscache.c:825
#8  0x000000000056afd2 in get_check_expr_from_partbound 
(rel=0x7f868601ca20, parent=0x7f868601b770, bound=0x26e6ac8) at 
partition.c:1427
#9  0x000000000056bc9e in generate_partition_check_expr 
(rel=0x7f868601ca20) at partition.c:1788
#10 0x000000000056bb5f in RelationGetPartitionCheckExpr 
(rel=0x7f868601ca20) at partition.c:1746
#11 0x0000000000782b5f in get_relation_constraints (root=0x268f1b8, 
relationObjectId=32807, rel=0x26e5cd8, include_notnull=1 '\001') at 
plancat.c:1209
#12 0x0000000000782d74 in relation_excluded_by_constraints 
(root=0x268f1b8, rel=0x26e5cd8, rte=0x268ebf0) at plancat.c:1302
#13 0x000000000072a18d in set_append_rel_size (root=0x268f1b8, 
rel=0x26e5690, rti=1, rte=0x268ea80) at allpaths.c:947
...

-- 
Ildar Musin
i.musin@postgrespro.ru

Re: Declarative partitioning

From

Amit Langote

Date:

20 April 2016, 16:06:55

Hi Ildar,

On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
> Thanks for clarification! I tried the updated patch, now it works correctly.

Great, thanks!

> I encountered another problem that concerns expressions as partitioning key.
> Probably there is still some work in progress. But if it will help here is
> my case:
>
> create table inh(a int, b int) partition by range ((a+b));
> create table inh_1 partition of inh for values start (0) end (10);
> create table inh_2 partition of inh for values start (10) end (20);
>
> Then if we run any SELECT query it crashes postgres:
>
> select * from inh;
>
> Crash occurs in get_check_expr_from_partbound(). It seems that function is
> not yet expecting an expression key and designed to handle only simple
> attributes keys. Backtrace:

You're right, silly mistake. :-(

Will fix

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

21 April 2016, 02:06:35

Hi Ildar,

On 2016/04/21 1:06, Amit Langote wrote:
> On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
>> Crash occurs in get_check_expr_from_partbound(). It seems that function is
>> not yet expecting an expression key and designed to handle only simple
>> attributes keys. Backtrace:
>
> You're right, silly mistake. :-(
>
> Will fix

Attached updated version fixes this.  I'll take time to send the next
version but I'd very much appreciate it if you keep reporting anything
that doesn't look/work right like you did so far.

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Erik Rijkers

Date:

26 April 2016, 08:46:47

On 2016-04-15 04:35, Amit Langote wrote:

A quick test with:

> 0001-Add-syntax-to-specify-partition-key-v3.patch
> 0002-Infrastructure-for-creation-of-partitioned-tables-v3.patch
> 0003-Add-syntax-to-create-partitions-v3.patch
> 0004-Infrastructure-for-partition-metadata-storage-and-ma-v3.patch
> 0005-Introduce-tuple-routing-for-partitioned-tables-v3.patch

patches apply, build and make check ok.

There is somwthing wrong with indexes on child tables (and only with 
higher rowcounts).

Surely the below code should give 6 rows; it actually does return 6 rows 
without the indexes.
With indexes it returns 0 rows.

(but when doing the same test with low rowcounts, things are OK.)


thanks,

Erik Rijkers


(Linux Centos 6.6)


---------------------------------------
create table inh(a int, b int) partition by range ((a+b));
create table inh_1 partition of inh for values start (     0) end (  
10000);
create table inh_2 partition of inh for values start ( 10000) end (  
20000);
create table inh_3 partition of inh for values start ( 20000) end ( 
100000);

create index inh_1_a_idx on inh_1 (a);
create index inh_2_a_idx on inh_2 (a);
create index inh_3_a_idx on inh_3 (a);

insert into inh select i, i as j from generate_series(1, 10000) as f(i);

analyze inh_1;
analyze inh_2;
analyze inh_3;
          select 'inh'  , count(*)  from inh
union all select 'inh_1', count(*)  from inh_1
union all select 'inh_2', count(*)  from inh_2
union all select 'inh_3', count(*)  from inh_3
;

explain analyze  select * from inh where a between 10110 and 10115;
---------------------------------------



# output :

create table inh(a int, b int) partition by range ((a+b));
create table inh_1 partition of inh for values start (     0) end (  
10000);  create index inh_1_a_idx on inh_1 (a);
create table inh_2 partition of inh for values start ( 10000) end (  
20000);  create index inh_2_a_idx on inh_2 (a);
create table inh_3 partition of inh for values start ( 20000) end ( 
100000);  create index inh_3_a_idx on inh_3 (a);
insert into inh select i, i as j from generate_series(1, 10000) as f(i);
analyze inh_1;
analyze inh_2;
analyze inh_3;          select 'inh'  , count(*)  from inh
union all select 'inh_1', count(*)  from inh_1
union all select 'inh_2', count(*)  from inh_2
union all select 'inh_3', count(*)  from inh_3
; ?column? | count
----------+------- inh      | 10000 inh_1    |  4999 inh_2    |  5000 inh_3    |     1
(4 rows)

explain analyze  select * from inh where a between 10110 and 10115;
 QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..17.37 rows=4 width=8) (actual time=0.023..0.023 
 
rows=0 loops=1)   ->  Seq Scan on inh  (cost=0.00..0.00 rows=1 width=8) (actual 
time=0.004..0.004 rows=0 loops=1)         Filter: ((a >= 10110) AND (a <= 10115))   ->  Index Scan using inh_1_a_idx on
inh_1 (cost=0.16..8.18 rows=1 
 
width=8) (actual time=0.007..0.007 rows=0 loops=1)         Index Cond: ((a >= 10110) AND (a <= 10115))   ->  Index Scan
usinginh_2_a_idx on inh_2  (cost=0.16..8.18 rows=1 
 
width=8) (actual time=0.002..0.002 rows=0 loops=1)         Index Cond: ((a >= 10110) AND (a <= 10115))   ->  Seq Scan
oninh_3  (cost=0.00..1.01 rows=1 width=8) (actual 
 
time=0.008..0.008 rows=0 loops=1)         Filter: ((a >= 10110) AND (a <= 10115))         Rows Removed by Filter: 1
Planningtime: 0.858 ms Execution time: 0.093 ms
 
(12 rows)

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

09 May 2016, 11:21:13

Hi Amit,

I am trying multi-column/expression partitions.

create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 200) end (100, 300);
create table t1_mc_p2 partition of t1_multi_col for values start (200, 1) end (300, 100);
insert into t1_multi_col values (1, 250);
insert into t1_multi_col values (250, 1);
insert into t1_multi_col values (100, 100);
select tableoid::regclass, * from t1_multi_col;
tableoid | a | b
----------+-----+-----
t1_mc_p1 | 1 | 250
t1_mc_p1 | 100 | 100
t1_mc_p2 | 250 | 1

The row (100, 100) landed in t1_mc_p1 which has partition bounds as (1, 200) and (100, 300) which should not accept a row with b = 100. It looks like the binary search got confused with the reversed order of ranges (should that be allowed?)

Symantec of multiple columns for ranges (may be list as well) looks confusing. The current scheme doesn't allow overlapping range for one of the partitioning keys even if the combined range is non-overlapping.
create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 100) end (100, 200);
create table t1_mc_p2 partition of t1_multi_col for values start (1, 200) end (100, 300);
ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of "t1_multi_col"
HINT: Please specify a range that does not overlap with any existing partition's range.
create table t1_mc_p2 partition of t1_multi_col for values start (1, 300) end (100, 400);
ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of "t1_multi_col"
HINT: Please specify a range that does not overlap with any existing partition's range.

That should be better realised using subpartitioning on b. The question is, if one column's value is enough to identify partition (since they can not contain overlapping values for that column), why do we need mutliple columns/expressions as partition keys? IIUC, all the other column does is to disallow certain range of values for that column, which can better be done by a CHECK constraint. It looks like Oracle looks at combined range and not just one column.

On Thu, Apr 21, 2016 at 7:35 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Ildar,

On 2016/04/21 1:06, Amit Langote wrote:
> On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
>> Crash occurs in get_check_expr_from_partbound(). It seems that function is
>> not yet expecting an expression key and designed to handle only simple
>> attributes keys. Backtrace:
>
> You're right, silly mistake. :-(
>
> Will fix

Attached updated version fixes this. I'll take time to send the next
version but I'd very much appreciate it if you keep reporting anything
that doesn't look/work right like you did so far.

Thanks,
Amit

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

10 May 2016, 10:40:56

Hi Ashutosh,

On 2016/05/09 20:21, Ashutosh Bapat wrote:
> Hi Amit,
> I am trying multi-column/expression partitions.

Thanks for the tests.

> create table t1_multi_col (a int, b int) partition by range (a, b);
> create table t1_mc_p1 partition of t1_multi_col for values start (1, 200)
> end (100, 300);
> create table t1_mc_p2 partition of t1_multi_col for values start (200, 1)
> end (300, 100);
> insert into t1_multi_col values (1, 250);
> insert into t1_multi_col values (250, 1);
> insert into t1_multi_col values (100, 100);
> select tableoid::regclass, * from t1_multi_col;
>  tableoid |  a  |  b
> ----------+-----+-----
>  t1_mc_p1 |   1 | 250
>  t1_mc_p1 | 100 | 100
>  t1_mc_p2 | 250 |   1
> The row (100, 100) landed in t1_mc_p1 which has partition bounds as (1,
> 200) and (100, 300) which should not accept a row with b = 100. It looks
> like the binary search got confused with the reversed order of ranges
> (should that be allowed?)

It's useful to think of multi-column key as defining ranges of composite
values (tuples) instead of thinking in terms of ranges of values of
individual columns.  That's how a row's partition key is compared against
individual partition bounds until a suitable partition is found (with
binary search that is), which uses record comparison logic as shown below:

postgres=# select (1, 200) <= (100, 100) AND (100, 100) < (100, 300);?column?
----------t
(1 row)

Which means the row (100, 100) belongs in the partition with the start
bound (1, 200) and the end bound (100, 300).  Just like in composite value
case, comparison stops at some leading column that returns != 0 result.
So, in comparison (1, 200) <= (100, 100), the second column plays no role.

> Symantec of multiple columns for ranges (may be list as well) looks

Note that list partition key does not support multiple columns.

> confusing. The current scheme doesn't allow overlapping range for one of
> the partitioning keys even if the combined range is non-overlapping.
> create table t1_multi_col (a int, b int) partition by range (a, b);
> create table t1_mc_p1 partition of t1_multi_col for values start (1, 100)
> end (100, 200);
> create table t1_mc_p2 partition of t1_multi_col for values start (1, 200)
> end (100, 300);
> ERROR:  new partition's range overlaps with that of partition "t1_mc_p1" of
> "t1_multi_col"
> HINT:  Please specify a range that does not overlap with any existing
> partition's range.
> create table t1_mc_p2 partition of t1_multi_col for values start (1, 300)
> end (100, 400);
> ERROR:  new partition's range overlaps with that of partition "t1_mc_p1" of
> "t1_multi_col"
> HINT:  Please specify a range that does not overlap with any existing
> partition's range.

Ranges [ (1, 100), (100, 200) ) and [ (1, 200), (100, 300) ) do overlap:

postgres=# select (1, 100) <= (1, 200) AND (1, 200) < (100, 200);?column?
----------t
(1 row)

That is, (1, 200) is both the start element of the 2nd partition's range
and is contained in the first partition's range as illustrated above.

> That should be better realised using subpartitioning on b. The question is,
> if one column's value is enough to identify partition (since they can not
> contain overlapping values for that column), why do we need mutliple
> columns/expressions as partition keys? IIUC, all the other column does is
> to disallow certain range of values for that column, which can better be
> done by a CHECK constraint. It looks like Oracle looks at combined range
> and not just one column.

A more familiar example I have seen around the web illustrating
multi-column range partitioning is for something like (year, month, day)
triple.  Consider the following example:

create table parted(year int, month int, day int) partition by range (year, month, day);

create table part201605week1 partition of parted for values start (2016, 5, 1) end (2016, 5, 8);
create table part201605week2 partition of parted for values start (2016, 5, 8) end (2016, 5, 15);
create table part201605week3 partition of parted for values start (2016, 5, 15) end (2016, 5, 22);
create table part201605week4 partition of parted for values start (2016, 5, 22) end (2016, 5, 29);
create table part201605week5 partition of parted for values start (2016, 5, 29) end (2016, 5, 31) inclusive;

create table part201606week1 partition of parted for values start (2016, 6, 1) end (2016, 6, 8);
create table part201606week2 partition of parted for values start (2016, 6, 8) end (2016, 6, 15);
create table part201606week3 partition of parted for values start (2016, 6, 15) end (2016, 6, 22);
create table part201606week4 partition of parted for values start (2016, 6, 2) end (2016, 6, 29);
create table part201606week4 partition of parted for values start (2016, 6, 22) end (2016, 6, 29);
create table part201606week5 partition of parted for values start (2016, 6, 29) end (2016, 6, 30) inclusive;

explain (costs off) select * from parted where day between 4 and 10;                 QUERY PLAN
----------------------------------------------Append  ->  Seq Scan on parted        Filter: ((day >= 4) AND (day <=
10)) ->  Seq Scan on part201605week1        Filter: ((day >= 4) AND (day <= 10))  ->  Seq Scan on part201605week2
Filter: ((day >= 4) AND (day <= 10))  ->  Seq Scan on part201606week1        Filter: ((day >= 4) AND (day <= 10))  ->
SeqScan on part201606week2        Filter: ((day >= 4) AND (day <= 10))
 
(11 rows)

explain (costs off) select * from parted where day between 4 and 10 and
month = 5;                         QUERY PLAN
--------------------------------------------------------------Append  ->  Seq Scan on parted        Filter: ((day >= 4)
AND(day <= 10) AND (month = 5))  ->  Seq Scan on part201605week1        Filter: ((day >= 4) AND (day <= 10) AND (month
=5))  ->  Seq Scan on part201605week2        Filter: ((day >= 4) AND (day <= 10) AND (month = 5))
 
(7 rows)

explain (costs off) select * from parted where month = 6;           QUERY PLAN
-----------------------------------Append  ->  Seq Scan on parted        Filter: (month = 6)  ->  Seq Scan on
part201606week1       Filter: (month = 6)  ->  Seq Scan on part201606week2        Filter: (month = 6)  ->  Seq Scan on
part201606week3       Filter: (month = 6)  ->  Seq Scan on part201606week4        Filter: (month = 6)  ->  Seq Scan on
part201606week5       Filter: (month = 6)
 
(13 rows)

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

11 May 2016, 02:27:36

Hi Erik,

On 2016/04/26 17:46, Erik Rijkers wrote:
> On 2016-04-15 04:35, Amit Langote wrote:
>
> A quick test with:
>
>> 0001-Add-syntax-to-specify-partition-key-v3.patch
>> 0002-Infrastructure-for-creation-of-partitioned-tables-v3.patch
>> 0003-Add-syntax-to-create-partitions-v3.patch
>> 0004-Infrastructure-for-partition-metadata-storage-and-ma-v3.patch
>> 0005-Introduce-tuple-routing-for-partitioned-tables-v3.patch
>
> patches apply, build and make check ok.

Thanks for testing!

> There is somwthing wrong with indexes on child tables (and only with
> higher rowcounts).

There was an oversight in patch 0005 that caused partition indexes to not
be opened and tuples inserted into.  Attached new version should have
fixed it.

> Surely the below code should give 6 rows; it actually does return 6 rows
> without the indexes.
> With indexes it returns 0 rows.
>
> (but when doing the same test with low rowcounts, things are OK.)

...

> ---------------------------------------
> create table inh(a int, b int) partition by range ((a+b));
> create table inh_1 partition of inh for values start (     0) end (  10000);
> create table inh_2 partition of inh for values start ( 10000) end (  20000);
> create table inh_3 partition of inh for values start ( 20000) end ( 100000);
>
> create index inh_1_a_idx on inh_1 (a);
> create index inh_2_a_idx on inh_2 (a);
> create index inh_3_a_idx on inh_3 (a);
>
> insert into inh select i, i as j from generate_series(1, 10000) as f(i);
>
> analyze inh_1;
> analyze inh_2;
> analyze inh_3;
>
>           select 'inh'  , count(*)  from inh
> union all select 'inh_1', count(*)  from inh_1
> union all select 'inh_2', count(*)  from inh_2
> union all select 'inh_3', count(*)  from inh_3
> ;
>
> explain analyze  select * from inh where a between 10110 and 10115;

Hmm, this last query should return 0 rows because:

select max(a) from inh;
  max
=------
 10000
(1 row)

Did you by any chance mean to write the following:

explain analyze  select * from inh where a + b between 10110 and 10115;

In which case:

explain analyze  select * from inh where a + b between 10110 and 10115;
                                               QUERY PLAN

=-------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..123.00 rows=26 width=8) (actual time=0.119..6.407
rows=3 loops=1)
   ->  Seq Scan on inh  (cost=0.00..0.00 rows=1 width=8) (actual
time=0.015..0.015 rows=0 loops=1)
         Filter: (((a + b) >= 10110) AND ((a + b) <= 10115))
   ->  Seq Scan on inh_2  (cost=0.00..123.00 rows=25 width=8) (actual
time=0.076..6.198 rows=3 loops=1)
         Filter: (((a + b) >= 10110) AND ((a + b) <= 10115))
         Rows Removed by Filter: 4997
 Planning time: 0.521 ms
 Execution time: 6.572 ms
(8 rows)

select * from inh where a + b between 10110 and 10115;
  a   |  b
=-----+------
 5055 | 5055
 5056 | 5056
 5057 | 5057
(3 rows)

Now that doesn't use index for the obvious reason (mismatched key).  So,
let's try one which will:

explain analyze  select * from inh where a = 4567;
                                                       QUERY PLAN


=------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..17.61 rows=4 width=8) (actual time=0.189..0.293
rows=1 loops=1)
   ->  Seq Scan on inh  (cost=0.00..0.00 rows=1 width=8) (actual
time=0.016..0.016 rows=0 loops=1)
         Filter: (a = 4567)
   ->  Index Scan using inh_1_a_idx on inh_1  (cost=0.28..8.30 rows=1
width=8) (actual time=0.043..0.056 rows=1 loops=1)
         Index Cond: (a = 4567)
   ->  Index Scan using inh_2_a_idx on inh_2  (cost=0.28..8.30 rows=1
width=8) (actual time=0.024..0.024 rows=0 loops=1)
         Index Cond: (a = 4567)
   ->  Seq Scan on inh_3  (cost=0.00..1.01 rows=1 width=8) (actual
time=0.029..0.029 rows=0 loops=1)
         Filter: (a = 4567)
         Rows Removed by Filter: 1
 Planning time: 0.589 ms
 Execution time: 0.433 ms

select * from inh where a = 4567;
  a   |  b
=-----+------
 4567 | 4567
(1 row)

No pruning occurs this time for the obvious reason (mismatched key).

Does that help clarify?

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Sameer Thakur-2

Date:

12 May 2016, 08:42:34

Hello Amit,
In the example 
>create table part201606week4 partition of parted 
>for values start (2016, 6, 2) end (2016, 6, 29);

seems to be a typo
regards
Sameer 



--
View this message in context: http://postgresql.nabble.com/Declarative-partitioning-tp5862462p5903204.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Declarative partitioning

From

Amit Langote

Date:

12 May 2016, 08:56:14

Hi,

On 2016/05/12 17:42, Sameer Thakur-2 wrote:
> Hello Amit,
> In the example 
>> create table part201606week4 partition of parted 
>> for values start (2016, 6, 2) end (2016, 6, 29);
> 
> seems to be a typo

Oops, there indeed is.

create table part201606week4 partition of parted
for values start (2016, 6, 2) end (2016, 6, 29);
create table part201606week4 partition of parted
for values start (2016, 6, 22) end (2016, 6, 29);

While copy-pasting part201606week4's definition from a previous line,
ended up pasting twice and then edited the second one to the correct
definition.

Thanks,
Amit

Re: Declarative partitioning

From

Ildar Musin

Date:

16 May 2016, 13:12:42

Hi Amit,

I'm running some experiments based on your infrastructure trying to 
optimize SELECT queries. At some point I need to get PartitionDesc for 
relation and to do it I'm using RelationGetPartitionDesc() function. 
Problem is that this function copies relcache data and it can be quite 
slow for large amounts (thousands) of partitions. The comment to the 
function says that we cannot use relation->rd_partdesc pointer to 
relcache because of possibility of relcache invalidation. Could you 
please tell is it possible that relcache invalidation occurs during 
SELECT/UPDATE/DELETE query?

Thanks!

-- 
Ildar Musin
i.musin@postgrespro.ru

Re: Declarative partitioning

From

Amit Langote

Date:

17 May 2016, 07:56:53

Hi Ildar,

On 2016/05/16 22:12, Ildar Musin wrote:
> Hi Amit,
> 
> I'm running some experiments based on your infrastructure trying to
> optimize SELECT queries. At some point I need to get PartitionDesc for
> relation and to do it I'm using RelationGetPartitionDesc() function.
> Problem is that this function copies relcache data and it can be quite
> slow for large amounts (thousands) of partitions. The comment to the
> function says that we cannot use relation->rd_partdesc pointer to relcache
> because of possibility of relcache invalidation. Could you please tell is
> it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE
> query?

Hmm, I think invalidation would not occur mid-query since it would have
acquired a lock on the table.  So the copying may be unnecessary though I
may be wrong.  I will study other similar code and see if that is so.

Thanks,
Amit

Re: Declarative partitioning

From

Tom Lane

Date:

17 May 2016, 17:22:56

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2016/05/16 22:12, Ildar Musin wrote:
>> Could you please tell is
>> it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE
>> query?

> Hmm, I think invalidation would not occur mid-query since it would have
> acquired a lock on the table.

This is incorrect: invalidation can occur anyway (for example, if
autoanalyze updates the relation's statistics).  If you are holding
a lock, you can expect that the relation's schema will not change more
than your lock would allow --- but a cache flush and rebuild could happen
underneath you, so keeping a pointer to any subsidiary relcache data
structure is very dangerous.

The two ways that we've dealt with this type of hazard are to copy data
out of the relcache before using it; or to give the relcache the
responsibility of not moving a particular portion of data if it did not
change.  From memory, the latter applies to the tuple descriptor and
trigger data, but we've done most other things the first way.
        regards, tom lane

Re: Declarative partitioning

From

Amit Langote

Date:

18 May 2016, 01:27:16

On 2016/05/18 2:22, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2016/05/16 22:12, Ildar Musin wrote:
>>> Could you please tell is
>>> it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE
>>> query?
> 
>> Hmm, I think invalidation would not occur mid-query since it would have
>> acquired a lock on the table.
> 
> This is incorrect: invalidation can occur anyway (for example, if
> autoanalyze updates the relation's statistics).  If you are holding
> a lock, you can expect that the relation's schema will not change more
> than your lock would allow --- but a cache flush and rebuild could happen
> underneath you, so keeping a pointer to any subsidiary relcache data
> structure is very dangerous.

I see.  Thanks for clarifying.

> The two ways that we've dealt with this type of hazard are to copy data
> out of the relcache before using it; or to give the relcache the
> responsibility of not moving a particular portion of data if it did not
> change.  From memory, the latter applies to the tuple descriptor and
> trigger data, but we've done most other things the first way.

It seems that tuple descriptor is reference-counted; however trigger data
is copied.  The former seems to have been done on performance grounds (I
found 06e10abc).

So for a performance-sensitive relcache data structure, refcounting is the
way to go (although done quite rarely)?  In this particular case, it is a
"partition descriptor" that could get big for a partitioned table with
partitions in hundreds or thousands.

Thanks,
Amit

Re: Declarative partitioning

From

Ildar Musin

Date:

18 May 2016, 15:36:42

Hi Amit and all,

On 18.05.2016 04:26, Amit Langote wrote:

On 2016/05/18 2:22, Tom Lane wrote:

The two ways that we've dealt with this type of hazard are to copy data
out of the relcache before using it; or to give the relcache the
responsibility of not moving a particular portion of data if it did not
change.  From memory, the latter applies to the tuple descriptor and
trigger data, but we've done most other things the first way.

It seems that tuple descriptor is reference-counted; however trigger data
is copied.  The former seems to have been done on performance grounds (I
found 06e10abc).

So for a performance-sensitive relcache data structure, refcounting is the
way to go (although done quite rarely)?  In this particular case, it is a
"partition descriptor" that could get big for a partitioned table with
partitions in hundreds or thousands.

Thanks,
Amit

Here is an experimental patch that optimizes planning time for range partitioned tables (it could be considered as a "proof of concept"). Patch should be applied on top of Amit's declarative partitioning patch. It handles only a very special case (often used though) where partitioning key consists of just a single attribute and doesn't contain expressions.

The main idea is the following:
* we are looking for clauses like 'VAR OP CONST' (where VAR is partitioning key attribute, OP is a comparison operator);
* using binary search find a partition (X) that fits CONST value;
* based on OP operator determine which partitions are also covered by clause. There are possible cases:
   1. If OP is '<' or '<=' then we need partitions standing left from X (including)
   2. If OP is '>' or '>=' then we need partitions standing right from X (including)
   3. If OP is '=' the we need only X partition
(for '<' and '>' operators we also check if CONST value is equal to a lower or upper boundary (accordingly) and if it's true then exclude X).

For boolean expressions we evaluate left and right sides accordingly to algorithm above and then based on boolean operator find intersection (for AND) or union (for OR).

I run some benchmarks on:
1. original constraint exclusion mechanism,
2. optimized version (this patch) and
3. optimized version using relation->rd_partdesc pointer instead of RelationGetPartitionDesc() function (see previous discussion).

Initial conditions:

CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY RANGE (a);
CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (0) END (1000);
CREATE TABLE abc_2 PARTITION OF abc FOR VALUES START (1000) END (2000);
...
etc
INSERT INTO %s (a) SELECT generate_series(0, <partitions_count> * 1000);

pgbench scripts: https://gist.github.com/zilder/872e634a8eeb405bd045465fc9527e53 (where :partitions is a number of partitions).
The first script tests fetching a single row from the partitioned table. Results (tps):

# of partitions | constraint excl. |     optimized | optimized (using pointer)
----------------+------------------+---------------+----------------------------
            100 |              658 |          2906 |           3079
           1000 |               45 |          2174 |           3021
           2000 |               22 |          1667 |           2919

The second script tests fetching all data from a single partition. Results (tps):

# of partitions | constraint excl. |     optimized | optimized (using pointer)
----------------+------------------+---------------+----------------------------
            100 |              317 |          1001 |           1051
           1000 |               34 |           941 |           1023
           2000 |               15 |           813 |           1016

Optimized version works much faster on large amount of partitions and degradates slower than constraint exclusion. But still there is a noticeable performance degradation from copying PartitionDesc structure: with 2000 partitions RelationGetPartitionDesc() function spent more than 40% of all execution time on copying in first benchmark (measured with `perf`). Using reference counting as Amit suggests will allow to significantily decrease performance degradation.

Any comments and suggestions are very welcome. Thanks!

-- 
Ildar Musin
i.musin@postgrespro.ru

Attachment

20160518_decl_part_opt.patch

Re: Declarative partitioning

From

Tom Lane

Date:

18 May 2016, 17:49:10

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2016/05/18 2:22, Tom Lane wrote:
>> The two ways that we've dealt with this type of hazard are to copy data
>> out of the relcache before using it; or to give the relcache the
>> responsibility of not moving a particular portion of data if it did not
>> change.  From memory, the latter applies to the tuple descriptor and
>> trigger data, but we've done most other things the first way.

After actually looking at the code, we do things that way for the
tupledesc, the relation's rules if any, and RLS policies --- see
RelationClearRelation().

> It seems that tuple descriptor is reference-counted; however trigger data
> is copied.  The former seems to have been done on performance grounds (I
> found 06e10abc).

We do refcount tuple descriptors, but we've been afraid to try to rely
completely on that; there are too many places that assume a relcache
entry's tupdesc is safe to reference.  It's not that easy to go over to
a fully refcounted approach, because that creates a new problem of being
sure that refcounts are decremented when necessary --- that's a pain,
particularly when a query is abandoned due to an error.

> So for a performance-sensitive relcache data structure, refcounting is the
> way to go (although done quite rarely)?

I'd be suspicious of this because of the cleanup problem.  The
don't-replace-unless-changed approach is the one that's actually battle
tested.
        regards, tom lane

Re: Declarative partitioning

From

Amit Langote

Date:

19 May 2016, 10:29:56

On 2016/05/19 2:48, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2016/05/18 2:22, Tom Lane wrote:
>>> The two ways that we've dealt with this type of hazard are to copy data
>>> out of the relcache before using it; or to give the relcache the
>>> responsibility of not moving a particular portion of data if it did not
>>> change.  From memory, the latter applies to the tuple descriptor and
>>> trigger data, but we've done most other things the first way.
> 
> After actually looking at the code, we do things that way for the
> tupledesc, the relation's rules if any, and RLS policies --- see
> RelationClearRelation().

I think I confused refcounting method of keeping things around with the
RelationClearRelation()'s method.  I now understand that you meant the
latter in your original message.

>> It seems that tuple descriptor is reference-counted; however trigger data
>> is copied.  The former seems to have been done on performance grounds (I
>> found 06e10abc).
> 
> We do refcount tuple descriptors, but we've been afraid to try to rely
> completely on that; there are too many places that assume a relcache
> entry's tupdesc is safe to reference.  It's not that easy to go over to
> a fully refcounted approach, because that creates a new problem of being
> sure that refcounts are decremented when necessary --- that's a pain,
> particularly when a query is abandoned due to an error.

I see.

>> So for a performance-sensitive relcache data structure, refcounting is the
>> way to go (although done quite rarely)?
> 
> I'd be suspicious of this because of the cleanup problem.  The
> don't-replace-unless-changed approach is the one that's actually battle
> tested.

OK, I will try the RelationClearRelation()'s method of keeping partition
descriptor data around so that no repeated copying is necessary.

Thanks,
Amit

Re: Declarative partitioning

From

Amit Langote

Date:

20 May 2016, 08:38:19

Hi Ildar,

On 2016/05/19 0:36, Ildar Musin wrote:
>
> Here is an experimental patch that optimizes planning time for range
> partitioned tables (it could be considered as a "proof of concept"). Patch
> should be applied on top of Amit's declarative partitioning patch. It
> handles only a very special case (often used though) where partitioning
> key consists of just a single attribute and doesn't contain expressions.

Great, thanks!

I understand that it's still PoC and the point may be just to consider
performance implications of excessive partdesc copying but I'm wondering
about a few things about the patch in general.  See below.

> The main idea is the following:
> * we are looking for clauses like 'VAR OP CONST' (where VAR is
> partitioning key attribute, OP is a comparison operator);
> * using binary search find a partition (X) that fits CONST value;
> * based on OP operator determine which partitions are also covered by
> clause. There are possible cases:
>    1. If OP is '<' or '<=' then we need partitions standing left from X
> (including)
>    2. If OP is '>' or '>=' then we need partitions standing right from X
> (including)
>    3. If OP is '=' the we need only X partition
>   (for '<' and '>' operators we also check if CONST value is equal to a
> lower or upper boundary (accordingly) and if it's true then exclude X).
>
> For boolean expressions we evaluate left and right sides accordingly to
> algorithm above and then based on boolean operator find intersection (for
> AND) or union (for OR).

Perhaps you're already aware but may I also suggest looking at how clauses
are matched to indexes?  For example, consider how
match_clauses_to_index() in src/backend/optimizer/path/indxpath.c works.

Moreover, instead of pruning partitions in planner prep phase, might it
not be better to do that when considering paths for the (partitioned) rel?
 IOW, instead of looking at parse->jointree, we should rather be working
with rel->baserestrictinfo.  Although, that would require some revisions
to how append_rel_list, simple_rel_list, etc. are constructed and
manipulated in a given planner invocation.  Maybe it's time for that...
Again, you may have already considered these things.

> I run some benchmarks on:
> 1. original constraint exclusion mechanism,
> 2. optimized version (this patch) and
> 3. optimized version using relation->rd_partdesc pointer instead of
> RelationGetPartitionDesc() function (see previous discussion).
>
> Initial conditions:
>
> CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY
> RANGE (a);
> CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (0) END (1000);
> CREATE TABLE abc_2 PARTITION OF abc FOR VALUES START (1000) END (2000);
> ...
> etc
> INSERT INTO %s (a) SELECT generate_series(0, <partitions_count> * 1000);
>
> pgbench scripts:
> https://gist.github.com/zilder/872e634a8eeb405bd045465fc9527e53 (where
> :partitions is a number of partitions).
> The first script tests fetching a single row from the partitioned table.
> Results (tps):
>
> # of partitions | constraint excl. |     optimized | optimized (using
> pointer)
> ----------------+------------------+---------------+----------------------------
>
>             100 |              658 |          2906 | 3079
>            1000 |               45 |          2174 | 3021
>            2000 |               22 |          1667 | 2919
>
>
> The second script tests fetching all data from a single partition. Results
> (tps):
>
> # of partitions | constraint excl. |     optimized | optimized (using
> pointer)
> ----------------+------------------+---------------+----------------------------
>
>             100 |              317 |          1001 | 1051
>            1000 |               34 |           941 | 1023
>            2000 |               15 |           813 | 1016
>
> Optimized version works much faster on large amount of partitions and
> degradates slower than constraint exclusion. But still there is a
> noticeable performance degradation from copying PartitionDesc structure:
> with 2000 partitions RelationGetPartitionDesc() function spent more than
> 40% of all execution time on copying in first benchmark (measured with
> `perf`). Using reference counting as Amit suggests will allow to
> significantily decrease performance degradation.

Could you try with the attached updated set of patches?  I changed
partition descriptor relcache code to eliminate excessive copying in
previous versions.

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Ildar Musin

Date:

20 May 2016, 15:29:15

Hi Amit,<br /><br /><div class="moz-cite-prefix">On 20.05.2016 11:37, Amit Langote wrote:<br /></div><blockquote
cite="mid:573ECCC3.7080104@lab.ntt.co.jp"type="cite"><pre wrap="">
 
Perhaps you're already aware but may I also suggest looking at how clauses
are matched to indexes?  For example, consider how
match_clauses_to_index() in src/backend/optimizer/path/indxpath.c works.
</pre></blockquote> Thanks, I'll take a closer look at it.<br /><blockquote cite="mid:573ECCC3.7080104@lab.ntt.co.jp"
type="cite"><prewrap="">
 
Moreover, instead of pruning partitions in planner prep phase, might it
not be better to do that when considering paths for the (partitioned) rel?IOW, instead of looking at
parse->jointree,we should rather be working
 
with rel->baserestrictinfo.  Although, that would require some revisions
to how append_rel_list, simple_rel_list, etc. are constructed and
manipulated in a given planner invocation.  Maybe it's time for that...
Again, you may have already considered these things.

</pre></blockquote> Yes, you're right, this is how we did it in pg_pathman extension. But for this patch it requires
furtherconsideration and I'll do it in future!<br /><blockquote cite="mid:573ECCC3.7080104@lab.ntt.co.jp"
type="cite"><prewrap="">
 
Could you try with the attached updated set of patches?  I changed
partition descriptor relcache code to eliminate excessive copying in
previous versions.

Thanks,
Amit
</pre></blockquote> I tried your new patch and got following results, which are quite close to the ones using pointer
toPartitionDesc structure (TPS):<br /><br /><font face="Courier New, Courier, monospace"># of partitions | single row |
singlepartition<br /> ----------------+------------+------------------<br />             100 |       3014 |            
1024<br/>            1000 |       2964 |             1001<br />            2000 |       2874 |            
1000</font><br/><br /> However I've encountered a problem which is that postgres crashes occasionally while creating
partitions.Here is function that reproduces this behaviour:<br /><br /> CREATE OR REPLACE FUNCTION fail()<br />
 RETURNSvoid<br />  LANGUAGE plpgsql<br /> AS $$<br /> BEGIN<br /> DROP TABLE IF EXISTS abc CASCADE;<br /> CREATE TABLE
abc(id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY RANGE (a);<br /> CREATE INDEX ON abc (a);<br /> CREATE TABLE
abc_0PARTITION OF abc FOR VALUES START (0) END (1000);<br /> CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START
(1000)END (2000);<br /> END<br /> $$;<br /><br /> SELECT fail();<br /><br /> It happens not every time but quite often.
Itdoesn't happen if I execute this commands one by one in psql. Backtrace:<br /><br /> #0 
range_overlaps_existing_partition(key=0x7f1097504410, range_spec=0x1d0f400, pdesc=0x1d32200, with=0x7ffe437ead00) at
partition.c:747<br/> #1  0x000000000054c2a5 in StorePartitionBound (relid=245775, parentId=245770, bound=0x1d0f400) at
partition.c:578<br/> #2  0x000000000061bfc4 in DefineRelation (stmt=0x1d0dfe0, relkind=114 'r', ownerId=10,
typaddress=0x0)at tablecmds.c:739<br /> #3  0x00000000007f4473 in ProcessUtilitySlow (parsetree=0x1d1a150,
queryString=0x1d1d940"CREATE TABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000)",
context=PROCESS_UTILITY_QUERY,params=0x0, dest=0xdb5ca0 <spi_printtupDR>, completionTag=0x7ffe437eb500 "")<br />
   at utility.c:983<br /> #4  0x00000000007f425e in standard_ProcessUtility (parsetree=0x1d1a150, queryString=0x1d1d940
"CREATETABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000)", context=PROCESS_UTILITY_QUERY, params=0x0,
dest=0xdb5ca0<spi_printtupDR>, <br />     completionTag=0x7ffe437eb500 "") at utility.c:907<br /> #5 
0x00000000007f3354in ProcessUtility (parsetree=0x1d1a150, queryString=0x1d1d940 "CREATE TABLE abc_0 PARTITION OF abc
FORVALUES START (0) END (1000)", context=PROCESS_UTILITY_QUERY, params=0x0, dest=0xdb5ca0 <spi_printtupDR>,
completionTag=0x7ffe437eb500"")<br />     at utility.c:336<br /> #6  0x000000000069f8b2 in _SPI_execute_plan
(plan=0x1d19cf0,paramLI=0x0, snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=1 '\001',
tcount=0)at spi.c:2200<br /> #7  0x000000000069c735 in SPI_execute_plan_with_paramlist (plan=0x1d19cf0, params=0x0,
read_only=0'\000', tcount=0) at spi.c:450<br /> #8  0x00007f108cc6266f in exec_stmt_execsql (estate=0x7ffe437eb8e0,
stmt=0x1d05318)at pl_exec.c:3517<br /> #9  0x00007f108cc5e5fc in exec_stmt (estate=0x7ffe437eb8e0, stmt=0x1d05318) at
pl_exec.c:1503<br/> #10 0x00007f108cc5e318 in exec_stmts (estate=0x7ffe437eb8e0, stmts=0x1d04c98) at pl_exec.c:1398<br
/>#11 0x00007f108cc5e1af in exec_stmt_block (estate=0x7ffe437eb8e0, block=0x1d055e0) at pl_exec.c:1336<br /> #12
0x00007f108cc5c35din plpgsql_exec_function (func=0x1cc2a90, fcinfo=0x1cf7f50, simple_eval_estate=0x0) at
pl_exec.c:434<br/> ...<br /><br /> Thanks<br /><pre class="moz-signature" cols="72">-- 
 
Ildar Musin
<a class="moz-txt-link-abbreviated" href="mailto:i.musin@postgrespro.ru">i.musin@postgrespro.ru</a></pre>

Re: Declarative partitioning

From

Amit Langote

Date:

23 May 2016, 10:06:27

Hi Ildar,

On 2016/05/21 0:29, Ildar Musin wrote:
> On 20.05.2016 11:37, Amit Langote wrote:
>> Moreover, instead of pruning partitions in planner prep phase, might it
>> not be better to do that when considering paths for the (partitioned) rel?
>>   IOW, instead of looking at parse->jointree, we should rather be working
>> with rel->baserestrictinfo.  Although, that would require some revisions
>> to how append_rel_list, simple_rel_list, etc. are constructed and
>> manipulated in a given planner invocation.  Maybe it's time for that...
>> Again, you may have already considered these things.
>>
> Yes, you're right, this is how we did it in pg_pathman extension. But for
> this patch it requires further consideration and I'll do it in future!

OK, sure.

>> Could you try with the attached updated set of patches?  I changed
>> partition descriptor relcache code to eliminate excessive copying in
>> previous versions.

[...]

> However I've encountered a problem which is that postgres crashes
> occasionally while creating partitions. Here is function that reproduces
> this behaviour:
>
> CREATE OR REPLACE FUNCTION fail()
>  RETURNS void
>  LANGUAGE plpgsql
> AS $$
> BEGIN
> DROP TABLE IF EXISTS abc CASCADE;
> CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY
> RANGE (a);
> CREATE INDEX ON abc (a);
> CREATE TABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000);
> CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (1000) END (2000);
> END
> $$;
>
> SELECT fail();
>
> It happens not every time but quite often. It doesn't happen if I execute
> this commands one by one in psql. Backtrace:
>
> #0  range_overlaps_existing_partition (key=0x7f1097504410,
> range_spec=0x1d0f400, pdesc=0x1d32200, with=0x7ffe437ead00) at
> partition.c:747

[...]

I made a mistake in the last version of the patch which caused a relcache
field to be pfree'd unexpectedly.  Attached updated patches.

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

08 June 2016, 13:22:58

0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c. I have tried fixing the conflict in attached patch.

On Mon, May 23, 2016 at 3:35 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Ildar,

On 2016/05/21 0:29, Ildar Musin wrote:
> On 20.05.2016 11:37, Amit Langote wrote:
>> Moreover, instead of pruning partitions in planner prep phase, might it
>> not be better to do that when considering paths for the (partitioned) rel?
>> IOW, instead of looking at parse->jointree, we should rather be working
>> with rel->baserestrictinfo. Although, that would require some revisions
>> to how append_rel_list, simple_rel_list, etc. are constructed and
>> manipulated in a given planner invocation. Maybe it's time for that...
>> Again, you may have already considered these things.
>>
> Yes, you're right, this is how we did it in pg_pathman extension. But for
> this patch it requires further consideration and I'll do it in future!

OK, sure.

>> Could you try with the attached updated set of patches? I changed
>> partition descriptor relcache code to eliminate excessive copying in
>> previous versions.

[...]

> However I've encountered a problem which is that postgres crashes
> occasionally while creating partitions. Here is function that reproduces
> this behaviour:
>
> CREATE OR REPLACE FUNCTION fail()
> RETURNS void
> LANGUAGE plpgsql
> AS $$
> BEGIN
> DROP TABLE IF EXISTS abc CASCADE;
> CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY
> RANGE (a);
> CREATE INDEX ON abc (a);
> CREATE TABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000);
> CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (1000) END (2000);
> END
> $$;
>
> SELECT fail();
>
> It happens not every time but quite often. It doesn't happen if I execute
> this commands one by one in psql. Backtrace:
>
> #0 range_overlaps_existing_partition (key=0x7f1097504410,
> range_spec=0x1d0f400, pdesc=0x1d32200, with=0x7ffe437ead00) at
> partition.c:747

[...]

I made a mistake in the last version of the patch which caused a relcache
field to be pfree'd unexpectedly. Attached updated patches.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_dump_adjust.patch

Re: Declarative partitioning

From

Amit Langote

Date:

09 June 2016, 01:51:21

On 2016/06/08 22:22, Ashutosh Bapat wrote:
> On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote
>>
>> [...]
>>
>> I made a mistake in the last version of the patch which caused a relcache
>> field to be pfree'd unexpectedly.  Attached updated patches.
>
> 0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c.
> I have tried fixing the conflict in attached patch.

Thanks.  See attached rebased patches.

Regards,
Amit

Attachment

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

21 June 2016, 07:50:00

Hi Amit,

I tried creating partitioned table by range on an expression like
CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);
ERROR: syntax error at or near "+"
LINE 1: ...E pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);

But when I try to create partitioned table by range on expression chr(a) it works.

On Thu, Jun 9, 2016 at 7:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/06/08 22:22, Ashutosh Bapat wrote:
> On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote
>>
>> [...]
>>
>> I made a mistake in the last version of the patch which caused a relcache
>> field to be pfree'd unexpectedly. Attached updated patches.
>
> 0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c.
> I have tried fixing the conflict in attached patch.

Thanks. See attached rebased patches.

Regards,
Amit

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

21 June 2016, 08:13:43

On 2016/06/21 16:49, Ashutosh Bapat wrote:
> Hi Amit,
> I tried creating partitioned table by range on an expression like
> CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a +
> b);
>   ERROR:  syntax error at or near
> "+"
> 
>   LINE 1: ...E pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);
> 
> But when I try to create partitioned table by range on expression chr(a) it
> works.

The expression needs to be parenthesized (function call expressions don't
need though).  So something like the following would work:

CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(add(a, b));

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

24 June 2016, 14:08:52

Hi Amit,

I tried creating 2-level partitioned table and tried to create simple table using CTAS from the partitioned table. It gives a cache lookup error. Here's the test
CREATE TABLE pt1_l (a int, b varchar, c int) PARTITION BY RANGE(a);
CREATE TABLE pt1_l_p1 PARTITION OF pt1_l FOR VALUES START (1) END (250) INCLUSIVE PARTITION BY RANGE(b);
CREATE TABLE pt1_l_p2 PARTITION OF pt1_l FOR VALUES START (251) END (500) INCLUSIVE PARTITION BY RANGE(((a+c)/2));
CREATE TABLE pt1_l_p3 PARTITION OF pt1_l FOR VALUES START (501) END (600) INCLUSIVE PARTITION BY RANGE(c);
CREATE TABLE pt1_l_p1_p1 PARTITION OF pt1_l_p1 FOR VALUES START ('000001') END ('000125') INCLUSIVE;
CREATE TABLE pt1_l_p1_p2 PARTITION OF pt1_l_p1 FOR VALUES START ('000126') END ('000250') INCLUSIVE;
CREATE TABLE pt1_l_p2_p1 PARTITION OF pt1_l_p2 FOR VALUES START (251) END (375) INCLUSIVE;
CREATE TABLE pt1_l_p2_p2 PARTITION OF pt1_l_p2 FOR VALUES START (376) END (500) INCLUSIVE;
CREATE TABLE pt1_l_p3_p1 PARTITION OF pt1_l_p3 FOR VALUES START (501) END (550) INCLUSIVE;
CREATE TABLE pt1_l_p3_p2 PARTITION OF pt1_l_p3 FOR VALUES START (551) END (600) INCLUSIVE;
INSERT INTO pt1_l SELECT i, to_char(i, 'FM000000'), i FROM generate_series(1, 600, 2) i;
CREATE TABLE upt1_l AS SELECT * FROM pt1_l;

The last statement gives error "ERROR: cache lookup failed for function 0". Let me know if this problem is reproducible.

On Thu, Jun 9, 2016 at 7:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/06/08 22:22, Ashutosh Bapat wrote:
> On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote
>>
>> [...]
>>
>> I made a mistake in the last version of the patch which caused a relcache
>> field to be pfree'd unexpectedly. Attached updated patches.
>
> 0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c.
> I have tried fixing the conflict in attached patch.

Thanks. See attached rebased patches.

Regards,
Amit

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

27 June 2016, 10:27:01

Hi Ashutosh,

On 2016/06/24 23:08, Ashutosh Bapat wrote:
> Hi Amit,
> I tried creating 2-level partitioned table and tried to create simple table
> using CTAS from the partitioned table. It gives a cache lookup error.
> Here's the test
> CREATE TABLE pt1_l (a int, b varchar, c int) PARTITION BY RANGE(a);
> CREATE TABLE pt1_l_p1 PARTITION OF pt1_l FOR VALUES START (1) END (250)
> INCLUSIVE PARTITION BY RANGE(b);
> CREATE TABLE pt1_l_p2 PARTITION OF pt1_l FOR VALUES START (251) END (500)
> INCLUSIVE PARTITION BY RANGE(((a+c)/2));
> CREATE TABLE pt1_l_p3 PARTITION OF pt1_l FOR VALUES START (501) END (600)
> INCLUSIVE PARTITION BY RANGE(c);
> CREATE TABLE pt1_l_p1_p1 PARTITION OF pt1_l_p1 FOR VALUES START ('000001')
> END ('000125') INCLUSIVE;
> CREATE TABLE pt1_l_p1_p2 PARTITION OF pt1_l_p1 FOR VALUES START ('000126')
> END ('000250') INCLUSIVE;
> CREATE TABLE pt1_l_p2_p1 PARTITION OF pt1_l_p2 FOR VALUES START (251) END
> (375) INCLUSIVE;
> CREATE TABLE pt1_l_p2_p2 PARTITION OF pt1_l_p2 FOR VALUES START (376) END
> (500) INCLUSIVE;
> CREATE TABLE pt1_l_p3_p1 PARTITION OF pt1_l_p3 FOR VALUES START (501) END
> (550) INCLUSIVE;
> CREATE TABLE pt1_l_p3_p2 PARTITION OF pt1_l_p3 FOR VALUES START (551) END
> (600) INCLUSIVE;
> INSERT INTO pt1_l SELECT i, to_char(i, 'FM000000'), i FROM
> generate_series(1, 600, 2) i;
> CREATE TABLE upt1_l AS SELECT * FROM pt1_l;
>
> The last statement gives error "ERROR:  cache lookup failed for function
> 0". Let me know if this problem is reproducible.

Thanks for the test case.  I can reproduce the error.  It has to do with
partition key column (b) being of type varchar whereas the default
operator family for the type being text_ops and there not being an
=(varchar, varchar) operator in text_ops.

When creating a plan for SELECT * FROM pt1_l, which is an Append plan,
partition check quals are generated internally to be used for constraint
exclusion - such as, b >= '000001' AND b < '000125'.  Individual OpExpr's
are generated by using information from PartitionKey of the parent
(PartitionKey) and pg_partition entry of the partition.  When choosing the
operator to use for =, >=, <, etc., opfamily and typid of corresponding
columns are referred.  As mentioned above, in this case, they happened to
be text_ops and varchar, respectively, for column b.  There doesn't exist
an operator =(varchar, varchar) in text_ops, so InvalidOid is returned by
get_opfamily_member which goes unchecked until the error in question occurs.

I have tried to fix this in attached updated patch by using operator class
input type for operator resolution in cases where column type didn't help.

Thanks,
Amit

Attachment

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

04 July 2016, 12:31:13

Hi Amit,

I observed that the ChangeVarNodes call at line 1229 in get_relation_constraints() (code below) changes the varno in the cached copy of partition key expression, which is undesirable. The reason for this seems to be that the RelationGetPartitionCheckQual() returns the copy of partition key expression directly from the cache. This is mostly because get_check_qual_for_range() directly works on cached copy of partition key expressions, which it should never.

1223     /* Append partition check quals, if any */
1224     pcqual = RelationGetPartitionCheckQual(relation);
1225     if (pcqual)
1226     {
1227         /* Fix Vars to have the desired varno */
1228         if (varno != 1)
1229             ChangeVarNodes((Node *) pcqual, 1, varno, 0);
1230
1231         result = list_concat(result, pcqual);
1232     }

Because of this, the first time through the partition key expressions are valid, but then onwards they are restamped with the varno of the first partition.

Please add testcases to your patch to catch such types of issues.

On Mon, Jun 27, 2016 at 3:56 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Ashutosh,

On 2016/06/24 23:08, Ashutosh Bapat wrote:
> Hi Amit,
> I tried creating 2-level partitioned table and tried to create simple table
> using CTAS from the partitioned table. It gives a cache lookup error.
> Here's the test
> CREATE TABLE pt1_l (a int, b varchar, c int) PARTITION BY RANGE(a);
> CREATE TABLE pt1_l_p1 PARTITION OF pt1_l FOR VALUES START (1) END (250)
> INCLUSIVE PARTITION BY RANGE(b);
> CREATE TABLE pt1_l_p2 PARTITION OF pt1_l FOR VALUES START (251) END (500)
> INCLUSIVE PARTITION BY RANGE(((a+c)/2));
> CREATE TABLE pt1_l_p3 PARTITION OF pt1_l FOR VALUES START (501) END (600)
> INCLUSIVE PARTITION BY RANGE(c);
> CREATE TABLE pt1_l_p1_p1 PARTITION OF pt1_l_p1 FOR VALUES START ('000001')
> END ('000125') INCLUSIVE;
> CREATE TABLE pt1_l_p1_p2 PARTITION OF pt1_l_p1 FOR VALUES START ('000126')
> END ('000250') INCLUSIVE;
> CREATE TABLE pt1_l_p2_p1 PARTITION OF pt1_l_p2 FOR VALUES START (251) END
> (375) INCLUSIVE;
> CREATE TABLE pt1_l_p2_p2 PARTITION OF pt1_l_p2 FOR VALUES START (376) END
> (500) INCLUSIVE;
> CREATE TABLE pt1_l_p3_p1 PARTITION OF pt1_l_p3 FOR VALUES START (501) END
> (550) INCLUSIVE;
> CREATE TABLE pt1_l_p3_p2 PARTITION OF pt1_l_p3 FOR VALUES START (551) END
> (600) INCLUSIVE;
> INSERT INTO pt1_l SELECT i, to_char(i, 'FM000000'), i FROM
> generate_series(1, 600, 2) i;
> CREATE TABLE upt1_l AS SELECT * FROM pt1_l;
>
> The last statement gives error "ERROR: cache lookup failed for function
> 0". Let me know if this problem is reproducible.

Thanks for the test case. I can reproduce the error. It has to do with
partition key column (b) being of type varchar whereas the default
operator family for the type being text_ops and there not being an
=(varchar, varchar) operator in text_ops.

When creating a plan for SELECT * FROM pt1_l, which is an Append plan,
partition check quals are generated internally to be used for constraint
exclusion - such as, b >= '000001' AND b < '000125'. Individual OpExpr's
are generated by using information from PartitionKey of the parent
(PartitionKey) and pg_partition entry of the partition. When choosing the
operator to use for =, >=, <, etc., opfamily and typid of corresponding
columns are referred. As mentioned above, in this case, they happened to
be text_ops and varchar, respectively, for column b. There doesn't exist
an operator =(varchar, varchar) in text_ops, so InvalidOid is returned by
get_opfamily_member which goes unchecked until the error in question occurs.

I have tried to fix this in attached updated patch by using operator class
input type for operator resolution in cases where column type didn't help.

Thanks,
Amit

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

05 July 2016, 04:48:40

On 2016/07/04 21:31, Ashutosh Bapat wrote:
> Hi Amit,
> I observed that the ChangeVarNodes call at line 1229 in
> get_relation_constraints() (code below) changes the varno in the cached
> copy of partition key expression, which is undesirable. The reason for this
> seems to be that the RelationGetPartitionCheckQual() returns the copy of
> partition key expression directly from the cache. This is mostly because
> get_check_qual_for_range() directly works on cached copy of partition key
> expressions, which it should never.

Yes, a copyObject() on key->partexprs items seems necessary. Will fix that.

> 1223     /* Append partition check quals, if any */
> 1224     pcqual = RelationGetPartitionCheckQual(relation);
> 1225     if (pcqual)
> 1226     {
> 1227         /* Fix Vars to have the desired varno */
> 1228         if (varno != 1)
> 1229             ChangeVarNodes((Node *) pcqual, 1, varno, 0);
> 1230
> 1231         result = list_concat(result, pcqual);
> 1232     }
> 
> Because of this, the first time through the partition key expressions are
> valid, but then onwards they are restamped with the varno of the first
> partition.
> 
> Please add testcases to your patch to catch such types of issues.

I will integrate tests into the patch(es) and add some more.

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

19 July 2016, 13:53:51

<div dir="ltr">I am seeing following warning with this set of patches.<br />gram.y:4734:24: warning: assignment from
incompatiblepointer type [enabled by default]<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On
Tue,Jul 5, 2016 at 10:18 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp"
target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0
00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016/07/04 21:31, Ashutosh Bapat wrote:<br />
>Hi Amit,<br /></span><span class="">> I observed that the ChangeVarNodes call at line 1229 in<br /> >
get_relation_constraints()(code below) changes the varno in the cached<br /> > copy of partition key expression,
whichis undesirable. The reason for this<br /> > seems to be that the RelationGetPartitionCheckQual() returns the
copyof<br /> > partition key expression directly from the cache. This is mostly because<br /> >
get_check_qual_for_range()directly works on cached copy of partition key<br /> > expressions, which it should
never.<br/><br /></span>Yes, a copyObject() on key->partexprs items seems necessary. Will fix that.<br /><span
class=""><br/> > 1223     /* Append partition check quals, if any */<br /> > 1224     pcqual =
RelationGetPartitionCheckQual(relation);<br/> > 1225     if (pcqual)<br /> > 1226     {<br /> > 1227       
 /*Fix Vars to have the desired varno */<br /> > 1228         if (varno != 1)<br /> > 1229           
 ChangeVarNodes((Node*) pcqual, 1, varno, 0);<br /> > 1230<br /> > 1231         result = list_concat(result,
pcqual);<br/> > 1232     }<br /> ><br /> > Because of this, the first time through the partition key
expressionsare<br /> > valid, but then onwards they are restamped with the varno of the first<br /> >
partition.<br/> ><br /> > Please add testcases to your patch to catch such types of issues.<br /><br /></span>I
willintegrate tests into the patch(es) and add some more.<br /><br /> Thanks,<br /> Amit<br /><br /><br
/></blockquote></div><br/><br clear="all" /><br />-- <br /><div class="gmail_signature"
data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The
PostgresDatabase Company<br /></div></div></div>

Re: Declarative partitioning

From

Amit Langote

Date:

21 July 2016, 04:40:42

On 2016/07/19 22:53, Ashutosh Bapat wrote:
> I am seeing following warning with this set of patches.
> gram.y:4734:24: warning: assignment from incompatible pointer type [enabled
> by default]


Thanks, will fix.  Was a copy-paste error.

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

05 August 2016, 10:53:24

<div dir="ltr">The lists for list partitioned tables are stored as they are specified by the user. While searching for
apartition to route tuple to, we compare it with every list value of every partition. We might do something better
similarto what's been done to range partitions. The list of values for a given partition can be stored in
ascending/descendingsorted order. Thus a binary search can be used to check whether given row's partition key column
hassame value as one in the list. The partitions can then be stored in the ascending/descending order of the
least/greatestvalues of corresponding partitions. We might be able to eliminate search in a given partition if its
lowestvalue is higher than the given value or its higher value is lower than the given value.<br /></div><div
class="gmail_extra"><br/><div class="gmail_quote">On Thu, Jul 21, 2016 at 10:10 AM, Amit Langote <span dir="ltr"><<a
href="mailto:Langote_Amit_f8@lab.ntt.co.jp"target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">On2016/07/19 22:53, Ashutosh Bapat wrote:<br /> > I am seeing following warning with this set of
patches.<br/> > gram.y:4734:24: warning: assignment from incompatible pointer type [enabled<br /> > by
default]<br/><br /><br /></span>Thanks, will fix.  Was a copy-paste error.<br /><br /> Thanks,<br /> Amit<br /><br
/><br/></blockquote></div><br /><br clear="all" /><br />-- <br /><div class="gmail_signature"
data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The
PostgresDatabase Company<br /></div></div></div>

Re: Declarative partitioning

From

Robert Haas

Date:

05 August 2016, 11:51:25

On Fri, Aug 5, 2016 at 6:53 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> The lists for list partitioned tables are stored as they are specified by
> the user. While searching for a partition to route tuple to, we compare it
> with every list value of every partition. We might do something better
> similar to what's been done to range partitions. The list of values for a
> given partition can be stored in ascending/descending sorted order. Thus a
> binary search can be used to check whether given row's partition key column
> has same value as one in the list. The partitions can then be stored in the
> ascending/descending order of the least/greatest values of corresponding
> partitions.

+1.  Here as with range partitions, we must be sure to know which
opclass should be used for the comparisons.

> We might be able to eliminate search in a given partition if its
> lowest value is higher than the given value or its higher value is lower
> than the given value.

I don't think I understand this part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

05 August 2016, 12:10:25

<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Aug 5, 2016 at 5:21 PM, Robert
Haas<span dir="ltr"><<a href="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">OnFri, Aug 5, 2016 at 6:53 AM, Ashutosh Bapat<br /> <<a
href="mailto:ashutosh.bapat@enterprisedb.com">ashutosh.bapat@enterprisedb.<wbr/>com</a>> wrote:<br /> > The lists
forlist partitioned tables are stored as they are specified by<br /> > the user. While searching for a partition to
routetuple to, we compare it<br /> > with every list value of every partition. We might do something better<br />
>similar to what's been done to range partitions. The list of values for a<br /> > given partition can be stored
inascending/descending sorted order. Thus a<br /> > binary search can be used to check whether given row's partition
keycolumn<br /> > has same value as one in the list. The partitions can then be stored in the<br /> >
ascending/descendingorder of the least/greatest values of corresponding<br /> > partitions.<br /><br /></span>+1. 
Hereas with range partitions, we must be sure to know which<br /> opclass should be used for the comparisons.<br
/><spanclass=""><br /> > We might be able to eliminate search in a given partition if its<br /> > lowest value is
higherthan the given value or its higher value is lower<br /> > than the given value.<br /><br /></span>I don't
thinkI understand this part.<br /></blockquote></div><br clear="all" /></div><div class="gmail_extra">Consider lists
('e','i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a list partitioned tables. I am suggesting that we arrange them
as('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row (either for comparison or for inserting) has
value'c', we will search for it in ('a','b','l') but will be eliminate other two partitions since the second's
partition'slowest value is higher than 'c' and lowest values of rest of the partitions are higher than that of the
secondpartition. Without this order among the partitions, we will have to compare lowest values of all the
partitions.<br/></div><div class="gmail_extra">-- <br /><div class="gmail_signature"
data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The
PostgresDatabase Company<br /></div></div></div></div>

Re: Declarative partitioning

From

Robert Haas

Date:

05 August 2016, 12:15:47

On Fri, Aug 5, 2016 at 8:10 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Aug 5, 2016 at 5:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Aug 5, 2016 at 6:53 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>> > The lists for list partitioned tables are stored as they are specified
>> > by
>> > the user. While searching for a partition to route tuple to, we compare
>> > it
>> > with every list value of every partition. We might do something better
>> > similar to what's been done to range partitions. The list of values for
>> > a
>> > given partition can be stored in ascending/descending sorted order. Thus
>> > a
>> > binary search can be used to check whether given row's partition key
>> > column
>> > has same value as one in the list. The partitions can then be stored in
>> > the
>> > ascending/descending order of the least/greatest values of corresponding
>> > partitions.
>>
>> +1.  Here as with range partitions, we must be sure to know which
>> opclass should be used for the comparisons.
>>
>> > We might be able to eliminate search in a given partition if its
>> > lowest value is higher than the given value or its higher value is lower
>> > than the given value.
>>
>> I don't think I understand this part.
>
> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a
> list partitioned tables. I am suggesting that we arrange them as
> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row (either
> for comparison or for inserting) has value 'c', we will search for it in
> ('a','b','l') but will be eliminate other two partitions since the second's
> partition's lowest value is higher than 'c' and lowest values of rest of the
> partitions are higher than that of the second partition. Without this order
> among the partitions, we will have to compare lowest values of all the
> partitions.

I would think that for that case what we'd want to do is create two
lists.  One looks like this:

[ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]

The other looks like this:

[3, 3, 2, 1, 1, 2, 1, 1, 3, 2]

Given a particular value, bsearch the first list.  If the value is not
found, it's not in any partition.  If it is found, then go look up the
same array index in the second list; that's the containing partition.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

05 August 2016, 12:38:22

>
> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a
> list partitioned tables. I am suggesting that we arrange them as
> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row (either
> for comparison or for inserting) has value 'c', we will search for it in
> ('a','b','l') but will be eliminate other two partitions since the second's
> partition's lowest value is higher than 'c' and lowest values of rest of the
> partitions are higher than that of the second partition. Without this order
> among the partitions, we will have to compare lowest values of all the
> partitions.

I would think that for that case what we'd want to do is create two
lists. One looks like this:

[ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]

The other looks like this:

[3, 3, 2, 1, 1, 2, 1, 1, 3, 2]

small correction; there's an extra 1 here. Every partition in the example has only three values.

Given a particular value, bsearch the first list. If the value is not
found, it's not in any partition. If it is found, then go look up the
same array index in the second list; that's the containing partition.

+1, if we could do it. It will need a change in the way Amit's patch stores partitioning scheme in PartitionDesc.

This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are essentially same, are represented in different ways. It makes matching partitions for partition-wise join a bit tedius. We have to make sure that the first array matches for both the joining relations and then make sure that all the values belonging to the same partition for one table also belong to the same partition in the other table. Some more complex logic for matching subsets of lists for partition-wise join.

At least for straight forward partitioned table matching it helps to have both these array look same independent of the user specification. From that point of view, the partition be ordered by their lowest or highest list values and the second array is the index in the ordered set. For both the specifications above, the list will look like

[ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
[1, 1, 2, 3, 3, 2, 3, 1, 2]

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

08 August 2016, 05:41:18

On 2016/08/05 21:38, Ashutosh Bapat wrote:
>>> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a
>>> list partitioned tables. I am suggesting that we arrange them as
>>> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row
>> (either
>>> for comparison or for inserting) has value 'c', we will search for it in
>>> ('a','b','l') but will be eliminate other two partitions since the
>> second's
>>> partition's lowest value is higher than 'c' and lowest values of rest of
>> the
>>> partitions are higher than that of the second partition. Without this
>> order
>>> among the partitions, we will have to compare lowest values of all the
>>> partitions.
>>
>> I would think that for that case what we'd want to do is create two
>> lists.  One looks like this:
>>
>> [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
>>
>> The other looks like this:
>>
>> [3, 3, 2, 1, 1, 2, 1, 1, 3, 2]
>>
> 
> small correction; there's an extra 1 here. Every partition in the example
> has only three values.
> 
> 
>>
>> Given a particular value, bsearch the first list.  If the value is not
>> found, it's not in any partition.  If it is found, then go look up the
>> same array index in the second list; that's the containing partition.
>>
> 
> +1, if we could do it. It will need a change in the way Amit's patch stores
> partitioning scheme in PartitionDesc.

Okay, I will try to implement this in the next version of the patch.

One thing that comes to mind is what if a user wants to apply hash
operator class equality to list partitioned key by specifying a hash
operator class for the corresponding column.  In that case, we would not
have the ordering procedure with an hash operator class, hence any
ordering based optimization becomes impossible to implement.  The current
patch rejects a column for partition key if its type does not have a btree
operator class for both range and list methods, so this issue doesn't
exist, however it could be seen as a limitation.

> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b',
> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are
> essentially same, are represented in different ways. It makes matching
> partitions for partition-wise join a bit tedius. We have to make sure that
> the first array matches for both the joining relations and then make sure
> that all the values belonging to the same partition for one table also
> belong to the same partition in the other table. Some more complex logic
> for matching subsets of lists for partition-wise join.

So, we have 3 choices for the internal representation of list partitions:

Choice 1 (the current approach):  Load them in the same order as they are
found in the partition catalog:

Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}

In this case, mismatch on the first list would make the two tables
incompatibly partitioned, whereas they really aren't incompatible.

Choice 2: Representation with 2 arrays:

Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]

It still doesn't help the case of pairwise joins because it's hard to tell
which value belongs to which partition (the 2nd array carries the original
partition numbers).  Although it might still work for tuple-routing.

Choice 3: Order all lists' elements for each list individually and then
order the lists themselves on their first values:

Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}

This representation makes pairing partitions for pairwise joining
convenient but for tuple-routing we still need to visit each partition in
the worst case.

> At least for straight forward partitioned table matching it helps to have
> both these array look same independent of the user specification. From that
> point of view, the partition be ordered by their lowest or highest list
> values and the second array is the index in the ordered set. For both the
> specifications above, the list will look like
> 
> [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
> [1, 1, 2, 3, 3, 2, 3, 1, 2]

IIUC, this seems like a combination of 2 and 3 above:

So, we have the following list partitions (as read from the catalog)

Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}

By applying the method of 3:

Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}

Then applying 2:

Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2]
Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1]

This is user-specification independent representation wherein the
partition numbers in the 2nd array are based on canonical representation
(ordered lists).  To check pairwise join compatibility, simply compare the
first two arrays.  As to which partitions (think OIDs, RTEs whatever) pair
with each other, simply pair corresponding elements of the 3rd array which
are original partitions numbers (or OIDs).  Also when routing a tuple, we
find partition number in the array 2 and then look up the array 3 to get
the actual partition to insert the tuple.

Neither of these representations make the logic of checking pairwise-join
compatibility and pairing a subset of partitions (on either side) any
easier, although I haven't given it a good thought yet.

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

08 August 2016, 06:41:05

IIUC, this seems like a combination of 2 and 3 above:

So, we have the following list partitions (as read from the catalog)

Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}

By applying the method of 3:

Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}

Then applying 2:

Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2]
Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1]

We should arrange the OID arrays to follow canonical order. Assuming that OIDs of p1, pt2, and p3 of table 1 are t1o1, t1o2 and t1o3 respectively, and those of p1, p2 and p3 of table 2 are t2o1, t2o2, t2o3 resp. the last arrays should be [t1o3, t1o2, t1o1] and [t2o2, t2o1, t2o3].Thus the last arrays from both representation give the OIDs of children that should be joined pair-wise. IOW, OID array should just follow the canonical order instead of specification order. AFAIU, your patch arranges the range partition OIDs in the canonical order and not specification order.

This is user-specification independent representation wherein the
partition numbers in the 2nd array are based on canonical representation
(ordered lists). To check pairwise join compatibility, simply compare the
first two arrays. As to which partitions (think OIDs, RTEs whatever) pair
with each other, simply pair corresponding elements of the 3rd array which
are original partitions numbers (or OIDs). Also when routing a tuple, we
find partition number in the array 2 and then look up the array 3 to get
the actual partition to insert the tuple.

Neither of these representations make the logic of checking pairwise-join
compatibility and pairing a subset of partitions (on either side) any
easier, although I haven't given it a good thought yet.

With the small change suggested above, it should be easy to check partition-wise join compatibilty for simplest case. I agree that for generic case it will be difficult. E.g. Table 1#: p3 {'a', 'e'}, p4{'a#', 'l'}, p2 {'b', 'f'}, p1 {'c', 'd'} being (INNER) joined with Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} assuming 'a'< 'a#' < 'b'; in this case, we should be able to match p3-p2, p2-p1, p1-p3 for partition-wise join, even though canonical representations of both partitions differ, because of an extra partition in between.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Robert Haas

Date:

08 August 2016, 21:02:35

On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> +1, if we could do it. It will need a change in the way Amit's patch stores
>> partitioning scheme in PartitionDesc.
>
> Okay, I will try to implement this in the next version of the patch.
>
> One thing that comes to mind is what if a user wants to apply hash
> operator class equality to list partitioned key by specifying a hash
> operator class for the corresponding column.  In that case, we would not
> have the ordering procedure with an hash operator class, hence any
> ordering based optimization becomes impossible to implement.  The current
> patch rejects a column for partition key if its type does not have a btree
> operator class for both range and list methods, so this issue doesn't
> exist, however it could be seen as a limitation.

Yes, I think you should expect careful scrutiny of that issue.  It
seems clear to me that range partitioning requires a btree opclass,
that hash partitioning requires a hash opclass, and that list
partitioning requires at least one of those things.  It would probably
be reasonable to pick one or the other and insist that list
partitioning always requires exactly that, but I can't see how it's
reasonable to insist that you must have both types of opclass for any
type of partitioning.

>> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b',
>> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are
>> essentially same, are represented in different ways. It makes matching
>> partitions for partition-wise join a bit tedius. We have to make sure that
>> the first array matches for both the joining relations and then make sure
>> that all the values belonging to the same partition for one table also
>> belong to the same partition in the other table. Some more complex logic
>> for matching subsets of lists for partition-wise join.
>
> So, we have 3 choices for the internal representation of list partitions:
>
> Choice 1 (the current approach):  Load them in the same order as they are
> found in the partition catalog:
>
> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>
> In this case, mismatch on the first list would make the two tables
> incompatibly partitioned, whereas they really aren't incompatible.

Such a limitation seems clearly unacceptable.  We absolutely must be
able to match up compatible partitioning schemes without getting
confused by little details like the order of the partitions.

> Choice 2: Representation with 2 arrays:
>
> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>
> It still doesn't help the case of pairwise joins because it's hard to tell
> which value belongs to which partition (the 2nd array carries the original
> partition numbers).  Although it might still work for tuple-routing.

It's very good for tuple routing.  It can also be used to match up
partitions for pairwise joins.  Compare the first arrays.  If they are
unequal, stop.  Else, compare the second arrays, incrementally
building a mapping between them and returning false if the mapping
turns out to be non-bijective.  For example, in this case, we look at
index 0 and decide that 3 -> 2.  We look at index 1 and decide 1 -> 3.
We look at index 2 and decide 2 -> 1.  We look at index 4 and find
that we already have a mapping for 2, but it's compatible because we
need 2 -> 1 and that's what is already there.  Similarly for the
remaining entries.  This is really a pretty easy loop to write and it
should run very quickly.

> Choice 3: Order all lists' elements for each list individually and then
> order the lists themselves on their first values:
>
> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>
> This representation makes pairing partitions for pairwise joining
> convenient but for tuple-routing we still need to visit each partition in
> the worst case.

I think this is clearly not good enough for tuple routing.  If the
algorithm I proposed above turns out to be too slow for matching
partitions, then we could keep both this representation and the
previous one.  We are not limited to just one.  But I don't think
that's likely to be the case.

Also, note that all of this presupposes we're doing range
partitioning, or perhaps list partitioning with a btree opclass.  For
partitioning based on a hash opclass, you'd organize the data based on
the hash values rather than range comparisons.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Declarative partitioning

From

Amit Langote

Date:

09 August 2016, 03:48:04

On 2016/08/09 6:02, Robert Haas wrote:
> On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> +1, if we could do it. It will need a change in the way Amit's patch stores
>>> partitioning scheme in PartitionDesc.
>>
>> Okay, I will try to implement this in the next version of the patch.
>>
>> One thing that comes to mind is what if a user wants to apply hash
>> operator class equality to list partitioned key by specifying a hash
>> operator class for the corresponding column.  In that case, we would not
>> have the ordering procedure with an hash operator class, hence any
>> ordering based optimization becomes impossible to implement.  The current
>> patch rejects a column for partition key if its type does not have a btree
>> operator class for both range and list methods, so this issue doesn't
>> exist, however it could be seen as a limitation.
> 
> Yes, I think you should expect careful scrutiny of that issue.  It
> seems clear to me that range partitioning requires a btree opclass,
> that hash partitioning requires a hash opclass, and that list
> partitioning requires at least one of those things.  It would probably
> be reasonable to pick one or the other and insist that list
> partitioning always requires exactly that, but I can't see how it's
> reasonable to insist that you must have both types of opclass for any
> type of partitioning.

So because we intend to implement optimizations for list partition
metadata that presuppose existence of corresponding btree operator class,
we should just always require user to specify one (or error out if user
doesn't specify and a default one doesn't exist).  That way, we explicitly
do not support specify hash equality operator for list partitioning.

>> So, we have 3 choices for the internal representation of list partitions:
>>
>> Choice 1 (the current approach):  Load them in the same order as they are
>> found in the partition catalog:
>>
>> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>>
>> In this case, mismatch on the first list would make the two tables
>> incompatibly partitioned, whereas they really aren't incompatible.
> 
> Such a limitation seems clearly unacceptable.  We absolutely must be
> able to match up compatible partitioning schemes without getting
> confused by little details like the order of the partitions.

Agreed.  Will change my patch to adopt the below method.

>> Choice 2: Representation with 2 arrays:
>>
>> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>>
>> It still doesn't help the case of pairwise joins because it's hard to tell
>> which value belongs to which partition (the 2nd array carries the original
>> partition numbers).  Although it might still work for tuple-routing.
> 
> It's very good for tuple routing.  It can also be used to match up
> partitions for pairwise joins.  Compare the first arrays.  If they are
> unequal, stop.  Else, compare the second arrays, incrementally
> building a mapping between them and returning false if the mapping
> turns out to be non-bijective.  For example, in this case, we look at
> index 0 and decide that 3 -> 2.  We look at index 1 and decide 1 -> 3.
> We look at index 2 and decide 2 -> 1.  We look at index 4 and find
> that we already have a mapping for 2, but it's compatible because we
> need 2 -> 1 and that's what is already there.  Similarly for the
> remaining entries.  This is really a pretty easy loop to write and it
> should run very quickly.

I see, it does make sense to try to implement this way.

>> Choice 3: Order all lists' elements for each list individually and then
>> order the lists themselves on their first values:
>>
>> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
>> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>>
>> This representation makes pairing partitions for pairwise joining
>> convenient but for tuple-routing we still need to visit each partition in
>> the worst case.
> 
> I think this is clearly not good enough for tuple routing.  If the
> algorithm I proposed above turns out to be too slow for matching
> partitions, then we could keep both this representation and the
> previous one.  We are not limited to just one.  But I don't think
> that's likely to be the case.

I agree.  Let's see how the option 2 turns out.

> Also, note that all of this presupposes we're doing range
> partitioning, or perhaps list partitioning with a btree opclass.  For
> partitioning based on a hash opclass, you'd organize the data based on
> the hash values rather than range comparisons.

Yes, the current patch does not implement hash partitioning, although I
have to think about how to support the hash case when designing the
internal data structures.


By the way, I am planning to start a new thread with the latest set of
patches which I will post in a day or two.  I have tried to implement all
the bug fixes and improvements that have been suggested on this thread so
far.  Thanks to all those who reviewed and gave their comments.  Please
check this page to get a link to the new thread:
https://commitfest.postgresql.org/10/611/

Thanks,
Amit

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

09 August 2016, 07:18:19

<div dir="ltr">What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned
table,I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see
thatthe code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like
PartitionKeyExecInfo.I don't have any specific observation as to why we need such information in RelOptInfo, but lack
ofit makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your
patchre-uses inheritance code. I am not sure.<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On
Tue,Aug 9, 2016 at 9:17 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp"
target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0
00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016/08/09 6:02, Robert Haas wrote:<br /> >
OnMon, Aug 8, 2016 at 1:40 AM, Amit Langote<br /> > <<a
href="mailto:Langote_Amit_f8@lab.ntt.co.jp">Langote_Amit_f8@lab.ntt.co.jp</a><wbr/>> wrote:<br /> >>> +1,
ifwe could do it. It will need a change in the way Amit's patch stores<br /> >>> partitioning scheme in
PartitionDesc.<br/> >><br /> >> Okay, I will try to implement this in the next version of the patch.<br />
>><br/> >> One thing that comes to mind is what if a user wants to apply hash<br /> >> operator class
equalityto list partitioned key by specifying a hash<br /> >> operator class for the corresponding column.  In
thatcase, we would not<br /> >> have the ordering procedure with an hash operator class, hence any<br /> >>
orderingbased optimization becomes impossible to implement.  The current<br /> >> patch rejects a column for
partitionkey if its type does not have a btree<br /> >> operator class for both range and list methods, so this
issuedoesn't<br /> >> exist, however it could be seen as a limitation.<br /> ><br /> > Yes, I think you
shouldexpect careful scrutiny of that issue.  It<br /> > seems clear to me that range partitioning requires a btree
opclass,<br/> > that hash partitioning requires a hash opclass, and that list<br /> > partitioning requires at
leastone of those things.  It would probably<br /> > be reasonable to pick one or the other and insist that list<br
/>> partitioning always requires exactly that, but I can't see how it's<br /> > reasonable to insist that you
musthave both types of opclass for any<br /> > type of partitioning.<br /><br /></span>So because we intend to
implementoptimizations for list partition<br /> metadata that presuppose existence of corresponding btree operator
class,<br/> we should just always require user to specify one (or error out if user<br /> doesn't specify and a default
onedoesn't exist).  That way, we explicitly<br /> do not support specify hash equality operator for list
partitioning.<br/><span class=""><br /> >> So, we have 3 choices for the internal representation of list
partitions:<br/> >><br /> >> Choice 1 (the current approach):  Load them in the same order as they are<br
/>>> found in the partition catalog:<br /> >><br /> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3
{'a','e'}<br /> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}<br /> >><br /> >> In this
case,mismatch on the first list would make the two tables<br /> >> incompatibly partitioned, whereas they really
aren'tincompatible.<br /> ><br /> > Such a limitation seems clearly unacceptable.  We absolutely must be<br />
>able to match up compatible partitioning schemes without getting<br /> > confused by little details like the
orderof the partitions.<br /><br /></span>Agreed.  Will change my patch to adopt the below method.<br /><span
class=""><br/> >> Choice 2: Representation with 2 arrays:<br /> >><br /> >> Table 1: ['a', 'b', 'c',
'd','e', 'f'], [3, 1, 2, 2, 3, 1]<br /> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]<br />
>><br/> >> It still doesn't help the case of pairwise joins because it's hard to tell<br /> >> which
valuebelongs to which partition (the 2nd array carries the original<br /> >> partition numbers).  Although it
mightstill work for tuple-routing.<br /> ><br /> > It's very good for tuple routing.  It can also be used to
matchup<br /> > partitions for pairwise joins.  Compare the first arrays.  If they are<br /> > unequal, stop. 
Else,compare the second arrays, incrementally<br /> > building a mapping between them and returning false if the
mapping<br/> > turns out to be non-bijective.  For example, in this case, we look at<br /> > index 0 and decide
that3 -> 2.  We look at index 1 and decide 1 -> 3.<br /> > We look at index 2 and decide 2 -> 1.  We look
atindex 4 and find<br /> > that we already have a mapping for 2, but it's compatible because we<br /> > need 2
->1 and that's what is already there.  Similarly for the<br /> > remaining entries.  This is really a pretty easy
loopto write and it<br /> > should run very quickly.<br /><br /></span>I see, it does make sense to try to implement
thisway.<br /><span class=""><br /> >> Choice 3: Order all lists' elements for each list individually and then<br
/>>> order the lists themselves on their first values:<br /> >><br /> >> Table 1: p3 {'a', 'e'}, p2
{'b','f'}, p1 {'c', 'd'}<br /> >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}<br /> >><br />
>>This representation makes pairing partitions for pairwise joining<br /> >> convenient but for
tuple-routingwe still need to visit each partition in<br /> >> the worst case.<br /> ><br /> > I think this
isclearly not good enough for tuple routing.  If the<br /> > algorithm I proposed above turns out to be too slow for
matching<br/> > partitions, then we could keep both this representation and the<br /> > previous one.  We are not
limitedto just one.  But I don't think<br /> > that's likely to be the case.<br /><br /></span>I agree.  Let's see
howthe option 2 turns out.<br /><span class=""><br /> > Also, note that all of this presupposes we're doing range<br
/>> partitioning, or perhaps list partitioning with a btree opclass.  For<br /> > partitioning based on a hash
opclass,you'd organize the data based on<br /> > the hash values rather than range comparisons.<br /><br
/></span>Yes,the current patch does not implement hash partitioning, although I<br /> have to think about how to
supportthe hash case when designing the<br /> internal data structures.<br /><br /><br /> By the way, I am planning to
starta new thread with the latest set of<br /> patches which I will post in a day or two.  I have tried to implement
all<br/> the bug fixes and improvements that have been suggested on this thread so<br /> far.  Thanks to all those who
reviewedand gave their comments.  Please<br /> check this page to get a link to the new thread:<br /><a
href="https://commitfest.postgresql.org/10/611/"rel="noreferrer" target="_blank">https://commitfest.postgresql.<wbr
/>org/10/611/</a><br/><br /> Thanks,<br /> Amit<br /><br /><br /></blockquote></div><br /><br clear="all" /><br />--
<br/><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Best Wishes,<br />Ashutosh Bapat<br
/>EnterpriseDBCorporation<br />The Postgres Database Company<br /></div></div></div>

Re: Declarative partitioning

From

Ashutosh Bapat

Date:

10 August 2016, 10:18:28

FOR VALUE clause of a partition does not allow a constant expression like (10000/5 -1). It gives syntax error
regression=# create table pt1_p1 partition of pt1 for values start (0) end ((10000/5) - 1);
ERROR: syntax error at or near "("
LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ...

Shouldn't we allow constant expressions here?

If this has been already discussed, please forgive me and point out the relevant mail chain.

On Tue, Aug 9, 2016 at 12:48 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned table, I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see that the code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like PartitionKeyExecInfo. I don't have any specific observation as to why we need such information in RelOptInfo, but lack of it makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your patch re-uses inheritance code. I am not sure.

On Tue, Aug 9, 2016 at 9:17 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/08/09 6:02, Robert Haas wrote:
> On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> +1, if we could do it. It will need a change in the way Amit's patch stores
>>> partitioning scheme in PartitionDesc.
>>
>> Okay, I will try to implement this in the next version of the patch.
>>
>> One thing that comes to mind is what if a user wants to apply hash
>> operator class equality to list partitioned key by specifying a hash
>> operator class for the corresponding column. In that case, we would not
>> have the ordering procedure with an hash operator class, hence any
>> ordering based optimization becomes impossible to implement. The current
>> patch rejects a column for partition key if its type does not have a btree
>> operator class for both range and list methods, so this issue doesn't
>> exist, however it could be seen as a limitation.
>
> Yes, I think you should expect careful scrutiny of that issue. It
> seems clear to me that range partitioning requires a btree opclass,
> that hash partitioning requires a hash opclass, and that list
> partitioning requires at least one of those things. It would probably
> be reasonable to pick one or the other and insist that list
> partitioning always requires exactly that, but I can't see how it's
> reasonable to insist that you must have both types of opclass for any
> type of partitioning.

So because we intend to implement optimizations for list partition
metadata that presuppose existence of corresponding btree operator class,
we should just always require user to specify one (or error out if user
doesn't specify and a default one doesn't exist). That way, we explicitly
do not support specify hash equality operator for list partitioning.

>> So, we have 3 choices for the internal representation of list partitions:
>>
>> Choice 1 (the current approach): Load them in the same order as they are
>> found in the partition catalog:
>>
>> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>>
>> In this case, mismatch on the first list would make the two tables
>> incompatibly partitioned, whereas they really aren't incompatible.
>
> Such a limitation seems clearly unacceptable. We absolutely must be
> able to match up compatible partitioning schemes without getting
> confused by little details like the order of the partitions.

Agreed. Will change my patch to adopt the below method.

>> Choice 2: Representation with 2 arrays:
>>
>> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>>
>> It still doesn't help the case of pairwise joins because it's hard to tell
>> which value belongs to which partition (the 2nd array carries the original
>> partition numbers). Although it might still work for tuple-routing.
>
> It's very good for tuple routing. It can also be used to match up
> partitions for pairwise joins. Compare the first arrays. If they are
> unequal, stop. Else, compare the second arrays, incrementally
> building a mapping between them and returning false if the mapping
> turns out to be non-bijective. For example, in this case, we look at
> index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3.
> We look at index 2 and decide 2 -> 1. We look at index 4 and find
> that we already have a mapping for 2, but it's compatible because we
> need 2 -> 1 and that's what is already there. Similarly for the
> remaining entries. This is really a pretty easy loop to write and it
> should run very quickly.

I see, it does make sense to try to implement this way.

>> Choice 3: Order all lists' elements for each list individually and then
>> order the lists themselves on their first values:
>>
>> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
>> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>>
>> This representation makes pairing partitions for pairwise joining
>> convenient but for tuple-routing we still need to visit each partition in
>> the worst case.
>
> I think this is clearly not good enough for tuple routing. If the
> algorithm I proposed above turns out to be too slow for matching
> partitions, then we could keep both this representation and the
> previous one. We are not limited to just one. But I don't think
> that's likely to be the case.

I agree. Let's see how the option 2 turns out.

> Also, note that all of this presupposes we're doing range
> partitioning, or perhaps list partitioning with a btree opclass. For
> partitioning based on a hash opclass, you'd organize the data based on
> the hash values rather than range comparisons.

Yes, the current patch does not implement hash partitioning, although I
have to think about how to support the hash case when designing the
internal data structures.

By the way, I am planning to start a new thread with the latest set of
patches which I will post in a day or two. I have tried to implement all
the bug fixes and improvements that have been suggested on this thread so
far. Thanks to all those who reviewed and gave their comments. Please
check this page to get a link to the new thread:
https://commitfest.postgresql.org/10/611/

Thanks,
Amit

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning

From

Amit Langote

Date:

10 August 2016, 10:29:40

On 2016/08/10 19:18, Ashutosh Bapat wrote:
> FOR VALUE clause of a partition does not allow a constant expression like
> (10000/5 -1). It gives syntax error
> regression=# create table pt1_p1 partition of pt1 for values start (0) end
> ((10000/5) - 1);
> ERROR:  syntax error at or near "("
> LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ...
> 
> Shouldn't we allow constant expressions here?
> 
> If this has been already discussed, please forgive me and point out the
> relevant mail chain.

Yes, the only allowed values there are string literals, numeric values and
null.  Decision to change to it followed from this comment:

"Also, I don't think allowing an a_expr as a bound is remotely sensible
- I think you should allow only Sconst, NumericOnly, NULL, and
UNBOUNDED."

https://www.postgresql.org/message-id/CA%2BTgmoZ1ZMCyGR3b9yvGDq79xYLMnJQwhwn5GVs_GsvPiySDxw%40mail.gmail.com

Thanks,
Amit