Thread: Declarative partitioning
Hi, I would like propose $SUBJECT for this development cycle. Attached is a WIP patch that implements most if not all of what's described below. Some yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. Syntax ====== 1. Creating a partitioned table CREATE TABLE table_name PARTITION BY {RANGE|LIST} ON (column_list); Where column_list consists of simple column names or expressions: PARTITION BY LIST ON (name) PARTITION BY RANGE ON (year, month) PARTITION BY LIST ON ((lower(left(name, 2))) PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) Note: LIST partition key supports only one column. For each column, you could write operator class name: PARTITION BY LIST/RANGE ON (colname [USING] opclass_name), If not specified, the default btree operator class based on type of each key column is used. If none of the available btree operator classes are compatible with the partitioning strategy (list/range), error is thrown. Built-in btree operator classes cover a good number of types for list and range partitioning in practical scenarios. A table created using this form is of proposed new relkind RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is created to store partition key info. Note: A table cannot be partitioned after-the-fact using ALTER TABLE. Normal dependencies are created between the partitioned table and operator classes, object in partition expressions like functions. 2. Creating a partition of a partitioned table CREATE TABLE table_name PARTITION OF partitioned_table_name FOR VALUES values_spec; Where values_spec is: listvalues: [IN] (val1, ...) rangevalues: START (col1min, ... ) END (col1max, ... ) | START (col1min, ... ) | END (col1max, ... ) A table created using this form has proposed pg_class.relispartition set to true. An entry in pg_partition (see below) is created to store the partition bound info. The values_spec should match the partitioning strategy of the partitioned table. In case of a range partition, the values in START and/or END should match columns in the partition key. Defining a list partition is fairly straightforward - just spell out the list of comma-separated values. Error is thrown if the list of values overlaps with one of the existing partitions' list. CREATE TABLE persons_by_state (name text, state text) PARTITION BY LIST ON (state); CREATE TABLE persons_IL PARTITION OF persons_by_state FOR VALUES IN ('IL'); CREATE TABLE persons_fail PARTITION OF persons_by_state FOR VALUES IN ('IL'); ERROR: cannot create partition that overlaps with an existing one For a range partition, there are more than one way: Specify both START and END bounds: resulting range should not overlap with the range(s) covered by existing partitions. Error is thrown otherwise. Although rare in practice, gaps between ranges are OK. CREATE TABLE measurement(logdate date NOT NULL) PARTITION BY RANGE ON (logdate); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-15') END ('2006-03-01'); ERROR: cannot create partition that overlaps with an existing one Specify only the START bound: add the partition on the left of some range covered by existing partitions provided no overlap occurs (also considering gaps between ranges, if any). If no such range exists, the new partition will cover the range [START, +INFINITY) and become the rightmost partition. Error is thrown if the specified START causes overlap. CREATE TABLE measurement_y2006m01 PARTITION OF measurement FOR VALUES START ('2006-01-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02 ERROR: cannot create partition that overlaps with an existing one Specify only the END bound: add the partition on the right of some range covered by existing partitions provided no overlap occurs (also considering gaps between ranges, if any). If no such range exists, the new partition would cover the range (-INFINITY, END) and become the leftmost partition. Error is thrown if the specified END causes overlap. CREATE TABLE measurement_y2006m03 PARTITION OF measurement FOR VALUES END ('2006-04-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02 ERROR: cannot create partition that overlaps with an existing one For each partition, START and END bound values are stored in the catalog. Note that the lower bound is inclusive, whereas the upper bound is exclusive. Note: At most one range partition can have null min bound in which case it covers the range (-INFINITY, END). Also, at most one range partition can have null max bound in which case it covers the range [START, +INFINITY). A normal dependency is created between the parent and the partition. 3. Multi-level partitioning CREATE TABLE table_name PARTITION OF partitioned_table_name FOR VALUES values_spec PARTITION BY {RANGE|LIST} ON (columns_list) This variant implements a form of so called composite or sub-partitioning with arbitrarily deep partitioning structure. A table created using this form has both the relkind RELKIND_PARTITIONED_REL and pg_class.relispartition set to true. 4. (yet unimplemented) Attach partition (from existing table) ALTER TABLE partitioned_table ATTACH PARTITION partition_name FOR VALUES values_spec USING [TABLE] table_name; ALTER TABLE table_name SET VALID PARTITION OF <parent>; The first of the above pair of commands would attach table_name as a (yet) 'invalid' partition of partitioned_table (after confirming that it matches the schema and does not overlap with other partitions per FOR VALUES spec). It would also record the FOR VALUES part in the partition catalog and set pg_class.relispartition to true for table_name. After the first command is done, the second command would take exclusive lock on table_name, scan the table to check if it contains any values outside the boundaries defined by FOR VALUES clause defined previously, throw error if so, mark as valid partition of parent if not. Does that make sense? 5. Detach partition ALTER TABLE partitioned_table DETACH PARTITION partition_name [USING table_name] This removes partition_name as partition of partitioned_table. The table continues to exist with the same name or 'table_name', if specified. pg_class.relispartition is set to false for the table, so it behaves like a normal table. System catalogs =============== 1. pg_partitioned_rel CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS { Oid partrelid; /* partitioned table pg_class.oid */ char partstrategy; /* partitioning strategy 'l'/'r' */ int16 partnatts; /* number of partition columns */ int2vector partkey; /* column numbers of partition columns; * 0 where specified column is an * expresion */ oidvector partclass; /* operator class to compare keys */ pg_node_tree partexprs; /* expression trees for partition key * members that are not simple column * references; one for each zero entry * in partkey[] */ }; 2. pg_partition (omits partisvalid alluded to above) CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS { Oid partitionid; /* partition oid */ Oid partparent; /* parent oid */ anyarray partlistvalues; /* list of allowed values of the only * partition column */ anyarray partrangebounds; /* list of bounds of ranges of * allowed values per partition key * column */ }; Further notes ============= There are a number of restrictions on performing after-the-fact changes using ALTER TABLE to partitions (ie, relispartition=true): * Cannot add/drop column * Cannot set/drop OIDs * Cannot set/drop NOT NULL * Cannot set/drop default * Cannot alter column type * Cannot add/drop alter constraint (table level) * Cannot change persistence * Cannot change inheritance * Cannot link to a composite type Such changes should be made to the topmost parent in the partitioning hierarchy (hereafter referred to as just parent). These are recursively applied to all the tables in the hierarchy. Although the last two items cannot be performed on parent either. Dropping a partition using DROP TABLE is not allowed. It needs to detached using ALTER TABLE on parent before it can be dropped as a normal table. Triggers on partitions are not allowed. They should be defined on the parent. That said, I could not figure out a way to implement row-level AFTER triggers on partitioned tables (more about that in a moment); so they are currently not allowed: CREATE TRIGGER audit_trig AFTER INSERT ON persons FOR EACH ROW EXECUTE PROCEDURE audit_func(); ERROR: Row-level AFTER triggers are not supported on partitioned tables Column/table constraints on partitions are not allowed. They should be defined on the parent. Foreign key constraints are not allowed due to above limitation (no row-level after triggers). A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage. Creating index on parent is not allowed. They should be defined on (leaf) partitions. Because of this limitation, primary keys are not allowed on a partitioned table. Perhaps, we should be able to just create a dummy entry somewhere to represent an index on parent (which every partition then copies.) Then by restricting primary key to contain all partition key columns, we can implement unique constraint over the whole partitioned table. That will in turn allow us to use partitioned tables as PK rels in a foreign key constraint provided row-level AFTER trigger issue is resolved. VACUUM/ANALYZE on individual partitions should work like normal tables. I've not implemented something like inheritance tree sampling for partitioning tree in this patch. Autovacuum has been taught to ignore parent tables and vacuum/analyze partitions normally. Dropping a partitioned table should (?) unconditionally drop all its partitions probably but, currently the patch uses dependencies, so requires to specify CASCADE to do the same. What should TRUNCATE on partitioned table do? Ownership, privileges/permissions, RLS should be managed through the parent table although not comprehensively addressed in the patch. There is no need to define tuple routing triggers. CopyFrom() and ExecInsert() determine target partition just before performing heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and constraints (on parent) are executed for tuple before being routed to a partition. If no partition can be found, it's an error. Because row-level AFTER triggers need to save ItemPointers in trigger event data and defining triggers on partitions (which is where tuples really go) is not allowed, I could not find a straightforward way to implement them. So, perhaps we should allow (only) row-level AFTER triggers on partitions or think of modifying trigger.c to know about this twist explicitly. Internal representations ======================== For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new fields to store the partitioning metadata. That includes partition key tuple (pg_partitioned_rel) including some derived info (opfamily, opcintype, compare proc FmgrInfos, partition expression trees). Additionally, it also includes a PartitionInfo object which includes partition OIDs array, partition bound arrays (if range partitioned, rangemax is sorted in ascending order and OIDs are likewise ordered). It is built from information in pg_partition catalog. While RelationBuildDesc() initializes the basic key info, fields like expression trees, PartitionInfo are built on demand and cached. For example, InitResultRelInfo() builds the latter to populate the newly added ri_PartitionKeyInfo and ri_Partitions fields, respectively. PartitionInfo object is rebuilt on every cache invalidation of the rel which includes when adding/attaching/detaching a new partition. Planner and executor considerations ===================================== The patch does not yet implement any planner changes for partitioned tables, although I'm working on the same and post updates as soon as possible. That means, it is not possible to run SELECT/UPDATE/DELETE queries on partitioned tables without getting: postgres=# SELECT * FROM persons; ERROR: could not open file "base/13244/106975": No such file or directory Given that there would be more direct ways of performing partition pruning decisions with the proposed, it would be nice to utilize them. Specifically, I would like to avoid having to rely on constraint exclusion for partition pruning whereby subquery_planner() builds append_rel_list and the later steps exclude useless partitions. By extending RelOptInfo to include partitioning info for partitioned rels, it might be possible to perform partition pruning directly without previously having to expand them. Although, as things stand now, it's not clear how that might work - when would partition RTEs be added to the rtable? The rtable is assumed not to change after setup_simple_rel_arrays() has done its job which is much earlier than when it would be desirable for the partitioned table expansion (along with partition pruning) to happen. Moreover, if that means we might not be able to build RelOptInfo's for partitions, how to choose best paths for them (index paths or not, etc.)? I'm also hoping we don't require something like inheritance_planner() for when partitioned tables are target rels. I assume considerations for why the special processing is necessary for inheritance trees in that scenario don't apply to partitioning trees. So, if grouping_planner() returns a Append plan (among other options) for the partitioning tree, tacking a ModifyTable node on top should do the trick? Suggestions greatly welcome in this area. Other items =========== Will include the following once we start reaching consensus on main parts of the proposed design/implementation: * New regression tests * Documentation updates * pg_dump, psql, etc. For reference, some immediately previous discussions: * On partitioning * http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org * Partitioning WIP patch * http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp Comments welcome! Thanks, Amit
Attachment
On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi,
I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.
Syntax
======
1. Creating a partitioned table
CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);
Where column_list consists of simple column names or expressions:
PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)
PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
Note: LIST partition key supports only one column.
For each column, you could write operator class name:
PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),
If not specified, the default btree operator class based on type of each
key column is used. If none of the available btree operator classes are
compatible with the partitioning strategy (list/range), error is thrown.
Built-in btree operator classes cover a good number of types for list and
range partitioning in practical scenarios.
A table created using this form is of proposed new relkind
RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
created to store partition key info.
Note: A table cannot be partitioned after-the-fact using ALTER TABLE.
Normal dependencies are created between the partitioned table and operator
classes, object in partition expressions like functions.
2. Creating a partition of a partitioned table
CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec;
Where values_spec is:
listvalues: [IN] (val1, ...)
rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )
A table created using this form has proposed pg_class.relispartition set
to true. An entry in pg_partition (see below) is created to store the
partition bound info.
The values_spec should match the partitioning strategy of the partitioned
table. In case of a range partition, the values in START and/or END should
match columns in the partition key.
Defining a list partition is fairly straightforward - just spell out the
list of comma-separated values. Error is thrown if the list of values
overlaps with one of the existing partitions' list.
CREATE TABLE persons_by_state (name text, state text)
PARTITION BY LIST ON (state);
CREATE TABLE persons_IL
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
CREATE TABLE persons_fail
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
ERROR: cannot create partition that overlaps with an existing one
For a range partition, there are more than one way:
Specify both START and END bounds: resulting range should not overlap with
the range(s) covered by existing partitions. Error is thrown otherwise.
Although rare in practice, gaps between ranges are OK.
CREATE TABLE measurement(logdate date NOT NULL)
PARTITION BY RANGE ON (logdate);
CREATE TABLE measurement_y2006m02
PARTITION OF measurement
FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success
CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-15') END ('2006-03-01');
ERROR: cannot create partition that overlaps with an existing one
Specify only the START bound: add the partition on the left of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition will cover the range [START, +INFINITY) and become the rightmost
partition. Error is thrown if the specified START causes overlap.
CREATE TABLE measurement_y2006m01
PARTITION OF measurement
FOR VALUES START ('2006-01-01'); --success
CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one
Specify only the END bound: add the partition on the right of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition would cover the range (-INFINITY, END) and become the leftmost
partition. Error is thrown if the specified END causes overlap.
CREATE TABLE measurement_y2006m03
PARTITION OF measurement
FOR VALUES END ('2006-04-01'); --success
CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one
For each partition, START and END bound values are stored in the
catalog. Note that the lower bound is inclusive, whereas the upper bound
is exclusive.
Note: At most one range partition can have null min bound in which case it
covers the range (-INFINITY, END). Also, at most one range partition can
have null max bound in which case it covers the range [START, +INFINITY).
A normal dependency is created between the parent and the partition.
3. Multi-level partitioning
CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)
This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.
4. (yet unimplemented) Attach partition (from existing table)
ALTER TABLE partitioned_table
ATTACH PARTITION partition_name
FOR VALUES values_spec
USING [TABLE] table_name;
ALTER TABLE table_name
SET VALID PARTITION OF <parent>;
The first of the above pair of commands would attach table_name as a (yet)
'invalid' partition of partitioned_table (after confirming that it matches
the schema and does not overlap with other partitions per FOR VALUES
spec). It would also record the FOR VALUES part in the partition catalog
and set pg_class.relispartition to true for table_name.
After the first command is done, the second command would take exclusive
lock on table_name, scan the table to check if it contains any values
outside the boundaries defined by FOR VALUES clause defined previously,
throw error if so, mark as valid partition of parent if not.
Does that make sense?
5. Detach partition
ALTER TABLE partitioned_table
DETACH PARTITION partition_name [USING table_name]
This removes partition_name as partition of partitioned_table. The table
continues to exist with the same name or 'table_name', if specified.
pg_class.relispartition is set to false for the table, so it behaves like
a normal table.
System catalogs
===============
1. pg_partitioned_rel
CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
{
Oid partrelid; /* partitioned table pg_class.oid */
char partstrategy; /* partitioning strategy 'l'/'r' */
int16 partnatts; /* number of partition columns */
int2vector partkey; /* column numbers of partition columns;
* 0 where specified column is an
* expresion */
oidvector partclass; /* operator class to compare keys */
pg_node_tree partexprs; /* expression trees for partition key
* members that are not simple column
* references; one for each zero entry
* in partkey[] */
};
2. pg_partition (omits partisvalid alluded to above)
CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
{
Oid partitionid; /* partition oid */
Oid partparent; /* parent oid */
anyarray partlistvalues; /* list of allowed values of the only
* partition column */
anyarray partrangebounds; /* list of bounds of ranges of
* allowed values per partition key
* column */
};
Further notes
=============
There are a number of restrictions on performing after-the-fact changes
using ALTER TABLE to partitions (ie, relispartition=true):
* Cannot add/drop column
* Cannot set/drop OIDs
* Cannot set/drop NOT NULL
* Cannot set/drop default
* Cannot alter column type
* Cannot add/drop alter constraint (table level)
* Cannot change persistence
* Cannot change inheritance
* Cannot link to a composite type
Such changes should be made to the topmost parent in the partitioning
hierarchy (hereafter referred to as just parent). These are recursively
applied to all the tables in the hierarchy. Although the last two items
cannot be performed on parent either.
Dropping a partition using DROP TABLE is not allowed. It needs to detached
using ALTER TABLE on parent before it can be dropped as a normal table.
Triggers on partitions are not allowed. They should be defined on the
parent. That said, I could not figure out a way to implement row-level
AFTER triggers on partitioned tables (more about that in a moment); so
they are currently not allowed:
CREATE TRIGGER audit_trig
AFTER INSERT ON persons
FOR EACH ROW EXECUTE PROCEDURE audit_func();
ERROR: Row-level AFTER triggers are not supported on partitioned tables
Column/table constraints on partitions are not allowed. They should be
defined on the parent. Foreign key constraints are not allowed due to
above limitation (no row-level after triggers).
A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.
Creating index on parent is not allowed. They should be defined on (leaf)
partitions. Because of this limitation, primary keys are not allowed on a
partitioned table. Perhaps, we should be able to just create a dummy
entry somewhere to represent an index on parent (which every partition
then copies.) Then by restricting primary key to contain all partition key
columns, we can implement unique constraint over the whole partitioned
table. That will in turn allow us to use partitioned tables as PK rels in
a foreign key constraint provided row-level AFTER trigger issue is resolved.
VACUUM/ANALYZE on individual partitions should work like normal tables.
I've not implemented something like inheritance tree sampling for
partitioning tree in this patch. Autovacuum has been taught to ignore
parent tables and vacuum/analyze partitions normally.
Dropping a partitioned table should (?) unconditionally drop all its
partitions probably but, currently the patch uses dependencies, so
requires to specify CASCADE to do the same.
What should TRUNCATE on partitioned table do?
Ownership, privileges/permissions, RLS should be managed through the
parent table although not comprehensively addressed in the patch.
There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.
Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.
Internal representations
========================
For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
fields to store the partitioning metadata. That includes partition key
tuple (pg_partitioned_rel) including some derived info (opfamily,
opcintype, compare proc FmgrInfos, partition expression trees).
Additionally, it also includes a PartitionInfo object which includes
partition OIDs array, partition bound arrays (if range partitioned,
rangemax is sorted in ascending order and OIDs are likewise ordered). It
is built from information in pg_partition catalog.
While RelationBuildDesc() initializes the basic key info, fields like
expression trees, PartitionInfo are built on demand and cached. For
example, InitResultRelInfo() builds the latter to populate the newly added
ri_PartitionKeyInfo and ri_Partitions fields, respectively.
PartitionInfo object is rebuilt on every cache invalidation of the rel
which includes when adding/attaching/detaching a new partition.
Planner and executor considerations
=====================================
The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:
postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory
Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.
By extending RelOptInfo to include partitioning info for partitioned rels,
it might be possible to perform partition pruning directly without
previously having to expand them. Although, as things stand now, it's not
clear how that might work - when would partition RTEs be added to the
rtable? The rtable is assumed not to change after
setup_simple_rel_arrays() has done its job which is much earlier than when
it would be desirable for the partitioned table expansion (along with
partition pruning) to happen. Moreover, if that means we might not be able
to build RelOptInfo's for partitions, how to choose best paths for them
(index paths or not, etc.)?
I'm also hoping we don't require something like inheritance_planner() for
when partitioned tables are target rels. I assume considerations for why
the special processing is necessary for inheritance trees in that scenario
don't apply to partitioning trees. So, if grouping_planner() returns a
Append plan (among other options) for the partitioning tree, tacking a
ModifyTable node on top should do the trick?
Suggestions greatly welcome in this area.
Other items
===========
Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:
* New regression tests
* Documentation updates
* pg_dump, psql, etc.
For reference, some immediately previous discussions:
* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org
* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp
Comments welcome!
Thanks,
Amit
Wow, didn't expect to see that email this morning.
A very quick test:
CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item text) partition by range on ((extract(year from purchase_time)),(extract(month from purchase_time)));
ERROR: referenced relation "purchases" is not a table or foreign table
A very quick test:
CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item text) partition by range on ((extract(year from purchase_time)),(extract(month from purchase_time)));
ERROR: referenced relation "purchases" is not a table or foreign table
Thom
Hi Thom, On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote: > > > Wow, didn't expect to see that email this morning. > > A very quick test: > > CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item > text) partition by range on ((extract(year from > purchase_time)),(extract(month from purchase_time))); > ERROR: referenced relation "purchases" is not a table or foreign table > Thanks for the quick test. Damn, I somehow missed adding the new relkind to a check in process_owned_by(). Will fix this and look for any such oversights. Thanks, Amit
On Tue, Aug 18, 2015 at 07:30:20PM +0900, Amit Langote wrote: > Hi, > > I would like propose $SUBJECT for this development cycle. Attached is a > WIP patch that implements most if not all of what's described below. Some > yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. Thanks for pushing this forward! We've needed this done for at least a decade. > 4. (yet unimplemented) Attach partition (from existing table) > > ALTER TABLE partitioned_table > ATTACH PARTITION partition_name > FOR VALUES values_spec > USING [TABLE] table_name; > > ALTER TABLE table_name > SET VALID PARTITION OF <parent>; > > The first of the above pair of commands would attach table_name as a (yet) > 'invalid' partition of partitioned_table (after confirming that it matches > the schema and does not overlap with other partitions per FOR VALUES > spec). It would also record the FOR VALUES part in the partition catalog > and set pg_class.relispartition to true for table_name. > > After the first command is done, the second command would take exclusive > lock on table_name, scan the table to check if it contains any values > outside the boundaries defined by FOR VALUES clause defined previously, > throw error if so, mark as valid partition of parent if not. One small change to make this part more efficient: 1. Take the access exclusive lock on table_name. 2. Check for a matching constraint on it. 3. If it's there, mark it as a valid partition. 4. If not, check for values outside the boundaries as above. Should the be a *valid* constraint? Perhaps that should be parameterized, as I'm not yet seeing a compelling argument either direction. I'm picturing something like: ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST] where TRUST would mean that an existing constraint need not be VALID. > Does that make sense? Yep. > 5. Detach partition > > ALTER TABLE partitioned_table > DETACH PARTITION partition_name [USING table_name] > > This removes partition_name as partition of partitioned_table. The table > continues to exist with the same name or 'table_name', if specified. > pg_class.relispartition is set to false for the table, so it behaves like > a normal table. Could this take anything short of an access exclusive lock on the parent? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Amit, > I would like propose $SUBJECT for this development cycle. Attached is a > WIP patch that implements most if not all of what's described below. Some > yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. First of all, wow! Really happy to see this. > > Syntax > ====== > > 1. Creating a partitioned table > > CREATE TABLE table_name > PARTITION BY {RANGE|LIST} > ON (column_list); > > Where column_list consists of simple column names or expressions: > > PARTITION BY LIST ON (name) > PARTITION BY RANGE ON (year, month) > > PARTITION BY LIST ON ((lower(left(name, 2))) > PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) So far so good. Have you given any thought as to how a user will determine which partition corresponds to which values (for purposes of dropping/maintaining a partition)? Also, won't doing things like extract() for range partitions make it much harder for you to develop the planner parts of this solution? What about defining an interval instead, such as: PARTITION BY RANGE USING ( interval ) ON ( column ); i.e. PARTITION BY RANGE USING ( INTERVAL '1 month' ) ON ( submitted_date ); PARTITION BY RANGE USING ( 100000 ) ON ( user_id ); This would make it easy for you to construct range type values defining the range of each partition, which would then make the planner work much easier than calling a function would, no? Or am I misunderstanding how you're using ranges here? It kind of seems like you're still leaving specific range defintions up to the user, which is (from my perspective) unsatisfactory (see below). I'm assuming that all range partitions will be [ closed, open ) ranges. > 2. Creating a partition of a partitioned table > > CREATE TABLE table_name > PARTITION OF partitioned_table_name > FOR VALUES values_spec; > > Where values_spec is: > > listvalues: [IN] (val1, ...) > > rangevalues: START (col1min, ... ) END (col1max, ... ) > | START (col1min, ... ) > | END (col1max, ... ) So, one thing I missed in here is anything about automated partitioning of tables; that is, creating new partitions based on incoming data or a simple statement which doesn't require knowledge of the partitioning scheme. It's possible (and entirely accceptable) that you're considering automated partition creation outside of the scope of this patch. However, for range partitions, it would be *really* useful to have this syntax: CREATE NEXT PARTITION ON parent_table; Which would just create the "next" partition based on whatever the range partitoning scheme is, instead of requiring the user to calculate start and end values which might or might not match the parent partitioning scheme, and might leave gaps. Also this would be useful for range partitions: CREATE PARTITION ON parent_table USING ( start_value ); ... where start_value is the start range of the new partition. Again, easier for users to get correct. Both of these require the idea of regular intervals for range partitions, that is, on a table partitioned by month on a timestamptz column, each partition will have the range [ month:1, nextmonth:1 ). This is the most common use-case for range partitions (like, 95% of all partitioning cases I've seen), so a new partitioning scheme ought to address it. While there are certainly users who desire the ability to define arbitrary ranges for each range partition, these are by far the minority and could be accomodated by a different path with more complex syntax. Further, I'd wager that most users who want to define arbitrary ranges for range partitions aren't going to be satisfied with the other restrictions on declarative partitioning (e.g. same constraints, columns for all partitions) and are going to use inheritance partitioning anyway. > 5. Detach partition > > ALTER TABLE partitioned_table > DETACH PARTITION partition_name [USING table_name] > > This removes partition_name as partition of partitioned_table. The table > continues to exist with the same name or 'table_name', if specified. > pg_class.relispartition is set to false for the table, so it behaves like > a normal table. What about DROPping partitions? Do they need to be detached first? > Creating index on parent is not allowed. They should be defined on (leaf) > partitions. Because of this limitation, primary keys are not allowed on a > partitioned table. Perhaps, we should be able to just create a dummy > entry somewhere to represent an index on parent (which every partition > then copies.) This would be preferable, yes. Making users remember to manually create indexes on each partition is undesirable. > What should TRUNCATE on partitioned table do? On the master table? Truncate all individual partitions. Do not drop the partitions. On a partitition? Truncate just that partition. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
great :-) >2. Creating a partition of a partitioned table > >CREATE TABLE table_name >PARTITION OF partitioned_table_name >FOR VALUES values_spec; > >Where values_spec is: > >listvalues: [IN] (val1, ...) > Would it make sense to allow one complementary partition to the listvalues? listvalues: [[NOT] IN] (val1, ...) I've thought a few times about moving data with some most common values to dedicated partitions and keeping the rest in a separate one... best regards, Marc Mamin
On Tue, Aug 18, 2015 at 6:30 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi,
I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.
Syntax
======
1. Creating a partitioned table
CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);
Where column_list consists of simple column names or expressions:
PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)
PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
Note: LIST partition key supports only one column.
For each column, you could write operator class name:
PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),
If not specified, the default btree operator class based on type of each
key column is used. If none of the available btree operator classes are
compatible with the partitioning strategy (list/range), error is thrown.
Built-in btree operator classes cover a good number of types for list and
range partitioning in practical scenarios.
A table created using this form is of proposed new relkind
RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
created to store partition key info.
Note: A table cannot be partitioned after-the-fact using ALTER TABLE.
Normal dependencies are created between the partitioned table and operator
classes, object in partition expressions like functions.
2. Creating a partition of a partitioned table
CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec;
Where values_spec is:
listvalues: [IN] (val1, ...)
rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )
A table created using this form has proposed pg_class.relispartition set
to true. An entry in pg_partition (see below) is created to store the
partition bound info.
The values_spec should match the partitioning strategy of the partitioned
table. In case of a range partition, the values in START and/or END should
match columns in the partition key.
Defining a list partition is fairly straightforward - just spell out the
list of comma-separated values. Error is thrown if the list of values
overlaps with one of the existing partitions' list.
CREATE TABLE persons_by_state (name text, state text)
PARTITION BY LIST ON (state);
CREATE TABLE persons_IL
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
CREATE TABLE persons_fail
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
ERROR: cannot create partition that overlaps with an existing one
For a range partition, there are more than one way:
Specify both START and END bounds: resulting range should not overlap with
the range(s) covered by existing partitions. Error is thrown otherwise.
Although rare in practice, gaps between ranges are OK.
CREATE TABLE measurement(logdate date NOT NULL)
PARTITION BY RANGE ON (logdate);
CREATE TABLE measurement_y2006m02
PARTITION OF measurement
FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success
CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-15') END ('2006-03-01');
ERROR: cannot create partition that overlaps with an existing one
Specify only the START bound: add the partition on the left of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition will cover the range [START, +INFINITY) and become the rightmost
partition. Error is thrown if the specified START causes overlap.
CREATE TABLE measurement_y2006m01
PARTITION OF measurement
FOR VALUES START ('2006-01-01'); --success
CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one
Specify only the END bound: add the partition on the right of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition would cover the range (-INFINITY, END) and become the leftmost
partition. Error is thrown if the specified END causes overlap.
CREATE TABLE measurement_y2006m03
PARTITION OF measurement
FOR VALUES END ('2006-04-01'); --success
CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one
For each partition, START and END bound values are stored in the
catalog. Note that the lower bound is inclusive, whereas the upper bound
is exclusive.
Note: At most one range partition can have null min bound in which case it
covers the range (-INFINITY, END). Also, at most one range partition can
have null max bound in which case it covers the range [START, +INFINITY).
A normal dependency is created between the parent and the partition.
3. Multi-level partitioning
CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)
This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.
4. (yet unimplemented) Attach partition (from existing table)
ALTER TABLE partitioned_table
ATTACH PARTITION partition_name
FOR VALUES values_spec
USING [TABLE] table_name;
ALTER TABLE table_name
SET VALID PARTITION OF <parent>;
The first of the above pair of commands would attach table_name as a (yet)
'invalid' partition of partitioned_table (after confirming that it matches
the schema and does not overlap with other partitions per FOR VALUES
spec). It would also record the FOR VALUES part in the partition catalog
and set pg_class.relispartition to true for table_name.
After the first command is done, the second command would take exclusive
lock on table_name, scan the table to check if it contains any values
outside the boundaries defined by FOR VALUES clause defined previously,
throw error if so, mark as valid partition of parent if not.
Does that make sense?
5. Detach partition
ALTER TABLE partitioned_table
DETACH PARTITION partition_name [USING table_name]
This removes partition_name as partition of partitioned_table. The table
continues to exist with the same name or 'table_name', if specified.
pg_class.relispartition is set to false for the table, so it behaves like
a normal table.
System catalogs
===============
1. pg_partitioned_rel
CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
{
Oid partrelid; /* partitioned table pg_class.oid */
char partstrategy; /* partitioning strategy 'l'/'r' */
int16 partnatts; /* number of partition columns */
int2vector partkey; /* column numbers of partition columns;
* 0 where specified column is an
* expresion */
oidvector partclass; /* operator class to compare keys */
pg_node_tree partexprs; /* expression trees for partition key
* members that are not simple column
* references; one for each zero entry
* in partkey[] */
};
2. pg_partition (omits partisvalid alluded to above)
CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
{
Oid partitionid; /* partition oid */
Oid partparent; /* parent oid */
anyarray partlistvalues; /* list of allowed values of the only
* partition column */
anyarray partrangebounds; /* list of bounds of ranges of
* allowed values per partition key
* column */
};
Further notes
=============
There are a number of restrictions on performing after-the-fact changes
using ALTER TABLE to partitions (ie, relispartition=true):
* Cannot add/drop column
* Cannot set/drop OIDs
* Cannot set/drop NOT NULL
* Cannot set/drop default
* Cannot alter column type
* Cannot add/drop alter constraint (table level)
* Cannot change persistence
* Cannot change inheritance
* Cannot link to a composite type
Such changes should be made to the topmost parent in the partitioning
hierarchy (hereafter referred to as just parent). These are recursively
applied to all the tables in the hierarchy. Although the last two items
cannot be performed on parent either.
Dropping a partition using DROP TABLE is not allowed. It needs to detached
using ALTER TABLE on parent before it can be dropped as a normal table.
Triggers on partitions are not allowed. They should be defined on the
parent. That said, I could not figure out a way to implement row-level
AFTER triggers on partitioned tables (more about that in a moment); so
they are currently not allowed:
CREATE TRIGGER audit_trig
AFTER INSERT ON persons
FOR EACH ROW EXECUTE PROCEDURE audit_func();
ERROR: Row-level AFTER triggers are not supported on partitioned tables
Column/table constraints on partitions are not allowed. They should be
defined on the parent. Foreign key constraints are not allowed due to
above limitation (no row-level after triggers).
A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.
Creating index on parent is not allowed. They should be defined on (leaf)
partitions. Because of this limitation, primary keys are not allowed on a
partitioned table. Perhaps, we should be able to just create a dummy
entry somewhere to represent an index on parent (which every partition
then copies.) Then by restricting primary key to contain all partition key
columns, we can implement unique constraint over the whole partitioned
table. That will in turn allow us to use partitioned tables as PK rels in
a foreign key constraint provided row-level AFTER trigger issue is resolved.
VACUUM/ANALYZE on individual partitions should work like normal tables.
I've not implemented something like inheritance tree sampling for
partitioning tree in this patch. Autovacuum has been taught to ignore
parent tables and vacuum/analyze partitions normally.
Dropping a partitioned table should (?) unconditionally drop all its
partitions probably but, currently the patch uses dependencies, so
requires to specify CASCADE to do the same.
What should TRUNCATE on partitioned table do?
Ownership, privileges/permissions, RLS should be managed through the
parent table although not comprehensively addressed in the patch.
There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.
Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.
Internal representations
========================
For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
fields to store the partitioning metadata. That includes partition key
tuple (pg_partitioned_rel) including some derived info (opfamily,
opcintype, compare proc FmgrInfos, partition expression trees).
Additionally, it also includes a PartitionInfo object which includes
partition OIDs array, partition bound arrays (if range partitioned,
rangemax is sorted in ascending order and OIDs are likewise ordered). It
is built from information in pg_partition catalog.
While RelationBuildDesc() initializes the basic key info, fields like
expression trees, PartitionInfo are built on demand and cached. For
example, InitResultRelInfo() builds the latter to populate the newly added
ri_PartitionKeyInfo and ri_Partitions fields, respectively.
PartitionInfo object is rebuilt on every cache invalidation of the rel
which includes when adding/attaching/detaching a new partition.
Planner and executor considerations
=====================================
The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:
postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory
Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.
By extending RelOptInfo to include partitioning info for partitioned rels,
it might be possible to perform partition pruning directly without
previously having to expand them. Although, as things stand now, it's not
clear how that might work - when would partition RTEs be added to the
rtable? The rtable is assumed not to change after
setup_simple_rel_arrays() has done its job which is much earlier than when
it would be desirable for the partitioned table expansion (along with
partition pruning) to happen. Moreover, if that means we might not be able
to build RelOptInfo's for partitions, how to choose best paths for them
(index paths or not, etc.)?
I'm also hoping we don't require something like inheritance_planner() for
when partitioned tables are target rels. I assume considerations for why
the special processing is necessary for inheritance trees in that scenario
don't apply to partitioning trees. So, if grouping_planner() returns a
Append plan (among other options) for the partitioning tree, tacking a
ModifyTable node on top should do the trick?
Suggestions greatly welcome in this area.
Other items
===========
Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:
* New regression tests
* Documentation updates
* pg_dump, psql, etc.
For reference, some immediately previous discussions:
* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org
* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp
Comments welcome!
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Quick thoughts borne of years of slugging it out with partitions on Oracle:
- Finally!!!!!!!!!!!
- Your range partitioning will need to express exclusive/inclusive bounds, or go to the Oracle model where every partition is a cascading "values less than" test context dependent on the partitions defined before it. I would suggest that leveraging existing range types (or allowing the user to specify a range type, like for a specific collation of a text range) would allow for the most flexible and postgres-ish range definition. You seem to do this with the "[USING] opclass_name" bit, but don't follow through on the START(...) and END(...). Something like FOR VALUES <@ '[''2014-01-01'',''2015-01-01)'::daterange would cover most needs succinctly, though I admit the syntax for complex ranges could be cumbersome, though something like FOR VALUES <@ '[(''a'',1),(''b'',1))'::letter_then_number_range is still readable.
- No partitioning scheme survives first contact with reality. So you will need a facility for splitting and joining existing partitions. For splitting partitions, it's sufficient to require that the new partition share either a upper/lower bound (with the same inclusivity/exclusivity) of an existing partition, thus uniquely identifying the partition to be split, and require that the other bound be within the range of the partition to be split. Similarly, it's fair to require that the partitions to be joined be adjacent in range. In both cases, range operators make these tests simple.
- Your features 4 and 5 are implemented in Oracle with SWAP PARTITION, which is really neat for doing ETLs and index rebuilds offline in a copy table, and then swapping the data segment of that table with the partition specified. Which could be considered cheating because none of the partition metadata changed, just the pointers to the segments. We already do this with adding removing INHERIT. I'm not saying they can't be separate functionality, but keeping an atomic SWAP operation would be grand.
On 2015-08-18 PM 10:43, David Fetter wrote: >> >> After the first command is done, the second command would take exclusive >> lock on table_name, scan the table to check if it contains any values >> outside the boundaries defined by FOR VALUES clause defined previously, >> throw error if so, mark as valid partition of parent if not. > > One small change to make this part more efficient: > > 1. Take the access exclusive lock on table_name. > 2. Check for a matching constraint on it. > 3. If it's there, mark it as a valid partition. > 4. If not, check for values outside the boundaries as above. > That's an interesting idea. Thanks! By a matching constraint, I guess you mean a 'valid' constraint from which the declared partition constraint can be proven to follow. For (a simple) example, from a CHECK (a >= 100 AND a < 150) on table_name, the partition constraint implied by FOR VALUES START (100) END (200) can be assumed to hold. > Should the be a *valid* constraint? Perhaps that should be > parameterized, as I'm not yet seeing a compelling argument either > direction. I'm picturing something like: > > ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST] > > where TRUST would mean that an existing constraint need not be VALID. > Hmm, I'd think this step must be able to assert the partition constraint beyond any doubt. If the DBA added the constraint and marked it invalid, she should first VALIDATE the constraint to make it valid by performing whatever steps necessary before. IOW, a full heap scan at least once is inevitable (the reason why we might want to make this a two step process at all). Am I missing something? > >> 5. Detach partition >> >> ALTER TABLE partitioned_table >> DETACH PARTITION partition_name [USING table_name] >> >> This removes partition_name as partition of partitioned_table. The table >> continues to exist with the same name or 'table_name', if specified. >> pg_class.relispartition is set to false for the table, so it behaves like >> a normal table. > > Could this take anything short of an access exclusive lock on the > parent? > Yes, both the step 1 of ATTACH command and DETACH command take access exclusive lock on the parent. They are rather quick metadata changes, so should not stall others significantly, I think. Thanks, Amit
On 2015-08-19 AM 02:57, Marc Mamin wrote: >> 2. Creating a partition of a partitioned table >> >> CREATE TABLE table_name >> PARTITION OF partitioned_table_name >> FOR VALUES values_spec; >> >> Where values_spec is: >> >> listvalues: [IN] (val1, ...) >> > > Would it make sense to allow one complementary partition to the listvalues? > > listvalues: [[NOT] IN] (val1, ...) > > I've thought a few times about moving data with some most common values to dedicated partitions > and keeping the rest in a separate one... > Thanks, that's definitely something to consider. I have been thinking of a sort of default list partition for the "rest" of values. Would you rather declare that with something like the below than having to enumerate all the values in a NOT IN list? Or the NOT IN way is more intuitive/friendly? CREATE TABLE _rest PARTITION OF table_name FOR VALUES [ IN ] DEFAULT Of course, at most one such partition would be allowed. Thanks, Amit
On 18 August 2015 at 18:31, Josh Berkus <josh@agliodbs.com> wrote:
--
> 2. Creating a partition of a partitioned table
>
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec;
>
> Where values_spec is:
>
> listvalues: [IN] (val1, ...)
>
> rangevalues: START (col1min, ... ) END (col1max, ... )
> | START (col1min, ... )
> | END (col1max, ... )
So, one thing I missed in here is anything about automated partitioning
of tables; that is, creating new partitions based on incoming data or a
simple statement which doesn't require knowledge of the partitioning
scheme. It's possible (and entirely accceptable) that you're
considering automated partition creation outside of the scope of this
patch.
I would like to make automatic partitioning outside the scope of this first patch.
However, for range partitions, it would be *really* useful to
have this syntax:
CREATE NEXT PARTITION ON parent_table;
Which would just create the "next" partition based on whatever the range
partitoning scheme is, instead of requiring the user to calculate start
and end values which might or might not match the parent partitioning
scheme, and might leave gaps. Also this would be useful for range
partitions:
CREATE PARTITION ON parent_table USING ( start_value );
... where start_value is the start range of the new partition. Again,
easier for users to get correct.
Both of these require the idea of regular intervals for range
partitions, that is, on a table partitioned by month on a timestamptz
column, each partition will have the range [ month:1, nextmonth:1 ).
This is the most common use-case for range partitions (like, 95% of all
partitioning cases I've seen), so a new partitioning scheme ought to
address it.
While there are certainly users who desire the ability to define
arbitrary ranges for each range partition, these are by far the minority
and could be accomodated by a different path with more complex syntax.
Further, I'd wager that most users who want to define arbitrary ranges
for range partitions aren't going to be satisfied with the other
restrictions on declarative partitioning (e.g. same constraints, columns
for all partitions) and are going to use inheritance partitioning anyway.
I like the idea of a regular partitioning step because it is how you design such tables - "lets use monthly partitions".
This gives sanely terse syntax, rather than specifying pages and pages of exact values in DDL....
PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START WITH value;
borrowing the same concepts from sequence syntax.
> Creating index on parent is not allowed. They should be defined on (leaf)
> partitions. Because of this limitation, primary keys are not allowed on a
> partitioned table. Perhaps, we should be able to just create a dummy
> entry somewhere to represent an index on parent (which every partition
> then copies.)
This would be preferable, yes. Making users remember to manually create
indexes on each partition is undesirable.
I think it is useful to allow additional indexes on partitions, if desired, but we should always automatically build the indexes that are defined on the master when we create a new partition.
Presumably unique indexes will be allowed on partitions. So if the partition key is unique, we can say the whole partitioned table is unique and call that a Primary Key.
I would want individual partitions to be placed on separate tablespaces, but not by default.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
--
There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.
Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.
I think tables will eventually need FK support; its not sustainable as a long term restriction, though perhaps its something we can do in a later patch.
You haven't specified what would happen if an UPDATE would change a row's partition. I'm happy to add this to the list of restrictions by saying that the partition key cannot be updated.
We'll need regression tests that cover each restriction and docs that match. This is not something we should leave until last. People read the docs to understand the feature, helping them to reach consensus. So it is for you to provide the docs before, not wait until later. I will begin a code review once you tell me docs and tests are present. We all want the feature, so its all about the details now.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote: > On 2015-08-18 PM 10:43, David Fetter wrote: > >> > >> After the first command is done, the second command would take exclusive > >> lock on table_name, scan the table to check if it contains any values > >> outside the boundaries defined by FOR VALUES clause defined previously, > >> throw error if so, mark as valid partition of parent if not. > > > > One small change to make this part more efficient: > > > > 1. Take the access exclusive lock on table_name. > > 2. Check for a matching constraint on it. > > 3. If it's there, mark it as a valid partition. > > 4. If not, check for values outside the boundaries as above. > > > > That's an interesting idea. Thanks! I hope I'm advancing this feature rather than bogging it down... > By a matching constraint, I guess you mean a 'valid' constraint from > which the declared partition constraint can be proven to follow. For > (a simple) example, from a CHECK (a >= 100 AND a < 150) on > table_name, the partition constraint implied by FOR VALUES START > (100) END (200) can be assumed to hold. Well, I was assuming an exact match, but a stricter match seems like a nice-to-have...possibly later. > > Should the be a *valid* constraint? Perhaps that should be > > parameterized, as I'm not yet seeing a compelling argument either > > direction. I'm picturing something like: > > > > ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST] > > > > where TRUST would mean that an existing constraint need not be VALID. > > Hmm, I'd think this step must be able to assert the partition > constraint beyond any doubt. If the DBA added the constraint and > marked it invalid, she should first VALIDATE the constraint to make > it valid by performing whatever steps necessary before. IOW, a full > heap scan at least once is inevitable (the reason why we might want > to make this a two step process at all). Am I missing something? There are use cases where we need to warn people that their assertions need to be true, and if those assertions are not true, this will explode, leaving them to pick the resulting shrapnel out of their faces. There are other parts of the system where this is true, as when people write UDFs in C. As I understand it, NOT VALID means, "I assert that the tuples already here fit the constraint. Any changes will be checked against the constraint." I've seen cases where a gigantic amount of data is coming out of some distributed system which holds the constraint as an invariant. This let a DBA decide to add a NOT VALID constraint in order not to take the hit of a second full scan of the data, which might have made the import, and possibly the entire project, untenable. See above. > >> 5. Detach partition > >> > >> ALTER TABLE partitioned_table > >> DETACH PARTITION partition_name [USING table_name] > >> > >> This removes partition_name as partition of partitioned_table. > >> The table continues to exist with the same name or 'table_name', > >> if specified. pg_class.relispartition is set to false for the > >> table, so it behaves like a normal table. > > > > Could this take anything short of an access exclusive lock on the > > parent? > > Yes, both the step 1 of ATTACH command and DETACH command take > access exclusive lock on the parent. They are rather quick metadata > changes, so should not stall others significantly, I think. So no. Weakening required locks has been something of an ongoing project, project-wide, and need not be part of the first cut of this long-needed feature. Thanks so much for working on this! Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
>On 2015-08-19 AM 02:57, Marc Mamin wrote: >>> 2. Creating a partition of a partitioned table >>> >>> CREATE TABLE table_name >>> PARTITION OF partitioned_table_name >>> FOR VALUES values_spec; >>> >>> Where values_spec is: >>> >>> listvalues: [IN] (val1, ...) >>> >> >> Would it make sense to allow one complementary partition to the listvalues? >> >> listvalues: [[NOT] IN] (val1, ...) >> >> I've thought a few times about moving data with some most common values to dedicated partitions >> and keeping the rest in a separate one... >> > >Thanks, that's definitely something to consider. > >I have been thinking of a sort of default list partition for the "rest" of >values. Would you rather declare that with something like the below than >having to enumerate all the values in a NOT IN list? Or the NOT IN way is >more intuitive/friendly? > >CREATE TABLE _rest PARTITION OF table_name FOR VALUES [ IN ] DEFAULT > >Of course, at most one such partition would be allowed. On the one hand I guess it will be easies to check for partition overlapping if their definitions all contain the exact allowedvalues. But this could be generalized to range partitions too: CREATE TABLE _rest FALLBACK PARTITION OF table_name The need for it for range partitions seems very narrow at the first glimpse, but I remember bore administrative work in orderto ensure that there always was a partition available for incoming data (from a very old time when I was still workingwith Oracle). To have it comfortable and nevertheless allow to define new partitions, this would require to always check/move data fromthe default partition to new partitions at create time. and 2 other thoughts: - In your proposal, the parent table is not materialized at all. Could it be used for the fallback partition? - what about always having a fallback partition? This would reduce the risk of unexpected failures and somewhat help Postgresstand out from the crowd :) regards, Marc Mamin
On 08/19/2015 04:59 AM, Simon Riggs wrote: > I like the idea of a regular partitioning step because it is how you > design such tables - "lets use monthly partitions". > > This gives sanely terse syntax, rather than specifying pages and pages > of exact values in DDL.... > > PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) > START WITH value; Oh, I like that syntax! How would it work if there were multiple columns? Maybe we don't want to allow that for this form? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 19 August 2015 at 21:10, Josh Berkus <span dir="ltr"><<ahref="mailto:josh@agliodbs.com" target="_blank">josh@agliodbs.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On08/19/2015 04:59 AM, Simon Riggs wrote:<br /> > I like the idea of a regular partitioning step because it ishow you<br /> > design such tables - "lets use monthly partitions".<br /> ><br /> > This gives sanely terse syntax,rather than specifying pages and pages<br /> > of exact values in DDL....<br /> ><br /> > PARTITION BYRANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )<br /> > START WITH value;<br /><br /></span>Oh, I like that syntax!<br/><br /> How would it work if there were multiple columns? Maybe we don't want<br /> to allow that for this form?<br/></blockquote></div><br /></div><div class="gmail_extra"></div><div class="gmail_extra">If we went with that, andhad:<br /><br /></div><div class="gmail_extra">CREATE TABLE orders (order_id serial, order_date date, item text)<br /></div><divclass="gmail_extra"> PARTITION BY RANGE ON (order_date) INCREMENT BY (INTERVAL '1 month')<br /></div><div class="gmail_extra"> START WITH '2015-01-01';<br /><br /></div><div class="gmail_extra">Where would the following go?<br/><br /></div><div class="gmail_extra">INSERT INTO orders (order_date, item) VALUES ('2014-11-12', 'Old item');<br/><br /></div><div class="gmail_extra">Would there automatically be an "others" partition? Or would it producean error and act like a constraint?<br /></div><div class="gmail_extra"><br /><div class="gmail_signature">Thom</div></div></div>
On 08/19/2015 01:18 PM, Thom Brown wrote: > On 19 August 2015 at 21:10, Josh Berkus <josh@agliodbs.com > <mailto:josh@agliodbs.com>> wrote: > > On 08/19/2015 04:59 AM, Simon Riggs wrote: > > I like the idea of a regular partitioning step because it is how you > > design such tables - "lets use monthly partitions". > > > > This gives sanely terse syntax, rather than specifying pages and pages > > of exact values in DDL.... > > > > PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) > > START WITH value; > > Oh, I like that syntax! > > How would it work if there were multiple columns? Maybe we don't want > to allow that for this form? > > > If we went with that, and had: > > CREATE TABLE orders (order_id serial, order_date date, item text) > PARTITION BY RANGE ON (order_date) INCREMENT BY (INTERVAL '1 month') > START WITH '2015-01-01'; > > Where would the following go? > > INSERT INTO orders (order_date, item) VALUES ('2014-11-12', 'Old item'); > > Would there automatically be an "others" partition? Or would it produce > an error and act like a constraint? The "others" partition was brought up upthread, as an addition to the original proposal. I really think that an "others" partition needs to be up to the DBA; I've seen apps where they'd want to capture it, and apps where they'd want such an insert to error. I, for one, would be OK with a new partitioning which didn't address the "others" partition issue until 9.7; I see it as a wholly separable improvement. Plus, you can always *manually* add high/low catchall partitions. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-08-19 PM 09:23, Simon Riggs wrote: > On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: > >> >> There is no need to define tuple routing triggers. CopyFrom() and >> ExecInsert() determine target partition just before performing >> heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and >> constraints (on parent) are executed for tuple before being routed to a >> partition. If no partition can be found, it's an error. >> >> Because row-level AFTER triggers need to save ItemPointers in trigger >> event data and defining triggers on partitions (which is where tuples >> really go) is not allowed, I could not find a straightforward way to >> implement them. So, perhaps we should allow (only) row-level AFTER >> triggers on partitions or think of modifying trigger.c to know about this >> twist explicitly. >> > > I think tables will eventually need FK support; its not sustainable as a > long term restriction, though perhaps its something we can do in a later > patch. > Sure. Solving the row-level AFTER trigger problem should hopefully open up the possibility of partitioned-table-as-FK-rel implementation. > You haven't specified what would happen if an UPDATE would change a row's > partition. I'm happy to add this to the list of restrictions by saying that > the partition key cannot be updated. > UPDATEs that change a row's partition would cause error. I haven't implemented that yet but will that way in the next patch. By the last sentence, do you mean only UPDATEs to the partition key that cause rows to jump partitions or simply any UPDATEs to the partition key? > We'll need regression tests that cover each restriction and docs that > match. This is not something we should leave until last. People read the > docs to understand the feature, helping them to reach consensus. So it is > for you to provide the docs before, not wait until later. I will begin a > code review once you tell me docs and tests are present. We all want the > feature, so its all about the details now. > Sorry, should have added tests and docs already. I will add them in the next version of the patch. Thanks for willing to review. Thanks, Amit
On Thu, Aug 20, 2015 at 11:16 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2015-08-19 PM 09:23, Simon Riggs wrote:
> We'll need regression tests that cover each restriction and docs that
> match. This is not something we should leave until last. People read the
> docs to understand the feature, helping them to reach consensus. So it is
> for you to provide the docs before, not wait until later. I will begin a
> code review once you tell me docs and tests are present. We all want the
> feature, so its all about the details now.
>
Sorry, should have added tests and docs already. I will add them in the
next version of the patch.
Yes, those would be really good to have before any review so as it is possible to grasp an understanding of what this patch does. I would like to look at it as well more in depths.
Thanks for willing to review.
--
Michael
On 20 August 2015 at 03:16, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
--
Sorry, should have added tests and docs already. I will add them in the
next version of the patch. Thanks for willing to review.
Thanks for picking up this challenge. It's easier if you have someone interested all the way.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-08-20 AM 05:10, Josh Berkus wrote: > On 08/19/2015 04:59 AM, Simon Riggs wrote: >> I like the idea of a regular partitioning step because it is how you >> design such tables - "lets use monthly partitions". >> >> This gives sanely terse syntax, rather than specifying pages and pages >> of exact values in DDL.... >> >> PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) >> START WITH value; > > Oh, I like that syntax! > > How would it work if there were multiple columns? Maybe we don't want > to allow that for this form? > Yea, we could simply restrict it to the single column case, which does not sound like a major restriction. Thanks, Amit
On Tue, Aug 18, 2015 at 4:00 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
--
Hi,
I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.
Syntax
======
1. Creating a partitioned table
CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);
Where column_list consists of simple column names or expressions:
PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)
PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
How about HASH partitioning? Are there plans to support foreign tables as partitions?
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On 20 August 2015 at 10:10, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
--
On 2015-08-20 AM 05:10, Josh Berkus wrote:
> On 08/19/2015 04:59 AM, Simon Riggs wrote:
>> I like the idea of a regular partitioning step because it is how you
>> design such tables - "lets use monthly partitions".
>>
>> This gives sanely terse syntax, rather than specifying pages and pages
>> of exact values in DDL....
>>
>> PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
>> START WITH value;
>
> Oh, I like that syntax!
>
> How would it work if there were multiple columns? Maybe we don't want
> to allow that for this form?
>
Yea, we could simply restrict it to the single column case, which does not
sound like a major restriction.
PARTITION BY ...
SUBPARTITION BY ...
We should plan for that in the way we develop the internals, but full support can wait until later patches.
My view has long been that the internals are they aspect here, not the syntax. We need to be able to have a very fast partition-selection mechanism that can be used in the planner or executor for each tuple. Working backwards, we need a relcache representation that allows that, and a catalog representation that allows that and syntax to match.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-08-20 PM 06:27, Pavan Deolasee wrote: > On Tue, Aug 18, 2015 at 4:00 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp >> wrote: >> >> PARTITION BY LIST ON (name) >> PARTITION BY RANGE ON (year, month) >> >> PARTITION BY LIST ON ((lower(left(name, 2))) >> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) >> >> > > How about HASH partitioning? Are there plans to support foreign tables as > partitions? > I've not given HASH partitioning a lot of thought yet. About the latter, it seems it should not be too difficult to incorporate into the proposed partitioning internals. Thanks, Amit
On 2015-08-20 PM 06:34, Simon Riggs wrote: > On 20 August 2015 at 10:10, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: > >> On 2015-08-20 AM 05:10, Josh Berkus wrote: >>>> PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) >>>> START WITH value; >>> >>> Oh, I like that syntax! >>> >>> How would it work if there were multiple columns? Maybe we don't want >>> to allow that for this form? >>> >> >> Yea, we could simply restrict it to the single column case, which does not >> sound like a major restriction. >> > > PARTITION BY ... > SUBPARTITION BY ... > > We should plan for that in the way we develop the internals, but full > support can wait until later patches. > At the moment, a form of SUBPARTITION BY is to allow PARTITION BY in a partition definition. But I can see that may not be what people would expect. > My view has long been that the internals are they aspect here, not the > syntax. We need to be able to have a very fast partition-selection > mechanism that can be used in the planner or executor for each tuple. > Working backwards, we need a relcache representation that allows that, and > a catalog representation that allows that and syntax to match. > Agreed. Thanks, Amit
On 2015-08-19 AM 02:59, Corey Huinker wrote: > > Quick thoughts borne of years of slugging it out with partitions on Oracle: > > - Finally!!!!!!!!!!! > > - Your range partitioning will need to express exclusive/inclusive bounds, > or go to the Oracle model where every partition is a cascading "values less > than" test context dependent on the partitions defined before it. I would > suggest that leveraging existing range types (or allowing the user to > specify a range type, like for a specific collation of a text range) would > allow for the most flexible and postgres-ish range definition. You seem to > do this with the "[USING] opclass_name" bit, but don't follow through on > the START(...) and END(...). Something like FOR VALUES <@ > '[''2014-01-01'',''2015-01-01)'::daterange would cover most needs > succinctly, though I admit the syntax for complex ranges could be > cumbersome, though something like FOR VALUES <@ > '[(''a'',1),(''b'',1))'::letter_then_number_range is still readable. > It seems the way of specifying per-partition definition/constraint, especially for range partitioning, would have a number of interesting alternatives. By the way, the [USING opclass_name] bit is just a way of telling that a particular key column has user-defined notion of "ordering" in case of range partitioning and "equality" for list partitioning. The opclass would eventually determine which WHERE clauses (looking at operators, operand types) are candidates to help prune partitions. If we use the range_op range_literal::range_type notation to describe partition constraint for each partition, it might not offer much beyond the readability. We are not really going to detect range operators being applied in WHERE conditions to trigger partition pruning, for example. Although I may be missing something... > - No partitioning scheme survives first contact with reality. So you will > need a facility for splitting and joining existing partitions. For > splitting partitions, it's sufficient to require that the new partition > share either a upper/lower bound (with the same inclusivity/exclusivity) of > an existing partition, thus uniquely identifying the partition to be split, > and require that the other bound be within the range of the partition to be > split. Similarly, it's fair to require that the partitions to be joined be > adjacent in range. In both cases, range operators make these tests simple. > SPLIT/MERGE can be done in later patches/release, I think. > - Your features 4 and 5 are implemented in Oracle with SWAP PARTITION, > which is really neat for doing ETLs and index rebuilds offline in a copy > table, and then swapping the data segment of that table with the partition > specified. Which could be considered cheating because none of the partition > metadata changed, just the pointers to the segments. We already do this > with adding removing INHERIT. I'm not saying they can't be separate > functionality, but keeping an atomic SWAP operation would be grand. > I think we can manage to find ways to make the proposed ATTACH/DETACH as useful and convenient. Thanks for reminding of SWAP PARTITION. Regards, Amit
On 2015-08-19 PM 09:52, David Fetter wrote: > On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote: >>> >>> One small change to make this part more efficient: >>> >>> 1. Take the access exclusive lock on table_name. >>> 2. Check for a matching constraint on it. >>> 3. If it's there, mark it as a valid partition. >>> 4. If not, check for values outside the boundaries as above. >>> >> >> That's an interesting idea. Thanks! > > I hope I'm advancing this feature rather than bogging it down... > Definitely advancing. >> By a matching constraint, I guess you mean a 'valid' constraint from >> which the declared partition constraint can be proven to follow. For >> (a simple) example, from a CHECK (a >= 100 AND a < 150) on >> table_name, the partition constraint implied by FOR VALUES START >> (100) END (200) can be assumed to hold. > > Well, I was assuming an exact match, but a stricter match seems like a > nice-to-have...possibly later. > >>> Should the be a *valid* constraint? Perhaps that should be >>> parameterized, as I'm not yet seeing a compelling argument either >>> direction. I'm picturing something like: >>> >>> ALTER TABLE table_name SET VALID PARTITION OF <parent> [TRUST] >>> >>> where TRUST would mean that an existing constraint need not be VALID. >> >> Hmm, I'd think this step must be able to assert the partition >> constraint beyond any doubt. If the DBA added the constraint and >> marked it invalid, she should first VALIDATE the constraint to make >> it valid by performing whatever steps necessary before. IOW, a full >> heap scan at least once is inevitable (the reason why we might want >> to make this a two step process at all). Am I missing something? > > There are use cases where we need to warn people that their assertions > need to be true, and if those assertions are not true, this will > explode, leaving them to pick the resulting shrapnel out of their > faces. There are other parts of the system where this is true, as > when people write UDFs in C. > > As I understand it, NOT VALID means, "I assert that the tuples already > here fit the constraint. Any changes will be checked against the > constraint." > > I've seen cases where a gigantic amount of data is coming out of some > distributed system which holds the constraint as an invariant. This > let a DBA decide to add a NOT VALID constraint in order not to take > the hit of a second full scan of the data, which might have made the > import, and possibly the entire project, untenable. > > See above. > Ah, I understand the point of parameterization (TRUST). Seems like it would be good to have with appropriate documentation of the same. Perhaps, it might as well a parameter to the step 1 itself. >>>> 5. Detach partition >>>> >>>> ALTER TABLE partitioned_table >>>> DETACH PARTITION partition_name [USING table_name] >>>> >>>> This removes partition_name as partition of partitioned_table. >>>> The table continues to exist with the same name or 'table_name', >>>> if specified. pg_class.relispartition is set to false for the >>>> table, so it behaves like a normal table. >>> >>> Could this take anything short of an access exclusive lock on the >>> parent? >> >> Yes, both the step 1 of ATTACH command and DETACH command take >> access exclusive lock on the parent. They are rather quick metadata >> changes, so should not stall others significantly, I think. > > So no. Weakening required locks has been something of an ongoing > project, project-wide, and need not be part of the first cut of this > long-needed feature. > Do you mean ATTACH and DETACH, if they require access exclusive lock on the parent, should not be in the first cut? Or am I misreading? If so, there is no way to drop partitions. With the patch, it would be achieved with detach and drop (if required). > Thanks so much for working on this! > Thanks for the feedback and suggestions! Regards, Amit
On 8/20/15 5:45 AM, Amit Langote wrote: > On 2015-08-20 PM 06:27, Pavan Deolasee wrote: >> On Tue, Aug 18, 2015 at 4:00 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp >>> wrote: >>> >>> PARTITION BY LIST ON (name) >>> PARTITION BY RANGE ON (year, month) >>> >>> PARTITION BY LIST ON ((lower(left(name, 2))) >>> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) >>> >>> >> >> How about HASH partitioning? Are there plans to support foreign tables as >> partitions? >> > > I've not given HASH partitioning a lot of thought yet. About the latter, > it seems it should not be too difficult to incorporate into the proposed > partitioning internals. Hash partitioning seems like it could wait. If fact, I've nearly always implemented hash partitions as list partitions. This gives me control over the hash function and allows me to use properties of the key to my advantage. For instance - if your key is a sha1 hash there's no need to rehash, just grab the required number of bits off the end of the key. My experiences with Oracle's hash function were generally not good - there's a reason many hash algorithms exist. If/when we do hash partitioning in Postgres I'd like to see the hash function be user-definable. Meanwhile, I think list and range are a good start. I'd prefer to see sub-partitioning before hash partitioning. -- -David david@pgmasters.net
On Thu, Aug 20, 2015 at 06:58:24PM +0900, Amit Langote wrote: > On 2015-08-19 PM 09:52, David Fetter wrote: > > On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote: > > > > There are use cases where we need to warn people that their assertions > > need to be true, and if those assertions are not true, this will > > explode, leaving them to pick the resulting shrapnel out of their > > faces. There are other parts of the system where this is true, as > > when people write UDFs in C. > > > > As I understand it, NOT VALID means, "I assert that the tuples already > > here fit the constraint. Any changes will be checked against the > > constraint." > > > > I've seen cases where a gigantic amount of data is coming out of some > > distributed system which holds the constraint as an invariant. This > > let a DBA decide to add a NOT VALID constraint in order not to take > > the hit of a second full scan of the data, which might have made the > > import, and possibly the entire project, untenable. > > Ah, I understand the point of parameterization (TRUST). Seems like it > would be good to have with appropriate documentation of the same. Perhaps, > it might as well a parameter to the step 1 itself. So TRUST would obviate step 2? Better still! > >>>> 5. Detach partition > >>>> > >>>> ALTER TABLE partitioned_table > >>>> DETACH PARTITION partition_name [USING table_name] > >>>> > >>>> This removes partition_name as partition of partitioned_table. > >>>> The table continues to exist with the same name or 'table_name', > >>>> if specified. pg_class.relispartition is set to false for the > >>>> table, so it behaves like a normal table. > >>> > >>> Could this take anything short of an access exclusive lock on the > >>> parent? > >> > >> Yes, both the step 1 of ATTACH command and DETACH command take > >> access exclusive lock on the parent. They are rather quick metadata > >> changes, so should not stall others significantly, I think. > > > > So no. Weakening required locks has been something of an ongoing > > project, project-wide, and need not be part of the first cut of this > > long-needed feature. > > > > Do you mean ATTACH and DETACH, if they require access exclusive lock on > the parent, should not be in the first cut? Or am I misreading? Sorry I was unclear. ATTACH and DETACH should be in the first cut even if they require an access exclusive lock. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
It seems the way of specifying per-partition definition/constraint,
especially for range partitioning, would have a number of interesting
alternatives.
By the way, the [USING opclass_name] bit is just a way of telling that a
particular key column has user-defined notion of "ordering" in case of
range partitioning and "equality" for list partitioning. The opclass would
eventually determine which WHERE clauses (looking at operators, operand
types) are candidates to help prune partitions. If we use the range_op
range_literal::range_type notation to describe partition constraint for
each partition, it might not offer much beyond the readability. We are not
really going to detect range operators being applied in WHERE conditions
to trigger partition pruning, for example. Although I may be missing
something...
I don't think it's important that ranges be used internally for partition pruning, though I don't see a reason why they couldn't. I was making the case for range types being used at partition creation/declaration time, because you were proposing new syntax to describe a range of values in text when we already have that in range types. We should eat our own dog food.
But mostly, I wanted to make sure that range partitions could have inclusive and exclusive bounds.
> - No partitioning scheme survives first contact with reality. So you will
> need a facility for splitting and joining existing partitions. For
> splitting partitions, it's sufficient to require that the new partition
> share either a upper/lower bound (with the same inclusivity/exclusivity) of
> an existing partition, thus uniquely identifying the partition to be split,
> and require that the other bound be within the range of the partition to be
> split. Similarly, it's fair to require that the partitions to be joined be
> adjacent in range. In both cases, range operators make these tests simple.
>
SPLIT/MERGE can be done in later patches/release, I think.
Later patches for sure. When a partition "fills up", it's a critical performance issue, so the fewer steps needed to amend it the better.
My experiences with Oracle's hash function were generally not good -
there's a reason many hash algorithms exist. If/when we do hash
partitioning in Postgres I'd like to see the hash function be
user-definable.
+1
In my experience, hash partitioning had one use: When you had run out of ways to logically partition the data, AND you had two tables in a foreign key relationship, AND that relationship was the most common use-case for selecting the two tables. In which case, your one big join became 8 or 16 little ones, and all was well in the universe...
...until those little joins started to spill to disk, and now you need to repartition by 2x hashes, and that basically means a reload of your primary fact table, and a talk with a boss about why you painted yourself into a corner when you first did that.
Meanwhile, I think list and range are a good start. I'd prefer to see
sub-partitioning before hash partitioning.
+1
On 08/20/2015 06:19 AM, David Fetter wrote: > On Thu, Aug 20, 2015 at 06:58:24PM +0900, Amit Langote wrote: >> Do you mean ATTACH and DETACH, if they require access exclusive lock on >> the parent, should not be in the first cut? Or am I misreading? > > Sorry I was unclear. > > ATTACH and DETACH should be in the first cut even if they require an > access exclusive lock. > > Cheers, > David. I don't see a way for them to *ever* not require an access exclusive lock. We could eventually implement: DETACH PARTITION CONCURRENTLY ... but that's the only way I can see around it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Aug 20, 2015 at 11:16:37AM +0900, Amit Langote wrote: > On 2015-08-19 PM 09:23, Simon Riggs wrote: > > On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > > wrote: > > You haven't specified what would happen if an UPDATE would change a row's > > partition. I'm happy to add this to the list of restrictions by saying that > > the partition key cannot be updated. > > UPDATEs that change a row's partition would cause error. I haven't > implemented that yet but will that way in the next patch. > > By the last sentence, do you mean only UPDATEs to the partition key that > cause rows to jump partitions or simply any UPDATEs to the partition key? I don't know what Simon had in mind, but it seems to me that we have the following in descending order of convenience to users, and I presume, descending order of difficulty of implementation: 1. Updates propagate transparently, moving rows between partitions if needed. 2. Updates fail only if the change to a partition key would cause the row to have to move to another partition. 3. All updates to the partition key fail. Whichever of these we choose, we should document it with great clarity. > > We'll need regression tests that cover each restriction and docs > > that match. This is not something we should leave until last. > > People read the docs to understand the feature, helping them to > > reach consensus. So it is for you to provide the docs before, not > > wait until later. I will begin a code review once you tell me docs > > and tests are present. We all want the feature, so its all about > > the details now. > > Sorry, should have added tests and docs already. I will add them in > the next version of the patch. Thanks for willing to review. Docs and tests are crucial, and thanks again for taking this on. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 2015-08-20 PM 10:19, David Fetter wrote: > On Thu, Aug 20, 2015 at 06:58:24PM +0900, Amit Langote wrote: >> >> Ah, I understand the point of parameterization (TRUST). Seems like it >> would be good to have with appropriate documentation of the same. Perhaps, >> it might as well a parameter to the step 1 itself. > > So TRUST would obviate step 2? Better still! > I really wish we could do this in one step, but... As you know, primary intention behind keeping step 2 separate is to do the heavy validation work with access exclusive lock on the other table instead of on the master table. Assuming we add TRUST TO step 1 (with implications clarified in docs), when/if the user does not use TRUST, the default behavior being to perform the validation, would it be OK as it would have to take the expensive lock on the master table? Motivation for keeping the step 2 (and TRUST parameter with it) is really to avoid this last concern. All that said, we'd still need some way to tell rows that came from a TRUSTed table but do not really belong to the partition because we skipped checking that at all. Obviously one can always write a query on the partition that can find them, but some form of DDL would be what people might expect. >> >> Do you mean ATTACH and DETACH, if they require access exclusive lock on >> the parent, should not be in the first cut? Or am I misreading? > > Sorry I was unclear. > > ATTACH and DETACH should be in the first cut even if they require an > access exclusive lock. > Ah, got it. Thanks, Amit
On 2015-08-21 AM 06:27, David Fetter wrote: >> By the last sentence, do you mean only UPDATEs to the partition key that >> cause rows to jump partitions or simply any UPDATEs to the partition key? > > I don't know what Simon had in mind, but it seems to me that we have > the following in descending order of convenience to users, and I > presume, descending order of difficulty of implementation: > > 1. Updates propagate transparently, moving rows between partitions if needed. > > 2. Updates fail only if the change to a partition key would cause the > row to have to move to another partition. > > 3. All updates to the partition key fail. > I was thinking I'd implement 2. There was some discussion last year[1] about how 1 could be realized. Thanks, Amit [1] http://www.postgresql.org/message-id/20140829172216.GF10109@awork2.anarazel.de
On Fri, Aug 21, 2015 at 11:22 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2015-08-21 AM 06:27, David Fetter wrote:
>> By the last sentence, do you mean only UPDATEs to the partition key that
>> cause rows to jump partitions or simply any UPDATEs to the partition key?
>
> I don't know what Simon had in mind, but it seems to me that we have
> the following in descending order of convenience to users, and I
> presume, descending order of difficulty of implementation:
>
> 1. Updates propagate transparently, moving rows between partitions if needed.
>
> 2. Updates fail only if the change to a partition key would cause the
> row to have to move to another partition.
>
> 3. All updates to the partition key fail.
>
I was thinking I'd implement 2.
+1. IMHO thats a good starting point.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On 8/18/15 12:31 PM, Josh Berkus wrote: > Also this would be useful for range > partitions: > > CREATE PARTITION ON parent_table USING ( start_value ); > > ... where start_value is the start range of the new partition. Again, > easier for users to get correct. Instead of that, I think it would be more foolproof to do CREATE PARTITION ON parent_table FOR ( value1, ... ); instead of trusting the user to get the exact start value correct. Though... I guess there could be value in allowing an exact start value but throwing an error if it doesn't sit exactly on a boundary. Might make it less likely to accidentally create the wrong partition. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 08/21/2015 08:34 PM, Jim Nasby wrote: > On 8/18/15 12:31 PM, Josh Berkus wrote: >> Also this would be useful for range >> partitions: >> >> CREATE PARTITION ON parent_table USING ( start_value ); >> >> ... where start_value is the start range of the new partition. Again, >> easier for users to get correct. > > Instead of that, I think it would be more foolproof to do > > CREATE PARTITION ON parent_table FOR ( value1, ... ); > > instead of trusting the user to get the exact start value correct. > > Though... I guess there could be value in allowing an exact start value > but throwing an error if it doesn't sit exactly on a boundary. Might > make it less likely to accidentally create the wrong partition. Well, I'm figuring that most people would use "CREATE NEXT PARTITION" instead. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 24 August 2015 at 00:53, Josh Berkus <josh@agliodbs.com> wrote:
when a sequence of partitions has been defined
--
On 08/21/2015 08:34 PM, Jim Nasby wrote:
> On 8/18/15 12:31 PM, Josh Berkus wrote:
>> Also this would be useful for range
>> partitions:
>>
>> CREATE PARTITION ON parent_table USING ( start_value );
>>
>> ... where start_value is the start range of the new partition. Again,
>> easier for users to get correct.
>
> Instead of that, I think it would be more foolproof to do
>
> CREATE PARTITION ON parent_table FOR ( value1, ... );
>
> instead of trusting the user to get the exact start value correct.
>
> Though... I guess there could be value in allowing an exact start value
> but throwing an error if it doesn't sit exactly on a boundary. Might
> make it less likely to accidentally create the wrong partition.
Well, I'm figuring that most people would use "CREATE NEXT PARTITION"
instead.
ALTER TABLE foo ADD PARTITION NEXT;
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/24/2015 03:46 AM, Simon Riggs wrote: > > > ALTER TABLE foo ADD PARTITION NEXT; > > when a sequence of partitions has been defined > > Or perhaps ALTER TABLE foo ADD PARTITION NEXT nn; So you can set up, say, a week's worth of daily partitions at once. I could also imagine a variant that allows you to specify partitions up to the one including some value. cheers andrew
On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
src/backend/access/nbtree/README
--
I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.
Thanks for working on this. It's a great start.
3. Multi-level partitioning
CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)
This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.
Multi-level partitioning is probably going to complicate things beyond sanity.
One RELKIND_PARTITIONED_REL with lots of partitions sounds best to me. We can still have N dimensions of partitioning (or partitioning and subpartitioning, if you prefer that term)
The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:
postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory
Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.
This is really the heart of this patch/design. You can work for months on all the rest of this, but you will live or die by how the optimization works because that is the thing we really need to work well. Previous attempts ignored this aspect and didn't get committed. It's hard, perhaps even scary, but its critical. It's the 80/20 rule in reverse - 20% of the code is 80% of the difficulty.
I suggest you write a partition query test script .sql and work towards making this work. Not exhaustive and weird tests, but 5-10 key queries that need to be optimized precisely and quickly. I'm sure that's been done before.
Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:
* New regression tests
* Documentation updates
* pg_dump, psql, etc.
For reference, some immediately previous discussions:
* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org
* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp
If you want to achieve consensus, please write either docs or README files that explain how this works.
It took me a few seconds to notice deviations from Alvaro's original post. I shouldn't have to read a full thread to see what the conclusions were, you need to record them coherently.
Some great examples of such things are
src/backend/optimizer/READMEsrc/backend/access/nbtree/README
Please imagine how far such code would have got without them, then look at the code comments on the top of each of the functions in that area for examples of the clarity of design this needs.
Comments welcome!
Yes, comments in code are indeed welcome, as well as the README/docs.
I couldn't see why you invented a new form of Alter Table recursion.
We will need to support multi-row batched COPY.
I'm pleased to see this patch and will stay with it to completion, perhaps others also. We have 3 more CFs in this release, Nov, Jan, Mar - so this has a great chance of making it into 9.6. The current patch implements a bunch of stuff, but its hard to say what, how or why it does it and without the planner stuff its all moot. My recommendation is we say "Returned with Feedback" on this now, looking forward to next patch.
If you submit another patch before Nov, I will review it without waiting for Nov 1.
There will be much discussion on syntax, but that is not the key point. DDL Support routines are usually pretty straightforward too, so that can be left for now.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested Needs planner work and tests of that. ALTER TABLE etc can wait. The new status of this patch is: Waiting on Author
Sorry about the long delay in replying, to this message or the others posted in the last few days. I should have notified in advance of my vacation with rather limited Internet access. On 2015-08-26 PM 11:00, Simon Riggs wrote: > On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > >> 3. Multi-level partitioning >> >> CREATE TABLE table_name >> PARTITION OF partitioned_table_name >> FOR VALUES values_spec >> PARTITION BY {RANGE|LIST} ON (columns_list) >> >> This variant implements a form of so called composite or sub-partitioning >> with arbitrarily deep partitioning structure. A table created using this >> form has both the relkind RELKIND_PARTITIONED_REL and >> pg_class.relispartition set to true. >> > > Multi-level partitioning is probably going to complicate things beyond > sanity. > > One RELKIND_PARTITIONED_REL with lots of partitions sounds best to me. We > can still have N dimensions of partitioning (or partitioning and > subpartitioning, if you prefer that term) > Implementation in the patch lets RELKIND_PARTITIONED_REL under another and so on to which appears to give a capability to have arbitrarily deep partitioning structure. That might feel unnecessary. Initial motivation to go for such design was to not have to complicate the partition key catalog with details of how to accommodate some notion of sub-partition template. ISTM, most usecases (as covered by other databases) would not require to use more than 2 levels. Maybe, we should consider catering to only that set of usecases using some sub-partition template based design (and then some specialized syntax). Or, we could just not handle sub-partitioning initially, by which I mean composite partitioning where you can have schemes such as range-list, list-range, range-hash, etc. We do still have multi-column range partitioning. In any case, we have to have a design that will cater to the multi-level partitioning. > >> The patch does not yet implement any planner changes for partitioned >> tables, although I'm working on the same and post updates as soon as >> possible. That means, it is not possible to run SELECT/UPDATE/DELETE >> queries on partitioned tables without getting: >> >> postgres=# SELECT * FROM persons; >> ERROR: could not open file "base/13244/106975": No such file or directory >> >> Given that there would be more direct ways of performing partition pruning >> decisions with the proposed, it would be nice to utilize them. >> Specifically, I would like to avoid having to rely on constraint exclusion >> for partition pruning whereby subquery_planner() builds append_rel_list >> and the later steps exclude useless partitions. >> > > This is really the heart of this patch/design. You can work for months on > all the rest of this, but you will live or die by how the optimization > works because that is the thing we really need to work well. Previous > attempts ignored this aspect and didn't get committed. It's hard, perhaps > even scary, but its critical. It's the 80/20 rule in reverse - 20% of the > code is 80% of the difficulty. > > I suggest you write a partition query test script .sql and work towards > making this work. Not exhaustive and weird tests, but 5-10 key queries that > need to be optimized precisely and quickly. I'm sure that's been done > before. > Yes, I am working on this and hope to have something to show soon. > > I couldn't see why you invented a new form of Alter Table recursion. > It was intended to keep the ALTER TABLE considerations for inherited tables (and typed tables) separate from those for partitioned tables. But... This begs a larger question that I did not try to answer in this design/patch - for partitions, do we need to have any catalog entries other than the pg_class tuple? If we manage to not require them, we would not need any AT recursion business at all except for cases that require pg_class tuple updates. The pg_class tuple would be the only authoritative catalog entry for partitions. Everything else belongs with the master table. That includes catalog entries for attributes, constraints, triggers, etc. Alvaro had mentioned something like this in his proposal. Although, I hope that such radical design is not very difficult to realize/code. Also, we might have to go back to the slightly controversial question of whether partitions share the same namespace as normal tables (partitioned or not). In the patch, partitions are created using CREATE TABLE, altered with ALTER TABLE (albeit with several restrictions). How about rather partitions are created/altered using: ALTER TABLE master CREATE PARTITION <name> ..., ALTER TABLE master MODIFY PARTITION <name> ... (as mentioned above) AT commands covered by the latter should only ever require updating the pg_class tuple for the named partition. ALTER TABLE <name> directly on a partition can be made to say something like the following under this scheme: ERROR: <name> is not a table Although, I'm not thinking of hiding partitions from everyone. For example, maintenance commands like VACUUM/ANALYZE (including autovacuum, of course) would be able to see them. > We will need to support multi-row batched COPY. Currently, tuple-routing may switch the ResultRelInfo for every consecutive tuple. So, at once I abandoned any hope of adapting the routing mechanism to heap_multi_insert() API. But, perhaps there is a smarter way. Let me think about that. > > I'm pleased to see this patch and will stay with it to completion, perhaps > others also. We have 3 more CFs in this release, Nov, Jan, Mar - so this > has a great chance of making it into 9.6. The current patch implements a > bunch of stuff, but its hard to say what, how or why it does it and without > the planner stuff its all moot. My recommendation is we say "Returned with > Feedback" on this now, looking forward to next patch. > > If you submit another patch before Nov, I will review it without waiting > for Nov 1. > > There will be much discussion on syntax, but that is not the key point. DDL > Support routines are usually pretty straightforward too, so that can be > left for now. > Sure, let me address number of points you have raised. In the upcoming versions, I will try to provide extensive documentation (both internal and user). As for the syntax and DDL changes you think can be dealt with later, I tend to think let's keep the bare essentials necessary. Thanks, Amit
On 4 September 2015 at 06:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
--
Sorry about the long delay in replying, to this message or the others
posted in the last few days. I should have notified in advance of my
vacation with rather limited Internet access.
No problem, I'm on leave too.
>> The patch does not yet implement any planner changes for partitioned
>> tables, although I'm working on the same and post updates as soon as
>> possible.
...
>
> This is really the heart of this patch/design. You can work for months on
> all the rest of this, but you will live or die by how the optimization
> works because that is the thing we really need to work well. Previous
> attempts ignored this aspect and didn't get committed. It's hard, perhaps
> even scary, but its critical. It's the 80/20 rule in reverse - 20% of the
> code is 80% of the difficulty.
>
> I suggest you write a partition query test script .sql and work towards
> making this work. Not exhaustive and weird tests, but 5-10 key queries that
> need to be optimized precisely and quickly. I'm sure that's been done
> before.
>
Yes, I am working on this and hope to have something to show soon.
No rush, no pressure; lets get this right.
>
> I couldn't see why you invented a new form of Alter Table recursion.
>
It was intended to keep the ALTER TABLE considerations for inherited
tables (and typed tables) separate from those for partitioned tables. But...
This begs a larger question that I did not try to answer in this
design/patch - for partitions, do we need to have any catalog entries
other than the pg_class tuple?
Everything should start from the requirements of the optimization approach. Once we have that clear, we can confirm other requirements.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 August 2015 at 12:23, Amit Langote <amitlangote09@gmail.com> wrote: > Hi Thom, > > On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote: >> >> >> Wow, didn't expect to see that email this morning. >> >> A very quick test: >> >> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item >> text) partition by range on ((extract(year from >> purchase_time)),(extract(month from purchase_time))); >> ERROR: referenced relation "purchases" is not a table or foreign table >> > > Thanks for the quick test. > > Damn, I somehow missed adding the new relkind to a check in > process_owned_by(). Will fix this and look for any such oversights. This doesn't seem to have materialised. Are you still working on this? Thom
On Wednesday, 21 October 2015, Thom Brown <thom@linux.com> wrote:
On 18 August 2015 at 12:23, Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Thom,
>
> On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote:
>>
>>
>> Wow, didn't expect to see that email this morning.
>>
>> A very quick test:
>>
>> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item
>> text) partition by range on ((extract(year from
>> purchase_time)),(extract(month from purchase_time)));
>> ERROR: referenced relation "purchases" is not a table or foreign table
>>
>
> Thanks for the quick test.
>
> Damn, I somehow missed adding the new relkind to a check in
> process_owned_by(). Will fix this and look for any such oversights.
This doesn't seem to have materialised. Are you still working on this?
Yes, I will be posting to this thread soon. Sorry about the silence.
Thanks,
Amit
On 20 October 2015 at 18:34, Amit Langote <amitlangote09@gmail.com> wrote: > > > On Wednesday, 21 October 2015, Thom Brown <thom@linux.com> wrote: >> >> On 18 August 2015 at 12:23, Amit Langote <amitlangote09@gmail.com> wrote: >> > Hi Thom, >> > >> > On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown <thom@linux.com> wrote: >> >> >> >> >> >> Wow, didn't expect to see that email this morning. >> >> >> >> A very quick test: >> >> >> >> CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, >> >> item >> >> text) partition by range on ((extract(year from >> >> purchase_time)),(extract(month from purchase_time))); >> >> ERROR: referenced relation "purchases" is not a table or foreign table >> >> >> > >> > Thanks for the quick test. >> > >> > Damn, I somehow missed adding the new relkind to a check in >> > process_owned_by(). Will fix this and look for any such oversights. >> >> This doesn't seem to have materialised. Are you still working on this? > > > Yes, I will be posting to this thread soon. Sorry about the silence. Thanks. I'd like to test it again. Thom
On 2015/08/26 23:01, Simon Riggs wrote: > The following review has been posted through the commitfest application: > ... > > Needs planner work and tests of that. ALTER TABLE etc can wait. > > The new status of this patch is: Waiting on Author Alright, after spending much time on this I came up with the attached. I'm sincerely sorry this took so long. I'll describe below what's in those patches. The DDL and catalogs part are not much different from what I had last described though I took a few steps to simplify things. I dropped the multi-level partitioning bit and changed range partition creation syntax a little. Ability to specify both min and max for every range partition had made life much difficult to construct and use the internal representation. Anyway, attached document patch (and regression tests to some degree) might give a picture of what user commands look like as of now, although things are very much subject to overhaul in this general area of user visible commands and features. As pointed out by Simon, there is a need to invent a better internal representation of partitioning constraints/metadata to address the planning considerations more effectively and scalably. To that end, I have tried to implement a number of things - There is a proposed set_partitioned_rel_size() that calls planner's partitioning module (proposed) which takes RelOptInfo of a partitioned table and returns a list of OIDs of partitions that are selected to be scanned based on the baserestrictinfo (other partitions are pruned!) The partitioned rel size is then computed using size estimates for selected partitions. Later, proposed set_partitioned_rel_pathlist() will create a Append/MergeAppend path with cheapest paths for individual partitions as its subpaths. It seems important to point out that the selected partitions each get simple_rte_array and a simple_rel_array slots (note that those arrays are required to be repalloc'd to make those slots available). The partition reloptinfos are of kind RELOPT_PARTITION_REL which are functionally quite similar to RELOPT_OTHER_MEMBER_REL but different in that they do not require using any auxiliary structures like AppendRelInfo and append_rel_list. The parent-partition mapping would still be necessary and is achieved using a field called parent_relid of partition reloptinfo. The proposed partitioning module (in the planner) figures out which of the clauses to use for partition pruning. The rules of matching clauses to partition key look similar to those used to match them to an index, at least with respect to the key representation. Key info includes attribute number, expression tree (for expression key columns), operator family OID, per key column. While clauses matched to an index would be returned as it is to make index paths with, things work differently for clauses matched to partition key column(s). For each matched clause, the Const operand and the operator would be added to operand list and operator list, respectively, for the matched column. Operands are later evaluated to datums. The datum arrays along with operator OIDs are passed along to another (general-purpose) partitioning module (proposed). That module returns a list of OIDs of selected partitions after comparing the clause operand datums with the partition bounds. The operator strategies (of passed operators) coupled with partitioning method (list or range, currently) determine how partitions are selected. For some partitioning methods, faster partition selection is enabled, for example, being able to use binary search helps make selecting range partitions reasonably fast. Note that the clauses matched for a given column are effectively ANDed with one another. At the moment, using OR clauses disables partition pruning. But it would be a matter of improving the structure used to pass around the operands and operators to also convey the logical structure of clauses. Also, even though range partitioning allows multi-column key, only the clauses referencing the first key are used to select partitions. Multi-column range partition pruning implementation is saved for later. Note that partition bounds are themselves stored as arrays of datums computed on-demand (cached in RelationData as part of a structure called PartitionDesc which stores other info about partitions like OIDs.) There are not many executor changes currently except for those made to ensure sane I/U/D behaviors (made to respective ExecI/U/D functions). We can certainly think about implementing specialized scan node for partitioned tables but I put it off for now. Attached user documentation and regression test patches, though not quite comprehensive, should help one get started. Thanks, Amit
Attachment
On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The DDL and catalogs part are not much different from what I had last > described though I took a few steps to simplify things. I dropped the > multi-level partitioning bit Hmm, that doesn't sound good to me. I think multi-level partitioning is a reasonably important use case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015/11/06 1:29, Robert Haas wrote: > On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The DDL and catalogs part are not much different from what I had last >> described though I took a few steps to simplify things. I dropped the >> multi-level partitioning bit > > Hmm, that doesn't sound good to me. I think multi-level partitioning > is a reasonably important use case. I agree. I'm in the process of reformulating this proposal from the syntax, catalog and DDL -centric perspective and will re-incorporate multi-level partitioning notion into it. It was a mistake to drop it. I am thinking of introducing an explicit notion of sub-partition key and sub-partitions (of the top parent as far as syntactic notation is concerned). I guess it would not be unreasonable to think that most use-cases that multi-level partitioning is used for require at most 2 levels. It will enable us to use a more intuitive syntax and make internals easier to manage. Thanks, Amit
On 20 November 2015 at 09:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
--
On 2015/11/06 1:29, Robert Haas wrote:
> On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The DDL and catalogs part are not much different from what I had last
>> described though I took a few steps to simplify things. I dropped the
>> multi-level partitioning bit
>
> Hmm, that doesn't sound good to me. I think multi-level partitioning
> is a reasonably important use case.
I agree. I'm in the process of reformulating this proposal from the
syntax, catalog and DDL -centric perspective and will re-incorporate
multi-level partitioning notion into it. It was a mistake to drop it.
Drop it?? I think he means "in this initial patch", right Amit L ?
I don't really understand why parallel query was pursued in small pieces, but partitioning needs to happen all in one huge patch. Wishing too many things is going to slow down this feature.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 20, 2015 at 7:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 20 November 2015 at 09:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: >> >> On 2015/11/06 1:29, Robert Haas wrote: >> > On Fri, Oct 30, 2015 at 6:08 AM, Amit Langote >> > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> The DDL and catalogs part are not much different from what I had last >> >> described though I took a few steps to simplify things. I dropped the >> >> multi-level partitioning bit >> > >> > Hmm, that doesn't sound good to me. I think multi-level partitioning >> > is a reasonably important use case. >> >> I agree. I'm in the process of reformulating this proposal from the >> syntax, catalog and DDL -centric perspective and will re-incorporate >> multi-level partitioning notion into it. It was a mistake to drop it. > > > Drop it?? I think he means "in this initial patch", right Amit L ? Yes, there was some notion of multi-level partitioning in the earlier patch but I removed it from the version I posted on Oct 30. I do intend to re-introduce it though. Thanks, Amit
Amit Langote wrote: > On Fri, Nov 20, 2015 at 7:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Drop it?? I think he means "in this initial patch", right Amit L ? > > Yes, there was some notion of multi-level partitioning in the earlier > patch but I removed it from the version I posted on Oct 30. I do > intend to re-introduce it though. I'm with Simon. In my own experience, it's crazy to try to introduce a huge feature such as this one in a single enormous commit. The last patch you posted was 300 kb without any SGML changes. The way parallel query is being introduced is a good example to follow (or logical multi-master replication, for that matter) --- one infrastructure piece at a time. Let's build incrementally. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 20, 2015 at 6:44 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Langote wrote: >> On Fri, Nov 20, 2015 at 7:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Drop it?? I think he means "in this initial patch", right Amit L ? >> >> Yes, there was some notion of multi-level partitioning in the earlier >> patch but I removed it from the version I posted on Oct 30. I do >> intend to re-introduce it though. > > I'm with Simon. In my own experience, it's crazy to try to introduce a > huge feature such as this one in a single enormous commit. The last > patch you posted was 300 kb without any SGML changes. > > The way parallel query is being introduced is a good example to follow > (or logical multi-master replication, for that matter) --- one > infrastructure piece at a time. > > Let's build incrementally. I support building incrementally, but I don't see why we want to change the catalog structure and then change it again. That seems like it makes the project more work, not less. To me, it seems like there is a pretty obvious approach here: each table can be either a plain table, or a partition root (which can look just like an empty inheritance parent). Then multi-level partitioning falls right out of that design without needing to do anything extra. If you want a single level of partitioning, you partition the original table. If you want two levels of partitioning, you partition the partitions. If you want three levels of partitioning, you partition those. It's being made out here that limiting ourselves to a single of partitioning makes things simpler, but it's not very clear to me that this is actually true. I think it is also worth getting the syntax right from the beginning. Even if we decide that patch #1 won't support multi-level partitioning, we should have a plan for the syntax that can be extended to multi-level partitioning. If we decide after releasing partitioning with one syntax that we really wish we'd used some other syntax, that is going to be a really big problem - deprecating the use of => or introducing standard_conforming_strings were projects that took many years to complete. We really only get one shot to get that right. That doesn't mean it's all got to be there in version one, but there had better be a way to extend it to all the things we want to do later or we are going to be deeply, deeply sad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > I support building incrementally, but I don't see why we want to > change the catalog structure and then change it again. That seems > like it makes the project more work, not less. I agree with what you say. I thought you were saying that the implementation had to provide multi-partitioning from the get-go, not just the design. > To me, it seems like there is a pretty obvious approach here: each > table can be either a plain table, or a partition root (which can look > just like an empty inheritance parent). Then multi-level partitioning > falls right out of that design without needing to do anything extra. Sounds reasonable. > I think it is also worth getting the syntax right from the beginning. Yes, that's critical. We could implement the whole thing in gram.y and then have the unsupported cases throw errors; then it's easy to see that there are no grammar conflicts to deal with later. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 23, 2015 at 1:44 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> I support building incrementally, but I don't see why we want to >> change the catalog structure and then change it again. That seems >> like it makes the project more work, not less. > > I agree with what you say. I thought you were saying that the > implementation had to provide multi-partitioning from the get-go, not > just the design. Well, I *hope* that's going to fall out naturally. If it doesn't, I can live with that. But I hope it will. >> To me, it seems like there is a pretty obvious approach here: each >> table can be either a plain table, or a partition root (which can look >> just like an empty inheritance parent). Then multi-level partitioning >> falls right out of that design without needing to do anything extra. > > Sounds reasonable. Cool. >> I think it is also worth getting the syntax right from the beginning. > > Yes, that's critical. We could implement the whole thing in gram.y and > then have the unsupported cases throw errors; then it's easy to see that > there are no grammar conflicts to deal with later. That's worth considering, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 24, 2015 at 7:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 23, 2015 at 1:44 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Robert Haas wrote: >>> I support building incrementally, but I don't see why we want to >>> change the catalog structure and then change it again. That seems >>> like it makes the project more work, not less. >> >> I agree with what you say. I thought you were saying that the >> implementation had to provide multi-partitioning from the get-go, not >> just the design. > > Well, I *hope* that's going to fall out naturally. If it doesn't, I > can live with that. But I hope it will. > >>> To me, it seems like there is a pretty obvious approach here: each >>> table can be either a plain table, or a partition root (which can look >>> just like an empty inheritance parent). Then multi-level partitioning >>> falls right out of that design without needing to do anything extra. >> >> Sounds reasonable. > > Cool. > >>> I think it is also worth getting the syntax right from the beginning. >> >> Yes, that's critical. We could implement the whole thing in gram.y and >> then have the unsupported cases throw errors; then it's easy to see that >> there are no grammar conflicts to deal with later. > > That's worth considering, too. It seems that the consensus is to rework a bit more this patch. Returned with feedback then? -- Michael
On 2015/12/02 15:41, Michael Paquier wrote: > It seems that the consensus is to rework a bit more this patch. > Returned with feedback then? Yes, as far as this commitfest is concerned. Or "moved to the next commitfest"? Not sure exactly which makes sense. Thanks, Amit
On Wed, Dec 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2015/12/02 15:41, Michael Paquier wrote: >> It seems that the consensus is to rework a bit more this patch. >> Returned with feedback then? > > Yes, as far as this commitfest is concerned. Or "moved to the next > commitfest"? Not sure exactly which makes sense. OK, then, switched as returned with feedback. -- Michael
On 2015/11/24 2:23, Robert Haas wrote: > To me, it seems like there is a pretty obvious approach here: each > table can be either a plain table, or a partition root (which can look > just like an empty inheritance parent). Then multi-level partitioning > falls right out of that design without needing to do anything extra. > If you want a single level of partitioning, you partition the original > table. If you want two levels of partitioning, you partition the > partitions. If you want three levels of partitioning, you partition > those. It's being made out here that limiting ourselves to a single > of partitioning makes things simpler, but it's not very clear to me > that this is actually true. > > I think it is also worth getting the syntax right from the beginning. > Even if we decide that patch #1 won't support multi-level > partitioning, we should have a plan for the syntax that can be > extended to multi-level partitioning. If we decide after releasing > partitioning with one syntax that we really wish we'd used some other > syntax, that is going to be a really big problem - deprecating the use > of => or introducing standard_conforming_strings were projects that > took many years to complete. We really only get one shot to get that > right. That doesn't mean it's all got to be there in version one, but > there had better be a way to extend it to all the things we want to do > later or we are going to be deeply, deeply sad. Previously proposed design would support multi-level partitioning with some adjustments. One of the reasons to not include it in the recent patch was a lack of clarity about proper syntax and catalog organization. That wasn't really nice. So, here is a revised proposal for the syntax and some general notes. I will adjust my previous patch to follow along any feedback I receive here. I think we can provide explicit SUBPARTITION keyword-based syntax for multi-level partitioning. It's not unreasonable to expect that all partitions at a given level are partitioned on the same key. IOW, why require to declare partition key for each partition separately? Instead, using SUBPARTITION BY on the master table seems better. Syntax to create a partitioned table (up to 2 levels of partitioning): CREATE TABLE foo ( ... ) PARTITION BY R/L ON (key0) SUBPARTITION BY R/L ON (key1) [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>] [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>], ...)], ...)]; The above creates two pg_partitioned_rel entries for foo with partlevel 0 and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this creates pg_partition entries, with foo and foo_1 as partparent, respectively. Why just 2 levels? - it seems commonplace and makes the syntax more intuitive? I guess it might be possible to generalize the syntax for multi-level partitioning. Ideas? If we want to support the notion of sub-partition template in future, that would require some thought, more importantly proper catalog organization for the same. To add a partition to table after-the-fact: ALTER TABLE foo CREATE PARTITION foo1 FOR VALUES <val> [<storage_params>] [<tblspc>] [(SUBPARTITION foo11 FOR VALUES <val> [<storage_params>] [<tblspc>], ...)]; To add a sub-partition to an existing partition: ALTER TABLE foo MODIFY PARTITION foo_1 CREATE SUBPARTITION foo_1_ FOR VALUES (val) [<storage_params>] [<tblspc>]; I considered ADD/ALTER instead of CREATE/MODIFY, but there exist grammar conflicts with ADD/ALTER column. What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied directly to partitions on case-by-case basis (they are tables under the hood after all), we should restrict AT to the master table. Most of the AT changes implicitly propagate from the master table to its partitions. Some of them could be directly applied to partitions and/or sub-partitions such as rename, storage manipulations like - changing tablespace, storage parameters (reloptions), etc.: ALTER TABLE foo RENAME PARTITION <partition-name> TO <new-name>; ALTER TABLE foo RENAME SUBPARTITION <sub-partition-name> TO <new-name>; ALTER TABLE foo SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>; ALTER TABLE foo SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>; ALTER TABLE foo SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>; ALTER TABLE foo SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>; Note that the keyword DEFAULT in some cases above means, do not apply the new setting to existing sub-partitions of the partition, rather use the new setting for future sub-partitions. In case of SET TABLESPACE, if FOR PARTITION clause is not specified, all partitions (actually "leaf" partitions) of the master table are moved to the new tablespace; is that necessary or should we just disallow that and instead output an error asking to use [DEFAULT] FOR PARTITION/FOR SUBPARTITION to move only specific partitions/sub-partitions? By the way, should we also allow changing the logging of partitions/sub-partitions as follows? ALTER TABLE foo MODIFY PARTITION <partition-name> SET {LOGGED | UNLOGGED}; ALTER TABLE foo MODIFY SUBPARTITION <sub-partition-name> SET {LOGGED | UNLOGGED}; What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION constraints - 2 things must be clear here: cannot create these constraints on individual partitions and all partition columns (key0 + key1) must be the leading columns of the key. On a related note, creating index on the master table should create the index on all "leaf" partitions. The index on the mater table itself would be just a logical index. Should we allow creating or dropping indexes on partitions directly? It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE but I'm inclined to leave them as future enhancements. For a functionality similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest patch. We could extend them to also consider sub-partitions: ALTER TABLE foo ATTACH PARTITION <partition-name> FOR VALUES (val) USING [ TABLE ] <table-name> [SET VALID]; ALTER TABLE foo MODIFY PARTITION foo_1 ATTACH SUBPARTITION <sub-partition-name> FOR VALUES (val) USING [ TABLE ] <table-name> [SET VALID]; ALTER TABLE foo DETACH PARTITION foo_1 USING [ TABLE ] <table-name>; ALTER TABLE foo MODIFY PARTITION foo_1 DETACH SUBPARTITION foo_1_1 USING [ TABLE ] <table-name>; Note that all commands described above refer to sub-partitions directly using names. These names refer to pg_class.relname of corresponding relations, so any partition/sub-partition names have to be unique across the database in a given schema. Maybe, if we stored partition name in pg_partition, system could generate suitable pg_class name itself and relieve that restriction. It would then be possible to refer to sub-partitions using its pg_class name or given name qualified by the partition name. Let me also outline some significant internal implementation details: * Introduce a new relkind - RELKIND_PARTITIONED_REL. Master/root tables in partitioning hierarchies are of this relkind. Following are some key properties of such relations - They have a pg_partitioned_rel catalog entry containing information about the partition key (and sub-partition key, if any). RelationBuildDesc() initializes the partition key fields in RelationData. It consists of linked-list of structs corresponding to keys at successive levels. If happen to be target in I/U/D (/COPY FROM), corresponding ResultRelInfo's partition key and partition descriptor fields are initialized. The latter consists of information about its immediate (level-1) partitions. In case of U/D, rewrite phase adds tableoid junk attribute to the target list in addition to a traditional one (ctid) for ExecUpdate and ExecDelete. * Next up is the notion of relations that are partitions at any level (including non-leaf levels). Following are some key properties of such relations - They each have a pg_partition entry. They cannot be targets in I/U and COPY FROM. One cannot define rules, triggers, and RLS policies on them. Although, AR triggers defined on a partitioned master table are propagated to the "leaf" partitions. Note that RELKIND_PARTITIONED_REL relations and non-leaf partitions do not have any storage. * How to handle multi-level partitioning - To map tuples to a "leaf" partition, CopyFrom(), ExecInsert() (also, ExecUpdate()) invoke a function called get_partition_for_tuple(). The function encapsulates details of partitioning such as partitioning method (R/L) and multi-level recursion. A similar recursive approach could be used within planner, too. Single invocation of function get_rel_partitions() from, say, set_partitioned_rel_size() would return the list of "leaf" partitions of rel excluding pruned partitions (and sub-partitions thereof). That is, quals are matched against partition keys at all levels in the same invocation. * Revised catalog structures: pg_partitioned_rel: partrelid Oid partmethod char partlevel int16 partnatts int16 partkey int2vector partclass oidvector partexprs pg_node_tree pg_partition: partrelid Oid partparent Oid partname NameData partlistvals anyarray partrangemaxs anyarray partvalid bool partsubtemplate bool partreloptions text parttablespace Oid Fields of pg_partition following partrangemaxs are yet unused. Thoughts? Thanks, Amit
On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Syntax to create a partitioned table (up to 2 levels of partitioning): > > CREATE TABLE foo ( > ... > ) > PARTITION BY R/L ON (key0) > SUBPARTITION BY R/L ON (key1) > [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>] > [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>], > ...)], ...)]; > > The above creates two pg_partitioned_rel entries for foo with partlevel 0 > and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this > creates pg_partition entries, with foo and foo_1 as partparent, > respectively. > > Why just 2 levels? - it seems commonplace and makes the syntax more > intuitive? I guess it might be possible to generalize the syntax for > multi-level partitioning. Ideas? If we want to support the notion of > sub-partition template in future, that would require some thought, more > importantly proper catalog organization for the same. I do not think this is a particularly good idea. You're going to need to dump each partition separately at least in --binary-upgrade mode, because each is going to have its own magic OIDs that need to be restored, and also because there will most likely be at least some properties that are going to vary between partitions. You could require that every partition have exactly the same set of columns, constraints, rules, triggers, policies, attribute defaults, comments, column comments, and everything else that might be different from one partition to another, and further require that they have exactly matching indexes. It would take a fair amount of code to prohibit all that, but it could be done. However, do we really want that? There may well be some things were we want to enforce that the parent and the child are exactly identical, but I doubt we want that for absolutely every property, current and future, of the partition. And even if you did, because of the --binary-upgrade stuff, you still need to to be able to dump them separately. Therefore, I believe it is a whole lot better to make the primary syntax for table partitioning something where you issue a CREATE statement for the parent and then a CREATE statement for each child. If we want to also have a convenience syntax so that people who want to create a parent and a bunch of children in one fell swoop can do so, fine. I would not choose to model the syntax for creating partitions on Oracle. I don't find that syntax particularly nice or easy to remember. I say PARTITION BY RANGE, and then inside the parentheses I use the PARTITION keyword for each partition? Really? But I think copying the style while having the details be incompatible is an even worse idea. > What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied > directly to partitions on case-by-case basis (they are tables under the > hood after all), we should restrict AT to the master table. Most of the AT > changes implicitly propagate from the master table to its partitions. Some > of them could be directly applied to partitions and/or sub-partitions such > as rename, storage manipulations like - changing tablespace, storage > parameters (reloptions), etc.: > > ALTER TABLE foo > RENAME PARTITION <partition-name> TO <new-name>; > > ALTER TABLE foo > RENAME SUBPARTITION <sub-partition-name> TO <new-name>; > > ALTER TABLE foo > SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>; > > ALTER TABLE foo > SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>; > > ALTER TABLE foo > SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>; > > ALTER TABLE foo > SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>; I don't think this is a very good idea. This is basically proposing that for every DDL command that you can apply to a table, you have to spell it differently for a partition. That seems like a lot of extra work for no additional functionality. > By the way, should we also allow changing the logging of > partitions/sub-partitions as follows? Again, I think you're coming at this from the wrong direction. Instead of saying we're going to disallow all changes to the partitions and then deciding we need to allow certain changes after all, I think we should allow everything that is currently allowed for an inherited table and then decide which of those things we need to prohibit, and why. For example, if you insist that a child table has to have a tuple descriptor that matches the parent, that can improve efficiency: Append won't need to project, and so on. But it now becomes very difficult to support taking a stand-alone table and making it a partition of an existing partitioned table, because the set of dropped columns might not match. Having to give an error in that case amounts to "we're sorry, we can't attach your partition to the partitioning hierarchy because of some invisible state that you can't see" isn't very nice. Now I'm not saying that isn't the right decision, but I think the design choices here need to be carefully thought about. Stepping away from that particular example, a blanket prohibition on changing any attribute of a child table seems like it will prohibit a lot of useful things that really ought to work. And again, I don't think it's a good idea to implement separate syntax for changing a partition vs. changing a table. If I want to set a partition as unlogged, I should be able to say ALTER TABLE partition_name UNLOGGED or maybe ALTER PARTITION partition_name UNLOGGED, not be forced to use some new grammar production that looks completely different. > What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION > constraints - 2 things must be clear here: cannot create these constraints > on individual partitions and all partition columns (key0 + key1) must be > the leading columns of the key. On a related note, creating index on the > master table should create the index on all "leaf" partitions. The index > on the mater table itself would be just a logical index. Should we allow > creating or dropping indexes on partitions directly? I don't find this to be particularly clear. You are assuming that nobody wants to create a constraint that a certain value is unique within a partition. That might not be a real common thing to want to do, but it could certainly be useful to somebody, and the current system with table inheritance allows it. For example, suppose that we have orders partitioned on the order_date column, by month. The user might want to create a UNIQUE index on order_id on each partition. Maybe they start over with order_id 1 at the beginning of each month. But even if, as is more likely, the order IDs keep counting up from month to month, they don't want to be forced to include the whole partitioning key in the index in order to have it marked UNIQUE. That may be enough, in practice, to ensure the global uniqueness of order IDs even though the system doesn't technically enforce it in all cases. If you want an index created on the parent to cascade down to all children, that's a big project to get right. Suppose I create an index on the parent table. After a while, I notice that it's getting bloated, so I created another index with the same definition. Now, I drop one of the two indexes. One of the two indexes from each child table needs to go away, and moreover it can't be picked arbitrarily - it has to be the one that was created at the same time as the parent index I'm dropping. If you want it to behave like this, you need a whole system of bookkeeping to make it work right. For version 1, I'd go the other way and prohibit index definitions on the empty parent rels. Let people create indexes on the children as they wish. In a future version, we can add stuff to cascade from parents to children. > It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE > but I'm inclined to leave them as future enhancements. For a functionality > similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest > patch. We could extend them to also consider sub-partitions: We don't need to have these in the first version, but we have to make some architectural decisions that affect how feasible they are to implement and in which cases, as noted above. > One cannot define rules, triggers, and RLS policies on them. Although, > AR triggers defined on a partitioned master table are propagated to the > "leaf" partitions. What value do you see us getting out of restricting these particular things? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015/12/18 3:56, Robert Haas wrote: > On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Syntax to create a partitioned table (up to 2 levels of partitioning): >> >> CREATE TABLE foo ( >> ... >> ) >> PARTITION BY R/L ON (key0) >> SUBPARTITION BY R/L ON (key1) >> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>] >> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>], >> ...)], ...)]; >> >> The above creates two pg_partitioned_rel entries for foo with partlevel 0 >> and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this >> creates pg_partition entries, with foo and foo_1 as partparent, >> respectively. >> >> Why just 2 levels? - it seems commonplace and makes the syntax more >> intuitive? I guess it might be possible to generalize the syntax for >> multi-level partitioning. Ideas? If we want to support the notion of >> sub-partition template in future, that would require some thought, more >> importantly proper catalog organization for the same. > > I do not think this is a particularly good idea. You're going to need > to dump each partition separately at least in --binary-upgrade mode, > because each is going to have its own magic OIDs that need to be > restored, and also because there will most likely be at least some > properties that are going to vary between partitions. You could > require that every partition have exactly the same set of columns, > constraints, rules, triggers, policies, attribute defaults, comments, > column comments, and everything else that might be different from one > partition to another, and further require that they have exactly > matching indexes. It would take a fair amount of code to prohibit all > that, but it could be done. However, do we really want that? There > may well be some things were we want to enforce that the parent and > the child are exactly identical, but I doubt we want that for > absolutely every property, current and future, of the partition. And > even if you did, because of the --binary-upgrade stuff, you still need > to to be able to dump them separately. > > Therefore, I believe it is a whole lot better to make the primary > syntax for table partitioning something where you issue a CREATE > statement for the parent and then a CREATE statement for each child. > If we want to also have a convenience syntax so that people who want > to create a parent and a bunch of children in one fell swoop can do > so, fine. Regarding --binary-upgrade dump mode, how about we teach pg_dump to dump each partition separately using ALTER TABLE parent ADD PARTITION especially for the "magic OIDs" reason? It may very well be a CREATE PARTITION-style command though. Note that each such command could specify properties that can be different per partition. I said in my email, perhaps not so clearly, that "only" WITH options, tablespace and relpersistence can be different per partition. But I can see why that may be severely restrictive at this stage. By the way, what do you think about SUBPARTITION keyword-based syntax for multi-level partitioning? Should we instead require that each partition has its own PARTITION BY in its creation command? > > I would not choose to model the syntax for creating partitions on > Oracle. I don't find that syntax particularly nice or easy to > remember. I say PARTITION BY RANGE, and then inside the parentheses I > use the PARTITION keyword for each partition? Really? But I think > copying the style while having the details be incompatible is an even > worse idea. As for the convenience syntax (if at all), how about: CREATE TABLE foo ( ... ) PARTITION BY ... ON (...) SUBPARTITION BY ... ON (...) opt_partition_list; where opt_partition_list is: PARTITIONS ( partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list [, ...] ) where opt_subpart_list is: SUBPARTITIONS ( subpartname FOR VALUES ... [WITH] [ TABLESPACE] [, ...] ) PARTITIONS, SUBPARTITIONS would be new unreserved keywords. Or we can do away with them. >> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied >> directly to partitions on case-by-case basis (they are tables under the >> hood after all), we should restrict AT to the master table. Most of the AT >> changes implicitly propagate from the master table to its partitions. Some >> of them could be directly applied to partitions and/or sub-partitions such >> as rename, storage manipulations like - changing tablespace, storage >> parameters (reloptions), etc.: >> >> ALTER TABLE foo >> RENAME PARTITION <partition-name> TO <new-name>; >> >> ALTER TABLE foo >> RENAME SUBPARTITION <sub-partition-name> TO <new-name>; >> >> ALTER TABLE foo >> SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>; >> >> ALTER TABLE foo >> SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>; >> >> ALTER TABLE foo >> SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>; >> >> ALTER TABLE foo >> SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>; > > I don't think this is a very good idea. This is basically proposing > that for every DDL command that you can apply to a table, you have to > spell it differently for a partition. That seems like a lot of extra > work for no additional functionality. I didn't mean to propose the special syntax for every available DDL command. Just those that modify properties that could be different per partition. I thought there would be very few such properties and hence not a lot of special commands. But then again, I may be wrong about what those properties are. > >> By the way, should we also allow changing the logging of >> partitions/sub-partitions as follows? > > Again, I think you're coming at this from the wrong direction. > Instead of saying we're going to disallow all changes to the > partitions and then deciding we need to allow certain changes after > all, I think we should allow everything that is currently allowed for > an inherited table and then decide which of those things we need to > prohibit, and why. For example, if you insist that a child table has > to have a tuple descriptor that matches the parent, that can improve > efficiency: Append won't need to project, and so on. But it now > becomes very difficult to support taking a stand-alone table and > making it a partition of an existing partitioned table, because the > set of dropped columns might not match. Having to give an error in > that case amounts to "we're sorry, we can't attach your partition to > the partitioning hierarchy because of some invisible state that you > can't see" isn't very nice. Now I'm not saying that isn't the right > decision, but I think the design choices here need to be carefully > thought about. Yeah, I am concerned about the ATTACH PARTITION USING TABLE case for the very point you mention. And I can see how it may result from the restrictive model I propose. FWIW, other databases impose a number of restrictions on the partition roll-in case but not sure if for the internal reasons we might want to. > > Stepping away from that particular example, a blanket prohibition on > changing any attribute of a child table seems like it will prohibit a > lot of useful things that really ought to work. And again, I don't > think it's a good idea to implement separate syntax for changing a > partition vs. changing a table. If I want to set a partition as > unlogged, I should be able to say ALTER TABLE partition_name UNLOGGED > or maybe ALTER PARTITION partition_name UNLOGGED, not be forced to use > some new grammar production that looks completely different. Okay. ALTER PARTITION may be the way to go. > >> What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION >> constraints - 2 things must be clear here: cannot create these constraints >> on individual partitions and all partition columns (key0 + key1) must be >> the leading columns of the key. On a related note, creating index on the >> master table should create the index on all "leaf" partitions. The index >> on the mater table itself would be just a logical index. Should we allow >> creating or dropping indexes on partitions directly? > > I don't find this to be particularly clear. You are assuming that > nobody wants to create a constraint that a certain value is unique > within a partition. That might not be a real common thing to want to > do, but it could certainly be useful to somebody, and the current > system with table inheritance allows it. For example, suppose that we > have orders partitioned on the order_date column, by month. The user > might want to create a UNIQUE index on order_id on each partition. > Maybe they start over with order_id 1 at the beginning of each month. > But even if, as is more likely, the order IDs keep counting up from > month to month, they don't want to be forced to include the whole > partitioning key in the index in order to have it marked UNIQUE. That > may be enough, in practice, to ensure the global uniqueness of order > IDs even though the system doesn't technically enforce it in all > cases. Okay, I didn't consider that one may want to create UNIQUE constraint per partition. Do they need to be allowed to be different per partition? Sorry, I could not understand your last sentence regarding global uniqueness. Given the restrictions on its definition, in what ways could the system fail to enforce it? > > If you want an index created on the parent to cascade down to all > children, that's a big project to get right. Suppose I create an > index on the parent table. After a while, I notice that it's getting > bloated, so I created another index with the same definition. Now, I > drop one of the two indexes. One of the two indexes from each child > table needs to go away, and moreover it can't be picked arbitrarily - > it has to be the one that was created at the same time as the parent > index I'm dropping. If you want it to behave like this, you need a > whole system of bookkeeping to make it work right. > > For version 1, I'd go the other way and prohibit index definitions on > the empty parent rels. Let people create indexes on the children as > they wish. In a future version, we can add stuff to cascade from > parents to children. Okay, I may be missing the subtleties of global uniqueness case. I agree we can leave it out of version 1. > >> It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE >> but I'm inclined to leave them as future enhancements. For a functionality >> similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest >> patch. We could extend them to also consider sub-partitions: > > We don't need to have these in the first version, but we have to make > some architectural decisions that affect how feasible they are to > implement and in which cases, as noted above. > >> One cannot define rules, triggers, and RLS policies on them. Although, >> AR triggers defined on a partitioned master table are propagated to the >> "leaf" partitions. > > What value do you see us getting out of restricting these particular things? Perhaps, I'm wrong in thinking that we should limit these to be associated with only the top-level partitioned table. Thanks a lot for the comments! Regards, Amit
Thought a bit more on some points (see below). To anyone interested in getting involved with the review, I'm working on getting a revised version of the patch out soon. However, I must also mention that we need to reach consensus on some broader design issues before any meaningful (or fruitful) code review could be done. Please feel free to post your thoughts about design, syntax, etc. On 2015/12/22 10:51, Amit Langote wrote: > On 2015/12/18 3:56, Robert Haas wrote: >> On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote wrote: >>> Syntax to create a partitioned table (up to 2 levels of partitioning): >>> >>> CREATE TABLE foo ( >>> ... >>> ) >>> PARTITION BY R/L ON (key0) >>> SUBPARTITION BY R/L ON (key1) >>> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>] >>> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>], >>> ...)], ...)]; >>> [ ... ] >> >> Therefore, I believe it is a whole lot better to make the primary >> syntax for table partitioning something where you issue a CREATE >> statement for the parent and then a CREATE statement for each child. >> If we want to also have a convenience syntax so that people who want >> to create a parent and a bunch of children in one fell swoop can do >> so, fine. > [ ... ] > > By the way, what do you think about SUBPARTITION keyword-based syntax for > multi-level partitioning? Should we instead require that each partition > has its own PARTITION BY in its creation command? If we have a CREATE statement for each partition, how do we generalize that to partitions at different levels? For example, if we use something like the following to create a partition of parent_name: CREATE PARTITION partition_name OF parent_name FOR VALUES ... WITH ... TABLESPACE ... Do we then say: CREATE PARTITION subpartition_name OF partition_name ... to create a level 2 partition (sub-partition) of parent_name? Obviously, as is readily apparent from the command, it is still a direct partition of partition_name for all internal purposes (consider partition list caching in relcache, recursive tuple routing, etc.) save some others. I ask that also because it's related to the choice of syntax to use to declare the partition key for the multi-level case. I'm considering the SUBPARTITION BY notation and perhaps we could generalize it to more than just 2 levels. So, for the above case, parent_name would have been created as: CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ... Needless to say, when subpartition_name is created with the command we saw a moment ago, the root partitioned table would be locked. In fact, adding a partition anywhere in the hierarchy needs an exclusive lock on the root table. Also, partition rule (the FOR VALUES clause) would be validated against PARTITION BY or SUBPARTITION BY clause at the respective level. Although, I must admit I feel a little uneasy about the inherent asymmetry in using SUBPARTITION BY for key declaration whereas piggybacking CREATE PARTITION for creating sub-partitions. Is there a better way? > As for the convenience syntax (if at all), how about: > > CREATE TABLE foo ( > ... > ) > PARTITION BY ... ON (...) > SUBPARTITION BY ... ON (...) > opt_partition_list; > > where opt_partition_list is: > > PARTITIONS ( > partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list > [, ...] > ) > > where opt_subpart_list is: > > SUBPARTITIONS ( > subpartname FOR VALUES ... [WITH] [ TABLESPACE] > [, ...] > ) Do we want this at all? It seems difficult to generalize this to multi-level hierarchy of more than 2 levels. >>> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied >>> directly to partitions on case-by-case basis (they are tables under the >>> hood after all), we should restrict AT to the master table. Most of the AT >>> changes implicitly propagate from the master table to its partitions. Some >>> of them could be directly applied to partitions and/or sub-partitions such >>> as rename, storage manipulations like - changing tablespace, storage >>> parameters (reloptions), etc.: [ ... ] >> >> I don't think this is a very good idea. This is basically proposing >> that for every DDL command that you can apply to a table, you have to >> spell it differently for a partition. That seems like a lot of extra >> work for no additional functionality. [ ... ] >>> By the way, should we also allow changing the logging of >>> partitions/sub-partitions as follows? >> >> Again, I think you're coming at this from the wrong direction. >> Instead of saying we're going to disallow all changes to the >> partitions and then deciding we need to allow certain changes after >> all, I think we should allow everything that is currently allowed for >> an inherited table and then decide which of those things we need to >> prohibit, and why. For example, if you insist that a child table has >> to have a tuple descriptor that matches the parent, that can improve >> efficiency: Append won't need to project, and so on. But it now >> becomes very difficult to support taking a stand-alone table and >> making it a partition of an existing partitioned table, because the >> set of dropped columns might not match. Having to give an error in >> that case amounts to "we're sorry, we can't attach your partition to >> the partitioning hierarchy because of some invisible state that you >> can't see" isn't very nice. Now I'm not saying that isn't the right >> decision, but I think the design choices here need to be carefully >> thought about. > > Yeah, I am concerned about the ATTACH PARTITION USING TABLE case for the > very point you mention. And I can see how it may result from the > restrictive model I propose. FWIW, other databases impose a number of > restrictions on the partition roll-in case but not sure if for the > internal reasons we might want to. After thinking some more on this - I think that identical tuple descriptors may not just be a nice-to-have but critical in some cases. For example, consider built-in/trigger-less tuple routing. I'd imagine that the partition to insert a tuple into would be determined just before calling heap_insert() in ExecInsert() and CopyFrom(). That means the HeapTuple that is passed to heap_insert() to insert into the partition would be based on the root table's tuple descriptor. Note also that the tuple would have passed through BR, IR triggers, constraints of the root table. When the data is eventually queried from partitions directly, or well even via the root table (considering existing executor capabilities), partition's tuple descriptor at that point had better match the data that went onto the disk. That means we had better keep at least the following things in sync: number of attributes, name, position (attnum), type, notnull-ness of individual attributes. So in order to do that, recursively apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN TYPE, SET/DROP NOT NULL on the root table to all the partitions and prevent those sub-commands to be directly applied to any table (partitions) in the partitioning hierarchy but the root. I further don't see the point of allowing to set (or drop) column defaults in partitions because now INSERT or COPY FROM cannot be directly applied to partitions. Similar argument could be made for BR, IR triggers and CHECK constraints. Am I missing something in all of this? An alternative to doing any of that very well may be to design trigger-less tuple routing to be smarter about possible mismatch of the tuple descriptors but I haven't given that a lot of thought. Is that really an alternative worth looking into? Coming to the ATTACH PARTITION case, I am not quite sure how we can make it as forgiving as ALTER TABLE INHERIT is given that we'd expect things like trigger-less tuple routing to work out-of-the-box as described above. That said, I would not dare propose an INCLUDING DROPPED extension to LIKE or would it stick if I do? ;) Unrelated to the above, I have been thinking about the planner considerations for partition hierarchies with minimal changes considering that we'd want to keep it that way for the first cut: 1) At least some optimizer/prep changes to handle adding partitions to PlannerInfo's append_rel_list would be necessary. 2) To be able to use constraint exclusion, we'd need to create equivalent CHECK constraints (non-dumpable) when partitions are created and perform all the necessary bookkeeping. For time being. You might wonder why (1) requires any special handling - would't "partition hierarchies" still be "inheritance hierarchies" and so would work with existing code albeit with unnecessary overheads like Var translation lists? Let me think out loud about general background on this so someone can correct me if it seems I'm going at it all wrong - On one hand, I think to keep treating "partition hierarchies" as "inheritance hierachies" might have some issues. I am afraid that documented inheritance semantics may not be what we want to keep using for the new partitioned tables. By that, I mean all the user-facing behaviors where inheritance has some bearing. Should it also affect new partitioned tables? Consider whether inheritance semantics would render infeasible some of the things that we'd like to introduce for the new partitioned tables such as automatic tuple routing, or keep us from improving planner smarts and executor capabilities for partitioned tables over what we already have. OTOH, I may be looking at it wrongly. We would not be required to enforce user-facing inheritance behaviors on the new partitioned tables after all. That is to say - it's just that new partitioned tables could still use relevant inheritance infrastructure behind-the-scenes for planning, execution and a few other things and not care about abiding by regular inheritance semantics. I should just go ahead and add special cases in all places where existing inheritance handling code stands to cause trouble down the line for partitioned tables. We might want to mention that we do so somewhere in documentation and also note that regular inheritance semantics does not apply. While it sounds probably fine as implementation for the feature released initially, a day will inevitably come when this behind-the-scenes implementation will be changed to something more amenable to better optimization. But that's for future... There's a great chance that not everyone cares right now about this part of the new partitioning but just want to put it out there. There are more contentious issues like the syntax, partitioning maintenance commands that we plan to support (now or later) and such. Any advice is greatly appreciated. Thanks, Amit
If we have a CREATE statement for each partition, how do we generalize
that to partitions at different levels? For example, if we use something
like the following to create a partition of parent_name:
CREATE PARTITION partition_name OF parent_name FOR VALUES ...
WITH ... TABLESPACE ...
Do we then say:
CREATE PARTITION subpartition_name OF partition_name ...
That's how I'd want it for partitions created after the initial partitioned table is created.
I'd like to be able to identify the parent partition by it's own partitioning parameters rather than name, like the way we can derive the name of an index in ON CONFLICT. But I see no clean way to do that, and if one did come up, we'd simply allow the user to replace
<partition_name>
with
table_name PARTITION partition_spec [...PARTITION partition_spec [ ...PARTITION turtles_all_the_way_down]]).
Again, totally fine with forcing the maintenance script to know or discover the name of the partition to be subpartitioned...for now.
to create a level 2 partition (sub-partition) of parent_name? Obviously,
as is readily apparent from the command, it is still a direct partition of
partition_name for all internal purposes (consider partition list caching
in relcache, recursive tuple routing, etc.) save some others.
I ask that also because it's related to the choice of syntax to use to
declare the partition key for the multi-level case. I'm considering the
SUBPARTITION BY notation and perhaps we could generalize it to more than
just 2 levels. So, for the above case, parent_name would have been created as:
CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...
Needless to say, when subpartition_name is created with the command we saw
a moment ago, the root partitioned table would be locked. In fact, adding
a partition anywhere in the hierarchy needs an exclusive lock on the root
table. Also, partition rule (the FOR VALUES clause) would be validated
against PARTITION BY or SUBPARTITION BY clause at the respective level.
Although, I must admit I feel a little uneasy about the inherent asymmetry
in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
PARTITION for creating sub-partitions. Is there a better way?
Provided that the syntax allows for N levels of partitioning, I don't care if it's
PARTITION BY.., PARTITION BY..., PARTITION BY ...
or
PARTITION BY.., SUBPARTITION BY..., SUBPARTITION BY ...
The first is probably better for meta-coding purposes, but the second makes it clear which partition layer is first.
> As for the convenience syntax (if at all), how about:
>
> CREATE TABLE foo (
> ...
> )
> PARTITION BY ... ON (...)
> SUBPARTITION BY ... ON (...)
> opt_partition_list;
>
> where opt_partition_list is:
>
> PARTITIONS (
> partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list
> [, ...]
> )
>
> where opt_subpart_list is:
>
> SUBPARTITIONS (
> subpartname FOR VALUES ... [WITH] [ TABLESPACE]
> [, ...]
> )
Do we want this at all? It seems difficult to generalize this to
multi-level hierarchy of more than 2 levels.
I want this.
Granted the syntax of a 3+ level partitioning would be cumbersome, but it is what the user wanted, and the nested PARTITION/SUBPARTITION. In those cases, the user might opt to not create more than the default first subpartition to keep the syntax sane, or we might auto-generate default partitions (with a VALUES clause of whatever "all values" is for that datatype...again, this is an area where leveraging range types would be most valuable).
On one hand, I think to keep treating "partition hierarchies" as
"inheritance hierachies" might have some issues. I am afraid that
documented inheritance semantics may not be what we want to keep using for
the new partitioned tables. By that, I mean all the user-facing behaviors
where inheritance has some bearing. Should it also affect new partitioned
tables? Consider whether inheritance semantics would render infeasible
some of the things that we'd like to introduce for the new partitioned
tables such as automatic tuple routing, or keep us from improving planner
smarts and executor capabilities for partitioned tables over what we
already have.
I feel that Automatic tuple routing should be considered they key benefit of "real" partitions over inherited tables. Trigger maintenance is most of the work of custom partitioning schemes, at least the ones I've written.
There's a great chance that not everyone cares right now about this part
of the new partitioning but just want to put it out there. There are more
contentious issues like the syntax, partitioning maintenance commands that
we plan to support (now or later) and such.
What I've read so far addresses most of my concerns.
Still somewhat on my mind:
1. ability to describe partition bounds via range types, regardless of whether the Automatic Tuple Routing uses those types internally.
2. syntax for splitting a partition in two, merging two adjacent partitions (you probably touched on these earlier and I missed it or forgot).
3. ability to swap a partition with a table not currently associated with the partitioned table.
4. The applicability of this syntax to materialized views, allowing us to do REFRESH CONCURRENTLY a few parts at a time, or only refreshing the data we know needs it.
Items 2 and 3 don't have to be implemented right away, as they're separate ALTER commands. 4 is a pipe dream. With Item 1 I ask only that we don't pick a syntax that prevents description via range types.
Hi, On 2016/01/17 9:47, Corey Huinker wrote: >> If we have a CREATE statement for each partition, how do we generalize >> that to partitions at different levels? For example, if we use something >> like the following to create a partition of parent_name: >> >> CREATE PARTITION partition_name OF parent_name FOR VALUES ... >> WITH ... TABLESPACE ... >> >> Do we then say: >> >> CREATE PARTITION subpartition_name OF partition_name ... > > That's how I'd want it for partitions created after the initial partitioned > table is created. > > I'd like to be able to identify the parent partition by it's own > partitioning parameters rather than name, like the way we can derive the > name of an index in ON CONFLICT. But I see no clean way to do that, and if > one did come up, we'd simply allow the user to replace > <partition_name> > with > table_name PARTITION partition_spec [...PARTITION partition_spec [ > ...PARTITION turtles_all_the_way_down]]). > > Again, totally fine with forcing the maintenance script to know or discover > the name of the partition to be subpartitioned...for now. I'm thinking of going with this last option. So for now, you create an empty partitioned table specifying all the partition keys without being able to define any partitions in the same statement. Partitions (and partitions thereof, if any) will be created using CREATE PARTITION statements, one for each. >> to create a level 2 partition (sub-partition) of parent_name? Obviously, >> as is readily apparent from the command, it is still a direct partition of >> partition_name for all internal purposes (consider partition list caching >> in relcache, recursive tuple routing, etc.) save some others. >> >> I ask that also because it's related to the choice of syntax to use to >> declare the partition key for the multi-level case. I'm considering the >> SUBPARTITION BY notation and perhaps we could generalize it to more than >> just 2 levels. So, for the above case, parent_name would have been created >> as: >> >> CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ... > >> Needless to say, when subpartition_name is created with the command we saw >> a moment ago, the root partitioned table would be locked. In fact, adding >> a partition anywhere in the hierarchy needs an exclusive lock on the root >> table. Also, partition rule (the FOR VALUES clause) would be validated >> against PARTITION BY or SUBPARTITION BY clause at the respective level. >> >> Although, I must admit I feel a little uneasy about the inherent asymmetry >> in using SUBPARTITION BY for key declaration whereas piggybacking CREATE >> PARTITION for creating sub-partitions. Is there a better way? > > Provided that the syntax allows for N levels of partitioning, I don't care > if it's > PARTITION BY.., PARTITION BY..., PARTITION BY ... > or > PARTITION BY.., SUBPARTITION BY..., SUBPARTITION BY ... > > The first is probably better for meta-coding purposes, but the second makes > it clear which partition layer is first. I prefer second one too, for clarity. In any case, the PARTITION / SUBPARTITION BY clauses are ordered. That is, successive clauses specify partition key for corresponding successive levels of the hierarchy. > There's a great chance that not everyone cares right now about this part >> of the new partitioning but just want to put it out there. There are more >> contentious issues like the syntax, partitioning maintenance commands that >> we plan to support (now or later) and such. >> > > What I've read so far addresses most of my concerns. > > Still somewhat on my mind: > > 1. ability to describe partition bounds via range types, regardless of > whether the Automatic Tuple Routing uses those types internally. Specifying range partitioning bound as PARTITION FOR RANGE <range-literal> sounds like it offers some flexibility, which can be seen as a good thing. But it tends to make internal logic slightly complicated. Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is notationally simpler still and easier to work with internally. Also, there will be no confusion about exclusivity of the bound if we document it so. > 2. syntax for splitting a partition in two, merging two adjacent partitions > (you probably touched on these earlier and I missed it or forgot). Maintenance operations like split, merge would not be the first cut. > 3. ability to swap a partition with a table not currently associated with > the partitioned table. EXCHANGE PARTITION-like ability is something on my list. There are a few hurdles to get there. Especially, source table would have to be physical tuple descriptor (ie, including dropped columns) compatible with the partitioned table given some other parts of the design (I touched on those in my earlier message). > 4. The applicability of this syntax to materialized views, allowing us to > do REFRESH CONCURRENTLY a few parts at a time, or only refreshing the data > we know needs it. > > Items 2 and 3 don't have to be implemented right away, as they're separate > ALTER commands. 4 is a pipe dream. With Item 1 I ask only that we don't > pick a syntax that prevents description via range types. Thanks a lot for taking time to comment. Thanks, Amit
Hi Amit, thanks for working on this. Seems the last version of the patch was submitted more than 2 months ago and I believe large parts of it will get reworked based on the extensive discussion on this list, so I haven't looked at the code at all. I'd like to comment on the one thing and that's the syntax. It seems to me we're really trying to reinvent the wheel and come up with our own version of the syntax. Is there a particular reason why not to look at the syntax of the other databases and adapt as much of the existing syntax as possible? I think that's for a few reasons - firstly it makes the life much easier for the DBAs and users who are either migrating to PostgreSQL or have to manage a mix of databases. Secondly, it serves as a valuable source of engineering info, preventing the "I haven't thought of this use case" problem. An example of this is the proposed syntax for adding a partition CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-15') END ('2006-03-01'); which seems a bit awkward as both the databases I'm familiar with (Oracle and Sybase) use ALTER TABLE to do this ALTER TABLE measurement ADD PARTITION measurement_fail VALUES LESS THAN ( ... ) And so on for the other commands. That being said, I entirely agree with Simon (and others) that getting the planner part work is the crucial part of the patch. But I also think that a proper abstraction (thanks to good syntax) may be a valuable hint how to define the catalogs and such. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So for now, you create an empty partitioned table specifying all the
partition keys without being able to define any partitions in the same
statement. Partitions (and partitions thereof, if any) will be created
using CREATE PARTITION statements, one for each.
...and I would assume that any attempt to insert into a partitioned table with no partitions (or lacking partitions at a defined level) would be an error? If so, I'd be ok with that.
Specifying range partitioning bound as PARTITION FOR RANGE <range-literal>
sounds like it offers some flexibility, which can be seen as a good thing.
But it tends to make internal logic slightly complicated.
Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is
notationally simpler still and easier to work with internally. Also, there
will be no confusion about exclusivity of the bound if we document it so.
I understand wanting the internal rules to be simple. Oracle clearly went with VALUES LESS THAN waterfalls for that reason.
What I'm hoping to avoid is:
- having to identify my "year2014" partition by VALUES LESS THAN '2015-01-01', a bit of cognitive dissonance defining data by what it's not.
- and then hoping that there is a year2013 partition created by someone with similar sensibilities, the partition definition being incomplete outside of the context of other partition definitions.
- and then further hoping that nobody drops the year2013 partition, thus causing new 2013 rows to fall into the year2014 partition, a side effect of an operation that did not mention the year2014 partition.
Range types do that, and if we're concerned about range type overhead, we're only dealing with the ranges at DDL time, we can break down the ATR rules into a more easily digestible form once the partition is modified.
Range continuity can be tested with -|-, but we'd only need to test for overlaps: gaps in ranges are sometimes a feature, not a bug (ex: I don't want any rows from future dates and we weren't in business before 1997).
Also, VALUES LESS THAN forces us to use discrete values. There is no way with to express with VALUES LESS THAN partitions that have float values for temperature:
ice (,0.0), water [0.0,212.0], steam (212.0,3000.0], plasma (3000.0,).
ice (,0.0), water [0.0,212.0], steam (212.0,3000.0], plasma (3000.0,).
Yes, I can calculate the day after the last day in a year, I can use 212.0000000001, I can write code to rigorously check that all partitions are in place. I'd just rather not.
p.s. I'm really excited about what this will bring to Postgres in general and my organization in particular. This feature alone will help chip away at our needs for Vertica and Redshift clusters. Let me know if there's anything I can do to help.
Hi, On 2016/01/23 3:42, Corey Huinker wrote: >> So for now, you create an empty partitioned table specifying all the >> partition keys without being able to define any partitions in the same >> statement. Partitions (and partitions thereof, if any) will be created >> using CREATE PARTITION statements, one for each. >> > > ...and I would assume that any attempt to insert into a partitioned table > with no partitions (or lacking partitions at a defined level) would be an > error? If so, I'd be ok with that. Yes. Basically, for a tuple insert to succeed, there should exist a leaf partition that the inserted tuple would map to. >> Specifying range partitioning bound as PARTITION FOR RANGE <range-literal> >> sounds like it offers some flexibility, which can be seen as a good thing. >> But it tends to make internal logic slightly complicated. >> >> Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is >> notationally simpler still and easier to work with internally. Also, there >> will be no confusion about exclusivity of the bound if we document it so. > > > I understand wanting the internal rules to be simple. Oracle clearly went > with VALUES LESS THAN waterfalls for that reason. > > What I'm hoping to avoid is: > - having to identify my "year2014" partition by VALUES LESS THAN > '2015-01-01', a bit of cognitive dissonance defining data by what it's not. > - and then hoping that there is a year2013 partition created by someone > with similar sensibilities, the partition definition being incomplete > outside of the context of other partition definitions. > - and then further hoping that nobody drops the year2013 partition, thus > causing new 2013 rows to fall into the year2014 partition, a side effect of > an operation that did not mention the year2014 partition. I somewhat agree with these concerns. So whenever a left range partition is deleted, a partition would begin accepting tuples that would have been mapped to the partition just deleted. Then as you say, partition would contain data that does not really fit the name. A keen user could then probably rename the partition to something suitable in the same transaction. I'd imagine that a partition deleted in this way would be the leftmost (oldest partition) in most scenarios and there wouldn't be any future tuples that would map to such partition. Admittedly, that may just be wishful thinking though. On the other hand, giving user the ability to specify both min and max requires a lot of bookkeeping on the system's part. One of my previous patches had that syntax and I wasn't quite happy with overall flakiness of the implementation. Granted it would have been slightly easier if I had used range type machinery and I tried a little but didn't stick with it for too long. You may have read the following thread: http://www.postgresql.org/message-id/15387.1423496163@sss.pgh.pa.us > Range types do that, and if we're concerned about range type overhead, > we're only dealing with the ranges at DDL time, we can break down the ATR > rules into a more easily digestible form once the partition is modified. We still end up with min, max pair per partition along with the inclusivity of both as being part of the persisted partition rule. > Range continuity can be tested with -|-, but we'd only need to test for > overlaps: gaps in ranges are sometimes a feature, not a bug (ex: I don't > want any rows from future dates and we weren't in business before 1997). > > Also, VALUES LESS THAN forces us to use discrete values. There is no way > with to express with VALUES LESS THAN partitions that have float values for > temperature: > ice (,0.0), water [0.0,212.0], steam (212.0,3000.0], plasma (3000.0,). > > Yes, I can calculate the day after the last day in a year, I can use > 212.0000000001, I can write code to rigorously check that all partitions > are in place. I'd just rather not. Hmm, this may be a concern. I have read discussion threads of range types development and one of the contending issues was around semantics for continuous types. So here, if all we have is VALUES LESS THAN to define a range partition, one cannot assign ranges to partitions that cannot be reduced to [) form, right? > p.s. I'm really excited about what this will bring to Postgres in general > and my organization in particular. This feature alone will help chip away > at our needs for Vertica and Redshift clusters. Let me know if there's > anything I can do to help. I am really glad to hear that. Your comments so far are much appreciated. Thanks, Amit
Hi Tomas, Thanks for your comments and sorry for replying so late. On 2016/01/22 22:54, Tomas Vondra wrote: > thanks for working on this. Seems the last version of the patch was > submitted more than 2 months ago and I believe large parts of it will get > reworked based on the extensive discussion on this list, so I haven't > looked at the code at all. > > I'd like to comment on the one thing and that's the syntax. It seems to me > we're really trying to reinvent the wheel and come up with our own version > of the syntax. Is there a particular reason why not to look at the syntax > of the other databases and adapt as much of the existing syntax as possible? > > I think that's for a few reasons - firstly it makes the life much easier > for the DBAs and users who are either migrating to PostgreSQL or have to > manage a mix of databases. Secondly, it serves as a valuable source of > engineering info, preventing the "I haven't thought of this use case" > problem. > > An example of this is the proposed syntax for adding a partition > > CREATE TABLE measurement_fail > PARTITION OF measurement > FOR VALUES START ('2006-02-15') END ('2006-03-01'); > > which seems a bit awkward as both the databases I'm familiar with (Oracle > and Sybase) use ALTER TABLE to do this > > ALTER TABLE measurement > ADD PARTITION measurement_fail VALUES LESS THAN ( ... ) Syntax like the one you mention allows to create/modify/move/drop partitions at 2 levels (generally) using PARTITION and SUBPARTITION keywords. That might be limiting to some users. I don't have a clear picture of what a syntax that's general enough would look like, but I proposed something like what follows: CREATE TABLE parent ( a int, b char(10) ) PARTITION BY RANGE ON (a) SUBPARTITION BY LIST ON ((substring(b from 1 for 2))); CREATE PARTITION partname OF parent FOR VALUES LESS THAN (100); CREATE PARTITION subpartname OF partname FOR VALUES IN ('ab'); The latter of the CREATE PARTITION commands lets us create the so-called sub-partition of 'parent'. Remember that in this scheme, all level 1 partitions are not actually physical tables themselves; only level 2 partitions are. If you stick one more SUBPARTITION BY in parent's definition, you can: CREATE PARTITION subsubpartname OF subpartname FOR VALUES ...; This is something that the Oracle-like syntax won't be able to support. Although, if we all agree that we'd never want to support such a case then let's implement something that's familiar viz. the following: CREATE TABLE parent ( a int, b char(10) ) PARTITION BY RANGE ON (a) SUBPARTITION BY LIST ON ((substring(b from 1 for 2))); ALTER TABLE parent ADD PARTITION partname FOR VALUES LESS THAN (100); ALTER TABLE parent MODIFY PARTITION partname ADD SUBPARTITION subpartname FOR VALUES IN ('ab'); ALTER TABLE parent MODIFY PARTITION partname DROP SUBPARTITION subpartname; ALTER TABLE parent DROP PARTITION partname; > And so on for the other commands. > > That being said, I entirely agree with Simon (and others) that getting the > planner part work is the crucial part of the patch. But I also think that > a proper abstraction (thanks to good syntax) may be a valuable hint how to > define the catalogs and such. I tried to do that in the November commitfest but decided to just work on the syntax as it became clear that throwing too many changes at the reviewers/committers like that may not be such a great idea. Syntax itself is a big enough change to discuss and reach consensus on. Let's get the syntax, catalog and some infrastructure for basic features of the new partitioning to work. We can think about planner enhancements for partitioned tables that are waiting for declarative partitioning to get in later. Thanks, Amit
Hi, Attached find a patch series to implement the following syntax: 1. Syntax for defining the partition key as a new clause in CREATE TABLE: [ PARTITION BY {RANGE | LIST} ON ( { column_name | ( expression ) } [ opclass ] [, ...] ) [ SUBPARTITION BY {RANGE | LIST} ON ( { column_name | ( expression ) } [ opclass ] [, ...] ) ] ] 2. Syntax to create partitions and sub-partitions (as ALTER TABLE commands on a partitioned table): ADD PARTITION name FOR VALUES partition_bound_spec [ WITH ( storage_parameter [= value] [, ... ] )] [ TABLESPACE tablespace_name ] MODIFY PARTITION partition_name ADD SUBPARTITION name FOR VALUES partition_bound_spec [ WITH ( storage_parameter [= value] [, ... ] )] [ TABLESPACE tablespace_name ] DROP PARTITION name DROP SUBPARTITION name ATTACH PARTITION name FOR VALUES partition_bound_spec USING [ TABLE ] table_name [ WITH ( storage_parameter [= value] [, ... ] )] [ TABLESPACE tablespace_name ] MODIFY PARTITION partition_name ATTACH SUBPARTITION name FOR VALUES partition_bound_spec USING [ TABLE ] table_name [ WITH ( storage_parameter [= value] [, ... ] )] [ TABLESPACE tablespace_name ] Where partition_bound_spec is: FOR VALUES [ IN ] ( expression [, ...] ) FOR VALUES LESS THAN ( expression [, ...] ) DETACH PARTITION name [ USING [ TABLE ] table_name] DETACH SUBPARTITION name [ USING [ TABLE ] table_name] Please note that ATTACH PARTITION ... USING TABLE command shown above is not yet implemented. As mentioned previously, this uses two system catalogs pg_partitioned_rel and pg_partition. This is complete with tuple routing so that no triggers or rules are required. There is also basic planner support but no support yet to enable constraint exclusion on partitions (which will be fulfilled shortly by installing equivalent check constraints on partitions). There is documentation for the new syntax and catalogs. I feel more detailed documentation is required so I will keep on improving it in subsequent versions as we build consensus about the syntax and general design. There are no regression tests at all in the attached patches which I will add in the next version along with constraint exclusion support. pg_dump support will also be added shortly. The individual patches have commit messages that describe code changes. This is registered in the upcoming CF. Feedback and review is greatly welcomed! Thanks, Amit
Attachment
- 0001-Add-relkind-RELKIND_PARTITIONED_REL-to-pg_class.h.patch
- 0002-Add-partitioned-table-system-catalog.patch
- 0003-Introduce-partition-key-clause-in-CREATE-TABLE.patch
- 0004-Infrastructure-for-creation-of-partitioned-tables.patch
- 0005-Add-partition-system-catalog.patch
- 0006-Add-partition-creation-management-syntax.patch
- 0007-Add-infrastructure-to-actually-create-and-manage-par.patch
- 0008-Rework-code-manipulating-the-PartitionKey-data-struc.patch
- 0009-Tuple-routing-functionality.patch
- 0010-Embarrasingly-small-optimizer-patch-to-work-with-par.patch
The individual patches have commit messages that describe code changes.
This is registered in the upcoming CF. Feedback and review is greatly
welcomed!
Thanks,
Amit
We have a current system that is currently a mix of tables, each of which is range partitioned into approximately 15 partitions (using the pgxn range partitioning extension), and those partitions are themselves date-series partitioned via pg_partman. The largest table ingests about 100M rows per day in a single ETL. I will try this patch out and see how well it compares in handling the workload. Do you have any areas of interest or concern that I should monitor?
Hi Corey, On 2016/02/16 5:15, Corey Huinker wrote: >> >> The individual patches have commit messages that describe code changes. >> This is registered in the upcoming CF. Feedback and review is greatly >> welcomed! >> > We have a current system that is currently a mix of tables, each of which > is range partitioned into approximately 15 partitions (using the pgxn range > partitioning extension), and those partitions are themselves date-series > partitioned via pg_partman. The largest table ingests about 100M rows per > day in a single ETL. I will try this patch out and see how well it compares > in handling the workload. Do you have any areas of interest or concern that > I should monitor? Thanks a lot for willing to give it a spin! I would say this patch series is more geared toward usability. For example, you won't have to write a trigger to route tuples to correct partitions. You can try your mentioned ETL load and see if the patch's implementation of tuple routing fares any better than existing trigger-based approach. Maybe, you were expecting something like load into a stand-alone table and then ALTER TABLE INHERIT to instantly load into the partitioned table (roll-in), but that command is not yet implemented (ATTACH PARTITION command will show a "not implemented" error message). Also, you won't see any optimizer and executor changes. Queries will still use the same plans as existing inheritance-based partitioned tables, although as I mentioned, constraint exclusion won't yet kick in. That will be fixed very shortly. And of course, comments on syntax are welcome as well. Thanks, Amit
Also, you won't see any optimizer and executor changes. Queries will still
use the same plans as existing inheritance-based partitioned tables,
although as I mentioned, constraint exclusion won't yet kick in. That will
be fixed very shortly.
And of course, comments on syntax are welcome as well.
Thanks,
Amit
Good to know the current limitations/expectations.
Our ETL has a great number of workers that do something like this:
1. grab a file
2. based on some metadata of that file, determine the partition that that would receive ALL of the rows in that file. It's actually multiple tables, all of which are partitioned, all of which fully expect the file data to fit in exactly one partition.
3. \copy into a temp table
4. Transform the data and insert the relevant bits into each of the target partitions derived in #2.
So while ATR is a major feature of true partitioning, it's not something we'd actually need in our current production environment, but I can certainly code it that way to benchmark ATR vs "know the destination partition ahead of time" vs "insane layered range_partitioning trigger + pg_partman trigger".
Currently we don't do anything like table swapping, but I've done that enough in the past that I could probably concoct a test of that too, once it's implemented.
As for the syntax, I'm not quite sure your patch addresses the concerned I voiced earlier: specifically if the VALUES IN works for RANGE as well as LIST, but I figured that would become clearer once I tried to actually use it. Currently we have partitioning on C-collated text ranges (no, they don't ship with postgres, I had to make a custom type) something like this:
part0: (,BIG_CLIENT)part1: [BIG_CLIENT,BIG_CLIENT]part2: (BIG_CLIENT,L)part3: [L,MONSTROUSLY_BIG_CLIENT)part4: [MONSTROUSLY_BIG_CLIENT,MONSTROUSLY_BIG_CLIENT]part5: (MONSTROUSLY_BIG_CLIENT,RANDOM_CLIENT_LATE_IN_ALPHABET]part6: (RANDOM_CLIENT_LATE_IN_ALPHABET,)
I can't implement that with a simple VALUES LESS THAN clause, unless I happen to know 'x' in 'BIG_CLIENTx', where 'x' is the exact first character in the collation sequence, which has to be something unprintable, and that would make those who later read my code to say something unprintable. So yeah, I'm hoping there's some way to cleanly represent such ranges.
On 02/15/2016 04:28 PM, Amit Langote wrote: > Also, you won't see any optimizer and executor changes. Queries will still > use the same plans as existing inheritance-based partitioned tables, > although as I mentioned, constraint exclusion won't yet kick in. That will > be fixed very shortly. We're not going to use CE for the new partitioning long-term, are we? This is just the first version, right? -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
Hi Josh, On 2016/02/16 11:41, Josh berkus wrote: > On 02/15/2016 04:28 PM, Amit Langote wrote: >> Also, you won't see any optimizer and executor changes. Queries will still >> use the same plans as existing inheritance-based partitioned tables, >> although as I mentioned, constraint exclusion won't yet kick in. That will >> be fixed very shortly. > > We're not going to use CE for the new partitioning long-term, are we? This > is just the first version, right? Yes. My approach in previous versions of stuffing major planner changes in with the syntax patch was not quite proper in retrospect. So, I thought I'd propose any major planner (and executor) changes later. Thanks, Amit
On Fri, Jan 15, 2016 at 5:48 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > If we have a CREATE statement for each partition, how do we generalize > that to partitions at different levels? For example, if we use something > like the following to create a partition of parent_name: > > CREATE PARTITION partition_name OF parent_name FOR VALUES ... > WITH ... TABLESPACE ... > > Do we then say: > > CREATE PARTITION subpartition_name OF partition_name ... > > to create a level 2 partition (sub-partition) of parent_name? Yes, exactly. Personally, I would be more inclined to make this a CREATE TABLE statement, like CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ... > I ask that also because it's related to the choice of syntax to use to > declare the partition key for the multi-level case. I'm considering the > SUBPARTITION BY notation and perhaps we could generalize it to more than > just 2 levels. So, for the above case, parent_name would have been created as: > > CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ... > > Needless to say, when subpartition_name is created with the command we saw > a moment ago, the root partitioned table would be locked. In fact, adding > a partition anywhere in the hierarchy needs an exclusive lock on the root > table. Also, partition rule (the FOR VALUES clause) would be validated > against PARTITION BY or SUBPARTITION BY clause at the respective level. > > Although, I must admit I feel a little uneasy about the inherent asymmetry > in using SUBPARTITION BY for key declaration whereas piggybacking CREATE > PARTITION for creating sub-partitions. Is there a better way? I think if you've got SUBPARTITION as a keyword in the syntax anywhere, you're doing it wrong. The toplevel object shouldn't really care whether its children are themselves partitioned or not. > Do we want this at all? It seems difficult to generalize this to > multi-level hierarchy of more than 2 levels. It doesn't do anything for me. There may be somebody who wants it, but I don't see much value myself. > After thinking some more on this - I think that identical tuple > descriptors may not just be a nice-to-have but critical in some cases. For > example, consider built-in/trigger-less tuple routing. I'd imagine that > the partition to insert a tuple into would be determined just before > calling heap_insert() in ExecInsert() and CopyFrom(). That means the > HeapTuple that is passed to heap_insert() to insert into the partition > would be based on the root table's tuple descriptor. Note also that the > tuple would have passed through BR, IR triggers, constraints of the root > table. When the data is eventually queried from partitions directly, or > well even via the root table (considering existing executor capabilities), > partition's tuple descriptor at that point had better match the data that > went onto the disk. That means we had better keep at least the following > things in sync: number of attributes, name, position (attnum), type, > notnull-ness of individual attributes. So in order to do that, recursively > apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN > TYPE, SET/DROP NOT NULL on the root table to all the partitions and > prevent those sub-commands to be directly applied to any table > (partitions) in the partitioning hierarchy but the root. I further don't > see the point of allowing to set (or drop) column defaults in partitions > because now INSERT or COPY FROM cannot be directly applied to partitions. > Similar argument could be made for BR, IR triggers and CHECK constraints. > Am I missing something in all of this? Well, in the end there are basically two choices. Either tuple descriptors have to match exactly, and then you can reuse a tuple intended for one partition for some other partition without projection; or else they don't, and you need to project. I'm not sure that projection is expensive enough to make disallowing mismatched tuple descriptors a necessary design choice - and certainly that design choice is awkward from a UI standpoint, because we will sometimes not be able to attach a partition for a reason that the user can neither see in the \d output nor correct. But on the flip side, not having to worry about projection is nice. > An alternative to doing any of that very well may be to design > trigger-less tuple routing to be smarter about possible mismatch of the > tuple descriptors but I haven't given that a lot of thought. Is that > really an alternative worth looking into? Yes. > On one hand, I think to keep treating "partition hierarchies" as > "inheritance hierachies" might have some issues. I am afraid that > documented inheritance semantics may not be what we want to keep using for > the new partitioned tables. By that, I mean all the user-facing behaviors > where inheritance has some bearing. Should it also affect new partitioned > tables? Consider whether inheritance semantics would render infeasible > some of the things that we'd like to introduce for the new partitioned > tables such as automatic tuple routing, or keep us from improving planner > smarts and executor capabilities for partitioned tables over what we > already have. > > OTOH, I may be looking at it wrongly. We would not be required to enforce > user-facing inheritance behaviors on the new partitioned tables after all. > That is to say - it's just that new partitioned tables could still use > relevant inheritance infrastructure behind-the-scenes for planning, > execution and a few other things and not care about abiding by regular > inheritance semantics. I should just go ahead and add special cases in all > places where existing inheritance handling code stands to cause trouble > down the line for partitioned tables. We might want to mention that we do > so somewhere in documentation and also note that regular inheritance > semantics does not apply. While it sounds probably fine as implementation > for the feature released initially, a day will inevitably come when this > behind-the-scenes implementation will be changed to something more > amenable to better optimization. But that's for future... This seems pretty vague to me. I don't know that I have an opinion without a more specific list of questions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 22, 2016 at 8:54 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I'd like to comment on the one thing and that's the syntax. It seems to me > we're really trying to reinvent the wheel and come up with our own version > of the syntax. Is there a particular reason why not to look at the syntax of > the other databases and adapt as much of the existing syntax as possible? I'm not sure this is at all standardized. How do we decide whose syntax to adopt? Oracle's just because it's Oracle? I don't particularly like inheriting the limitations of Oracle's syntax - such as at most 2 levels of partitioning, weird VALUES LESS THAN syntax that doesn't look like anything else we do in PG and has no way to specify an opclass, etc. > I think that's for a few reasons - firstly it makes the life much easier for > the DBAs and users who are either migrating to PostgreSQL or have to manage > a mix of databases. Secondly, it serves as a valuable source of engineering > info, preventing the "I haven't thought of this use case" problem. > > An example of this is the proposed syntax for adding a partition > > CREATE TABLE measurement_fail > PARTITION OF measurement > FOR VALUES START ('2006-02-15') END ('2006-03-01'); > > which seems a bit awkward as both the databases I'm familiar with (Oracle > and Sybase) use ALTER TABLE to do this > > ALTER TABLE measurement > ADD PARTITION measurement_fail VALUES LESS THAN ( ... ) > > And so on for the other commands. One thing to think about is that pg_dump --binary-upgrade needs to decorate each command that creates a relation. I guess you could decorate ALTER TABLE with the same stuff we currently apply to CREATE TABLE but... > That being said, I entirely agree with Simon (and others) that getting the > planner part work is the crucial part of the patch. But I also think that a > proper abstraction (thanks to good syntax) may be a valuable hint how to > define the catalogs and such. No argument on that from me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016/02/15 10:55, Amit Langote wrote: > required. There is also basic planner support but no support yet to enable > constraint exclusion on partitions (which will be fulfilled shortly by > installing equivalent check constraints on partitions). Just to follow up on this - attached now adds equivalent check constraint with partition creation. Constraint exclusion should work. No regression tests yet though. >From now on, instead of attaching multiple files like in the previous message, I will send a single tar.gz which will contain patches created by git-format-patch. Thanks, Amit
Attachment
On 2016/02/16 21:57, Robert Haas wrote: > On Fri, Jan 15, 2016 at 5:48 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> If we have a CREATE statement for each partition, how do we generalize >> that to partitions at different levels? For example, if we use something >> like the following to create a partition of parent_name: >> >> CREATE PARTITION partition_name OF parent_name FOR VALUES ... >> WITH ... TABLESPACE ... >> >> Do we then say: >> >> CREATE PARTITION subpartition_name OF partition_name ... >> >> to create a level 2 partition (sub-partition) of parent_name? > > Yes, exactly. > > Personally, I would be more inclined to make this a CREATE TABLE statement, like > > CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... > CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ... I guess the first of which would actually be: CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... PARTITION BY ... Some might think that writing potentially the same PARTITION BY clause 100 times for 100 level-1 partitions could be cumbersome. That is what SUBPARTITION BY notation may be good as a shorthand for. > I think if you've got SUBPARTITION as a keyword in the syntax > anywhere, you're doing it wrong. The toplevel object shouldn't really > care whether its children are themselves partitioned or not. This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It might as well be just cascaded PARTITION BYs. The point is to specify as many of those with CREATE TABLE toplevel as the number of levels of partitioning we want. That does however fix the number of levels in advance. In the patch I have posted, here are some details of the tuple routing implementation for instance - the top-level parent knows only of its immediate partitions. Once a level-1 partition from that list is determined using a tuple's level-1 key, the tuple is passed down to choose one of its own partitions using the level-2 key. The key descriptor list is associated with the top-level parent alone and the recursive code knows to iterate the key list to apply nth key for nth level. The recursion happens in the partition module. Now there are also functions to let obtain, say *all* or only *leaf* partitions (OIDs) of a top-level parent but they are used for certain DDL scenarios. As far as DML is concerned, the level-at-a-time recursive approach as described above is used. Queries not yet because the plan is a flattened append of leaf partitions anyway. If such notation convenience at the expense of loss of generality is not worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax. CREATE TABLE toplevel() PARTITION BY CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY CREATE TABLE subpartition PARTITION OF partition FOR VALUES ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE ALTER TABLE partitioned DETACH PARTITION name [ USING newname ] While we are on the syntax story, how about FOR VALUES syntax for range partitions (sorry for piggybacking it here in this message). At least some people dislike LESS THAN notation. Corey Huinker says we should be using range type literals for that. It's not clear though that using range type literals directly is a way ahead. For one, range type API expects there to exist a range type with given element type. Whereas all we require for range partitioning proper is for column type to have a btree operator class. Should we require it to have an associated range type as well? Don't think that there exists an API to check that either. All in all, range types are good to implement things in applications but not so much within the backend (unless I'm missing something). I know reinventing the wheel is disliked as well but perhaps we could offer something like the following because Corey offered some examples which would help from the flexibility: START [ EXCL ] (startval) END [ INCL ] (endval) That is, in range type notation, '[startval, endval)' is the default behavior. So for each partition, there is at least the following pieces of metadata: Datum *startval; bool startexcl; Datum *endval; bool endincl; That requires quite some gymnastics during DDL (with no help from range type infrastructure) to maintain the invariant that no two range partitions overlap. Even gaps can result which are considered undesirable, so maybe, the invariant to maintain would include no gaps in addition to no overlap. That would have us looking for a definition of a "gap" for all sorts of btree supporting data types. Some people said something akin to interval partitioning would be good like: PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START WITH value; But that could be just a UI addition to the design where each partition has [startval, endval) bounds. In any case, not a version 1 material I'd think. >> Do we want this at all? It seems difficult to generalize this to >> multi-level hierarchy of more than 2 levels. > > It doesn't do anything for me. There may be somebody who wants it, > but I don't see much value myself. OK. >> After thinking some more on this - I think that identical tuple >> descriptors may not just be a nice-to-have but critical in some cases. For >> example, consider built-in/trigger-less tuple routing. I'd imagine that >> the partition to insert a tuple into would be determined just before >> calling heap_insert() in ExecInsert() and CopyFrom(). That means the >> HeapTuple that is passed to heap_insert() to insert into the partition >> would be based on the root table's tuple descriptor. Note also that the >> tuple would have passed through BR, IR triggers, constraints of the root >> table. When the data is eventually queried from partitions directly, or >> well even via the root table (considering existing executor capabilities), >> partition's tuple descriptor at that point had better match the data that >> went onto the disk. That means we had better keep at least the following >> things in sync: number of attributes, name, position (attnum), type, >> notnull-ness of individual attributes. So in order to do that, recursively >> apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN >> TYPE, SET/DROP NOT NULL on the root table to all the partitions and >> prevent those sub-commands to be directly applied to any table >> (partitions) in the partitioning hierarchy but the root. I further don't >> see the point of allowing to set (or drop) column defaults in partitions >> because now INSERT or COPY FROM cannot be directly applied to partitions. >> Similar argument could be made for BR, IR triggers and CHECK constraints. >> Am I missing something in all of this? > > Well, in the end there are basically two choices. Either tuple > descriptors have to match exactly, and then you can reuse a tuple > intended for one partition for some other partition without > projection; or else they don't, and you need to project. I'm not sure > that projection is expensive enough to make disallowing mismatched > tuple descriptors a necessary design choice - and certainly that > design choice is awkward from a UI standpoint, because we will > sometimes not be able to attach a partition for a reason that the user > can neither see in the \d output nor correct. But on the flip side, > not having to worry about projection is nice. > >> An alternative to doing any of that very well may be to design >> trigger-less tuple routing to be smarter about possible mismatch of the >> tuple descriptors but I haven't given that a lot of thought. Is that >> really an alternative worth looking into? > > Yes. I see the trade-off. I agree that considering the significance for attach partition case is quite important. So, the tuple routing code should be ready to use projection if there happens to be a partition with differing tuple descriptor. In the code I posted, a ResultRelInfo is lazily built afresh for each inserted tuple in ExecInsert's case and for each tuple where the chosen partition is different from the previous tuple's in CopyFrom's case. One can feel that there is a certain overhead to that approach for the bulk-loading case (almost every CopyFrom invocation). Now if projection enters this picture, we have to consider that too. Should we initialize ResultRelInfo's and corresponding ProjectionInfo's for all partitions beforehand? Consider how ModifyTable currently handles update on inheritance set, for example. That would incur unnecessary overhead if only a single tuple is inserted. But it would certainly help bulk-loading case. Am I missing something? >> On one hand, I think to keep treating "partition hierarchies" as >> "inheritance hierachies" might have some issues. I am afraid that >> documented inheritance semantics may not be what we want to keep using for >> the new partitioned tables. By that, I mean all the user-facing behaviors >> where inheritance has some bearing. Should it also affect new partitioned >> tables? Consider whether inheritance semantics would render infeasible >> some of the things that we'd like to introduce for the new partitioned >> tables such as automatic tuple routing, or keep us from improving planner >> smarts and executor capabilities for partitioned tables over what we >> already have. >> >> OTOH, I may be looking at it wrongly. We would not be required to enforce >> user-facing inheritance behaviors on the new partitioned tables after all. >> That is to say - it's just that new partitioned tables could still use >> relevant inheritance infrastructure behind-the-scenes for planning, >> execution and a few other things and not care about abiding by regular >> inheritance semantics. I should just go ahead and add special cases in all >> places where existing inheritance handling code stands to cause trouble >> down the line for partitioned tables. We might want to mention that we do >> so somewhere in documentation and also note that regular inheritance >> semantics does not apply. While it sounds probably fine as implementation >> for the feature released initially, a day will inevitably come when this >> behind-the-scenes implementation will be changed to something more >> amenable to better optimization. But that's for future... > > This seems pretty vague to me. I don't know that I have an opinion > without a more specific list of questions. Sorry for rather incoherent articulation in the previous message. Let me try to say something that's perhaps more specific. Consider that we create partitions as inheritance children, that is, creating a partition using: CREATE TABLE partition PARTITION OF parent FOR VALUES ... is equivalent to saying CREATE TABLE partition (...) INHERITS (parent) except that the latter allows partition to have its own column definitions, table constraints and multiple parents as long as things are conflict-free. The former requires specifying partition bounding values. The common thing between the two then is StoreCatalogInheritance(partOid, parentOid) that will mark partition as inheritance child of parent in pg_inherits. So, our new partitions are really inheritance children but that's not apparent to users (where this last bit is important). Then consider ALTER TABLE partition - should we need to handle it in way different from existing inheritance code would do - * Prevent something whereas regular inheritance wouldn't?* Do something instead of/in addition to whatever regular inheritancedoes? Consider adding a column to partition - regular inheritance wouldn't prevent it because adding a column to child table doesn't get in the way of how inheritance works. Now we can make it so that if table is a partition created with CRATE TABLE PARTITION OF (which we can tell because the table has pg_partition entry), we should *actively* prevent adding a column. Inheritance already prevents dropping an inherited attribute in child but says "cannot drop inherited column". IMO in a partition's case, it should say "cannot drop columns of a partition". OTOH, adding a column to parent propagates to all partitions by way of ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is necessary for partitioned tables (which we can tell from relkind or maybe because table has pg_partitioned_rel entry) - especially no need to propagate column default and check constraint, etc. Because one cannot apply insert/update on partitions anyway. Then because defaults and check constraints on partitions are really useless, when a user adds them to a partition directly, we should say, "cannot alter column default for a partition" and "cannot add check constraint to a partition" with "instead do that operation on the root parent" hint. But what is a "root parent"? How to determine one for a given partition? Perhaps, it's not much of a POLA violation a user must have done something to have ended up in the error occurring, that is, created the table as partition of (partition of ...) something. Thanks a lot for your time! Thanks, Amit
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Feb 18, 2016 at 12:41 AM, Amit Langote <span dir="ltr"><<ahref="mailto:Langote_Amit_f8@lab.ntt.co.jp" target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="a3s" id=":1pe" style="overflow:hidden">START [ EXCL ] (startval)END [ INCL ] (endval)<br /><br /> That is, in range type notation, '[startval, endval)' is the default<br /> behavior.So for each partition, there is at least the following pieces of<br /> metadata:</div></blockquote></div><div class="gmail_extra"><br/></div><div class="gmail_extra">This is really close, and if it is what we ended up with we wouldbe able to use it.</div><div class="gmail_extra"><br /></div><div class="gmail_extra">I suggest that the range notationcan be used even when no suitable range type exists.<br /></div><div class="gmail_extra"><br /></div><div class="gmail_extra">Iassume the code for parsing a range spec regardless of data type already exists, but in case it doesn't,take a range spec of unknown type:</div><div class="gmail_extra"><br /></div><div class="gmail_extra">[x,y)</div><divclass="gmail_extra"><br /></div><div class="gmail_extra">x and y are either going to beraw strings or doublequoted strings with possible doublequote escapes, each of which would be coercible into the the typeof the partition column.</div><div class="gmail_extra"><br /></div><div class="gmail_extra">In other words, if your stringvalues were 'blah , blah ' and 'fabizzle', the [) range spec would be ["blah , blah ",fabizzle).</div><div class="gmail_extra"><br/></div><div class="gmail_extra">Using regular range specs syntax also allows for the range to beunbounded in either or both directions, which is a possibility, especially in newer tables where the expected distributionof data is unknown.</div><div class="gmail_extra"><br /></div>p.s. Sorry I haven't been able to kick the tiresjust yet. We have a very good use case for this, it's just a matter of getting a machine and the time to devote to it.</div></div>
On 2/16/16 9:56 PM, Amit Langote wrote: > From now on, instead of attaching multiple files like in the previous > message, I will send a single tar.gz which will contain patches created by > git-format-patch. Please don't do that.
On Sat, Feb 20, 2016 at 1:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 2/16/16 9:56 PM, Amit Langote wrote: >> From now on, instead of attaching multiple files like in the previous >> message, I will send a single tar.gz which will contain patches created by >> git-format-patch. > > Please don't do that. OK, will remember. Sorry about that. Thanks, Amit
On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Personally, I would be more inclined to make this a CREATE TABLE statement, like >> >> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... >> CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ... > > I guess the first of which would actually be: > > CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ... > PARTITION BY ... > > Some might think that writing potentially the same PARTITION BY clause 100 > times for 100 level-1 partitions could be cumbersome. That is what > SUBPARTITION BY notation may be good as a shorthand for. I think that anybody who is doing that much partitioning is going to use a tool to generate the DDL anyway, so it doesn't really matter. >> I think if you've got SUBPARTITION as a keyword in the syntax >> anywhere, you're doing it wrong. The toplevel object shouldn't really >> care whether its children are themselves partitioned or not. > > This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It > might as well be just cascaded PARTITION BYs. The point is to specify as > many of those with CREATE TABLE toplevel as the number of levels of > partitioning we want. That does however fix the number of levels in advance. Which I think is not good. If we say that a table can be partitioned, and a table that is a partition can also be partitioned, we've got a nice general system. Fixing the number of partitioning levels for the sake of a little syntactic sugar is, IMHO, getting our priorities backwards. > In the patch I have posted, here are some details of the tuple routing > implementation for instance - the top-level parent knows only of its > immediate partitions. Once a level-1 partition from that list is > determined using a tuple's level-1 key, the tuple is passed down to choose > one of its own partitions using the level-2 key. The key descriptor list > is associated with the top-level parent alone and the recursive code knows > to iterate the key list to apply nth key for nth level. The recursion > happens in the partition module. I haven't looked at the code, but that sounds like the right approach. > Now there are also functions to let obtain, say *all* or only *leaf* > partitions (OIDs) of a top-level parent but they are used for certain DDL > scenarios. As far as DML is concerned, the level-at-a-time recursive > approach as described above is used. Queries not yet because the plan is a > flattened append of leaf partitions anyway. OK. > If such notation convenience at the expense of loss of generality is not > worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax. > > CREATE TABLE toplevel() PARTITION BY > CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY > CREATE TABLE subpartition PARTITION OF partition FOR VALUES > ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE > ALTER TABLE partitioned DETACH PARTITION name [ USING newname ] > > While we are on the syntax story, how about FOR VALUES syntax for range > partitions (sorry for piggybacking it here in this message). At least some > people dislike LESS THAN notation. Corey Huinker says we should be using > range type literals for that. It's not clear though that using range type > literals directly is a way ahead. For one, range type API expects there to > exist a range type with given element type. Whereas all we require for > range partitioning proper is for column type to have a btree operator > class. Should we require it to have an associated range type as well? > Don't think that there exists an API to check that either. All in all, > range types are good to implement things in applications but not so much > within the backend (unless I'm missing something). I know reinventing the > wheel is disliked as well but perhaps we could offer something like the > following because Corey offered some examples which would help from the > flexibility: > > START [ EXCL ] (startval) END [ INCL ] (endval) I don't think using range type literals is gonna work. There's no guarantee that the necessary range types exist. However, we could possibly use a syntax inspired by the syntax for range types. I'm a little nervous about asking people to type commands with mismatching braces: CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 ); ...but maybe it's a good idea. It certainly has the advantage of being more compact than a lot of there ways we might choose to write that. And I think LESS THAN sucks. It's just clunky and awkward syntax. > Some people said something akin to interval partitioning would be good like: > > PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START > WITH value; > > But that could be just a UI addition to the design where each partition > has [startval, endval) bounds. In any case, not a version 1 material I'd > think. Agreed; not right now. > I see the trade-off. I agree that considering the significance for attach > partition case is quite important. > > So, the tuple routing code should be ready to use projection if there > happens to be a partition with differing tuple descriptor. In the code I > posted, a ResultRelInfo is lazily built afresh for each inserted tuple in > ExecInsert's case and for each tuple where the chosen partition is > different from the previous tuple's in CopyFrom's case. One can feel that > there is a certain overhead to that approach for the bulk-loading case > (almost every CopyFrom invocation). Now if projection enters this picture, > we have to consider that too. Should we initialize ResultRelInfo's and > corresponding ProjectionInfo's for all partitions beforehand? Consider how > ModifyTable currently handles update on inheritance set, for example. That > would incur unnecessary overhead if only a single tuple is inserted. But > it would certainly help bulk-loading case. Am I missing something? I'd say you are missing benchmarking data. :-) Why should the single-tuple case be harmed here? > Sorry for rather incoherent articulation in the previous message. Let me > try to say something that's perhaps more specific. > > Consider that we create partitions as inheritance children, that is, > creating a partition using: > > CREATE TABLE partition PARTITION OF parent FOR VALUES ... > > is equivalent to saying > > CREATE TABLE partition (...) INHERITS (parent) > > except that the latter allows partition to have its own column > definitions, table constraints and multiple parents as long as things are > conflict-free. The former requires specifying partition bounding values. > The common thing between the two then is StoreCatalogInheritance(partOid, > parentOid) that will mark partition as inheritance child of parent in > pg_inherits. So, our new partitions are really inheritance children but > that's not apparent to users (where this last bit is important). So far, this sounds good to me. > Then consider ALTER TABLE partition - should we need to handle it in way > different from existing inheritance code would do - > > * Prevent something whereas regular inheritance wouldn't? > * Do something instead of/in addition to whatever regular inheritance does? > > Consider adding a column to partition - regular inheritance wouldn't > prevent it because adding a column to child table doesn't get in the way > of how inheritance works. Now we can make it so that if table is a > partition created with CRATE TABLE PARTITION OF (which we can tell because > the table has pg_partition entry), we should *actively* prevent adding a > column. Inheritance already prevents dropping an inherited attribute in > child but says "cannot drop inherited column". IMO in a partition's case, > it should say "cannot drop columns of a partition". Agreed. So maybe pg_inherits gets an additional column that indicates whether this is "regular" inheritance or "partitioning" inheritance, and in the latter case there are additional restrictions. > OTOH, adding a column to parent propagates to all partitions by way of > ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is > necessary for partitioned tables (which we can tell from relkind or maybe > because table has pg_partitioned_rel entry) - especially no need to > propagate column default and check constraint, etc. Because one cannot > apply insert/update on partitions anyway. Hmm. I feel like maybe that should be allowed. If the user wants to bulk-load a bunch of data and knows that it goes in a particular partition, why should they have to pay the overhead of determining that anew for each row? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016/02/20 5:06, Corey Huinker wrote: > On Thu, Feb 18, 2016 at 12:41 AM, Amit Langote wrote: > >> START [ EXCL ] (startval) END [ INCL ] (endval) >> >> That is, in range type notation, '[startval, endval)' is the default >> behavior. So for each partition, there is at least the following pieces of >> metadata: >> > > This is really close, and if it is what we ended up with we would be able > to use it. > > I suggest that the range notation can be used even when no suitable range > type exists. > > I assume the code for parsing a range spec regardless of data type already > exists, but in case it doesn't, take a range spec of unknown type: > > [x,y) > > x and y are either going to be raw strings or doublequoted strings with > possible doublequote escapes, each of which would be coercible into the the > type of the partition column. > > In other words, if your string values were 'blah , blah ' and 'fabizzle', > the [) range spec would be ["blah , blah ",fabizzle). Hm, I see. How about multi-column keys? Do we care enough about that use case? I don't see a nice-enough-looking range literal for such keys. Consider for instance, With the partitioned table defined as: CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2); where we'd use LESS THAN as: CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES LESS THAN ('b', '2'); CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES LESS THAN ('b', '3'); CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES LESS THAN ('b', '4'); CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES LESS THAN ('c', '2'); CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES LESS THAN ('c', '3'); CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES LESS THAN ('c', '4'); I guess it would be the following with the new range spec: CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES ['b', 'b'], ['1', '2'); CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES ['b', 'b'], ['2', '3'); CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES ['b', 'b'], ['3', '4'); CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES ['c', 'c'], ['1', '2'); CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES ['c', 'c'], ['2', '3'); CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES ['c', 'c'], ['3', '4'); The first column does not unambiguously select a partition with such keys. Now I can imagine that it is possible to "emulate" such a multi-column range key with k columns by k levels of range-range-* partitioning: CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1); CREATE TABLE foo_ax PARTITION OF foo FOR VALUES ['b, 'c') PARTITION BY RANGE (c2); CREATE TABLE foo_ax1x PARTITION OF foo_ax FOR VALUES ['1', '2'); CREATE TABLE foo_ax2x PARTITION OF foo_ax FOR VALUES ['2', '3'); CREATE TABLE foo_ax3x PARTITION OF foo_ax FOR VALUES ['3', '4'); CREATE TABLE foo_bx PARTITION OF foo FOR VALUES ['b, 'c') PARTITION BY RANGE (c2); CREATE TABLE foo_bx1x PARTITION OF foo_bx FOR VALUES ['1', '2'); CREATE TABLE foo_bx2x PARTITION OF foo_bx FOR VALUES ['2', '3'); CREATE TABLE foo_bx3x PARTITION OF foo_bx FOR VALUES ['3', '4'); But IIRC, I have been contradicted once before regarding whether two ways are exactly the same. > Using regular range specs syntax also allows for the range to be unbounded > in either or both directions, which is a possibility, especially in newer > tables where the expected distribution of data is unknown. We would want to also think about what subset of many permutations of this syntax to accept range specs for new partitions. Mostly to preserve the non-overlapping invariant and I think it would also be nice to prevent gaps. Consider that once we create: PARTITION FOR VALUES [current_date,); Now to create a new partition starting at later date, we have to have a "split partition" feature which would scan the above partition to distribute the existing data rows appropriately to the resulting two partitions. Right? IOW, one shouldn't create an unbounded partition if more partitions in the unbounded direction are expected to be created. It would be OK for unbounded partitions to be on the lower end most of the times. > p.s. Sorry I haven't been able to kick the tires just yet. We have a very > good use case for this, it's just a matter of getting a machine and the > time to devote to it. I would appreciate it. You could wait a little more for my next submission which will contain some revisions to the tuple routing code. Thanks, Amit
Hm, I see. How about multi-column keys? Do we care enough about that use
case? I don't see a nice-enough-looking range literal for such keys.
Consider for instance,
With the partitioned table defined as:
CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2);
Good question! I would assume that we'd use a syntax that presumes c1 and c2 are a hypothetical composite type. But what does that look like?
To answer it, I tried this:
# create type duple as (a text, b text);
CREATE TYPE
# create type duplerange as range (subtype = duple);
CREATE TYPE
# select '(beebop,alula)'::duple;
duple
----------------
(beebop,alula)
(1 row)
# select '("hey ho","letsgo")'::duple;
duple
-------------------
("hey ho",letsgo)
(1 row)
analytics=# select duplerange('(beebop,alula)','("hey ho","letsgo")','(]');
duplerange
------------------------------------------
("(beebop,alula)","(""hey ho"",letsgo)"]
(1 row)
So I would assume that we'd use a syntax that presumed the columns were in a composite range type.
Which means your creates would look like (following Robert Haas's implied suggestion that we leave off the string literal quotes):
CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES ( , (b,2) );
CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,2), (b,3) );
CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,3), (b,4) );
CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,4), (c,2) );
CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,2), (c,3) );
CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,3), (c,4) );
That's not terrible looking.
We would want to also think about what subset of many permutations of this
syntax to accept range specs for new partitions. Mostly to preserve the
non-overlapping invariant and I think it would also be nice to prevent gaps.
Gaps *might* be intentional. I can certainly see where we'd want to set up warnings for discontinuity, or perhaps some utility functions:
pg_partitions_ranges_are_continuous('master_table_name')
pg_partitions_are_adjacent('master_table_name','p1','p2')
But for the most part, range partitions evolve from splits when one partition grows too big, so that won't be such a problem.
Consider that once we create:
PARTITION FOR VALUES [current_date,);
Now to create a new partition starting at later date, we have to have a
"split partition" feature which would scan the above partition to
distribute the existing data rows appropriately to the resulting two
partitions. Right?
Correct. And depending on policy, that might be desirable and might be not.
If the table were for death records, we'd very much want to reject rows in the future, if only to avoid upsetting the person.
If the table were of movie release dates, we'd *expect* that only dates (,current_date] would be entered, but if someone chose to leak a release date, we'd want to capture that and deal with it later.
So yeah, we're going to (eventually) need a SPLIT PARTITION that migrates rows to a new partition.
IOW, one shouldn't create an unbounded partition if more partitions in the
unbounded direction are expected to be created. It would be OK for
unbounded partitions to be on the lower end most of the times.
On this I'll have to disagree. My own use case where I use my range_partitioning extension starts off with a single partition () and all new partitions are splits of that. The ranges evolve over time as partitions grow and slow down. It's nice because we're not trying to predict where growth will be, we split where growth is.
> p.s. Sorry I haven't been able to kick the tires just yet. We have a very
> good use case for this, it's just a matter of getting a machine and the
> time to devote to it.
I would appreciate it. You could wait a little more for my next
submission which will contain some revisions to the tuple routing code.
Ok, I'll wait a bit. In the mean time I can tell you a bit about the existing production system I'm hoping to replicate in true partitioning looks like this:
Big Master Table:
Range partition by C collated text
Date Range
Date Range
Date Range
...
Range partition by C collated text
Date Range
Date Range
...
...
Currently this is accomplished through my range_partitioning module, and then using pg_partman on those partitions. It works, but it's a lot of moving parts.
Currently this is accomplished through my range_partitioning module, and then using pg_partman on those partitions. It works, but it's a lot of moving parts.
The machine will be a 32 core AWS box. As per usual with AWS, it will be have ample memory and CPU, and be somewhat starved for I/O.
Question: I haven't dove into the code, but I was curious about your tuple routing algorithm. Is there any way for the algorithm to begin it's scan of candidate partitions based on the destination of the last row inserted this statement? I ask because most use cases (that I am aware of) have data that would naturally cluster in the same partition.
Why not based it on "Exclusion Constraint" ? Most discussions as of late seems to focus on Range overlaps which appeal (I would think) is that it supports both "equality" and "overlaps", two popular partitioning schemes. "Equality" as in "value1 = value2" can be implemented with "range overlaps" as "range(value1,value) = range(value,value2)". I would think that Partitioning schemes can be Declarative, Efficient and not restricted to Equality and Overlaps as long as all partitions (of a partitioned table) are using a single partitioning definition expressed as: - An Immutable Expression on tuple columns, in the simplest case a single column - An Operator, in the simplest case, "equality" That seems very close to the semantic of "Constraint Exclusion" as described here: http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-general ized-sql-unique/ If partitioning could be based on EC, it would bring these additional benefits: - The choice of operator as long as it is boolean. commutative and Indexable - The use of Expression/Function and not just bare columns Jean-Pierre Pelletier
Why not based it on "Exclusion Constraint" ? Most discussions as of late seems to focus on Range overlaps which appeal (I would think) is that it supports both "equality" and "overlaps", two popular partitioning schemes. "Equality" as in "value1 = value2" can be implemented with "range overlaps" as "range(value1,value) = range(value,value2)". I would think that Partitioning schemes can be Declarative, Efficient and not restricted to Equality and Overlaps as long as all partitions (of a partitioned table) are using a single partitioning definition expressed as: - An Immutable Expression on tuple columns, in the simplest case a single column - An Operator, in the simplest case, "equality" That seems very close to the semantic of "Constraint Exclusion" as described here: http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-general ized-sql-unique/ If partitioning could be based on EC, it would bring these additional benefits: - The choice of operator as long as it is boolean. commutative and Indexable - The use of Expression/Function and not just bare columns Jean-Pierre Pelletier
On 2016/02/23 22:51, Robert Haas wrote: > On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote wrote: >> Some might think that writing potentially the same PARTITION BY clause 100 >> times for 100 level-1 partitions could be cumbersome. That is what >> SUBPARTITION BY notation may be good as a shorthand for. > > I think that anybody who is doing that much partitioning is going to > use a tool to generate the DDL anyway, so it doesn't really matter. >>> I think if you've got SUBPARTITION as a keyword in the syntax >>> anywhere, you're doing it wrong. The toplevel object shouldn't really >>> care whether its children are themselves partitioned or not. >> >> This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It >> might as well be just cascaded PARTITION BYs. The point is to specify as >> many of those with CREATE TABLE toplevel as the number of levels of >> partitioning we want. That does however fix the number of levels in advance. > > Which I think is not good. If we say that a table can be partitioned, > and a table that is a partition can also be partitioned, we've got a > nice general system. Fixing the number of partitioning levels for the > sake of a little syntactic sugar is, IMHO, getting our priorities > backwards. OK. To reiterate the syntax: CREATE TABLE parent(...) PARTITION BY CREATE TABLE partition PARTITION OF parent FOR VALUES ... [ PARTITION BY ] ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE source_table [ NO VALIDATE ] ALTER TABLE partitioned DETACH PARTITION name [ WITH new_name ] A note about NO VALIDATE in ATTACH PARTITION: If specified, it means user is telling the system to "trust" that none of the rows contained in the source table lie outside partition boundary specification (the FOR VALUES clause). Which is not the same thing as adding a NOT VALID check constraint because the check constraint is assumed invalid by the optimizer until explicitly validated by VALIDATE CONSTRAINT. The default behavior is to validate by scanning the source table to check for violating rows and fail adding the partition, if any. Because adding a partition requires an exclusive lock on the parent, the default behavior may cause us to have the lock for long durations which may be undesirable. Should do better than that? >> While we are on the syntax story, how about FOR VALUES syntax for range >> partitions (sorry for piggybacking it here in this message). At least some >> people dislike LESS THAN notation. Corey Huinker says we should be using >> range type literals for that. It's not clear though that using range type >> literals directly is a way ahead. For one, range type API expects there to >> exist a range type with given element type. Whereas all we require for >> range partitioning proper is for column type to have a btree operator >> class. Should we require it to have an associated range type as well? >> Don't think that there exists an API to check that either. All in all, >> range types are good to implement things in applications but not so much >> within the backend (unless I'm missing something). I know reinventing the >> wheel is disliked as well but perhaps we could offer something like the >> following because Corey offered some examples which would help from the >> flexibility: >> >> START [ EXCL ] (startval) END [ INCL ] (endval) > > I don't think using range type literals is gonna work. There's no > guarantee that the necessary range types exist. However, we could > possibly use a syntax inspired by the syntax for range types. I'm a > little nervous about asking people to type commands with mismatching > braces: > > CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 ); > > ...but maybe it's a good idea. It certainly has the advantage of > being more compact than a lot of there ways we might choose to write > that. And I think LESS THAN sucks. It's just clunky and awkward > syntax. Slightly concerned about multi-column range partition key but as suggested by Corey, we can use record-like notation. CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2); CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ ('a', 1), ('a', 2) ); CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES [ ('a', 2), ('a', 3) ); CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES [ ('a', 3), ('a', 4) ); CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES [ ('b', 1), ('b', 2) ); CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES [ ('b', 2), ('b', 3) ); CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES [ ('b', 3), ('b', 4) ); >> I see the trade-off. I agree that considering the significance for attach >> partition case is quite important. >> >> So, the tuple routing code should be ready to use projection if there >> happens to be a partition with differing tuple descriptor. In the code I >> posted, a ResultRelInfo is lazily built afresh for each inserted tuple in >> ExecInsert's case and for each tuple where the chosen partition is >> different from the previous tuple's in CopyFrom's case. One can feel that >> there is a certain overhead to that approach for the bulk-loading case >> (almost every CopyFrom invocation). Now if projection enters this picture, >> we have to consider that too. Should we initialize ResultRelInfo's and >> corresponding ProjectionInfo's for all partitions beforehand? Consider how >> ModifyTable currently handles update on inheritance set, for example. That >> would incur unnecessary overhead if only a single tuple is inserted. But >> it would certainly help bulk-loading case. Am I missing something? > > I'd say you are missing benchmarking data. :-) > > Why should the single-tuple case be harmed here? I think setting up N ResultRelInfos in advance where the tuple would only ever require one might be superfluous. But that may point to some flaw in my original design or thinking about the case. As I said, my approach so far is to "lazily" set up a ResultRelInfo once a valid leaf partition for the tuple is found and OID returned. Now there will also be a ProjectionInfo if the target partition's physical descriptor happens to differ from the root table. All we ever start with in ExecInsert() is root table's ResultRelInfo which comes with executor state related to partition key (expression states, if any) and information about root table's partitions. ResultRelInfos for actual partitions that the tuple is inserted into are ephemeral and are created for ExecInsertIndexTuples's perusal only, that is, switched back to root table's right after inserting index tuples. Maybe, there should be an array in EState of ResultRelInfos and ProjectionInfos of size N where N is the number of leaf partitions and indexed by leaf partition sequence number. The sequence number would be based on OID order. That means the partition module now returns the sequence number instead of OID. I indeed should collect some performance numbers for any resulting implementation. >> Consider that we create partitions as inheritance children, that is, >> creating a partition using: >> >> CREATE TABLE partition PARTITION OF parent FOR VALUES ... >> >> is equivalent to saying >> >> CREATE TABLE partition (...) INHERITS (parent) >> >> except that the latter allows partition to have its own column >> definitions, table constraints and multiple parents as long as things are >> conflict-free. The former requires specifying partition bounding values. >> The common thing between the two then is StoreCatalogInheritance(partOid, >> parentOid) that will mark partition as inheritance child of parent in >> pg_inherits. So, our new partitions are really inheritance children but >> that's not apparent to users (where this last bit is important). > > So far, this sounds good to me. > >> Then consider ALTER TABLE partition - should we need to handle it in way >> different from existing inheritance code would do - >> >> * Prevent something whereas regular inheritance wouldn't? >> * Do something instead of/in addition to whatever regular inheritance does? >> >> Consider adding a column to partition - regular inheritance wouldn't >> prevent it because adding a column to child table doesn't get in the way >> of how inheritance works. Now we can make it so that if table is a >> partition created with CRATE TABLE PARTITION OF (which we can tell because >> the table has pg_partition entry), we should *actively* prevent adding a >> column. Inheritance already prevents dropping an inherited attribute in >> child but says "cannot drop inherited column". IMO in a partition's case, >> it should say "cannot drop columns of a partition". > > Agreed. So maybe pg_inherits gets an additional column that indicates > whether this is "regular" inheritance or "partitioning" inheritance, > and in the latter case there are additional restrictions. So far how I have thought it would work is that "partitions" have a pg_partition entry. So, wherever we want to check if an inheritance child is a partition, we just look up the OID in the syscache on pg_partition.partrelid. >> OTOH, adding a column to parent propagates to all partitions by way of >> ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is >> necessary for partitioned tables (which we can tell from relkind or maybe >> because table has pg_partitioned_rel entry) - especially no need to >> propagate column default and check constraint, etc. Because one cannot >> apply insert/update on partitions anyway. > > Hmm. I feel like maybe that should be allowed. If the user wants to > bulk-load a bunch of data and knows that it goes in a particular > partition, why should they have to pay the overhead of determining > that anew for each row? Hm, I thought tuple routing through the parent is the (only) way to ensure that a given partition only contains rows meant for that partition. IOW, there are no CHECK constraints per partition guarding the rows going in. However, there is a provision in the latest patch to "derive" a CHECK constraint by combining the key info and bound info and maybe ExecConstraints() should check those in addition to TupleConstr.has_not_null and TupleConstr.check constraints. Thoughts? Thanks, Amit
Hi, Thanks for your feedback. On 2016/02/26 0:43, Jean-Pierre Pelletier wrote: > Why not based it on "Exclusion Constraint" ? > > Most discussions as of late seems to focus on Range overlaps which appeal > (I would think) is that it supports both "equality" and "overlaps", two > popular partitioning schemes. > > "Equality" as in "value1 = value2" can be implemented with "range > overlaps" > as "range(value1,value) = range(value,value2)". > > I would think that Partitioning schemes can be Declarative, Efficient and > not restricted to Equality and Overlaps as long as all partitions (of a > partitioned table) are using a single partitioning definition expressed > as: > - An Immutable Expression on tuple columns, in the simplest case a single > column > - An Operator, in the simplest case, "equality" > > That seems very close to the semantic of "Constraint Exclusion" as > described here: > http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-general > ized-sql-unique/ > > If partitioning could be based on EC, it would bring these additional > benefits: > - The choice of operator as long as it is boolean. commutative and > Indexable > - The use of Expression/Function and not just bare columns Note that proposed patch does more or less what you say we should be doing minus the "exclusion constraint" part. With the proposed, you can specify an expression(s)/column(s) as partition key along with an "operator class" for the column like below: CREATE TABLE foo (a type_name) PARTITION BY LIST (a USING opclass_name); CREATE TABLE bar (a type_name) PARTITION BY RANGE (a USING opclass_name); Right now, supported partition strategies are RANGE and LIST where "btree operators" suffice. So in the above example, type_name must have a suitable btree operators class defined in the system which could be opclass_name. If opclass_name was created as the default for type_name, one need not write USING opclass_name. Then when you create a partition of foo: CREATE TABLE foo_partition PARTITION OF foo FOR VALUES IN (val1, val2); system enforces that foo_partition only contains values such that: a = ANY ( ARRAY [val1, val2] ), where the operator "=" refers to an operator belonging to the operator class opclass_name (hence can have a user-defined notion of "equality"). And when you create a partition of bar: CREATE TABLE bar_partition PARTITION OF bar FOR VALUES [val1, val2); system enforces that bar_partition only contains values such that: val1 <= a < val2, where operators "<=" and "<" refer to the operators belonging to the operator class opclass_name (hence can have a user-defined notion of ordering). Further, system can also optimize queries based on its knowledge of operators appearing in query clauses and implicit constraints just mentioned above. Specifically, it can can exclude partitions using "constraint exclusion" which is NOT the same thing as "exclusion constraints", as you might be aware. "Exclusion constraints" depend on having suitable a index (just like unique constraint enforcing btree index) that uses the specified operator to enforce the constraint: postgres=# CREATE TABLE circles ( postgres(# c circle, postgres(# EXCLUDE USING gist (c WITH &&) postgres(# ); CREATE TABLE postgres=# \d+ circles Table "public.circles"Column | Type | Modifiers | Storage | Stats target |Description-------+--------+-----------+---------+--------------+-------------c | circle | | plain | | Indexes: "circles_c_excl" EXCLUDE USING gist (c WITH &&) The talk of "non-overlapping partitions" in this thread refers to the invariant that partition DDL should maintain which uses ad-hoc logic to do that but which is based on the semantics of the specified operators. Thanks, Amit
16/02/16 07:46, Amit Langote wrote: > Hi Josh, > > On 2016/02/16 11:41, Josh berkus wrote: >> On 02/15/2016 04:28 PM, Amit Langote wrote: >>> Also, you won't see any optimizer and executor changes. Queries will still >>> use the same plans as existing inheritance-based partitioned tables, >>> although as I mentioned, constraint exclusion won't yet kick in. That will >>> be fixed very shortly. >> We're not going to use CE for the new partitioning long-term, are we? This >> is just the first version, right? > Yes. My approach in previous versions of stuffing major planner changes in > with the syntax patch was not quite proper in retrospect. So, I thought > I'd propose any major planner (and executor) changes later. > > Thanks, > Amit > Hello Amit, Thank you for your work. I'm currently working on extension aimed at planner optimization for partitioned tables (https://github.com/postgrespro/pg_pathman). At this moment I have an implementation of binary search for range partitioned tables with basic partitioning keys (date, timestamp, integers etc). And I'd like to try to combine your syntax and infrastructure with my binary search implementation. There likely will be changes in range syntax and partitions cache structure based on discussion. So looking forward for your next patch. Ildar
Hi Ildar, On 2016/02/29 7:14, Ildar Musin wrote: > 16/02/16 07:46, Amit Langote wrote: >> On 2016/02/16 11:41, Josh berkus wrote: >>> We're not going to use CE for the new partitioning long-term, are we? This >>> is just the first version, right? >> Yes. My approach in previous versions of stuffing major planner changes in >> with the syntax patch was not quite proper in retrospect. So, I thought >> I'd propose any major planner (and executor) changes later. >> >> Thanks, >> Amit >> > Hello Amit, > > Thank you for your work. I'm currently working on extension aimed at > planner optimization for partitioned tables > (https://github.com/postgrespro/pg_pathman). At this moment I have an > implementation of binary search for range partitioned tables with basic > partitioning keys (date, timestamp, integers etc). And I'd like to try to > combine your syntax and infrastructure with my binary search implementation. > There likely will be changes in range syntax and partitions cache > structure based on discussion. So looking forward for your next patch. Sure, thanks! I will look at your extension as well. Thanks, Amit
Hi Corey, Sorry for replying so late. On 2016/02/25 3:31, Corey Huinker wrote: > [ ... ] > So I would assume that we'd use a syntax that presumed the columns were in > a composite range type. > > Which means your creates would look like (following Robert Haas's implied > suggestion that we leave off the string literal quotes): > > CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES ( , (b,2) ); > CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,2), (b,3) ); > CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,3), (b,4) ); > CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,4), (c,2) ); > CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,2), (c,3) ); > CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,3), (c,4) ); > > That's not terrible looking. So I tried a grammar that looked like the following: range_spec: lb_inc bound ',' bound ub_inc {<create-range-struct>} lb_inc: '[' { $$ = true; } | '(' { $$ = false; } ub_inc: ']' { $$ = true; } | ')' { $$ = false; } bound: a_expr { if (IsA($1, RowExpr)) /* (key1, key2, ...) */ $$ = ((RowExpr) $1)->args; else /* key */ $$ = list_make1($1); } | /* EMPTY */ { $$ = NIL; } Everything seemed to go dandy until I tried FOR VALUES (blah , blah], where psql wouldn't send the command string without accepting the closing parenthesis, :(. So maybe I should try to put the whole thing in '', that is, accept the full range_spec in a string, but then we are back to requiring full-blown range parse function which I was trying to avoid by using the aforementioned grammar. So, I decided to move ahead with the following grammar for time being: START (lower-bound) [ EXCLUSIVE ] | END (upper-bound) [ INCLUSIVE ] | START (lower-bound) [ EXCLUSIVE ] END (upper-bound) [ INCLUSIVE ] Where, *-bound: a_expr | *-bound ',' a_expr Note that in the absence of explicit specification, lower-bound is inclusive and upper-bound is exclusive. So the verbosity couldn't be helped unless accepting range literal in string form and exporting rangetypes.c:range_parse() with range partitioning-specific hacks (consider composite bounds) to parse it are acceptable things. >> IOW, one shouldn't create an unbounded partition if more partitions in the >> unbounded direction are expected to be created. It would be OK for >> unbounded partitions to be on the lower end most of the times. >> > > On this I'll have to disagree. My own use case where I use my > range_partitioning extension starts off with a single partition () and all > new partitions are splits of that. The ranges evolve over time as > partitions grow and slow down. It's nice because we're not trying to > predict where growth will be, we split where growth is. Okay, perhaps I should not presume a certain usage. However, as you know, the usage like yours requires some mechanism of data redistribution (also not without some syntax), which I am not targeting with the initial patch.If that was the only way of creating partitions,matters would be a little easier - you only specify a split point and have some rule about inclusivity around the split point. But we have to start with the usage where each new partition is separately created with explicit partition bound specification that is complete in itself and that's where the logic to check partition invariants may get a bit complicated. > Ok, I'll wait a bit. In the mean time I can tell you a bit about the > existing production system I'm hoping to replicate in true partitioning > looks like this: > > Big Master Table: > Range partition by C collated text > Date Range > Date Range > ... > Range partition by C collated text > Date Range > Date Range > ... > ... > > Currently this is accomplished through my range_partitioning module, and > then using pg_partman on those partitions. It works, but it's a lot of > moving parts. > > The machine will be a 32 core AWS box. As per usual with AWS, it will be > have ample memory and CPU, and be somewhat starved for I/O. > > Question: I haven't dove into the code, but I was curious about your tuple > routing algorithm. Is there any way for the algorithm to begin it's scan of > candidate partitions based on the destination of the last row inserted this > statement? I ask because most use cases (that I am aware of) have data that > would naturally cluster in the same partition. No. Actually the tuple-routing function starts afresh for each row. For range partitions, it's binary search over an array of upper bounds. There is no row-to-row state caching in the partition module itself. Thanks, Amit
Sorry for replying so late.
No worries! We have jobs to do aside from this.
Everything seemed to go dandy until I tried FOR VALUES (blah , blah],
where psql wouldn't send the command string without accepting the closing
parenthesis, :(. So maybe I should try to put the whole thing in '', that
is, accept the full range_spec in a string, but then we are back to
requiring full-blown range parse function which I was trying to avoid by
using the aforementioned grammar. So, I decided to move ahead with the
following grammar for time being:
START (lower-bound) [ EXCLUSIVE ]
| END (upper-bound) [ INCLUSIVE ]
| START (lower-bound) [ EXCLUSIVE ] END (upper-bound) [ INCLUSIVE ]
Where,
*-bound: a_expr
| *-bound ',' a_expr
Note that in the absence of explicit specification, lower-bound is
inclusive and upper-bound is exclusive.
Thanks for trying. I agree that it would be a full blown range parser, and I'm not yet advanced enough to help you with that.
So presently partitions that are unbounded on the lower end aren't possible, but that's a creation syntax issue, not an infrastructure issue. Correct?
Okay, perhaps I should not presume a certain usage. However, as you know,
the usage like yours requires some mechanism of data redistribution (also
not without some syntax), which I am not targeting with the initial patch.
I'm quite fine with limitations in this initial patch, especially if they don't limit what's possible in the future.
> Question: I haven't dove into the code, but I was curious about your tuple
> routing algorithm. Is there any way for the algorithm to begin it's scan of
> candidate partitions based on the destination of the last row inserted this
> statement? I ask because most use cases (that I am aware of) have data that
> would naturally cluster in the same partition.
No. Actually the tuple-routing function starts afresh for each row. For
range partitions, it's binary search over an array of upper bounds. There
is no row-to-row state caching in the partition module itself.
bsearch should be fine, that's what I've used in my own custom partitioning schemes.
Was there a new patch, and if so, is it the one you want me to kick the tires on?
Hi, On 2016/03/09 9:17, Corey Huinker wrote: >> >> Sorry for replying so late. > No worries! We have jobs to do aside from this. Thanks! >> Everything seemed to go dandy until I tried FOR VALUES (blah , blah], >> where psql wouldn't send the command string without accepting the closing >> parenthesis, :(. So maybe I should try to put the whole thing in '', that >> is, accept the full range_spec in a string, but then we are back to >> requiring full-blown range parse function which I was trying to avoid by >> using the aforementioned grammar. So, I decided to move ahead with the >> following grammar for time being: >> >> START (lower-bound) [ EXCLUSIVE ] >> | END (upper-bound) [ INCLUSIVE ] >> | START (lower-bound) [ EXCLUSIVE ] END (upper-bound) [ INCLUSIVE ] >> >> Where, >> >> *-bound: a_expr >> | *-bound ',' a_expr >> >> Note that in the absence of explicit specification, lower-bound is >> inclusive and upper-bound is exclusive. > > Thanks for trying. I agree that it would be a full blown range parser, and > I'm not yet advanced enough to help you with that. > > So presently partitions that are unbounded on the lower end aren't > possible, but that's a creation syntax issue, not an infrastructure issue. > Correct? In case it wasn't apparent, you can create those: FOR VALUES END (upper-bound) [INCLUSIVE] which is equivalent to: FOR VALUES '(, upper-bound)' or FOR VALUES '(, upper-bound]' >>> Question: I haven't dove into the code, but I was curious about your >> tuple >>> routing algorithm. Is there any way for the algorithm to begin it's scan >> of >>> candidate partitions based on the destination of the last row inserted >> this >>> statement? I ask because most use cases (that I am aware of) have data >> that >>> would naturally cluster in the same partition. >> >> No. Actually the tuple-routing function starts afresh for each row. For >> range partitions, it's binary search over an array of upper bounds. There >> is no row-to-row state caching in the partition module itself. >> >> > bsearch should be fine, that's what I've used in my own custom partitioning > schemes. > > Was there a new patch, and if so, is it the one you want me to kick the > tires on? You didn't miss any. I need a little more time to send the next revision with significant design overhaul. Thanks, Amit
> So presently partitions that are unbounded on the lower end aren't
> possible, but that's a creation syntax issue, not an infrastructure issue.
> Correct?
In case it wasn't apparent, you can create those:
FOR VALUES END (upper-bound) [INCLUSIVE]
which is equivalent to:
FOR VALUES '(, upper-bound)' or FOR VALUES '(, upper-bound]'
Thanks for clarifying. My BNF-fu is weak.
On Mon, Mar 7, 2016 at 8:39 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Which means your creates would look like (following Robert Haas's implied >> suggestion that we leave off the string literal quotes): >> >> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES ( , (b,2) ); >> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,2), (b,3) ); >> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,3), (b,4) ); >> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (b,4), (c,2) ); >> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,2), (c,3) ); >> CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ (c,3), (c,4) ); >> >> That's not terrible looking. > > So I tried a grammar that looked like the following: > > range_spec: lb_inc bound ',' bound ub_inc {<create-range-struct>} > > lb_inc: '[' { $$ = true; } | '(' { $$ = false; } > ub_inc: ']' { $$ = true; } | ')' { $$ = false; } > > bound: a_expr > { > if (IsA($1, RowExpr)) /* (key1, key2, ...) */ > $$ = ((RowExpr) $1)->args; > else /* key */ > $$ = list_make1($1); > } > | /* EMPTY */ { $$ = NIL; } I think we're converging on a good syntax, but I don't think the choice of nothingness to represent an open range is a good idea, both because it will probably create grammar conflicts now or later and also because it actually is sort of confusing and unintuitive to read given the rest of our syntax. I suggest using UNBOUNDED instead. Also, I don't think allowing an a_expr as a bound is remotely sensible - I think you should allow only Sconst, NumericOnly, NULL, and UNBOUNDED. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 26, 2016 at 1:05 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I think setting up N ResultRelInfos in advance where the tuple would only > ever require one might be superfluous. But that may point to some flaw in > my original design or thinking about the case. You have a point. But maybe we should get it working first and optimize it later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I think we're converging on a good syntax, but I don't think the
choice of nothingness to represent an open range is a good idea, both
because it will probably create grammar conflicts now or later and
also because it actually is sort of confusing and unintuitive to read
given the rest of our syntax. I suggest using UNBOUNDED instead.
As much as it reminds me of the window syntax I loathe (ROWS BETWEEN UNBOUNDED ....gah), I'm inclined to agree with Robert here.
It also probably helps for code forensics in the sense that it's easier to text search for a something than a nothing.
Hi, Amit!
I tried to apply your patch. It still applies, but has some duplicate oids. After fixing duplicate oids, I've noticed following warning during compilation by clang-700.1.81.
scan.c:10308:23: warning: unused variable 'yyg' [-Wunused-variable]
struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be unused depending upon options. */
^
tablecmds.c:12922:6: warning: variable 'is_subpart' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (parent != NULL)
^~~~~~~~~~~~~~
tablecmds.c:12931:12: note: uninitialized use occurs here
partKey = is_subpart ? list_nth(rootParent->rd_partkeys, parent_level) :
^~~~~~~~~~
tablecmds.c:12922:2: note: remove the 'if' if its condition is always true
if (parent != NULL)
^~~~~~~~~~~~~~~~~~~
tablecmds.c:12912:19: note: initialize the variable 'is_subpart' to silence this warning
bool is_subpart;
^
= '\0'
tablecmds.c:13375:37: warning: variable 'operoid' is uninitialized when used here [-Wuninitialized]
comp_left_expr = make_opclause(operoid, BOOLOID, false,
^~~~~~~
tablecmds.c:13326:17: note: initialize the variable 'operoid' to silence this warning
Oid operoid;
^
= 0
Regression tests passed cleanly for me. I also examined code a bit. As I get, for DML queries, declarative partitioning works like inheritance. It just provides alternative way for collecting append_rel_list.
We're working on the other side of partitioning problem. Without solving syntax problem, we're solving performance problems in pg_pathman extension: https://github.com/postgrespro/pg_pathman. We already have interesting results which you can see in blog posts [1], [2], [3].
We already have fast algorithm for partition selection in optimizer [1] and effective optimization of filter conditions [3]. And we're planning to implement following features:
- Execute time partitions selection (useful for nested loops and prepared statements);
- Optimization of ordered output from patitioned tables (use Append instead of MergeAppend when possible);
- Optimization of hash join when both tables are patitioned by join key.
9.6 feature freeze in coming, and we're planning our development resources for 9.7. Besides providing an extension, we would like these features to eventually arrive to core. In order to achieve this we need to port out functionality over your declarative syntax. At first, we would need some way for caching partition metadata suitable for fast partition selection. For range partition it could be sorted array or RB-tree of partition bounds. When we have this infrastructure, we can start porting pieces of pg_pathman functionality to declarative partitiong.
So, our draft plan of patches would be following:
- Implement partition metadata cache suitable for fast partition selection.
- Fast partition selection using metadata cache.
- Optimization of filter conditions passed to partitions.
- Execute time partitions selection (useful for nested loops and prepared statements);
- Optimization of ordered output from patitioned tables (use Append instead of MergeAppend when possible);
- Optimization of hash join when both tables are patitioned by join key.
I'd like to validate that this development plan doesn't overlaps with your plans. If out plans are not overlapping then let's accept this plan of work for 9.7.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Alexander, Thanks a lot for taking a look! On Wed, Mar 16, 2016 at 10:56 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I tried to apply your patch. It still applies, but has some duplicate oids. Actually, a reworked version I will post hopefully early next week will have fixed this. > After fixing duplicate oids, I've noticed following warning during > compilation by clang-700.1.81. > > scan.c:10308:23: warning: unused variable 'yyg' [-Wunused-variable] > struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be > unused depending upon options. */ > ^ > tablecmds.c:12922:6: warning: variable 'is_subpart' is used uninitialized > whenever 'if' condition is false [-Wsometimes-uninitialized] > if (parent != NULL) > ^~~~~~~~~~~~~~ > tablecmds.c:12931:12: note: uninitialized use occurs here > partKey = is_subpart ? list_nth(rootParent->rd_partkeys, > parent_level) : > ^~~~~~~~~~ > tablecmds.c:12922:2: note: remove the 'if' if its condition is always true > if (parent != NULL) > ^~~~~~~~~~~~~~~~~~~ > tablecmds.c:12912:19: note: initialize the variable 'is_subpart' to silence > this warning > bool is_subpart; > ^ > = '\0' > tablecmds.c:13375:37: warning: variable 'operoid' is uninitialized when used > here [-Wuninitialized] > comp_left_expr = > make_opclause(operoid, BOOLOID, false, > > ^~~~~~~ > tablecmds.c:13326:17: note: initialize the variable 'operoid' to silence > this warning > Oid operoid; > ^ > = 0 Oh, I will find and fix these issues if they are still left in the newer version. Thanks for reporting. > Regression tests passed cleanly for me. I also examined code a bit. As I > get, for DML queries, declarative partitioning works like inheritance. It > just provides alternative way for collecting append_rel_list. Right, my latest patch doesn't touch this part at all. > We're working on the other side of partitioning problem. Without solving > syntax problem, we're solving performance problems in pg_pathman extension: > https://github.com/postgrespro/pg_pathman. We already have interesting > results which you can see in blog posts [1], [2], [3]. I have been following the blog posts. Results look promising, :) > We already have fast algorithm for partition selection in optimizer [1] and > effective optimization of filter conditions [3]. And we're planning to > implement following features: > > Execute time partitions selection (useful for nested loops and prepared > statements); > Optimization of ordered output from patitioned tables (use Append instead of > MergeAppend when possible); > Optimization of hash join when both tables are patitioned by join key. > > 9.6 feature freeze in coming, and we're planning our development resources > for 9.7. Besides providing an extension, we would like these features to > eventually arrive to core. In order to achieve this we need to port out > functionality over your declarative syntax. At first, we would need some > way for caching partition metadata suitable for fast partition selection. > For range partition it could be sorted array or RB-tree of partition bounds. > When we have this infrastructure, we can start porting pieces of pg_pathman > functionality to declarative partitiong. I had to think about the internal metadata representation (and its caching) when developing the tuple routing solution. I am hopeful that it is suitable for other executor mechanisms we will build for partitioned tables. > So, our draft plan of patches would be following: > > Implement partition metadata cache suitable for fast partition selection. > Fast partition selection using metadata cache. > Optimization of filter conditions passed to partitions. > Execute time partitions selection (useful for nested loops and prepared > statements); > Optimization of ordered output from patitioned tables (use Append instead of > MergeAppend when possible); > Optimization of hash join when both tables are patitioned by join key. > > I'd like to validate that this development plan doesn't overlaps with your > plans. If out plans are not overlapping then let's accept this plan of work > for 9.7. It looks OK to me. Thanks for sharing it. Thanks, Amit
On Wed, Mar 16, 2016 at 5:29 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> 9.6 feature freeze in coming, and we're planning our development resources
> for 9.7. Besides providing an extension, we would like these features to
> eventually arrive to core. In order to achieve this we need to port out
> functionality over your declarative syntax. At first, we would need some
> way for caching partition metadata suitable for fast partition selection.
> For range partition it could be sorted array or RB-tree of partition bounds.
> When we have this infrastructure, we can start porting pieces of pg_pathman
> functionality to declarative partitiong.
I had to think about the internal metadata representation (and its
caching) when developing the tuple routing solution. I am hopeful
that it is suitable for other executor mechanisms we will build for
partitioned tables.
Yes, it appears that I missed it. You already have sorted array for range partitioning and binary search implementation. This is good.
I'm a bit worrying about list partitioning because in this case you scan array sequentially. We could optimize this case for better handling of many list partitions. This is probably not most common case though.
> So, our draft plan of patches would be following:
>
> Implement partition metadata cache suitable for fast partition selection.
> Fast partition selection using metadata cache.
> Optimization of filter conditions passed to partitions.
> Execute time partitions selection (useful for nested loops and prepared
> statements);
> Optimization of ordered output from patitioned tables (use Append instead of
> MergeAppend when possible);
> Optimization of hash join when both tables are patitioned by join key.
>
> I'd like to validate that this development plan doesn't overlaps with your
> plans. If out plans are not overlapping then let's accept this plan of work
> for 9.7.
It looks OK to me. Thanks for sharing it.
Great! Let's work together.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Mar 16, 2016 at 10:49 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> > I'd like to validate that this development plan doesn't overlaps with >> > your >> > plans. If out plans are not overlapping then let's accept this plan of >> > work >> > for 9.7. >> >> It looks OK to me. Thanks for sharing it. > > > Great! Let's work together. So, the last patch on this thread was posted on February 17th, and the CF entry was marked Waiting on Author on March 2nd. Even if we had a new patch in hand at this point, I don't think there's any real chance of being able to get this done for 9.6; there are too many things left to do here in terms of figuring out syntax and scope, and of course performance testing. Moreover, when this goes in, it's going to open up lots of opportunities for follow-up optimizations that we surely do not have time to follow up on for 9.6. And, as it is, the patch hasn't been updated in over a month and is clearly not in final form as it exists today. Therefore, I have marked this Returned with Feedback. I look forward to returning to this topic for 9.7, and I'm willing to step up to the plate and review this more aggressively at that time, with an eye toward committing it when we've got it in good shape. But I don't think there's any way to proceed with it for 9.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 March 2016 at 19:55, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Wed, Mar 16, 2016 at 10:49 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> > I'd like to validate that this development plan doesn't overlaps with
>> > your
>> > plans. If out plans are not overlapping then let's accept this plan of
>> > work
>> > for 9.7.
>>
>> It looks OK to me. Thanks for sharing it.
>
>
> Great! Let's work together.
So, the last patch on this thread was posted on February 17th, and the
CF entry was marked Waiting on Author on March 2nd. Even if we had a
new patch in hand at this point, I don't think there's any real chance
of being able to get this done for 9.6; there are too many things left
to do here in terms of figuring out syntax and scope, and of course
performance testing. Moreover, when this goes in, it's going to open
up lots of opportunities for follow-up optimizations that we surely do
not have time to follow up on for 9.6. And, as it is, the patch
hasn't been updated in over a month and is clearly not in final form
as it exists today.
Therefore, I have marked this Returned with Feedback. I look forward
to returning to this topic for 9.7, and I'm willing to step up to the
plate and review this more aggressively at that time, with an eye
toward committing it when we've got it in good shape. But I don't
think there's any way to proceed with it for 9.6.
Good decision.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016/03/22 4:55, Robert Haas wrote: > So, the last patch on this thread was posted on February 17th, and the > CF entry was marked Waiting on Author on March 2nd. Even if we had a > new patch in hand at this point, I don't think there's any real chance > of being able to get this done for 9.6; there are too many things left > to do here in terms of figuring out syntax and scope, and of course > performance testing. Moreover, when this goes in, it's going to open > up lots of opportunities for follow-up optimizations that we surely do > not have time to follow up on for 9.6. And, as it is, the patch > hasn't been updated in over a month and is clearly not in final form > as it exists today. > > Therefore, I have marked this Returned with Feedback. I look forward > to returning to this topic for 9.7, and I'm willing to step up to the > plate and review this more aggressively at that time, with an eye > toward committing it when we've got it in good shape. But I don't > think there's any way to proceed with it for 9.6. OK. I agree with the decision. Actually, I was just about to post a patch set today, but I figure it's too late for that. Anyway, I look forward to working on this for 9.7. Thanks, Amit
Hi Amit,
Here are some random comments. You said that you were about to post a new patch, so you might have already taken care of those comments. But anyway here they are.1. Following patches do not apply cleanly for me.
0001
0003 - conflicts at #include for partition.h in rel.h.
0004 - conflicts in src/backend/catalog/Makefile
0005 - conflicts in src/backend/parser/gram.y
2. The patches are using now used OIDs 3318-3323. Corresponding objects need new unused oids.
3. In patch 0001-*, you are using indkey instead of partkey in one of the comments and also in documentation.
4. After all patches are applied I am getting several errors like "error: #error "lock.h may not be included from frontend code", while building rmgrdesc.c. This
seems to be because rel.h includes partition.h, which leads to inclusion of lock.h in rmgrdesc.c. Are you getting the same error message? It looks like we need separate header file for declaring function which can be used at the time of execution, which is anyway better irrespective of the compiler error.
5. In expand_partitioned_rtentry(), instead of finding all the leaf level relations, it might be better to build the RTEs, RelOptInfo and paths for
intermediate relations as well. This helps in partition pruning. In your introductory write-up you have mentioned this. It might be better if v1 includes this change, so that the partition hierarchy is visible in EXPLAIN output as well.
6. Explain output of scan on a partitioned table shows Append node with individual table scans as sub-plans. May be we should annotate Append node with
the name of the partitioned table to make EXPLAIN output more readable.
7. \d+ output of partitioned table does not show partitioning information, something necessary for V1.
8. Instead of storing partition key column numbers and expressions separately, can we store everything as expression; columns being a single Var node expression? That will avoid constructing Var nodes to lookup in equivalence classes for partition pruning or join optimizations.
10. The code distinguishes between the top level table and its partitions which in turn are partitioned. We should try to minimize this distinction as much as possible so as to use recursive functions for operating on partitions. E.g. each of the partitioned partitions may be labelled RELKIND_PARTITIONED_REL? A 0/NULL parent would distinguish between root partition and child partitions.
On Tue, Mar 22, 2016 at 7:53 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/03/22 4:55, Robert Haas wrote:
> So, the last patch on this thread was posted on February 17th, and the
> CF entry was marked Waiting on Author on March 2nd. Even if we had a
> new patch in hand at this point, I don't think there's any real chance
> of being able to get this done for 9.6; there are too many things left
> to do here in terms of figuring out syntax and scope, and of course
> performance testing. Moreover, when this goes in, it's going to open
> up lots of opportunities for follow-up optimizations that we surely do
> not have time to follow up on for 9.6. And, as it is, the patch
> hasn't been updated in over a month and is clearly not in final form
> as it exists today.
>
> Therefore, I have marked this Returned with Feedback. I look forward
> to returning to this topic for 9.7, and I'm willing to step up to the
> plate and review this more aggressively at that time, with an eye
> toward committing it when we've got it in good shape. But I don't
> think there's any way to proceed with it for 9.6.
OK. I agree with the decision.
Actually, I was just about to post a patch set today, but I figure it's
too late for that. Anyway, I look forward to working on this for 9.7.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Hi Ashutosh, On 2016/04/14 21:34, Ashutosh Bapat wrote: > Hi Amit, > Here are some random comments. You said that you were about to post a new > patch, so you might have already taken care of those comments. But anyway > here they are. Thanks a lot for the comments. The patch set changed quite a bit since the last version. Once the CF entry was marked returned with feedback on March 22, I held off sending the new version at all. Perhaps, it would have been OK. Anyway here it is, if you are interested. I will create an entry in CF 2016-09 for the same. Also, see below replies to you individual comments. > 1. Following patches do not apply cleanly for me. > 0001 > 0003 - conflicts at #include for partition.h in rel.h. > 0004 - conflicts in src/backend/catalog/Makefile > 0005 - conflicts in src/backend/parser/gram.y These should be fixed although the attached one is a significantly different patch set. > 2. The patches are using now used OIDs 3318-3323. Corresponding objects > need new unused oids. Right, fixed. > 3. In patch 0001-*, you are using indkey instead of partkey in one of the > comments and also in documentation. Fixed. > 4. After all patches are applied I am getting several errors like "error: > #error "lock.h may not be included from frontend code", while building > rmgrdesc.c. This > seems to be because rel.h includes partition.h, which leads to inclusion of > lock.h in rmgrdesc.c. Are you getting the same error message? It looks like > we need separate header file for declaring function which can be used at > the time of execution, which is anyway better irrespective of the compiler > error. Yeah, I too am beginning to feel that dividing partition.h into separate headers would be a good idea in long term. For time being, I have managed to get rid of partition.h #included in rel.h which solves the issue. > 5. In expand_partitioned_rtentry(), instead of finding all the leaf level > relations, it might be better to build the RTEs, RelOptInfo and paths for > intermediate relations as well. This helps in partition pruning. In your > introductory write-up you have mentioned this. It might be better if v1 > includes this change, so that the partition hierarchy is visible in EXPLAIN > output as well. > > 6. Explain output of scan on a partitioned table shows Append node with > individual table scans as sub-plans. May be we should annotate Append node > with > the name of the partitioned table to make EXPLAIN output more readable. I got rid of all the optimizer changes in the new version (except a line or two). I did that by switching to storing parent-child relationships in pg_inherits so that all the existing optimizer code for inheritance sets works unchanged. Also, the implementation detail that required to put only leaf partitions in the append list is also gone. Previously no storage was allocated for partitioned tables (either root or any of the internal partitions), so it was being done that way. Regarding 6, it seems to me that because Append does not have a associated relid (like scan nodes have with scanrelid). Maybe we need to either fix Append or create some enhanced version of Append which would also support dynamic pruning. > 7. \d+ output of partitioned table does not show partitioning information, > something necessary for V1. This has been fixed in the attached. > 8. Instead of storing partition key column numbers and expressions > separately, can we store everything as expression; columns being a single > Var node expression? That will avoid constructing Var nodes to lookup in > equivalence classes for partition pruning or join optimizations. Hmm, let me consider that. FWIW, if you look at my proposed patch (or description thereof) back in CF 2015-11 [1], you will see that I had modeled matching clauses to partition key on lines of how similar processing is done for indexes (as in how indexes are matched with clauses - down all the way to match_clause_to_indexcol()). None of that exists in the current patch set but when we get to that (optimizer patch that is), I perhaps wouldn't do it radically differently. I admit however that I hadn't thought really hard about join optimization stuff so I may be missing something. > 10. The code distinguishes between the top level table and its partitions > which in turn are partitioned. We should try to minimize this distinction > as much as possible so as to use recursive functions for operating on > partitions. E.g. each of the partitioned partitions may be labelled > RELKIND_PARTITIONED_REL? A 0/NULL parent would distinguish between root > partition and child partitions. Exactly how it's done in the attached. Any table that is partitioned is now a RELKIND_PARTITIONED_REL. Thanks, Amit [1] http://www.postgresql.org/message-id/563341AE.5010207@lab.ntt.co.jp
Attachment
With the new set of patches, I am getting following warnings but no compilation failures. Patches apply smoothly.
partition.c:1216:21: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1637:20: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1216:21: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1637:20: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
For creating partition the documentation seems to suggest the syntax
create table t1_p1 partition of t1 for values {start {0} end {100} exclusive;
create table t1_p1 partition of t1 for values {start {0} end {100} exclusive;
but following syntax works
create table t1_p1 partition of t1 for values start (0) end (100) exclusive;
create table t1_p1 partition of t1 for values start (0) end (100) exclusive;
> 5. In expand_partitioned_rtentry(), instead of finding all the leaf level
> relations, it might be better to build the RTEs, RelOptInfo and paths for
> intermediate relations as well. This helps in partition pruning. In your
> introductory write-up you have mentioned this. It might be better if v1
> includes this change, so that the partition hierarchy is visible in EXPLAIN
> output as well.
>
> 6. Explain output of scan on a partitioned table shows Append node with
> individual table scans as sub-plans. May be we should annotate Append node
> with
> the name of the partitioned table to make EXPLAIN output more readable.
I got rid of all the optimizer changes in the new version (except a line
or two). I did that by switching to storing parent-child relationships in
pg_inherits so that all the existing optimizer code for inheritance sets
works unchanged. Also, the implementation detail that required to put
only leaf partitions in the append list is also gone. Previously no
storage was allocated for partitioned tables (either root or any of the
internal partitions), so it was being done that way.
With this set of patches we are still flattening all the partitions.
postgres=# create table t1 (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table t1_p1 partition of t1 for values start (0) end (100) exclusive partition by range(b);
CREATE TABLE
postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0) end (100) exclusive;
CREATE TABLE
postgres=# explain verbose select * from t1;
QUERY PLAN
-------------------------------------------------------------------------
Append (cost=0.00..32.60 rows=2262 width=8)
-> Seq Scan on public.t1 (cost=0.00..0.00 rows=1 width=8)
Output: t1.a, t1.b
-> Seq Scan on public.t1_p1 (cost=0.00..0.00 rows=1 width=8)
Output: t1_p1.a, t1_p1.b
-> Seq Scan on public.t1_p1_p1 (cost=0.00..32.60 rows=2260 width=8)
Output: t1_p1_p1.a, t1_p1_p1.b
(7 rows)
postgres=# create table t1 (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table t1_p1 partition of t1 for values start (0) end (100) exclusive partition by range(b);
CREATE TABLE
postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0) end (100) exclusive;
CREATE TABLE
postgres=# explain verbose select * from t1;
QUERY PLAN
-------------------------------------------------------------------------
Append (cost=0.00..32.60 rows=2262 width=8)
-> Seq Scan on public.t1 (cost=0.00..0.00 rows=1 width=8)
Output: t1.a, t1.b
-> Seq Scan on public.t1_p1 (cost=0.00..0.00 rows=1 width=8)
Output: t1_p1.a, t1_p1.b
-> Seq Scan on public.t1_p1_p1 (cost=0.00..32.60 rows=2260 width=8)
Output: t1_p1_p1.a, t1_p1_p1.b
(7 rows)
Retaining the partition hierarchy would help to push-down join across partition hierarchy effectively.
Regarding 6, it seems to me that because Append does not have a associated
relid (like scan nodes have with scanrelid). Maybe we need to either fix
Append or create some enhanced version of Append which would also support
dynamic pruning.
Right, I think, Append might store the relid of the parent table as well as partitioning information at that level along-with the subplans.
Some more comments:
1. Would it be better to declare PartitionDescData as
{
int nparts;
PartitionInfo *partinfo; /* array of partition information structures. */
}
int nparts;
PartitionInfo *partinfo; /* array of partition information structures. */
}
2. The new syntax allows CREATE TABLE to be specified as partition of an already partitioned table. Is it possible to do the same for CREATE FOREIGN TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.
3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be obtained by calling functions exprType, exprTypemod on partexprs. Why do we need to store that information as a separate member?
3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be obtained by calling functions exprType, exprTypemod on partexprs. Why do we need to store that information as a separate member?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
With the new set of patches, I am getting following warnings but no compilation failures. Patches apply smoothly.
partition.c:1216:21: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1637:20: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1216:21: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
partition.c:1637:20: warning: variable ‘form’ set but not used [-Wunused-but-set-variable]
For creating partition the documentation seems to suggest the syntax
create table t1_p1 partition of t1 for values {start {0} end {100} exclusive;
create table t1_p1 partition of t1 for values {start {0} end {100} exclusive;
but following syntax works
create table t1_p1 partition of t1 for values start (0) end (100) exclusive;
create table t1_p1 partition of t1 for values start (0) end (100) exclusive;
> 5. In expand_partitioned_rtentry(), instead of finding all the leaf level
> relations, it might be better to build the RTEs, RelOptInfo and paths for
> intermediate relations as well. This helps in partition pruning. In your
> introductory write-up you have mentioned this. It might be better if v1
> includes this change, so that the partition hierarchy is visible in EXPLAIN
> output as well.
>
> 6. Explain output of scan on a partitioned table shows Append node with
> individual table scans as sub-plans. May be we should annotate Append node
> with
> the name of the partitioned table to make EXPLAIN output more readable.
I got rid of all the optimizer changes in the new version (except a line
or two). I did that by switching to storing parent-child relationships in
pg_inherits so that all the existing optimizer code for inheritance sets
works unchanged. Also, the implementation detail that required to put
only leaf partitions in the append list is also gone. Previously no
storage was allocated for partitioned tables (either root or any of the
internal partitions), so it was being done that way.
With this set of patches we are still flattening all the partitions.
postgres=# create table t1 (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table t1_p1 partition of t1 for values start (0) end (100) exclusive partition by range(b);
CREATE TABLE
postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0) end (100) exclusive;
CREATE TABLE
postgres=# explain verbose select * from t1;
QUERY PLAN
-------------------------------------------------------------------------
Append (cost=0.00..32.60 rows=2262 width=8)
-> Seq Scan on public.t1 (cost=0.00..0.00 rows=1 width=8)
Output: t1.a, t1.b
-> Seq Scan on public.t1_p1 (cost=0.00..0.00 rows=1 width=8)
Output: t1_p1.a, t1_p1.b
-> Seq Scan on public.t1_p1_p1 (cost=0.00..32.60 rows=2260 width=8)
Output: t1_p1_p1.a, t1_p1_p1.b
(7 rows)
postgres=# create table t1 (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table t1_p1 partition of t1 for values start (0) end (100) exclusive partition by range(b);
CREATE TABLE
postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0) end (100) exclusive;
CREATE TABLE
postgres=# explain verbose select * from t1;
QUERY PLAN
-------------------------------------------------------------------------
Append (cost=0.00..32.60 rows=2262 width=8)
-> Seq Scan on public.t1 (cost=0.00..0.00 rows=1 width=8)
Output: t1.a, t1.b
-> Seq Scan on public.t1_p1 (cost=0.00..0.00 rows=1 width=8)
Output: t1_p1.a, t1_p1.b
-> Seq Scan on public.t1_p1_p1 (cost=0.00..32.60 rows=2260 width=8)
Output: t1_p1_p1.a, t1_p1_p1.b
(7 rows)
Retaining the partition hierarchy would help to push-down join across partition hierarchy effectively.
Regarding 6, it seems to me that because Append does not have a associated
relid (like scan nodes have with scanrelid). Maybe we need to either fix
Append or create some enhanced version of Append which would also support
dynamic pruning.
Right, I think, Append might store the relid of the parent table as well as partitioning information at that level along-with the subplans.
Some more comments
Would it be better to declare PartitionDescData as
{
int nparts;
PartitionInfo *partinfo; /* array of partition information structures. */
}
int nparts;
PartitionInfo *partinfo; /* array of partition information structures. */
}
The new syntax allows CREATE TABLE to be specified as partition of an already partitioned table. Is it possible to do the same for CREATE FOREIGN TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Fri, Apr 15, 2016 at 5:46 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Retaining the partition hierarchy would help to push-down join across > partition hierarchy effectively. -1. You don't get to insert cruft into the final plan for the convenience of the optimizer. I think the AppendPath needs to be annotated with sufficient information to do whatever query planning optimizations we want, and some or all of that may need to carry over to the Append plan to allow run-time partition pruning. But I think that flattening nests of Appends is a good optimization and we should preserve it. If that makes the additional information that any given Append needs to carry a bit more complex, so be it. I also think it's very good that Amit has kept the query planner unchanged in this initial patch. Let's leave that work to phase two. What I suggest we do when the time comes is invent new nodes RangePartitionMap, ListPartitionMap, HashPartitionMap. Each contains minimal metadata needed for tuple routing or planner transformation. For example, RangePartitionMap can contain an array of partition boundaries - represented as Datums - and an array of mappings, each a Node *. The associated value can be another PartitionMap object if there is subpartitioning in use, or an OID. This can be used both for matching up partitions for join pushdown, and also for fast tuple routing and runtime partition pruning. > 2. The new syntax allows CREATE TABLE to be specified as partition of an > already partitioned table. Is it possible to do the same for CREATE FOREIGN > TABLE? Or that's material for v2? Similarly for ATTACH PARTITION. +1 for making CREATE FOREIGN TABLE support that also, and in version 1. And same for ATTACH PARTITION. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Ashutosh, On 2016/04/15 18:48, Ashutosh Bapat wrote: > With the new set of patches, I am getting following warnings but no > compilation failures. Patches apply smoothly. > partition.c:1216:21: warning: variable ‘form’ set but not used > [-Wunused-but-set-variable] > partition.c:1637:20: warning: variable ‘form’ set but not used > [-Wunused-but-set-variable] Ah, will fix. > For creating partition the documentation seems to suggest the syntax > create table t1_p1 partition of t1 for values {start {0} end {100} > exclusive; > but following syntax works > create table t1_p1 partition of t1 for values start (0) end (100) exclusive; Hmm, I see the following in docs: ... and partition_bound_spec is: FOR VALUES { list_spec | range_spec } ... list_spec in FOR VALUES is: { IN ( expression [, ...] ) } range_spec in FOR VALUES is: { START (lower-bound) [ INCLUSIVE | EXCLUSIVE ] | END (upper-bound) [ INCLUSIVE | EXCLUSIVE ] | START (lower-bound) [ INCLUSIVE| EXCLUSIVE ] END (upper-bound) [ INCLUSIVE | EXCLUSIVE ] } By the way, I see that you are always specifying "exclusive" for end bound in your examples. The specification is optional if that wasn't apparent.Default (ie, without explicit specification) is[start, end). >> I got rid of all the optimizer changes in the new version (except a line >> or two). I did that by switching to storing parent-child relationships in >> pg_inherits so that all the existing optimizer code for inheritance sets >> works unchanged. Also, the implementation detail that required to put >> only leaf partitions in the append list is also gone. Previously no >> storage was allocated for partitioned tables (either root or any of the >> internal partitions), so it was being done that way. >> > > With this set of patches we are still flattening all the partitions. > postgres=# create table t1 (a int, b int) partition by range(a); > CREATE TABLE > postgres=# create table t1_p1 partition of t1 for values start (0) end > (100) exclusive partition by range(b); > CREATE TABLE > postgres=# create table t1_p1_p1 partition of t1_p1 for values start (0) > end (100) exclusive; > CREATE TABLE > postgres=# explain verbose select * from t1; > QUERY PLAN > ------------------------------------------------------------------------- > Append (cost=0.00..32.60 rows=2262 width=8) > -> Seq Scan on public.t1 (cost=0.00..0.00 rows=1 width=8) > Output: t1.a, t1.b > -> Seq Scan on public.t1_p1 (cost=0.00..0.00 rows=1 width=8) > Output: t1_p1.a, t1_p1.b > -> Seq Scan on public.t1_p1_p1 (cost=0.00..32.60 rows=2260 width=8) > Output: t1_p1_p1.a, t1_p1_p1.b > (7 rows) > Retaining the partition hierarchy would help to push-down join across > partition hierarchy effectively. Yes, it's still a flattened append (basically what optimizer currently creates for arbitrary inheritance trees). I didn't modify any of that yet. >> Regarding 6, it seems to me that because Append does not have a associated >> relid (like scan nodes have with scanrelid). Maybe we need to either fix >> Append or create some enhanced version of Append which would also support >> dynamic pruning. >> > > Right, I think, Append might store the relid of the parent table as well as > partitioning information at that level along-with the subplans. For time being, I will leave this as yet unaddressed (I am thinking about what is reasonable to do for this also considering Robert's comment). Is that OK? > Some more comments > Would it be better to declare PartitionDescData as > { > int nparts; > PartitionInfo *partinfo; /* array of partition information structures. */ > } I think that might be better. Will do it the way you suggest. > The new syntax allows CREATE TABLE to be specified as partition of an > already partitioned table. Is it possible to do the same for CREATE FOREIGN > TABLE? Or that's material for v2? Similarly for ATTACH PARTITION. OK, I will address this in the next version. One question though: should foreign table be only allowed to be leaf partitions (ie, no PARTITION BY clause in CREATE FOREIGN TABLE ... PARTITION OF)? By the way, ATTACH PARTITION only allows leaf partitions anyway. Thanks a lot for the review! Thanks, Amt
Hi, On 2016/04/15 18:46, Ashutosh Bapat wrote: > > 3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be > obtained by calling functions exprType, exprTypemod on partexprs. Why do we > need to store that information as a separate member? There was no KeyTypeCollInfo in early days of the patch and then I found myself doing a lot of: partexprs_item = list_head(key->partexprs); for (attr in key->partattrs) { if (attr->attnum != 0) { // simple column reference, get type from attr } else { // expression, gettype using exprType, etc. partexprs_item = lnext(partexprs_item); } } That ended up being quite a few places (though I managed to reduce the number of places over time). So, I created this struct which is initialized when partition key is built (on first open of the partitioned table). Thanks, Amit
On Fri, Apr 15, 2016 at 10:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Apr 15, 2016 at 5:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Retaining the partition hierarchy would help to push-down join across
> partition hierarchy effectively.
-1. You don't get to insert cruft into the final plan for the
convenience of the optimizer. I think the AppendPath needs to be
annotated with sufficient information to do whatever query planning
optimizations we want, and some or all of that may need to carry over
to the Append plan to allow run-time partition pruning. But I think
that flattening nests of Appends is a good optimization and we should
preserve it. If that makes the additional information that any given
Append needs to carry a bit more complex, so be it.
I also think it's very good that Amit has kept the query planner
unchanged in this initial patch. Let's leave that work to phase two.
What I suggest we do when the time comes is invent new nodes
RangePartitionMap, ListPartitionMap, HashPartitionMap. Each contains
minimal metadata needed for tuple routing or planner transformation.
For example, RangePartitionMap can contain an array of partition
boundaries - represented as Datums - and an array of mappings, each a
Node *. The associated value can be another PartitionMap object if
there is subpartitioning in use, or an OID.
range table index instead of OID to make it easy to lookup the
This can be used both for
matching up partitions for join pushdown, and also for fast tuple
routing and runtime partition pruning.
I was thinking about join pushdown (in partitioning hierarchy) of multiple tables which have similar partitioning structure for few upper levels but the entire partitioning hierarchy does not match. Pushing down the join as much possible into the partition hierarchy will help. We might end up with joins between a plain table and Append relation at the leaf level. For such join push-down it looks like we will have to construct corresponding Append hierarchy, push the joins down and then (may be) collapse it to just collect the results of joins at the leaf level. But preparing for that kind of optimization need not be part of this work.
> 2. The new syntax allows CREATE TABLE to be specified as partition of an
> already partitioned table. Is it possible to do the same for CREATE FOREIGN
> TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.
+1 for making CREATE FOREIGN TABLE support that also, and in version
1. And same for ATTACH PARTITION.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
>> Regarding 6, it seems to me that because Append does not have a associated
>> relid (like scan nodes have with scanrelid). Maybe we need to either fix
>> Append or create some enhanced version of Append which would also support
>> dynamic pruning.
>>
>
> Right, I think, Append might store the relid of the parent table as well as
> partitioning information at that level along-with the subplans.
For time being, I will leave this as yet unaddressed (I am thinking about
what is reasonable to do for this also considering Robert's comment). Is
that OK?
Right now EXPLAIN of select * from t1, where t1 is partitioned table shows
Append
-> Seq Scan on t1
-> Seq scan on partition 1
-> seq scan on partition 2
...
...
which shows t1 as well as all the partitions on the same level. Users might have accepted that for inheritance hierarchy but for partitioning that can be confusing, esp. when all the error messages and documentation indicate that t1 is an empty (shell?) table. Instead showing it like
Append for t1 -- (essentially show that this is Append for partitioned table t1, exact text might vary)
-> Seq scan on partition 1
-> ....
-> ....
would be more readable. Similarly if we are going to collapse all the Append hierarchy, it might get even more confusing. Listing all the intermediate partitioned tables as Seq Scan on them would be confusing for the reasons mentioned above, and not listing them might make user wonder about the reasons for their disappearance. I am not sure what's the solution their.
> Some more comments
> Would it be better to declare PartitionDescData as
> {
> int nparts;
> PartitionInfo *partinfo; /* array of partition information structures. */
> }
I think that might be better. Will do it the way you suggest.
> The new syntax allows CREATE TABLE to be specified as partition of an
> already partitioned table. Is it possible to do the same for CREATE FOREIGN
> TABLE? Or that's material for v2? Similarly for ATTACH PARTITION.
OK, I will address this in the next version. One question though: should
foreign table be only allowed to be leaf partitions (ie, no PARTITION BY
clause in CREATE FOREIGN TABLE ... PARTITION OF)?
That seems a better way. Otherwise users might wonder whether we keep the partitions of a foreign table on the foreign server which won't be true. But then we allow foreign tables to have local tables as children in inheritance, so somebody from that background might find it incompatible. I think we shouldn't let the connection between partitioning and inheritance linger longer than necessary.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/04/15 18:46, Ashutosh Bapat wrote:
>
> 3. PartitionKeyData contains KeyTypeCollInfo, whose contents can be
> obtained by calling functions exprType, exprTypemod on partexprs. Why do we
> need to store that information as a separate member?
There was no KeyTypeCollInfo in early days of the patch and then I found
myself doing a lot of:
partexprs_item = list_head(key->partexprs);
for (attr in key->partattrs)
{
if (attr->attnum != 0)
{
// simple column reference, get type from attr
}
else
{
// expression, get type using exprType, etc.
partexprs_item = lnext(partexprs_item);
}
}
At least the two loops can be flattened to a single loop if we keep only expressions list with attributes being just Var nodes. exprType() etc. would then work seemlessly.
That ended up being quite a few places (though I managed to reduce the
number of places over time). So, I created this struct which is
initialized when partition key is built (on first open of the partitioned
table).
Hmm, I am just afraid that we might end up with some code using cached information and some using exprType, exprTypmod etc.
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/04/18 15:38, Ashutosh Bapat wrote: >> There was no KeyTypeCollInfo in early days of the patch and then I found >> myself doing a lot of: >> >> partexprs_item = list_head(key->partexprs); >> for (attr in key->partattrs) >> { >> if (attr->attnum != 0) >> { >> // simple column reference, get type from attr >> } >> else >> { >> // expression, get type using exprType, etc. >> partexprs_item = lnext(partexprs_item); >> } >> } >> > > At least the two loops can be flattened to a single loop if we keep only > expressions list with attributes being just Var nodes. exprType() etc. > would then work seemlessly. I didn't say anything about your suggestion to use a Node * list as a representation for the cached partition key information. IIUC, you mean instead of the AttrNumber[partnatts] array with non-zero attnum for a named column slot and 0 for a expressional column slot, create a Node * list with Var nodes for simple column references and Expr nodes for expressions. I would mention that the same information is also being used in contexts where having simple attnums may be better (for example, when extracting key of a tuple slot during tuple routing). Moreover, this is cached information and I thought it may be better to follow the format that other similar information uses (index key and such). Furthermore, looking at qual matching code for indexes and recently introduced foreign key optimization, it seems we will want to use a similar representation within optimizer for partition keys. IndexOptInfo has int ncolumns and int * indexkeys and then match_index_to_operand() compares index key attnums with varattno of vars in qual. It's perhaps speculative at the moment because there is not much code wanting to use it yet other than partition DDL and tuple-routing and cached info seems to work as-is for the latter. >> That ended up being quite a few places (though I managed to reduce the >> number of places over time). So, I created this struct which is >> initialized when partition key is built (on first open of the partitioned >> table). >> > > Hmm, I am just afraid that we might end up with some code using cached > information and some using exprType, exprTypmod etc. Well, you never use exprType(), etc. for partition keys in other than a few places. All places that do always use the cached values. Mostly partitioning DDL stuff so far. Tuple routing considers collation of individual key columns when comparing input value with partition bounds. Am I missing something? Thanks, Amit
On Mon, Apr 18, 2016 at 1:23 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/04/18 15:38, Ashutosh Bapat wrote:
>> There was no KeyTypeCollInfo in early days of the patch and then I found
>> myself doing a lot of:
>>
>> partexprs_item = list_head(key->partexprs);
>> for (attr in key->partattrs)
>> {
>> if (attr->attnum != 0)
>> {
>> // simple column reference, get type from attr
>> }
>> else
>> {
>> // expression, get type using exprType, etc.
>> partexprs_item = lnext(partexprs_item);
>> }
>> }
>>
>
> At least the two loops can be flattened to a single loop if we keep only
> expressions list with attributes being just Var nodes. exprType() etc.
> would then work seemlessly.
I didn't say anything about your suggestion to use a Node * list as a
representation for the cached partition key information. IIUC, you mean
instead of the AttrNumber[partnatts] array with non-zero attnum for a
named column slot and 0 for a expressional column slot, create a Node *
list with Var nodes for simple column references and Expr nodes for
expressions.
I would mention that the same information is also being used in contexts
where having simple attnums may be better (for example, when extracting
key of a tuple slot during tuple routing). Moreover, this is cached
information and I thought it may be better to follow the format that other
similar information uses (index key and such). Furthermore, looking at
qual matching code for indexes and recently introduced foreign key
optimization, it seems we will want to use a similar representation within
optimizer for partition keys. IndexOptInfo has int ncolumns and int *
indexkeys and then match_index_to_operand() compares index key attnums
with varattno of vars in qual. It's perhaps speculative at the moment
because there is not much code wanting to use it yet other than partition
DDL and tuple-routing and cached info seems to work as-is for the latter.
Ok.
>> That ended up being quite a few places (though I managed to reduce the
>> number of places over time). So, I created this struct which is
>> initialized when partition key is built (on first open of the partitioned
>> table).
>>
>
> Hmm, I am just afraid that we might end up with some code using cached
> information and some using exprType, exprTypmod etc.
Well, you never use exprType(), etc. for partition keys in other than a
few places. All places that do always use the cached values. Mostly
partitioning DDL stuff so far. Tuple routing considers collation of
individual key columns when comparing input value with partition bounds.
I am not worried about the current code. But there will be a lot of code added after version 1. I am worried about that.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/04/18 15:33, Ashutosh Bapat wrote: >> For time being, I will leave this as yet unaddressed (I am thinking about >> what is reasonable to do for this also considering Robert's comment). Is >> that OK? >> > > Right now EXPLAIN of select * from t1, where t1 is partitioned table shows > Append > -> Seq Scan on t1 > -> Seq scan on partition 1 > -> seq scan on partition 2 > ... > which shows t1 as well as all the partitions on the same level. Users might > have accepted that for inheritance hierarchy but for partitioning that can > be confusing, esp. when all the error messages and documentation indicate > that t1 is an empty (shell?) table. Instead showing it like > Append for t1 -- (essentially show that this is Append for partitioned > table t1, exact text might vary) > -> Seq scan on partition 1 > -> .... > would be more readable. Similarly if we are going to collapse all the > Append hierarchy, it might get even more confusing. Listing all the > intermediate partitioned tables as Seq Scan on them would be confusing for > the reasons mentioned above, and not listing them might make user wonder > about the reasons for their disappearance. I am not sure what's the > solution their. Yes, things remain confusing with Append plans for partitioned tables. Note that currently if an internal partition doesn't show up, none of its partitions do (given the way CHECK constraints are generated for each table in the partition tree). >> OK, I will address this in the next version. One question though: should >> foreign table be only allowed to be leaf partitions (ie, no PARTITION BY >> clause in CREATE FOREIGN TABLE ... PARTITION OF)? > > That seems a better way. Otherwise users might wonder whether we keep the > partitions of a foreign table on the foreign server which won't be true. Quite true. Can't assume anything other than what the FDW for a given foreign table lets us assume about what the remote table looks like. Existing FDW API allows to treat a foreign table as nothing other than regular tables at best. > But then we allow foreign tables to have local tables as children in > inheritance, so somebody from that background might find it incompatible. I > think we shouldn't let the connection between partitioning and inheritance > linger longer than necessary. I, too, think quite some about this last sentence. But I have only considered DDL yet. Thanks, Amit
Hi Amit, On 15.04.2016 07:35, Amit Langote wrote: > Thanks a lot for the comments. The patch set changed quite a bit since > the last version. Once the CF entry was marked returned with feedback > on March 22, I held off sending the new version at all. Perhaps, it > would have been OK. Anyway here it is, if you are interested. I will > create an entry in CF 2016-09 for the same. Also, see below replies to > you individual comments. Thanks for your new patch! I've tried it and discovered some strange behavior for partitioning by composite key. Here is an example of my setup: create table test(a int, b int) partition by range (a, b); create table test_1 partition of test for values start (0, 0) end (100, 100); create table test_2 partition of test for values start (100, 100) end (200, 200); create table test_3 partition of test for values start (200, 200) end (300, 300); It's alright so far. But if we try to insert record in which attribute 'a' belongs to one partition and attribute 'b' belongs to another then record will be inserted in the first one: insert into test(a, b) values (150, 50); select tableoid::regclass, * from test; tableoid | a | b ----------+-----+---- test_2 | 150 | 50 (1 row) I think there should be an error because value for 'b' violates range constraint for test_2. Now if we query data from 'test' and add filter b < 100, then planner will exclude partitions 'test_2' (which actually contains inserted row) and 'test_3' and return nothing: select * from test where b < 100; a | b ---+--- (0 rows) explain (costs off) select * from test where b < 100; QUERY PLAN --------------------------- Append -> Seq Scan on test Filter: (b < 100) -> Seq Scan on test_1 Filter:(b < 100) (5 rows) -- Ildar Musin i.musin@postgrespro.ru
On Tue, Apr 19, 2016 at 4:57 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
On 15.04.2016 07:35, Amit Langote wrote:Thanks a lot for the comments. The patch set changed quite a bit since the last version. Once the CF entry was marked returned with feedback on March 22, I held off sending the new version at all. Perhaps, it would have been OK. Anyway here it is, if you are interested. I will create an entry in CF 2016-09 for the same. Also, see below replies to you individual comments.
Thanks for your new patch! I've tried it and discovered some strange behavior for partitioning by composite key. Here is an example of my setup:
create table test(a int, b int) partition by range (a, b);
create table test_1 partition of test for values start (0, 0) end (100, 100);
create table test_2 partition of test for values start (100, 100) end (200, 200);
create table test_3 partition of test for values start (200, 200) end (300, 300);
It's alright so far. But if we try to insert record in which attribute 'a' belongs to one partition and attribute 'b' belongs to another then record will be inserted in the first one:
insert into test(a, b) values (150, 50);
select tableoid::regclass, * from test;
tableoid | a | b
----------+-----+----
test_2 | 150 | 50
(1 row)
That's how composite keys work. First subkey is checked. If it's equal then second subkey is checked and so on.
# SELECT (100, 100) < (150, 50), (150, 50) < (200, 200);
?column? | ?column?
----------+----------
t | t
(1 row)
Another question is that it might be NOT what users expect from that. From the syntax side it very looks like defining something boxes regions for two keys which could be replacement for subpartitioning. But it isn't so.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Idlar, Alexander, On Tue, Apr 19, 2016 at 11:26 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Tue, Apr 19, 2016 at 4:57 PM, Ildar Musin <i.musin@postgrespro.ru> wrote: >> >> Thanks for your new patch! I've tried it and discovered some strange >> behavior for partitioning by composite key. Here is an example of my setup: >> >> create table test(a int, b int) partition by range (a, b); >> create table test_1 partition of test for values start (0, 0) end (100, >> 100); >> create table test_2 partition of test for values start (100, 100) end >> (200, 200); >> create table test_3 partition of test for values start (200, 200) end >> (300, 300); >> >> It's alright so far. But if we try to insert record in which attribute 'a' >> belongs to one partition and attribute 'b' belongs to another then record >> will be inserted in the first one: >> >> insert into test(a, b) values (150, 50); >> >> select tableoid::regclass, * from test; >> tableoid | a | b >> ----------+-----+---- >> test_2 | 150 | 50 >> (1 row) > > > That's how composite keys work. First subkey is checked. If it's equal then > second subkey is checked and so on. > > # SELECT (100, 100) < (150, 50), (150, 50) < (200, 200); > ?column? | ?column? > ----------+---------- > t | t > (1 row) Yes. > Another question is that it might be NOT what users expect from that. From > the syntax side it very looks like defining something boxes regions for two > keys which could be replacement for subpartitioning. But it isn't so. Need to check why query with qual b < 100 behaves the way it does. Something's going wrong there with the constraints (partition predicates) that are being generated internally (as mentioned before, still driven by constraint exclusion using the constraints generated on-the-fly). As for the composite range partition bounds in Ildar's example, it's as if the second value in the key never determines the fate of a row going into some partition, therefore no constraints should have been generated for column b of the key. I'm afraid that's not the case as per the latest patch. Will fix. Thanks a lot for trying it out and the report. Thanks, Amit
On 2016/04/19 23:52, Amit Langote wrote: > On Tue, Apr 19, 2016 at 11:26 PM, Alexander Korotkov >> Another question is that it might be NOT what users expect from that. From >> the syntax side it very looks like defining something boxes regions for two >> keys which could be replacement for subpartitioning. But it isn't so. > > Need to check why query with qual b < 100 behaves the way it does. > Something's going wrong there with the constraints (partition > predicates) that are being generated internally (as mentioned before, > still driven by constraint exclusion using the constraints generated > on-the-fly). > > As for the composite range partition bounds in Ildar's example, it's > as if the second value in the key never determines the fate of a row > going into some partition, therefore no constraints should have been > generated for column b of the key. I'm afraid that's not the case as > per the latest patch. Will fix. The strange behavior that Ildar reported should have been fixed with the attached updated set of patches (v2): create table test(a int, b int) partition by range (a, b); create table test_1 partition of test for values start (0, 0) end (100, 100); create table test_2 partition of test for values start (100, 100) end (200, 200); create table test_3 partition of test for values start (200, 200) end (300, 300); CREATE TABLE CREATE TABLE CREATE TABLE CREATE TABLE insert into test(a, b) values (150, 50); INSERT 0 1 select * from test where b < 100; a | b -----+---- 150 | 50 (1 row) explain (costs off) select * from test where b < 100; QUERY PLAN --------------------------- Append -> Seq Scan on test Filter: (b < 100) -> Seq Scan on test_1 Filter: (b < 100) -> Seq Scan on test_2 Filter: (b < 100) -> Seq Scan on test_3 Filter: (b < 100) (9 rows) Multi-column range partitioning seems a bit tricky as far as generating constraints on individual columns using a partition's lower and upper bounds (both composite values) is concerned. I mentally pictured something like the following example scenario: create table test(a int, b int, c int) partition by range (a, b, c); create table test_1 partition of test for values start (0, 0, 0) end (0, 2, 0); create table test_2 partition of test for values start (0, 2, 0) end (0, 3, 0); create table test_3 partition of test for values start (0, 3, 0) end (0, 4, 0); create table test_4 partition of test for values start (0, 4, 0) end (1, 0, 0); create table test_5 partition of test for values start (1, 0, 0) end (1, 2, 0); create table test_6 partition of test for values start (1, 2, 0) end (1, 3, 0); create table test_7 partition of test for values start (1, 3, 0) end (1, 4, 0); create table test_8 partition of test for values start (1, 4, 0) end (2, 0, 0); Useful to think of the above as sequence of ranges [000, 020), [020, 030), [030, 040), [040, 100), [100, 120), [120, 130), [130, 140), [140, 200) for purposes of finding the partition for a row. Then constraints generated internally for each partition: test_1: a = 0 AND b >= 0 AND b <= 2 test_2: a = 0 AND b >= 2 AND b <= 3 test_3: a = 0 AND b >= 3 AND b <= 4 test_4: a >= 0 AND a <= 1 test_5: a = 1 AND b >= 0 AND b <= 2 test_6: a = 1 AND b >= 2 AND b <= 3 test_7: a = 1 AND b >= 3 AND b <= 4 test_8: a >= 1 AND a <= 2 I will try further to poke holes in my thinking about this. Please feel free to point out if you find any. Thanks, Amit
Attachment
Hi Amit, On 20.04.2016 13:28, Amit Langote wrote: > On 2016/04/19 23:52, Amit Langote wrote: >> On Tue, Apr 19, 2016 at 11:26 PM, Alexander Korotkov >>> Another question is that it might be NOT what users expect from that. From >>> the syntax side it very looks like defining something boxes regions for two >>> keys which could be replacement for subpartitioning. But it isn't so. >> Need to check why query with qual b < 100 behaves the way it does. >> Something's going wrong there with the constraints (partition >> predicates) that are being generated internally (as mentioned before, >> still driven by constraint exclusion using the constraints generated >> on-the-fly). >> >> As for the composite range partition bounds in Ildar's example, it's >> as if the second value in the key never determines the fate of a row >> going into some partition, therefore no constraints should have been >> generated for column b of the key. I'm afraid that's not the case as >> per the latest patch. Will fix. > The strange behavior that Ildar reported should have been fixed with the > attached updated set of patches (v2): > > create table test(a int, b int) partition by range (a, b); > create table test_1 partition of test for values start (0, 0) end (100, 100); > create table test_2 partition of test for values start (100, 100) end > (200, 200); > create table test_3 partition of test for values start (200, 200) end > (300, 300); > CREATE TABLE > CREATE TABLE > CREATE TABLE > CREATE TABLE > > insert into test(a, b) values (150, 50); > INSERT 0 1 > > select * from test where b < 100; > a | b > -----+---- > 150 | 50 > (1 row) > > explain (costs off) select * from test where b < 100; > QUERY PLAN > --------------------------- > Append > -> Seq Scan on test > Filter: (b < 100) > -> Seq Scan on test_1 > Filter: (b < 100) > -> Seq Scan on test_2 > Filter: (b < 100) > -> Seq Scan on test_3 > Filter: (b < 100) > (9 rows) > > > Multi-column range partitioning seems a bit tricky as far as generating > constraints on individual columns using a partition's lower and upper > bounds (both composite values) is concerned. I mentally pictured > something like the following example scenario: > > create table test(a int, b int, c int) > partition by range (a, b, c); > create table test_1 partition of test > for values start (0, 0, 0) end (0, 2, 0); > create table test_2 partition of test > for values start (0, 2, 0) end (0, 3, 0); > create table test_3 partition of test > for values start (0, 3, 0) end (0, 4, 0); > create table test_4 partition of test > for values start (0, 4, 0) end (1, 0, 0); > create table test_5 partition of test > for values start (1, 0, 0) end (1, 2, 0); > create table test_6 partition of test > for values start (1, 2, 0) end (1, 3, 0); > create table test_7 partition of test > for values start (1, 3, 0) end (1, 4, 0); > create table test_8 partition of test > for values start (1, 4, 0) end (2, 0, 0); > > Useful to think of the above as sequence of ranges [000, 020), [020, 030), > [030, 040), [040, 100), [100, 120), [120, 130), [130, 140), [140, 200) for > purposes of finding the partition for a row. > > Then constraints generated internally for each partition: > > test_1: a = 0 AND b >= 0 AND b <= 2 > test_2: a = 0 AND b >= 2 AND b <= 3 > test_3: a = 0 AND b >= 3 AND b <= 4 > test_4: a >= 0 AND a <= 1 > test_5: a = 1 AND b >= 0 AND b <= 2 > test_6: a = 1 AND b >= 2 AND b <= 3 > test_7: a = 1 AND b >= 3 AND b <= 4 > test_8: a >= 1 AND a <= 2 > > I will try further to poke holes in my thinking about this. Please feel > free to point out if you find any. > > Thanks, > Amit Thanks for clarification! I tried the updated patch, now it works correctly. I encountered another problem that concerns expressions as partitioning key. Probably there is still some work in progress. But if it will help here is my case: create table inh(a int, b int) partition by range ((a+b)); create table inh_1 partition of inh for values start (0) end (10); create table inh_2 partition of inh for values start (10) end (20); Then if we run any SELECT query it crashes postgres: select * from inh; Crash occurs in get_check_expr_from_partbound(). It seems that function is not yet expecting an expression key and designed to handle only simple attributes keys. Backtrace: #0 strlen () at ../sysdeps/x86_64/strlen.S:106 #1 0x00000000004add8a in hashname (fcinfo=0x7ffdbdb9c760) at hashfunc.c:145 #2 0x000000000099cc08 in DirectFunctionCall1Coll (func=0x4add66 <hashname>, collation=0, arg1=0) at fmgr.c:1027 #3 0x00000000009724dd in CatalogCacheComputeHashValue (cache=0x26590b0, nkeys=2, cur_skey=0x7ffdbdb9cbf0) at catcache.c:207 #4 0x0000000000974979 in SearchCatCache (cache=0x26590b0, v1=32807, v2=0, v3=0, v4=0) at catcache.c:1151 #5 0x0000000000988e35 in SearchSysCache (cacheId=6, key1=32807, key2=0, key3=0, key4=0) at syscache.c:1006 #6 0x0000000000988fe3 in SearchSysCacheAttName (relid=32807, attname=0x0) at syscache.c:1106 #7 0x000000000098a744 in get_attnum (relid=32807, attname=0x0) at lsyscache.c:825 #8 0x000000000056afd2 in get_check_expr_from_partbound (rel=0x7f868601ca20, parent=0x7f868601b770, bound=0x26e6ac8) at partition.c:1427 #9 0x000000000056bc9e in generate_partition_check_expr (rel=0x7f868601ca20) at partition.c:1788 #10 0x000000000056bb5f in RelationGetPartitionCheckExpr (rel=0x7f868601ca20) at partition.c:1746 #11 0x0000000000782b5f in get_relation_constraints (root=0x268f1b8, relationObjectId=32807, rel=0x26e5cd8, include_notnull=1 '\001') at plancat.c:1209 #12 0x0000000000782d74 in relation_excluded_by_constraints (root=0x268f1b8, rel=0x26e5cd8, rte=0x268ebf0) at plancat.c:1302 #13 0x000000000072a18d in set_append_rel_size (root=0x268f1b8, rel=0x26e5690, rti=1, rte=0x268ea80) at allpaths.c:947 ... -- Ildar Musin i.musin@postgrespro.ru
Hi Ildar, On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin <i.musin@postgrespro.ru> wrote: > Thanks for clarification! I tried the updated patch, now it works correctly. Great, thanks! > I encountered another problem that concerns expressions as partitioning key. > Probably there is still some work in progress. But if it will help here is > my case: > > create table inh(a int, b int) partition by range ((a+b)); > create table inh_1 partition of inh for values start (0) end (10); > create table inh_2 partition of inh for values start (10) end (20); > > Then if we run any SELECT query it crashes postgres: > > select * from inh; > > Crash occurs in get_check_expr_from_partbound(). It seems that function is > not yet expecting an expression key and designed to handle only simple > attributes keys. Backtrace: You're right, silly mistake. :-( Will fix Thanks, Amit
Hi Ildar, On 2016/04/21 1:06, Amit Langote wrote: > On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin <i.musin@postgrespro.ru> wrote: >> Crash occurs in get_check_expr_from_partbound(). It seems that function is >> not yet expecting an expression key and designed to handle only simple >> attributes keys. Backtrace: > > You're right, silly mistake. :-( > > Will fix Attached updated version fixes this. I'll take time to send the next version but I'd very much appreciate it if you keep reporting anything that doesn't look/work right like you did so far. Thanks, Amit
Attachment
On 2016-04-15 04:35, Amit Langote wrote: A quick test with: > 0001-Add-syntax-to-specify-partition-key-v3.patch > 0002-Infrastructure-for-creation-of-partitioned-tables-v3.patch > 0003-Add-syntax-to-create-partitions-v3.patch > 0004-Infrastructure-for-partition-metadata-storage-and-ma-v3.patch > 0005-Introduce-tuple-routing-for-partitioned-tables-v3.patch patches apply, build and make check ok. There is somwthing wrong with indexes on child tables (and only with higher rowcounts). Surely the below code should give 6 rows; it actually does return 6 rows without the indexes. With indexes it returns 0 rows. (but when doing the same test with low rowcounts, things are OK.) thanks, Erik Rijkers (Linux Centos 6.6) --------------------------------------- create table inh(a int, b int) partition by range ((a+b)); create table inh_1 partition of inh for values start ( 0) end ( 10000); create table inh_2 partition of inh for values start ( 10000) end ( 20000); create table inh_3 partition of inh for values start ( 20000) end ( 100000); create index inh_1_a_idx on inh_1 (a); create index inh_2_a_idx on inh_2 (a); create index inh_3_a_idx on inh_3 (a); insert into inh select i, i as j from generate_series(1, 10000) as f(i); analyze inh_1; analyze inh_2; analyze inh_3; select 'inh' , count(*) from inh union all select 'inh_1', count(*) from inh_1 union all select 'inh_2', count(*) from inh_2 union all select 'inh_3', count(*) from inh_3 ; explain analyze select * from inh where a between 10110 and 10115; --------------------------------------- # output : create table inh(a int, b int) partition by range ((a+b)); create table inh_1 partition of inh for values start ( 0) end ( 10000); create index inh_1_a_idx on inh_1 (a); create table inh_2 partition of inh for values start ( 10000) end ( 20000); create index inh_2_a_idx on inh_2 (a); create table inh_3 partition of inh for values start ( 20000) end ( 100000); create index inh_3_a_idx on inh_3 (a); insert into inh select i, i as j from generate_series(1, 10000) as f(i); analyze inh_1; analyze inh_2; analyze inh_3; select 'inh' , count(*) from inh union all select 'inh_1', count(*) from inh_1 union all select 'inh_2', count(*) from inh_2 union all select 'inh_3', count(*) from inh_3 ; ?column? | count ----------+------- inh | 10000 inh_1 | 4999 inh_2 | 5000 inh_3 | 1 (4 rows) explain analyze select * from inh where a between 10110 and 10115; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Append (cost=0.00..17.37 rows=4 width=8) (actual time=0.023..0.023 rows=0 loops=1) -> Seq Scan on inh (cost=0.00..0.00 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1) Filter: ((a >= 10110) AND (a <= 10115)) -> Index Scan using inh_1_a_idx on inh_1 (cost=0.16..8.18 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: ((a >= 10110) AND (a <= 10115)) -> Index Scan usinginh_2_a_idx on inh_2 (cost=0.16..8.18 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=1) Index Cond: ((a >= 10110) AND (a <= 10115)) -> Seq Scan oninh_3 (cost=0.00..1.01 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=1) Filter: ((a >= 10110) AND (a <= 10115)) Rows Removed by Filter: 1 Planningtime: 0.858 ms Execution time: 0.093 ms (12 rows)
Hi Amit,
I am trying multi-column/expression partitions.create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 200) end (100, 300);
create table t1_mc_p2 partition of t1_multi_col for values start (200, 1) end (300, 100);
insert into t1_multi_col values (1, 250);
insert into t1_multi_col values (250, 1);
insert into t1_multi_col values (100, 100);
select tableoid::regclass, * from t1_multi_col;
tableoid | a | b
----------+-----+-----
t1_mc_p1 | 1 | 250
t1_mc_p1 | 100 | 100
t1_mc_p2 | 250 | 1
Symantec of multiple columns for ranges (may be list as well) looks confusing. The current scheme doesn't allow overlapping range for one of the partitioning keys even if the combined range is non-overlapping.
create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 100) end (100, 200);
create table t1_mc_p2 partition of t1_multi_col for values start (1, 200) end (100, 300);
ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of "t1_multi_col"
HINT: Please specify a range that does not overlap with any existing partition's range.
create table t1_mc_p2 partition of t1_multi_col for values start (1, 300) end (100, 400);
ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of "t1_multi_col"
HINT: Please specify a range that does not overlap with any existing partition's range.
create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 100) end (100, 200);
create table t1_mc_p2 partition of t1_multi_col for values start (1, 200) end (100, 300);
ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of "t1_multi_col"
HINT: Please specify a range that does not overlap with any existing partition's range.
create table t1_mc_p2 partition of t1_multi_col for values start (1, 300) end (100, 400);
ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of "t1_multi_col"
HINT: Please specify a range that does not overlap with any existing partition's range.
That should be better realised using subpartitioning on b. The question is, if one column's value is enough to identify partition (since they can not contain overlapping values for that column), why do we need mutliple columns/expressions as partition keys? IIUC, all the other column does is to disallow certain range of values for that column, which can better be done by a CHECK constraint. It looks like Oracle looks at combined range and not just one column.
On Thu, Apr 21, 2016 at 7:35 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi Ildar,
On 2016/04/21 1:06, Amit Langote wrote:
> On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
>> Crash occurs in get_check_expr_from_partbound(). It seems that function is
>> not yet expecting an expression key and designed to handle only simple
>> attributes keys. Backtrace:
>
> You're right, silly mistake. :-(
>
> Will fix
Attached updated version fixes this. I'll take time to send the next
version but I'd very much appreciate it if you keep reporting anything
that doesn't look/work right like you did so far.
Thanks,
Amit
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Hi Ashutosh, On 2016/05/09 20:21, Ashutosh Bapat wrote: > Hi Amit, > I am trying multi-column/expression partitions. Thanks for the tests. > create table t1_multi_col (a int, b int) partition by range (a, b); > create table t1_mc_p1 partition of t1_multi_col for values start (1, 200) > end (100, 300); > create table t1_mc_p2 partition of t1_multi_col for values start (200, 1) > end (300, 100); > insert into t1_multi_col values (1, 250); > insert into t1_multi_col values (250, 1); > insert into t1_multi_col values (100, 100); > select tableoid::regclass, * from t1_multi_col; > tableoid | a | b > ----------+-----+----- > t1_mc_p1 | 1 | 250 > t1_mc_p1 | 100 | 100 > t1_mc_p2 | 250 | 1 > The row (100, 100) landed in t1_mc_p1 which has partition bounds as (1, > 200) and (100, 300) which should not accept a row with b = 100. It looks > like the binary search got confused with the reversed order of ranges > (should that be allowed?) It's useful to think of multi-column key as defining ranges of composite values (tuples) instead of thinking in terms of ranges of values of individual columns. That's how a row's partition key is compared against individual partition bounds until a suitable partition is found (with binary search that is), which uses record comparison logic as shown below: postgres=# select (1, 200) <= (100, 100) AND (100, 100) < (100, 300);?column? ----------t (1 row) Which means the row (100, 100) belongs in the partition with the start bound (1, 200) and the end bound (100, 300). Just like in composite value case, comparison stops at some leading column that returns != 0 result. So, in comparison (1, 200) <= (100, 100), the second column plays no role. > Symantec of multiple columns for ranges (may be list as well) looks Note that list partition key does not support multiple columns. > confusing. The current scheme doesn't allow overlapping range for one of > the partitioning keys even if the combined range is non-overlapping. > create table t1_multi_col (a int, b int) partition by range (a, b); > create table t1_mc_p1 partition of t1_multi_col for values start (1, 100) > end (100, 200); > create table t1_mc_p2 partition of t1_multi_col for values start (1, 200) > end (100, 300); > ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of > "t1_multi_col" > HINT: Please specify a range that does not overlap with any existing > partition's range. > create table t1_mc_p2 partition of t1_multi_col for values start (1, 300) > end (100, 400); > ERROR: new partition's range overlaps with that of partition "t1_mc_p1" of > "t1_multi_col" > HINT: Please specify a range that does not overlap with any existing > partition's range. Ranges [ (1, 100), (100, 200) ) and [ (1, 200), (100, 300) ) do overlap: postgres=# select (1, 100) <= (1, 200) AND (1, 200) < (100, 200);?column? ----------t (1 row) That is, (1, 200) is both the start element of the 2nd partition's range and is contained in the first partition's range as illustrated above. > That should be better realised using subpartitioning on b. The question is, > if one column's value is enough to identify partition (since they can not > contain overlapping values for that column), why do we need mutliple > columns/expressions as partition keys? IIUC, all the other column does is > to disallow certain range of values for that column, which can better be > done by a CHECK constraint. It looks like Oracle looks at combined range > and not just one column. A more familiar example I have seen around the web illustrating multi-column range partitioning is for something like (year, month, day) triple. Consider the following example: create table parted(year int, month int, day int) partition by range (year, month, day); create table part201605week1 partition of parted for values start (2016, 5, 1) end (2016, 5, 8); create table part201605week2 partition of parted for values start (2016, 5, 8) end (2016, 5, 15); create table part201605week3 partition of parted for values start (2016, 5, 15) end (2016, 5, 22); create table part201605week4 partition of parted for values start (2016, 5, 22) end (2016, 5, 29); create table part201605week5 partition of parted for values start (2016, 5, 29) end (2016, 5, 31) inclusive; create table part201606week1 partition of parted for values start (2016, 6, 1) end (2016, 6, 8); create table part201606week2 partition of parted for values start (2016, 6, 8) end (2016, 6, 15); create table part201606week3 partition of parted for values start (2016, 6, 15) end (2016, 6, 22); create table part201606week4 partition of parted for values start (2016, 6, 2) end (2016, 6, 29); create table part201606week4 partition of parted for values start (2016, 6, 22) end (2016, 6, 29); create table part201606week5 partition of parted for values start (2016, 6, 29) end (2016, 6, 30) inclusive; explain (costs off) select * from parted where day between 4 and 10; QUERY PLAN ----------------------------------------------Append -> Seq Scan on parted Filter: ((day >= 4) AND (day <= 10)) -> Seq Scan on part201605week1 Filter: ((day >= 4) AND (day <= 10)) -> Seq Scan on part201605week2 Filter: ((day >= 4) AND (day <= 10)) -> Seq Scan on part201606week1 Filter: ((day >= 4) AND (day <= 10)) -> SeqScan on part201606week2 Filter: ((day >= 4) AND (day <= 10)) (11 rows) explain (costs off) select * from parted where day between 4 and 10 and month = 5; QUERY PLAN --------------------------------------------------------------Append -> Seq Scan on parted Filter: ((day >= 4) AND(day <= 10) AND (month = 5)) -> Seq Scan on part201605week1 Filter: ((day >= 4) AND (day <= 10) AND (month =5)) -> Seq Scan on part201605week2 Filter: ((day >= 4) AND (day <= 10) AND (month = 5)) (7 rows) explain (costs off) select * from parted where month = 6; QUERY PLAN -----------------------------------Append -> Seq Scan on parted Filter: (month = 6) -> Seq Scan on part201606week1 Filter: (month = 6) -> Seq Scan on part201606week2 Filter: (month = 6) -> Seq Scan on part201606week3 Filter: (month = 6) -> Seq Scan on part201606week4 Filter: (month = 6) -> Seq Scan on part201606week5 Filter: (month = 6) (13 rows) Thanks, Amit
Hi Erik, On 2016/04/26 17:46, Erik Rijkers wrote: > On 2016-04-15 04:35, Amit Langote wrote: > > A quick test with: > >> 0001-Add-syntax-to-specify-partition-key-v3.patch >> 0002-Infrastructure-for-creation-of-partitioned-tables-v3.patch >> 0003-Add-syntax-to-create-partitions-v3.patch >> 0004-Infrastructure-for-partition-metadata-storage-and-ma-v3.patch >> 0005-Introduce-tuple-routing-for-partitioned-tables-v3.patch > > patches apply, build and make check ok. Thanks for testing! > There is somwthing wrong with indexes on child tables (and only with > higher rowcounts). There was an oversight in patch 0005 that caused partition indexes to not be opened and tuples inserted into. Attached new version should have fixed it. > Surely the below code should give 6 rows; it actually does return 6 rows > without the indexes. > With indexes it returns 0 rows. > > (but when doing the same test with low rowcounts, things are OK.) ... > --------------------------------------- > create table inh(a int, b int) partition by range ((a+b)); > create table inh_1 partition of inh for values start ( 0) end ( 10000); > create table inh_2 partition of inh for values start ( 10000) end ( 20000); > create table inh_3 partition of inh for values start ( 20000) end ( 100000); > > create index inh_1_a_idx on inh_1 (a); > create index inh_2_a_idx on inh_2 (a); > create index inh_3_a_idx on inh_3 (a); > > insert into inh select i, i as j from generate_series(1, 10000) as f(i); > > analyze inh_1; > analyze inh_2; > analyze inh_3; > > select 'inh' , count(*) from inh > union all select 'inh_1', count(*) from inh_1 > union all select 'inh_2', count(*) from inh_2 > union all select 'inh_3', count(*) from inh_3 > ; > > explain analyze select * from inh where a between 10110 and 10115; Hmm, this last query should return 0 rows because: select max(a) from inh; max =------ 10000 (1 row) Did you by any chance mean to write the following: explain analyze select * from inh where a + b between 10110 and 10115; In which case: explain analyze select * from inh where a + b between 10110 and 10115; QUERY PLAN =------------------------------------------------------------------------------------------------------- Append (cost=0.00..123.00 rows=26 width=8) (actual time=0.119..6.407 rows=3 loops=1) -> Seq Scan on inh (cost=0.00..0.00 rows=1 width=8) (actual time=0.015..0.015 rows=0 loops=1) Filter: (((a + b) >= 10110) AND ((a + b) <= 10115)) -> Seq Scan on inh_2 (cost=0.00..123.00 rows=25 width=8) (actual time=0.076..6.198 rows=3 loops=1) Filter: (((a + b) >= 10110) AND ((a + b) <= 10115)) Rows Removed by Filter: 4997 Planning time: 0.521 ms Execution time: 6.572 ms (8 rows) select * from inh where a + b between 10110 and 10115; a | b =-----+------ 5055 | 5055 5056 | 5056 5057 | 5057 (3 rows) Now that doesn't use index for the obvious reason (mismatched key). So, let's try one which will: explain analyze select * from inh where a = 4567; QUERY PLAN =------------------------------------------------------------------------------------------------------------------------ Append (cost=0.00..17.61 rows=4 width=8) (actual time=0.189..0.293 rows=1 loops=1) -> Seq Scan on inh (cost=0.00..0.00 rows=1 width=8) (actual time=0.016..0.016 rows=0 loops=1) Filter: (a = 4567) -> Index Scan using inh_1_a_idx on inh_1 (cost=0.28..8.30 rows=1 width=8) (actual time=0.043..0.056 rows=1 loops=1) Index Cond: (a = 4567) -> Index Scan using inh_2_a_idx on inh_2 (cost=0.28..8.30 rows=1 width=8) (actual time=0.024..0.024 rows=0 loops=1) Index Cond: (a = 4567) -> Seq Scan on inh_3 (cost=0.00..1.01 rows=1 width=8) (actual time=0.029..0.029 rows=0 loops=1) Filter: (a = 4567) Rows Removed by Filter: 1 Planning time: 0.589 ms Execution time: 0.433 ms select * from inh where a = 4567; a | b =-----+------ 4567 | 4567 (1 row) No pruning occurs this time for the obvious reason (mismatched key). Does that help clarify? Thanks, Amit
Attachment
Hello Amit, In the example >create table part201606week4 partition of parted >for values start (2016, 6, 2) end (2016, 6, 29); seems to be a typo regards Sameer -- View this message in context: http://postgresql.nabble.com/Declarative-partitioning-tp5862462p5903204.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Hi, On 2016/05/12 17:42, Sameer Thakur-2 wrote: > Hello Amit, > In the example >> create table part201606week4 partition of parted >> for values start (2016, 6, 2) end (2016, 6, 29); > > seems to be a typo Oops, there indeed is. create table part201606week4 partition of parted for values start (2016, 6, 2) end (2016, 6, 29); create table part201606week4 partition of parted for values start (2016, 6, 22) end (2016, 6, 29); While copy-pasting part201606week4's definition from a previous line, ended up pasting twice and then edited the second one to the correct definition. Thanks, Amit
Hi Amit, I'm running some experiments based on your infrastructure trying to optimize SELECT queries. At some point I need to get PartitionDesc for relation and to do it I'm using RelationGetPartitionDesc() function. Problem is that this function copies relcache data and it can be quite slow for large amounts (thousands) of partitions. The comment to the function says that we cannot use relation->rd_partdesc pointer to relcache because of possibility of relcache invalidation. Could you please tell is it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE query? Thanks! -- Ildar Musin i.musin@postgrespro.ru
Hi Ildar, On 2016/05/16 22:12, Ildar Musin wrote: > Hi Amit, > > I'm running some experiments based on your infrastructure trying to > optimize SELECT queries. At some point I need to get PartitionDesc for > relation and to do it I'm using RelationGetPartitionDesc() function. > Problem is that this function copies relcache data and it can be quite > slow for large amounts (thousands) of partitions. The comment to the > function says that we cannot use relation->rd_partdesc pointer to relcache > because of possibility of relcache invalidation. Could you please tell is > it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE > query? Hmm, I think invalidation would not occur mid-query since it would have acquired a lock on the table. So the copying may be unnecessary though I may be wrong. I will study other similar code and see if that is so. Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2016/05/16 22:12, Ildar Musin wrote: >> Could you please tell is >> it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE >> query? > Hmm, I think invalidation would not occur mid-query since it would have > acquired a lock on the table. This is incorrect: invalidation can occur anyway (for example, if autoanalyze updates the relation's statistics). If you are holding a lock, you can expect that the relation's schema will not change more than your lock would allow --- but a cache flush and rebuild could happen underneath you, so keeping a pointer to any subsidiary relcache data structure is very dangerous. The two ways that we've dealt with this type of hazard are to copy data out of the relcache before using it; or to give the relcache the responsibility of not moving a particular portion of data if it did not change. From memory, the latter applies to the tuple descriptor and trigger data, but we've done most other things the first way. regards, tom lane
On 2016/05/18 2:22, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2016/05/16 22:12, Ildar Musin wrote: >>> Could you please tell is >>> it possible that relcache invalidation occurs during SELECT/UPDATE/DELETE >>> query? > >> Hmm, I think invalidation would not occur mid-query since it would have >> acquired a lock on the table. > > This is incorrect: invalidation can occur anyway (for example, if > autoanalyze updates the relation's statistics). If you are holding > a lock, you can expect that the relation's schema will not change more > than your lock would allow --- but a cache flush and rebuild could happen > underneath you, so keeping a pointer to any subsidiary relcache data > structure is very dangerous. I see. Thanks for clarifying. > The two ways that we've dealt with this type of hazard are to copy data > out of the relcache before using it; or to give the relcache the > responsibility of not moving a particular portion of data if it did not > change. From memory, the latter applies to the tuple descriptor and > trigger data, but we've done most other things the first way. It seems that tuple descriptor is reference-counted; however trigger data is copied. The former seems to have been done on performance grounds (I found 06e10abc). So for a performance-sensitive relcache data structure, refcounting is the way to go (although done quite rarely)? In this particular case, it is a "partition descriptor" that could get big for a partitioned table with partitions in hundreds or thousands. Thanks, Amit
Hi Amit and all,
Here is an experimental patch that optimizes planning time for range partitioned tables (it could be considered as a "proof of concept"). Patch should be applied on top of Amit's declarative partitioning patch. It handles only a very special case (often used though) where partitioning key consists of just a single attribute and doesn't contain expressions.
The main idea is the following:
* we are looking for clauses like 'VAR OP CONST' (where VAR is partitioning key attribute, OP is a comparison operator);
* using binary search find a partition (X) that fits CONST value;
* based on OP operator determine which partitions are also covered by clause. There are possible cases:
1. If OP is '<' or '<=' then we need partitions standing left from X (including)
2. If OP is '>' or '>=' then we need partitions standing right from X (including)
3. If OP is '=' the we need only X partition
(for '<' and '>' operators we also check if CONST value is equal to a lower or upper boundary (accordingly) and if it's true then exclude X).
For boolean expressions we evaluate left and right sides accordingly to algorithm above and then based on boolean operator find intersection (for AND) or union (for OR).
I run some benchmarks on:
1. original constraint exclusion mechanism,
2. optimized version (this patch) and
3. optimized version using relation->rd_partdesc pointer instead of RelationGetPartitionDesc() function (see previous discussion).
Initial conditions:
CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY RANGE (a);
CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (0) END (1000);
CREATE TABLE abc_2 PARTITION OF abc FOR VALUES START (1000) END (2000);
...
etc
INSERT INTO %s (a) SELECT generate_series(0, <partitions_count> * 1000);
pgbench scripts: https://gist.github.com/zilder/872e634a8eeb405bd045465fc9527e53 (where :partitions is a number of partitions).
The first script tests fetching a single row from the partitioned table. Results (tps):
# of partitions | constraint excl. | optimized | optimized (using pointer)
----------------+------------------+---------------+----------------------------
100 | 658 | 2906 | 3079
1000 | 45 | 2174 | 3021
2000 | 22 | 1667 | 2919
The second script tests fetching all data from a single partition. Results (tps):
# of partitions | constraint excl. | optimized | optimized (using pointer)
----------------+------------------+---------------+----------------------------
100 | 317 | 1001 | 1051
1000 | 34 | 941 | 1023
2000 | 15 | 813 | 1016
Optimized version works much faster on large amount of partitions and degradates slower than constraint exclusion. But still there is a noticeable performance degradation from copying PartitionDesc structure: with 2000 partitions RelationGetPartitionDesc() function spent more than 40% of all execution time on copying in first benchmark (measured with `perf`). Using reference counting as Amit suggests will allow to significantily decrease performance degradation.
Any comments and suggestions are very welcome. Thanks!
On 18.05.2016 04:26, Amit Langote wrote:
On 2016/05/18 2:22, Tom Lane wrote:The two ways that we've dealt with this type of hazard are to copy data out of the relcache before using it; or to give the relcache the responsibility of not moving a particular portion of data if it did not change. From memory, the latter applies to the tuple descriptor and trigger data, but we've done most other things the first way.It seems that tuple descriptor is reference-counted; however trigger data is copied. The former seems to have been done on performance grounds (I found 06e10abc). So for a performance-sensitive relcache data structure, refcounting is the way to go (although done quite rarely)? In this particular case, it is a "partition descriptor" that could get big for a partitioned table with partitions in hundreds or thousands. Thanks, Amit
Here is an experimental patch that optimizes planning time for range partitioned tables (it could be considered as a "proof of concept"). Patch should be applied on top of Amit's declarative partitioning patch. It handles only a very special case (often used though) where partitioning key consists of just a single attribute and doesn't contain expressions.
The main idea is the following:
* we are looking for clauses like 'VAR OP CONST' (where VAR is partitioning key attribute, OP is a comparison operator);
* using binary search find a partition (X) that fits CONST value;
* based on OP operator determine which partitions are also covered by clause. There are possible cases:
1. If OP is '<' or '<=' then we need partitions standing left from X (including)
2. If OP is '>' or '>=' then we need partitions standing right from X (including)
3. If OP is '=' the we need only X partition
(for '<' and '>' operators we also check if CONST value is equal to a lower or upper boundary (accordingly) and if it's true then exclude X).
For boolean expressions we evaluate left and right sides accordingly to algorithm above and then based on boolean operator find intersection (for AND) or union (for OR).
I run some benchmarks on:
1. original constraint exclusion mechanism,
2. optimized version (this patch) and
3. optimized version using relation->rd_partdesc pointer instead of RelationGetPartitionDesc() function (see previous discussion).
Initial conditions:
CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY RANGE (a);
CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (0) END (1000);
CREATE TABLE abc_2 PARTITION OF abc FOR VALUES START (1000) END (2000);
...
etc
INSERT INTO %s (a) SELECT generate_series(0, <partitions_count> * 1000);
pgbench scripts: https://gist.github.com/zilder/872e634a8eeb405bd045465fc9527e53 (where :partitions is a number of partitions).
The first script tests fetching a single row from the partitioned table. Results (tps):
# of partitions | constraint excl. | optimized | optimized (using pointer)
----------------+------------------+---------------+----------------------------
100 | 658 | 2906 | 3079
1000 | 45 | 2174 | 3021
2000 | 22 | 1667 | 2919
The second script tests fetching all data from a single partition. Results (tps):
# of partitions | constraint excl. | optimized | optimized (using pointer)
----------------+------------------+---------------+----------------------------
100 | 317 | 1001 | 1051
1000 | 34 | 941 | 1023
2000 | 15 | 813 | 1016
Optimized version works much faster on large amount of partitions and degradates slower than constraint exclusion. But still there is a noticeable performance degradation from copying PartitionDesc structure: with 2000 partitions RelationGetPartitionDesc() function spent more than 40% of all execution time on copying in first benchmark (measured with `perf`). Using reference counting as Amit suggests will allow to significantily decrease performance degradation.
Any comments and suggestions are very welcome. Thanks!
-- Ildar Musin i.musin@postgrespro.ru
Attachment
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2016/05/18 2:22, Tom Lane wrote: >> The two ways that we've dealt with this type of hazard are to copy data >> out of the relcache before using it; or to give the relcache the >> responsibility of not moving a particular portion of data if it did not >> change. From memory, the latter applies to the tuple descriptor and >> trigger data, but we've done most other things the first way. After actually looking at the code, we do things that way for the tupledesc, the relation's rules if any, and RLS policies --- see RelationClearRelation(). > It seems that tuple descriptor is reference-counted; however trigger data > is copied. The former seems to have been done on performance grounds (I > found 06e10abc). We do refcount tuple descriptors, but we've been afraid to try to rely completely on that; there are too many places that assume a relcache entry's tupdesc is safe to reference. It's not that easy to go over to a fully refcounted approach, because that creates a new problem of being sure that refcounts are decremented when necessary --- that's a pain, particularly when a query is abandoned due to an error. > So for a performance-sensitive relcache data structure, refcounting is the > way to go (although done quite rarely)? I'd be suspicious of this because of the cleanup problem. The don't-replace-unless-changed approach is the one that's actually battle tested. regards, tom lane
On 2016/05/19 2:48, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2016/05/18 2:22, Tom Lane wrote: >>> The two ways that we've dealt with this type of hazard are to copy data >>> out of the relcache before using it; or to give the relcache the >>> responsibility of not moving a particular portion of data if it did not >>> change. From memory, the latter applies to the tuple descriptor and >>> trigger data, but we've done most other things the first way. > > After actually looking at the code, we do things that way for the > tupledesc, the relation's rules if any, and RLS policies --- see > RelationClearRelation(). I think I confused refcounting method of keeping things around with the RelationClearRelation()'s method. I now understand that you meant the latter in your original message. >> It seems that tuple descriptor is reference-counted; however trigger data >> is copied. The former seems to have been done on performance grounds (I >> found 06e10abc). > > We do refcount tuple descriptors, but we've been afraid to try to rely > completely on that; there are too many places that assume a relcache > entry's tupdesc is safe to reference. It's not that easy to go over to > a fully refcounted approach, because that creates a new problem of being > sure that refcounts are decremented when necessary --- that's a pain, > particularly when a query is abandoned due to an error. I see. >> So for a performance-sensitive relcache data structure, refcounting is the >> way to go (although done quite rarely)? > > I'd be suspicious of this because of the cleanup problem. The > don't-replace-unless-changed approach is the one that's actually battle > tested. OK, I will try the RelationClearRelation()'s method of keeping partition descriptor data around so that no repeated copying is necessary. Thanks, Amit
Hi Ildar, On 2016/05/19 0:36, Ildar Musin wrote: > > Here is an experimental patch that optimizes planning time for range > partitioned tables (it could be considered as a "proof of concept"). Patch > should be applied on top of Amit's declarative partitioning patch. It > handles only a very special case (often used though) where partitioning > key consists of just a single attribute and doesn't contain expressions. Great, thanks! I understand that it's still PoC and the point may be just to consider performance implications of excessive partdesc copying but I'm wondering about a few things about the patch in general. See below. > The main idea is the following: > * we are looking for clauses like 'VAR OP CONST' (where VAR is > partitioning key attribute, OP is a comparison operator); > * using binary search find a partition (X) that fits CONST value; > * based on OP operator determine which partitions are also covered by > clause. There are possible cases: > 1. If OP is '<' or '<=' then we need partitions standing left from X > (including) > 2. If OP is '>' or '>=' then we need partitions standing right from X > (including) > 3. If OP is '=' the we need only X partition > (for '<' and '>' operators we also check if CONST value is equal to a > lower or upper boundary (accordingly) and if it's true then exclude X). > > For boolean expressions we evaluate left and right sides accordingly to > algorithm above and then based on boolean operator find intersection (for > AND) or union (for OR). Perhaps you're already aware but may I also suggest looking at how clauses are matched to indexes? For example, consider how match_clauses_to_index() in src/backend/optimizer/path/indxpath.c works. Moreover, instead of pruning partitions in planner prep phase, might it not be better to do that when considering paths for the (partitioned) rel? IOW, instead of looking at parse->jointree, we should rather be working with rel->baserestrictinfo. Although, that would require some revisions to how append_rel_list, simple_rel_list, etc. are constructed and manipulated in a given planner invocation. Maybe it's time for that... Again, you may have already considered these things. > I run some benchmarks on: > 1. original constraint exclusion mechanism, > 2. optimized version (this patch) and > 3. optimized version using relation->rd_partdesc pointer instead of > RelationGetPartitionDesc() function (see previous discussion). > > Initial conditions: > > CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY > RANGE (a); > CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (0) END (1000); > CREATE TABLE abc_2 PARTITION OF abc FOR VALUES START (1000) END (2000); > ... > etc > INSERT INTO %s (a) SELECT generate_series(0, <partitions_count> * 1000); > > pgbench scripts: > https://gist.github.com/zilder/872e634a8eeb405bd045465fc9527e53 (where > :partitions is a number of partitions). > The first script tests fetching a single row from the partitioned table. > Results (tps): > > # of partitions | constraint excl. | optimized | optimized (using > pointer) > ----------------+------------------+---------------+---------------------------- > > 100 | 658 | 2906 | 3079 > 1000 | 45 | 2174 | 3021 > 2000 | 22 | 1667 | 2919 > > > The second script tests fetching all data from a single partition. Results > (tps): > > # of partitions | constraint excl. | optimized | optimized (using > pointer) > ----------------+------------------+---------------+---------------------------- > > 100 | 317 | 1001 | 1051 > 1000 | 34 | 941 | 1023 > 2000 | 15 | 813 | 1016 > > Optimized version works much faster on large amount of partitions and > degradates slower than constraint exclusion. But still there is a > noticeable performance degradation from copying PartitionDesc structure: > with 2000 partitions RelationGetPartitionDesc() function spent more than > 40% of all execution time on copying in first benchmark (measured with > `perf`). Using reference counting as Amit suggests will allow to > significantily decrease performance degradation. Could you try with the attached updated set of patches? I changed partition descriptor relcache code to eliminate excessive copying in previous versions. Thanks, Amit
Attachment
- 0001-Add-syntax-to-specify-partition-key-v5.patch
- 0002-Add-a-IGNORE-dependency-type-v5.patch
- 0003-Infrastructure-for-creation-of-partitioned-tables-v5.patch
- 0004-Add-syntax-to-create-partitions-v5.patch
- 0005-Infrastructure-for-partition-metadata-storage-and-ma-v5.patch
- 0006-Introduce-tuple-routing-for-partitioned-tables-v5.patch
Hi Amit,<br /><br /><div class="moz-cite-prefix">On 20.05.2016 11:37, Amit Langote wrote:<br /></div><blockquote cite="mid:573ECCC3.7080104@lab.ntt.co.jp"type="cite"><pre wrap=""> Perhaps you're already aware but may I also suggest looking at how clauses are matched to indexes? For example, consider how match_clauses_to_index() in src/backend/optimizer/path/indxpath.c works. </pre></blockquote> Thanks, I'll take a closer look at it.<br /><blockquote cite="mid:573ECCC3.7080104@lab.ntt.co.jp" type="cite"><prewrap=""> Moreover, instead of pruning partitions in planner prep phase, might it not be better to do that when considering paths for the (partitioned) rel?IOW, instead of looking at parse->jointree,we should rather be working with rel->baserestrictinfo. Although, that would require some revisions to how append_rel_list, simple_rel_list, etc. are constructed and manipulated in a given planner invocation. Maybe it's time for that... Again, you may have already considered these things. </pre></blockquote> Yes, you're right, this is how we did it in pg_pathman extension. But for this patch it requires furtherconsideration and I'll do it in future!<br /><blockquote cite="mid:573ECCC3.7080104@lab.ntt.co.jp" type="cite"><prewrap=""> Could you try with the attached updated set of patches? I changed partition descriptor relcache code to eliminate excessive copying in previous versions. Thanks, Amit </pre></blockquote> I tried your new patch and got following results, which are quite close to the ones using pointer toPartitionDesc structure (TPS):<br /><br /><font face="Courier New, Courier, monospace"># of partitions | single row | singlepartition<br /> ----------------+------------+------------------<br /> 100 | 3014 | 1024<br/> 1000 | 2964 | 1001<br /> 2000 | 2874 | 1000</font><br/><br /> However I've encountered a problem which is that postgres crashes occasionally while creating partitions.Here is function that reproduces this behaviour:<br /><br /> CREATE OR REPLACE FUNCTION fail()<br /> RETURNSvoid<br /> LANGUAGE plpgsql<br /> AS $$<br /> BEGIN<br /> DROP TABLE IF EXISTS abc CASCADE;<br /> CREATE TABLE abc(id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY RANGE (a);<br /> CREATE INDEX ON abc (a);<br /> CREATE TABLE abc_0PARTITION OF abc FOR VALUES START (0) END (1000);<br /> CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (1000)END (2000);<br /> END<br /> $$;<br /><br /> SELECT fail();<br /><br /> It happens not every time but quite often. Itdoesn't happen if I execute this commands one by one in psql. Backtrace:<br /><br /> #0 range_overlaps_existing_partition(key=0x7f1097504410, range_spec=0x1d0f400, pdesc=0x1d32200, with=0x7ffe437ead00) at partition.c:747<br/> #1 0x000000000054c2a5 in StorePartitionBound (relid=245775, parentId=245770, bound=0x1d0f400) at partition.c:578<br/> #2 0x000000000061bfc4 in DefineRelation (stmt=0x1d0dfe0, relkind=114 'r', ownerId=10, typaddress=0x0)at tablecmds.c:739<br /> #3 0x00000000007f4473 in ProcessUtilitySlow (parsetree=0x1d1a150, queryString=0x1d1d940"CREATE TABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000)", context=PROCESS_UTILITY_QUERY,params=0x0, dest=0xdb5ca0 <spi_printtupDR>, completionTag=0x7ffe437eb500 "")<br /> at utility.c:983<br /> #4 0x00000000007f425e in standard_ProcessUtility (parsetree=0x1d1a150, queryString=0x1d1d940 "CREATETABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000)", context=PROCESS_UTILITY_QUERY, params=0x0, dest=0xdb5ca0<spi_printtupDR>, <br /> completionTag=0x7ffe437eb500 "") at utility.c:907<br /> #5 0x00000000007f3354in ProcessUtility (parsetree=0x1d1a150, queryString=0x1d1d940 "CREATE TABLE abc_0 PARTITION OF abc FORVALUES START (0) END (1000)", context=PROCESS_UTILITY_QUERY, params=0x0, dest=0xdb5ca0 <spi_printtupDR>, completionTag=0x7ffe437eb500"")<br /> at utility.c:336<br /> #6 0x000000000069f8b2 in _SPI_execute_plan (plan=0x1d19cf0,paramLI=0x0, snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=1 '\001', tcount=0)at spi.c:2200<br /> #7 0x000000000069c735 in SPI_execute_plan_with_paramlist (plan=0x1d19cf0, params=0x0, read_only=0'\000', tcount=0) at spi.c:450<br /> #8 0x00007f108cc6266f in exec_stmt_execsql (estate=0x7ffe437eb8e0, stmt=0x1d05318)at pl_exec.c:3517<br /> #9 0x00007f108cc5e5fc in exec_stmt (estate=0x7ffe437eb8e0, stmt=0x1d05318) at pl_exec.c:1503<br/> #10 0x00007f108cc5e318 in exec_stmts (estate=0x7ffe437eb8e0, stmts=0x1d04c98) at pl_exec.c:1398<br />#11 0x00007f108cc5e1af in exec_stmt_block (estate=0x7ffe437eb8e0, block=0x1d055e0) at pl_exec.c:1336<br /> #12 0x00007f108cc5c35din plpgsql_exec_function (func=0x1cc2a90, fcinfo=0x1cf7f50, simple_eval_estate=0x0) at pl_exec.c:434<br/> ...<br /><br /> Thanks<br /><pre class="moz-signature" cols="72">-- Ildar Musin <a class="moz-txt-link-abbreviated" href="mailto:i.musin@postgrespro.ru">i.musin@postgrespro.ru</a></pre>
Hi Ildar, On 2016/05/21 0:29, Ildar Musin wrote: > On 20.05.2016 11:37, Amit Langote wrote: >> Moreover, instead of pruning partitions in planner prep phase, might it >> not be better to do that when considering paths for the (partitioned) rel? >> IOW, instead of looking at parse->jointree, we should rather be working >> with rel->baserestrictinfo. Although, that would require some revisions >> to how append_rel_list, simple_rel_list, etc. are constructed and >> manipulated in a given planner invocation. Maybe it's time for that... >> Again, you may have already considered these things. >> > Yes, you're right, this is how we did it in pg_pathman extension. But for > this patch it requires further consideration and I'll do it in future! OK, sure. >> Could you try with the attached updated set of patches? I changed >> partition descriptor relcache code to eliminate excessive copying in >> previous versions. [...] > However I've encountered a problem which is that postgres crashes > occasionally while creating partitions. Here is function that reproduces > this behaviour: > > CREATE OR REPLACE FUNCTION fail() > RETURNS void > LANGUAGE plpgsql > AS $$ > BEGIN > DROP TABLE IF EXISTS abc CASCADE; > CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY > RANGE (a); > CREATE INDEX ON abc (a); > CREATE TABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000); > CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (1000) END (2000); > END > $$; > > SELECT fail(); > > It happens not every time but quite often. It doesn't happen if I execute > this commands one by one in psql. Backtrace: > > #0 range_overlaps_existing_partition (key=0x7f1097504410, > range_spec=0x1d0f400, pdesc=0x1d32200, with=0x7ffe437ead00) at > partition.c:747 [...] I made a mistake in the last version of the patch which caused a relcache field to be pfree'd unexpectedly. Attached updated patches. Thanks, Amit
Attachment
- 0005-Infrastructure-for-partition-metadata-storage-and-ma-v5.patch
- 0001-Add-syntax-to-specify-partition-key-v5.patch
- 0002-Add-a-IGNORE-dependency-type-v5.patch
- 0003-Infrastructure-for-creation-of-partitioned-tables-v5.patch
- 0004-Add-syntax-to-create-partitions-v5.patch
- 0006-Introduce-tuple-routing-for-partitioned-tables-v5.patch
0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c. I have tried fixing the conflict in attached patch.
On Mon, May 23, 2016 at 3:35 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi Ildar,
On 2016/05/21 0:29, Ildar Musin wrote:
> On 20.05.2016 11:37, Amit Langote wrote:
>> Moreover, instead of pruning partitions in planner prep phase, might it
>> not be better to do that when considering paths for the (partitioned) rel?
>> IOW, instead of looking at parse->jointree, we should rather be working
>> with rel->baserestrictinfo. Although, that would require some revisions
>> to how append_rel_list, simple_rel_list, etc. are constructed and
>> manipulated in a given planner invocation. Maybe it's time for that...
>> Again, you may have already considered these things.
>>
> Yes, you're right, this is how we did it in pg_pathman extension. But for
> this patch it requires further consideration and I'll do it in future!
OK, sure.
>> Could you try with the attached updated set of patches? I changed
>> partition descriptor relcache code to eliminate excessive copying in
>> previous versions.
[...]
> However I've encountered a problem which is that postgres crashes
> occasionally while creating partitions. Here is function that reproduces
> this behaviour:
>
> CREATE OR REPLACE FUNCTION fail()
> RETURNS void
> LANGUAGE plpgsql
> AS $$
> BEGIN
> DROP TABLE IF EXISTS abc CASCADE;
> CREATE TABLE abc (id SERIAL NOT NULL, a INT, dt TIMESTAMP) PARTITION BY
> RANGE (a);
> CREATE INDEX ON abc (a);
> CREATE TABLE abc_0 PARTITION OF abc FOR VALUES START (0) END (1000);
> CREATE TABLE abc_1 PARTITION OF abc FOR VALUES START (1000) END (2000);
> END
> $$;
>
> SELECT fail();
>
> It happens not every time but quite often. It doesn't happen if I execute
> this commands one by one in psql. Backtrace:
>
> #0 range_overlaps_existing_partition (key=0x7f1097504410,
> range_spec=0x1d0f400, pdesc=0x1d32200, with=0x7ffe437ead00) at
> partition.c:747
[...]
I made a mistake in the last version of the patch which caused a relcache
field to be pfree'd unexpectedly. Attached updated patches.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
On 2016/06/08 22:22, Ashutosh Bapat wrote: > On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote >> >> [...] >> >> I made a mistake in the last version of the patch which caused a relcache >> field to be pfree'd unexpectedly. Attached updated patches. > > 0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c. > I have tried fixing the conflict in attached patch. Thanks. See attached rebased patches. Regards, Amit
Attachment
- 0001-Add-syntax-to-specify-partition-key-v6.patch
- 0002-Add-a-IGNORE-dependency-type-v6.patch
- 0003-Infrastructure-for-creation-of-partitioned-tables-v6.patch
- 0004-Add-syntax-to-create-partitions-v6.patch
- 0005-Infrastructure-for-partition-metadata-storage-and-ma-v6.patch
- 0006-Introduce-tuple-routing-for-partitioned-tables-v6.patch
Hi Amit,
I tried creating partitioned table by range on an expression likeCREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);
ERROR: syntax error at or near "+"
LINE 1: ...E pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b);
On Thu, Jun 9, 2016 at 7:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/06/08 22:22, Ashutosh Bapat wrote:
> On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote
>>
>> [...]
>>
>> I made a mistake in the last version of the patch which caused a relcache
>> field to be pfree'd unexpectedly. Attached updated patches.
>
> 0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c.
> I have tried fixing the conflict in attached patch.
Thanks. See attached rebased patches.
Regards,
Amit
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/06/21 16:49, Ashutosh Bapat wrote: > Hi Amit, > I tried creating partitioned table by range on an expression like > CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + > b); > ERROR: syntax error at or near > "+" > > LINE 1: ...E pt1_e (a int, b int, c varchar) PARTITION BY RANGE(a + b); > > But when I try to create partitioned table by range on expression chr(a) it > works. The expression needs to be parenthesized (function call expressions don't need though). So something like the following would work: CREATE TABLE pt1_e (a int, b int, c varchar) PARTITION BY RANGE(add(a, b)); Thanks, Amit
Hi Amit,
I tried creating 2-level partitioned table and tried to create simple table using CTAS from the partitioned table. It gives a cache lookup error. Here's the testCREATE TABLE pt1_l (a int, b varchar, c int) PARTITION BY RANGE(a);
CREATE TABLE pt1_l_p1 PARTITION OF pt1_l FOR VALUES START (1) END (250) INCLUSIVE PARTITION BY RANGE(b);
CREATE TABLE pt1_l_p2 PARTITION OF pt1_l FOR VALUES START (251) END (500) INCLUSIVE PARTITION BY RANGE(((a+c)/2));
CREATE TABLE pt1_l_p3 PARTITION OF pt1_l FOR VALUES START (501) END (600) INCLUSIVE PARTITION BY RANGE(c);
CREATE TABLE pt1_l_p1_p1 PARTITION OF pt1_l_p1 FOR VALUES START ('000001') END ('000125') INCLUSIVE;
CREATE TABLE pt1_l_p1_p2 PARTITION OF pt1_l_p1 FOR VALUES START ('000126') END ('000250') INCLUSIVE;
CREATE TABLE pt1_l_p2_p1 PARTITION OF pt1_l_p2 FOR VALUES START (251) END (375) INCLUSIVE;
CREATE TABLE pt1_l_p2_p2 PARTITION OF pt1_l_p2 FOR VALUES START (376) END (500) INCLUSIVE;
CREATE TABLE pt1_l_p3_p1 PARTITION OF pt1_l_p3 FOR VALUES START (501) END (550) INCLUSIVE;
CREATE TABLE pt1_l_p3_p2 PARTITION OF pt1_l_p3 FOR VALUES START (551) END (600) INCLUSIVE;
INSERT INTO pt1_l SELECT i, to_char(i, 'FM000000'), i FROM generate_series(1, 600, 2) i;
CREATE TABLE upt1_l AS SELECT * FROM pt1_l;
On Thu, Jun 9, 2016 at 7:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/06/08 22:22, Ashutosh Bapat wrote:
> On Mon, May 23, 2016 at 3:35 PM, Amit Langote wrote
>>
>> [...]
>>
>> I made a mistake in the last version of the patch which caused a relcache
>> field to be pfree'd unexpectedly. Attached updated patches.
>
> 0003-... patch does not apply cleanly. It has some conflicts in pg_dump.c.
> I have tried fixing the conflict in attached patch.
Thanks. See attached rebased patches.
Regards,
Amit
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Hi Ashutosh, On 2016/06/24 23:08, Ashutosh Bapat wrote: > Hi Amit, > I tried creating 2-level partitioned table and tried to create simple table > using CTAS from the partitioned table. It gives a cache lookup error. > Here's the test > CREATE TABLE pt1_l (a int, b varchar, c int) PARTITION BY RANGE(a); > CREATE TABLE pt1_l_p1 PARTITION OF pt1_l FOR VALUES START (1) END (250) > INCLUSIVE PARTITION BY RANGE(b); > CREATE TABLE pt1_l_p2 PARTITION OF pt1_l FOR VALUES START (251) END (500) > INCLUSIVE PARTITION BY RANGE(((a+c)/2)); > CREATE TABLE pt1_l_p3 PARTITION OF pt1_l FOR VALUES START (501) END (600) > INCLUSIVE PARTITION BY RANGE(c); > CREATE TABLE pt1_l_p1_p1 PARTITION OF pt1_l_p1 FOR VALUES START ('000001') > END ('000125') INCLUSIVE; > CREATE TABLE pt1_l_p1_p2 PARTITION OF pt1_l_p1 FOR VALUES START ('000126') > END ('000250') INCLUSIVE; > CREATE TABLE pt1_l_p2_p1 PARTITION OF pt1_l_p2 FOR VALUES START (251) END > (375) INCLUSIVE; > CREATE TABLE pt1_l_p2_p2 PARTITION OF pt1_l_p2 FOR VALUES START (376) END > (500) INCLUSIVE; > CREATE TABLE pt1_l_p3_p1 PARTITION OF pt1_l_p3 FOR VALUES START (501) END > (550) INCLUSIVE; > CREATE TABLE pt1_l_p3_p2 PARTITION OF pt1_l_p3 FOR VALUES START (551) END > (600) INCLUSIVE; > INSERT INTO pt1_l SELECT i, to_char(i, 'FM000000'), i FROM > generate_series(1, 600, 2) i; > CREATE TABLE upt1_l AS SELECT * FROM pt1_l; > > The last statement gives error "ERROR: cache lookup failed for function > 0". Let me know if this problem is reproducible. Thanks for the test case. I can reproduce the error. It has to do with partition key column (b) being of type varchar whereas the default operator family for the type being text_ops and there not being an =(varchar, varchar) operator in text_ops. When creating a plan for SELECT * FROM pt1_l, which is an Append plan, partition check quals are generated internally to be used for constraint exclusion - such as, b >= '000001' AND b < '000125'. Individual OpExpr's are generated by using information from PartitionKey of the parent (PartitionKey) and pg_partition entry of the partition. When choosing the operator to use for =, >=, <, etc., opfamily and typid of corresponding columns are referred. As mentioned above, in this case, they happened to be text_ops and varchar, respectively, for column b. There doesn't exist an operator =(varchar, varchar) in text_ops, so InvalidOid is returned by get_opfamily_member which goes unchecked until the error in question occurs. I have tried to fix this in attached updated patch by using operator class input type for operator resolution in cases where column type didn't help. Thanks, Amit
Attachment
- 0001-Add-syntax-to-specify-partition-key-v7.patch
- 0002-Add-a-IGNORE-dependency-type-v7.patch
- 0003-Infrastructure-for-creation-of-partitioned-tables-v7.patch
- 0004-Add-syntax-to-create-partitions-v7.patch
- 0005-Infrastructure-for-partition-metadata-storage-and-ma-v7.patch
- 0006-Introduce-tuple-routing-for-partitioned-tables-v7.patch
Hi Amit,
I observed that the ChangeVarNodes call at line 1229 in get_relation_constraints() (code below) changes the varno in the cached copy of partition key expression, which is undesirable. The reason for this seems to be that the RelationGetPartitionCheckQual() returns the copy of partition key expression directly from the cache. This is mostly because get_check_qual_for_range() directly works on cached copy of partition key expressions, which it should never.1223 /* Append partition check quals, if any */
1224 pcqual = RelationGetPartitionCheckQual(relation);
1225 if (pcqual)
1226 {
1227 /* Fix Vars to have the desired varno */
1228 if (varno != 1)
1229 ChangeVarNodes((Node *) pcqual, 1, varno, 0);
1230
1231 result = list_concat(result, pcqual);
1232 }
On Mon, Jun 27, 2016 at 3:56 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi Ashutosh,Thanks for the test case. I can reproduce the error. It has to do with
On 2016/06/24 23:08, Ashutosh Bapat wrote:
> Hi Amit,
> I tried creating 2-level partitioned table and tried to create simple table
> using CTAS from the partitioned table. It gives a cache lookup error.
> Here's the test
> CREATE TABLE pt1_l (a int, b varchar, c int) PARTITION BY RANGE(a);
> CREATE TABLE pt1_l_p1 PARTITION OF pt1_l FOR VALUES START (1) END (250)
> INCLUSIVE PARTITION BY RANGE(b);
> CREATE TABLE pt1_l_p2 PARTITION OF pt1_l FOR VALUES START (251) END (500)
> INCLUSIVE PARTITION BY RANGE(((a+c)/2));
> CREATE TABLE pt1_l_p3 PARTITION OF pt1_l FOR VALUES START (501) END (600)
> INCLUSIVE PARTITION BY RANGE(c);
> CREATE TABLE pt1_l_p1_p1 PARTITION OF pt1_l_p1 FOR VALUES START ('000001')
> END ('000125') INCLUSIVE;
> CREATE TABLE pt1_l_p1_p2 PARTITION OF pt1_l_p1 FOR VALUES START ('000126')
> END ('000250') INCLUSIVE;
> CREATE TABLE pt1_l_p2_p1 PARTITION OF pt1_l_p2 FOR VALUES START (251) END
> (375) INCLUSIVE;
> CREATE TABLE pt1_l_p2_p2 PARTITION OF pt1_l_p2 FOR VALUES START (376) END
> (500) INCLUSIVE;
> CREATE TABLE pt1_l_p3_p1 PARTITION OF pt1_l_p3 FOR VALUES START (501) END
> (550) INCLUSIVE;
> CREATE TABLE pt1_l_p3_p2 PARTITION OF pt1_l_p3 FOR VALUES START (551) END
> (600) INCLUSIVE;
> INSERT INTO pt1_l SELECT i, to_char(i, 'FM000000'), i FROM
> generate_series(1, 600, 2) i;
> CREATE TABLE upt1_l AS SELECT * FROM pt1_l;
>
> The last statement gives error "ERROR: cache lookup failed for function
> 0". Let me know if this problem is reproducible.
partition key column (b) being of type varchar whereas the default
operator family for the type being text_ops and there not being an
=(varchar, varchar) operator in text_ops.
When creating a plan for SELECT * FROM pt1_l, which is an Append plan,
partition check quals are generated internally to be used for constraint
exclusion - such as, b >= '000001' AND b < '000125'. Individual OpExpr's
are generated by using information from PartitionKey of the parent
(PartitionKey) and pg_partition entry of the partition. When choosing the
operator to use for =, >=, <, etc., opfamily and typid of corresponding
columns are referred. As mentioned above, in this case, they happened to
be text_ops and varchar, respectively, for column b. There doesn't exist
an operator =(varchar, varchar) in text_ops, so InvalidOid is returned by
get_opfamily_member which goes unchecked until the error in question occurs.
I have tried to fix this in attached updated patch by using operator class
input type for operator resolution in cases where column type didn't help.
Thanks,
Amit
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/07/04 21:31, Ashutosh Bapat wrote: > Hi Amit, > I observed that the ChangeVarNodes call at line 1229 in > get_relation_constraints() (code below) changes the varno in the cached > copy of partition key expression, which is undesirable. The reason for this > seems to be that the RelationGetPartitionCheckQual() returns the copy of > partition key expression directly from the cache. This is mostly because > get_check_qual_for_range() directly works on cached copy of partition key > expressions, which it should never. Yes, a copyObject() on key->partexprs items seems necessary. Will fix that. > 1223 /* Append partition check quals, if any */ > 1224 pcqual = RelationGetPartitionCheckQual(relation); > 1225 if (pcqual) > 1226 { > 1227 /* Fix Vars to have the desired varno */ > 1228 if (varno != 1) > 1229 ChangeVarNodes((Node *) pcqual, 1, varno, 0); > 1230 > 1231 result = list_concat(result, pcqual); > 1232 } > > Because of this, the first time through the partition key expressions are > valid, but then onwards they are restamped with the varno of the first > partition. > > Please add testcases to your patch to catch such types of issues. I will integrate tests into the patch(es) and add some more. Thanks, Amit
<div dir="ltr">I am seeing following warning with this set of patches.<br />gram.y:4734:24: warning: assignment from incompatiblepointer type [enabled by default]<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On Tue,Jul 5, 2016 at 10:18 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp" target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016/07/04 21:31, Ashutosh Bapat wrote:<br /> >Hi Amit,<br /></span><span class="">> I observed that the ChangeVarNodes call at line 1229 in<br /> > get_relation_constraints()(code below) changes the varno in the cached<br /> > copy of partition key expression, whichis undesirable. The reason for this<br /> > seems to be that the RelationGetPartitionCheckQual() returns the copyof<br /> > partition key expression directly from the cache. This is mostly because<br /> > get_check_qual_for_range()directly works on cached copy of partition key<br /> > expressions, which it should never.<br/><br /></span>Yes, a copyObject() on key->partexprs items seems necessary. Will fix that.<br /><span class=""><br/> > 1223 /* Append partition check quals, if any */<br /> > 1224 pcqual = RelationGetPartitionCheckQual(relation);<br/> > 1225 if (pcqual)<br /> > 1226 {<br /> > 1227 /*Fix Vars to have the desired varno */<br /> > 1228 if (varno != 1)<br /> > 1229 ChangeVarNodes((Node*) pcqual, 1, varno, 0);<br /> > 1230<br /> > 1231 result = list_concat(result, pcqual);<br/> > 1232 }<br /> ><br /> > Because of this, the first time through the partition key expressionsare<br /> > valid, but then onwards they are restamped with the varno of the first<br /> > partition.<br/> ><br /> > Please add testcases to your patch to catch such types of issues.<br /><br /></span>I willintegrate tests into the patch(es) and add some more.<br /><br /> Thanks,<br /> Amit<br /><br /><br /></blockquote></div><br/><br clear="all" /><br />-- <br /><div class="gmail_signature" data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The PostgresDatabase Company<br /></div></div></div>
On 2016/07/19 22:53, Ashutosh Bapat wrote: > I am seeing following warning with this set of patches. > gram.y:4734:24: warning: assignment from incompatible pointer type [enabled > by default] Thanks, will fix. Was a copy-paste error. Thanks, Amit
<div dir="ltr">The lists for list partitioned tables are stored as they are specified by the user. While searching for apartition to route tuple to, we compare it with every list value of every partition. We might do something better similarto what's been done to range partitions. The list of values for a given partition can be stored in ascending/descendingsorted order. Thus a binary search can be used to check whether given row's partition key column hassame value as one in the list. The partitions can then be stored in the ascending/descending order of the least/greatestvalues of corresponding partitions. We might be able to eliminate search in a given partition if its lowestvalue is higher than the given value or its higher value is lower than the given value.<br /></div><div class="gmail_extra"><br/><div class="gmail_quote">On Thu, Jul 21, 2016 at 10:10 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp"target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On2016/07/19 22:53, Ashutosh Bapat wrote:<br /> > I am seeing following warning with this set of patches.<br/> > gram.y:4734:24: warning: assignment from incompatible pointer type [enabled<br /> > by default]<br/><br /><br /></span>Thanks, will fix. Was a copy-paste error.<br /><br /> Thanks,<br /> Amit<br /><br /><br/></blockquote></div><br /><br clear="all" /><br />-- <br /><div class="gmail_signature" data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The PostgresDatabase Company<br /></div></div></div>
On Fri, Aug 5, 2016 at 6:53 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The lists for list partitioned tables are stored as they are specified by > the user. While searching for a partition to route tuple to, we compare it > with every list value of every partition. We might do something better > similar to what's been done to range partitions. The list of values for a > given partition can be stored in ascending/descending sorted order. Thus a > binary search can be used to check whether given row's partition key column > has same value as one in the list. The partitions can then be stored in the > ascending/descending order of the least/greatest values of corresponding > partitions. +1. Here as with range partitions, we must be sure to know which opclass should be used for the comparisons. > We might be able to eliminate search in a given partition if its > lowest value is higher than the given value or its higher value is lower > than the given value. I don't think I understand this part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Aug 5, 2016 at 5:21 PM, Robert Haas<span dir="ltr"><<a href="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">OnFri, Aug 5, 2016 at 6:53 AM, Ashutosh Bapat<br /> <<a href="mailto:ashutosh.bapat@enterprisedb.com">ashutosh.bapat@enterprisedb.<wbr/>com</a>> wrote:<br /> > The lists forlist partitioned tables are stored as they are specified by<br /> > the user. While searching for a partition to routetuple to, we compare it<br /> > with every list value of every partition. We might do something better<br /> >similar to what's been done to range partitions. The list of values for a<br /> > given partition can be stored inascending/descending sorted order. Thus a<br /> > binary search can be used to check whether given row's partition keycolumn<br /> > has same value as one in the list. The partitions can then be stored in the<br /> > ascending/descendingorder of the least/greatest values of corresponding<br /> > partitions.<br /><br /></span>+1. Hereas with range partitions, we must be sure to know which<br /> opclass should be used for the comparisons.<br /><spanclass=""><br /> > We might be able to eliminate search in a given partition if its<br /> > lowest value is higherthan the given value or its higher value is lower<br /> > than the given value.<br /><br /></span>I don't thinkI understand this part.<br /></blockquote></div><br clear="all" /></div><div class="gmail_extra">Consider lists ('e','i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a list partitioned tables. I am suggesting that we arrange them as('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row (either for comparison or for inserting) has value'c', we will search for it in ('a','b','l') but will be eliminate other two partitions since the second's partition'slowest value is higher than 'c' and lowest values of rest of the partitions are higher than that of the secondpartition. Without this order among the partitions, we will have to compare lowest values of all the partitions.<br/></div><div class="gmail_extra">-- <br /><div class="gmail_signature" data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The PostgresDatabase Company<br /></div></div></div></div>
On Fri, Aug 5, 2016 at 8:10 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Aug 5, 2016 at 5:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Aug 5, 2016 at 6:53 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > The lists for list partitioned tables are stored as they are specified >> > by >> > the user. While searching for a partition to route tuple to, we compare >> > it >> > with every list value of every partition. We might do something better >> > similar to what's been done to range partitions. The list of values for >> > a >> > given partition can be stored in ascending/descending sorted order. Thus >> > a >> > binary search can be used to check whether given row's partition key >> > column >> > has same value as one in the list. The partitions can then be stored in >> > the >> > ascending/descending order of the least/greatest values of corresponding >> > partitions. >> >> +1. Here as with range partitions, we must be sure to know which >> opclass should be used for the comparisons. >> >> > We might be able to eliminate search in a given partition if its >> > lowest value is higher than the given value or its higher value is lower >> > than the given value. >> >> I don't think I understand this part. > > Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a > list partitioned tables. I am suggesting that we arrange them as > ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row (either > for comparison or for inserting) has value 'c', we will search for it in > ('a','b','l') but will be eliminate other two partitions since the second's > partition's lowest value is higher than 'c' and lowest values of rest of the > partitions are higher than that of the second partition. Without this order > among the partitions, we will have to compare lowest values of all the > partitions. I would think that for that case what we'd want to do is create two lists. One looks like this: [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ] The other looks like this: [3, 3, 2, 1, 1, 2, 1, 1, 3, 2] Given a particular value, bsearch the first list. If the value is not found, it's not in any partition. If it is found, then go look up the same array index in the second list; that's the containing partition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I would think that for that case what we'd want to do is create two>
> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a
> list partitioned tables. I am suggesting that we arrange them as
> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row (either
> for comparison or for inserting) has value 'c', we will search for it in
> ('a','b','l') but will be eliminate other two partitions since the second's
> partition's lowest value is higher than 'c' and lowest values of rest of the
> partitions are higher than that of the second partition. Without this order
> among the partitions, we will have to compare lowest values of all the
> partitions.
lists. One looks like this:
[ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
The other looks like this:
[3, 3, 2, 1, 1, 2, 1, 1, 3, 2]
small correction; there's an extra 1 here. Every partition in the example has only three values.
Given a particular value, bsearch the first list. If the value is not
found, it's not in any partition. If it is found, then go look up the
same array index in the second list; that's the containing partition.
+1, if we could do it. It will need a change in the way Amit's patch stores partitioning scheme in PartitionDesc.
This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are essentially same, are represented in different ways. It makes matching partitions for partition-wise join a bit tedius. We have to make sure that the first array matches for both the joining relations and then make sure that all the values belonging to the same partition for one table also belong to the same partition in the other table. Some more complex logic for matching subsets of lists for partition-wise join.
At least for straight forward partitioned table matching it helps to have both these array look same independent of the user specification. From that point of view, the partition be ordered by their lowest or highest list values and the second array is the index in the ordered set. For both the specifications above, the list will look like
[ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
[1, 1, 2, 3, 3, 2, 3, 1, 2]
At least for straight forward partitioned table matching it helps to have both these array look same independent of the user specification. From that point of view, the partition be ordered by their lowest or highest list values and the second array is the index in the ordered set. For both the specifications above, the list will look like
[ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
[1, 1, 2, 3, 3, 2, 3, 1, 2]
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/08/05 21:38, Ashutosh Bapat wrote: >>> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a >>> list partitioned tables. I am suggesting that we arrange them as >>> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row >> (either >>> for comparison or for inserting) has value 'c', we will search for it in >>> ('a','b','l') but will be eliminate other two partitions since the >> second's >>> partition's lowest value is higher than 'c' and lowest values of rest of >> the >>> partitions are higher than that of the second partition. Without this >> order >>> among the partitions, we will have to compare lowest values of all the >>> partitions. >> >> I would think that for that case what we'd want to do is create two >> lists. One looks like this: >> >> [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ] >> >> The other looks like this: >> >> [3, 3, 2, 1, 1, 2, 1, 1, 3, 2] >> > > small correction; there's an extra 1 here. Every partition in the example > has only three values. > > >> >> Given a particular value, bsearch the first list. If the value is not >> found, it's not in any partition. If it is found, then go look up the >> same array index in the second list; that's the containing partition. >> > > +1, if we could do it. It will need a change in the way Amit's patch stores > partitioning scheme in PartitionDesc. Okay, I will try to implement this in the next version of the patch. One thing that comes to mind is what if a user wants to apply hash operator class equality to list partitioned key by specifying a hash operator class for the corresponding column. In that case, we would not have the ordering procedure with an hash operator class, hence any ordering based optimization becomes impossible to implement. The current patch rejects a column for partition key if its type does not have a btree operator class for both range and list methods, so this issue doesn't exist, however it could be seen as a limitation. > This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', > 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are > essentially same, are represented in different ways. It makes matching > partitions for partition-wise join a bit tedius. We have to make sure that > the first array matches for both the joining relations and then make sure > that all the values belonging to the same partition for one table also > belong to the same partition in the other table. Some more complex logic > for matching subsets of lists for partition-wise join. So, we have 3 choices for the internal representation of list partitions: Choice 1 (the current approach): Load them in the same order as they are found in the partition catalog: Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} In this case, mismatch on the first list would make the two tables incompatibly partitioned, whereas they really aren't incompatible. Choice 2: Representation with 2 arrays: Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1] Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3] It still doesn't help the case of pairwise joins because it's hard to tell which value belongs to which partition (the 2nd array carries the original partition numbers). Although it might still work for tuple-routing. Choice 3: Order all lists' elements for each list individually and then order the lists themselves on their first values: Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} This representation makes pairing partitions for pairwise joining convenient but for tuple-routing we still need to visit each partition in the worst case. > At least for straight forward partitioned table matching it helps to have > both these array look same independent of the user specification. From that > point of view, the partition be ordered by their lowest or highest list > values and the second array is the index in the ordered set. For both the > specifications above, the list will look like > > [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ] > [1, 1, 2, 3, 3, 2, 3, 1, 2] IIUC, this seems like a combination of 2 and 3 above: So, we have the following list partitions (as read from the catalog) Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} By applying the method of 3: Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} Then applying 2: Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2] Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1] This is user-specification independent representation wherein the partition numbers in the 2nd array are based on canonical representation (ordered lists). To check pairwise join compatibility, simply compare the first two arrays. As to which partitions (think OIDs, RTEs whatever) pair with each other, simply pair corresponding elements of the 3rd array which are original partitions numbers (or OIDs). Also when routing a tuple, we find partition number in the array 2 and then look up the array 3 to get the actual partition to insert the tuple. Neither of these representations make the logic of checking pairwise-join compatibility and pairing a subset of partitions (on either side) any easier, although I haven't given it a good thought yet. Thanks, Amit
IIUC, this seems like a combination of 2 and 3 above:
So, we have the following list partitions (as read from the catalog)
Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
By applying the method of 3:
Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
Then applying 2:
Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2]
Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1]
We should arrange the OID arrays to follow canonical order. Assuming that OIDs of p1, pt2, and p3 of table 1 are t1o1, t1o2 and t1o3 respectively, and those of p1, p2 and p3 of table 2 are t2o1, t2o2, t2o3 resp. the last arrays should be [t1o3, t1o2, t1o1] and [t2o2, t2o1, t2o3].Thus the last arrays from both representation give the OIDs of children that should be joined pair-wise. IOW, OID array should just follow the canonical order instead of specification order. AFAIU, your patch arranges the range partition OIDs in the canonical order and not specification order.
This is user-specification independent representation wherein the
partition numbers in the 2nd array are based on canonical representation
(ordered lists). To check pairwise join compatibility, simply compare the
first two arrays. As to which partitions (think OIDs, RTEs whatever) pair
with each other, simply pair corresponding elements of the 3rd array which
are original partitions numbers (or OIDs). Also when routing a tuple, we
find partition number in the array 2 and then look up the array 3 to get
the actual partition to insert the tuple.
Neither of these representations make the logic of checking pairwise-join
compatibility and pairing a subset of partitions (on either side) any
easier, although I haven't given it a good thought yet.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> +1, if we could do it. It will need a change in the way Amit's patch stores >> partitioning scheme in PartitionDesc. > > Okay, I will try to implement this in the next version of the patch. > > One thing that comes to mind is what if a user wants to apply hash > operator class equality to list partitioned key by specifying a hash > operator class for the corresponding column. In that case, we would not > have the ordering procedure with an hash operator class, hence any > ordering based optimization becomes impossible to implement. The current > patch rejects a column for partition key if its type does not have a btree > operator class for both range and list methods, so this issue doesn't > exist, however it could be seen as a limitation. Yes, I think you should expect careful scrutiny of that issue. It seems clear to me that range partitioning requires a btree opclass, that hash partitioning requires a hash opclass, and that list partitioning requires at least one of those things. It would probably be reasonable to pick one or the other and insist that list partitioning always requires exactly that, but I can't see how it's reasonable to insist that you must have both types of opclass for any type of partitioning. >> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', >> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are >> essentially same, are represented in different ways. It makes matching >> partitions for partition-wise join a bit tedius. We have to make sure that >> the first array matches for both the joining relations and then make sure >> that all the values belonging to the same partition for one table also >> belong to the same partition in the other table. Some more complex logic >> for matching subsets of lists for partition-wise join. > > So, we have 3 choices for the internal representation of list partitions: > > Choice 1 (the current approach): Load them in the same order as they are > found in the partition catalog: > > Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} > Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} > > In this case, mismatch on the first list would make the two tables > incompatibly partitioned, whereas they really aren't incompatible. Such a limitation seems clearly unacceptable. We absolutely must be able to match up compatible partitioning schemes without getting confused by little details like the order of the partitions. > Choice 2: Representation with 2 arrays: > > Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1] > Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3] > > It still doesn't help the case of pairwise joins because it's hard to tell > which value belongs to which partition (the 2nd array carries the original > partition numbers). Although it might still work for tuple-routing. It's very good for tuple routing. It can also be used to match up partitions for pairwise joins. Compare the first arrays. If they are unequal, stop. Else, compare the second arrays, incrementally building a mapping between them and returning false if the mapping turns out to be non-bijective. For example, in this case, we look at index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3. We look at index 2 and decide 2 -> 1. We look at index 4 and find that we already have a mapping for 2, but it's compatible because we need 2 -> 1 and that's what is already there. Similarly for the remaining entries. This is really a pretty easy loop to write and it should run very quickly. > Choice 3: Order all lists' elements for each list individually and then > order the lists themselves on their first values: > > Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} > Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} > > This representation makes pairing partitions for pairwise joining > convenient but for tuple-routing we still need to visit each partition in > the worst case. I think this is clearly not good enough for tuple routing. If the algorithm I proposed above turns out to be too slow for matching partitions, then we could keep both this representation and the previous one. We are not limited to just one. But I don't think that's likely to be the case. Also, note that all of this presupposes we're doing range partitioning, or perhaps list partitioning with a btree opclass. For partitioning based on a hash opclass, you'd organize the data based on the hash values rather than range comparisons. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016/08/09 6:02, Robert Haas wrote: > On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> +1, if we could do it. It will need a change in the way Amit's patch stores >>> partitioning scheme in PartitionDesc. >> >> Okay, I will try to implement this in the next version of the patch. >> >> One thing that comes to mind is what if a user wants to apply hash >> operator class equality to list partitioned key by specifying a hash >> operator class for the corresponding column. In that case, we would not >> have the ordering procedure with an hash operator class, hence any >> ordering based optimization becomes impossible to implement. The current >> patch rejects a column for partition key if its type does not have a btree >> operator class for both range and list methods, so this issue doesn't >> exist, however it could be seen as a limitation. > > Yes, I think you should expect careful scrutiny of that issue. It > seems clear to me that range partitioning requires a btree opclass, > that hash partitioning requires a hash opclass, and that list > partitioning requires at least one of those things. It would probably > be reasonable to pick one or the other and insist that list > partitioning always requires exactly that, but I can't see how it's > reasonable to insist that you must have both types of opclass for any > type of partitioning. So because we intend to implement optimizations for list partition metadata that presuppose existence of corresponding btree operator class, we should just always require user to specify one (or error out if user doesn't specify and a default one doesn't exist). That way, we explicitly do not support specify hash equality operator for list partitioning. >> So, we have 3 choices for the internal representation of list partitions: >> >> Choice 1 (the current approach): Load them in the same order as they are >> found in the partition catalog: >> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} >> >> In this case, mismatch on the first list would make the two tables >> incompatibly partitioned, whereas they really aren't incompatible. > > Such a limitation seems clearly unacceptable. We absolutely must be > able to match up compatible partitioning schemes without getting > confused by little details like the order of the partitions. Agreed. Will change my patch to adopt the below method. >> Choice 2: Representation with 2 arrays: >> >> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1] >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3] >> >> It still doesn't help the case of pairwise joins because it's hard to tell >> which value belongs to which partition (the 2nd array carries the original >> partition numbers). Although it might still work for tuple-routing. > > It's very good for tuple routing. It can also be used to match up > partitions for pairwise joins. Compare the first arrays. If they are > unequal, stop. Else, compare the second arrays, incrementally > building a mapping between them and returning false if the mapping > turns out to be non-bijective. For example, in this case, we look at > index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3. > We look at index 2 and decide 2 -> 1. We look at index 4 and find > that we already have a mapping for 2, but it's compatible because we > need 2 -> 1 and that's what is already there. Similarly for the > remaining entries. This is really a pretty easy loop to write and it > should run very quickly. I see, it does make sense to try to implement this way. >> Choice 3: Order all lists' elements for each list individually and then >> order the lists themselves on their first values: >> >> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} >> >> This representation makes pairing partitions for pairwise joining >> convenient but for tuple-routing we still need to visit each partition in >> the worst case. > > I think this is clearly not good enough for tuple routing. If the > algorithm I proposed above turns out to be too slow for matching > partitions, then we could keep both this representation and the > previous one. We are not limited to just one. But I don't think > that's likely to be the case. I agree. Let's see how the option 2 turns out. > Also, note that all of this presupposes we're doing range > partitioning, or perhaps list partitioning with a btree opclass. For > partitioning based on a hash opclass, you'd organize the data based on > the hash values rather than range comparisons. Yes, the current patch does not implement hash partitioning, although I have to think about how to support the hash case when designing the internal data structures. By the way, I am planning to start a new thread with the latest set of patches which I will post in a day or two. I have tried to implement all the bug fixes and improvements that have been suggested on this thread so far. Thanks to all those who reviewed and gave their comments. Please check this page to get a link to the new thread: https://commitfest.postgresql.org/10/611/ Thanks, Amit
<div dir="ltr">What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned table,I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see thatthe code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like PartitionKeyExecInfo.I don't have any specific observation as to why we need such information in RelOptInfo, but lack ofit makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your patchre-uses inheritance code. I am not sure.<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On Tue,Aug 9, 2016 at 9:17 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp" target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016/08/09 6:02, Robert Haas wrote:<br /> > OnMon, Aug 8, 2016 at 1:40 AM, Amit Langote<br /> > <<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp">Langote_Amit_f8@lab.ntt.co.jp</a><wbr/>> wrote:<br /> >>> +1, ifwe could do it. It will need a change in the way Amit's patch stores<br /> >>> partitioning scheme in PartitionDesc.<br/> >><br /> >> Okay, I will try to implement this in the next version of the patch.<br /> >><br/> >> One thing that comes to mind is what if a user wants to apply hash<br /> >> operator class equalityto list partitioned key by specifying a hash<br /> >> operator class for the corresponding column. In thatcase, we would not<br /> >> have the ordering procedure with an hash operator class, hence any<br /> >> orderingbased optimization becomes impossible to implement. The current<br /> >> patch rejects a column for partitionkey if its type does not have a btree<br /> >> operator class for both range and list methods, so this issuedoesn't<br /> >> exist, however it could be seen as a limitation.<br /> ><br /> > Yes, I think you shouldexpect careful scrutiny of that issue. It<br /> > seems clear to me that range partitioning requires a btree opclass,<br/> > that hash partitioning requires a hash opclass, and that list<br /> > partitioning requires at leastone of those things. It would probably<br /> > be reasonable to pick one or the other and insist that list<br />> partitioning always requires exactly that, but I can't see how it's<br /> > reasonable to insist that you musthave both types of opclass for any<br /> > type of partitioning.<br /><br /></span>So because we intend to implementoptimizations for list partition<br /> metadata that presuppose existence of corresponding btree operator class,<br/> we should just always require user to specify one (or error out if user<br /> doesn't specify and a default onedoesn't exist). That way, we explicitly<br /> do not support specify hash equality operator for list partitioning.<br/><span class=""><br /> >> So, we have 3 choices for the internal representation of list partitions:<br/> >><br /> >> Choice 1 (the current approach): Load them in the same order as they are<br />>> found in the partition catalog:<br /> >><br /> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a','e'}<br /> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}<br /> >><br /> >> In this case,mismatch on the first list would make the two tables<br /> >> incompatibly partitioned, whereas they really aren'tincompatible.<br /> ><br /> > Such a limitation seems clearly unacceptable. We absolutely must be<br /> >able to match up compatible partitioning schemes without getting<br /> > confused by little details like the orderof the partitions.<br /><br /></span>Agreed. Will change my patch to adopt the below method.<br /><span class=""><br/> >> Choice 2: Representation with 2 arrays:<br /> >><br /> >> Table 1: ['a', 'b', 'c', 'd','e', 'f'], [3, 1, 2, 2, 3, 1]<br /> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]<br /> >><br/> >> It still doesn't help the case of pairwise joins because it's hard to tell<br /> >> which valuebelongs to which partition (the 2nd array carries the original<br /> >> partition numbers). Although it mightstill work for tuple-routing.<br /> ><br /> > It's very good for tuple routing. It can also be used to matchup<br /> > partitions for pairwise joins. Compare the first arrays. If they are<br /> > unequal, stop. Else,compare the second arrays, incrementally<br /> > building a mapping between them and returning false if the mapping<br/> > turns out to be non-bijective. For example, in this case, we look at<br /> > index 0 and decide that3 -> 2. We look at index 1 and decide 1 -> 3.<br /> > We look at index 2 and decide 2 -> 1. We look atindex 4 and find<br /> > that we already have a mapping for 2, but it's compatible because we<br /> > need 2 ->1 and that's what is already there. Similarly for the<br /> > remaining entries. This is really a pretty easy loopto write and it<br /> > should run very quickly.<br /><br /></span>I see, it does make sense to try to implement thisway.<br /><span class=""><br /> >> Choice 3: Order all lists' elements for each list individually and then<br />>> order the lists themselves on their first values:<br /> >><br /> >> Table 1: p3 {'a', 'e'}, p2 {'b','f'}, p1 {'c', 'd'}<br /> >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}<br /> >><br /> >>This representation makes pairing partitions for pairwise joining<br /> >> convenient but for tuple-routingwe still need to visit each partition in<br /> >> the worst case.<br /> ><br /> > I think this isclearly not good enough for tuple routing. If the<br /> > algorithm I proposed above turns out to be too slow for matching<br/> > partitions, then we could keep both this representation and the<br /> > previous one. We are not limitedto just one. But I don't think<br /> > that's likely to be the case.<br /><br /></span>I agree. Let's see howthe option 2 turns out.<br /><span class=""><br /> > Also, note that all of this presupposes we're doing range<br />> partitioning, or perhaps list partitioning with a btree opclass. For<br /> > partitioning based on a hash opclass,you'd organize the data based on<br /> > the hash values rather than range comparisons.<br /><br /></span>Yes,the current patch does not implement hash partitioning, although I<br /> have to think about how to supportthe hash case when designing the<br /> internal data structures.<br /><br /><br /> By the way, I am planning to starta new thread with the latest set of<br /> patches which I will post in a day or two. I have tried to implement all<br/> the bug fixes and improvements that have been suggested on this thread so<br /> far. Thanks to all those who reviewedand gave their comments. Please<br /> check this page to get a link to the new thread:<br /><a href="https://commitfest.postgresql.org/10/611/"rel="noreferrer" target="_blank">https://commitfest.postgresql.<wbr />org/10/611/</a><br/><br /> Thanks,<br /> Amit<br /><br /><br /></blockquote></div><br /><br clear="all" /><br />-- <br/><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDBCorporation<br />The Postgres Database Company<br /></div></div></div>
FOR VALUE clause of a partition does not allow a constant expression like (10000/5 -1). It gives syntax error
regression=# create table pt1_p1 partition of pt1 for values start (0) end ((10000/5) - 1);
ERROR: syntax error at or near "("
LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ...
regression=# create table pt1_p1 partition of pt1 for values start (0) end ((10000/5) - 1);
ERROR: syntax error at or near "("
LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ...
On Tue, Aug 9, 2016 at 12:48 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned table, I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see that the code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like PartitionKeyExecInfo. I don't have any specific observation as to why we need such information in RelOptInfo, but lack of it makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your patch re-uses inheritance code. I am not sure.--On Tue, Aug 9, 2016 at 9:17 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: On 2016/08/09 6:02, Robert Haas wrote:
> On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> +1, if we could do it. It will need a change in the way Amit's patch stores
>>> partitioning scheme in PartitionDesc.
>>
>> Okay, I will try to implement this in the next version of the patch.
>>
>> One thing that comes to mind is what if a user wants to apply hash
>> operator class equality to list partitioned key by specifying a hash
>> operator class for the corresponding column. In that case, we would not
>> have the ordering procedure with an hash operator class, hence any
>> ordering based optimization becomes impossible to implement. The current
>> patch rejects a column for partition key if its type does not have a btree
>> operator class for both range and list methods, so this issue doesn't
>> exist, however it could be seen as a limitation.
>
> Yes, I think you should expect careful scrutiny of that issue. It
> seems clear to me that range partitioning requires a btree opclass,
> that hash partitioning requires a hash opclass, and that list
> partitioning requires at least one of those things. It would probably
> be reasonable to pick one or the other and insist that list
> partitioning always requires exactly that, but I can't see how it's
> reasonable to insist that you must have both types of opclass for any
> type of partitioning.
So because we intend to implement optimizations for list partition
metadata that presuppose existence of corresponding btree operator class,
we should just always require user to specify one (or error out if user
doesn't specify and a default one doesn't exist). That way, we explicitly
do not support specify hash equality operator for list partitioning.
>> So, we have 3 choices for the internal representation of list partitions:
>>
>> Choice 1 (the current approach): Load them in the same order as they are
>> found in the partition catalog:
>>
>> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>>
>> In this case, mismatch on the first list would make the two tables
>> incompatibly partitioned, whereas they really aren't incompatible.
>
> Such a limitation seems clearly unacceptable. We absolutely must be
> able to match up compatible partitioning schemes without getting
> confused by little details like the order of the partitions.
Agreed. Will change my patch to adopt the below method.
>> Choice 2: Representation with 2 arrays:
>>
>> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>>
>> It still doesn't help the case of pairwise joins because it's hard to tell
>> which value belongs to which partition (the 2nd array carries the original
>> partition numbers). Although it might still work for tuple-routing.
>
> It's very good for tuple routing. It can also be used to match up
> partitions for pairwise joins. Compare the first arrays. If they are
> unequal, stop. Else, compare the second arrays, incrementally
> building a mapping between them and returning false if the mapping
> turns out to be non-bijective. For example, in this case, we look at
> index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3.
> We look at index 2 and decide 2 -> 1. We look at index 4 and find
> that we already have a mapping for 2, but it's compatible because we
> need 2 -> 1 and that's what is already there. Similarly for the
> remaining entries. This is really a pretty easy loop to write and it
> should run very quickly.
I see, it does make sense to try to implement this way.
>> Choice 3: Order all lists' elements for each list individually and then
>> order the lists themselves on their first values:
>>
>> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
>> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>>
>> This representation makes pairing partitions for pairwise joining
>> convenient but for tuple-routing we still need to visit each partition in
>> the worst case.
>
> I think this is clearly not good enough for tuple routing. If the
> algorithm I proposed above turns out to be too slow for matching
> partitions, then we could keep both this representation and the
> previous one. We are not limited to just one. But I don't think
> that's likely to be the case.
I agree. Let's see how the option 2 turns out.
> Also, note that all of this presupposes we're doing range
> partitioning, or perhaps list partitioning with a btree opclass. For
> partitioning based on a hash opclass, you'd organize the data based on
> the hash values rather than range comparisons.
Yes, the current patch does not implement hash partitioning, although I
have to think about how to support the hash case when designing the
internal data structures.
By the way, I am planning to start a new thread with the latest set of
patches which I will post in a day or two. I have tried to implement all
the bug fixes and improvements that have been suggested on this thread so
far. Thanks to all those who reviewed and gave their comments. Please
check this page to get a link to the new thread:
https://commitfest.postgresql.org/10/611/
Thanks,
AmitBest Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2016/08/10 19:18, Ashutosh Bapat wrote: > FOR VALUE clause of a partition does not allow a constant expression like > (10000/5 -1). It gives syntax error > regression=# create table pt1_p1 partition of pt1 for values start (0) end > ((10000/5) - 1); > ERROR: syntax error at or near "(" > LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ... > > Shouldn't we allow constant expressions here? > > If this has been already discussed, please forgive me and point out the > relevant mail chain. Yes, the only allowed values there are string literals, numeric values and null. Decision to change to it followed from this comment: "Also, I don't think allowing an a_expr as a bound is remotely sensible - I think you should allow only Sconst, NumericOnly, NULL, and UNBOUNDED." https://www.postgresql.org/message-id/CA%2BTgmoZ1ZMCyGR3b9yvGDq79xYLMnJQwhwn5GVs_GsvPiySDxw%40mail.gmail.com Thanks, Amit