Re: Declarative partitioning - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: Declarative partitioning |
Date | |
Msg-id | 566E6C48.9080500@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Declarative partitioning (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Declarative partitioning
|
List | pgsql-hackers |
On 2015/11/24 2:23, Robert Haas wrote: > To me, it seems like there is a pretty obvious approach here: each > table can be either a plain table, or a partition root (which can look > just like an empty inheritance parent). Then multi-level partitioning > falls right out of that design without needing to do anything extra. > If you want a single level of partitioning, you partition the original > table. If you want two levels of partitioning, you partition the > partitions. If you want three levels of partitioning, you partition > those. It's being made out here that limiting ourselves to a single > of partitioning makes things simpler, but it's not very clear to me > that this is actually true. > > I think it is also worth getting the syntax right from the beginning. > Even if we decide that patch #1 won't support multi-level > partitioning, we should have a plan for the syntax that can be > extended to multi-level partitioning. If we decide after releasing > partitioning with one syntax that we really wish we'd used some other > syntax, that is going to be a really big problem - deprecating the use > of => or introducing standard_conforming_strings were projects that > took many years to complete. We really only get one shot to get that > right. That doesn't mean it's all got to be there in version one, but > there had better be a way to extend it to all the things we want to do > later or we are going to be deeply, deeply sad. Previously proposed design would support multi-level partitioning with some adjustments. One of the reasons to not include it in the recent patch was a lack of clarity about proper syntax and catalog organization. That wasn't really nice. So, here is a revised proposal for the syntax and some general notes. I will adjust my previous patch to follow along any feedback I receive here. I think we can provide explicit SUBPARTITION keyword-based syntax for multi-level partitioning. It's not unreasonable to expect that all partitions at a given level are partitioned on the same key. IOW, why require to declare partition key for each partition separately? Instead, using SUBPARTITION BY on the master table seems better. Syntax to create a partitioned table (up to 2 levels of partitioning): CREATE TABLE foo ( ... ) PARTITION BY R/L ON (key0) SUBPARTITION BY R/L ON (key1) [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>] [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>], ...)], ...)]; The above creates two pg_partitioned_rel entries for foo with partlevel 0 and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this creates pg_partition entries, with foo and foo_1 as partparent, respectively. Why just 2 levels? - it seems commonplace and makes the syntax more intuitive? I guess it might be possible to generalize the syntax for multi-level partitioning. Ideas? If we want to support the notion of sub-partition template in future, that would require some thought, more importantly proper catalog organization for the same. To add a partition to table after-the-fact: ALTER TABLE foo CREATE PARTITION foo1 FOR VALUES <val> [<storage_params>] [<tblspc>] [(SUBPARTITION foo11 FOR VALUES <val> [<storage_params>] [<tblspc>], ...)]; To add a sub-partition to an existing partition: ALTER TABLE foo MODIFY PARTITION foo_1 CREATE SUBPARTITION foo_1_ FOR VALUES (val) [<storage_params>] [<tblspc>]; I considered ADD/ALTER instead of CREATE/MODIFY, but there exist grammar conflicts with ADD/ALTER column. What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied directly to partitions on case-by-case basis (they are tables under the hood after all), we should restrict AT to the master table. Most of the AT changes implicitly propagate from the master table to its partitions. Some of them could be directly applied to partitions and/or sub-partitions such as rename, storage manipulations like - changing tablespace, storage parameters (reloptions), etc.: ALTER TABLE foo RENAME PARTITION <partition-name> TO <new-name>; ALTER TABLE foo RENAME SUBPARTITION <sub-partition-name> TO <new-name>; ALTER TABLE foo SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>; ALTER TABLE foo SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>; ALTER TABLE foo SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>; ALTER TABLE foo SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>; Note that the keyword DEFAULT in some cases above means, do not apply the new setting to existing sub-partitions of the partition, rather use the new setting for future sub-partitions. In case of SET TABLESPACE, if FOR PARTITION clause is not specified, all partitions (actually "leaf" partitions) of the master table are moved to the new tablespace; is that necessary or should we just disallow that and instead output an error asking to use [DEFAULT] FOR PARTITION/FOR SUBPARTITION to move only specific partitions/sub-partitions? By the way, should we also allow changing the logging of partitions/sub-partitions as follows? ALTER TABLE foo MODIFY PARTITION <partition-name> SET {LOGGED | UNLOGGED}; ALTER TABLE foo MODIFY SUBPARTITION <sub-partition-name> SET {LOGGED | UNLOGGED}; What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION constraints - 2 things must be clear here: cannot create these constraints on individual partitions and all partition columns (key0 + key1) must be the leading columns of the key. On a related note, creating index on the master table should create the index on all "leaf" partitions. The index on the mater table itself would be just a logical index. Should we allow creating or dropping indexes on partitions directly? It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE but I'm inclined to leave them as future enhancements. For a functionality similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest patch. We could extend them to also consider sub-partitions: ALTER TABLE foo ATTACH PARTITION <partition-name> FOR VALUES (val) USING [ TABLE ] <table-name> [SET VALID]; ALTER TABLE foo MODIFY PARTITION foo_1 ATTACH SUBPARTITION <sub-partition-name> FOR VALUES (val) USING [ TABLE ] <table-name> [SET VALID]; ALTER TABLE foo DETACH PARTITION foo_1 USING [ TABLE ] <table-name>; ALTER TABLE foo MODIFY PARTITION foo_1 DETACH SUBPARTITION foo_1_1 USING [ TABLE ] <table-name>; Note that all commands described above refer to sub-partitions directly using names. These names refer to pg_class.relname of corresponding relations, so any partition/sub-partition names have to be unique across the database in a given schema. Maybe, if we stored partition name in pg_partition, system could generate suitable pg_class name itself and relieve that restriction. It would then be possible to refer to sub-partitions using its pg_class name or given name qualified by the partition name. Let me also outline some significant internal implementation details: * Introduce a new relkind - RELKIND_PARTITIONED_REL. Master/root tables in partitioning hierarchies are of this relkind. Following are some key properties of such relations - They have a pg_partitioned_rel catalog entry containing information about the partition key (and sub-partition key, if any). RelationBuildDesc() initializes the partition key fields in RelationData. It consists of linked-list of structs corresponding to keys at successive levels. If happen to be target in I/U/D (/COPY FROM), corresponding ResultRelInfo's partition key and partition descriptor fields are initialized. The latter consists of information about its immediate (level-1) partitions. In case of U/D, rewrite phase adds tableoid junk attribute to the target list in addition to a traditional one (ctid) for ExecUpdate and ExecDelete. * Next up is the notion of relations that are partitions at any level (including non-leaf levels). Following are some key properties of such relations - They each have a pg_partition entry. They cannot be targets in I/U and COPY FROM. One cannot define rules, triggers, and RLS policies on them. Although, AR triggers defined on a partitioned master table are propagated to the "leaf" partitions. Note that RELKIND_PARTITIONED_REL relations and non-leaf partitions do not have any storage. * How to handle multi-level partitioning - To map tuples to a "leaf" partition, CopyFrom(), ExecInsert() (also, ExecUpdate()) invoke a function called get_partition_for_tuple(). The function encapsulates details of partitioning such as partitioning method (R/L) and multi-level recursion. A similar recursive approach could be used within planner, too. Single invocation of function get_rel_partitions() from, say, set_partitioned_rel_size() would return the list of "leaf" partitions of rel excluding pruned partitions (and sub-partitions thereof). That is, quals are matched against partition keys at all levels in the same invocation. * Revised catalog structures: pg_partitioned_rel: partrelid Oid partmethod char partlevel int16 partnatts int16 partkey int2vector partclass oidvector partexprs pg_node_tree pg_partition: partrelid Oid partparent Oid partname NameData partlistvals anyarray partrangemaxs anyarray partvalid bool partsubtemplate bool partreloptions text parttablespace Oid Fields of pg_partition following partrangemaxs are yet unused. Thoughts? Thanks, Amit
pgsql-hackers by date: