Re: Declarative partitioning - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Declarative partitioning
Date
Msg-id 566E6C48.9080500@lab.ntt.co.jp
Whole thread Raw
In response to Re: Declarative partitioning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Declarative partitioning
List pgsql-hackers
On 2015/11/24 2:23, Robert Haas wrote:
> To me, it seems like there is a pretty obvious approach here: each
> table can be either a plain table, or a partition root (which can look
> just like an empty inheritance parent).  Then multi-level partitioning
> falls right out of that design without needing to do anything extra.
> If you want a single level of partitioning, you partition the original
> table.  If you want two levels of partitioning, you partition the
> partitions.  If you want three levels of partitioning, you partition
> those.  It's being made out here that limiting ourselves to a single
> of partitioning makes things simpler, but it's not very clear to me
> that this is actually true.
>
> I think it is also worth getting the syntax right from the beginning.
> Even if we decide that patch #1 won't support multi-level
> partitioning, we should have a plan for the syntax that can be
> extended to multi-level partitioning.  If we decide after releasing
> partitioning with one syntax that we really wish we'd used some other
> syntax, that is going to be a really big problem - deprecating the use
> of => or introducing standard_conforming_strings were projects that
> took many years to complete.  We really only get one shot to get that
> right.  That doesn't mean it's all got to be there in version one, but
> there had better be a way to extend it to all the things we want to do
> later or we are going to be deeply, deeply sad.

Previously proposed design would support multi-level partitioning with
some adjustments. One of the reasons to not include it in the recent patch
was a lack of clarity about proper syntax and catalog organization. That
wasn't really nice. So, here is a revised proposal for the syntax and some
general notes. I will adjust my previous patch to follow along any
feedback I receive here.


I think we can provide explicit SUBPARTITION keyword-based syntax for
multi-level partitioning. It's not unreasonable to expect that all
partitions at a given level are partitioned on the same key. IOW, why
require to declare partition key for each partition separately? Instead,
using SUBPARTITION BY on the master table seems better.

Syntax to create a partitioned table (up to 2 levels of partitioning):

CREATE TABLE foo (
...
)
PARTITION BY R/L ON (key0)
SUBPARTITION BY R/L ON (key1)
[(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>]
[(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>],
...)], ...)];

The above creates two pg_partitioned_rel entries for foo with partlevel 0
and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
creates pg_partition entries, with foo and foo_1 as partparent,
respectively.

Why just 2 levels? - it seems commonplace and makes the syntax more
intuitive? I guess it might be possible to generalize the syntax for
multi-level partitioning. Ideas? If we want to support the notion of
sub-partition template in future, that would require some thought, more
importantly proper catalog organization for the same.

To add a partition to table after-the-fact:

ALTER TABLE foo
CREATE PARTITION foo1 FOR VALUES <val> [<storage_params>] [<tblspc>]
[(SUBPARTITION foo11 FOR VALUES <val> [<storage_params>] [<tblspc>], ...)];

To add a sub-partition to an existing partition:

ALTER TABLE foo
MODIFY PARTITION foo_1
CREATE SUBPARTITION foo_1_ FOR VALUES (val) [<storage_params>] [<tblspc>];

I considered ADD/ALTER instead of CREATE/MODIFY, but there exist grammar
conflicts with ADD/ALTER column.

What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
directly to partitions on case-by-case basis (they are tables under the
hood after all), we should restrict AT to the master table. Most of the AT
changes implicitly propagate from the master table to its partitions. Some
of them could be directly applied to partitions and/or sub-partitions such
as rename, storage manipulations like - changing tablespace, storage
parameters (reloptions), etc.:

ALTER TABLE foo
RENAME PARTITION <partition-name> TO <new-name>;

ALTER TABLE foo
RENAME SUBPARTITION <sub-partition-name> TO <new-name>;

ALTER TABLE foo
SET TABLESPACE ... [DEFAULT] FOR PARTITION <partition-name>;

ALTER TABLE foo
SET TABLESPACE ... FOR SUBPARTITION <sub-partition-name>;

ALTER TABLE foo
SET (storage_parameter = value) [DEFAULT] FOR PARTITION <partition-name>;

ALTER TABLE foo
SET (storage_parameter = value) FOR SUBPARTITION <sub-partition-name>;

Note that the keyword DEFAULT in some cases above means, do not apply the
new setting to existing sub-partitions of the partition, rather use the
new setting for future sub-partitions. In case of SET TABLESPACE, if FOR
PARTITION clause is not specified, all partitions (actually "leaf"
partitions) of the master table are moved to the new tablespace; is that
necessary or should we just disallow that and instead output an error
asking to use [DEFAULT] FOR PARTITION/FOR SUBPARTITION to move only
specific partitions/sub-partitions?

By the way, should we also allow changing the logging of
partitions/sub-partitions as follows?

ALTER TABLE foo MODIFY PARTITION <partition-name> SET {LOGGED | UNLOGGED};

ALTER TABLE foo MODIFY SUBPARTITION <sub-partition-name> SET {LOGGED | UNLOGGED};

What about index constraints, ie, PRIMARY KEY, UNIQUE and EXCLUSION
constraints - 2 things must be clear here: cannot create these constraints
on individual partitions and all partition columns (key0 + key1) must be
the leading columns of the key. On a related note, creating index on the
master table should create the index on all "leaf" partitions. The index
on the mater table itself would be just a logical index. Should we allow
creating or dropping indexes on partitions directly?

It would be interesting to talk about features like SPLIT, MERGE, EXCHANGE
but I'm inclined to leave them as future enhancements. For a functionality
similar to EXCHANGE, there are commands like ATTACH/DETACH in the latest
patch. We could extend them to also consider sub-partitions:

ALTER TABLE foo
ATTACH PARTITION <partition-name> FOR VALUES (val)
USING [ TABLE ] <table-name> [SET VALID];

ALTER TABLE foo
MODIFY PARTITION foo_1 ATTACH SUBPARTITION <sub-partition-name> FOR VALUES
(val)
USING [ TABLE ] <table-name> [SET VALID];

ALTER TABLE foo
DETACH PARTITION foo_1
USING [ TABLE ] <table-name>;

ALTER TABLE foo
MODIFY PARTITION foo_1 DETACH SUBPARTITION foo_1_1
USING [ TABLE ] <table-name>;

Note that all commands described above refer to sub-partitions directly
using names. These names refer to pg_class.relname of corresponding
relations, so any partition/sub-partition names have to be unique across
the database in a given schema. Maybe, if we stored partition name in
pg_partition, system could generate suitable pg_class name itself and
relieve that restriction. It would then be possible to refer to
sub-partitions using its pg_class name or given name qualified by the
partition name.

Let me also outline some significant internal implementation details:

* Introduce a new relkind - RELKIND_PARTITIONED_REL. Master/root tables in
partitioning hierarchies are of this relkind. Following are some key
properties of such relations -

They have a pg_partitioned_rel catalog entry containing information about
the partition key (and sub-partition key, if any). RelationBuildDesc()
initializes the partition key fields in RelationData. It consists of
linked-list of structs corresponding to keys at successive levels.

If happen to be target in I/U/D (/COPY FROM), corresponding
ResultRelInfo's partition key and partition descriptor fields are
initialized. The latter consists of information about its immediate
(level-1) partitions.

In case of U/D, rewrite phase adds tableoid junk attribute to the target
list in addition to a traditional one (ctid) for ExecUpdate and
ExecDelete.

* Next up is the notion of relations that are partitions at any level
(including non-leaf levels). Following are some key properties of such
relations -

They each have a pg_partition entry.

They cannot be targets in I/U and COPY FROM.

One cannot define rules, triggers, and RLS policies on them. Although,
AR triggers defined on a partitioned master table are propagated to the
"leaf" partitions.

Note that RELKIND_PARTITIONED_REL relations and non-leaf partitions do not
have any storage.

* How to handle multi-level partitioning -

To map tuples to a "leaf" partition, CopyFrom(), ExecInsert() (also,
ExecUpdate()) invoke a function called get_partition_for_tuple(). The
function encapsulates details of partitioning such as partitioning method
(R/L) and multi-level recursion.

A similar recursive approach could be used within planner, too. Single
invocation of function get_rel_partitions() from, say,
set_partitioned_rel_size() would return the list of "leaf" partitions of
rel excluding pruned partitions (and sub-partitions thereof). That is,
quals are matched against partition keys at all levels in the same
invocation.

* Revised catalog structures:

pg_partitioned_rel:

partrelid       Oid
partmethod      char
partlevel       int16
partnatts       int16
partkey         int2vector
partclass       oidvector
partexprs       pg_node_tree

pg_partition:

partrelid       Oid
partparent      Oid
partname        NameData
partlistvals    anyarray
partrangemaxs   anyarray
partvalid       bool
partsubtemplate bool
partreloptions  text
parttablespace  Oid

Fields of pg_partition following partrangemaxs are yet unused.

Thoughts?

Thanks,
Amit





pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: [sqlsmith] Failed to generate plan on lateral subqueries
Next
From: Chapman Flack
Date:
Subject: Re: Proposal: custom compression methods