Thread: Declarative partitioning - another take

Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

Attached is the latest set of patches to implement declarative
partitioning.  There is already a commitfest entry for the same:
https://commitfest.postgresql.org/10/611/

The old discussion is here:
http://www.postgresql.org/message-id/55D3093C.5010800@lab.ntt.co.jp/

Attached patches are described below:

0001-Catalog-and-DDL-for-partition-key.patch
0002-psql-and-pg_dump-support-for-partitioned-tables.patch

These patches create the infrastructure and DDL for partitioned
tables.

In addition to a catalog for storing the partition key information, this
adds a new relkind to pg_class.h. PARTITION BY clause is added to CREATE
TABLE. Tables so created are RELKIND_PARTITIONED_REL relations which are
to be special in a number of ways, especially with regard to their
interactions with regular table inheritance features.

PARTITION BY RANGE ({ column_name | ( expression ) } [ opclass ] [, ...])
PARTITION BY LIST ({ column_name | ( expression ) } [ opclass ])


0003-Catalog-and-DDL-for-partition-bounds.patch
0004-psql-and-pg_dump-support-for-partitions.patch

These patches create the infrastructure and DDL for partitions.

Parent-child relationships of a partitioned table and its partitions are
managed behind-the-scenes with inheritance.  That means there is a
pg_inherits entry and attributes, constraints, etc. are marked with
inheritance related information appropriately.  However this case differs
from a regular inheritance relationship in a number of ways.  While the
regular inheritance imposes certain restrictions on what elements a
child's schema is allowed to contain (both at creation time and
after-the-fact), the partitioning related code imposes further
restrictions.  For example, while regular inheritance allows a child to
contain its own columns, the partitioning code disallows that.  Stuff like
NO INHERIT marking on check constraints, ONLY are ignored by the the
partitioning code.

Partition DDL includes both a way to create new partition and "attach" an
existing table as a partition of parent partitioned table.  Attempt to
drop a partition using DROP TABLE causes an error. Instead a partition
needs first to be "detached" from parent partitioned table.  On the other
hand, dropping the parent drops all the partitions if CASCADE is specified.

CREATE TABLE partition_name
    PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
[ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass
] [, ...] )

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
  PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
  SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]

ALTER TABLE parent ATTACH PARTITION partition_name partition_bound_spec [
VALIDATE | NO VALIDATE ]

ALTER TABLE parent DETACH PARTITION partition_name

partition_bound_spec is:

FOR VALUES { list_spec | range_spec }

list_spec in FOR VALUES is:

IN ( expression [, ...] )

range_spec in FOR VALUES is:

START lower-bound [ INCLUSIVE | EXCLUSIVE ] END upper-bound [ INCLUSIVE |
EXCLUSIVE ]

where lower-bound and upper-bound are:

{ ( expression [, ...] ) | UNBOUNDED }

expression can be a string literal, a numeric literal or NULL.

Note that the one can specify PARTITION BY when creating a partition
itself. That is to allow creating multi-level partitioned tables.


0005-Teach-a-few-places-to-use-partition-check-constraint.patch

A partition's bound implicitly constrains the values that are allowed in
the partition key of its rows.  The same can be applied to partitions when
inserting data *directly* into them to make sure that only the correct
data is allowed in (if a tuple has been routed from the parent, that
becomes unnecessary). To that end, ExecConstraints() now includes the
above implicit check constraint in the list of constraints it enforces.

Further, to enable constraint based partition exclusion on partitioned
tables, the planner code includes in its list of constraints the above
implicitly defined constraints.  This arrangement is temporary however and
will be rendered unnecessary when we implement special data structures and
algorithms within the planner in future versions of this patch to use
partition metadata more effectively for partition exclusion.

Note that the "constraints" referred to above are not some on-disk
structures but those generated internally on-the-fly when requested by a
caller.

0006-Introduce-a-PartitionTreeNode-data-structure.patch
0007-Tuple-routing-for-partitioned-tables.patch

These patches enable routing of tuples inserted into a partitioned table
to one of its leaf partitions.  It applies to both COPY FROM and INSERT.
First of these patches introduces a data structure that provides a
convenient means for the tuple routing code to step down a partition tree
one level at a time.  The second one modifies copy.c and executor to
implement actual tuple routing.  When inserting into a partition, its row
constraints and triggers are applied.  Note that the partition's
constraints also include the constraints defined on the parent.  This
arrangements means however that the parent's triggers are not currently
applied.

Updates are handled like they are now for inheritance sets, however, if an
update makes a row change partition, an error will be thrown.

0008-Update-DDL-Partitioning-chapter.patch

This patch updates the partitioning section in the DDL chapter to reflect
the new methods made available for creating and managing partitioned table
and its partitions.  Especially considering that it is no longer necessary
to define CHECK constraints and triggers/rules manually for constraint
exclusion and tuple routing, respectively.

TODO (in short term):
* Add more regression tests and docs
* Add PartitionOptInfo and use it to perform partition pruning more
effectively (the added infrastructure should also help pairwise joins
patch proposed by Ashutosh Bapat [1])
* Fix internal representation of list partition bounds to be more efficient


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj%3DEaDTSA%40mail.gmail.com

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is the latest set of patches to implement declarative
> partitioning.

Cool.  I would encourage you to give some thought to what is the least
committable subset of these patches, and think about whether it can be
reorganized to make that smaller.  Based on the descriptions, it
sounds to me like the least committable subset is probably all of 0001
- 0005 as one giant commit, and that's a lot of code.  Maybe there's
no real way to come up with something more compact than that, but it's
worth some thought.

> 0001-Catalog-and-DDL-for-partition-key.patch

+      <entry><link
linkend="catalog-pg-partitioned"><structname>pg_partitioned</structname></link></entry>

pg_partitioned seems like a slightly strange choice of name.  We have
no other catalog tables whose names end in "ed", so the use of a past
participle here is novel.  More generally, this patch seems like it's
suffering a bit of terminological confusion: while the catalog table
is called pg_partitioned, the relkind is RELKIND_PARTITION_REL.  And
get_relation_by_qualified_name() thinks that RELKIND_PARTITION_REL is
a "table", but getRelationDescription thinks it's a "partitioned
table".  I think we need to get all these bits and pieces on the same
page, or have some kind of consistent way of deciding what to do in
each case.  Maybe pg_partitioned_table, RELKIND_PARTITIONED_TABLE,
etc. for the internal stuff, but just "table" in user-facing messages.

Alternatively, I wonder if we should switch to calling this a
"partition root" rather than a "partitioned table".  It's not the
whole table, just the root.  And doesn't contain data - in fact it
probably shouldn't even have storage - so calling it a "table" might
be confusing.  But if we're using CREATE TABLE to create it in the
ifrst place, then calling it something other than a table later on is
also confusing.  Hmm.

+      <entry><structfield>partexprs</structfield></entry>

There's a certain symmetry between this and what we do for indexes,
but I'm wondering whether there's a use case for partitioning a table
by an expression rather than a column value.  I suppose if you've
already done the work, there's no harm in supporting it.

+[ PARTITION BY {RANGE | LIST} ( { <replaceable
class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable> ) } [ <replaceable
class="parameter">opclass</replaceable> ] [, ...] )

The spacing isn't right here.  Should say { RANGE | LIST } rather than
{RANGE|LIST}.  Similarly elsewhere.

+      thus created is called <firstterm>partitioned</firstterm> table.  Key
+      consists of an ordered list of column names and/or expressions when
+      using the <literal>RANGE</> method, whereas only a single column or
+      expression can be specified when using the <literal>LIST</> method.

Why do we support range partitioning with multiple columns, but list
partitioning only with a single column?

+      The type of a key column or an expresion must have an associated

Typo.

+     <para>
+      Currently, there are following limitations on definition of partitioned
+      tables: one cannot specify any UNIQUE, PRIMARY KEY, EXCLUDE and/or
+      FOREIGN KEY constraints.
+     </para>

But I presume you can define these constraints on the partitions;
that's probably worth mentioning.  I'd say something like this
"Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or
FOREIGN KEY constraints; however, you can define these constraints on
individual data partitions."

+    if (partexprs)
+        recordDependencyOnSingleRelExpr(&myself,
+                                        (Node *) partexprs,
+                                        RelationGetRelid(rel),
+                                        DEPENDENCY_NORMAL,
+                                        DEPENDENCY_IGNORE);

I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
here.  Instead, you could just add an additional Boolean argument to
recordDependencyOnSingleRelExpr.  That seems less likely to create
bugs in unrelated portions of the code.

+    /*
+     * Override specified inheritance option, if relation is a partitioned
+     * table and it's a target of INSERT.
+     */
+    if (alsoSource)
+        rte->inh |= relid_is_partitioned(rte->relid);

This seems extremely unlikely to be the right way to do this.  We're
just doing parse analysis here, so depending on properties of the
table that are (or might be made to be) changeable is not good.  It's
also an extra catalog lookup whether the table is partitioned or not
(and whether rte->inh is already true or not). Furthermore, it seems
to go against the comment explaining what rte->inh is supposed to
mean, which says:
*        inh is TRUE for relation references that should be expanded to include*        inheritance children, if the
relhas any.  This *must* be FALSE for*        RTEs other than RTE_RELATION entries.
 

I am not sure what problem you're trying to solve here, but I suspect
this needs to be ripped out altogether or handled later, during
planning.  Similarly for the related change in addRangeTableEntry.

+    {PartitionedRelationId,        /* PARTEDRELID */
+        PartitionedRelidIndexId,
+        1,
+        {
+            Anum_pg_partitioned_partedrelid,
+            0,
+            0,
+            0
+        },
+        128

I'd probably cut the initial size of this down a bit.  Odds are good
that not all of the tables you access will be partitioned.  Of course
that assumes we'll avoid probing this syscache for non-partitioned
tables, but I think that's pretty important.

+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                     errmsg("cannot alter column named in partition key")));
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                     errmsg("cannot alter column referenced in
partition key expression")));

Maybe "cannot alter type of column named in partition key"?  And
similarly for the other one.

+/*
+ * transformPartitionBy
+ *         Transform any expressions present in the partition key
+ */
+static PartitionBy *
+transformPartitionBy(Relation rel, PartitionBy *partitionby)

Shouldn't this be in src/backend/parser/parse_*.c instead of tablecmds.c?

Most of this (0001) looks pretty reasonable.  I'm sure that there is
more that needs fixing than what I've mentioned above, which is just
what I saw on an initial read-through, but overall I like the
approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Thanks a lot for taking a look at this.

On 2016/08/11 3:22, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is the latest set of patches to implement declarative
>> partitioning.
> 
> Cool.  I would encourage you to give some thought to what is the least
> committable subset of these patches, and think about whether it can be
> reorganized to make that smaller.  Based on the descriptions, it
> sounds to me like the least committable subset is probably all of 0001
> - 0005 as one giant commit, and that's a lot of code.  Maybe there's
> no real way to come up with something more compact than that, but it's
> worth some thought.

I will consider this and try to come up with a minimal patch set covering
what is now 0001 - 0005.

>> 0001-Catalog-and-DDL-for-partition-key.patch
> 
> +      <entry><link
> linkend="catalog-pg-partitioned"><structname>pg_partitioned</structname></link></entry>
> 
> pg_partitioned seems like a slightly strange choice of name.  We have
> no other catalog tables whose names end in "ed", so the use of a past
> participle here is novel.  More generally, this patch seems like it's
> suffering a bit of terminological confusion: while the catalog table
> is called pg_partitioned, the relkind is RELKIND_PARTITION_REL.  And
> get_relation_by_qualified_name() thinks that RELKIND_PARTITION_REL is
> a "table", but getRelationDescription thinks it's a "partitioned
> table".  I think we need to get all these bits and pieces on the same
> page, or have some kind of consistent way of deciding what to do in
> each case.  Maybe pg_partitioned_table, RELKIND_PARTITIONED_TABLE,
> etc. for the internal stuff, but just "table" in user-facing messages.

Name pg_partitioned does sound a little strange now that you mention it.

I agree to make the terminology more consistent and to that end, your
suggestion to keep the user-facing term "table" and using
pg_partitioned_table and RELKIND_PARTITIONED_TABLE for internals sounds good.

> Alternatively, I wonder if we should switch to calling this a
> "partition root" rather than a "partitioned table".  It's not the
> whole table, just the root.  And doesn't contain data - in fact it
> probably shouldn't even have storage - so calling it a "table" might
> be confusing.  But if we're using CREATE TABLE to create it in the
> ifrst place, then calling it something other than a table later on is
> also confusing.  Hmm.

I think it makes sense to keep calling it a table because it has all the
logical properties of a table even though it will differ from a regular
table on the basis of physical implementation details such as that it does
not own physical storage.  Am I missing something?

> 
> +      <entry><structfield>partexprs</structfield></entry>
> 
> There's a certain symmetry between this and what we do for indexes,
> but I'm wondering whether there's a use case for partitioning a table
> by an expression rather than a column value.  I suppose if you've
> already done the work, there's no harm in supporting it.

Yeah, it's not a whole lot of code to manage expressions alongside simple
column references.

> +[ PARTITION BY {RANGE | LIST} ( { <replaceable
> class="parameter">column_name</replaceable> | ( <replaceable
> class="parameter">expression</replaceable> ) } [ <replaceable
> class="parameter">opclass</replaceable> ] [, ...] )
> 
> The spacing isn't right here.  Should say { RANGE | LIST } rather than
> {RANGE|LIST}.  Similarly elsewhere.

Will fix.

> +      thus created is called <firstterm>partitioned</firstterm> table.  Key
> +      consists of an ordered list of column names and/or expressions when
> +      using the <literal>RANGE</> method, whereas only a single column or
> +      expression can be specified when using the <literal>LIST</> method.
> 
> Why do we support range partitioning with multiple columns, but list
> partitioning only with a single column?

This is mostly because I didn't find any other database supporting it and
hence thought maybe there is not much use for it.

On the implementation side, I think it makes sense to think of ordering of
tuples (for a multi-column range key), where we compare a new tuple's
partition key columns one-by-one until we find the column such that its
value != the value of corresponding column in the range upper bound of a
partition (usually but not necessarily the last column of the key).  Once
we have determined which side of the bound the value was, we then perform
the same process with either the lower bound of the same partition or
upper bound of some partition in the other half as determined by binary
search logic.

Conversely, when determining the constraint on rows a range partition
contains, we emit a (keycol = lowerval) expression for all columns i =
0..j-1 where range.lowerval[i] = range.upperval[i] and then emit (keycol >
/ >= range.lowerval[j] and keycol < / <= upperval[j]) for the first column
j where lowerval[j] < upperval[j] and stop at that column (any further
columns are irrelevant).

When considering the same for list partitioning where we consider set-
membership of values (based on equality of a new tuple's value for the
column and a partition's list of values), implementing multi-column logic
does not seem as straightforward.  Also, it might make the optimizations
we recently discussed slightly more complicated to implement.  Although I
may be missing something.

> +      The type of a key column or an expresion must have an associated
> 
> Typo.

Will fix.

> 
> +     <para>
> +      Currently, there are following limitations on definition of partitioned
> +      tables: one cannot specify any UNIQUE, PRIMARY KEY, EXCLUDE and/or
> +      FOREIGN KEY constraints.
> +     </para>
> 
> But I presume you can define these constraints on the partitions;
> that's probably worth mentioning.  I'd say something like this
> "Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or
> FOREIGN KEY constraints; however, you can define these constraints on
> individual data partitions."

Okay, will mention that.

> +    if (partexprs)
> +        recordDependencyOnSingleRelExpr(&myself,
> +                                        (Node *) partexprs,
> +                                        RelationGetRelid(rel),
> +                                        DEPENDENCY_NORMAL,
> +                                        DEPENDENCY_IGNORE);
> 
> I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
> here.  Instead, you could just add an additional Boolean argument to
> recordDependencyOnSingleRelExpr.  That seems less likely to create
> bugs in unrelated portions of the code.

I did consider a Boolean argument instead of a new DependencyType value,
however it felt a bit strange to pass a valid value for the fifth argument
(self_behavior) and then ask using a separate parameter that it (a
self-dependency) is to be ignored.  By the way, no pg_depend entry is
created on such a call, so the effect of the new type's usage seems
localized to me. Thoughts?

> 
> +    /*
> +     * Override specified inheritance option, if relation is a partitioned
> +     * table and it's a target of INSERT.
> +     */
> +    if (alsoSource)
> +        rte->inh |= relid_is_partitioned(rte->relid);
> 
> This seems extremely unlikely to be the right way to do this.  We're
> just doing parse analysis here, so depending on properties of the
> table that are (or might be made to be) changeable is not good.  It's
> also an extra catalog lookup whether the table is partitioned or not
> (and whether rte->inh is already true or not). Furthermore, it seems
> to go against the comment explaining what rte->inh is supposed to
> mean, which says:
> 
>  *        inh is TRUE for relation references that should be expanded to include
>  *        inheritance children, if the rel has any.  This *must* be FALSE for
>  *        RTEs other than RTE_RELATION entries.
> 
> I am not sure what problem you're trying to solve here, but I suspect
> this needs to be ripped out altogether or handled later, during
> planning.  Similarly for the related change in addRangeTableEntry.

Okay, I will see how this could rather be done within the planner.  My
intention here is to keep the code churn low by not modifying a lot of
places in the code where rte->inh is checked to turn inheritance on or
off.  I agree though that it's a bit ugly and perhaps wrong.

> +    {PartitionedRelationId,        /* PARTEDRELID */
> +        PartitionedRelidIndexId,
> +        1,
> +        {
> +            Anum_pg_partitioned_partedrelid,
> +            0,
> +            0,
> +            0
> +        },
> +        128
> 
> I'd probably cut the initial size of this down a bit.  Odds are good
> that not all of the tables you access will be partitioned.  Of course
> that assumes we'll avoid probing this syscache for non-partitioned
> tables, but I think that's pretty important.

Okay, I will turn that down to 64 maybe.

> 
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                     errmsg("cannot alter column named in partition key")));
> +        else
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                     errmsg("cannot alter column referenced in
> partition key expression")));
> 
> Maybe "cannot alter type of column named in partition key"?  And
> similarly for the other one.

Will fix.

> 
> +/*
> + * transformPartitionBy
> + *         Transform any expressions present in the partition key
> + */
> +static PartitionBy *
> +transformPartitionBy(Relation rel, PartitionBy *partitionby)
> 
> Shouldn't this be in src/backend/parser/parse_*.c instead of tablecmds.c?

It can be performed only after the new relation is opened which can be
only after we have called heap_create_with_catalog() which is in
tablecmds.c: DefineRelation().  I think that's because we call
transformExpr() on partition expressions which expects to be able to see
the table's columns.  Other callers such as transformIndexStmt and
transformAlterTableStmt can do transformations within parse_utilcmds.c
because they can open the relation there.

> 
> Most of this (0001) looks pretty reasonable.  I'm sure that there is
> more that needs fixing than what I've mentioned above, which is just
> what I saw on an initial read-through, but overall I like the
> approach.
> 

Okay, thanks.  I will post the updated patches soon.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:




I think it makes sense to keep calling it a table because it has all the
logical properties of a table even though it will differ from a regular
table on the basis of physical implementation details such as that it does
not own physical storage.  Am I missing something?

>
> +      <entry><structfield>partexprs</structfield></entry>
>
> There's a certain symmetry between this and what we do for indexes,
> but I'm wondering whether there's a use case for partitioning a table
> by an expression rather than a column value.  I suppose if you've
> already done the work, there's no harm in supporting it.

Yeah, it's not a whole lot of code to manage expressions alongside simple
column references.

Users who would like to partition their tables by "age" will partition those by the month or year extracted out of a date column e.g. order_date. They will find it convenient to use an expression (extract(month from date)) as a partition key, instead of storing month or year as a separate column.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Robert Eckhardt
Date:
On Tue, Aug 16, 2016 at 2:30 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:




I think it makes sense to keep calling it a table because it has all the
logical properties of a table even though it will differ from a regular
table on the basis of physical implementation details such as that it does
not own physical storage.  Am I missing something?

>
> +      <entry><structfield>partexprs</structfield></entry>
>
> There's a certain symmetry between this and what we do for indexes,
> but I'm wondering whether there's a use case for partitioning a table
> by an expression rather than a column value.  I suppose if you've
> already done the work, there's no harm in supporting it.

Yeah, it's not a whole lot of code to manage expressions alongside simple
column references.

Users who would like to partition their tables by "age" will partition those by the month or year extracted out of a date column e.g. order_date. They will find it convenient to use an expression (extract(month from date)) as a partition key, instead of storing month or year as a separate column.

In GPDB we have partitioning. It is almost always by date and then often the partitions are for different sizes, i.e. by day for 30 days then by month for 3 years then by year. What we also support, but isn't super performant, is sub-partitioning. 

This is where some on the newer indexing strategies is interesting to me. I see them as synergistic not redundant. 
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch

+    if (pset.sversion >= 90600 && tableinfo.relkind == 'P')

Version check is redundant, right?

+) PARTITION BY RANGE ((a+b));
+\d describe_range_key
+Partitioned table "public.describe_range_key"
+ Column |  Type   | Modifiers
+--------+---------+-----------
+ a      | integer |
+ b      | integer |
+Partition Key: PARTITION BY RANGE (((a + b)))

I understand that it's probably difficult not to end up with two sets
of parentheses here, but can we avoid ending up with three sets?

Also, I wonder if pg_get_partkeydef() should omit "PARTITION BY" and
pg_dump can add that part back.  Then this could say:

Partition Key: RANGE ((a + b))

...which seems a good deal more natural than what you have now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 0003-Catalog-and-DDL-for-partition-bounds.patch
>
> Partition DDL includes both a way to create new partition and "attach" an
> existing table as a partition of parent partitioned table.  Attempt to
> drop a partition using DROP TABLE causes an error. Instead a partition
> needs first to be "detached" from parent partitioned table.  On the other
> hand, dropping the parent drops all the partitions if CASCADE is specified.

Hmm, I don't think I like this.  Why should it be necessary to detach
a partition before dropping it?  That seems like an unnecessary step.
    [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
IMMEDIATE ]
+</synopsis>

Unnecessary hunk.

+     <para>
+      If this table is a partition, one cannot perform <literal>DROP
NOT NULL</>
+      on a column if it is marked not null in the parent table.
+      not null.
+     </para>

Sentence fragment.

+     <para>
+      Note that unlike the <literal>ATTACH PARTITION</> command, a partition
+      being detached can be itself partitioned.  In that case, it continues
+      to exist as such.
+     </para>

This is another restriction I don't understand.  Why can't I attach a
partitioned table?

+        indicate that descendant tables are included.  Note that whether
+        <literal>ONLY</> or <literal>*</> is specified has no effect in case
+        of a partitioned table; descendant tables (in this case, partitions)
+        are always included.

Ugh, why?  I think this should work exactly the same way for
partitioned tables that it does for any other inheritance hierarchy.
Sure, you'll get no rows, but who cares?

+CREATE FOREIGN TABLE measurement_y2016m07
+    PARTITION OF measurement FOR VALUES START ('2016-07-01') END
('2016-08-01');
+    SERVER server_07;

Extra semicolon?

+      A partition cannot have columns other than those inherited from the
+      parent.  That includes the <structfield>oid</> column, which can be

I think experience suggests that this is a good restriction, but then
why does the syntax synopsis indicate that PARTITION BY can be
specified along with column definitions?  Similarly for CREATE FOREIGN
TABLE.

+      When specifying for a table being created as partition, one needs to
+      use column names from the parent table as part of the key.

This is not very clear.

-       /* Remove NO INHERIT flag if rel is a partitioned table */
-       if (relid_is_partitioned(relid))
+       /* Discard NO INHERIT, if relation is a partitioned table or a
partition */
+       if (relid_is_partitioned(relid) || relid_is_partition(relid))               is_no_inherit = false;

It might be right to disallow NO INHERIT in this case, but I don't
think it can be right to just silently ignore it.

+ * Not flushed from the cache by RelationClearRelation() unless changed because
+ * of addition or removal of partitions.

This seems unlikely to be safe, unless I'm missing something.

+       form = (Form_pg_inherits) GETSTRUCT(tuple);
+
+       systable_endscan(scan);
+       heap_close(catalogRelation, AccessShareLock);
+
+       return form->inhparent;

This is unsafe.  After systable_endscan, it is no longer OK to access
form->inhparent.

Try building with CLOBBER_CACHE_ALWAYS to find other cache flush hazards.

There should probably be a note in the function header comment that it
is unsafe to use this for an inheritance child that is not a
partition, because there could be more than one parent in that case.
Or maybe the whole idea of this function just isn't very sound...

+static List *
+get_partitions(Oid relid, int lockmode)
+{
+       return find_inheritance_children(relid, lockmode);
+}

What's the point?  If we're going to have a wrapper here at all, then
shouldn't it have a name that matches the existing convention - e.g.
find_partitions() or find_child_partitions()?  But I think you might
as well just use find_inheritance_children() directly.

+                * Happens when we have created the pg_inherits entry
but not the
+                * pg_partition entry yet.

Why do we ever allow the flow of control to reach this point while we
are in such an intermediate state?

+free_partition_info(PartitionInfoData **p, int num)

Seems very error-prone.  Isn't this why MemoryContextReset was invented?

+relid_is_partition(Oid relid)
+{
+       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
+}

This is used in a lot of places, and the overhead of checking it in
all of those places is not necessarily nil.  Syscache lookups aren't
free.  What if we didn't create a new catalog for this and instead
just added a relpartitionbound attribute to pg_class?  It seems a bit
silly to have a whole extra catalog to store one extra column...
       /*
+        * If this foreign table is a partition, check that the FDW supports
+        * insert.
+        */
+       if (stmt->base.partbound != NULL)
+       {
+               FdwRoutine *fdw_routine;
+
+               fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+               if (fdw_routine->ExecForeignInsert == NULL)
+                       ereport(ERROR,
+                                       (errcode(ERRCODE_FDW_NO_SCHEMAS),
+                                        errmsg("cannot create foreign
table as partition"),
+                                        errdetail("foreign-data
wrapper \"%s\" does not support insert",
+                                                       fdw->fdwname)));
+       }

Why?  This seems like an entirely arbitrary prohibition.  If inserts
aren't supported, then they'll fail at runtime.  Same for updates or
deletes, for which you have not added checks.  I think you should just
remove this.

+               /* Force inheritance recursion, if partitioned table. */
+               if (recurse || relid_is_partitioned(myrelid))

Disagree with this, too.  There's no reason for partitioned tables to
be special in this way.  Similarly, disagree with all of the places
that do something similar.

-                               errmsg("column \"%s\" in child table
must be marked NOT NULL",
-                                          attributeName)));
+                               errmsg("column \"%s\" in %s table must
be marked NOT NULL",
+                                          attributeName,
+                                          is_attach_partition ?
"source" : "child")));

You have a few of these; they cause problems for translators, because
different languages have different word ordering.  Repeat the entire
message instead: is_attach_partition ? "column \"%s\" in source table
must be marked NOT NULL" : "column \"%s\" in child table must be
marked NOT NULL".

+-- XXX add ownership tests

So do that.  :-)

+ERROR:  column "b" is not null in parent
+HINT:  Please drop not null in the parent instead

Hmm.   That hint doesn't seem like project style, and I'm not sure
that it really makes sense to issue such a hint anyway.  Who knows
whether that is the right thing to do?  I think you should somehow be
complaining about the fact that this is a partition, rather than
complaining about the fact that the column is NOT NULL in the parent.
Are we insisting that the flags match exactly, or only that the child
may not allow nulls unless the parent does?

+ERROR:  new partition's list of values overlaps with partition
"lpart1" of "list_parted"

Maybe just:

ERROR: partitions must not overlap
-or-
ERROR: partition "%s" would overlap partition "%s"

As before, this is just an initial read-through, so apologies for
whatever I may have missed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:



+relid_is_partition(Oid relid)
+{
+       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
+}

This is used in a lot of places, and the overhead of checking it in
all of those places is not necessarily nil.  Syscache lookups aren't
free.  What if we didn't create a new catalog for this and instead
just added a relpartitionbound attribute to pg_class?  It seems a bit
silly to have a whole extra catalog to store one extra column...

 

It looks like in most of the places where this function is called it's using relid_is_partition(RelationGetRelid(rel)). Instead probably we should check existence of rd_partdesc or rd_partkey within Relation() and make sure that those members are always set for a partitioned table. That will avoid cache lookup and may give better performance.

That brings up another question. Can we have rd_partdesc non null and rd_partkey null or vice-versa. If not, should we club those into a single structure like Partition (similar to Relation)?



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/08/17 14:33, Ashutosh Bapat wrote:
>> +relid_is_partition(Oid relid)
>> +{
>> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
>> +}
>>
>> This is used in a lot of places, and the overhead of checking it in
>> all of those places is not necessarily nil.  Syscache lookups aren't
>> free.  What if we didn't create a new catalog for this and instead
>> just added a relpartitionbound attribute to pg_class?  It seems a bit
>> silly to have a whole extra catalog to store one extra column...
>>
> It looks like in most of the places where this function is called it's
> using relid_is_partition(RelationGetRelid(rel)). Instead probably we should
> check existence of rd_partdesc or rd_partkey within Relation() and make
> sure that those members are always set for a partitioned table. That will
> avoid cache lookup and may give better performance.

It seems you are talking about a *partitioned* relation here, whereas
relid_is_partition() is to trying to check if a relation is *partition* by
looking up the pg_partition catalog (or the associated cache).  For the
former, the test you suggest or rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE test is enough.

I am slightly tempted to eliminate the pg_partition catalog and associated
syscache altogether and add a column to pg_class as Robert suggested.
That way, all relid_is_partition() calls will be replaced by
rel->rd_partbound != NULL check.  But one potential problem with that
approach is that now whenever a parent relation is opened, all the
partition relations must be opened to get the partbound value (to form the
PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
currently, we just look up the pg_partition catalog (or the associated
cache) for every partition and that gets us the partbound.

> That brings up another question. Can we have rd_partdesc non null and
> rd_partkey null or vice-versa. If not, should we club those into a single
> structure like Partition (similar to Relation)?

It's true that rd_partkey and rd_partdesc are both either NULL or
non-NULL, so combining them into a single struct is an idea worth considering.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:


On Wed, Aug 17, 2016 at 11:51 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/08/17 14:33, Ashutosh Bapat wrote:
>> +relid_is_partition(Oid relid)
>> +{
>> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
>> +}
>>
>> This is used in a lot of places, and the overhead of checking it in
>> all of those places is not necessarily nil.  Syscache lookups aren't
>> free.  What if we didn't create a new catalog for this and instead
>> just added a relpartitionbound attribute to pg_class?  It seems a bit
>> silly to have a whole extra catalog to store one extra column...
>>
> It looks like in most of the places where this function is called it's
> using relid_is_partition(RelationGetRelid(rel)). Instead probably we should
> check existence of rd_partdesc or rd_partkey within Relation() and make
> sure that those members are always set for a partitioned table. That will
> avoid cache lookup and may give better performance.

It seems you are talking about a *partitioned* relation here, whereas
relid_is_partition() is to trying to check if a relation is *partition* by
looking up the pg_partition catalog (or the associated cache).  For the
former, the test you suggest or rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE test is enough.

Uh, you are right. Sorry for my misunderstanding.
 

I am slightly tempted to eliminate the pg_partition catalog and associated
syscache altogether and add a column to pg_class as Robert suggested.
That way, all relid_is_partition() calls will be replaced by
rel->rd_partbound != NULL check.  But one potential problem with that
approach is that now whenever a parent relation is opened, all the
partition relations must be opened to get the partbound value (to form the
PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
currently, we just look up the pg_partition catalog (or the associated
cache) for every partition and that gets us the partbound.

> That brings up another question. Can we have rd_partdesc non null and
> rd_partkey null or vice-versa. If not, should we club those into a single
> structure like Partition (similar to Relation)?

It's true that rd_partkey and rd_partdesc are both either NULL or
non-NULL, so combining them into a single struct is an idea worth considering.

Thanks,
Amit





--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I am slightly tempted to eliminate the pg_partition catalog and associated
> syscache altogether and add a column to pg_class as Robert suggested.
> That way, all relid_is_partition() calls will be replaced by
> rel->rd_partbound != NULL check.  But one potential problem with that
> approach is that now whenever a parent relation is opened, all the
> partition relations must be opened to get the partbound value (to form the
> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
> currently, we just look up the pg_partition catalog (or the associated
> cache) for every partition and that gets us the partbound.

Well, you could just look up the pg_class row without opening the
relation, too.  There is a system cache on pg_class.oid, after all.  I
think the issue is whether it's safe to read either one of those
things without a lock on the child relation.  If altering the
partitioning information for a relation requires holding only
AccessExclusiveLock on that relation, and no lock on the parent, then
you really can't read the information for any child relation without
taking at least AccessShareLock.  Otherwise, it might change under
you, and that would be bad.

I'm inclined to think that changing the partitioning information for a
child is going to require AccessExclusiveLock on both the child and
the parent.  That seems unfortunate from a concurrency point of view,
but we may be stuck with it: suppose you require only
ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
transaction might see the partition boundaries change when it does a
relcache rebuild, which would cause it to suddenly start expecting the
data to be in a different plan in mid-transaction, perhaps even in
mid-scan.  Maybe that's survivable with really careful coding, but it
seems like it's probably a bad thing.  For example, it would mean that
the executor would be unable to rely on the partitioning information
in the relcache remaining stable underneath it.  Moreover, the
relcache is always going to be scanned with the most recent possible
MVCC snapshot, but the transaction snapshot may be older, so such a
system creates all sorts of nasty possibilities for there to be skew
between the snapshot being used to via the data and the snapshot being
used to read the metadata that says where the data is.

This may need some more thought, but if we go with that approach of
requiring an AccessExclusiveLock on both parent and child, then it
seems to me that maybe we should consider the partitioning information
to be a property of the parent rather than the child.  Just take all
the partitioning information for all children and put it in one big
node tree and store it in the pg_class or pg_partition_root entry for
the parent as one big ol' varlena.  Now you can open the parent and
get all of the partitioning information for all of the children
without needing any lock on any child, and that's *really* good,
because it means that some day we might be able to do partition
elimination before locking any of the children!  That would be
excellent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
The parent-child relationship of multi-level partitioned tables is not retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes for all the leaf and intermediate tables. The AppendRelInfo nodes created for these RelOptInfos set the topmost table as the parent of all the leaf and child tables. Since partitioning scheme/s at each level is/are associated with the parent/s at that level, we loose information about the immediate parents and thus it becomes difficult to identify which leaf node falls where in the partition hierarchy. This stops us from doing any lump-sum partition pruning where we can eliminate all the partitions under a given parent-partition if that parent-partition gets pruned. It also restricts partition-wise join technique from being applied to partial partition hierarchy when the whole partitioning scheme of joining tables does not match. Maintaining a RelOptInfo hierarchy should not create corresponding Append (all kinds) plan hierarchy since accumulate_append_subpath() flattens any such hierarchy while creating paths. Can you please consider this point in your upcoming patch?


On Wed, Aug 10, 2016 at 4:39 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi,

Attached is the latest set of patches to implement declarative
partitioning.  There is already a commitfest entry for the same:
https://commitfest.postgresql.org/10/611/

The old discussion is here:
http://www.postgresql.org/message-id/55D3093C.5010800@lab.ntt.co.jp/

Attached patches are described below:

0001-Catalog-and-DDL-for-partition-key.patch
0002-psql-and-pg_dump-support-for-partitioned-tables.patch

These patches create the infrastructure and DDL for partitioned
tables.

In addition to a catalog for storing the partition key information, this
adds a new relkind to pg_class.h. PARTITION BY clause is added to CREATE
TABLE. Tables so created are RELKIND_PARTITIONED_REL relations which are
to be special in a number of ways, especially with regard to their
interactions with regular table inheritance features.

PARTITION BY RANGE ({ column_name | ( expression ) } [ opclass ] [, ...])
PARTITION BY LIST ({ column_name | ( expression ) } [ opclass ])


0003-Catalog-and-DDL-for-partition-bounds.patch
0004-psql-and-pg_dump-support-for-partitions.patch

These patches create the infrastructure and DDL for partitions.

Parent-child relationships of a partitioned table and its partitions are
managed behind-the-scenes with inheritance.  That means there is a
pg_inherits entry and attributes, constraints, etc. are marked with
inheritance related information appropriately.  However this case differs
from a regular inheritance relationship in a number of ways.  While the
regular inheritance imposes certain restrictions on what elements a
child's schema is allowed to contain (both at creation time and
after-the-fact), the partitioning related code imposes further
restrictions.  For example, while regular inheritance allows a child to
contain its own columns, the partitioning code disallows that.  Stuff like
NO INHERIT marking on check constraints, ONLY are ignored by the the
partitioning code.

Partition DDL includes both a way to create new partition and "attach" an
existing table as a partition of parent partitioned table.  Attempt to
drop a partition using DROP TABLE causes an error. Instead a partition
needs first to be "detached" from parent partitioned table.  On the other
hand, dropping the parent drops all the partitions if CASCADE is specified.

CREATE TABLE partition_name
    PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
[ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass
] [, ...] )

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
  PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
  SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]

ALTER TABLE parent ATTACH PARTITION partition_name partition_bound_spec [
VALIDATE | NO VALIDATE ]

ALTER TABLE parent DETACH PARTITION partition_name

partition_bound_spec is:

FOR VALUES { list_spec | range_spec }

list_spec in FOR VALUES is:

IN ( expression [, ...] )

range_spec in FOR VALUES is:

START lower-bound [ INCLUSIVE | EXCLUSIVE ] END upper-bound [ INCLUSIVE |
EXCLUSIVE ]

where lower-bound and upper-bound are:

{ ( expression [, ...] ) | UNBOUNDED }

expression can be a string literal, a numeric literal or NULL.

Note that the one can specify PARTITION BY when creating a partition
itself. That is to allow creating multi-level partitioned tables.


0005-Teach-a-few-places-to-use-partition-check-constraint.patch

A partition's bound implicitly constrains the values that are allowed in
the partition key of its rows.  The same can be applied to partitions when
inserting data *directly* into them to make sure that only the correct
data is allowed in (if a tuple has been routed from the parent, that
becomes unnecessary). To that end, ExecConstraints() now includes the
above implicit check constraint in the list of constraints it enforces.

Further, to enable constraint based partition exclusion on partitioned
tables, the planner code includes in its list of constraints the above
implicitly defined constraints.  This arrangement is temporary however and
will be rendered unnecessary when we implement special data structures and
algorithms within the planner in future versions of this patch to use
partition metadata more effectively for partition exclusion.

Note that the "constraints" referred to above are not some on-disk
structures but those generated internally on-the-fly when requested by a
caller.

0006-Introduce-a-PartitionTreeNode-data-structure.patch
0007-Tuple-routing-for-partitioned-tables.patch

These patches enable routing of tuples inserted into a partitioned table
to one of its leaf partitions.  It applies to both COPY FROM and INSERT.
First of these patches introduces a data structure that provides a
convenient means for the tuple routing code to step down a partition tree
one level at a time.  The second one modifies copy.c and executor to
implement actual tuple routing.  When inserting into a partition, its row
constraints and triggers are applied.  Note that the partition's
constraints also include the constraints defined on the parent.  This
arrangements means however that the parent's triggers are not currently
applied.

Updates are handled like they are now for inheritance sets, however, if an
update makes a row change partition, an error will be thrown.

0008-Update-DDL-Partitioning-chapter.patch

This patch updates the partitioning section in the DDL chapter to reflect
the new methods made available for creating and managing partitioned table
and its partitions.  Especially considering that it is no longer necessary
to define CHECK constraints and triggers/rules manually for constraint
exclusion and tuple routing, respectively.

TODO (in short term):
* Add more regression tests and docs
* Add PartitionOptInfo and use it to perform partition pruning more
effectively (the added infrastructure should also help pairwise joins
patch proposed by Ashutosh Bapat [1])
* Fix internal representation of list partition bounds to be more efficient


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj%3DEaDTSA%40mail.gmail.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/08/22 13:51, Ashutosh Bapat wrote:
> The parent-child relationship of multi-level partitioned tables is not
> retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes
> for all the leaf and intermediate tables. The AppendRelInfo nodes created
> for these RelOptInfos set the topmost table as the parent of all the leaf
> and child tables. Since partitioning scheme/s at each level is/are
> associated with the parent/s at that level, we loose information about the
> immediate parents and thus it becomes difficult to identify which leaf node
> falls where in the partition hierarchy. This stops us from doing any
> lump-sum partition pruning where we can eliminate all the partitions under
> a given parent-partition if that parent-partition gets pruned. It also
> restricts partition-wise join technique from being applied to partial
> partition hierarchy when the whole partitioning scheme of joining tables
> does not match. Maintaining a RelOptInfo hierarchy should not create
> corresponding Append (all kinds) plan hierarchy since
> accumulate_append_subpath() flattens any such hierarchy while creating
> paths. Can you please consider this point in your upcoming patch?

I agree.  So there seem to be two things here:  a) when expanding a
partitioned table inheritance set, do it recursively such that resulting
AppendRelInfos preserve *immediate* parent-child relationship info.  b)
when accumulating append subpaths, do not flatten a subpath that is itself
an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
non-NULL partitioning info.  Is the latter somehow necessary for
pairwise-join considerations?

I think I can manage to squeeze in (a) in the next version patch and will
also start working on (b), mainly the part about RelOptInfo getting some
partitioning info.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:


On Thu, Aug 25, 2016 at 12:22 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/08/22 13:51, Ashutosh Bapat wrote:
> The parent-child relationship of multi-level partitioned tables is not
> retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes
> for all the leaf and intermediate tables. The AppendRelInfo nodes created
> for these RelOptInfos set the topmost table as the parent of all the leaf
> and child tables. Since partitioning scheme/s at each level is/are
> associated with the parent/s at that level, we loose information about the
> immediate parents and thus it becomes difficult to identify which leaf node
> falls where in the partition hierarchy. This stops us from doing any
> lump-sum partition pruning where we can eliminate all the partitions under
> a given parent-partition if that parent-partition gets pruned. It also
> restricts partition-wise join technique from being applied to partial
> partition hierarchy when the whole partitioning scheme of joining tables
> does not match. Maintaining a RelOptInfo hierarchy should not create
> corresponding Append (all kinds) plan hierarchy since
> accumulate_append_subpath() flattens any such hierarchy while creating
> paths. Can you please consider this point in your upcoming patch?

I agree.  So there seem to be two things here:  a) when expanding a
partitioned table inheritance set, do it recursively such that resulting
AppendRelInfos preserve *immediate* parent-child relationship info. 

Right.
 
b)
when accumulating append subpaths, do not flatten a subpath that is itself
an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
non-NULL partitioning info.Is the latter somehow necessary for
pairwise-join considerations?

I don't think you need to do anything in the path creation code for this. As is it flattens all AppendPath hierarchies whether for partitioning or inheritance or subqueries. We should leave it as it is.



I think I can manage to squeeze in (a) in the next version patch and will
also start working on (b), mainly the part about RelOptInfo getting some
partitioning info.

I am fine with b, where you would include some partitioning information in RelOptInfo. But you don't need to do what you said in (b) above.

In a private conversation Robert Haas suggested a way slightly different than what my patch for partition-wise join does. He suggested that the partitioning schemes i.e strategy, number of partitions and bounds of the partitioned elations involved in the query should be stored in PlannerInfo in the form of a list. Each partitioning scheme is annotated with the relids of the partitioned relations. RelOptInfo of the partitioned relation will point to the partitioning scheme in PlannerInfo. Along-with that each RelOptInfo will need to store partition keys for corresponding relation. This simplifies matching the partitioning schemes of the joining relations. Also it reduces the number of copies of partition bounds floating around as we expect that a query will involve multiple partitioned tables following similar partitioning schemes. May be you want to consider this idea while working on (b).


Thanks,
Amit





--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/08/18 5:23, Robert Haas wrote:
> On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I am slightly tempted to eliminate the pg_partition catalog and associated
>> syscache altogether and add a column to pg_class as Robert suggested.
>> That way, all relid_is_partition() calls will be replaced by
>> rel->rd_partbound != NULL check.  But one potential problem with that
>> approach is that now whenever a parent relation is opened, all the
>> partition relations must be opened to get the partbound value (to form the
>> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
>> currently, we just look up the pg_partition catalog (or the associated
>> cache) for every partition and that gets us the partbound.
> 
> Well, you could just look up the pg_class row without opening the
> relation, too.  There is a system cache on pg_class.oid, after all.  I

Yes, I somehow didn't think of that.

> think the issue is whether it's safe to read either one of those
> things without a lock on the child relation.  If altering the
> partitioning information for a relation requires holding only
> AccessExclusiveLock on that relation, and no lock on the parent, then
> you really can't read the information for any child relation without
> taking at least AccessShareLock.  Otherwise, it might change under
> you, and that would be bad.

I'd imagine this won't be a problem because we take an AccessExclusiveLock
on the parent when adding/removing a partition.

> I'm inclined to think that changing the partitioning information for a
> child is going to require AccessExclusiveLock on both the child and
> the parent.  That seems unfortunate from a concurrency point of view,
> but we may be stuck with it: suppose you require only
> ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
> transaction might see the partition boundaries change when it does a
> relcache rebuild, which would cause it to suddenly start expecting the
> data to be in a different plan in mid-transaction, perhaps even in
> mid-scan.  Maybe that's survivable with really careful coding, but it
> seems like it's probably a bad thing.  For example, it would mean that
> the executor would be unable to rely on the partitioning information
> in the relcache remaining stable underneath it.  Moreover, the
> relcache is always going to be scanned with the most recent possible
> MVCC snapshot, but the transaction snapshot may be older, so such a
> system creates all sorts of nasty possibilities for there to be skew
> between the snapshot being used to via the data and the snapshot being
> used to read the metadata that says where the data is.

We do take a lock on the parent because we would be changing its partition
descriptor (relcache).  I changed MergeAttributes() such that an
AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
parent is a partitioned table.

> This may need some more thought, but if we go with that approach of
> requiring an AccessExclusiveLock on both parent and child, then it
> seems to me that maybe we should consider the partitioning information
> to be a property of the parent rather than the child.  Just take all
> the partitioning information for all children and put it in one big
> node tree and store it in the pg_class or pg_partition_root entry for
> the parent as one big ol' varlena.  Now you can open the parent and
> get all of the partitioning information for all of the children
> without needing any lock on any child, and that's *really* good,
> because it means that some day we might be able to do partition
> elimination before locking any of the children!  That would be
> excellent.

If we need an AccessExclusiveLock on parent to add/remove a partition
(IOW, changing that child table's partitioning information), then do we
need to lock the individual partitions when reading partition's
information?  I mean to ask why the simple syscache look-ups to get each
partition's bound wouldn't do.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
Sorry it took me a while to reply.  Attached updated patches including the
review comments on 0001 at [1].

On 2016/08/17 3:54, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch
>
> +    if (pset.sversion >= 90600 && tableinfo.relkind == 'P')
>
> Version check is redundant, right?

Yep, fixed.

> +) PARTITION BY RANGE ((a+b));
> +\d describe_range_key
> +Partitioned table "public.describe_range_key"
> + Column |  Type   | Modifiers
> +--------+---------+-----------
> + a      | integer |
> + b      | integer |
> +Partition Key: PARTITION BY RANGE (((a + b)))
>
> I understand that it's probably difficult not to end up with two sets
> of parentheses here, but can we avoid ending up with three sets?

Fixed.  This code was copy-pasted from the index code which has other
considerations for adding surrounding parentheses which don't apply to the
partition key code.

> Also, I wonder if pg_get_partkeydef() should omit "PARTITION BY" and
> pg_dump can add that part back.  Then this could say:
>
> Partition Key: RANGE ((a + b))
>
> ...which seems a good deal more natural than what you have now.

Agreed, so done that way.

>> 0003-Catalog-and-DDL-for-partition-bounds.patch
>>
>> Partition DDL includes both a way to create new partition and "attach" an
>> existing table as a partition of parent partitioned table.  Attempt to
>> drop a partition using DROP TABLE causes an error. Instead a partition
>> needs first to be "detached" from parent partitioned table.  On the other
>> hand, dropping the parent drops all the partitions if CASCADE is specified.
>
> Hmm, I don't think I like this.  Why should it be necessary to detach
> a partition before dropping it?  That seems like an unnecessary step.

I thought we had better lock the parent table when removing one of its
partitions and it seemed a bit odd to lock the parent table when dropping
a partition using DROP TABLE?  OTOH, with ALTER TABLE parent DETACH
PARTITION, the parent table is locked anyway.

>      [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
> IMMEDIATE ]
> +
>  </synopsis>
>
> Unnecessary hunk.
>
> +     <para>
> +      If this table is a partition, one cannot perform <literal>DROP
> NOT NULL</>
> +      on a column if it is marked not null in the parent table.
> +      not null.
> +     </para>
>
> Sentence fragment.

Fixed.

>
> +     <para>
> +      Note that unlike the <literal>ATTACH PARTITION</> command, a partition
> +      being detached can be itself partitioned.  In that case, it continues
> +      to exist as such.
> +     </para>
>
> This is another restriction I don't understand.  Why can't I attach a
> partitioned table?

I removed this restriction.

ATExecAttachPartition() adds a AT work queue entry for the table being
attached to perform a heap scan in the rewrite phase.  The scan is done
for checking that no row violates the partition boundary condition.  When
attaching a partitioned table as partition, multiple AT work queue entries
are now added - one for each leaf partition of the table being attached.

> +        indicate that descendant tables are included.  Note that whether
> +        <literal>ONLY</> or <literal>*</> is specified has no effect in case
> +        of a partitioned table; descendant tables (in this case, partitions)
> +        are always included.
>
> Ugh, why?  I think this should work exactly the same way for
> partitioned tables that it does for any other inheritance hierarchy.
> Sure, you'll get no rows, but who cares?

Agreed, done that way.

> +CREATE FOREIGN TABLE measurement_y2016m07
> +    PARTITION OF measurement FOR VALUES START ('2016-07-01') END
> ('2016-08-01');
> +    SERVER server_07;
>
> Extra semicolon?

Fixed.

>
> +      A partition cannot have columns other than those inherited from the
> +      parent.  That includes the <structfield>oid</> column, which can be
>
> I think experience suggests that this is a good restriction, but then
> why does the syntax synopsis indicate that PARTITION BY can be
> specified along with column definitions?  Similarly for CREATE FOREIGN
> TABLE.

The syntax synopsis of CREATE TABLE ... PARTITION OF indicates that a list
of column WITH OPTION and/or table_constraint can be specified.  It does
not allow column definitions.

In this case, inherited columns will be listed in the PARTITION BY clause.

Do you mean that the CREATE TABLE ... PARTITION OF syntax should allow
column definitions just like INHERITS does and unlike regular inheritance,
throw error if columns other than those to be merged are found?

>
> +      When specifying for a table being created as partition, one needs to
> +      use column names from the parent table as part of the key.
>
> This is not very clear.

Sentence removed because it may be clear from the context that inherited
columns are to be used for the partition key.

> -       /* Remove NO INHERIT flag if rel is a partitioned table */
> -       if (relid_is_partitioned(relid))
> +       /* Discard NO INHERIT, if relation is a partitioned table or a
> partition */
> +       if (relid_is_partitioned(relid) || relid_is_partition(relid))
>                 is_no_inherit = false;
>
> It might be right to disallow NO INHERIT in this case, but I don't
> think it can be right to just silently ignore it.

OK, I changed this to instead throw an error if a NO INHERIT check
constraint is added to a partitioned table or a partition.

> + * Not flushed from the cache by RelationClearRelation() unless changed because
> + * of addition or removal of partitions.
>
> This seems unlikely to be safe, unless I'm missing something.

Like TupleDesc, a table's PartitionDesc is preserved across relcache
rebuilds.  PartitionDesc consists of arrays of OIDs and partition bounds
for a table's immediate partitions.  It can only change by adding/removing
partitions to/from the table which requires an exclusive lock on it.
Since this data can grow arbitrarily big it seemed better to not have to
copy it around, so a direct pointer to the relcache field (rd_partdesc) is
given to callers.  relcache rebuilds that do not logically change
PartitionDesc leave it intact so that some user of it is not left with a
dangling pointer.  Am I missing something?

>
> +       form = (Form_pg_inherits) GETSTRUCT(tuple);
> +
> +       systable_endscan(scan);
> +       heap_close(catalogRelation, AccessShareLock);
> +
> +       return form->inhparent;
>
> This is unsafe.  After systable_endscan, it is no longer OK to access
> form->inhparent.
>
> Try building with CLOBBER_CACHE_ALWAYS to find other cache flush hazards.
>
> There should probably be a note in the function header comment that it
> is unsafe to use this for an inheritance child that is not a
> partition, because there could be more than one parent in that case.
> Or maybe the whole idea of this function just isn't very sound...

Fixed unsafe coding and added a comment to the function saying it should
be called on tables known to be partitions.

>
> +static List *
> +get_partitions(Oid relid, int lockmode)
> +{
> +       return find_inheritance_children(relid, lockmode);
> +}
>
> What's the point?  If we're going to have a wrapper here at all, then
> shouldn't it have a name that matches the existing convention - e.g.
> find_partitions() or find_child_partitions()?  But I think you might
> as well just use find_inheritance_children() directly.

OK, changed to just use find_inheritance_children() directly.

>
> +                * Happens when we have created the pg_inherits entry
> but not the
> +                * pg_partition entry yet.
>
> Why do we ever allow the flow of control to reach this point while we
> are in such an intermediate state?

Fixed to prevent this from happening.

>
> +free_partition_info(PartitionInfoData **p, int num)
>
> Seems very error-prone.  Isn't this why MemoryContextReset was invented?

Got rid of this function.

>
> +relid_is_partition(Oid relid)
> +{
> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
> +}
>
> This is used in a lot of places, and the overhead of checking it in
> all of those places is not necessarily nil.  Syscache lookups aren't
> free.  What if we didn't create a new catalog for this and instead
> just added a relpartitionbound attribute to pg_class?  It seems a bit
> silly to have a whole extra catalog to store one extra column...

OK, I got rid of the pg_partition catalog and added a pg_class attribute
for storing partition bound.  Because the newly added attribute is a
pg_node_tree and hence not readily accessible without a heap_getattr()
call, I added a boolean relispartition as well. Now all the
relid_is_partition() calls have been replaced by checks using
relation->rd_rel->relispartition.

>
>         /*
> +        * If this foreign table is a partition, check that the FDW supports
> +        * insert.
> +        */
> +       if (stmt->base.partbound != NULL)
> +       {
> +               FdwRoutine *fdw_routine;
> +
> +               fdw_routine = GetFdwRoutine(fdw->fdwhandler);
> +               if (fdw_routine->ExecForeignInsert == NULL)
> +                       ereport(ERROR,
> +                                       (errcode(ERRCODE_FDW_NO_SCHEMAS),
> +                                        errmsg("cannot create foreign
> table as partition"),
> +                                        errdetail("foreign-data
> wrapper \"%s\" does not support insert",
> +                                                       fdw->fdwname)));
> +       }
>
> Why?  This seems like an entirely arbitrary prohibition.  If inserts
> aren't supported, then they'll fail at runtime.  Same for updates or
> deletes, for which you have not added checks.  I think you should just
> remove this.

OK, removed this check.

Instead, ExecInitModifyTable()/BeginCopy() call CheckValidResultRel() for
every leaf partition to check that they are ready for CMD_INSERT.

>
> +               /* Force inheritance recursion, if partitioned table. */
> +               if (recurse || relid_is_partitioned(myrelid))
>
> Disagree with this, too.  There's no reason for partitioned tables to
> be special in this way.  Similarly, disagree with all of the places
> that do something similar.

Removed the forced recursion bit here and a few other places.

>
> -                               errmsg("column \"%s\" in child table
> must be marked NOT NULL",
> -                                          attributeName)));
> +                               errmsg("column \"%s\" in %s table must
> be marked NOT NULL",
> +                                          attributeName,
> +                                          is_attach_partition ?
> "source" : "child")));
>
> You have a few of these; they cause problems for translators, because
> different languages have different word ordering.  Repeat the entire
> message instead: is_attach_partition ? "column \"%s\" in source table
> must be marked NOT NULL" : "column \"%s\" in child table must be
> marked NOT NULL".

Changed so that the entire message is repeated.

>
> +-- XXX add ownership tests
>
> So do that.  :-)

Done, sorry about that.

>
> +ERROR:  column "b" is not null in parent
> +HINT:  Please drop not null in the parent instead
>
> Hmm.   That hint doesn't seem like project style, and I'm not sure
> that it really makes sense to issue such a hint anyway.  Who knows
> whether that is the right thing to do?  I think you should somehow be
> complaining about the fact that this is a partition, rather than
> complaining about the fact that the column is NOT NULL in the parent.
> Are we insisting that the flags match exactly, or only that the child
> may not allow nulls unless the parent does?

I think the latter.  It is assumed that all the parent's constraints are
present in a child table, because in ExecInsert()/CopyFrom() we perform
ExecConstraints() using the child relation even if the actual insert was
on the parent.  Also, if an inherited NOT NULL constraint on a child's
column is dropped irrespective of parent's, selecting a parent's NOT NULL
column might return nulls from the child table that no longer has the
constraint.

I recently came across a related proposal whereby dropping *inherited* NOT
NULL  from child tables will be prevented.  Problems in letting it be be
dropped are mentioned here:

https://www.postgresql.org/message-id/21633.1448383428@sss.pgh.pa.us

That proposal is probably being reworked such that NOT NULL constraints
get a pg_constraint entry with proper accounting of inheritance count.

> +ERROR:  new partition's list of values overlaps with partition
> "lpart1" of "list_parted"
>
> Maybe just:
>
> ERROR: partitions must not overlap
> -or-
> ERROR: partition "%s" would overlap partition "%s"

OK, I changed to the second message.

>
> As before, this is just an initial read-through, so apologies for
> whatever I may have missed.

Thanks a lot for the review.

By the way, I am still working on the following items and will be included
in the next version of the patch.

* Fix internal representation of list partition bounds to be more efficient
* Add PartitionOptInfo

As mentioned in [2], I have refactored the inheritance expansion code
within optimizer so that a partitioned table's inheritance hierarchy is
preserved in resulting AppendRelInfos (patch 0005).  One immediate benefit
of that is that if constraint exclusion determines that an intermediate
partition is to be excluded then all partitions underneath it are excluded
automatically.  With the current flattened model, all partitions in that
subtree would have been processed and excluded one-by-one.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZ008qTgd_Qg6_oZb3i0mOYrS6MdhncwgcqPKahixjarg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/f2a9592a-17e9-4c6a-e021-03b802195ce7%40lab.ntt.co.jp

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Aug 26, 2016 at 1:33 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> We do take a lock on the parent because we would be changing its partition
> descriptor (relcache).  I changed MergeAttributes() such that an
> AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
> parent is a partitioned table.

Hmm, that seems both good and bad.  On the good side, as mentioned,
being able to rely on the partition descriptor not changing under us
makes this sort of thing much easier to reason about.  On the bad
side, it isn't good for this feature to have worse concurrency than
regular inheritance.  Not sure what to do about this.

> If we need an AccessExclusiveLock on parent to add/remove a partition
> (IOW, changing that child table's partitioning information), then do we
> need to lock the individual partitions when reading partition's
> information?  I mean to ask why the simple syscache look-ups to get each
> partition's bound wouldn't do.

Well, if X can't be changed without having an AccessExclusiveLock on
the parent, then an AccessShareLock on the parent is sufficient to
read X, right?  Because those lock modes conflict.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/08/29 20:53, Robert Haas wrote:
> On Fri, Aug 26, 2016 at 1:33 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> We do take a lock on the parent because we would be changing its partition
>> descriptor (relcache).  I changed MergeAttributes() such that an
>> AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
>> parent is a partitioned table.
> 
> Hmm, that seems both good and bad.  On the good side, as mentioned,
> being able to rely on the partition descriptor not changing under us
> makes this sort of thing much easier to reason about.  On the bad
> side, it isn't good for this feature to have worse concurrency than
> regular inheritance.  Not sure what to do about this.
>
>> If we need an AccessExclusiveLock on parent to add/remove a partition
>> (IOW, changing that child table's partitioning information), then do we
>> need to lock the individual partitions when reading partition's
>> information?  I mean to ask why the simple syscache look-ups to get each
>> partition's bound wouldn't do.
> 
> Well, if X can't be changed without having an AccessExclusiveLock on
> the parent, then an AccessShareLock on the parent is sufficient to
> read X, right?  Because those lock modes conflict.

Yes.  And hence we can proceed with performing partition elimination
before locking any of children.  Lock on parent (AccessShareLock) will
prevent any of existing partitions to be removed and any new partitions to
be added because those operations require AccessExclusiveLock on the
parent.  What I was trying to understand is why this would not be possible
with a design where partition bound is stored in the catalog as a property
of individual partitions instead of a design where we store collection of
partition bounds as a property of the parent.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Aug 31, 2016 at 12:37 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> If we need an AccessExclusiveLock on parent to add/remove a partition
>>> (IOW, changing that child table's partitioning information), then do we
>>> need to lock the individual partitions when reading partition's
>>> information?  I mean to ask why the simple syscache look-ups to get each
>>> partition's bound wouldn't do.
>>
>> Well, if X can't be changed without having an AccessExclusiveLock on
>> the parent, then an AccessShareLock on the parent is sufficient to
>> read X, right?  Because those lock modes conflict.
>
> Yes.  And hence we can proceed with performing partition elimination
> before locking any of children.  Lock on parent (AccessShareLock) will
> prevent any of existing partitions to be removed and any new partitions to
> be added because those operations require AccessExclusiveLock on the
> parent.

Agreed.

> What I was trying to understand is why this would not be possible
> with a design where partition bound is stored in the catalog as a property
> of individual partitions instead of a design where we store collection of
> partition bounds as a property of the parent.

From the point of view of feasibility, I don't think it matters very
much where the property is stored; it's the locking that is the key
thing.  In other words, I think this *would* be possible if the
partition bound is stored as a property of individual partitions, as
long as it can't change without a lock on the parent.

However, it seems a lot better to make it a property of the parent
from a performance point of view.  Suppose there are 1000 partitions.
Reading one toasted value for pg_class and running stringToNode() on
it is probably a lot faster than scanning pg_inherits to find all of
the child partitions and then doing an index scan to find the pg_class
tuple for each and then decoding all of those tuples and assembling
them into some data structure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/08/31 16:17, Robert Haas wrote:
> On Wed, Aug 31, 2016 at 12:37 PM, Amit Langote wrote:
>> What I was trying to understand is why this would not be possible
>> with a design where partition bound is stored in the catalog as a property
>> of individual partitions instead of a design where we store collection of
>> partition bounds as a property of the parent.
> 
> From the point of view of feasibility, I don't think it matters very
> much where the property is stored; it's the locking that is the key
> thing.  In other words, I think this *would* be possible if the
> partition bound is stored as a property of individual partitions, as
> long as it can't change without a lock on the parent.
> 
> However, it seems a lot better to make it a property of the parent
> from a performance point of view.  Suppose there are 1000 partitions.
> Reading one toasted value for pg_class and running stringToNode() on
> it is probably a lot faster than scanning pg_inherits to find all of
> the child partitions and then doing an index scan to find the pg_class
> tuple for each and then decoding all of those tuples and assembling
> them into some data structure.

Seems worth trying.  One point that bothers me a bit is how do we enforce
partition bound condition on individual partition basis.  For example when
a row is inserted into a partition directly, we better check that it does
not fall outside the bounds and issue an error otherwise.  With current
approach, we just look up a partition's bound from the catalog and gin up
a check constraint expression (and cache in relcache) to be enforced in
ExecConstraints().  With the new approach, I guess we would need to look
up the parent's partition descriptor.  Note that the checking in
ExecConstraints() is turned off when routing a tuple from the parent.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/08/25 16:15, Ashutosh Bapat wrote:
> On Thu, Aug 25, 2016 at 12:22 PM, Amit Langote wrote:
>> b)
>> when accumulating append subpaths, do not flatten a subpath that is itself
>> an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
>> non-NULL partitioning info.Is the latter somehow necessary for
>> pairwise-join considerations?
> 
> I don't think you need to do anything in the path creation code for this.
> As is it flattens all AppendPath hierarchies whether for partitioning or
> inheritance or subqueries. We should leave it as it is.

I thought it would be convenient for pairwise join code to work with the
hierarchy intact even within the AppendPath tree.  If it turns out to be
so, maybe that patch can take care of it.

>> I think I can manage to squeeze in (a) in the next version patch and will
>> also start working on (b), mainly the part about RelOptInfo getting some
>> partitioning info.
> 
> I am fine with b, where you would include some partitioning information in
> RelOptInfo. But you don't need to do what you said in (b) above.
> 
> In a private conversation Robert Haas suggested a way slightly different
> than what my patch for partition-wise join does. He suggested that the
> partitioning schemes i.e strategy, number of partitions and bounds of the
> partitioned elations involved in the query should be stored in PlannerInfo
> in the form of a list. Each partitioning scheme is annotated with the
> relids of the partitioned relations. RelOptInfo of the partitioned relation
> will point to the partitioning scheme in PlannerInfo. Along-with that each
> RelOptInfo will need to store partition keys for corresponding relation.
> This simplifies matching the partitioning schemes of the joining relations.
> Also it reduces the number of copies of partition bounds floating around as
> we expect that a query will involve multiple partitioned tables following
> similar partitioning schemes. May be you want to consider this idea while
> working on (b).

So IIUC, a partitioned relation's (baserel or joinrel) RelOptInfo has only
the information about partition keys.  They will be matched with query
restriction quals pruning away any unneeded partitions which happens
individually for each such parent baserel (within set_append_rel_size() I
suppose).  Further, two joining relations are eligible to be considered
for pairwise joining if they have identical partition keys and query
equi-join quals match the same.  The resulting joinrel will have the same
partition key (as either joining relation) and will have as many
partitions as there are in the intersection of sets of partitions of
joining rels (intersection proceeds by matching partition bounds).

"Partition scheme" structs go into a PlannerInfo list member, one
corresponding to each partitioned relation - baserel or joinrel, right?
As you say, each such struct has the following pieces of information:
strategy, num_partitions, bounds (and other auxiliary info).  Before
make_one_rel() starts, the list has one for each partitioned baserel.
After make_one_rel() has formed baserel pathlists and before
make_rel_from_joinlist() is called, are the partition scheme structs of
processed baserels marked with some information about the pruning activity
that occurred so far?  Then as we build successively higher levels of
joinrels, new entries will be made for those joinrels for which we added
pairwise join paths, with relids matching the corresponding joinrels.
Does that make sense?

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:



>
> I don't think you need to do anything in the path creation code for this.
> As is it flattens all AppendPath hierarchies whether for partitioning or
> inheritance or subqueries. We should leave it as it is.

I thought it would be convenient for pairwise join code to work with the
hierarchy intact even within the AppendPath tree.  If it turns out to be
so, maybe that patch can take care of it.

Partition-wise join work with RelOptInfos, so it's fine if the AppendPath hierarchy is flattened out. We need the RelOptInfo hierarchy though.
 

>> I think I can manage to squeeze in (a) in the next version patch and will
>> also start working on (b), mainly the part about RelOptInfo getting some
>> partitioning info.
>
> I am fine with b, where you would include some partitioning information in
> RelOptInfo. But you don't need to do what you said in (b) above.
>
> In a private conversation Robert Haas suggested a way slightly different
> than what my patch for partition-wise join does. He suggested that the
> partitioning schemes i.e strategy, number of partitions and bounds of the
> partitioned elations involved in the query should be stored in PlannerInfo
> in the form of a list. Each partitioning scheme is annotated with the
> relids of the partitioned relations. RelOptInfo of the partitioned relation
> will point to the partitioning scheme in PlannerInfo. Along-with that each
> RelOptInfo will need to store partition keys for corresponding relation.
> This simplifies matching the partitioning schemes of the joining relations.
> Also it reduces the number of copies of partition bounds floating around as
> we expect that a query will involve multiple partitioned tables following
> similar partitioning schemes. May be you want to consider this idea while
> working on (b).

So IIUC, a partitioned relation's (baserel or joinrel) RelOptInfo has only
the information about partition keys.  They will be matched with query
restriction quals pruning away any unneeded partitions which happens
individually for each such parent baserel (within set_append_rel_size() I
suppose).  Further, two joining relations are eligible to be considered
for pairwise joining if they have identical partition keys and query
equi-join quals match the same.  The resulting joinrel will have the same
partition key (as either joining relation) and will have as many
partitions as there are in the intersection of sets of partitions of
joining rels (intersection proceeds by matching partition bounds).

"Partition scheme" structs go into a PlannerInfo list member, one
corresponding to each partitioned relation - baserel or joinrel, right?

Multiple relations (base or join) can share Partition Scheme if they are partitioned the same way. Each partition scheme also stores the relids of the base relations partitioned by that scheme.
 
As you say, each such struct has the following pieces of information:
strategy, num_partitions, bounds (and other auxiliary info).  Before
make_one_rel() starts, the list has one for each partitioned baserel.
After make_one_rel() has formed baserel pathlists and before
make_rel_from_joinlist() is called, are the partition scheme structs of
processed baserels marked with some information about the pruning activity
that occurred so far?

Right now pruned partitions are labelled as dummy rels (empty appent paths). That's enough to detect a pruned partition. I haven't found a need to label partitioning scheme with pruned partitions for partition-wise join.
 
Then as we build successively higher levels of
joinrels, new entries will be made for those joinrels for which we added
pairwise join paths, with relids matching the corresponding joinrels.
Does that make sense?


I don't think we will make any new partition scheme entry in PlannerInfo after all the base relations have been considered. Partitionin-wise join will pick the one suitable for the given join. But in case partition-wise join needs to make new entries, I will take care of that in my patch.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
Here's something I observed with your set of patches posted in June. I have not checked the latest set of patches. So, if it's something fixed, please ignore the mail and sorry for me being lazy.

prt1 is partitioned table and it shows following information with \d+

regression=# \d+ prt1
                        Partitioned table "public.prt1"
 Column |       Type        | Modifiers | Storage  | Stats target | Description
--------+-------------------+-----------+----------+--------------+-------------
 a      | integer           |           | plain    |              |
 b      | integer           |           | plain    |              |
 c      | character varying |           | extended |              |
Partition Key: PARTITION BY RANGE (a)
Indexes:
    "iprt1_a" btree (a)

Shouldn't we show all the partitions of this table and may be their ranges of lists?

I found the partitions from EXPLAIN plan

regression=# explain verbose select * from prt1;
                              QUERY PLAN                              
-----------------------------------------------------------------------
 Append  (cost=0.00..6.00 rows=301 width=13)
   ->  Seq Scan on public.prt1  (cost=0.00..0.00 rows=1 width=40)
         Output: prt1.a, prt1.b, prt1.c
   ->  Seq Scan on public.prt1_p1  (cost=0.00..2.25 rows=125 width=13)
         Output: prt1_p1.a, prt1_p1.b, prt1_p1.c
   ->  Seq Scan on public.prt1_p3  (cost=0.00..1.50 rows=50 width=13)
         Output: prt1_p3.a, prt1_p3.b, prt1_p3.c
   ->  Seq Scan on public.prt1_p2  (cost=0.00..2.25 rows=125 width=13)
         Output: prt1_p2.a, prt1_p2.b, prt1_p2.c
(9 rows)

Then did \d+ on each of those to find their ranges

regression=# \d+ prt1_p1
                             Table "public.prt1_p1"
 Column |       Type        | Modifiers | Storage  | Stats target | Description
--------+-------------------+-----------+----------+--------------+-------------
 a      | integer           |           | plain    |              |
 b      | integer           |           | plain    |              |
 c      | character varying |           | extended |              |
Partition Of: prt1 FOR VALUES START (0) END (250)
Indexes:
    "iprt1_p1_a" btree (a)

regression=# \d+ prt1_p2
                             Table "public.prt1_p2"
 Column |       Type        | Modifiers | Storage  | Stats target | Description
--------+-------------------+-----------+----------+--------------+-------------
 a      | integer           |           | plain    |              |
 b      | integer           |           | plain    |              |
 c      | character varying |           | extended |              |
Partition Of: prt1 FOR VALUES START (250) END (500)
Indexes:
    "iprt1_p2_a" btree (a)

regression=# \d+ prt1_p3
                             Table "public.prt1_p3"
 Column |       Type        | Modifiers | Storage  | Stats target | Description
--------+-------------------+-----------+----------+--------------+-------------
 a      | integer           |           | plain    |              |
 b      | integer           |           | plain    |              |
 c      | character varying |           | extended |              |
Partition Of: prt1 FOR VALUES START (500) END (600)
Indexes:
    "iprt1_p3_a" btree (a)

As you will observe that the table prt1 can not have any row with a < 0 and a > 600. But when I execute

regression=# explain verbose select * from prt1 where a > 1000000;
                            QUERY PLAN                           
------------------------------------------------------------------
 Append  (cost=0.00..0.00 rows=1 width=40)
   ->  Seq Scan on public.prt1  (cost=0.00..0.00 rows=1 width=40)
         Output: prt1.a, prt1.b, prt1.c
         Filter: (prt1.a > 1000000)
(4 rows)

it correctly excluded all the partitions, but did not exclude the parent relation. I guess, we have enough information to exclude it. Probably, we should add a check constraint on the parent which is OR of the check constraints on all the partitions. So there are two problems here

1. \d+ doesn't show partitions - this is probably reported earlier, I don't remember.
2. A combination of constraints on the partitions should be applicable to the parent. We aren't doing that.


On Wed, Aug 10, 2016 at 4:39 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi,

Attached is the latest set of patches to implement declarative
partitioning.  There is already a commitfest entry for the same:
https://commitfest.postgresql.org/10/611/

The old discussion is here:
http://www.postgresql.org/message-id/55D3093C.5010800@lab.ntt.co.jp/

Attached patches are described below:

0001-Catalog-and-DDL-for-partition-key.patch
0002-psql-and-pg_dump-support-for-partitioned-tables.patch

These patches create the infrastructure and DDL for partitioned
tables.

In addition to a catalog for storing the partition key information, this
adds a new relkind to pg_class.h. PARTITION BY clause is added to CREATE
TABLE. Tables so created are RELKIND_PARTITIONED_REL relations which are
to be special in a number of ways, especially with regard to their
interactions with regular table inheritance features.

PARTITION BY RANGE ({ column_name | ( expression ) } [ opclass ] [, ...])
PARTITION BY LIST ({ column_name | ( expression ) } [ opclass ])


0003-Catalog-and-DDL-for-partition-bounds.patch
0004-psql-and-pg_dump-support-for-partitions.patch

These patches create the infrastructure and DDL for partitions.

Parent-child relationships of a partitioned table and its partitions are
managed behind-the-scenes with inheritance.  That means there is a
pg_inherits entry and attributes, constraints, etc. are marked with
inheritance related information appropriately.  However this case differs
from a regular inheritance relationship in a number of ways.  While the
regular inheritance imposes certain restrictions on what elements a
child's schema is allowed to contain (both at creation time and
after-the-fact), the partitioning related code imposes further
restrictions.  For example, while regular inheritance allows a child to
contain its own columns, the partitioning code disallows that.  Stuff like
NO INHERIT marking on check constraints, ONLY are ignored by the the
partitioning code.

Partition DDL includes both a way to create new partition and "attach" an
existing table as a partition of parent partitioned table.  Attempt to
drop a partition using DROP TABLE causes an error. Instead a partition
needs first to be "detached" from parent partitioned table.  On the other
hand, dropping the parent drops all the partitions if CASCADE is specified.

CREATE TABLE partition_name
    PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
[ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass
] [, ...] )

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
  PARTITION OF parent_table [ (
  { column_name WITH OPTIONS [ column_constraint [ ... ] ]
    | table_constraint }
    [, ... ]
) ] partition_bound_spec
  SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]

ALTER TABLE parent ATTACH PARTITION partition_name partition_bound_spec [
VALIDATE | NO VALIDATE ]

ALTER TABLE parent DETACH PARTITION partition_name

partition_bound_spec is:

FOR VALUES { list_spec | range_spec }

list_spec in FOR VALUES is:

IN ( expression [, ...] )

range_spec in FOR VALUES is:

START lower-bound [ INCLUSIVE | EXCLUSIVE ] END upper-bound [ INCLUSIVE |
EXCLUSIVE ]

where lower-bound and upper-bound are:

{ ( expression [, ...] ) | UNBOUNDED }

expression can be a string literal, a numeric literal or NULL.

Note that the one can specify PARTITION BY when creating a partition
itself. That is to allow creating multi-level partitioned tables.


0005-Teach-a-few-places-to-use-partition-check-constraint.patch

A partition's bound implicitly constrains the values that are allowed in
the partition key of its rows.  The same can be applied to partitions when
inserting data *directly* into them to make sure that only the correct
data is allowed in (if a tuple has been routed from the parent, that
becomes unnecessary). To that end, ExecConstraints() now includes the
above implicit check constraint in the list of constraints it enforces.

Further, to enable constraint based partition exclusion on partitioned
tables, the planner code includes in its list of constraints the above
implicitly defined constraints.  This arrangement is temporary however and
will be rendered unnecessary when we implement special data structures and
algorithms within the planner in future versions of this patch to use
partition metadata more effectively for partition exclusion.

Note that the "constraints" referred to above are not some on-disk
structures but those generated internally on-the-fly when requested by a
caller.

0006-Introduce-a-PartitionTreeNode-data-structure.patch
0007-Tuple-routing-for-partitioned-tables.patch

These patches enable routing of tuples inserted into a partitioned table
to one of its leaf partitions.  It applies to both COPY FROM and INSERT.
First of these patches introduces a data structure that provides a
convenient means for the tuple routing code to step down a partition tree
one level at a time.  The second one modifies copy.c and executor to
implement actual tuple routing.  When inserting into a partition, its row
constraints and triggers are applied.  Note that the partition's
constraints also include the constraints defined on the parent.  This
arrangements means however that the parent's triggers are not currently
applied.

Updates are handled like they are now for inheritance sets, however, if an
update makes a row change partition, an error will be thrown.

0008-Update-DDL-Partitioning-chapter.patch

This patch updates the partitioning section in the DDL chapter to reflect
the new methods made available for creating and managing partitioned table
and its partitions.  Especially considering that it is no longer necessary
to define CHECK constraints and triggers/rules manually for constraint
exclusion and tuple routing, respectively.

TODO (in short term):
* Add more regression tests and docs
* Add PartitionOptInfo and use it to perform partition pruning more
effectively (the added infrastructure should also help pairwise joins
patch proposed by Ashutosh Bapat [1])
* Fix internal representation of list partition bounds to be more efficient


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj%3DEaDTSA%40mail.gmail.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/02 14:38, Ashutosh Bapat wrote:
> Here's something I observed with your set of patches posted in June. I have
> not checked the latest set of patches. So, if it's something fixed, please
> ignore the mail and sorry for me being lazy.
> 
> prt1 is partitioned table and it shows following information with \d+
> 
> regression=# \d+ prt1
>                         Partitioned table "public.prt1"
>  Column |       Type        | Modifiers | Storage  | Stats target |
> Description
> --------+-------------------+-----------+----------+--------------+-------------
>  a      | integer           |           | plain    |              |
>  b      | integer           |           | plain    |              |
>  c      | character varying |           | extended |              |
> Partition Key: PARTITION BY RANGE (a)
> Indexes:
>     "iprt1_a" btree (a)
> 
> Shouldn't we show all the partitions of this table and may be their ranges
> of lists?

Something I thought about as well.  I will implement that.

> I found the partitions from EXPLAIN plan
> 
> regression=# explain verbose select * from prt1;
>                               QUERY PLAN
> -----------------------------------------------------------------------
>  Append  (cost=0.00..6.00 rows=301 width=13)
>    ->  Seq Scan on public.prt1  (cost=0.00..0.00 rows=1 width=40)
>          Output: prt1.a, prt1.b, prt1.c
>    ->  Seq Scan on public.prt1_p1  (cost=0.00..2.25 rows=125 width=13)
>          Output: prt1_p1.a, prt1_p1.b, prt1_p1.c
>    ->  Seq Scan on public.prt1_p3  (cost=0.00..1.50 rows=50 width=13)
>          Output: prt1_p3.a, prt1_p3.b, prt1_p3.c
>    ->  Seq Scan on public.prt1_p2  (cost=0.00..2.25 rows=125 width=13)
>          Output: prt1_p2.a, prt1_p2.b, prt1_p2.c
> (9 rows)
> 
> Then did \d+ on each of those to find their ranges

[ ... ]

> 
> As you will observe that the table prt1 can not have any row with a < 0 and
> a > 600. But when I execute
> 
> regression=# explain verbose select * from prt1 where a > 1000000;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  Append  (cost=0.00..0.00 rows=1 width=40)
>    ->  Seq Scan on public.prt1  (cost=0.00..0.00 rows=1 width=40)
>          Output: prt1.a, prt1.b, prt1.c
>          Filter: (prt1.a > 1000000)
> (4 rows)
> 
> it correctly excluded all the partitions, but did not exclude the parent
> relation. I guess, we have enough information to exclude it. Probably, we
> should add a check constraint on the parent which is OR of the check
> constraints on all the partitions. So there are two problems here
> 
> 1. \d+ doesn't show partitions - this is probably reported earlier, I don't
> remember.

You just did, :)

As I said I will implement that on lines of how inheritance children are
listed (with additional information ie, range or list).

> 2. A combination of constraints on the partitions should be applicable to
> the parent. We aren't doing that.

How about on seeing that a RELOPT_OTHER_MEMBER_REL is partitioned parent
table, we can have get_relation_constraints() include a constant false
clause in the list of constraints returned for
relation_excluded_by_constraints() to process so that it is not included
in the append result by way of constraint exclusion.  One more option is
to mark such rels dummy in set_rel_size().

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"><span class=""><br /> > 2. A combination of constraints on the
partitionsshould be applicable to<br /> > the parent. We aren't doing that.<br /><br /></span>How about on seeing
thata RELOPT_OTHER_MEMBER_REL is partitioned parent<br /> table, we can have get_relation_constraints() include a
constantfalse<br /> clause in the list of constraints returned for<br /> relation_excluded_by_<wbr />constraints() to
processso that it is not included<br /> in the append result by way of constraint exclusion.  One more option is<br />
tomark such rels dummy in set_rel_size().<br /><br clear="all" /></blockquote></div><br /></div><div
class="gmail_extra">Iam not complaining about having parent relation there. For the people who are used to seeing the
parentrelation in the list of append relations, it may be awkward. But +1 if we can do that. If we can't do that, we
shouldat least mark with an OR of all constraints on the partitions, so that constraint exclusion can exclude it if
thereare conditions incompatible with constraints. This is what would happen in inheritance case as well, if there are
constraintson the parent. In the above example, the parent table would have constraints CHECK ((a >= 0 AND a <
250)OR (a >= 250 and a < 500) OR (a >= 500 or a < 600)). It will probably get excluded, if constraint
exclusionis smart enough to understand ORing.<br /></div><div class="gmail_extra">-- <br /><div class="gmail_signature"
data-smartmail="gmail_signature"><divdir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDB Corporation<br />The
PostgresDatabase Company<br /></div></div></div></div> 

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/02 15:22, Ashutosh Bapat wrote:
>>
>>
>>> 2. A combination of constraints on the partitions should be applicable to
>>> the parent. We aren't doing that.
>>
>> How about on seeing that a RELOPT_OTHER_MEMBER_REL is partitioned parent
>> table, we can have get_relation_constraints() include a constant false
>> clause in the list of constraints returned for
>> relation_excluded_by_constraints() to process so that it is not included
>> in the append result by way of constraint exclusion.  One more option is
>> to mark such rels dummy in set_rel_size().
>>
>>
> I am not complaining about having parent relation there. For the people who
> are used to seeing the parent relation in the list of append relations, it
> may be awkward. But +1 if we can do that. If we can't do that, we should at
> least mark with an OR of all constraints on the partitions, so that
> constraint exclusion can exclude it if there are conditions incompatible
> with constraints. This is what would happen in inheritance case as well, if
> there are constraints on the parent. In the above example, the parent table
> would have constraints CHECK ((a >= 0 AND a < 250) OR (a >= 250 and a <
> 500) OR (a >= 500 or a < 600)). It will probably get excluded, if
> constraint exclusion is smart enough to understand ORing.

I guess constraint exclusion would be (is) smart enough to handle that
correctly but why make it unnecessarily spend a *significant* amount of
time on doing the proof (when we *know* we can just skip it).  Imagine how
long the OR list could get.  By the way, my suggestion of just returning a
constant false clause also does not work - neither in case of a query with
restrictions (predicate has to be an OpExpr to go ahead with the proof
which constant false clause is not) nor in case of a query without
restrictions (proof is not performed at all).  So, that one is useless.

Getting rid of the parent table in the append list by other means may be a
way to go.  We know that the table is empty and safe to just drop.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:


On Fri, Sep 2, 2016 at 12:23 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/09/02 15:22, Ashutosh Bapat wrote:
>>
>>
>>> 2. A combination of constraints on the partitions should be applicable to
>>> the parent. We aren't doing that.
>>
>> How about on seeing that a RELOPT_OTHER_MEMBER_REL is partitioned parent
>> table, we can have get_relation_constraints() include a constant false
>> clause in the list of constraints returned for
>> relation_excluded_by_constraints() to process so that it is not included
>> in the append result by way of constraint exclusion.  One more option is
>> to mark such rels dummy in set_rel_size().
>>
>>
> I am not complaining about having parent relation there. For the people who
> are used to seeing the parent relation in the list of append relations, it
> may be awkward. But +1 if we can do that. If we can't do that, we should at
> least mark with an OR of all constraints on the partitions, so that
> constraint exclusion can exclude it if there are conditions incompatible
> with constraints. This is what would happen in inheritance case as well, if
> there are constraints on the parent. In the above example, the parent table
> would have constraints CHECK ((a >= 0 AND a < 250) OR (a >= 250 and a <
> 500) OR (a >= 500 or a < 600)). It will probably get excluded, if
> constraint exclusion is smart enough to understand ORing.

I guess constraint exclusion would be (is) smart enough to handle that
correctly but why make it unnecessarily spend a *significant* amount of
time on doing the proof (when we *know* we can just skip it).  Imagine how
long the OR list could get.  By the way, my suggestion of just returning a
constant false clause also does not work - neither in case of a query with
restrictions (predicate has to be an OpExpr to go ahead with the proof
which constant false clause is not) nor in case of a query without
restrictions (proof is not performed at all).  So, that one is useless.

Huh!
 

Getting rid of the parent table in the append list by other means may be a
way to go.  We know that the table is empty and safe to just drop.


Ok. Does a constraint (1 = 2) or something like that which is obviously false, help?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/02 15:57, Ashutosh Bapat wrote:
> On Fri, Sep 2, 2016 at 12:23 PM, Amit Langote wrote:
>> Getting rid of the parent table in the append list by other means may be a
>> way to go.  We know that the table is empty and safe to just drop.
>>
> Ok. Does a constraint (1 = 2) or something like that which is obviously
> false, help?

No, anything like that would get reduced to a constant false clause by
eval_const_expressions() processing.

Thanks,
Amit






Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
I found a server crash while running make check in regress folder. with this set of patches. Problem is RelationBuildPartitionKey() partexprsrc may be used uninitialized. Initializing it with NIL fixes the crash. Here's patch to fix it. Came up with the fix after discussion with Amit.

On Fri, Aug 26, 2016 at 1:45 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry it took me a while to reply.  Attached updated patches including the
review comments on 0001 at [1].

On 2016/08/17 3:54, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch
>
> +    if (pset.sversion >= 90600 && tableinfo.relkind == 'P')
>
> Version check is redundant, right?

Yep, fixed.

> +) PARTITION BY RANGE ((a+b));
> +\d describe_range_key
> +Partitioned table "public.describe_range_key"
> + Column |  Type   | Modifiers
> +--------+---------+-----------
> + a      | integer |
> + b      | integer |
> +Partition Key: PARTITION BY RANGE (((a + b)))
>
> I understand that it's probably difficult not to end up with two sets
> of parentheses here, but can we avoid ending up with three sets?

Fixed.  This code was copy-pasted from the index code which has other
considerations for adding surrounding parentheses which don't apply to the
partition key code.

> Also, I wonder if pg_get_partkeydef() should omit "PARTITION BY" and
> pg_dump can add that part back.  Then this could say:
>
> Partition Key: RANGE ((a + b))
>
> ...which seems a good deal more natural than what you have now.

Agreed, so done that way.

>> 0003-Catalog-and-DDL-for-partition-bounds.patch
>>
>> Partition DDL includes both a way to create new partition and "attach" an
>> existing table as a partition of parent partitioned table.  Attempt to
>> drop a partition using DROP TABLE causes an error. Instead a partition
>> needs first to be "detached" from parent partitioned table.  On the other
>> hand, dropping the parent drops all the partitions if CASCADE is specified.
>
> Hmm, I don't think I like this.  Why should it be necessary to detach
> a partition before dropping it?  That seems like an unnecessary step.

I thought we had better lock the parent table when removing one of its
partitions and it seemed a bit odd to lock the parent table when dropping
a partition using DROP TABLE?  OTOH, with ALTER TABLE parent DETACH
PARTITION, the parent table is locked anyway.

>      [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
> IMMEDIATE ]
> +
>  </synopsis>
>
> Unnecessary hunk.
>
> +     <para>
> +      If this table is a partition, one cannot perform <literal>DROP
> NOT NULL</>
> +      on a column if it is marked not null in the parent table.
> +      not null.
> +     </para>
>
> Sentence fragment.

Fixed.

>
> +     <para>
> +      Note that unlike the <literal>ATTACH PARTITION</> command, a partition
> +      being detached can be itself partitioned.  In that case, it continues
> +      to exist as such.
> +     </para>
>
> This is another restriction I don't understand.  Why can't I attach a
> partitioned table?

I removed this restriction.

ATExecAttachPartition() adds a AT work queue entry for the table being
attached to perform a heap scan in the rewrite phase.  The scan is done
for checking that no row violates the partition boundary condition.  When
attaching a partitioned table as partition, multiple AT work queue entries
are now added - one for each leaf partition of the table being attached.

> +        indicate that descendant tables are included.  Note that whether
> +        <literal>ONLY</> or <literal>*</> is specified has no effect in case
> +        of a partitioned table; descendant tables (in this case, partitions)
> +        are always included.
>
> Ugh, why?  I think this should work exactly the same way for
> partitioned tables that it does for any other inheritance hierarchy.
> Sure, you'll get no rows, but who cares?

Agreed, done that way.

> +CREATE FOREIGN TABLE measurement_y2016m07
> +    PARTITION OF measurement FOR VALUES START ('2016-07-01') END
> ('2016-08-01');
> +    SERVER server_07;
>
> Extra semicolon?

Fixed.

>
> +      A partition cannot have columns other than those inherited from the
> +      parent.  That includes the <structfield>oid</> column, which can be
>
> I think experience suggests that this is a good restriction, but then
> why does the syntax synopsis indicate that PARTITION BY can be
> specified along with column definitions?  Similarly for CREATE FOREIGN
> TABLE.

The syntax synopsis of CREATE TABLE ... PARTITION OF indicates that a list
of column WITH OPTION and/or table_constraint can be specified.  It does
not allow column definitions.

In this case, inherited columns will be listed in the PARTITION BY clause.

Do you mean that the CREATE TABLE ... PARTITION OF syntax should allow
column definitions just like INHERITS does and unlike regular inheritance,
throw error if columns other than those to be merged are found?

>
> +      When specifying for a table being created as partition, one needs to
> +      use column names from the parent table as part of the key.
>
> This is not very clear.

Sentence removed because it may be clear from the context that inherited
columns are to be used for the partition key.

> -       /* Remove NO INHERIT flag if rel is a partitioned table */
> -       if (relid_is_partitioned(relid))
> +       /* Discard NO INHERIT, if relation is a partitioned table or a
> partition */
> +       if (relid_is_partitioned(relid) || relid_is_partition(relid))
>                 is_no_inherit = false;
>
> It might be right to disallow NO INHERIT in this case, but I don't
> think it can be right to just silently ignore it.

OK, I changed this to instead throw an error if a NO INHERIT check
constraint is added to a partitioned table or a partition.

> + * Not flushed from the cache by RelationClearRelation() unless changed because
> + * of addition or removal of partitions.
>
> This seems unlikely to be safe, unless I'm missing something.

Like TupleDesc, a table's PartitionDesc is preserved across relcache
rebuilds.  PartitionDesc consists of arrays of OIDs and partition bounds
for a table's immediate partitions.  It can only change by adding/removing
partitions to/from the table which requires an exclusive lock on it.
Since this data can grow arbitrarily big it seemed better to not have to
copy it around, so a direct pointer to the relcache field (rd_partdesc) is
given to callers.  relcache rebuilds that do not logically change
PartitionDesc leave it intact so that some user of it is not left with a
dangling pointer.  Am I missing something?

>
> +       form = (Form_pg_inherits) GETSTRUCT(tuple);
> +
> +       systable_endscan(scan);
> +       heap_close(catalogRelation, AccessShareLock);
> +
> +       return form->inhparent;
>
> This is unsafe.  After systable_endscan, it is no longer OK to access
> form->inhparent.
>
> Try building with CLOBBER_CACHE_ALWAYS to find other cache flush hazards.
>
> There should probably be a note in the function header comment that it
> is unsafe to use this for an inheritance child that is not a
> partition, because there could be more than one parent in that case.
> Or maybe the whole idea of this function just isn't very sound...

Fixed unsafe coding and added a comment to the function saying it should
be called on tables known to be partitions.

>
> +static List *
> +get_partitions(Oid relid, int lockmode)
> +{
> +       return find_inheritance_children(relid, lockmode);
> +}
>
> What's the point?  If we're going to have a wrapper here at all, then
> shouldn't it have a name that matches the existing convention - e.g.
> find_partitions() or find_child_partitions()?  But I think you might
> as well just use find_inheritance_children() directly.

OK, changed to just use find_inheritance_children() directly.

>
> +                * Happens when we have created the pg_inherits entry
> but not the
> +                * pg_partition entry yet.
>
> Why do we ever allow the flow of control to reach this point while we
> are in such an intermediate state?

Fixed to prevent this from happening.

>
> +free_partition_info(PartitionInfoData **p, int num)
>
> Seems very error-prone.  Isn't this why MemoryContextReset was invented?

Got rid of this function.

>
> +relid_is_partition(Oid relid)
> +{
> +       return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
> +}
>
> This is used in a lot of places, and the overhead of checking it in
> all of those places is not necessarily nil.  Syscache lookups aren't
> free.  What if we didn't create a new catalog for this and instead
> just added a relpartitionbound attribute to pg_class?  It seems a bit
> silly to have a whole extra catalog to store one extra column...

OK, I got rid of the pg_partition catalog and added a pg_class attribute
for storing partition bound.  Because the newly added attribute is a
pg_node_tree and hence not readily accessible without a heap_getattr()
call, I added a boolean relispartition as well. Now all the
relid_is_partition() calls have been replaced by checks using
relation->rd_rel->relispartition.

>
>         /*
> +        * If this foreign table is a partition, check that the FDW supports
> +        * insert.
> +        */
> +       if (stmt->base.partbound != NULL)
> +       {
> +               FdwRoutine *fdw_routine;
> +
> +               fdw_routine = GetFdwRoutine(fdw->fdwhandler);
> +               if (fdw_routine->ExecForeignInsert == NULL)
> +                       ereport(ERROR,
> +                                       (errcode(ERRCODE_FDW_NO_SCHEMAS),
> +                                        errmsg("cannot create foreign
> table as partition"),
> +                                        errdetail("foreign-data
> wrapper \"%s\" does not support insert",
> +                                                       fdw->fdwname)));
> +       }
>
> Why?  This seems like an entirely arbitrary prohibition.  If inserts
> aren't supported, then they'll fail at runtime.  Same for updates or
> deletes, for which you have not added checks.  I think you should just
> remove this.

OK, removed this check.

Instead, ExecInitModifyTable()/BeginCopy() call CheckValidResultRel() for
every leaf partition to check that they are ready for CMD_INSERT.

>
> +               /* Force inheritance recursion, if partitioned table. */
> +               if (recurse || relid_is_partitioned(myrelid))
>
> Disagree with this, too.  There's no reason for partitioned tables to
> be special in this way.  Similarly, disagree with all of the places
> that do something similar.

Removed the forced recursion bit here and a few other places.

>
> -                               errmsg("column \"%s\" in child table
> must be marked NOT NULL",
> -                                          attributeName)));
> +                               errmsg("column \"%s\" in %s table must
> be marked NOT NULL",
> +                                          attributeName,
> +                                          is_attach_partition ?
> "source" : "child")));
>
> You have a few of these; they cause problems for translators, because
> different languages have different word ordering.  Repeat the entire
> message instead: is_attach_partition ? "column \"%s\" in source table
> must be marked NOT NULL" : "column \"%s\" in child table must be
> marked NOT NULL".

Changed so that the entire message is repeated.

>
> +-- XXX add ownership tests
>
> So do that.  :-)

Done, sorry about that.

>
> +ERROR:  column "b" is not null in parent
> +HINT:  Please drop not null in the parent instead
>
> Hmm.   That hint doesn't seem like project style, and I'm not sure
> that it really makes sense to issue such a hint anyway.  Who knows
> whether that is the right thing to do?  I think you should somehow be
> complaining about the fact that this is a partition, rather than
> complaining about the fact that the column is NOT NULL in the parent.
> Are we insisting that the flags match exactly, or only that the child
> may not allow nulls unless the parent does?

I think the latter.  It is assumed that all the parent's constraints are
present in a child table, because in ExecInsert()/CopyFrom() we perform
ExecConstraints() using the child relation even if the actual insert was
on the parent.  Also, if an inherited NOT NULL constraint on a child's
column is dropped irrespective of parent's, selecting a parent's NOT NULL
column might return nulls from the child table that no longer has the
constraint.

I recently came across a related proposal whereby dropping *inherited* NOT
NULL  from child tables will be prevented.  Problems in letting it be be
dropped are mentioned here:

https://www.postgresql.org/message-id/21633.1448383428@sss.pgh.pa.us

That proposal is probably being reworked such that NOT NULL constraints
get a pg_constraint entry with proper accounting of inheritance count.

> +ERROR:  new partition's list of values overlaps with partition
> "lpart1" of "list_parted"
>
> Maybe just:
>
> ERROR: partitions must not overlap
> -or-
> ERROR: partition "%s" would overlap partition "%s"

OK, I changed to the second message.

>
> As before, this is just an initial read-through, so apologies for
> whatever I may have missed.

Thanks a lot for the review.

By the way, I am still working on the following items and will be included
in the next version of the patch.

* Fix internal representation of list partition bounds to be more efficient
* Add PartitionOptInfo

As mentioned in [2], I have refactored the inheritance expansion code
within optimizer so that a partitioned table's inheritance hierarchy is
preserved in resulting AppendRelInfos (patch 0005).  One immediate benefit
of that is that if constraint exclusion determines that an intermediate
partition is to be excluded then all partitions underneath it are excluded
automatically.  With the current flattened model, all partitions in that
subtree would have been processed and excluded one-by-one.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZ008qTgd_Qg6_oZb3i0mOYrS6MdhncwgcqPKahixjarg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/f2a9592a-17e9-4c6a-e021-03b802195ce7%40lab.ntt.co.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

On 2016/09/06 16:13, Ashutosh Bapat wrote:
> I found a server crash while running make check in regress folder. with
> this set of patches. Problem is RelationBuildPartitionKey() partexprsrc may
> be used uninitialized. Initializing it with NIL fixes the crash. Here's
> patch to fix it. Came up with the fix after discussion with Amit.

Thanks for the report.  Here is a rebased version of the patches including
you fix (no significant changes from those posted on Aug 26).

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
Hi,

I have applied updated patches given by you, and observe below.

here in the given example, t6_p3 partition is not allowed to have null, but I am able to insert it, causing two nulls in the table.

--create a partition table
create table t6 (a int, b varchar) partition by list(a);
create table t6_p1 partition of t6 for values in (1,2,null);
create table t6_p2 partition of t6 for values in (4,5);
create table t6_p3 partition of t6 for values in (3,6);

--insert some values
insert into t6 select i,i::varchar from generate_series(1,6) i;
insert into t6 values (null,'A');

--try inserting null to t6_p3 partition table
insert into t6_p3 values (null,'A');

select tableoid::regclass,* from t6;
 tableoid | a | b
----------+---+---
 t6_p1    | 1 | 1
 t6_p1    | 2 | 2
 t6_p1    |   | A
 t6_p2    | 4 | 4
 t6_p2    | 5 | 5
 t6_p3    | 3 | 3
 t6_p3    | 6 | 6
 t6_p3    |   | A
(8 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Tue, Sep 6, 2016 at 1:37 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi,

On 2016/09/06 16:13, Ashutosh Bapat wrote:
> I found a server crash while running make check in regress folder. with
> this set of patches. Problem is RelationBuildPartitionKey() partexprsrc may
> be used uninitialized. Initializing it with NIL fixes the crash. Here's
> patch to fix it. Came up with the fix after discussion with Amit.

Thanks for the report.  Here is a rebased version of the patches including
you fix (no significant changes from those posted on Aug 26).

Thanks,
Amit


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Aug 31, 2016 at 1:05 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> However, it seems a lot better to make it a property of the parent
>> from a performance point of view.  Suppose there are 1000 partitions.
>> Reading one toasted value for pg_class and running stringToNode() on
>> it is probably a lot faster than scanning pg_inherits to find all of
>> the child partitions and then doing an index scan to find the pg_class
>> tuple for each and then decoding all of those tuples and assembling
>> them into some data structure.
>
> Seems worth trying.  One point that bothers me a bit is how do we enforce
> partition bound condition on individual partition basis.  For example when
> a row is inserted into a partition directly, we better check that it does
> not fall outside the bounds and issue an error otherwise.  With current
> approach, we just look up a partition's bound from the catalog and gin up
> a check constraint expression (and cache in relcache) to be enforced in
> ExecConstraints().  With the new approach, I guess we would need to look
> up the parent's partition descriptor.  Note that the checking in
> ExecConstraints() is turned off when routing a tuple from the parent.

[ Sorry for the slow response. ]

Yeah, that's a problem.  Maybe it's best to associate this data with
the childrels after all - or halfway in between, e.g. augment
pg_inherits with this information.  After all, the performance problem
I was worried about above isn't really much of an issue: each backend
will build a relcache entry for the parent just once and then use it
for the lifetime of the session unless some invalidation occurs.  So
if that takes a small amount of extra time, it's probably not really a
big deal.  On the other hand, if we can't build the implicit
constraint for the child table without opening the parent, that's
probably going to cause us some serious inconvenience.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
This patch uses RangeBound structure. There's also a structure defined with the same name in rangetypes.h with some slight differences. Should we rename the one in partition.c as PartRangeBound or something like that to avoid the confusion?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

On Tue, Sep 6, 2016 at 8:15 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> Hi,
>
> I have applied updated patches given by you, and observe below.
>
> here in the given example, t6_p3 partition is not allowed to have null, but
> I am able to insert it, causing two nulls in the table.
>
> --create a partition table
> create table t6 (a int, b varchar) partition by list(a);
> create table t6_p1 partition of t6 for values in (1,2,null);
> create table t6_p2 partition of t6 for values in (4,5);
> create table t6_p3 partition of t6 for values in (3,6);
>
> --insert some values
> insert into t6 select i,i::varchar from generate_series(1,6) i;
> insert into t6 values (null,'A');
>
> --try inserting null to t6_p3 partition table
> insert into t6_p3 values (null,'A');
>
> select tableoid::regclass,* from t6;
>  tableoid | a | b
> ----------+---+---
>  t6_p1    | 1 | 1
>  t6_p1    | 2 | 2
>  t6_p1    |   | A
>  t6_p2    | 4 | 4
>  t6_p2    | 5 | 5
>  t6_p3    | 3 | 3
>  t6_p3    | 6 | 6
>  t6_p3    |   | A
> (8 rows)

Thanks for testing!

That looks like a bug.  The same won't occur if you had inserted
through the parent (it would be correctly routed to the partition that
has been defined to accept nulls (that is, t6_p1 of your example). The
bug seems to have to do with the handling of nulls in generating
implicit constraints from the list of values of a given list
partition.  The direct insert on t6_p1 should have failed because
partition key has a value (null) violating the partition boundary
condition.  Will fix.

Thanks,
Amit



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On Tue, Sep 6, 2016 at 9:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 31, 2016 at 1:05 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> However, it seems a lot better to make it a property of the parent
>>> from a performance point of view.  Suppose there are 1000 partitions.
>>> Reading one toasted value for pg_class and running stringToNode() on
>>> it is probably a lot faster than scanning pg_inherits to find all of
>>> the child partitions and then doing an index scan to find the pg_class
>>> tuple for each and then decoding all of those tuples and assembling
>>> them into some data structure.
>>
>> Seems worth trying.  One point that bothers me a bit is how do we enforce
>> partition bound condition on individual partition basis.  For example when
>> a row is inserted into a partition directly, we better check that it does
>> not fall outside the bounds and issue an error otherwise.  With current
>> approach, we just look up a partition's bound from the catalog and gin up
>> a check constraint expression (and cache in relcache) to be enforced in
>> ExecConstraints().  With the new approach, I guess we would need to look
>> up the parent's partition descriptor.  Note that the checking in
>> ExecConstraints() is turned off when routing a tuple from the parent.
>
> [ Sorry for the slow response. ]
>
> Yeah, that's a problem.  Maybe it's best to associate this data with
> the childrels after all - or halfway in between, e.g. augment
> pg_inherits with this information.  After all, the performance problem
> I was worried about above isn't really much of an issue: each backend
> will build a relcache entry for the parent just once and then use it
> for the lifetime of the session unless some invalidation occurs.  So
> if that takes a small amount of extra time, it's probably not really a
> big deal.  On the other hand, if we can't build the implicit
> constraint for the child table without opening the parent, that's
> probably going to cause us some serious inconvenience.

Agreed.  So I will stick with the existing approach.

Thanks,
Amit



Re: Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
Hi,

I have a query regarding list partitioning,

For example if I want to store employee data in a table, with "IT" dept employee in emp_p1 partition, "HR" dept employee in emp_p2 partition and if employee belongs to other than these two, should come in emp_p3 partition.

In this case not sure how to create partition table. Do we have something like we have UNBOUNDED for range partition or oracle have "DEFAULT" for list partition.

create table employee (empid int, dept varchar) partition by list(dept);
create table emp_p1 partition of employee for values in ('IT');
create table emp_p2 partition of employee for values in ('HR');
create table emp_p3 partition of employee for values in (??);

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Tue, Sep 6, 2016 at 6:37 PM, Amit Langote <amitlangote09@gmail.com> wrote:
On Tue, Sep 6, 2016 at 9:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 31, 2016 at 1:05 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> However, it seems a lot better to make it a property of the parent
>>> from a performance point of view.  Suppose there are 1000 partitions.
>>> Reading one toasted value for pg_class and running stringToNode() on
>>> it is probably a lot faster than scanning pg_inherits to find all of
>>> the child partitions and then doing an index scan to find the pg_class
>>> tuple for each and then decoding all of those tuples and assembling
>>> them into some data structure.
>>
>> Seems worth trying.  One point that bothers me a bit is how do we enforce
>> partition bound condition on individual partition basis.  For example when
>> a row is inserted into a partition directly, we better check that it does
>> not fall outside the bounds and issue an error otherwise.  With current
>> approach, we just look up a partition's bound from the catalog and gin up
>> a check constraint expression (and cache in relcache) to be enforced in
>> ExecConstraints().  With the new approach, I guess we would need to look
>> up the parent's partition descriptor.  Note that the checking in
>> ExecConstraints() is turned off when routing a tuple from the parent.
>
> [ Sorry for the slow response. ]
>
> Yeah, that's a problem.  Maybe it's best to associate this data with
> the childrels after all - or halfway in between, e.g. augment
> pg_inherits with this information.  After all, the performance problem
> I was worried about above isn't really much of an issue: each backend
> will build a relcache entry for the parent just once and then use it
> for the lifetime of the session unless some invalidation occurs.  So
> if that takes a small amount of extra time, it's probably not really a
> big deal.  On the other hand, if we can't build the implicit
> constraint for the child table without opening the parent, that's
> probably going to cause us some serious inconvenience.

Agreed.  So I will stick with the existing approach.

Thanks,
Amit


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

On 2016/09/07 17:56, Rajkumar Raghuwanshi wrote:
> Hi,
> 
> I have a query regarding list partitioning,
> 
> For example if I want to store employee data in a table, with "IT" dept
> employee in emp_p1 partition, "HR" dept employee in emp_p2 partition and if
> employee belongs to other than these two, should come in emp_p3 partition.
> 
> In this case not sure how to create partition table. Do we have something
> like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
> list partition.
> 
> create table employee (empid int, dept varchar) partition by list(dept);
> create table emp_p1 partition of employee for values in ('IT');
> create table emp_p2 partition of employee for values in ('HR');
> create table emp_p3 partition of employee for values in (??);

Sorry, no such feature is currently offered.  It might be possible to
offer something like a "default" list partition which accepts values other
than those specified for other existing partitions.  However, that means
if we add a non-default list partition after a default one has been
created, the implementation must make sure that it moves any values from
the default partition that now belong to the newly created partition.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
On Wed, Sep 7, 2016 at 3:58 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi,

On 2016/09/07 17:56, Rajkumar Raghuwanshi wrote:
> Hi,
>
> I have a query regarding list partitioning,
>
> For example if I want to store employee data in a table, with "IT" dept
> employee in emp_p1 partition, "HR" dept employee in emp_p2 partition and if
> employee belongs to other than these two, should come in emp_p3 partition.
>
> In this case not sure how to create partition table. Do we have something
> like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
> list partition.
>
> create table employee (empid int, dept varchar) partition by list(dept);
> create table emp_p1 partition of employee for values in ('IT');
> create table emp_p2 partition of employee for values in ('HR');
> create table emp_p3 partition of employee for values in (??);

Sorry, no such feature is currently offered.  It might be possible to
offer something like a "default" list partition which accepts values other
than those specified for other existing partitions.  However, that means
if we add a non-default list partition after a default one has been
created, the implementation must make sure that it moves any values from
the default partition that now belong to the newly created partition.

Thanks,
Amit

Thanks for clarifying, But I could see same problem of moving data when adding a new non-default partition with unbounded range partition.

For example give here, Initially I have create a partition table test with test_p3 as unbounded end,
Later tried to change test_p3 to contain 7-9 values only, and adding a new partition test_p4 contain 10-unbound.

--create partition table and some leafs
CREATE TABLE test (a int, b int) PARTITION BY RANGE(a);
CREATE TABLE test_p1 PARTITION OF test FOR VALUES START (1) END (4);
CREATE TABLE test_p2 PARTITION OF test FOR VALUES START (4) END (7);
CREATE TABLE test_p3 PARTITION OF test FOR VALUES START (7) END UNBOUNDED;

--insert some data
INSERT INTO test SELECT i, i*10 FROM generate_series(1,3) i;
INSERT INTO test SELECT i, i*10 FROM generate_series(4,6) i;
INSERT INTO test SELECT i, i*10 FROM generate_series(7,13) i;

--directly not able to attach test_p4 because of overlap error, hence detached test_p3 and than attaching test_p4
SELECT tableoid::regclass,* FROM test;
 tableoid | a  |  b 
----------+----+-----
 test_p1  |  1 |  10
 test_p1  |  2 |  20
 test_p1  |  3 |  30
 test_p2  |  4 |  40
 test_p2  |  5 |  50
 test_p2  |  6 |  60
 test_p3  |  7 |  70
 test_p3  |  8 |  80
 test_p3  |  9 |  90
 test_p3  | 10 | 100
 test_p3  | 11 | 110
 test_p3  | 12 | 120
 test_p3  | 13 | 130
(13 rows)

ALTER TABLE test DETACH PARTITION test_p3;
CREATE TABLE test_p4 (like test);
ALTER TABLE test ATTACH PARTITION test_p4 FOR VALUES start (10) end UNBOUNDED;

--now can not attach test_p3 because of overlap with test_p4, causing data loss from main test table.
ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (10);
ERROR:  source table contains a row violating partition bound specification
ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (13);
ERROR:  partition "test_p3" would overlap partition "test_p4"

SELECT tableoid::regclass,* FROM test;
 tableoid | a | b 
----------+---+----
 test_p1  | 1 | 10
 test_p1  | 2 | 20
 test_p1  | 3 | 30
 test_p2  | 4 | 40
 test_p2  | 5 | 50
 test_p2  | 6 | 60
(6 rows)
 

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/08 21:38, Rajkumar Raghuwanshi wrote:
> On Wed, Sep 7, 2016 at 3:58 PM, Amit Langote wrote:
>> On 2016/09/07 17:56, Rajkumar Raghuwanshi wrote:
>>>
>>> In this case not sure how to create partition table. Do we have something
>>> like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
>>> list partition.
>>>
>>> create table employee (empid int, dept varchar) partition by list(dept);
>>> create table emp_p1 partition of employee for values in ('IT');
>>> create table emp_p2 partition of employee for values in ('HR');
>>> create table emp_p3 partition of employee for values in (??);
>>
>> Sorry, no such feature is currently offered.  It might be possible to
>> offer something like a "default" list partition which accepts values other
>> than those specified for other existing partitions.  However, that means
>> if we add a non-default list partition after a default one has been
>> created, the implementation must make sure that it moves any values from
>> the default partition that now belong to the newly created partition.
> 
> Thanks for clarifying, But I could see same problem of moving data when
> adding a new non-default partition with unbounded range partition.
> 
> For example give here, Initially I have create a partition table test with
> test_p3 as unbounded end,
> Later tried to change test_p3 to contain 7-9 values only, and adding a new
> partition test_p4 contain 10-unbound.
> 
> --create partition table and some leafs
> CREATE TABLE test (a int, b int) PARTITION BY RANGE(a);
> CREATE TABLE test_p1 PARTITION OF test FOR VALUES START (1) END (4);
> CREATE TABLE test_p2 PARTITION OF test FOR VALUES START (4) END (7);
> CREATE TABLE test_p3 PARTITION OF test FOR VALUES START (7) END UNBOUNDED;
> 
> --insert some data
> INSERT INTO test SELECT i, i*10 FROM generate_series(1,3) i;
> INSERT INTO test SELECT i, i*10 FROM generate_series(4,6) i;
> INSERT INTO test SELECT i, i*10 FROM generate_series(7,13) i;
> 
> --directly not able to attach test_p4 because of overlap error, hence
> detached test_p3 and than attaching test_p4
> SELECT tableoid::regclass,* FROM test;
>  tableoid | a  |  b
> ----------+----+-----
>  test_p1  |  1 |  10
>  test_p1  |  2 |  20
>  test_p1  |  3 |  30
>  test_p2  |  4 |  40
>  test_p2  |  5 |  50
>  test_p2  |  6 |  60
>  test_p3  |  7 |  70
>  test_p3  |  8 |  80
>  test_p3  |  9 |  90
>  test_p3  | 10 | 100
>  test_p3  | 11 | 110
>  test_p3  | 12 | 120
>  test_p3  | 13 | 130
> (13 rows)
> 
> ALTER TABLE test DETACH PARTITION test_p3;
> CREATE TABLE test_p4 (like test);
> ALTER TABLE test ATTACH PARTITION test_p4 FOR VALUES start (10) end
> UNBOUNDED;
> 
> --now can not attach test_p3 because of overlap with test_p4, causing data
> loss from main test table.
> ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (10);
> ERROR:  source table contains a row violating partition bound specification
> ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (13);
> ERROR:  partition "test_p3" would overlap partition "test_p4"

In this particular case, you will have to move any rows in test_p3 with
key > or >= 10 into the new partition test_p4 using dml (to not lose any
data).  Then attach test_p3 as partition for values start (7) end (10);
you won't get either of the above errors.

Looking forward, what we need I think is a split partition command.
Adding a new partition that overlaps the default list partition or
unbounded range partition could be done by splitting the latter. Perhaps
something like:

alter table test split partition test_p3 at (10) [inclusive | exclusive] with test_p4;

The above command would make test_p3 into 2 partitions:
 test_p3 start (7) end (10) [inclusive | exclusive] test_p4 start (10) [exclusive | inclusive] end unbounded

Any rows in test_p3 with key > or >= 10 will be moved into the newly
created test_p4 as part of the execution of this command.

For your list partitioning example:

create table employee (empid int, dept varchar) partition by list(dept);
create table emp_p1 partition of employee for values in ('IT');
create table emp_p2 partition of employee for values in ('HR');
create table emp_p3 partition of employee for values in (default);

alter table emp split partition emp_p3 with emp_p3 ('ACCT') emp_p4 (default);

Any rows in emp_p3 with key != 'ACCT' will be moved into the newly created
default partition emp_p4.


But for time being, I think we could provide the syntax and mechanism for
default list partition seeing as we have the same for range partitioned
table (namely a range partition with unbounded start or end).  Although
with the limitations as discussed.

Thoughts?

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/06 22:04, Amit Langote wrote:
> Will fix.

Here is an updated set of patches.

In addition to fixing a couple of bugs reported by Ashutosh and Rajkumar,
there are a few of major changes:

* change the way individual partition bounds are represented internally
  and the way a collection of partition bounds associated with a
  partitioned table is exposed to other modules.  Especially list
  partition bounds which are manipulated more efficiently as discussed
  at [1].

* \d partitioned_table now shows partition count and \d+ lists partition
  names and their bounds as follows:

\d t6
           Table "public.t6"
 Column |       Type        | Modifiers
.-------+-------------------+-----------
 a      | integer           |
 b      | character varying |
Partition Key: LIST (a)
Number of partitions: 3 (Use \d+ to list them.)

\d+ t6
                               Table "public.t6"
 Column |       Type        | Modifiers | Storage  | Stats target |
Description
.-------+-------------------+-----------+----------+--------------+-------------
 a      | integer           |           | plain    |              |
 b      | character varying |           | extended |              |
Partition Key: LIST (a)
Partitions: t6_p1 FOR VALUES IN (1, 2, NULL),
            t6_p2 FOR VALUES IN (4, 5),
            t6_p3 FOR VALUES IN (3, 6)

\d+ p
                             Table "public.p"
 Column |     Type     | Modifiers | Storage  | Stats target | Description
.-------+--------------+-----------+----------+--------------+-------------
 a      | integer      |           | plain    |              |
 b      | character(1) |           | extended |              |
Partition Key: RANGE (a)
Partitions: p1 FOR VALUES START (1) END (10),
            p2 FOR VALUES START (10) END (20),
            p3 FOR VALUES START (20) END (30),
            p4 FOR VALUES START (30) EXCLUSIVE END (40) INCLUSIVE,
            p5 FOR VALUES START (40) EXCLUSIVE END (50),
            p6 FOR VALUES START (50) END UNBOUNDED

* Some more regression tests

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZCr0-t93KgJA3T1uy9yWxfYaSYL3X35ObyHg%2BZUfERqQ%40mail.gmail.com

Attachment

Re: Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
I have Continued with testing declarative partitioning with the latest patch. Got some more observation, given below

-- Observation 1 : Getting overlap error with START with EXCLUSIVE in range partition.

create table test_range_bound ( a int) partition by range(a);
--creating a partition to contain records {1,2,3,4}, by default 1 is inclusive and 5 is exclusive
create table test_range_bound_p1 partition of test_range_bound for values start (1) end (5);
--now trying to create a partition by explicitly mentioning start is exclusive to contain records {5,6,7}, here trying to create with START with 4 as exclusive so range should be 5 to 8, but getting partition overlap error.
create table test_range_bound_p2 partition of test_range_bound for values start (4) EXCLUSIVE end (8);
ERROR:  partition "test_range_bound_p2" would overlap partition "test_range_bound_p1"

-- Observation 2 : able to create sub-partition out of the range set for main table, causing not able to insert data satisfying any of the partition.

create table test_subpart (c1 int) partition by range (c1);
create table test_subpart_p1 partition of test_subpart for values start (1) end (100) inclusive partition by range (c1);
create table test_subpart_p1_sub1 partition of test_subpart_p1 for values start (101) end (200);

\d+ test_subpart
                     Table "public.test_subpart"
 Column |  Type   | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
 c1     | integer |           | plain   |              |
Partition Key: RANGE (c1)
Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE

\d+ test_subpart_p1
                   Table "public.test_subpart_p1"
 Column |  Type   | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
 c1     | integer |           | plain   |              |
Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
Partition Key: RANGE (c1)
Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)

insert into test_subpart values (50);
ERROR:  no partition of relation "test_subpart_p1" found for row
DETAIL:  Failing row contains (50).
insert into test_subpart values (150);
ERROR:  no partition of relation "test_subpart" found for row
DETAIL:  Failing row contains (150).

-- Observation 3 : Getting cache lookup failed, when selecting list partition table containing array.

CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');

INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);

postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
   tableoid    | i |  j  |  k 
---------------+---+-----+-----
 test_array_p1 | 1 | {1} | {1}
(1 row)

postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
   tableoid    | i |   j   |  k 
---------------+---+-------+-----
 test_array_p2 | 2 | {2,2} | {2}
(1 row)

postgres=# SELECT tableoid::regclass,* FROM test_array;
ERROR:  cache lookup failed for type 0

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Fri, Sep 9, 2016 at 2:25 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/09/06 22:04, Amit Langote wrote:
> Will fix.

Here is an updated set of patches.

In addition to fixing a couple of bugs reported by Ashutosh and Rajkumar,
there are a few of major changes:

* change the way individual partition bounds are represented internally
  and the way a collection of partition bounds associated with a
  partitioned table is exposed to other modules.  Especially list
  partition bounds which are manipulated more efficiently as discussed
  at [1].

* \d partitioned_table now shows partition count and \d+ lists partition
  names and their bounds as follows:

\d t6
           Table "public.t6"
 Column |       Type        | Modifiers
.-------+-------------------+-----------
 a      | integer           |
 b      | character varying |
Partition Key: LIST (a)
Number of partitions: 3 (Use \d+ to list them.)

\d+ t6
                               Table "public.t6"
 Column |       Type        | Modifiers | Storage  | Stats target |
Description
.-------+-------------------+-----------+----------+--------------+-------------
 a      | integer           |           | plain    |              |
 b      | character varying |           | extended |              |
Partition Key: LIST (a)
Partitions: t6_p1 FOR VALUES IN (1, 2, NULL),
            t6_p2 FOR VALUES IN (4, 5),
            t6_p3 FOR VALUES IN (3, 6)

\d+ p
                             Table "public.p"
 Column |     Type     | Modifiers | Storage  | Stats target | Description
.-------+--------------+-----------+----------+--------------+-------------
 a      | integer      |           | plain    |              |
 b      | character(1) |           | extended |              |
Partition Key: RANGE (a)
Partitions: p1 FOR VALUES START (1) END (10),
            p2 FOR VALUES START (10) END (20),
            p3 FOR VALUES START (20) END (30),
            p4 FOR VALUES START (30) EXCLUSIVE END (40) INCLUSIVE,
            p5 FOR VALUES START (40) EXCLUSIVE END (50),
            p6 FOR VALUES START (50) END UNBOUNDED

* Some more regression tests

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZCr0-t93KgJA3T1uy9yWxfYaSYL3X35ObyHg%2BZUfERqQ%40mail.gmail.com

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi

On 2016/09/09 17:55, Amit Langote wrote:
> On 2016/09/06 22:04, Amit Langote wrote:
>> Will fix.
>
> Here is an updated set of patches.

An email from Rajkumar somehow managed to break out of this thread.
Quoting his message below so that I don't end up replying with patches on
two different threads.

On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
> I have Continued with testing declarative partitioning with the latest
> patch. Got some more observation, given below

Thanks a lot for testing.

> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
> partition.
>
> create table test_range_bound ( a int) partition by range(a);
> --creating a partition to contain records {1,2,3,4}, by default 1 is
> inclusive and 5 is exclusive
> create table test_range_bound_p1 partition of test_range_bound for values
> start (1) end (5);
> --now trying to create a partition by explicitly mentioning start is
> exclusive to contain records {5,6,7}, here trying to create with START with
> 4 as exclusive so range should be 5 to 8, but getting partition overlap
> error.
> create table test_range_bound_p2 partition of test_range_bound for values
> start (4) EXCLUSIVE end (8);
> ERROR:  partition "test_range_bound_p2" would overlap partition
> "test_range_bound_p1"

Wow, this is bad.  What is needed in this case is "canonicalization" of
the range partition bounds specified in the command.  Range types do this
and hence an equivalent test done with range type values would disagree
with the result given by the patch for range partition bounds.

select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
 cmp
-----
 f
(1 row)

In this case, the second range is converted into its equivalent canonical
form viz. '[5, 9)'.  Then comparison of bounds 5) and [5 can tell that the
ranges do not overlap after all.  Range type operators can do this because
their code can rely on the availability of a canonicalization function for
a given range type.  From the range types documentation:

"""
A discrete range type should have a canonicalization function that is
aware of the desired step size for the element type. The canonicalization
function is charged with converting equivalent values of the range type to
have identical representations, in particular consistently inclusive or
exclusive bounds. If a canonicalization function is not specified, then
ranges with different formatting will always be treated as unequal, even
though they might represent the same set of values in reality.
"""

to extend the last sentence:

"... or consider two ranges overlapping when in reality they are not
(maybe they are really just adjacent)."

Within the code handling range partition bound, no such canonicalization
happens, so comparison 5) and (4 ends up concluding that upper1 > lower2,
hence ranges overlap.

To mitigate this, how about we restrict range partition key to contain
columns of only those types for which we know we can safely canonicalize a
range bound (ie, discrete range types)?  I don't think we can use, say,
existing int4range_canonical but will have to write a version of it for
partitioning usage (range bounds of partitions are different from what
int4range_canonical is ready to handle).  This approach will be very
limiting as then range partitions will be limited to columns of int,
bigint and date type only.

One more option is we let the user specify the canonicalize function next
to the column name when defining the partition key.  If not specified, we
hard-code one for the types for which we will be implementing a
canonicalize function (ie, above mentioned types).  In other cases, we
just don't have one and hence if an unexpected result occurs when creating
a new partition, it's up to the user to realize what happened.  Of course,
we will be mentioning in the documentation why a canonicalize function is
necessary and how to write one.  Note that this canonicalize function
comes into play only when defining new partitions, it has no role beyond
that point.

> -- Observation 2 : able to create sub-partition out of the range set for
> main table, causing not able to insert data satisfying any of the partition.
>
> create table test_subpart (c1 int) partition by range (c1);
> create table test_subpart_p1 partition of test_subpart for values start (1)
> end (100) inclusive partition by range (c1);
> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
> start (101) end (200);
>
> \d+ test_subpart
>                      Table "public.test_subpart"
>  Column |  Type   | Modifiers | Storage | Stats target | Description
> --------+---------+-----------+---------+--------------+-------------
>  c1     | integer |           | plain   |              |
> Partition Key: RANGE (c1)
> Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE
>
> \d+ test_subpart_p1
>                    Table "public.test_subpart_p1"
>  Column |  Type   | Modifiers | Storage | Stats target | Description
> --------+---------+-----------+---------+--------------+-------------
>  c1     | integer |           | plain   |              |
> Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
> Partition Key: RANGE (c1)
> Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)
>
> insert into test_subpart values (50);
> ERROR:  no partition of relation "test_subpart_p1" found for row
> DETAIL:  Failing row contains (50).
> insert into test_subpart values (150);
> ERROR:  no partition of relation "test_subpart" found for row
> DETAIL:  Failing row contains (150).

It seems that DDL should prevent the same column being used in partition
key of lower level partitions.  I don't know how much sense it would make,
but being able to use the same column as partition key of lower level
partitions may be a feature useful to some users if they know what they
are doing.  But this last part doesn't sound like a good thing.  I
modified the patch such that lower level partitions cannot use columns
used by ancestor tables.

> -- Observation 3 : Getting cache lookup failed, when selecting list
> partition table containing array.
>
> CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
> CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
> CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');
>
> INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
> INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);
>
> postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
>    tableoid    | i |  j  |  k
> ---------------+---+-----+-----
>  test_array_p1 | 1 | {1} | {1}
> (1 row)
>
> postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
>    tableoid    | i |   j   |  k
> ---------------+---+-------+-----
>  test_array_p2 | 2 | {2,2} | {2}
> (1 row)
>
> postgres=# SELECT tableoid::regclass,* FROM test_array;
> ERROR:  cache lookup failed for type 0

That's a bug.  Fixed in the attached patch.

PS: I'm going to have limited Internet access during this weekend and over
the next week, so responses could be slow.  Sorry about that.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
Hi Amit,

It looks like there is some problem while creating paramterized paths
for multi-level partitioned tables. Here's a longish testcase

CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
(250) PARTITION BY RANGE (b);
CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END (100);
CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
(100) END (250);
CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
(500) PARTITION BY RANGE (c);
CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
('0250') END ('0400');
CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
('0400') END ('0500');
CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
(600) PARTITION BY RANGE ((b + a));
CREATE TABLE prt1_l_p3_p1 PARTITION OF prt1_l_p3 FOR VALUES START
(1000) END (1100);
CREATE TABLE prt1_l_p3_p2 PARTITION OF prt1_l_p3 FOR VALUES START
(1100) END (1200);
INSERT INTO prt1_l SELECT i, i, to_char(i, 'FM0000') FROM
generate_series(0, 599, 2) i;
CREATE TABLE uprt1_l AS SELECT * FROM prt1_l;

CREATE TABLE prt2_l (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_l_p1 PARTITION OF prt2_l FOR VALUES START (0) END
(250) PARTITION BY RANGE (a);
CREATE TABLE prt2_l_p1_p1 PARTITION OF prt2_l_p1 FOR VALUES START (0) END (100);
CREATE TABLE prt2_l_p1_p2 PARTITION OF prt2_l_p1 FOR VALUES START
(100) END (250);
CREATE TABLE prt2_l_p2 PARTITION OF prt2_l FOR VALUES START (250) END
(500) PARTITION BY RANGE (c);
CREATE TABLE prt2_l_p2_p1 PARTITION OF prt2_l_p2 FOR VALUES START
('0250') END ('0400');
CREATE TABLE prt2_l_p2_p2 PARTITION OF prt2_l_p2 FOR VALUES START
('0400') END ('0500');
CREATE TABLE prt2_l_p3 PARTITION OF prt2_l FOR VALUES START (500) END
(600) PARTITION BY RANGE ((a + b));
CREATE TABLE prt2_l_p3_p1 PARTITION OF prt2_l_p3 FOR VALUES START
(1000) END (1100);
CREATE TABLE prt2_l_p3_p2 PARTITION OF prt2_l_p3 FOR VALUES START
(1100) END (1200);
INSERT INTO prt2_l SELECT i, i, to_char(i, 'FM0000') FROM
generate_series(0, 599, 3) i;
CREATE TABLE uprt2_l AS SELECT * FROM prt2_l;

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM prt1_l t1 LEFT JOIN LATERAL             (SELECT t2.a AS t2a, t2.c AS t2c, t2.b AS t2b, t3.a AS
t3a, least(t1.a,t2.a,t3.a) FROM prt1_l t2 JOIN prt2_l t3 ON (t2.a =
t3.b AND t2.b = t3.a AND t2.c = t3.c AND t2.b + t2.a = t3.a + t3.b))
ss             ON t1.a = ss.t2a AND t1.b = ss.t2a AND t1.c = ss.t2c AND
t1.b + t1.a = ss.t2a + ss.t2b WHERE t1.a % 25 = 0 ORDER BY t1.a;
ERROR:  could not devise a query plan for the given query

Let's replace the laterally referenced relation by an unpartitioned table.

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM uprt1_l t1 LEFT JOIN LATERAL             (SELECT t2.a AS t2a, t2.c AS t2c, t2.b AS t2b, t3.a AS
t3a, least(t1.a,t2.a,t3.a) FROM prt1_l t2 JOIN prt2_l t3 ON (t2.a =
t3.b AND t2.b = t3.a AND t2.c = t3.c AND t2.b + t2.a = t3.a + t3.b))
ss             ON t1.a = ss.t2a AND t1.b = ss.t2a AND t1.c = ss.t2c AND
t1.b + t1.a = ss.t2a + ss.t2b WHERE t1.a % 25 = 0 ORDER BY t1.a;
ERROR:  could not devise a query plan for the given query

Let's replace another partitioned table in the inner query with an
unpartitioned table.

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM uprt1_l t1 LEFT JOIN LATERAL             (SELECT t2.a AS t2a, t2.c AS t2c, t2.b AS t2b, t3.a AS
t3a, least(t1.a,t2.a,t3.a) FROM prt1_l t2 JOIN uprt2_l t3 ON (t2.a =
t3.b AND t2.b = t3.a AND t2.c = t3.c AND t2.b + t2.a = t3.a + t3.b))
ss             ON t1.a = ss.t2a AND t1.b = ss.t2a AND t1.c = ss.t2c AND
t1.b + t1.a = ss.t2a + ss.t2b WHERE t1.a % 25 = 0 ORDER BY t1.a;
ERROR:  could not devise a query plan for the given query

Please check if you are able to reproduce these errors in your
repository. I made sure that I cleaned up all partition-wise join code
before testing this, but ... .

I tried to debug the problem somewhat. In set_append_rel_pathlist(),
it finds that at least one child has a parameterized path as the
cheapest path, so it doesn't create an unparameterized path for append
rel. At the same time there is a parameterization common to all the
children, so it doesn't create any path. There seem to be two problems
here
1. The children from second level onwards may not be getting
parameterized for lateral references. That seems unlikely but
possible.
2. Reparameterization should have corrected this, but
reparameterize_path() does not support AppendPaths.

On Thu, Sep 15, 2016 at 2:23 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Hi
>
> On 2016/09/09 17:55, Amit Langote wrote:
>> On 2016/09/06 22:04, Amit Langote wrote:
>>> Will fix.
>>
>> Here is an updated set of patches.
>
> An email from Rajkumar somehow managed to break out of this thread.
> Quoting his message below so that I don't end up replying with patches on
> two different threads.
>
> On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
>> I have Continued with testing declarative partitioning with the latest
>> patch. Got some more observation, given below
>
> Thanks a lot for testing.
>
>> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
>> partition.
>>
>> create table test_range_bound ( a int) partition by range(a);
>> --creating a partition to contain records {1,2,3,4}, by default 1 is
>> inclusive and 5 is exclusive
>> create table test_range_bound_p1 partition of test_range_bound for values
>> start (1) end (5);
>> --now trying to create a partition by explicitly mentioning start is
>> exclusive to contain records {5,6,7}, here trying to create with START with
>> 4 as exclusive so range should be 5 to 8, but getting partition overlap
>> error.
>> create table test_range_bound_p2 partition of test_range_bound for values
>> start (4) EXCLUSIVE end (8);
>> ERROR:  partition "test_range_bound_p2" would overlap partition
>> "test_range_bound_p1"
>
> Wow, this is bad.  What is needed in this case is "canonicalization" of
> the range partition bounds specified in the command.  Range types do this
> and hence an equivalent test done with range type values would disagree
> with the result given by the patch for range partition bounds.
>
> select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
>  cmp
> -----
>  f
> (1 row)
>
> In this case, the second range is converted into its equivalent canonical
> form viz. '[5, 9)'.  Then comparison of bounds 5) and [5 can tell that the
> ranges do not overlap after all.  Range type operators can do this because
> their code can rely on the availability of a canonicalization function for
> a given range type.  From the range types documentation:
>
> """
> A discrete range type should have a canonicalization function that is
> aware of the desired step size for the element type. The canonicalization
> function is charged with converting equivalent values of the range type to
> have identical representations, in particular consistently inclusive or
> exclusive bounds. If a canonicalization function is not specified, then
> ranges with different formatting will always be treated as unequal, even
> though they might represent the same set of values in reality.
> """
>
> to extend the last sentence:
>
> "... or consider two ranges overlapping when in reality they are not
> (maybe they are really just adjacent)."
>
> Within the code handling range partition bound, no such canonicalization
> happens, so comparison 5) and (4 ends up concluding that upper1 > lower2,
> hence ranges overlap.
>
> To mitigate this, how about we restrict range partition key to contain
> columns of only those types for which we know we can safely canonicalize a
> range bound (ie, discrete range types)?  I don't think we can use, say,
> existing int4range_canonical but will have to write a version of it for
> partitioning usage (range bounds of partitions are different from what
> int4range_canonical is ready to handle).  This approach will be very
> limiting as then range partitions will be limited to columns of int,
> bigint and date type only.
>
> One more option is we let the user specify the canonicalize function next
> to the column name when defining the partition key.  If not specified, we
> hard-code one for the types for which we will be implementing a
> canonicalize function (ie, above mentioned types).  In other cases, we
> just don't have one and hence if an unexpected result occurs when creating
> a new partition, it's up to the user to realize what happened.  Of course,
> we will be mentioning in the documentation why a canonicalize function is
> necessary and how to write one.  Note that this canonicalize function
> comes into play only when defining new partitions, it has no role beyond
> that point.
>
>> -- Observation 2 : able to create sub-partition out of the range set for
>> main table, causing not able to insert data satisfying any of the partition.
>>
>> create table test_subpart (c1 int) partition by range (c1);
>> create table test_subpart_p1 partition of test_subpart for values start (1)
>> end (100) inclusive partition by range (c1);
>> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
>> start (101) end (200);
>>
>> \d+ test_subpart
>>                      Table "public.test_subpart"
>>  Column |  Type   | Modifiers | Storage | Stats target | Description
>> --------+---------+-----------+---------+--------------+-------------
>>  c1     | integer |           | plain   |              |
>> Partition Key: RANGE (c1)
>> Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE
>>
>> \d+ test_subpart_p1
>>                    Table "public.test_subpart_p1"
>>  Column |  Type   | Modifiers | Storage | Stats target | Description
>> --------+---------+-----------+---------+--------------+-------------
>>  c1     | integer |           | plain   |              |
>> Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
>> Partition Key: RANGE (c1)
>> Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)
>>
>> insert into test_subpart values (50);
>> ERROR:  no partition of relation "test_subpart_p1" found for row
>> DETAIL:  Failing row contains (50).
>> insert into test_subpart values (150);
>> ERROR:  no partition of relation "test_subpart" found for row
>> DETAIL:  Failing row contains (150).
>
> It seems that DDL should prevent the same column being used in partition
> key of lower level partitions.  I don't know how much sense it would make,
> but being able to use the same column as partition key of lower level
> partitions may be a feature useful to some users if they know what they
> are doing.  But this last part doesn't sound like a good thing.  I
> modified the patch such that lower level partitions cannot use columns
> used by ancestor tables.
>
>> -- Observation 3 : Getting cache lookup failed, when selecting list
>> partition table containing array.
>>
>> CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
>> CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
>> CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');
>>
>> INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
>> INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
>>    tableoid    | i |  j  |  k
>> ---------------+---+-----+-----
>>  test_array_p1 | 1 | {1} | {1}
>> (1 row)
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
>>    tableoid    | i |   j   |  k
>> ---------------+---+-------+-----
>>  test_array_p2 | 2 | {2,2} | {2}
>> (1 row)
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array;
>> ERROR:  cache lookup failed for type 0
>
> That's a bug.  Fixed in the attached patch.
>
> PS: I'm going to have limited Internet access during this weekend and over
> the next week, so responses could be slow.  Sorry about that.
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Sep 15, 2016 at 4:53 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Wow, this is bad.  What is needed in this case is "canonicalization" of
> the range partition bounds specified in the command.

I think we shouldn't worry about this.  It seems like unnecessary
scope creep.  All human beings capable of using PostgreSQL will
understand that there are no integers between 4 and 5, so any
practical impact on this would be on someone creating partitions
automatically.  But if someone is creating partitions automatically
they are highly likely to be using the same EXCLUSIVE/INCLUSIVE
settings for all of the partitions, in which case this won't arise.
And if they aren't, then I think we should just make them deal with
this limitation in their code instead of dealing with it in our code.
This patch is plenty complicated enough already; introducing a whole
new canonicalization concept that will help practically nobody seems
to me to be going in the wrong direction.  If somebody really cares
enough to want to try to fix this, they can submit a followup patch
someday.

> To mitigate this, how about we restrict range partition key to contain
> columns of only those types for which we know we can safely canonicalize a
> range bound (ie, discrete range types)?  I don't think we can use, say,
> existing int4range_canonical but will have to write a version of it for
> partitioning usage (range bounds of partitions are different from what
> int4range_canonical is ready to handle).  This approach will be very
> limiting as then range partitions will be limited to columns of int,
> bigint and date type only.

-1.  That is letting the tail wag the dog.  Let's leave it the way you
had it and be happy.

>> -- Observation 2 : able to create sub-partition out of the range set for
>> main table, causing not able to insert data satisfying any of the partition.
>>
>> create table test_subpart (c1 int) partition by range (c1);
>> create table test_subpart_p1 partition of test_subpart for values start (1)
>> end (100) inclusive partition by range (c1);
>> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
>> start (101) end (200);
>
> It seems that DDL should prevent the same column being used in partition
> key of lower level partitions.  I don't know how much sense it would make,
> but being able to use the same column as partition key of lower level
> partitions may be a feature useful to some users if they know what they
> are doing.  But this last part doesn't sound like a good thing.  I
> modified the patch such that lower level partitions cannot use columns
> used by ancestor tables.

I again disagree.  If somebody defines partition bounds that make it
impossible to insert the data that they care about, that's operator
error.  The fact that, across multiple levels, you can manage to make
it impossible to insert any data at all does not make it anything
other than operator error.  If we take the contrary position that it's
the system's job to prevent this sort of thing, we may disallow some
useful cases, like partitioning by the year portion of a date and then
subpartitioning by the month portion of that same date.

I think we'll probably also find that we're making the code
complicated to no purpose.  For example, now you have to check when
attaching a partition that it doesn't violate the rule; otherwise you
end up with a table that can't be created directly (and thus can't
survive dump-and-restore) but can be created indirectly by exploiting
loopholes in the checks.  It's tempting to think that we can check
simple cases - e.g. if the parent and the child are partitioning on
the same exact column, the child's range should be contained within
the parent's range - but more complicated cases are tricky.  Suppose
the table is range-partitioned on (a, b) and range-subpartitioned on
b.  It's not trivial to figure out whether the set of values that the
user can insert into that partition is non-empty.  If we allow
partitioning on expressions, then it quickly becomes altogether
impossible to deduce anything useful - unless you can solve the
halting problem.

And, really, why do we care?  If the user creates a partitioning
scheme that permits no rows in some or all of the partitions, then
they will have an empty table that can be correctly dumped and
restored but which cannot be used for anything useful unless it is
modified first.  They probably don't want that, but it's not any more
broken than a table inheritance setup with mutually exclusive CHECK
constraints, or for that matter a plain table with mutually exclusive
CHECK constraints - and we don't try to prohibit those things.  This
patch is supposed to be implementing partitioning, not artificial
intelligence.

>> -- Observation 3 : Getting cache lookup failed, when selecting list
>> partition table containing array.
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array;
>> ERROR:  cache lookup failed for type 0
>
> That's a bug.  Fixed in the attached patch.

Now on this one I'm not going to argue with your analysis.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On Thu, Sep 15, 2016 at 10:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 4:53 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Wow, this is bad.  What is needed in this case is "canonicalization" of
>> the range partition bounds specified in the command.
>
> I think we shouldn't worry about this.  It seems like unnecessary
> scope creep.  All human beings capable of using PostgreSQL will
> understand that there are no integers between 4 and 5, so any
> practical impact on this would be on someone creating partitions
> automatically.  But if someone is creating partitions automatically
> they are highly likely to be using the same EXCLUSIVE/INCLUSIVE
> settings for all of the partitions, in which case this won't arise.
> And if they aren't, then I think we should just make them deal with
> this limitation in their code instead of dealing with it in our code.
> This patch is plenty complicated enough already; introducing a whole
> new canonicalization concept that will help practically nobody seems
> to me to be going in the wrong direction.  If somebody really cares
> enough to want to try to fix this, they can submit a followup patch
> someday.
>
>> To mitigate this, how about we restrict range partition key to contain
>> columns of only those types for which we know we can safely canonicalize a
>> range bound (ie, discrete range types)?  I don't think we can use, say,
>> existing int4range_canonical but will have to write a version of it for
>> partitioning usage (range bounds of partitions are different from what
>> int4range_canonical is ready to handle).  This approach will be very
>> limiting as then range partitions will be limited to columns of int,
>> bigint and date type only.
>
> -1.  That is letting the tail wag the dog.  Let's leave it the way you
> had it and be happy.

Alright, let's leave this as something to work out later.  We will
have to document the fact that such limitation exists though, I'd
think.

>>> -- Observation 2 : able to create sub-partition out of the range set for
>>> main table, causing not able to insert data satisfying any of the partition.
>>>
>>> create table test_subpart (c1 int) partition by range (c1);
>>> create table test_subpart_p1 partition of test_subpart for values start (1)
>>> end (100) inclusive partition by range (c1);
>>> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
>>> start (101) end (200);
>>
>> It seems that DDL should prevent the same column being used in partition
>> key of lower level partitions.  I don't know how much sense it would make,
>> but being able to use the same column as partition key of lower level
>> partitions may be a feature useful to some users if they know what they
>> are doing.  But this last part doesn't sound like a good thing.  I
>> modified the patch such that lower level partitions cannot use columns
>> used by ancestor tables.
>
> I again disagree.  If somebody defines partition bounds that make it
> impossible to insert the data that they care about, that's operator
> error.  The fact that, across multiple levels, you can manage to make
> it impossible to insert any data at all does not make it anything
> other than operator error.  If we take the contrary position that it's
> the system's job to prevent this sort of thing, we may disallow some
> useful cases, like partitioning by the year portion of a date and then
> subpartitioning by the month portion of that same date.
>
> I think we'll probably also find that we're making the code
> complicated to no purpose.  For example, now you have to check when
> attaching a partition that it doesn't violate the rule; otherwise you
> end up with a table that can't be created directly (and thus can't
> survive dump-and-restore) but can be created indirectly by exploiting
> loopholes in the checks.  It's tempting to think that we can check
> simple cases - e.g. if the parent and the child are partitioning on
> the same exact column, the child's range should be contained within
> the parent's range - but more complicated cases are tricky.  Suppose
> the table is range-partitioned on (a, b) and range-subpartitioned on
> b.  It's not trivial to figure out whether the set of values that the
> user can insert into that partition is non-empty.  If we allow
> partitioning on expressions, then it quickly becomes altogether
> impossible to deduce anything useful - unless you can solve the
> halting problem.
>
> And, really, why do we care?  If the user creates a partitioning
> scheme that permits no rows in some or all of the partitions, then
> they will have an empty table that can be correctly dumped and
> restored but which cannot be used for anything useful unless it is
> modified first.  They probably don't want that, but it's not any more
> broken than a table inheritance setup with mutually exclusive CHECK
> constraints, or for that matter a plain table with mutually exclusive
> CHECK constraints - and we don't try to prohibit those things.  This
> patch is supposed to be implementing partitioning, not artificial
> intelligence.

Agree with your arguments.  Certainly, I overlooked some use cases
that my proposed solution would outright prevent.  I withdraw it.

Thanks,
Amit



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, Aug 15, 2016 at 7:21 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> +    if (partexprs)
>> +        recordDependencyOnSingleRelExpr(&myself,
>> +                                        (Node *) partexprs,
>> +                                        RelationGetRelid(rel),
>> +                                        DEPENDENCY_NORMAL,
>> +                                        DEPENDENCY_IGNORE);
>>
>> I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
>> here.  Instead, you could just add an additional Boolean argument to
>> recordDependencyOnSingleRelExpr.  That seems less likely to create
>> bugs in unrelated portions of the code.
>
> I did consider a Boolean argument instead of a new DependencyType value,
> however it felt a bit strange to pass a valid value for the fifth argument
> (self_behavior) and then ask using a separate parameter that it (a
> self-dependency) is to be ignored.  By the way, no pg_depend entry is
> created on such a call, so the effect of the new type's usage seems
> localized to me. Thoughts?

I think that's not a very plausible argument.  If you add a fifth
argument to that function, then only that function needs to know about
the possibility of ignoring self-dependencies.  If you add a new
dependency type, then everything that knows about DependencyType needs
to know about them.  That's got to be a much larger surface area for
bugs.  Also, if you look around a bit, I believe you will find other
examples of cases where one argument is used only for certain values
of some other argument.  That's not a novel design pattern.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Sep 15, 2016 at 4:53 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [ new patches ]

Re-reviewing 0001.

+      <entry><structfield>partexprs</structfield></entry>
+      <entry><type>pg_node_tree</type></entry>

This documentation doesn't match pg_partition_table.h, which has
partexprsrc and partexprbin.  I don't understand why it's a good idea
to have both, and there seem to be no comments or documentation
supporting that choice anywhere.

+      The optional <literal>PARTITION BY</> clause specifies a method of
+      partitioning the table and the corresponding partition key.  Table
+      thus created is called <firstterm>partitioned</firstterm> table.  Key
+      consists of an ordered list of column names and/or expressions when
+      using the <literal>RANGE</> method, whereas only a single column or
+      expression can be specified when using the <literal>LIST</> method.
+      The type of a key column or an expression must have an associated
+      btree operator class or one must be specified along with the column
+      or the expression.

Both of the sentences in this paragraph that do not begin with "the"
need to begin with "the".  (In my experience, it's generally a feature
of English as spoken in India that connecting words like "the" and "a"
are sometimes left out where non-Indian speakers of English would
include them, so it would be good to watch out for this issue in
general.)

Also, I think this should be rephrased a bit to be more clear about
how the partitioning key works, like this: The optional
<literal>PARTITION BY</literal> clause specifies a method of
partitioning the table.  The table thus created is called a
<firstterm>partitioned</firstterm> table.  The parenthesized list of
expressions forms the <firsttem>partitioning key</firstterm> for the
table.  When using range partitioning, the partioning key can include
multiple columns or expressions, but for list partitioning, the
partitioning key must consist of a single column or expression.  If no
btree operator class is specified when creating a partitioned table,
the default btree operator class for the datatype will be used.  If
there is none, an error will be reported.

+        case RELKIND_PARTITIONED_TABLE:            options = heap_reloptions(classForm->relkind, datum, false);

Why?  None of the reloptions that pertain to heap seem relevant to a
relkind without storage.

But, ah, do partitioned tables have storage?  I mean, do we end up
with an empty file, or no relfilenode at all?  Can I CLUSTER, VACUUM,
etc. a partitioned table?  It would seem cleaner for the parent to
have no relfilenode at all, but I'm guessing we might need some more
changes for that to work out.

+    pg_collation.h pg_range.h pg_transform.h pg_partitioned_table.h\

Whitespace.  Also, here and elsewhere, how about using alphabetical
order, or anyway preserving it insofar as the existing list is
alphabetized?

+    /* Remove NO INHERIT flag if rel is a partitioned table */
+    if (is_no_inherit &&
+        rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                 errmsg("cannot add NO INHERIT constraint to
partitioned table \"%s\"",
+                         RelationGetRelationName(rel))));

The code and the comment disagree.  I think the code is right and the
comment should be adjusted to say something like /* Partitioned tables
do not have storage, so a NO INHERIT constraint makes no sense. */

+ * IDENTIFICATION
+ *        src/backend/utils/misc/partition.c

Wrong.

+} KeyTypeCollInfo;

I don't like this type name very much.  Can we get "Part" in there someplace?

It doesn't seem to be very well-designed, either.  The number of
entries in each array is determined by the partnatts flag in
PartitionKeyData, which has also got various other arrays whose
lengths are determined by partnatts.  Why do we have some arrays in
one structure and some arrays in another structure?  Would it hurt
anything to merge everything into one structure?  Or could
PartitionKeyData include a field of type KeyTypeCollInfo rather than
KeyTypeCollInfo *, saving one pointer reference every place we access
this data?

+    /* Allocate in the supposedly short-lived working context */

Why supposedly?

+    datum = fastgetattr(tuple, Anum_pg_partitioned_table_partattrs,
+                        RelationGetDescr(catalog),
+                        &isnull);

Isn't the point of putting the fixed-length fields first that we can
use GETSTRUCT() here?  And even for partattrs as the first
variable-length thing?

+        /*
+         * Run the expressions through eval_const_expressions. This is
+         * not just an optimization, but is necessary, because eventually
+         * the planner will be comparing them to similarly-processed qual
+         * clauses, and may fail to detect valid matches without this.
+         * We don't bother with canonicalize_qual, however.
+         */

I'm a bit confused by this, because I would think this processing
ought to have been done before storing anything in the system
catalogs.  I don't see why it should be necessary to do it again after
pulling data back out of the system catalogs.

+            Value *str = lfirst(partexprsrc_item);
+            key->partcolnames[i] = pstrdup(str->val.str);

Should have a blank line in between.

+/*
+ * Partition key information inquiry functions
+ */
+int
+get_partition_key_strategy(PartitionKey key)
+{
+    return key->strategy;
+}
+
+int
+get_partition_key_natts(PartitionKey key)
+{
+    return key->partnatts;
+}
+
+List *
+get_partition_key_exprs(PartitionKey key)
+{
+    return key->partexprs;
+}
+
+/*
+ * Partition key information inquiry functions - one column
+ */
+int16
+get_partition_col_attnum(PartitionKey key, int col)
+{
+    return key->partattrs[col];
+}
+
+Oid
+get_partition_col_typid(PartitionKey key, int col)
+{
+    return key->tcinfo->typid[col];
+}
+
+int32
+get_partition_col_typmod(PartitionKey key, int col)
+{
+    return key->tcinfo->typmod[col];
+}

If we're going to add notation for this, I think we should use macros
(or static inline functions defined in the header file).  Doing it
this way adds more cycles for no benefit.

+    newkey->partattrs = (AttrNumber *)
+                            palloc0(newkey->partnatts * sizeof(AttrNumber));
+    memcpy(newkey->partattrs, fromkey->partattrs,
+                            newkey->partnatts * sizeof(AttrNumber));

It's wasteful to use palloc0 if you're immediately going to overwrite
every byte in the array.  Use regular palloc instead.

+     * Copy the partition key, opclass info into arrays (should we
+     * make the caller pass them like this to start with?)

Only if it happens to be convenient for the caller, which doesn't seem
to be the case here.

+    /* Only this can ever be NULL */
+    if (!partexprbinDatum)
+    {
+        nulls[Anum_pg_partitioned_table_partexprbin - 1] = true;
+        nulls[Anum_pg_partitioned_table_partexprsrc - 1] = true;
+    }

How can it be valid to have no partitioning expressions?

+    /* Tell world about the key */
+    CacheInvalidateRelcache(rel);

Is this really needed?  Isn't the caller going to do something similar
pretty soon?

+    heap_freetuple(tuple);

Probably useless - might as well let the context reset clean it up.

+    simple_heap_delete(rel, &tuple->t_self);
+
+    /* Update the indexes on pg_partitioned_table */
+    CatalogUpdateIndexes(rel, tuple);

You don't need CatalogUpdateIndexes() after a delete, only after an
insert or update.

+    if (classform->relkind != relkind &&
+                (relkind == RELKIND_RELATION &&
+                    classform->relkind != RELKIND_PARTITIONED_TABLE))

That's broken.  Note that all of the conditions are joined using &&,
so if any one of them fails then we won't throw an error.  In
particular, it's no longer possible to throw an error when relkind is
not RELKIND_RELATION.

+/* Checks if a Var node is for a given attnum */
+static bool
+find_attr_reference_walker(Node *node, find_attr_reference_context *context)
+{
+    if (node == NULL)
+        return false;
+
+    if (IsA(node, Var))
+    {
+        Var       *var = (Var *) node;
+        char   *varattname = get_attname(context->relid, var->varattno);
+
+        if (!strcmp(varattname, context->attname))
+            return true;
+    }
+
+    return expression_tree_walker(node, find_attr_reference_walker, context);
+}

Hrm.  The comment says we're matching on attnum, but the code says
we're matching on attname.  is_partition_attr() has the same confusion
between comments and code.  Maybe instead of this whole approach it
would be better to use pull_varattnos(), then get_attnum() to find the
attribute number for the one you want, then bms_is_member().

+static PartitionBy *
+transformPartitionBy(Relation rel, PartitionBy *partitionby)
+{
+    PartitionBy       *partby;
+    ParseState       *pstate;
+    RangeTblEntry  *rte;
+    ListCell       *l;
+
+    partby = (PartitionBy *) makeNode(PartitionBy);
+
+    partby->strategy = partitionby->strategy;
+    partby->location = partitionby->location;
+    partby->partParams = NIL;
+
+    /*
+     * Create a dummy ParseState and insert the target relation as its sole
+     * rangetable entry.  We need a ParseState for transformExpr.
+     */
+    pstate = make_parsestate(NULL);

Why isn't this logic being invoked from transformCreateStmt()?  Then
we could use the actual parseState for the query instead of a fake
one.

+            if (IsA(expr, CollateExpr))
+                ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+                         errmsg("cannot use COLLATE in partition key
expression")));

I assume there is a good reason for this seemingly-arbitrary
restriction, but there's no comment saying what it is.  One thing
that's odd is that this will only prohibit a CollateExpr at the top
level, not in some more-deeply nested position.  That seems
suspicious.

+                /*
+                 * User wrote "(column)" or "(column COLLATE something)".
+                 * Treat it like simple attribute anyway.
+                 */

Evidently, the user did not do that, because you just prohibited the
second one of those.

+                if (IsA(expr, Const))
+                    ereport(ERROR,
+                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+                             errmsg("cannot use a constant expression
as partition key")));
+
+                if (contain_mutable_functions(expr))
+                    ereport(ERROR,
+                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+                             errmsg("functions in partition key
expression must be marked IMMUTABLE")));

Do these checks parallel what we do for CHECK constraints?  It might
be good to apply about the same level of rigor in both cases.

+                exprsrc = deparse_expression(expr,
+                            deparse_context_for(RelationGetRelationName(rel),
+                                                RelationGetRelid(rel)),
+                                       false, false);

Why bother?  The output of this doesn't seem like a useful thing to
store.  The fact that we've done similar things elsewhere doesn't make
it a good idea.  I think we did it in other cases because we used to
be dumber than we are now.

+                        (errcode(ERRCODE_UNDEFINED_OBJECT),
+                         errmsg("data type %s has no default btree
operator class",
+                                format_type_be(atttype)),
+                         errhint("You must specify an existing btree
operator class or define one for the type.")));

The hint is not really accurate, because the type may well have a
btree operator class.  Just not a default one.

+                            char    relkind = ((CreateStmt *)
stmt)->partby != NULL
+                                                    ? RELKIND_PARTITIONED_TABLE
+                                                    : RELKIND_RELATION;

Let's push this down into DefineRelation().  i.e. if (stmt->partby !=
NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
RELKIND_PARTITION_TABLE; }

+        RelationBuildPartitionKey(relation);

I wonder if RelationBuildPartitionKey should really be in relcache.c.
What do we do in similar cases?

+} PartitionBy;

Maybe PartitionSpec?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
Hi Amit,
Following sequence of DDLs gets an error
--
-- multi-leveled partitions
--
CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
(250) PARTITION BY RANGE (b);
CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END (100);
CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
(100) END (250);
CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
(500) PARTITION BY RANGE (c);
CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
('0250') END ('0400');
CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
('0400') END ('0500');
CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
(600) PARTITION BY RANGE ((b + a));
ERROR:  cannot use column or expression from ancestor partition key

The last statement is trying create subpartitions by range (b + a),
which contains a partition key from ancestor partition key but is not
exactly same as that. In fact it contains some extra columns other
than the ancestor partition key columns. Why do we want to prohibit
such cases?

On Thu, Sep 15, 2016 at 2:23 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Hi
>
> On 2016/09/09 17:55, Amit Langote wrote:
>> On 2016/09/06 22:04, Amit Langote wrote:
>>> Will fix.
>>
>> Here is an updated set of patches.
>
> An email from Rajkumar somehow managed to break out of this thread.
> Quoting his message below so that I don't end up replying with patches on
> two different threads.
>
> On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
>> I have Continued with testing declarative partitioning with the latest
>> patch. Got some more observation, given below
>
> Thanks a lot for testing.
>
>> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
>> partition.
>>
>> create table test_range_bound ( a int) partition by range(a);
>> --creating a partition to contain records {1,2,3,4}, by default 1 is
>> inclusive and 5 is exclusive
>> create table test_range_bound_p1 partition of test_range_bound for values
>> start (1) end (5);
>> --now trying to create a partition by explicitly mentioning start is
>> exclusive to contain records {5,6,7}, here trying to create with START with
>> 4 as exclusive so range should be 5 to 8, but getting partition overlap
>> error.
>> create table test_range_bound_p2 partition of test_range_bound for values
>> start (4) EXCLUSIVE end (8);
>> ERROR:  partition "test_range_bound_p2" would overlap partition
>> "test_range_bound_p1"
>
> Wow, this is bad.  What is needed in this case is "canonicalization" of
> the range partition bounds specified in the command.  Range types do this
> and hence an equivalent test done with range type values would disagree
> with the result given by the patch for range partition bounds.
>
> select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
>  cmp
> -----
>  f
> (1 row)
>
> In this case, the second range is converted into its equivalent canonical
> form viz. '[5, 9)'.  Then comparison of bounds 5) and [5 can tell that the
> ranges do not overlap after all.  Range type operators can do this because
> their code can rely on the availability of a canonicalization function for
> a given range type.  From the range types documentation:
>
> """
> A discrete range type should have a canonicalization function that is
> aware of the desired step size for the element type. The canonicalization
> function is charged with converting equivalent values of the range type to
> have identical representations, in particular consistently inclusive or
> exclusive bounds. If a canonicalization function is not specified, then
> ranges with different formatting will always be treated as unequal, even
> though they might represent the same set of values in reality.
> """
>
> to extend the last sentence:
>
> "... or consider two ranges overlapping when in reality they are not
> (maybe they are really just adjacent)."
>
> Within the code handling range partition bound, no such canonicalization
> happens, so comparison 5) and (4 ends up concluding that upper1 > lower2,
> hence ranges overlap.
>
> To mitigate this, how about we restrict range partition key to contain
> columns of only those types for which we know we can safely canonicalize a
> range bound (ie, discrete range types)?  I don't think we can use, say,
> existing int4range_canonical but will have to write a version of it for
> partitioning usage (range bounds of partitions are different from what
> int4range_canonical is ready to handle).  This approach will be very
> limiting as then range partitions will be limited to columns of int,
> bigint and date type only.
>
> One more option is we let the user specify the canonicalize function next
> to the column name when defining the partition key.  If not specified, we
> hard-code one for the types for which we will be implementing a
> canonicalize function (ie, above mentioned types).  In other cases, we
> just don't have one and hence if an unexpected result occurs when creating
> a new partition, it's up to the user to realize what happened.  Of course,
> we will be mentioning in the documentation why a canonicalize function is
> necessary and how to write one.  Note that this canonicalize function
> comes into play only when defining new partitions, it has no role beyond
> that point.
>
>> -- Observation 2 : able to create sub-partition out of the range set for
>> main table, causing not able to insert data satisfying any of the partition.
>>
>> create table test_subpart (c1 int) partition by range (c1);
>> create table test_subpart_p1 partition of test_subpart for values start (1)
>> end (100) inclusive partition by range (c1);
>> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
>> start (101) end (200);
>>
>> \d+ test_subpart
>>                      Table "public.test_subpart"
>>  Column |  Type   | Modifiers | Storage | Stats target | Description
>> --------+---------+-----------+---------+--------------+-------------
>>  c1     | integer |           | plain   |              |
>> Partition Key: RANGE (c1)
>> Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE
>>
>> \d+ test_subpart_p1
>>                    Table "public.test_subpart_p1"
>>  Column |  Type   | Modifiers | Storage | Stats target | Description
>> --------+---------+-----------+---------+--------------+-------------
>>  c1     | integer |           | plain   |              |
>> Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
>> Partition Key: RANGE (c1)
>> Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)
>>
>> insert into test_subpart values (50);
>> ERROR:  no partition of relation "test_subpart_p1" found for row
>> DETAIL:  Failing row contains (50).
>> insert into test_subpart values (150);
>> ERROR:  no partition of relation "test_subpart" found for row
>> DETAIL:  Failing row contains (150).
>
> It seems that DDL should prevent the same column being used in partition
> key of lower level partitions.  I don't know how much sense it would make,
> but being able to use the same column as partition key of lower level
> partitions may be a feature useful to some users if they know what they
> are doing.  But this last part doesn't sound like a good thing.  I
> modified the patch such that lower level partitions cannot use columns
> used by ancestor tables.
>
>> -- Observation 3 : Getting cache lookup failed, when selecting list
>> partition table containing array.
>>
>> CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
>> CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
>> CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');
>>
>> INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
>> INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
>>    tableoid    | i |  j  |  k
>> ---------------+---+-----+-----
>>  test_array_p1 | 1 | {1} | {1}
>> (1 row)
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
>>    tableoid    | i |   j   |  k
>> ---------------+---+-------+-----
>>  test_array_p2 | 2 | {2,2} | {2}
>> (1 row)
>>
>> postgres=# SELECT tableoid::regclass,* FROM test_array;
>> ERROR:  cache lookup failed for type 0
>
> That's a bug.  Fixed in the attached patch.
>
> PS: I'm going to have limited Internet access during this weekend and over
> the next week, so responses could be slow.  Sorry about that.
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
For list partitions, the ListInfo stores the index maps for values
i.e. the index of the partition to which the value belongs. Those
indexes are same as the indexes in partition OIDs array and come from
the catalogs. In case a user creates two partitioned tables with
exactly same lists for partitions but specifies them in a different
order, the OIDs are stored in the order specified. This means that
index array for these tables come out different. equal_list_info()
works around that by creating an array of mappings and checks whether
that mapping is consistent for all values. This means we will create
the mapping as many times as equal_list_info() is called, which is
expected to be more than the number of time
RelationBuildPartitionDescriptor() is called. Instead, if we
"canonicalise" the indexes so that they come out exactly same for
similarly partitioned tables, we build the mapping only once and
arrange OIDs accordingly.

Here's patch to do that. I have ran make check with this and it didn't
show any failure. Please consider this to be included in your next set
of patches.

That helps partition-wise join as well. For partition-wise join (and
further optimizations for partitioned tables), we create a list of
canonical partition schemes. In this list two similarly partitioned
tables share partition scheme pointer. A join between relations with
same partition scheme pointer can be joined partition-wise. It's
important that the indexes in partition scheme match to the OIDs array
to find matching RelOptInfos for partition-wise join.

On Thu, Sep 22, 2016 at 11:12 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Hi Amit,
> Following sequence of DDLs gets an error
> --
> -- multi-leveled partitions
> --
> CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
> CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
> (250) PARTITION BY RANGE (b);
> CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END (100);
> CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
> (100) END (250);
> CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
> (500) PARTITION BY RANGE (c);
> CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0250') END ('0400');
> CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0400') END ('0500');
> CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
> (600) PARTITION BY RANGE ((b + a));
> ERROR:  cannot use column or expression from ancestor partition key
>
> The last statement is trying create subpartitions by range (b + a),
> which contains a partition key from ancestor partition key but is not
> exactly same as that. In fact it contains some extra columns other
> than the ancestor partition key columns. Why do we want to prohibit
> such cases?
>
> On Thu, Sep 15, 2016 at 2:23 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> Hi
>>
>> On 2016/09/09 17:55, Amit Langote wrote:
>>> On 2016/09/06 22:04, Amit Langote wrote:
>>>> Will fix.
>>>
>>> Here is an updated set of patches.
>>
>> An email from Rajkumar somehow managed to break out of this thread.
>> Quoting his message below so that I don't end up replying with patches on
>> two different threads.
>>
>> On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
>>> I have Continued with testing declarative partitioning with the latest
>>> patch. Got some more observation, given below
>>
>> Thanks a lot for testing.
>>
>>> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
>>> partition.
>>>
>>> create table test_range_bound ( a int) partition by range(a);
>>> --creating a partition to contain records {1,2,3,4}, by default 1 is
>>> inclusive and 5 is exclusive
>>> create table test_range_bound_p1 partition of test_range_bound for values
>>> start (1) end (5);
>>> --now trying to create a partition by explicitly mentioning start is
>>> exclusive to contain records {5,6,7}, here trying to create with START with
>>> 4 as exclusive so range should be 5 to 8, but getting partition overlap
>>> error.
>>> create table test_range_bound_p2 partition of test_range_bound for values
>>> start (4) EXCLUSIVE end (8);
>>> ERROR:  partition "test_range_bound_p2" would overlap partition
>>> "test_range_bound_p1"
>>
>> Wow, this is bad.  What is needed in this case is "canonicalization" of
>> the range partition bounds specified in the command.  Range types do this
>> and hence an equivalent test done with range type values would disagree
>> with the result given by the patch for range partition bounds.
>>
>> select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
>>  cmp
>> -----
>>  f
>> (1 row)
>>
>> In this case, the second range is converted into its equivalent canonical
>> form viz. '[5, 9)'.  Then comparison of bounds 5) and [5 can tell that the
>> ranges do not overlap after all.  Range type operators can do this because
>> their code can rely on the availability of a canonicalization function for
>> a given range type.  From the range types documentation:
>>
>> """
>> A discrete range type should have a canonicalization function that is
>> aware of the desired step size for the element type. The canonicalization
>> function is charged with converting equivalent values of the range type to
>> have identical representations, in particular consistently inclusive or
>> exclusive bounds. If a canonicalization function is not specified, then
>> ranges with different formatting will always be treated as unequal, even
>> though they might represent the same set of values in reality.
>> """
>>
>> to extend the last sentence:
>>
>> "... or consider two ranges overlapping when in reality they are not
>> (maybe they are really just adjacent)."
>>
>> Within the code handling range partition bound, no such canonicalization
>> happens, so comparison 5) and (4 ends up concluding that upper1 > lower2,
>> hence ranges overlap.
>>
>> To mitigate this, how about we restrict range partition key to contain
>> columns of only those types for which we know we can safely canonicalize a
>> range bound (ie, discrete range types)?  I don't think we can use, say,
>> existing int4range_canonical but will have to write a version of it for
>> partitioning usage (range bounds of partitions are different from what
>> int4range_canonical is ready to handle).  This approach will be very
>> limiting as then range partitions will be limited to columns of int,
>> bigint and date type only.
>>
>> One more option is we let the user specify the canonicalize function next
>> to the column name when defining the partition key.  If not specified, we
>> hard-code one for the types for which we will be implementing a
>> canonicalize function (ie, above mentioned types).  In other cases, we
>> just don't have one and hence if an unexpected result occurs when creating
>> a new partition, it's up to the user to realize what happened.  Of course,
>> we will be mentioning in the documentation why a canonicalize function is
>> necessary and how to write one.  Note that this canonicalize function
>> comes into play only when defining new partitions, it has no role beyond
>> that point.
>>
>>> -- Observation 2 : able to create sub-partition out of the range set for
>>> main table, causing not able to insert data satisfying any of the partition.
>>>
>>> create table test_subpart (c1 int) partition by range (c1);
>>> create table test_subpart_p1 partition of test_subpart for values start (1)
>>> end (100) inclusive partition by range (c1);
>>> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
>>> start (101) end (200);
>>>
>>> \d+ test_subpart
>>>                      Table "public.test_subpart"
>>>  Column |  Type   | Modifiers | Storage | Stats target | Description
>>> --------+---------+-----------+---------+--------------+-------------
>>>  c1     | integer |           | plain   |              |
>>> Partition Key: RANGE (c1)
>>> Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE
>>>
>>> \d+ test_subpart_p1
>>>                    Table "public.test_subpart_p1"
>>>  Column |  Type   | Modifiers | Storage | Stats target | Description
>>> --------+---------+-----------+---------+--------------+-------------
>>>  c1     | integer |           | plain   |              |
>>> Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
>>> Partition Key: RANGE (c1)
>>> Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)
>>>
>>> insert into test_subpart values (50);
>>> ERROR:  no partition of relation "test_subpart_p1" found for row
>>> DETAIL:  Failing row contains (50).
>>> insert into test_subpart values (150);
>>> ERROR:  no partition of relation "test_subpart" found for row
>>> DETAIL:  Failing row contains (150).
>>
>> It seems that DDL should prevent the same column being used in partition
>> key of lower level partitions.  I don't know how much sense it would make,
>> but being able to use the same column as partition key of lower level
>> partitions may be a feature useful to some users if they know what they
>> are doing.  But this last part doesn't sound like a good thing.  I
>> modified the patch such that lower level partitions cannot use columns
>> used by ancestor tables.
>>
>>> -- Observation 3 : Getting cache lookup failed, when selecting list
>>> partition table containing array.
>>>
>>> CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
>>> CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
>>> CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');
>>>
>>> INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
>>> INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);
>>>
>>> postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
>>>    tableoid    | i |  j  |  k
>>> ---------------+---+-----+-----
>>>  test_array_p1 | 1 | {1} | {1}
>>> (1 row)
>>>
>>> postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
>>>    tableoid    | i |   j   |  k
>>> ---------------+---+-------+-----
>>>  test_array_p2 | 2 | {2,2} | {2}
>>> (1 row)
>>>
>>> postgres=# SELECT tableoid::regclass,* FROM test_array;
>>> ERROR:  cache lookup failed for type 0
>>
>> That's a bug.  Fixed in the attached patch.
>>
>> PS: I'm going to have limited Internet access during this weekend and over
>> the next week, so responses could be slow.  Sorry about that.
>>
>> Thanks,
>> Amit
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
On Thu, Sep 22, 2016 at 1:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> For list partitions, the ListInfo stores the index maps for values
> i.e. the index of the partition to which the value belongs. Those
> indexes are same as the indexes in partition OIDs array and come from
> the catalogs. In case a user creates two partitioned tables with
> exactly same lists for partitions but specifies them in a different
> order, the OIDs are stored in the order specified. This means that
> index array for these tables come out different. equal_list_info()
> works around that by creating an array of mappings and checks whether
> that mapping is consistent for all values. This means we will create
> the mapping as many times as equal_list_info() is called, which is
> expected to be more than the number of time
> RelationBuildPartitionDescriptor() is called. Instead, if we
> "canonicalise" the indexes so that they come out exactly same for
> similarly partitioned tables, we build the mapping only once and
> arrange OIDs accordingly.
>
> Here's patch to do that. I have ran make check with this and it didn't
> show any failure. Please consider this to be included in your next set
> of patches.

The patch has an if condition as statement by itself
+    if (l1->null_index != l2->null_index);        return false;

There shouldn't be ';' at the end. It looks like in the tests you have
added the function always bails out before it reaches this statement.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Ashutosh,

On 2016/09/22 14:42, Ashutosh Bapat wrote:
> Hi Amit,
> Following sequence of DDLs gets an error
> --
> -- multi-leveled partitions
> --
> CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
> CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
> (250) PARTITION BY RANGE (b);
> CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END (100);
> CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
> (100) END (250);
> CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
> (500) PARTITION BY RANGE (c);
> CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0250') END ('0400');
> CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0400') END ('0500');
> CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
> (600) PARTITION BY RANGE ((b + a));
> ERROR:  cannot use column or expression from ancestor partition key
> 
> The last statement is trying create subpartitions by range (b + a),
> which contains a partition key from ancestor partition key but is not
> exactly same as that. In fact it contains some extra columns other
> than the ancestor partition key columns. Why do we want to prohibit
> such cases?

Per discussion [1], I am going to remove this ill-considered restriction.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BHiwqEXAU_m%2BV%3Db-VGmsDNjoqc-Z_9KQdyPuOGbiQGzNObmVg%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/22 19:10, Ashutosh Bapat wrote:
> On Thu, Sep 22, 2016 at 1:02 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> For list partitions, the ListInfo stores the index maps for values
>> i.e. the index of the partition to which the value belongs. Those
>> indexes are same as the indexes in partition OIDs array and come from
>> the catalogs. In case a user creates two partitioned tables with
>> exactly same lists for partitions but specifies them in a different
>> order, the OIDs are stored in the order specified. This means that
>> index array for these tables come out different. equal_list_info()
>> works around that by creating an array of mappings and checks whether
>> that mapping is consistent for all values. This means we will create
>> the mapping as many times as equal_list_info() is called, which is
>> expected to be more than the number of time
>> RelationBuildPartitionDescriptor() is called. Instead, if we
>> "canonicalise" the indexes so that they come out exactly same for
>> similarly partitioned tables, we build the mapping only once and
>> arrange OIDs accordingly.
>>
>> Here's patch to do that. I have ran make check with this and it didn't
>> show any failure. Please consider this to be included in your next set
>> of patches.

Thanks.  It seems like this will save quite a few cycles for future users
of equal_list_info().  I will incorporate it into may patch.

With this patch, the mapping is created *only once* during
RelationBuildPartitionDesc() to assign canonical indexes to individual
list values.  The partition OID array will also be rearranged such that
using the new (canonical) index instead of the old
catalog-scan-order-based index will retrieve the correct partition for
that value.

By the way, I fixed one thinko in your patch as follows:

-        result->oids[i] = oids[mapping[i]];
+        result->oids[mapping[i]] = oids[i];

That is, copy an OID at a given position in the original array to a
*mapped position* in the result array (instead of the other way around).

> The patch has an if condition as statement by itself
> +    if (l1->null_index != l2->null_index);
>          return false;
> 
> There shouldn't be ';' at the end. It looks like in the tests you have
> added the function always bails out before it reaches this statement.

There is no user of equal_list_info() such that the above bug would have
caused a regression test failure.  Maybe a segfault crash (due to dangling
pointer into partition descriptor's listinfo after the above function
would unintentionally return false to equalPartitionDescs() which is in
turn called by RelationClearRelation()).

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
>
> With this patch, the mapping is created *only once* during
> RelationBuildPartitionDesc() to assign canonical indexes to individual
> list values.  The partition OID array will also be rearranged such that
> using the new (canonical) index instead of the old
> catalog-scan-order-based index will retrieve the correct partition for
> that value.
>
> By the way, I fixed one thinko in your patch as follows:
>
> -        result->oids[i] = oids[mapping[i]];
> +        result->oids[mapping[i]] = oids[i];

While I can not spot any problem with this logic, when I make that
change and run partition_join testcase in my patch, it fails because
wrong partitions are matched for partition-wise join of list
partitions. In that patch, RelOptInfo of partitions are saved in
RelOptInfo of the parent by matching their OIDs. They are saved in the
same order as corresponding OIDs. Partition-wise join simply joins the
RelOptInfos at the same positions from both the parent RelOptInfos. I
can not spot an error in this logic too.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Sorry about the delay in replying.

On 2016/09/15 21:58, Ashutosh Bapat wrote:
> Hi Amit,
>
> It looks like there is some problem while creating paramterized paths
> for multi-level partitioned tables. Here's a longish testcase
>
> [ ... ]
>
> Please check if you are able to reproduce these errors in your
> repository. I made sure that I cleaned up all partition-wise join code
> before testing this, but ... .

Thanks for the test case.  I can reproduce the same.

> I tried to debug the problem somewhat. In set_append_rel_pathlist(),
> it finds that at least one child has a parameterized path as the
> cheapest path, so it doesn't create an unparameterized path for append
> rel. At the same time there is a parameterization common to all the
> children, so it doesn't create any path. There seem to be two problems
> here
> 1. The children from second level onwards may not be getting
> parameterized for lateral references. That seems unlikely but
> possible.
> 2. Reparameterization should have corrected this, but
> reparameterize_path() does not support AppendPaths.

Hmm, 0005-Refactor-optimizer-s-inheritance-set-expansion-code-5.patch is
certainly to be blamed here; if I revert the patch, the problem goes away.

Based on 2 above, I attempted to add logic for AppendPath in
reparameterize_path() as in the attached.  It fixes the reported problem
and does not break any regression tests. If it's sane to do things this
way, I will incorporate the attached into patch 0005 mentioned above.
Thoughts?

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
>>
>> Please check if you are able to reproduce these errors in your
>> repository. I made sure that I cleaned up all partition-wise join code
>> before testing this, but ... .
>
> Thanks for the test case.  I can reproduce the same.
>
>> I tried to debug the problem somewhat. In set_append_rel_pathlist(),
>> it finds that at least one child has a parameterized path as the
>> cheapest path, so it doesn't create an unparameterized path for append
>> rel. At the same time there is a parameterization common to all the
>> children, so it doesn't create any path. There seem to be two problems
>> here
>> 1. The children from second level onwards may not be getting
>> parameterized for lateral references. That seems unlikely but
>> possible.

Did you check this? We may be missing on creating index scan paths
with parameterization. If we fix this, we don't need to
re-parameterize Append.

>> 2. Reparameterization should have corrected this, but
>> reparameterize_path() does not support AppendPaths.
>
> Hmm, 0005-Refactor-optimizer-s-inheritance-set-expansion-code-5.patch is
> certainly to be blamed here; if I revert the patch, the problem goes away.
>
> Based on 2 above, I attempted to add logic for AppendPath in
> reparameterize_path() as in the attached.  It fixes the reported problem
> and does not break any regression tests. If it's sane to do things this
> way, I will incorporate the attached into patch 0005 mentioned above.
> Thoughts?


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/27 15:44, Ashutosh Bapat wrote:
>> By the way, I fixed one thinko in your patch as follows:
>>
>> -        result->oids[i] = oids[mapping[i]];
>> +        result->oids[mapping[i]] = oids[i];
> 
> While I can not spot any problem with this logic, when I make that
> change and run partition_join testcase in my patch, it fails because
> wrong partitions are matched for partition-wise join of list
> partitions. In that patch, RelOptInfo of partitions are saved in
> RelOptInfo of the parent by matching their OIDs. They are saved in the
> same order as corresponding OIDs. Partition-wise join simply joins the
> RelOptInfos at the same positions from both the parent RelOptInfos. I
> can not spot an error in this logic too.

OTOH, using the original logic makes tuple routing put tuples into the
wrong partitions.  When debugging why that was happening I discovered this
and hence the proposed change.

You mean that partition RelOptInfo's are placed using the canonical
ordering of OIDs instead of catalog-scan-driven order, right?  If that's
true, then there is no possibility of wrong pairing happening, even with
the new ordering of OIDs in the partition descriptor (ie, the ordering
that would be produced by my proposed method above).

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
On Tue, Sep 27, 2016 at 2:46 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/09/27 15:44, Ashutosh Bapat wrote:
>>> By the way, I fixed one thinko in your patch as follows:
>>>
>>> -        result->oids[i] = oids[mapping[i]];
>>> +        result->oids[mapping[i]] = oids[i];
>>
>> While I can not spot any problem with this logic, when I make that
>> change and run partition_join testcase in my patch, it fails because
>> wrong partitions are matched for partition-wise join of list
>> partitions. In that patch, RelOptInfo of partitions are saved in
>> RelOptInfo of the parent by matching their OIDs. They are saved in the
>> same order as corresponding OIDs. Partition-wise join simply joins the
>> RelOptInfos at the same positions from both the parent RelOptInfos. I
>> can not spot an error in this logic too.
>
> OTOH, using the original logic makes tuple routing put tuples into the
> wrong partitions.  When debugging why that was happening I discovered this
> and hence the proposed change.
>
> You mean that partition RelOptInfo's are placed using the canonical
> ordering of OIDs instead of catalog-scan-driven order, right?  If that's
> true, then there is no possibility of wrong pairing happening, even with
> the new ordering of OIDs in the partition descriptor (ie, the ordering
> that would be produced by my proposed method above).

right! I don't know what's wrong, will debug my changes.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/09/27 18:09, Ashutosh Bapat wrote:
>>> I tried to debug the problem somewhat. In set_append_rel_pathlist(),
>>> it finds that at least one child has a parameterized path as the
>>> cheapest path, so it doesn't create an unparameterized path for append
>>> rel. At the same time there is a parameterization common to all the
>>> children, so it doesn't create any path. There seem to be two problems
>>> here
>>> 1. The children from second level onwards may not be getting
>>> parameterized for lateral references. That seems unlikely but
>>> possible.
>
> Did you check this? We may be missing on creating index scan paths
> with parameterization. If we fix this, we don't need to
> re-parameterize Append.

You're right.  How about the attached patch that fixes the problem along
these lines?  The problem seems to be that multi-level inheritance sets
(partitioned tables) are not handled in create_lateral_join_info(), which
means that lateral_relids and lateral_referencers of the root relation are
not being propagated to the partitions below level 1.

I'm getting concerned about one thing though - for a given *regular*
inheritance set, the root->append_rel_list would be scanned only once; But
for a *partitioned table* inheritance set, it would be scanned for every
partitioned table in the set (ie, the root table and internal partitions).

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Thanks a lot for the review and sorry about the delay in replying.
Combining responses to two emails.

On 2016/09/20 5:06, Robert Haas wrote:
> On Mon, Aug 15, 2016 at 7:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> +    if (partexprs)
>>> +        recordDependencyOnSingleRelExpr(&myself,
>>> +                                        (Node *) partexprs,
>>> +                                        RelationGetRelid(rel),
>>> +                                        DEPENDENCY_NORMAL,
>>> +                                        DEPENDENCY_IGNORE);
>>>
>>> I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
>>> here.  Instead, you could just add an additional Boolean argument to
>>> recordDependencyOnSingleRelExpr.  That seems less likely to create
>>> bugs in unrelated portions of the code.
>>
>> I did consider a Boolean argument instead of a new DependencyType value,
>> however it felt a bit strange to pass a valid value for the fifth argument
>> (self_behavior) and then ask using a separate parameter that it (a
>> self-dependency) is to be ignored.  By the way, no pg_depend entry is
>> created on such a call, so the effect of the new type's usage seems
>> localized to me. Thoughts?
>
> I think that's not a very plausible argument.  If you add a fifth
> argument to that function, then only that function needs to know about
> the possibility of ignoring self-dependencies.  If you add a new
> dependency type, then everything that knows about DependencyType needs
> to know about them.  That's got to be a much larger surface area for
> bugs.  Also, if you look around a bit, I believe you will find other
> examples of cases where one argument is used only for certain values
> of some other argument.  That's not a novel design pattern.

I removed DEPENDENCY_IGNORE.  Does the following look good or am I still
missing something?

@@ -83,7 +78,6 @@ typedef enum DependencyType
     DEPENDENCY_EXTENSION = 'e',
     DEPENDENCY_AUTO_EXTENSION = 'x',
     DEPENDENCY_PIN = 'p',
-    DEPENDENCY_IGNORE = 'g'

@@ -194,7 +188,8 @@ extern void recordDependencyOnExpr(const ObjectAddress
*depender,
 extern void recordDependencyOnSingleRelExpr(const ObjectAddress *depender,
                                 Node *expr, Oid relId,
                                 DependencyType behavior,
-                                DependencyType self_behavior);
+                                DependencyType self_behavior,
+                                bool ignore_self);

@@ -1450,9 +1449,10 @@ recordDependencyOnSingleRelExpr(const ObjectAddress
*depender,
         context.addrs->numrefs = outrefs;

         /* Record the self-dependencies */
-        recordMultipleDependencies(depender,
-                                   self_addrs->refs, self_addrs->numrefs,
-                                   self_behavior);
+        if (!ignore_self)
+            recordMultipleDependencies(depender,
+                                       self_addrs->refs, self_addrs->numrefs,
+                                       self_behavior);

@@ -138,7 +138,7 @@ StorePartitionKey(Relation rel,
                                         (Node *) partexprbin,
                                         RelationGetRelid(rel),
                                         DEPENDENCY_NORMAL,
-                                        DEPENDENCY_IGNORE);
+                                        DEPENDENCY_NORMAL, true);

> Re-reviewing 0001.
>
> +      <entry><structfield>partexprs</structfield></entry>
> +      <entry><type>pg_node_tree</type></entry>
>
> This documentation doesn't match pg_partition_table.h, which has
> partexprsrc and partexprbin.  I don't understand why it's a good idea
> to have both, and there seem to be no comments or documentation
> supporting that choice anywhere.

Removed partexprsrc, I realized the design whereby it was required to be
stored in the catalog was a dubious one after all.

> +      The optional <literal>PARTITION BY</> clause specifies a method of
> +      partitioning the table and the corresponding partition key.  Table
> +      thus created is called <firstterm>partitioned</firstterm> table.  Key
> +      consists of an ordered list of column names and/or expressions when
> +      using the <literal>RANGE</> method, whereas only a single column or
> +      expression can be specified when using the <literal>LIST</> method.
> +      The type of a key column or an expression must have an associated
> +      btree operator class or one must be specified along with the column
> +      or the expression.
>
> Both of the sentences in this paragraph that do not begin with "the"
> need to begin with "the".  (In my experience, it's generally a feature
> of English as spoken in India that connecting words like "the" and "a"
> are sometimes left out where non-Indian speakers of English would
> include them, so it would be good to watch out for this issue in
> general.)

Thanks for the tip, will try to be careful (rules about a/an/the's can be
too subtle for me sometimes, so any help is much appreciated).

> Also, I think this should be rephrased a bit to be more clear about
> how the partitioning key works, like this: The optional
> <literal>PARTITION BY</literal> clause specifies a method of
> partitioning the table.  The table thus created is called a
> <firstterm>partitioned</firstterm> table.  The parenthesized list of
> expressions forms the <firsttem>partitioning key</firstterm> for the
> table.  When using range partitioning, the partioning key can include
> multiple columns or expressions, but for list partitioning, the
> partitioning key must consist of a single column or expression.  If no
> btree operator class is specified when creating a partitioned table,
> the default btree operator class for the datatype will be used.  If
> there is none, an error will be reported.

Revised text along these lines.

> +        case RELKIND_PARTITIONED_TABLE:
>              options = heap_reloptions(classForm->relkind, datum, false);
>
> Why?  None of the reloptions that pertain to heap seem relevant to a
> relkind without storage.
>
> But, ah, do partitioned tables have storage?  I mean, do we end up
> with an empty file, or no relfilenode at all?  Can I CLUSTER, VACUUM,
> etc. a partitioned table?  It would seem cleaner for the parent to
> have no relfilenode at all, but I'm guessing we might need some more
> changes for that to work out.

Things were that way initially, that is, the parent relations had no
relfilenode.  I abandoned that project however.  The storage-less parent
thing seemed pretty daunting to me to handle right away.  For example,
optimizer and the executor code needed to be taught about the parent rel
appendrel member that used to be included as part of the result of
scanning the inheritance set but was no longer.

That said, there are still diffs in the patch resulting from simply
informing various sites in the code that there is a new relkind
RELKIND_PARTITIONED_TABLE, most of the sites giving it the same treatment
as RELKIND_RELATION.  Except maybe that the new relkind serves as a
convenient shorthand for specifying to various modules that the relation
is a partitioned table that is in some regards to be treated differently
from the ordinary tables.

> +    pg_collation.h pg_range.h pg_transform.h pg_partitioned_table.h\
>
> Whitespace.  Also, here and elsewhere, how about using alphabetical
> order, or anyway preserving it insofar as the existing list is
> alphabetized?

Fixed.

> +    /* Remove NO INHERIT flag if rel is a partitioned table */
> +    if (is_no_inherit &&
> +        rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                 errmsg("cannot add NO INHERIT constraint to
> partitioned table \"%s\"",
> +                         RelationGetRelationName(rel))));
>
> The code and the comment disagree.  I think the code is right and the
> comment should be adjusted to say something like /* Partitioned tables
> do not have storage, so a NO INHERIT constraint makes no sense. */

Rewrote the comment.

>
> + * IDENTIFICATION
> + *        src/backend/utils/misc/partition.c
>
> Wrong.

This file is no longer part of the 0001 patch since I moved the file's
only function to relcache.c as mentioned below.  Fixed in the later patch
nonetheless.

>
> +} KeyTypeCollInfo;
>
> I don't like this type name very much.  Can we get "Part" in there someplace?
>
> It doesn't seem to be very well-designed, either.  The number of
> entries in each array is determined by the partnatts flag in
> PartitionKeyData, which has also got various other arrays whose
> lengths are determined by partnatts.  Why do we have some arrays in
> one structure and some arrays in another structure?  Would it hurt
> anything to merge everything into one structure?  Or could
> PartitionKeyData include a field of type KeyTypeCollInfo rather than
> KeyTypeCollInfo *, saving one pointer reference every place we access
> this data?

I followed your advice to just move the typ* arrays into the
PartitionKeyData struct.

> +    /* Allocate in the supposedly short-lived working context */
>
> Why supposedly?

When writing the code, I was thinking of the following header comment that
I copied from somewhere:

 * Note that the partition key data attached to a relcache entry must be
 * stored CacheMemoryContext to ensure it survives as long as the relcache
 * entry. But we should be running in a less long-lived working context.
 * To avoid leaking cache memory if this routine fails partway through,
 * we build in working memory and then copy the completed structure into
 * cache memory.

Removed the comment as it is not very illuminating anymore.

> +    datum = fastgetattr(tuple, Anum_pg_partitioned_table_partattrs,
> +                        RelationGetDescr(catalog),
> +                        &isnull);
>
> Isn't the point of putting the fixed-length fields first that we can
> use GETSTRUCT() here?  And even for partattrs as the first
> variable-length thing?

Right, fixed.

>
> +        /*
> +         * Run the expressions through eval_const_expressions. This is
> +         * not just an optimization, but is necessary, because eventually
> +         * the planner will be comparing them to similarly-processed qual
> +         * clauses, and may fail to detect valid matches without this.
> +         * We don't bother with canonicalize_qual, however.
> +         */
>
> I'm a bit confused by this, because I would think this processing
> ought to have been done before storing anything in the system
> catalogs.  I don't see why it should be necessary to do it again after
> pulling data back out of the system catalogs.

The pattern matches what's done for other expressions that optimizer deals
with, such as CHECK, index key, and index predicate expressions.

> +            Value *str = lfirst(partexprsrc_item);
> +            key->partcolnames[i] = pstrdup(str->val.str);
>
> Should have a blank line in between.

Fixed.

> +/*
> + * Partition key information inquiry functions
> + */
> +int
> +get_partition_key_strategy(PartitionKey key)
> +{
> +    return key->strategy;
> +}
> +
> [ ... ]
> +
> +int32
> +get_partition_col_typmod(PartitionKey key, int col)
> +{
> +    return key->tcinfo->typmod[col];
> +}
>
> If we're going to add notation for this, I think we should use macros
> (or static inline functions defined in the header file).  Doing it
> this way adds more cycles for no benefit.

OK, done using the static inline functions defined in the header file way.

> +    newkey->partattrs = (AttrNumber *)
> +                            palloc0(newkey->partnatts * sizeof(AttrNumber));
> +    memcpy(newkey->partattrs, fromkey->partattrs,
> +                            newkey->partnatts * sizeof(AttrNumber));
>
> It's wasteful to use palloc0 if you're immediately going to overwrite
> every byte in the array.  Use regular palloc instead.

Done.

> +    /* Only this can ever be NULL */
> +    if (!partexprbinDatum)
> +    {
> +        nulls[Anum_pg_partitioned_table_partexprbin - 1] = true;
> +        nulls[Anum_pg_partitioned_table_partexprsrc - 1] = true;
> +    }
>
> How can it be valid to have no partitioning expressions?

Keys that are simply column names are resolved to attnums and stored
likewise.  If some key is an expression, then corresponding attnum is 0
and the expression itself is added to the list that gets stored into
partexprbin.  It is doing the same thing as index expressions.

> +    /* Tell world about the key */
> +    CacheInvalidateRelcache(rel);
>
> Is this really needed?  Isn't the caller going to do something similar
> pretty soon?

Oops, I removed that in one of the later patches but shouldn't be there in
the first place.  Fixed.

> +    heap_freetuple(tuple);
>
> Probably useless - might as well let the context reset clean it up.

Removed.

> +    simple_heap_delete(rel, &tuple->t_self);
> +
> +    /* Update the indexes on pg_partitioned_table */
> +    CatalogUpdateIndexes(rel, tuple);
>
> You don't need CatalogUpdateIndexes() after a delete, only after an
> insert or update.

Ah right, removed.

>
> +    if (classform->relkind != relkind &&
> +                (relkind == RELKIND_RELATION &&
> +                    classform->relkind != RELKIND_PARTITIONED_TABLE))
>
> That's broken.  Note that all of the conditions are joined using &&,
> so if any one of them fails then we won't throw an error.  In
> particular, it's no longer possible to throw an error when relkind is
> not RELKIND_RELATION.

You are right.  I guess it would have to be the following:

+    if ((classform->relkind != relkind &&
+         classform->relkind != RELKIND_PARTITIONED_TABLE) ||
+        (classform->relkind == RELKIND_PARTITIONED_TABLE &&
+         relkind != RELKIND_RELATION))

Such hackishness could not be helped because we have a separate DROP
command for every distinct relkind, except we overload DROP TABLE for both
regular and partitioned tables.

> +/* Checks if a Var node is for a given attnum */
> +static bool
> +find_attr_reference_walker(Node *node, find_attr_reference_context *context)
> +{
> +    if (node == NULL)
> +        return false;
> +
> +    if (IsA(node, Var))
> +    {
> +        Var       *var = (Var *) node;
> +        char   *varattname = get_attname(context->relid, var->varattno);
> +
> +        if (!strcmp(varattname, context->attname))
> +            return true;
> +    }
> +
> +    return expression_tree_walker(node, find_attr_reference_walker, context);
> +}
>
> Hrm.  The comment says we're matching on attnum, but the code says
> we're matching on attname.  is_partition_attr() has the same confusion
> between comments and code.  Maybe instead of this whole approach it
> would be better to use pull_varattnos(), then get_attnum() to find the
> attribute number for the one you want, then bms_is_member().

I like the idea to use pull_varattnos(), so done that way.

> +static PartitionBy *
> +transformPartitionBy(Relation rel, PartitionBy *partitionby)
> +{
> +    PartitionBy       *partby;
> +    ParseState       *pstate;
> +    RangeTblEntry  *rte;
> +    ListCell       *l;
> +
> +    partby = (PartitionBy *) makeNode(PartitionBy);
> +
> +    partby->strategy = partitionby->strategy;
> +    partby->location = partitionby->location;
> +    partby->partParams = NIL;
> +
> +    /*
> +     * Create a dummy ParseState and insert the target relation as its sole
> +     * rangetable entry.  We need a ParseState for transformExpr.
> +     */
> +    pstate = make_parsestate(NULL);
>
> Why isn't this logic being invoked from transformCreateStmt()?  Then
> we could use the actual parseState for the query instead of a fake
> one.

Because we need an open relation for it to work, which in this case there
won't be until after we have performed heap_create_with_catalog() in
DefineRelation().  Mainly because we need to perform transformExpr() on
expressions.  That's similar to how cookConstraint() on the new CHECK
constraints cannot be performed earlier.  Am I missing something?

> +            if (IsA(expr, CollateExpr))
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> +                         errmsg("cannot use COLLATE in partition key
> expression")));
>
> I assume there is a good reason for this seemingly-arbitrary
> restriction, but there's no comment saying what it is.  One thing
> that's odd is that this will only prohibit a CollateExpr at the top
> level, not in some more-deeply nested position.  That seems
> suspicious.

Hmm, I guess it wouldn't hurt to just leave any COLLATE clauses as it is -
just remove the above code.  Of course, we must then record the collation
in the catalog alongside other user-specified information such as operator
class.  Currently, if the key is a simple column we use its attcollation
and if it's an expression then we use its exprCollation().

When I first wrote it, I wasn't sure what the implications of explicit
collations would be for partitioning.  There are two places where it comes
into play: a) when comparing partition key values using the btree compare
function b) embedded as varcollid and inputcollid in implicitly generated
check constraints for partitions.  In case of the latter, any mismatch
with query-specified collation causes constraint exclusion proof to be
canceled.  When it's default collations everywhere, the chances of that
sort of thing happening are less.  If we support user-specified collations
on keys, then things will get a little bit more involved.

Thoughts?

> +                /*
> +                 * User wrote "(column)" or "(column COLLATE something)".
> +                 * Treat it like simple attribute anyway.
> +                 */
>
> Evidently, the user did not do that, because you just prohibited the
> second one of those.

Holds true now that I have removed the prohibition.

> +                if (IsA(expr, Const))
> +                    ereport(ERROR,
> +                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> +                             errmsg("cannot use a constant expression
> as partition key")));
> +
> +                if (contain_mutable_functions(expr))
> +                    ereport(ERROR,
> +                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> +                             errmsg("functions in partition key
> expression must be marked IMMUTABLE")));
>
> Do these checks parallel what we do for CHECK constraints?  It might
> be good to apply about the same level of rigor in both cases.

Both of these checks are not done for CHECK constraints.  The only check
performed on them in cookConstraint() checks whether the expression is of
boolean type.

However in the present case, this is just one side of a whole partition
constraint (the other piece being individual partition's bound value), so
should be treated a bit differently from the CHECK constraints.  I modeled
this on ComputeIndexAttrs() checks viz. the following:

/*
 * An expression using mutable functions is probably wrong,
 * since if you aren't going to get the same result for the
 * same data every time, it's not clear what the index entries
 * mean at all.
 */
 if (CheckMutability((Expr *) expr))
     ereport(ERROR,
         (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
          errmsg("functions in index expression must be marked IMMUTABLE")));

Likewise if a partition key expression contained mutable functions, same
input row could be mapped to different partitions based on the result of
expression computed using the input values.  So, it seems prudent to
enforce immutability unlike CHECK constraints.  Am I missing something?

>
> +                exprsrc = deparse_expression(expr,
> +                            deparse_context_for(RelationGetRelationName(rel),
> +                                                RelationGetRelid(rel)),
> +                                       false, false);
>
> Why bother?  The output of this doesn't seem like a useful thing to
> store.  The fact that we've done similar things elsewhere doesn't make
> it a good idea.  I think we did it in other cases because we used to
> be dumber than we are now.

Deparsed expressions are no longer stored in the catalog.

> +                        (errcode(ERRCODE_UNDEFINED_OBJECT),
> +                         errmsg("data type %s has no default btree
> operator class",
> +                                format_type_be(atttype)),
> +                         errhint("You must specify an existing btree
> operator class or define one for the type.")));
>
> The hint is not really accurate, because the type may well have a
> btree operator class.  Just not a default one.

Changed to: "You must specify a btree operator class or define a default
btree operator class for the data type."

> +                            char    relkind = ((CreateStmt *)
> stmt)->partby != NULL
> +                                                    ? RELKIND_PARTITIONED_TABLE
> +                                                    : RELKIND_RELATION;
>
> Let's push this down into DefineRelation().  i.e. if (stmt->partby !=
> NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
> RELKIND_PARTITION_TABLE; }

Done.  By the way, I made the ereport say the following: "unexpected
relkind value passed to DefineRelation", did you intend it to  say
something else?

>
> +        RelationBuildPartitionKey(relation);
>
> I wonder if RelationBuildPartitionKey should really be in relcache.c.
> What do we do in similar cases?

There a number of RelationBuild* functions that RelationBuildDesc calls
that are reside outside relcache.c in respective modules - trigger.c
(BuildTrigger), policy.c (BuildRowSecurity), whereas some others such as
RelationInitIndexAccessInfo() are housed in relcache.c.

BuildPartitionKey used to be in relcache.c and can be moved back there. So
did.

>
> +} PartitionBy;
>
> Maybe PartitionSpec?

A later patch uses PartitionListSpec and PartitionRangeSpec as parse nodes
for partition bounds.  How about call PartitionBy PartitionKeySpec and the
bound nodes just mentioned PartitionBoundList and PartitionBoundRange,
respectively?


Attached revised patches.  In addition to addressing comments in this
email (which addressed only the patch 0001), number of other fixes [1] and
other miscellaneous improvements [2] have been included.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/2896956a-8af0-3f91-a3bc-1e5225ae7ed2@lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/98d6c7e7-4919-4970-a158-49f740101812%40lab.ntt.co.jp

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Sep 29, 2016 at 8:09 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I removed DEPENDENCY_IGNORE.  Does the following look good or am I still
> missing something?

You missed your commit message, but otherwise looks fine.

>> Also, I think this should be rephrased a bit to be more clear about
>> how the partitioning key works, like this: The optional
>> <literal>PARTITION BY</literal> clause specifies a method of
>> partitioning the table.  The table thus created is called a
>> <firstterm>partitioned</firstterm> table.  The parenthesized list of
>> expressions forms the <firsttem>partitioning key</firstterm> for the
>> table.  When using range partitioning, the partioning key can include
>> multiple columns or expressions, but for list partitioning, the
>> partitioning key must consist of a single column or expression.  If no
>> btree operator class is specified when creating a partitioned table,
>> the default btree operator class for the datatype will be used.  If
>> there is none, an error will be reported.
>
> Revised text along these lines.

It seems you copied my typo: my text has "partioning" for
"partitioning" and your patch now has that, too.

> Things were that way initially, that is, the parent relations had no
> relfilenode.  I abandoned that project however.  The storage-less parent
> thing seemed pretty daunting to me to handle right away.  For example,
> optimizer and the executor code needed to be taught about the parent rel
> appendrel member that used to be included as part of the result of
> scanning the inheritance set but was no longer.

Even if we leave the empty relfilenode around for now -- in the long
run I think it should die -- I think we should prohibit the creation
of subsidiary object on the parent which is only sensible if it has
rows - e.g. indexes.  It makes no sense to disallow non-inheritable
constraints while allowing indexes, and it could box us into a corner
later.

>> +        /*
>> +         * Run the expressions through eval_const_expressions. This is
>> +         * not just an optimization, but is necessary, because eventually
>> +         * the planner will be comparing them to similarly-processed qual
>> +         * clauses, and may fail to detect valid matches without this.
>> +         * We don't bother with canonicalize_qual, however.
>> +         */
>>
>> I'm a bit confused by this, because I would think this processing
>> ought to have been done before storing anything in the system
>> catalogs.  I don't see why it should be necessary to do it again after
>> pulling data back out of the system catalogs.
>
> The pattern matches what's done for other expressions that optimizer deals
> with, such as CHECK, index key, and index predicate expressions.

That's kind of a non-answer answer, but OK.  Still, I think you
shouldn't just copy somebody else's comment blindly into a new place.
Reference the original comment, or write your own.

>> How can it be valid to have no partitioning expressions?
>
> Keys that are simply column names are resolved to attnums and stored
> likewise.  If some key is an expression, then corresponding attnum is 0
> and the expression itself is added to the list that gets stored into
> partexprbin.  It is doing the same thing as index expressions.

Oh, right.  Oops.

>> +    if (classform->relkind != relkind &&
>> +                (relkind == RELKIND_RELATION &&
>> +                    classform->relkind != RELKIND_PARTITIONED_TABLE))
>>
>> That's broken.  Note that all of the conditions are joined using &&,
>> so if any one of them fails then we won't throw an error.  In
>> particular, it's no longer possible to throw an error when relkind is
>> not RELKIND_RELATION.
>
> You are right.  I guess it would have to be the following:
>
> +    if ((classform->relkind != relkind &&
> +         classform->relkind != RELKIND_PARTITIONED_TABLE) ||
> +        (classform->relkind == RELKIND_PARTITIONED_TABLE &&
> +         relkind != RELKIND_RELATION))
>
> Such hackishness could not be helped because we have a separate DROP
> command for every distinct relkind, except we overload DROP TABLE for both
> regular and partitioned tables.

Maybe this would be better:

if (relkind == RELKIND_PARTITIONED_TABLE)   syntax_relkind = RELKIND_RELATION;
else   syntax_relkind = rekind;
if (classform->relkind != syntax_relkind)   DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

>> Why isn't this logic being invoked from transformCreateStmt()?  Then
>> we could use the actual parseState for the query instead of a fake
>> one.
>
> Because we need an open relation for it to work, which in this case there
> won't be until after we have performed heap_create_with_catalog() in
> DefineRelation().  Mainly because we need to perform transformExpr() on
> expressions.  That's similar to how cookConstraint() on the new CHECK
> constraints cannot be performed earlier.  Am I missing something?

Hmm, yeah, I guess that's the same thing.  I guess I got on this line
of thinking because the function name started with "transform" which
is usually something happening during parse analysis only.  Maybe add
a code comment explaining why the work can't be done sooner, just like
what you've written here.

> Hmm, I guess it wouldn't hurt to just leave any COLLATE clauses as it is -
> just remove the above code.  Of course, we must then record the collation
> in the catalog alongside other user-specified information such as operator
> class.  Currently, if the key is a simple column we use its attcollation
> and if it's an expression then we use its exprCollation().
>
> When I first wrote it, I wasn't sure what the implications of explicit
> collations would be for partitioning.  There are two places where it comes
> into play: a) when comparing partition key values using the btree compare
> function b) embedded as varcollid and inputcollid in implicitly generated
> check constraints for partitions.  In case of the latter, any mismatch
> with query-specified collation causes constraint exclusion proof to be
> canceled.  When it's default collations everywhere, the chances of that
> sort of thing happening are less.  If we support user-specified collations
> on keys, then things will get a little bit more involved.

I think it's worth rewinding to general principles here: the only time
a collation has real meaning is in the context of a comparison
operation.  If we ask whether A < B, the answer may depend on the
collation that is used to perform the comparison.  Any other place
that mentions a collation is only trying to influence the collation
that gets used for some comparison that will happen later - so, for
example, for a table column, attcollation is setting the default
collation for comparisons involving that column.  The column is not
itself collated; that doesn't really make sense.

In the case of partitioning, there is really exactly one thing that
matters: everybody's got to agree on the collation to be used to
compare actual or hypothetical values of the partitioning columns
against the partition bounds.  If we've got a set of partition bounds
b_1, b_2, ..., b_n and a set of partitions p_1, p_2, ..., p_{n-1} such
that a row in p_i must have a key k such that b_i < k and k < b_{i+1},
and if the meaning of that "<" operator is collation-dependent, then
we had better agree on which collation is in use every single time we
do the test.

Indexes, of course, have this exact same problem, at least if, like
btree, they rely on ordering.  We search the index by applying the "<"
operator to compare various values taken from the index tuples with
the value we're trying to find, and it is absolutely vital that the
collation used for lookups remain constant and that it is the same as
the collation used for inserts, which had also better remain constant.
That is why pg_index has an indcollation column of type oidvector: it
tells us which collation to use for each index column.  It pairs with
indclass, which tells us which operator class to use for each index
column.  I think that partitioning will need exact analogues -
partclass and partcollation - because it has exactly the same problem.
Currently, you've got partclass but not partcollation.

I'd in general recommend that you try to follow the index precedent
here as closely as practical.

> However in the present case, this is just one side of a whole partition
> constraint (the other piece being individual partition's bound value), so
> should be treated a bit differently from the CHECK constraints.  I modeled
> this on ComputeIndexAttrs() checks viz. the following:
>
> /*
>  * An expression using mutable functions is probably wrong,
>  * since if you aren't going to get the same result for the
>  * same data every time, it's not clear what the index entries
>  * mean at all.
>  */
>  if (CheckMutability((Expr *) expr))
>      ereport(ERROR,
>          (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
>           errmsg("functions in index expression must be marked IMMUTABLE")));
>
> Likewise if a partition key expression contained mutable functions, same
> input row could be mapped to different partitions based on the result of
> expression computed using the input values.  So, it seems prudent to
> enforce immutability unlike CHECK constraints.  Am I missing something?

OK, seems reasonable.

>> +                            char    relkind = ((CreateStmt *)
>> stmt)->partby != NULL
>> +                                                    ? RELKIND_PARTITIONED_TABLE
>> +                                                    : RELKIND_RELATION;
>>
>> Let's push this down into DefineRelation().  i.e. if (stmt->partby !=
>> NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
>> RELKIND_PARTITION_TABLE; }
>
> Done.  By the way, I made the ereport say the following: "unexpected
> relkind value passed to DefineRelation", did you intend it to  say
> something else?

You don't need it to be translatable because it should only happen if
there's a bug in the code; users shouldn't actually be able to hit it.
So use elog() rather than ereport().  You don't need to include the
function name, either; the user can use \errverbose or \set VERBOSITY
verbose to get the file and line number if they need it.  So just
elog(ERROR, "unexpected relkind") should be fine; or maybe elog(ERROR,
"unexpected relkind: %d", (int) relkind), following similar precedent
elsewhere.

Reading through the latest 0001:

+   The catalog <structname>pg_partitioned_table</structname> stores information
+   about the partition key of tables.

Maybe "stores information about how tables are partitioned".

+       Partitioning strategy (or method); <literal>l</> = list
partitioned table,

I don't think we need "(or method)".

+      <entry><structfield>partexprbin</structfield></entry>

Is there any good reason not to do partexprbin -> partexpr throughout the patch?

+      A partitioned table is divided into sub-tables (called partitions), which
+      in turn, are created using separate <literal>CREATE TABLE</> commands.

I would delete "in turn" and the subsequent comma, so that it says
"which are created using separate".

+      The table itself is empty.  A data row inserted into the table is mapped
+      to and stored in one of the partitions (if one exists) based on the
+      values of columns or expressions in the partition key and partition
+      bound constraints associated with the individual partitions.

How about: "A data row inserted into the table is routed to a
partition based on the value of columns or expressions in the
partition key.  If no existing partition matches the values in the new
row, an error will occur."

+      Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or FOREIGN
+      KEY constraints; however, you can define these constraints on individual
+      data partitions.

Delete "data".   Add <literal> tags around the keywords.

+    tuple = SearchSysCache1(PARTEDRELID,
+                            ObjectIdGetDatum(RelationGetRelid(rel)));
+    /* Cannot already exist */
+    Assert(!HeapTupleIsValid(tuple));

This seems pointless.  It's hard to see how the tuple could already
exist, but if it does, this will fail a little later on anyway when we
do simple_heap_insert() and CatalogUpdateIndexes() for the new tuple.
In general, if you're doing work only to support an Assert(), you
should put #ifdef USE_ASSERT_CHECKING around it, but in this case I'd
just rip this out.

+    myself.objectId = RelationGetRelid(rel);;

Extra semicolon.

+    /*
+     * Store dependencies on anything mentioned in the key expressions.
+     * However, ignore the column references which causes self-dependencies
+     * to be created that are undesirable.  That is done by asking the
+     * dependency-tracking sub-system to ignore any such dependencies.
+     */

I think this comment is spending a lot of time explaining the
mechanism when what it should be doing is explaining why this case
arises here and not elsewhere.

+    if (relkind != RELKIND_RELATION  && relkind != RELKIND_PARTITIONED_TABLE)

Extra space before &&.

+        if(partattno != 0)

Missing space.

+         * Identify a btree opclass to use. Currently, we use only btree
+         * operators which seems enough for list and range partitioning.

Probably best to add a comma before "which".
                break;            case T_ForeignKeyCacheInfo:                _outForeignKeyCacheInfo(str, obj);
+            case T_PartitionSpec:
+                _outPartitionSpec(str, obj);
+                break;
+            case T_PartitionElem:
+                _outPartitionElem(str, obj);                break;
            default:

Missing break.

+                    n->strategy = PARTITION_STRAT_RANGE;

Let's not abbreviate STRATEGY to STRAT in the names of these constants.
        case EXPR_KIND_TRIGGER_WHEN:            return "WHEN";
+        case EXPR_KIND_PARTITION_KEY:
+            return "partition key expression";

I think you should say "PARTITION BY" here.  See the function header
comment for ParseExprKindName.

+                 errmsg("cannot use more than one column in partition key"),
+                 errdetail("Only one column allowed with list
partitioning.")));

How about combining these two: cannot list partition using more than one column

+ * Note that the partition key data attached to a relcache entry must be
+ * stored CacheMemoryContext to ensure it survives as long as the relcache
+ * entry. But we should be running in a less long-lived working context.
+ * To avoid leaking cache memory if this routine fails partway through,
+ * we build in working memory and then copy the completed structure into
+ * cache memory.

Again, don't just copy existing comments.  Refer to them.

+     * To retrieve further variable-length attributes, we'd need the catlog's

Typo.

+ * pg_partitioned_table_fn.h
+ *      prototypes for functions in catalog/pg_partitioned_table.c

Is it really worth having a separate header file for ONE function?

+PG_KEYWORD("list", LIST, UNRESERVED_KEYWORD)

I bet you can avoid making this a keyword.  PARTITION BY IDENT in the
grammar, or something like that.

+CREATE TABLE pkrel(
+    a int PRIMARY KEY
+);

Add a space.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Michael Paquier
Date:
On Fri, Sep 30, 2016 at 9:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 8:09 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I removed DEPENDENCY_IGNORE.  Does the following look good or am I still
>> missing something?
>
> You missed your commit message, but otherwise looks fine.

I have moved this patch to next CF because that's fresh, but switched
the patch as "waiting on author". Be careful, the patch was in "needs
review".
-- 
Michael



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/03 13:26, Michael Paquier wrote:
> On Fri, Sep 30, 2016 at 9:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 29, 2016 at 8:09 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I removed DEPENDENCY_IGNORE.  Does the following look good or am I still
>>> missing something?
>>
>> You missed your commit message, but otherwise looks fine.
> 
> I have moved this patch to next CF because that's fresh, but switched
> the patch as "waiting on author". Be careful, the patch was in "needs
> review".

Thanks Michael.  I was going to post the patch addressing Robert's
comments today, which I will anyway.  Then I will switch the status back
to "Needs Review", albeit as part of the next CF.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
Thanks for the comments.

On 2016/09/30 9:10, Robert Haas wrote:
> On Thu, Sep 29, 2016 at 8:09 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Things were that way initially, that is, the parent relations had no
>> relfilenode.  I abandoned that project however.  The storage-less parent
>> thing seemed pretty daunting to me to handle right away.  For example,
>> optimizer and the executor code needed to be taught about the parent rel
>> appendrel member that used to be included as part of the result of
>> scanning the inheritance set but was no longer.
>
> Even if we leave the empty relfilenode around for now -- in the long
> run I think it should die -- I think we should prohibit the creation
> of subsidiary object on the parent which is only sensible if it has
> rows - e.g. indexes.  It makes no sense to disallow non-inheritable
> constraints while allowing indexes, and it could box us into a corner
> later.

I agree.  So we must prevent from the get-go the creation of following
objects on parent tables (aka RELKIND_PARTITIONED_TABLE relations):

* Indexes
* Row triggers (?)

In addition to preventing creation of these objects, we must also teach
commands that directly invoke heapam.c to skip such relations.

I have not implemented that in the patch yet though.

>>> +        /*
>>> +         * Run the expressions through eval_const_expressions. This is
>>> +         * not just an optimization, but is necessary, because eventually
>>> +         * the planner will be comparing them to similarly-processed qual
>>> +         * clauses, and may fail to detect valid matches without this.
>>> +         * We don't bother with canonicalize_qual, however.
>>> +         */
>>>
>>> I'm a bit confused by this, because I would think this processing
>>> ought to have been done before storing anything in the system
>>> catalogs.  I don't see why it should be necessary to do it again after
>>> pulling data back out of the system catalogs.
>>
>> The pattern matches what's done for other expressions that optimizer deals
>> with, such as CHECK, index key, and index predicate expressions.
>
> That's kind of a non-answer answer, but OK.  Still, I think you
> shouldn't just copy somebody else's comment blindly into a new place.
> Reference the original comment, or write your own.

Sorry I could have explained a bit more in my previous message.  I rewrote
the comment as follows:

/*
 * Run the expressions through const-simplification since the planner
 * will be comparing them to similarly-processed qual clause operands,
 * and may fail to detect valid matches without this step.  We don't
 * need to bother with canonicalize_qual() though, because partition
 * expressions are not full-fledged qualification clauses.
 */

>>> +    if (classform->relkind != relkind &&
>>> +                (relkind == RELKIND_RELATION &&
>>> +                    classform->relkind != RELKIND_PARTITIONED_TABLE))
>>>
>>> That's broken.  Note that all of the conditions are joined using &&,
>>> so if any one of them fails then we won't throw an error.  In
>>> particular, it's no longer possible to throw an error when relkind is
>>> not RELKIND_RELATION.
>>
>> You are right.  I guess it would have to be the following:
>>
>> +    if ((classform->relkind != relkind &&
>> +         classform->relkind != RELKIND_PARTITIONED_TABLE) ||
>> +        (classform->relkind == RELKIND_PARTITIONED_TABLE &&
>> +         relkind != RELKIND_RELATION))
>>
>> Such hackishness could not be helped because we have a separate DROP
>> command for every distinct relkind, except we overload DROP TABLE for both
>> regular and partitioned tables.
>
> Maybe this would be better:
>
> if (relkind == RELKIND_PARTITIONED_TABLE)
>     syntax_relkind = RELKIND_RELATION;
> else
>     syntax_relkind = rekind;
> if (classform->relkind != syntax_relkind)
>     DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

In this case, relkind refers to the command viz. DROP <ObjectType/relkind>
<relname>.  It can never be RELKIND_PARTITIONED_TABLE, so the above will
be equivalent to leaving things unchanged.  Anyway I agree the suggested
style is better, but I assume you meant:

    if (classform->relkind == RELKIND_PARTITIONED_TABLE)
        expected_relkind = RELKIND_RELATION;
    else
        expected_relkind = classform->relkind;

    if (relkind != expected_relkind)
        DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);


>>> Why isn't this logic being invoked from transformCreateStmt()?  Then
>>> we could use the actual parseState for the query instead of a fake
>>> one.
>>
>> Because we need an open relation for it to work, which in this case there
>> won't be until after we have performed heap_create_with_catalog() in
>> DefineRelation().  Mainly because we need to perform transformExpr() on
>> expressions.  That's similar to how cookConstraint() on the new CHECK
>> constraints cannot be performed earlier.  Am I missing something?
>
> Hmm, yeah, I guess that's the same thing.  I guess I got on this line
> of thinking because the function name started with "transform" which
> is usually something happening during parse analysis only.  Maybe add
> a code comment explaining why the work can't be done sooner, just like
> what you've written here.

Added the comment.

>> Hmm, I guess it wouldn't hurt to just leave any COLLATE clauses as it is -
>> just remove the above code.  Of course, we must then record the collation
>> in the catalog alongside other user-specified information such as operator
>> class.  Currently, if the key is a simple column we use its attcollation
>> and if it's an expression then we use its exprCollation().
>>
>> When I first wrote it, I wasn't sure what the implications of explicit
>> collations would be for partitioning.  There are two places where it comes
>> into play: a) when comparing partition key values using the btree compare
>> function b) embedded as varcollid and inputcollid in implicitly generated
>> check constraints for partitions.  In case of the latter, any mismatch
>> with query-specified collation causes constraint exclusion proof to be
>> canceled.  When it's default collations everywhere, the chances of that
>> sort of thing happening are less.  If we support user-specified collations
>> on keys, then things will get a little bit more involved.
>
> I think it's worth rewinding to general principles here: the only time
> a collation has real meaning is in the context of a comparison
> operation.  If we ask whether A < B, the answer may depend on the
> collation that is used to perform the comparison.  Any other place
> that mentions a collation is only trying to influence the collation
> that gets used for some comparison that will happen later - so, for
> example, for a table column, attcollation is setting the default
> collation for comparisons involving that column.  The column is not
> itself collated; that doesn't really make sense.
>
> In the case of partitioning, there is really exactly one thing that
> matters: everybody's got to agree on the collation to be used to
> compare actual or hypothetical values of the partitioning columns
> against the partition bounds.  If we've got a set of partition bounds
> b_1, b_2, ..., b_n and a set of partitions p_1, p_2, ..., p_{n-1} such
> that a row in p_i must have a key k such that b_i < k and k < b_{i+1},
> and if the meaning of that "<" operator is collation-dependent, then
> we had better agree on which collation is in use every single time we
> do the test.
>
> Indexes, of course, have this exact same problem, at least if, like
> btree, they rely on ordering.  We search the index by applying the "<"
> operator to compare various values taken from the index tuples with
> the value we're trying to find, and it is absolutely vital that the
> collation used for lookups remain constant and that it is the same as
> the collation used for inserts, which had also better remain constant.
> That is why pg_index has an indcollation column of type oidvector: it
> tells us which collation to use for each index column.  It pairs with
> indclass, which tells us which operator class to use for each index
> column.  I think that partitioning will need exact analogues -
> partclass and partcollation - because it has exactly the same problem.
> Currently, you've got partclass but not partcollation.
>
> I'd in general recommend that you try to follow the index precedent
> here as closely as practical.

Thanks for the explanation. I added a new field to the catalog called
partcollation.

That means a couple of things - when comparing input key values (consider
a new tuple being routed) to partition bounds using the btree compare
function, we use the corresponding key column's partcollation.  The same
is also used as inputcollid of the OpExpr (or ScalarArrayOpExpr) in the
implicitly generated CHECK constraints.

However, varcollid of any Var nodes in the above expressions is still the
corresponding attributes' attcollation.

Needless to say, a query-specified qualification clause must now include
the same collate clause as used for individual partition columns (or
expressions) for the constraint exclusion to work as intended.

>>> +                            char    relkind = ((CreateStmt *)
>>> stmt)->partby != NULL
>>> +                                                    ? RELKIND_PARTITIONED_TABLE
>>> +                                                    : RELKIND_RELATION;
>>>
>>> Let's push this down into DefineRelation().  i.e. if (stmt->partby !=
>>> NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
>>> RELKIND_PARTITION_TABLE; }
>>
>> Done.  By the way, I made the ereport say the following: "unexpected
>> relkind value passed to DefineRelation", did you intend it to  say
>> something else?
>
> You don't need it to be translatable because it should only happen if
> there's a bug in the code; users shouldn't actually be able to hit it.
> So use elog() rather than ereport().  You don't need to include the
> function name, either; the user can use \errverbose or \set VERBOSITY
> verbose to get the file and line number if they need it.  So just
> elog(ERROR, "unexpected relkind") should be fine; or maybe elog(ERROR,
> "unexpected relkind: %d", (int) relkind), following similar precedent
> elsewhere.

Done as: elog(ERROR, "unexpected relkind: %d", (int) relkind)

> Reading through the latest 0001:
>
> +   The catalog <structname>pg_partitioned_table</structname> stores information
> +   about the partition key of tables.
>
> Maybe "stores information about how tables are partitioned".

Fixed.

> +       Partitioning strategy (or method); <literal>l</> = list
> partitioned table,
>
> I don't think we need "(or method)".

Fixed.

>
> +      <entry><structfield>partexprbin</structfield></entry>
>
> Is there any good reason not to do partexprbin -> partexpr throughout the patch?

Agreed. Though I used partexprs (like indexprs).

> +      A partitioned table is divided into sub-tables (called partitions), which
> +      in turn, are created using separate <literal>CREATE TABLE</> commands.
>
> I would delete "in turn" and the subsequent comma, so that it says
> "which are created using separate".

Done.

>
> +      The table itself is empty.  A data row inserted into the table is mapped
> +      to and stored in one of the partitions (if one exists) based on the
> +      values of columns or expressions in the partition key and partition
> +      bound constraints associated with the individual partitions.
>
> How about: "A data row inserted into the table is routed to a
> partition based on the value of columns or expressions in the
> partition key.  If no existing partition matches the values in the new
> row, an error will occur."

Done.

>
> +      Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or FOREIGN
> +      KEY constraints; however, you can define these constraints on individual
> +      data partitions.
>
> Delete "data".   Add <literal> tags around the keywords.

Done.

> +    tuple = SearchSysCache1(PARTEDRELID,
> +                            ObjectIdGetDatum(RelationGetRelid(rel)));
> +    /* Cannot already exist */
> +    Assert(!HeapTupleIsValid(tuple));
>
> This seems pointless.  It's hard to see how the tuple could already
> exist, but if it does, this will fail a little later on anyway when we
> do simple_heap_insert() and CatalogUpdateIndexes() for the new tuple.
> In general, if you're doing work only to support an Assert(), you
> should put #ifdef USE_ASSERT_CHECKING around it, but in this case I'd
> just rip this out.

OK, removed the Assert.

> +    myself.objectId = RelationGetRelid(rel);;
>
> Extra semicolon.

Fixed.

> +    /*
> +     * Store dependencies on anything mentioned in the key expressions.
> +     * However, ignore the column references which causes self-dependencies
> +     * to be created that are undesirable.  That is done by asking the
> +     * dependency-tracking sub-system to ignore any such dependencies.
> +     */
>
> I think this comment is spending a lot of time explaining the
> mechanism when what it should be doing is explaining why this case
> arises here and not elsewhere.

Rewrote the comment as follows:

/*
 * Anything mentioned in the expressions.  We must ignore the column
 * references which will count as self-dependency items; in this case,
 * the depender is the table itself (there is no such thing as partition
 * key object).
 */

> +    if (relkind != RELKIND_RELATION  && relkind != RELKIND_PARTITIONED_TABLE)
>
> Extra space before &&.

Fixed.

>
> +        if(partattno != 0)
>
> Missing space.

Fixed.

>
> +         * Identify a btree opclass to use. Currently, we use only btree
> +         * operators which seems enough for list and range partitioning.
>
> Probably best to add a comma before "which".

Done.

>
>                  break;
>              case T_ForeignKeyCacheInfo:
>                  _outForeignKeyCacheInfo(str, obj);
> +            case T_PartitionSpec:
> +                _outPartitionSpec(str, obj);
> +                break;
> +            case T_PartitionElem:
> +                _outPartitionElem(str, obj);
>                  break;
>
>              default:
>
> Missing break.

Oops, fixed.

> +                    n->strategy = PARTITION_STRAT_RANGE;
>
> Let's not abbreviate STRATEGY to STRAT in the names of these constants.

Done.

>
>          case EXPR_KIND_TRIGGER_WHEN:
>              return "WHEN";
> +        case EXPR_KIND_PARTITION_KEY:
> +            return "partition key expression";
>
> I think you should say "PARTITION BY" here.  See the function header
> comment for ParseExprKindName.

Used "PARTITION BY".  I was trying to mimic EXPR_KIND_INDEX_EXPRESSION,
but this seems to work better.  Also, I renamed EXPR_KIND_PARTITION_KEY to
EXPR_KIND_PARTITION_EXPRESSION.

By the way, a bunch of error messages say "partition key expression".  I'm
assuming it's better to leave them alone, that is, not rewrite them as
"PARTITION BY expressions".

> +                 errmsg("cannot use more than one column in partition key"),
> +                 errdetail("Only one column allowed with list
> partitioning.")));
>
> How about combining these two: cannot list partition using more than one column

Done.

> + * Note that the partition key data attached to a relcache entry must be
> + * stored CacheMemoryContext to ensure it survives as long as the relcache
> + * entry. But we should be running in a less long-lived working context.
> + * To avoid leaking cache memory if this routine fails partway through,
> + * we build in working memory and then copy the completed structure into
> + * cache memory.
>
> Again, don't just copy existing comments.  Refer to them.

Rewrote the comment to explain two points: why build in the working
context instead of directly in the CacheMemoryContext and why use a
private child context of CacheMemoryContext to store the information.

> +     * To retrieve further variable-length attributes, we'd need the catlog's
>
> Typo.

Fixed.

> + * pg_partitioned_table_fn.h
> + *      prototypes for functions in catalog/pg_partitioned_table.c
>
> Is it really worth having a separate header file for ONE function?

Previously the two functions in this file were defined in catalog/heap.c
and I think they could be moved back there alongside such folks as
AddRelationNewConstraints, StoreAttrDefault, RemoveAttributeById,
RemoveStatistics, etc.  So did.

There are no longer pg_partitioned_table.c and pg_partitioned_table_fn.h
files.

> +PG_KEYWORD("list", LIST, UNRESERVED_KEYWORD)
>
> I bet you can avoid making this a keyword.  PARTITION BY IDENT in the
> grammar, or something like that.

Managed to do that with one more production as follows:

+part_strategy:  IDENT                   { $$ = $1; }
+                | unreserved_keyword    { $$ = pstrdup($1); }
+        ;

And then PARTITION BY part_strategy

That's because "range" is already a keyword.

>
> +CREATE TABLE pkrel(
> +    a int PRIMARY KEY
> +);
>
> Add a space.

Fixed.


Attached updated patches.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Even if we leave the empty relfilenode around for now -- in the long
>> run I think it should die -- I think we should prohibit the creation
>> of subsidiary object on the parent which is only sensible if it has
>> rows - e.g. indexes.  It makes no sense to disallow non-inheritable
>> constraints while allowing indexes, and it could box us into a corner
>> later.
>
> I agree.  So we must prevent from the get-go the creation of following
> objects on parent tables (aka RELKIND_PARTITIONED_TABLE relations):
>
> * Indexes
> * Row triggers (?)

Hmm, do we ever fire triggers on the parent for operations on a child
table?  Note this thread, which seems possibly relevant:

https://www.postgresql.org/message-id/flat/cd282adde5b70b20c57f53bb9ab75e27%40biglumber.com

We certainly won't fire the parent's per-row triggers ever, so those
should be prohibited.  But I'm not sure about per-statement triggers.

> In addition to preventing creation of these objects, we must also teach
> commands that directly invoke heapam.c to skip such relations.

I'm don't think that's required if we're not actually getting rid of
the relfilenode.

> In this case, relkind refers to the command viz. DROP <ObjectType/relkind>
> <relname>.  It can never be RELKIND_PARTITIONED_TABLE, so the above will
> be equivalent to leaving things unchanged.  Anyway I agree the suggested
> style is better, but I assume you meant:
>
>     if (classform->relkind == RELKIND_PARTITIONED_TABLE)
>         expected_relkind = RELKIND_RELATION;
>     else
>         expected_relkind = classform->relkind;
>
>     if (relkind != expected_relkind)
>         DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

Uh, yeah, probably that's what I meant. :-)

> Thanks for the explanation. I added a new field to the catalog called
> partcollation.
>
> That means a couple of things - when comparing input key values (consider
> a new tuple being routed) to partition bounds using the btree compare
> function, we use the corresponding key column's partcollation.  The same
> is also used as inputcollid of the OpExpr (or ScalarArrayOpExpr) in the
> implicitly generated CHECK constraints.

Check.

> However, varcollid of any Var nodes in the above expressions is still the
> corresponding attributes' attcollation.

I think that's OK.

> Needless to say, a query-specified qualification clause must now include
> the same collate clause as used for individual partition columns (or
> expressions) for the constraint exclusion to work as intended.

Hopefully this is only required if the default choice of collation
wouldn't be correct anyway.

>> +      <entry><structfield>partexprbin</structfield></entry>
>>
>> Is there any good reason not to do partexprbin -> partexpr throughout the patch?
>
> Agreed. Though I used partexprs (like indexprs).

Oh, good idea.

>>          case EXPR_KIND_TRIGGER_WHEN:
>>              return "WHEN";
>> +        case EXPR_KIND_PARTITION_KEY:
>> +            return "partition key expression";
>>
>> I think you should say "PARTITION BY" here.  See the function header
>> comment for ParseExprKindName.
>
> Used "PARTITION BY".  I was trying to mimic EXPR_KIND_INDEX_EXPRESSION,
> but this seems to work better.  Also, I renamed EXPR_KIND_PARTITION_KEY to
> EXPR_KIND_PARTITION_EXPRESSION.
>
> By the way, a bunch of error messages say "partition key expression".  I'm
> assuming it's better to leave them alone, that is, not rewrite them as
> "PARTITION BY expressions".

I think that is fine.  ParseExprKindName is something of a special case.

Reviewing 0002:

+    prettyFlags = PRETTYFLAG_INDENT;
+    PG_RETURN_TEXT_P(string_to_text(pg_get_partkeydef_worker(relid,
+                                    prettyFlags)));

Why bother with the variable?

+    /* Must get partclass, and partexprs the hard way */
+    datum = SysCacheGetAttr(PARTEDRELID, tuple,
+                            Anum_pg_partitioned_table_partclass, &isnull);
+    Assert(!isnull);
+    partclass = (oidvector *) DatumGetPointer(datum);

Comment mentions getting two things, but code only gets one.

+        partexprs = (List *) stringToNode(exprsString);

if (!IsA(partexprs, List)) elog(ERROR, ...); so as to guard against
corrupt catalog contents.

+    switch (form->partstrat)
+    {
+        case 'l':
+            appendStringInfo(&buf, "LIST");
+            break;
+        case 'r':
+            appendStringInfo(&buf, "RANGE");
+            break;
+    }

default: elog(ERROR, "unexpected partition strategy: %d", (int)
form->partstrat);

Also, why not case PARTITION_STRATEGY_LIST: instead of case 'l': and
similarly for 'r'?

@@ -5296,7 +5301,8 @@ getTables(Archive *fout, int *numTables)                          "OR %s IS NOT NULL "
             "OR %s IS NOT NULL"                          "))"
 
-                          "AS changed_acl "
+                          "AS changed_acl, "
+                          "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "                          "FROM pg_class c "
              "LEFT JOIN pg_depend d ON "                          "(c.relkind = '%c' AND "
 

I think if you test it you'll find this breaks dumps from older server
versions.  You've got to add a new version of the SQL query for 10+,
and then update the 9.6, 9.5, 9.4, 9.3, 9.1-9.2, 9.0, 8.4, 8.2-8.3,
8.0-8.1, 7.3-7.4, 7.2, 7.1, and pre-7.1 queries to return a dummy NULL
value for that column. Alternatively, you can add the new version for
10.0, leave the older queries alone, and then adjust the code further
down to cope with i_partkeydef == -1 (see i_checkoption for an
example).
                      gettext_noop("table"),
+                      gettext_noop("table"),

Add a comment like /* partitioned table */ on the same line.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [ latest patch set ]

Reviewing 0003:

+      This form attaches an existing table (partitioned or otherwise) as

(which might itself be partitioned)

+      partition of the target table.  Partition bound specification must
+      correspond with the partition method and the key of the target table.

The partition bound specification must correspond to the partitioning
method and partitioning key of the target table.

+      The table being attached must have all the columns of the target table
+      with matching types and no more. Also, it must have all the matching

The table to be attached must have all of the same columns as the
target table and no more; moreover, the column types must also match.

+      with matching types and no more. Also, it must have all the matching
+      constraints as the target table.  That includes both <literal>NOT NULL</>
+      and <literal>CHECK</> constraints.  If some <literal>CHECK</literal>
+      constraint of the table being attached is marked <literal>NO
INHERIT</literal>,
+      the command will fail; such constraints must either be dropped or
+      recreated without the <literal>NO INHERIT</literal> mark.

Why all of these requirements?  We could instead perform a scan to
validate that the constraints are met.  I think the way this should
work is:

1. ATTACH PARTITION works whether matching NOT NULL and CHECK
constraints are present or not.

2. If all of the constraints are present, and a validated constraint
matching the implicit partitioning constraint is also present, then
ATTACH PARTITION does not scan the table to validate constraints;
otherwise, it does.

3. NO VALIDATE is not an option.

+      Currently <literal>UNIQUE</literal>, <literal>PRIMARY KEY</literal>, and
+      <literal>FOREIGN KEY</literal> constraints are not considered, but that
+      might change in the future.

Really?  Changing that sounds impractical to me.

+      This form detaches specified partition of the target table.  The
+      detached partition continues to exist as a standalone table with no ties
+      remaining with the target table.

continues to exist as a standalone table, but no longer has any ties
to the table from which it was detached.

+      Note that if a partition being detached is itself a partitioned table,
+      it continues to exist as such.

You don't really need to say this, I think.  All of the properties of
the detached table are retained, not only its partitioning status.
You wouldn't like it if I told you to document "note that if a
partition being detached is unlogged, it will still be unlogged".
   To add the table as a new child of a parent table, you must own the
-   parent table as well.
+   parent table as well.  That applies to both adding the table as a
+   inheritance child of a parent table and attaching a table as partition to
+   the table.

To add the table as a new child of a parent table, or as a new
partition of a partitioned table, you must own the parent table as
well.

+        The name of the table to attach as a new partition to or
detach from this table.

s/to or/or to/

+    <literal>NO VALIDATE</> option is spcified.

Typo, but see comments above about nuking this altogether.
    A recursive <literal>DROP COLUMN</literal> operation will remove a    descendant table's column only if the
descendantdoes not inherit    that column from any other parents and never had an independent
 
-    definition of the column.  A nonrecursive <literal>DROP
+    definition of the column (which always holds if the descendant table
+    is a partition).  A nonrecursive <literal>DROP    COLUMN</literal> (i.e., <command>ALTER TABLE ONLY ... DROP
COLUMN</command>)never removes any descendant columns, but
 
-    instead marks them as independently defined rather than inherited.
+    instead marks them as independently defined rather than inherited,
+    unless the descendant table is a partition.

This is a hairy explanation.  I suggest that the documentation say
this (and that the code implement it): A nonrecursive DROP TABLE
command will fail for a partitioned table, because all partitions of a
table must have the same columns as the partitioning root.

-    that are not marked <literal>NO INHERIT</>.
+    that are not marked <literal>NO INHERIT</> which are unsupported if
+    the table is a partitioned table.

I think you can omit this hunk.

+   If <literal>PARTITION OF</literal> clause is specified then the table is
+   created as a partition of <literal>parent_table</literal> with specified
+   bounds.  However, unlike regular tables, one cannot specify
+   <literal>PARTITION BY</literal> clause which means foreign tables can
+   only be created as leaf partitions.

I'd delete the sentence beginning with "However".

+   Create foreign table <structname>measurement_y2016m07</>, which will be
+   accessed through the server <structname>server_07</>, that is partition
+   of the range partitioned table <structname>measurement</>:

s/, that is/ as a/

+<phrase>and <replaceable
class="PARAMETER">partition_bound_spec</replaceable> is:</phrase>
+
+FOR VALUES { <replaceable class="PARAMETER">list_spec</replaceable> |
<replaceable class="PARAMETER">range_spec</replaceable> }

I think you can inline the definitions of list_spec and range_spec
here instead of making them separate productions, and I think that
would be preferable.

FOR VALUES { IN ( <replaceable
class="PARAMETER">expression</replaceable> [, ...] ) |
START <replaceable class="PARAMETER">lower-bound</replaceable> [
INCLUSIVE | EXCLUSIVE ] END <replaceable
class="PARAMETER">upper-bound</replaceable> [ INCLUSIVE | EXCLUSIVE ]
}

+      parent table (name optionally schema-qualified).

Parenthetical phrase isn't needed.

+      A partition bound specification must be present and must correspond with
+      partition method and key of the parent table.  It is checked using the
+      specification that the new partition does not overlap with any existing
+      partitions of the parent.

The partition bound specification must correspond to the partitioning
method and partitioning key of the parent table, and must not overlap
with any existing partition of that parent.

+      clause, if any.  Defaults and constraints can optionally be specified
+      for each of the inherited columns, which override those in the parent.

Surely not.  You can't "override" an inherited constraint or an
inherited default.  The child may have extra constraints not present
in the parent, and may have different defaults when it is directly
targeted by an insertion, but it can't possibly override the parent
defaults.

+      One can also specify table constraints, in addition to those inherited
+      from the parent.  Note that all subsequent schema modifications to the
+      parent propagate to partition.

The first part of this seems right, but then what's going on with the
reference to constraints in the previous sentence?  (See previous
review comment.) The second sentence I would delete (but see below).

+     <para>
+      Any data row subsequently inserted into the parent table is mapped to
+      and stored in the partition, provided partition key of the row falls
+      within the partition bounds.
+     </para>

How about: Rows inserted into a partitioned table will be
automatically routed to the correct partition.  If no suitable
partition exists, an error will occur.

+     <para>
+      A partition is dropped or truncated when the parent table is dropped or
+      truncated.  Dropping it directly using <literal>DROP TABLE</literal>
+      will fail; it must first be <firstterm>detached</> from the parent.
+      However, truncating a partition directly works.
+     </para>

How about: A partition must have the same column names and types as
the table of which it is a partition.  Therefore, modifications the
column names or types of the partitioned table will automatically
propagate to all children, as will operations such as TRUNCATE which
normally affect a table and all of its inheritance children.  It is
also possible to TRUNCATE a partition individually, just as for an
inheritance child.

Insisting that you can't drop a child without detaching it first seems
wrong to me.  If I already made this comment and you responded to it,
please point me back to whatever you said.  However, my feeling is
this is flat wrong and absolutely must be changed.

-            if (is_no_inherit)
+
+            /* Discard the NO INHERIT flag if the relation is a partition */
+            if (is_no_inherit && !rel->rd_rel->relispartition)

Something about this seems fishy.  In StoreRelCheck(), you disallow
the addition of a NO INHERIT constraint on a partition, but here you
seem happy to accept it and ignore the NO INHERIT property.  That
doesn't seem consistent.  We should adopt a consistent policy about
what to do about such constraints, and I submit that throwing an error
is better than silently changing things, unless you have some reason
for doing that which I'm not seeing.  Anyway, we should have the same
behavior in both cases.

We should also standardize on what value of conislocal and coninhcount
children end up with; presumably the latter should be 1, but I'm not
sure if the former should be true or false.  In either case, anything
that can vary between children probably needs to be dumped, so let's
enforce that it doesn't so we don't have to dump it.  I'm not sure
whether the code here achieves those objectives, though, and note the
comment in the function header about making sure the logic here
matches MergeConstraintsIntoExisting.

I think the overriding principle here should be: If you attach a table
as a partition, it must not be part of a standard inheritance
hierarchy, and it must not be a partition of any other table.  It can,
however, be partitioned itself.  If you later detach a partition, it
ends up as a standalone table with a copy of each constraint it had as
a partition - probably including the implicit partition constraint.
The DBA can drop those constraints if they're not wanted.

I wonder if it's really a good idea for the partition constraints to
be implicit; what is the benefit of leaving those uncatalogued?

+ * Depending on whether the relation in question is list or range
+ * partitioned, one of the fields is set.
+ */
+typedef struct BoundCollectionData
+{
+    struct ListInfo       *listinfo;
+    struct RangeInfo   *rangeinfo;
+} BoundCollectionData;

This seems like an odd design.  First, when you have a pointer to
either of two things, the normal tool for that in C would be a union,
not a struct. Second, in PostgreSQL we typically handle that by making
both of the things nodes and then you can use IsA() or switch on
nodeTag() to figure out what you've got.  Third, the only place this
is used at least in 0003 is as part of PartitionDescData, which only
has 3 members, so if you were going to do it with 2 members, you could
just include these two members directly.  Considering all of the
foregoing, I'd suggest making this a union and including partstrategy
in PartitionDescData.

I think that the names ListInfo and RangeInfo are far too generic for
something that's specific to partitioning.

+/*
+ * Range bound collection - sorted array of ranges of partitions of a range
+ * partitioned table
+ */
+typedef struct RangeInfo
+{
+    struct PartitionRange    **ranges;
+} RangeInfo;
+
+/* One partition's range */
+typedef struct PartitionRange
+{
+    struct PartitionRangeBound    *lower;
+    struct PartitionRangeBound    *upper;
+} PartitionRange;

This representation doesn't seem optimal, because in most cases the
lower bound of one partition will be the upper bound of the next.  I
suggest that you flatten this representation into a single list of
relevant bounds, each flagged as to whether it is exclusive and
whether it is finite; and a list with one more element of bounds.  For
example, suppose the partition bounds are [10, 20), [20, 30), (30,
40), and [50, 60).  You first construct a list of all of the distinct
bounds, flipping inclusive/exclusive for the lefthand bound in each
case.  So you get:

10 EXCLUSIVE
20 EXCLUSIVE
30 EXCLUSIVE
30 INCLUSIVE
40 EXCLUSIVE
50 EXCLUSIVE
60 EXCLUSIVE

When ordering items for this list, if the same item appears twice, the
EXCLUSIVE copy always appears before INCLUSIVE.  When comparing
against an EXCLUSIVE item, we move to the first half of the array if
we are searching for a value strictly less than that value; when
comparing against an INCLUSIVE item, we move to the first half of the
array if we are searching for a value less than or equal to that
value.

This is a list of seven items, so a binary search will return a
position between 0 (less than all items) and 7 (greater than all
items).  So we need a list of 8 partition mappings, which in this case
will look like this: -1, 0, 1, -1, 2, -1, 3, -1.

In this particular example, there are only two adjacent partitions, so
we end up with 7 bounds with this representation vs. 8 with yours, but
in general I think the gains will be bigger.  If you've got 1000
partitions and they're all adjacent, which is likely, you store 1000
bounds instead of 2000 bounds by doing it this way.

+ * Note: This function should be called only when it is known that 'relid'
+ * is a partition.

Why?  How about "Because this function assumes that the relation whose
OID is passed as an argument will have precisely one parent, it should
only been called when it is known that the relation is a partition."

+    /*
+     * Translate vars in the generated expression to have correct attnos.
+     * Note that the vars in my_qual bear attnos dictated by key which carries
+     * physical attnos of the parent.  We must allow for a case where physical
+     * attnos of a partition can be different from the parent.
+     */
+    partexprs_item = list_head(key->partexprs);
+    for (i = 0; i < key->partnatts; i++)
+    {
+        AttrNumber    attno = key->partattrs[i],
+                    new_attno;
+        char       *attname;
+
+        if (attno != 0)
+        {
+            /* Simple column reference */
+            attname = get_attname(RelationGetRelid(parent), attno);
+            new_attno = get_attnum(RelationGetRelid(rel), attname);
+
+            if (new_attno != attno)
+                my_qual = (List *) translate_var_attno((Node *) my_qual,
+                                                       attno,
+                                                       new_attno);

It can't really be safe to do this one attribute number at a time, or
even if by some good fortune it can't be broken, it at least it seems
extremely fragile.  Suppose that you translate 0 -> 3 and then 3 -> 5;
now the result is garbage.  It's not very efficient to do this one
attno at a time, either.

+    if (classform->relispartition)
+        ereport(ERROR,
+                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                 errmsg("\"%s\" is a partition of \"%s\"", rel->relname,
+                        get_rel_name(get_partition_parent(relOid))),
+                 errhint("Use ALTER TABLE DETACH PARTITION to be able
to drop it.")));
+

RangeVarCallbackForDropRelation should do only minimal sanity checks;
defer this until after we have a relation lock.

I didn't get all the way through this patch, so this is a pretty
incomplete review, but it's late here and I'm out of steam for
tonight.  Some general comments:

1. I think that this patch seems to introduce an awful lot of new
structures with confusingly similar names and purposes:
PartitionBoundList, PartitionBoundRange, ListInfo, RangeInfo,
PartitionRange, PartitionList, ListValue, RangePartition.  You've got
4 different structures here that all have "Partition" and "Range" in
the name someplace, including both PartitionRange and RangePartition.
Maybe there's no way around that kind of duplication; after all there
are quite a few moving parts here.  But it seems like it would be good
to try to simplify it.

2. I'm also a bit concerned about the fairly large amount of
apparently-boilerplate code in partition.c, all having to do with how
we create all of these data structures and translate between different
forms of them.  I haven't understood that stuff thoroughly enough to
have a specific idea about how we might be able to get rid of any of
it, and maybe there's no way.  But that too seems like a topic for
futher investigation.  One idea is that maybe some of these things
should be nodes that piggyback on the existing infrastructure in
src/backend/nodes instead of inventing a new way to do something
similar.

3. There are a lot of places where you have separate code for the
range and list partitioning cases, and I'm suspicious that there are
ways that code could be unified.  For example, with list partitioning,
you have a bunch of Datums which represent the specific values that
can appear in the various partitions, and with range partitioning, you
have a bunch of Datums that represent the edges of partitions.  Well,
if you used the same array for both purposes, you could use the same
code to copy it.  That would involve flattening away a lot of the
subsidiary structure under PartitionDescData and pulling up stuff that
is buried lower down into the main structure, but I think that's
likely a good idea anyway - see also point #1.

4. I'm somewhat wondering if we ought to just legislate that the lower
bound is always inclusive and the upper bound is always exclusive.
The decision to support both inclusive and exclusive partition bounds
is responsible for an enormous truckload of complexity in this patch,
and I have a feeling it is also going to be a not-infrequent cause of
user error.

5. I wonder how well this handles multi-column partition keys.  You've
just got one Datum flag and one is-finite flag per partition, but I
wonder if you don't need to track is-finite on a per-column basis, so
that you could partition on (a, b) and have the first partition go up
to (10, 10), the second to (10, infinity), the third to (20, 10), the
fourth to (20, infinity), and the last to (infinity, infinity).  FWIW,
Oracle supports this sort of thing, so perhaps we should, too.  On a
related note, I'm not sure it's going to work to treat a composite
partition key as a record type.  The user may want to specify a
separate opfamily and collation for each column, not just inherit
whatever the record behavior is.  I'm not sure if that's what you are
doing, but the relcache structures don't seem adapted to storing one
Datum per partitioning column per partition, but rather just one Datum
per partition, and I'm not sure that's going to work very well.

Thanks for working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Petr Jelinek
Date:
On 05/10/16 03:50, Robert Haas wrote:
> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ latest patch set ]
> 
> Reviewing 0003:
> 
> +      with matching types and no more. Also, it must have all the matching
> +      constraints as the target table.  That includes both <literal>NOT NULL</>
> +      and <literal>CHECK</> constraints.  If some <literal>CHECK</literal>
> +      constraint of the table being attached is marked <literal>NO
> INHERIT</literal>,
> +      the command will fail; such constraints must either be dropped or
> +      recreated without the <literal>NO INHERIT</literal> mark.
> 
> Why all of these requirements?  We could instead perform a scan to
> validate that the constraints are met.  I think the way this should
> work is:

I think it's survivable limitation in initial version, it does not seem
to me like there is anything that prevents improving this in some follow
up patch.

> 
> +      Currently <literal>UNIQUE</literal>, <literal>PRIMARY KEY</literal>, and
> +      <literal>FOREIGN KEY</literal> constraints are not considered, but that
> +      might change in the future.
> 
> Really?  Changing that sounds impractical to me.
> 

Which part of that statement?

> 
> +      Note that if a partition being detached is itself a partitioned table,
> +      it continues to exist as such.
> 
> You don't really need to say this, I think.  All of the properties of
> the detached table are retained, not only its partitioning status.
> You wouldn't like it if I told you to document "note that if a
> partition being detached is unlogged, it will still be unlogged".
> 

I think this is bit different though, it basically means you are
detaching whole branch from the rest of the partitioning tree, not just
that one partition. To me that's worth mentioning to avoid potential
confusion.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
<div dir="ltr"><br /><div class="gmail_extra">On Tue, Oct 4, 2016 at 1:32 PM, Amit Langote <span dir="ltr"><<a
href="mailto:Langote_Amit_f8@lab.ntt.co.jp"target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a><wbr />></span>
wrote:<span><br/><br /></span><span></span><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px
0px0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span> </span>Attached updated patches.<br /><br
/>Thanks,<br /> Amit<br /></blockquote></div><br />Hi,<br /><br />I observed, when creating foreign table with range
partition,data is not inserting into specified partition range. below are steps to reproduce.<br /><br />CREATE
EXTENSIONpostgres_fdw;<br />CREATE SERVER pwj_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres', port
'5432',use_remote_estimate'true');<br />CREATE USER MAPPING FOR PUBLIC SERVER pwj_server;<br /><br />CREATE TABLE
test_range(a int) PARTITION BY RANGE(a);<br /><br />CREATE TABLE test_range_p1 (a int);<br />CREATE FOREIGN TABLE
ft_test_range_p1PARTITION OF test_range FOR VALUES START (1) END (10) SERVER pwj_server OPTIONS (TABLE_NAME
'test_range_p1');<br/><br />CREATE TABLE test_range_p2 (a int);<br />CREATE FOREIGN TABLE ft_test_range_p2 PARTITION OF
test_rangeFOR VALUES START (20) END (30) SERVER pwj_server OPTIONS (TABLE_NAME 'test_range_p2');<br /><br />CREATE
TABLEtest_range_p3 (a int);<br />CREATE FOREIGN TABLE ft_test_range_p3 PARTITION OF test_range FOR VALUES START (10)
END(20) SERVER pwj_server OPTIONS (TABLE_NAME 'test_range_p3');<br /><br />postgres=# INSERT INTO test_range (a) values
(5),(25),(15);<br/>INSERT 0 3<br /><br />postgres=# select tableoid::regclass, * from test_range;<br />    
tableoid    | a  <br />------------------+----<br /> ft_test_range_p1 |  5<br /> ft_test_range_p2 | 15<br
/> ft_test_range_p3| 25<br />(3 rows)<br /><br /></div><div class="gmail_extra">--Here ft_test_range_p2 is created for
range20-30 having value 15.<br /></div><div class="gmail_extra"><br />Thanks & Regards,<br />Rajkumar
Raghuwanshi<br/>QMG, EnterpriseDB Corporation<br /><br /></div></div> 

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

On 2016/10/05 16:57, Rajkumar Raghuwanshi wrote:
> I observed, when creating foreign table with range partition, data is not
> inserting into specified partition range. below are steps to reproduce.
> 
> [ ... ]
> 
> postgres=# INSERT INTO test_range (a) values (5),(25),(15);
> INSERT 0 3
> 
> postgres=# select tableoid::regclass, * from test_range;
>      tableoid     | a 
> ------------------+----
>  ft_test_range_p1 |  5
>  ft_test_range_p2 | 15
>  ft_test_range_p3 | 25
> (3 rows)
> 
> --Here ft_test_range_p2 is created for range 20-30 having value 15.

Thanks a lot for testing.

That's a bug.  I found that it is caused by the FDW plans getting paired
with the wrong result relations during ExecInitModifyTable() initialization.

I will include a fix for the same in the patch set that I will be sending
soon in reply to Robert's review comments on patch 0002 [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoY1aQ5iPz0S2GBJw4YUR1Z2Qg5iKUf8YJSo2Ctya4ZmNg%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/05 2:12, Robert Haas wrote:
> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Even if we leave the empty relfilenode around for now -- in the long
>>> run I think it should die -- I think we should prohibit the creation
>>> of subsidiary object on the parent which is only sensible if it has
>>> rows - e.g. indexes.  It makes no sense to disallow non-inheritable
>>> constraints while allowing indexes, and it could box us into a corner
>>> later.
>>
>> I agree.  So we must prevent from the get-go the creation of following
>> objects on parent tables (aka RELKIND_PARTITIONED_TABLE relations):
>>
>> * Indexes
>> * Row triggers (?)
>
> Hmm, do we ever fire triggers on the parent for operations on a child
> table?  Note this thread, which seems possibly relevant:
>
> https://www.postgresql.org/message-id/flat/cd282adde5b70b20c57f53bb9ab75e27%40biglumber.com

The answer to your question is no.

The thread you quoted discusses statement-level triggers and the
conclusion is that they don't work as desired for UPDATE and DELETE on
inheritance tables.  As things stand, only UPDATE or DELETE on the parent
affects the child tables and it's proposed there that the statement-level
triggers on the parent and also on any child tables affected should be
fired in that case.

Currently, INSERT doesn't have that problem because it only ever affects
the parent table.  That changes with tuple routing though.  I am not sure
if the aforementioned proposed behavior should be applied in this case.
That is to say, if we have INSERT INTO parent, only the parent's
statement-level triggers should be fired.  The partition into which the
tuple is routed will have only its row-level triggers fired.

The proposed TODO item has been left untouched for many years now, so it
is perhaps not such a big pain point after all (or perhaps some standard
noncompliance).  I may be wrong though.

> We certainly won't fire the parent's per-row triggers ever, so those
> should be prohibited.  But I'm not sure about per-statement triggers.

Agree about the row-level triggers.  I think per-statement triggers are
alright to allows on parent tables from the POV of the original topic we
were discussing - ie, per-statement triggers do not depend on the physical
addressing capabilities of the relation on which they are defined whereas
row-levels triggers do in some cases.

So I changed 0001 so that indexes and row triggers are disallowed on
partitioned tables.

>> In addition to preventing creation of these objects, we must also teach
>> commands that directly invoke heapam.c to skip such relations.
>
> I'm don't think that's required if we're not actually getting rid of
> the relfilenode.

Oops that line got sent after all, I had convinced myself to remove that
line from the email. :)  Agreed, no need to do that just yet.

>> Needless to say, a query-specified qualification clause must now include
>> the same collate clause as used for individual partition columns (or
>> expressions) for the constraint exclusion to work as intended.
>
> Hopefully this is only required if the default choice of collation
> wouldn't be correct anyway.

Sorry, I'm not sure I understand what you said - specifically what would
"wrong" mean in this case? Parser error?  Run-time error like in varstr_cmp()?

What I meant to say is that predicate OpExpr's generated by the
partitioning code will be using the corresponding partition column's
partcollation as its inputcollid (and the Var will be wrapped by a
RelabelType).  Here partcollation is a known valid alternative collation
that is applied when partitioning the input data, overriding any default
collation of the column's type or the collation specified for the column
when creating the table.  During assign_query_collations(), a qual clause
OpExpr's variable will get assigned a collation that is either the
column/expressions's attcollation/exprCollationor explicitly specified
collation using a COLLATE clause.  If variable collation assigned thusly
doesn't match the partitioning collation, then the predicate OpExpr's and
the query clause OpExpr's variable expressions will fail to match using
equal(), so constraint exclusion (specifically, operator_predicate_proof()
bails out) will fail.  Of course, going ahead with constraint exclusion
ignoring the said collation mismatch would be wrong anyway.

> Reviewing 0002:
>
> +    prettyFlags = PRETTYFLAG_INDENT;
> +    PG_RETURN_TEXT_P(string_to_text(pg_get_partkeydef_worker(relid,
> +                                    prettyFlags)));
>
> Why bother with the variable?

Leftover, removed.

> +    /* Must get partclass, and partexprs the hard way */
> +    datum = SysCacheGetAttr(PARTEDRELID, tuple,
> +                            Anum_pg_partitioned_table_partclass, &isnull);
> +    Assert(!isnull);
> +    partclass = (oidvector *) DatumGetPointer(datum);
>
> Comment mentions getting two things, but code only gets one.

Fixed. Further code uses the "hard way" to get partexprs but it's got a
separate comment anyway.

>
> +        partexprs = (List *) stringToNode(exprsString);
>
> if (!IsA(partexprs, List)) elog(ERROR, ...); so as to guard against
> corrupt catalog contents.

Done as follows:

    partexprs = (List *) stringToNode(exprsString);

    if (!IsA(partexprs, List))
        elog(ERROR, "unexpected node type found in partexprs: %d",
                    (int) nodeTag(partexprs));

> +    switch (form->partstrat)
> +    {
> +        case 'l':
> +            appendStringInfo(&buf, "LIST");
> +            break;
> +        case 'r':
> +            appendStringInfo(&buf, "RANGE");
> +            break;
> +    }
>
> default: elog(ERROR, "unexpected partition strategy: %d", (int)
> form->partstrat);
>
> Also, why not case PARTITION_STRATEGY_LIST: instead of case 'l': and
> similarly for 'r'?

Done and done.

> @@ -5296,7 +5301,8 @@ getTables(Archive *fout, int *numTables)
>                            "OR %s IS NOT NULL "
>                            "OR %s IS NOT NULL"
>                            "))"
> -                          "AS changed_acl "
> +                          "AS changed_acl, "
> +                          "CASE WHEN c.relkind = 'P' THEN
> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
>                            "FROM pg_class c "
>                            "LEFT JOIN pg_depend d ON "
>                            "(c.relkind = '%c' AND "
>
> I think if you test it you'll find this breaks dumps from older server
> versions.  You've got to add a new version of the SQL query for 10+,
> and then update the 9.6, 9.5, 9.4, 9.3, 9.1-9.2, 9.0, 8.4, 8.2-8.3,
> 8.0-8.1, 7.3-7.4, 7.2, 7.1, and pre-7.1 queries to return a dummy NULL
> value for that column. Alternatively, you can add the new version for
> 10.0, leave the older queries alone, and then adjust the code further
> down to cope with i_partkeydef == -1 (see i_checkoption for an
> example).

You're quite right.  Fixed using this last method.

>                        gettext_noop("table"),
> +                      gettext_noop("table"),
>
> Add a comment like /* partitioned table */ on the same line.

Done.

Attached revised patches.  Also, includes a fix for an issue reported by
Rajkumar Raghuwanshi [1] which turned out to be a bug in one of the later
patches.  I will now move on to addressing the comments on patch 0003.

Thanks a lot for the review!

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/5dded2f1-c7f6-e7fc-56b5-23ab59495e4b@lab.ntt.co.jp

Attachment

Re: Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Attached revised patches.  Also, includes a fix for an issue reported by
Rajkumar Raghuwanshi [1] which turned out to be a bug in one of the later
patches.  I will now move on to addressing the comments on patch 0003.

Thanks a lot for the review!

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/5dded2f1-c7f6-e7fc-56b5-23ab59495e4b@lab.ntt.co.jp

Hi,

I have applied latest patches, getting some error and crash, please check if you are also able to reproduce the same.

Observation1 : Not able to create index on partition table.
--------------
CREATE TABLE rp (c1 int, c2 int) PARTITION BY RANGE(c1);
CREATE TABLE rp_p1 PARTITION OF rp FOR VALUES START (1) END (10);
CREATE TABLE rp_p2 PARTITION OF rp FOR VALUES START (10) END (20);

CREATE INDEX idx_rp_c1 on rp(c1);
ERROR:  cannot create index on partitioned table "rp"

Observation2 : Getting cache lookup failed error for multiple column range partition
--------------
CREATE TABLE rp1_m (c1 int, c2 int) PARTITION BY RANGE(c1, ((c1 + c2)/2));

CREATE TABLE rp1_m_p1 PARTITION OF rp1_m FOR VALUES START (1, 1) END (10, 10);
ERROR:  cache lookup failed for attribute 0 of relation 16429

Observation3 : Getting server crash with multiple column range partition
--------------
CREATE TABLE rp2_m (c1 int, c2 int) PARTITION BY RANGE(((c2 + c1)/2), c2);
CREATE TABLE rp2_m_p1 PARTITION OF rp2_m FOR VALUES START (1, 1) END (10, 10);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/07 17:33, Rajkumar Raghuwanshi wrote:
> On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote wrote:
> I have applied latest patches, getting some error and crash, please check
> if you are also able to reproduce the same.
>
> Observation1 : Not able to create index on partition table.
> --------------
> CREATE TABLE rp (c1 int, c2 int) PARTITION BY RANGE(c1);
> CREATE TABLE rp_p1 PARTITION OF rp FOR VALUES START (1) END (10);
> CREATE TABLE rp_p2 PARTITION OF rp FOR VALUES START (10) END (20);
>
> CREATE INDEX idx_rp_c1 on rp(c1);
> ERROR:  cannot create index on partitioned table "rp"

This one is a new behavior as I mentioned earlier in my previous email.

> Observation2 : Getting cache lookup failed error for multiple column range
> partition
> --------------
> CREATE TABLE rp1_m (c1 int, c2 int) PARTITION BY RANGE(c1, ((c1 + c2)/2));
>
> CREATE TABLE rp1_m_p1 PARTITION OF rp1_m FOR VALUES START (1, 1) END (10,
> 10);
> ERROR:  cache lookup failed for attribute 0 of relation 16429
>
> Observation3 : Getting server crash with multiple column range partition
> --------------
> CREATE TABLE rp2_m (c1 int, c2 int) PARTITION BY RANGE(((c2 + c1)/2), c2);
> CREATE TABLE rp2_m_p1 PARTITION OF rp2_m FOR VALUES START (1, 1) END (10,
> 10);
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>

Fixed.  Should have been caught when running the regression tests after I
made changes to 0001 to address some review comments (apparently there was
no test in 0003 that would've caught this, so added a new one, thanks!).


Attached updated patches.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
It's allowed to specify an non-default opclass in partition by clause,
but I do not see any testcase testing the same. Can you please add
one.

On Fri, Oct 7, 2016 at 2:50 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/10/07 17:33, Rajkumar Raghuwanshi wrote:
>> On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote wrote:
>> I have applied latest patches, getting some error and crash, please check
>> if you are also able to reproduce the same.
>>
>> Observation1 : Not able to create index on partition table.
>> --------------
>> CREATE TABLE rp (c1 int, c2 int) PARTITION BY RANGE(c1);
>> CREATE TABLE rp_p1 PARTITION OF rp FOR VALUES START (1) END (10);
>> CREATE TABLE rp_p2 PARTITION OF rp FOR VALUES START (10) END (20);
>>
>> CREATE INDEX idx_rp_c1 on rp(c1);
>> ERROR:  cannot create index on partitioned table "rp"
>
> This one is a new behavior as I mentioned earlier in my previous email.
>
>> Observation2 : Getting cache lookup failed error for multiple column range
>> partition
>> --------------
>> CREATE TABLE rp1_m (c1 int, c2 int) PARTITION BY RANGE(c1, ((c1 + c2)/2));
>>
>> CREATE TABLE rp1_m_p1 PARTITION OF rp1_m FOR VALUES START (1, 1) END (10,
>> 10);
>> ERROR:  cache lookup failed for attribute 0 of relation 16429
>>
>> Observation3 : Getting server crash with multiple column range partition
>> --------------
>> CREATE TABLE rp2_m (c1 int, c2 int) PARTITION BY RANGE(((c2 + c1)/2), c2);
>> CREATE TABLE rp2_m_p1 PARTITION OF rp2_m FOR VALUES START (1, 1) END (10,
>> 10);
>> server closed the connection unexpectedly
>>     This probably means the server terminated abnormally
>>     before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> !>
>
> Fixed.  Should have been caught when running the regression tests after I
> made changes to 0001 to address some review comments (apparently there was
> no test in 0003 that would've caught this, so added a new one, thanks!).
>
>
> Attached updated patches.
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/07 18:27, Ashutosh Bapat wrote:
> It's allowed to specify an non-default opclass in partition by clause,
> but I do not see any testcase testing the same. Can you please add
> one.

OK, I will add some tests related to that.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Kapila
Date:
On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/10/05 2:12, Robert Haas wrote:
>> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> Even if we leave the empty relfilenode around for now -- in the long
>>>> run I think it should die -- I think we should prohibit the creation
>>>> of subsidiary object on the parent which is only sensible if it has
>>>> rows - e.g. indexes.  It makes no sense to disallow non-inheritable
>>>> constraints while allowing indexes, and it could box us into a corner
>>>> later.
>>>
>>> I agree.  So we must prevent from the get-go the creation of following
>>> objects on parent tables (aka RELKIND_PARTITIONED_TABLE relations):
>>>
>>> * Indexes
>>> * Row triggers (?)
>>
>> Hmm, do we ever fire triggers on the parent for operations on a child
>> table?  Note this thread, which seems possibly relevant:
>>
>> https://www.postgresql.org/message-id/flat/cd282adde5b70b20c57f53bb9ab75e27%40biglumber.com
>
> The answer to your question is no.
>
> The thread you quoted discusses statement-level triggers and the
> conclusion is that they don't work as desired for UPDATE and DELETE on
> inheritance tables.  As things stand, only UPDATE or DELETE on the parent
> affects the child tables and it's proposed there that the statement-level
> triggers on the parent and also on any child tables affected should be
> fired in that case.
>

Doesn't that imply that the statement level triggers should be fired
for all the tables that get changed for statement?  If so, then in
your case it should never fire for parent table, which means we could
disallow statement level triggers as well on parent tables?

Some of the other things that we might want to consider disallowing on
parent table could be:
a. Policy on table_name
b. Alter table has many clauses, are all of those allowed and will it
make sense to allow them?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Declarative partitioning - another take

From
Amit Kapila
Date:
On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/10/05 2:12, Robert Haas wrote:

> Attached revised patches.

Few assorted review comments for 0001-Catalog*:


1.
@@ -1775,6 +1775,12 @@ BeginCopyTo(ParseState *pstate,
{
..
+ else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot copy from partitioned table \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Try the COPY (SELECT ...) TO variant.")));
..
}

Why is this restriction?  Won't it be useful to allow it for the cases
when user wants to copy the data of all the partitions?


2.
+ if (!pg_strcasecmp(stmt->partspec->strategy, "list") &&
+ partnatts > 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("cannot list partition using more than one column")));

/cannot list/cannot use list

3.
@@ -77,7 +77,7 @@ typedef enum DependencyType DEPENDENCY_INTERNAL = 'i', DEPENDENCY_EXTENSION = 'e',
DEPENDENCY_AUTO_EXTENSION= 'x',
 
- DEPENDENCY_PIN = 'p'
+ DEPENDENCY_PIN = 'p',} DependencyType;

Why is this change required?

4.
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_partitioned_table.h
+ *  definition of the system "partitioned table" relation
+ *  along with the relation's initial contents.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group

Copyright year should be 2016.

5.
+/*
+ * PartitionSpec - partition key definition including the strategy
+ *
+ * 'strategy' partition strategy name ('list', 'range', etc.)

etc. in above comment seems to be unnecessary.

6.
+ {PartitionedRelationId, /* PARTEDRELID */

Here PARTEDRELID sounds inconvenient, how about PARTRELID?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Thanks for the review!

On 2016/10/25 20:32, Amit Kapila wrote:
> On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2016/10/05 2:12, Robert Haas wrote:
> 
>> Attached revised patches.
> 
> Few assorted review comments for 0001-Catalog*:
> 
> 
> 1.
> @@ -1775,6 +1775,12 @@ BeginCopyTo(ParseState *pstate,
> {
> ..
> + else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> + errmsg("cannot copy from partitioned table \"%s\"",
> + RelationGetRelationName(rel)),
> + errhint("Try the COPY (SELECT ...) TO variant.")));
> ..
> }
> 
> Why is this restriction?  Won't it be useful to allow it for the cases
> when user wants to copy the data of all the partitions?

Sure, CopyTo() can be be taught to scan leaf partitions when a partitioned
table is specified, but I thought this may be fine initially.

> 2.
> + if (!pg_strcasecmp(stmt->partspec->strategy, "list") &&
> + partnatts > 1)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> + errmsg("cannot list partition using more than one column")));
> 
> /cannot list/cannot use list

Actually "list partition" works here as a verb, as in "to list partition".

> 3.
> @@ -77,7 +77,7 @@ typedef enum DependencyType
>   DEPENDENCY_INTERNAL = 'i',
>   DEPENDENCY_EXTENSION = 'e',
>   DEPENDENCY_AUTO_EXTENSION = 'x',
> - DEPENDENCY_PIN = 'p'
> + DEPENDENCY_PIN = 'p',
>  } DependencyType;
> Why is this change required?

Looks like a leftover hunk from previous revision.  Will fix.

> 4.
> @@ -0,0 +1,69 @@
> +/*-------------------------------------------------------------------------
> + *
> + * pg_partitioned_table.h
> + *  definition of the system "partitioned table" relation
> + *  along with the relation's initial contents.
> + *
> + *
> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
> 
> Copyright year should be 2016.

Oops, will fix.

> 5.
> +/*
> + * PartitionSpec - partition key definition including the strategy
> + *
> + * 'strategy' partition strategy name ('list', 'range', etc.)
> 
> etc. in above comment seems to be unnecessary.

Will fix, although  I thought that list is yet incomplete.

> 6.
> + {PartitionedRelationId, /* PARTEDRELID */
> 
> Here PARTEDRELID sounds inconvenient, how about PARTRELID?

Agreed.  There used to be another catalog which had used up PARTRELID, but
that's no longer an issue.

I will include these changes in the next version of patches I will post
soon in reply to [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoYJcUTcN7vVgg54GHtffH11JJWYZnfF4KiRxjV-iaACQg%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/25 15:58, Amit Kapila wrote:
> On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2016/10/05 2:12, Robert Haas wrote:
>>> Hmm, do we ever fire triggers on the parent for operations on a child
>>> table?  Note this thread, which seems possibly relevant:
>>>
>>> https://www.postgresql.org/message-id/flat/cd282adde5b70b20c57f53bb9ab75e27%40biglumber.com
>>
>> The answer to your question is no.
>>
>> The thread you quoted discusses statement-level triggers and the
>> conclusion is that they don't work as desired for UPDATE and DELETE on
>> inheritance tables.  As things stand, only UPDATE or DELETE on the parent
>> affects the child tables and it's proposed there that the statement-level
>> triggers on the parent and also on any child tables affected should be
>> fired in that case.
>>
> 
> Doesn't that imply that the statement level triggers should be fired
> for all the tables that get changed for statement?  If so, then in
> your case it should never fire for parent table, which means we could
> disallow statement level triggers as well on parent tables?

I may have misunderstood statement-level triggers, but don't they apply to
tables *specified* as the target table in the statement, instead of those
*changed* by resulting actions?

Now in case of inheritance, unless ONLY is specified, all tables in the
hierarchy including the parent are *implicitly* specified to be affected
by an UPDATE or DELETE operation.  So, if some or all of those tables have
any statement-level triggers defined, they should get fired.  That was the
conclusion of that thread, but that TODO item still remains [1].

I am not (or no longer) sure how that argument affects INSERT on
partitioned tables with tuple-routing though.  Are partitions at all
levels *implicitly specified to be affected* when we say INSERT INTO
root_partitioned_table?

> Some of the other things that we might want to consider disallowing on
> parent table could be:
> a. Policy on table_name

Perhaps.  Since there are no rows in the parent table(s) itself of a
partition hierarchy, it might not make sense to continue to allow creating
row-level security policies on them.

> b. Alter table has many clauses, are all of those allowed and will it
> make sense to allow them?

Currently, we only disallow the following with partitioned parent tables
as far as alter table is concerned.

- cannot change inheritance by ALTER TABLE partitioned_table INHERIT ...

- cannot let them be regular inheritance parents either - that is, the following is disallowed: ALTER TABLE some_able
INHERITpartitioned_table
 

- cannot create UNIQUE, PRIMARY KEY, FOREIGN KEY, EXCLUDE constraints

- cannot drop column involved in the partitioning key

Most other forms that affect attributes and constraints follow the regular
inheritance behavior (recursion) with certain exceptions such as:

- cannot add/drop an attribute or check constraint to *only* to/from the parent

- cannot add/drop NOT NULL constraint to/from *only* the parent

Thoughts?

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Kapila
Date:
On Wed, Oct 26, 2016 at 6:36 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 1.
>> @@ -1775,6 +1775,12 @@ BeginCopyTo(ParseState *pstate,
>> {
>> ..
>> + else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>> + errmsg("cannot copy from partitioned table \"%s\"",
>> + RelationGetRelationName(rel)),
>> + errhint("Try the COPY (SELECT ...) TO variant.")));
>> ..
>> }
>>
>> Why is this restriction?  Won't it be useful to allow it for the cases
>> when user wants to copy the data of all the partitions?
>
> Sure, CopyTo() can be be taught to scan leaf partitions when a partitioned
> table is specified, but I thought this may be fine initially.
>

Okay, I don't want to add anything to your existing work unless it is
important.  However, I think there should be some agreement on which
of the restrictions are okay for first version of patch.  This can
avoid such questions in future from other reviewers.


>> 2.
>> + if (!pg_strcasecmp(stmt->partspec->strategy, "list") &&
>> + partnatts > 1)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
>> + errmsg("cannot list partition using more than one column")));
>>
>> /cannot list/cannot use list
>
> Actually "list partition" works here as a verb, as in "to list partition".
>

I am not an expert of this matter, so probably some one having better
grip can comment.  Are we using something similar in any other error
message?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/26 11:41, Amit Kapila wrote:
> On Wed, Oct 26, 2016 at 6:36 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> 1.
>>> @@ -1775,6 +1775,12 @@ BeginCopyTo(ParseState *pstate,
>>> {
>>> ..
>>> + else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>>> + errmsg("cannot copy from partitioned table \"%s\"",
>>> + RelationGetRelationName(rel)),
>>> + errhint("Try the COPY (SELECT ...) TO variant.")));
>>> ..
>>> }
>>>
>>> Why is this restriction?  Won't it be useful to allow it for the cases
>>> when user wants to copy the data of all the partitions?
>>
>> Sure, CopyTo() can be be taught to scan leaf partitions when a partitioned
>> table is specified, but I thought this may be fine initially.
>>
> 
> Okay, I don't want to add anything to your existing work unless it is
> important.  However, I think there should be some agreement on which
> of the restrictions are okay for first version of patch.  This can
> avoid such questions in future from other reviewers.

OK, so I assume you don't find this particular restriction problematic in
long term.

>>> 2.
>>> + if (!pg_strcasecmp(stmt->partspec->strategy, "list") &&
>>> + partnatts > 1)
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
>>> + errmsg("cannot list partition using more than one column")));
>>>
>>> /cannot list/cannot use list
>>
>> Actually "list partition" works here as a verb, as in "to list partition".
>>
> 
> I am not an expert of this matter, so probably some one having better
> grip can comment.  Are we using something similar in any other error
> message?

In fact, I changed to the current text after Robert suggested the same [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoaPxXJ14eDVia514UiuQAXyZGqfbz8Qg3G4a8Rz2gKF7w%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Kapila
Date:
On Wed, Oct 26, 2016 at 8:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/10/26 11:41, Amit Kapila wrote:
>> On Wed, Oct 26, 2016 at 6:36 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> 1.
>>>> @@ -1775,6 +1775,12 @@ BeginCopyTo(ParseState *pstate,
>>>> {
>>>> ..
>>>> + else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>>>> + ereport(ERROR,
>>>> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>>>> + errmsg("cannot copy from partitioned table \"%s\"",
>>>> + RelationGetRelationName(rel)),
>>>> + errhint("Try the COPY (SELECT ...) TO variant.")));
>>>> ..
>>>> }
>>>>
>>>> Why is this restriction?  Won't it be useful to allow it for the cases
>>>> when user wants to copy the data of all the partitions?
>>>
>>> Sure, CopyTo() can be be taught to scan leaf partitions when a partitioned
>>> table is specified, but I thought this may be fine initially.
>>>
>>
>> Okay, I don't want to add anything to your existing work unless it is
>> important.  However, I think there should be some agreement on which
>> of the restrictions are okay for first version of patch.  This can
>> avoid such questions in future from other reviewers.
>
> OK, so I assume you don't find this particular restriction problematic in
> long term.
>

I think you can keep it as you have in patch.  After posting your
updated patches, please do send a list of restrictions which this
patch is imposing based on the argument that for first version they
are not essential.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/26 12:09, Amit Kapila wrote:
> On Wed, Oct 26, 2016 at 8:27 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2016/10/26 11:41, Amit Kapila wrote:
>>> On Wed, Oct 26, 2016 at 6:36 AM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>>
>>>> Sure, CopyTo() can be be taught to scan leaf partitions when a partitioned
>>>> table is specified, but I thought this may be fine initially.
>>>
>>> Okay, I don't want to add anything to your existing work unless it is
>>> important.  However, I think there should be some agreement on which
>>> of the restrictions are okay for first version of patch.  This can
>>> avoid such questions in future from other reviewers.
>>
>> OK, so I assume you don't find this particular restriction problematic in
>> long term.
> 
> I think you can keep it as you have in patch.  After posting your
> updated patches, please do send a list of restrictions which this
> patch is imposing based on the argument that for first version they
> are not essential.

OK, agreed that it will be better to have all such restrictions and
limitations of the first version listed in one place, rather than being
scattered across different emails where they might have been mentioned and
discussed.

I will try to include such a list when posting the latest set of patches.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Kapila
Date:
On Wed, Oct 5, 2016 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ latest patch set ]
>
> Reviewing 0003:
>
>
> 5. I wonder how well this handles multi-column partition keys.  You've
> just got one Datum flag and one is-finite flag per partition, but I
> wonder if you don't need to track is-finite on a per-column basis, so
> that you could partition on (a, b) and have the first partition go up
> to (10, 10), the second to (10, infinity), the third to (20, 10), the
> fourth to (20, infinity), and the last to (infinity, infinity).  FWIW,
> Oracle supports this sort of thing, so perhaps we should, too.  On a
> related note, I'm not sure it's going to work to treat a composite
> partition key as a record type.  The user may want to specify a
> separate opfamily and collation for each column, not just inherit
> whatever the record behavior is.  I'm not sure if that's what you are
> doing, but the relcache structures don't seem adapted to storing one
> Datum per partitioning column per partition, but rather just one Datum
> per partition, and I'm not sure that's going to work very well.
>

@@ -123,6 +123,9 @@ typedef struct RelationData
{
.. MemoryContext rd_partkeycxt; /* private memory cxt for the below */ struct PartitionKeyData *rd_partkey; /*
partitionkey, or NULL */
 
+ MemoryContext rd_pdcxt; /* private context for partdesc */
+ struct PartitionDescData *rd_partdesc; /* partitions, or NULL */
+ List   *rd_partcheck; /* partition CHECK quals */
..
}

I think one thing to consider here is the increase in size of relcache
due to PartitionDescData.  I think it will be quite useful in some of
the cases like tuple routing.  Isn't it feasible to get it in some
other way, may be by using relpartbound from pg_class tuple?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/26 17:57, Amit Kapila wrote:
> @@ -123,6 +123,9 @@ typedef struct RelationData
> {
> ..
>   MemoryContext rd_partkeycxt; /* private memory cxt for the below */
>   struct PartitionKeyData *rd_partkey; /* partition key, or NULL */
> + MemoryContext rd_pdcxt; /* private context for partdesc */
> + struct PartitionDescData *rd_partdesc; /* partitions, or NULL */
> + List   *rd_partcheck; /* partition CHECK quals */
> ..
> }
> 
> I think one thing to consider here is the increase in size of relcache
> due to PartitionDescData.  I think it will be quite useful in some of
> the cases like tuple routing.  Isn't it feasible to get it in some
> other way, may be by using relpartbound from pg_class tuple?

Whereas pg_class.relpartbound stores partition bound of the *individual
partitions* in Node form, the above relcache struct is associated with
parent tables; it contains some efficient to use (and fairly compact)
representation of bounds of *all* the partitions of the parent.  Consider
for example, an array of sorted range bounds for range partitioned tables.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Kapila
Date:
On Wed, Oct 26, 2016 at 3:04 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/10/26 17:57, Amit Kapila wrote:
>> @@ -123,6 +123,9 @@ typedef struct RelationData
>> {
>> ..
>>   MemoryContext rd_partkeycxt; /* private memory cxt for the below */
>>   struct PartitionKeyData *rd_partkey; /* partition key, or NULL */
>> + MemoryContext rd_pdcxt; /* private context for partdesc */
>> + struct PartitionDescData *rd_partdesc; /* partitions, or NULL */
>> + List   *rd_partcheck; /* partition CHECK quals */
>> ..
>> }
>>
>> I think one thing to consider here is the increase in size of relcache
>> due to PartitionDescData.  I think it will be quite useful in some of
>> the cases like tuple routing.  Isn't it feasible to get it in some
>> other way, may be by using relpartbound from pg_class tuple?
>
> Whereas pg_class.relpartbound stores partition bound of the *individual
> partitions* in Node form, the above relcache struct is associated with
> parent tables; it contains some efficient to use (and fairly compact)
> representation of bounds of *all* the partitions of the parent.
>

Okay, but still it will be proportional to number of partitions and
the partition keys.  Is it feasible to store ranges only for
partitions that are actively accessed?  For example, consider a table
with 100 partitions and the first access to table requires to access
5th partition, then we store ranges for first five partitions or
something like that.  This could be helpful, if we consider cases that
active partitions are much less as compare to total partitions of a
table.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Oct 26, 2016 at 6:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Oct 26, 2016 at 3:04 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2016/10/26 17:57, Amit Kapila wrote:
>>> @@ -123,6 +123,9 @@ typedef struct RelationData
>>> {
>>> ..
>>>   MemoryContext rd_partkeycxt; /* private memory cxt for the below */
>>>   struct PartitionKeyData *rd_partkey; /* partition key, or NULL */
>>> + MemoryContext rd_pdcxt; /* private context for partdesc */
>>> + struct PartitionDescData *rd_partdesc; /* partitions, or NULL */
>>> + List   *rd_partcheck; /* partition CHECK quals */
>>> ..
>>> }
>>>
>>> I think one thing to consider here is the increase in size of relcache
>>> due to PartitionDescData.  I think it will be quite useful in some of
>>> the cases like tuple routing.  Isn't it feasible to get it in some
>>> other way, may be by using relpartbound from pg_class tuple?
>>
>> Whereas pg_class.relpartbound stores partition bound of the *individual
>> partitions* in Node form, the above relcache struct is associated with
>> parent tables; it contains some efficient to use (and fairly compact)
>> representation of bounds of *all* the partitions of the parent.
>>
>
> Okay, but still it will be proportional to number of partitions and
> the partition keys.  Is it feasible to store ranges only for
> partitions that are actively accessed?  For example, consider a table
> with 100 partitions and the first access to table requires to access
> 5th partition, then we store ranges for first five partitions or
> something like that.  This could be helpful, if we consider cases that
> active partitions are much less as compare to total partitions of a
> table.

I have serious doubt about whether it's a good idea to do that EVER,
but it certainly doesn't need to be in the first version of this
patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Oct 25, 2016 at 2:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> a. Policy on table_name

No, because queries against the parent will apply the policy to
children.  See today's commit
162477a63d3c0fd1c31197717140a88077a8d0aa.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Oct 25, 2016 at 9:06 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I will include these changes in the next version of patches I will post
> soon in reply to [1].
> [1]
> https://www.postgresql.org/message-id/CA%2BTgmoYJcUTcN7vVgg54GHtffH11JJWYZnfF4KiRxjV-iaACQg%40mail.gmail.com

How soon?  Tempus fugit, and tempus in this release cycle fugit
particularly quickly.  It looks like the last revision of these
patches was on October 7th, and that's now more than 2 weeks ago.  We
need to get moving here if this is going to happen this cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Oct 25, 2016 at 10:00 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I am not (or no longer) sure how that argument affects INSERT on
> partitioned tables with tuple-routing though.  Are partitions at all
> levels *implicitly specified to be affected* when we say INSERT INTO
> root_partitioned_table?

I'd say yes.

>> Some of the other things that we might want to consider disallowing on
>> parent table could be:
>> a. Policy on table_name
>
> Perhaps.  Since there are no rows in the parent table(s) itself of a
> partition hierarchy, it might not make sense to continue to allow creating
> row-level security policies on them.

No, per my previous email.  Those policies are emphatically not without effect.

>> b. Alter table has many clauses, are all of those allowed and will it
>> make sense to allow them?
>
> Currently, we only disallow the following with partitioned parent tables
> as far as alter table is concerned.
>
> - cannot change inheritance by ALTER TABLE partitioned_table INHERIT ...
>
> - cannot let them be regular inheritance parents either - that is, the
>   following is disallowed: ALTER TABLE some_able INHERIT partitioned_table
>
> - cannot create UNIQUE, PRIMARY KEY, FOREIGN KEY, EXCLUDE constraints
>
> - cannot drop column involved in the partitioning key
>
> Most other forms that affect attributes and constraints follow the regular
> inheritance behavior (recursion) with certain exceptions such as:
>
> - cannot add/drop an attribute or check constraint to *only* to/from
>   the parent
>
> - cannot add/drop NOT NULL constraint to/from *only* the parent
>
> Thoughts?

Seems sensible to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/27 3:13, Robert Haas wrote:
> On Tue, Oct 25, 2016 at 9:06 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I will include these changes in the next version of patches I will post
>> soon in reply to [1].
>> [1]
>> https://www.postgresql.org/message-id/CA%2BTgmoYJcUTcN7vVgg54GHtffH11JJWYZnfF4KiRxjV-iaACQg%40mail.gmail.com
> 
> How soon?  Tempus fugit, and tempus in this release cycle fugit
> particularly quickly.  It looks like the last revision of these
> patches was on October 7th, and that's now more than 2 weeks ago.  We
> need to get moving here if this is going to happen this cycle.

Sorry about the delay, I'll post no later than tomorrow.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/10/05 10:50, Robert Haas wrote:
> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ latest patch set ]
>
> Reviewing 0003:

Thanks a lot for the review and sorry about the delay in replying.

> +      This form attaches an existing table (partitioned or otherwise) as
>
> (which might itself be partitioned)
>
> +      partition of the target table.  Partition bound specification must
> +      correspond with the partition method and the key of the target table.
>
> The partition bound specification must correspond to the partitioning
> method and partitioning key of the target table.
>
> +      The table being attached must have all the columns of the target table
> +      with matching types and no more. Also, it must have all the matching
>
> The table to be attached must have all of the same columns as the
> target table and no more; moreover, the column types must also match.

Done.

> +      with matching types and no more. Also, it must have all the matching
> +      constraints as the target table.  That includes both <literal>NOT NULL</>
> +      and <literal>CHECK</> constraints.  If some <literal>CHECK</literal>
> +      constraint of the table being attached is marked <literal>NO
> INHERIT</literal>,
> +      the command will fail; such constraints must either be dropped or
> +      recreated without the <literal>NO INHERIT</literal> mark.
>
> Why all of these requirements?

Except the last one, regular inheritance currently has the same
requirements; from the docs:

"""
To be added as a child, the target table must already contain all the same
columns as the parent (it could have additional columns, too). The columns
must have matching data types, and if they have NOT NULL constraints in
the parent then they must also have NOT NULL constraints in the child.

There must also be matching child-table constraints for all CHECK
constraints of the parent, except those marked non-inheritable (that is,
created with ALTER TABLE ... ADD CONSTRAINT ... NO INHERIT) in the parent,
which are ignored; all child-table constraints matched must not be marked
non-inheritable.
"""

Note that the "(it could have additional columns, too)" part does not
apply for partitions.  Also the last point does not hold for partitioned
tables because they never have NO INHERIT constraints.

> We could instead perform a scan to
> validate that the constraints are met.  I think the way this should
> work is:
>
> 1. ATTACH PARTITION works whether matching NOT NULL and CHECK
> constraints are present or not.

Wouldn't this break the invariant that if the parent has a NOT NULL or a
CHECK constraint, all of its children must have it too?  Remember that we
allow directly inserting/updating data into partitions, so these
constraints better be persistent.

> 2. If all of the constraints are present, and a validated constraint
> matching the implicit partitioning constraint is also present, then
> ATTACH PARTITION does not scan the table to validate constraints;
> otherwise, it does.

I wonder what matching in "validated constraint matching with the implicit
partitioning constraint" means?  By that do you mean an exact equal()
match, provided both have been suitably canonicalized?  Or an existing
check constraint that *implies* the partitioning constraint?

> 3. NO VALIDATE is not an option.

I had my doubts about this option initially, but 2 above seemed hard to
do.  Or perhaps we can just remove this option and we will always scan
when attaching a partition.

> +      Currently <literal>UNIQUE</literal>, <literal>PRIMARY KEY</literal>, and
> +      <literal>FOREIGN KEY</literal> constraints are not considered, but that
> +      might change in the future.
>
> Really?  Changing that sounds impractical to me.

Assuming you are questioning the "but that might change in the future"
part, I took that line from the description for INHERIT.  Maybe it was
written in that section with a view that someone might implement globally
enforced variant of each of those constraints.  I thought the path to that
would be slightly easier with declarative partitioning but that may be too
optimistic.  Removed that part.

> +      This form detaches specified partition of the target table.  The
> +      detached partition continues to exist as a standalone table with no ties
> +      remaining with the target table.
>
> continues to exist as a standalone table, but no longer has any ties
> to the table from which it was detached.

Done.

>
> +      Note that if a partition being detached is itself a partitioned table,
> +      it continues to exist as such.
>
> You don't really need to say this, I think.  All of the properties of
> the detached table are retained, not only its partitioning status.
> You wouldn't like it if I told you to document "note that if a
> partition being detached is unlogged, it will still be unlogged".

I thought same as Petr who replied to this part of your email - might help
avoid confusion.  But it might be redundant, so I removed that line.

>
>     To add the table as a new child of a parent table, you must own the
> -   parent table as well.
> +   parent table as well.  That applies to both adding the table as a
> +   inheritance child of a parent table and attaching a table as partition to
> +   the table.
>
> To add the table as a new child of a parent table, or as a new
> partition of a partitioned table, you must own the parent table as
> well.

Done.

> +        The name of the table to attach as a new partition to or
> detach from this table.
>
> s/to or/or to/

Oops, fixed.

> +    <literal>NO VALIDATE</> option is spcified.
>
> Typo, but see comments above about nuking this altogether.

Fixed the typo for now. As I said above it's not clear to me how the
alternative method of skipping the scan is supposed to work.

>      A recursive <literal>DROP COLUMN</literal> operation will remove a
>      descendant table's column only if the descendant does not inherit
>      that column from any other parents and never had an independent
> -    definition of the column.  A nonrecursive <literal>DROP
> +    definition of the column (which always holds if the descendant table
> +    is a partition).  A nonrecursive <literal>DROP
>      COLUMN</literal> (i.e., <command>ALTER TABLE ONLY ... DROP
>      COLUMN</command>) never removes any descendant columns, but
> -    instead marks them as independently defined rather than inherited.
> +    instead marks them as independently defined rather than inherited,
> +    unless the descendant table is a partition.
>
> This is a hairy explanation.  I suggest that the documentation say
> this (and that the code implement it): A nonrecursive DROP TABLE
> command will fail for a partitioned table, because all partitions of a
> table must have the same columns as the partitioning root.

Agreed, done (including the code).

> -    that are not marked <literal>NO INHERIT</>.
> +    that are not marked <literal>NO INHERIT</> which are unsupported if
> +    the table is a partitioned table.
>
> I think you can omit this hunk.

Done.

>
> +   If <literal>PARTITION OF</literal> clause is specified then the table is
> +   created as a partition of <literal>parent_table</literal> with specified
> +   bounds.  However, unlike regular tables, one cannot specify
> +   <literal>PARTITION BY</literal> clause which means foreign tables can
> +   only be created as leaf partitions.
>
> I'd delete the sentence beginning with "However".

Done.

> +   Create foreign table <structname>measurement_y2016m07</>, which will be
> +   accessed through the server <structname>server_07</>, that is partition
> +   of the range partitioned table <structname>measurement</>:
>
> s/, that is/ as a/

Fixed.

> +<phrase>and <replaceable
> class="PARAMETER">partition_bound_spec</replaceable> is:</phrase>
> +
> +FOR VALUES { <replaceable class="PARAMETER">list_spec</replaceable> |
> <replaceable class="PARAMETER">range_spec</replaceable> }
>
> I think you can inline the definitions of list_spec and range_spec
> here instead of making them separate productions, and I think that
> would be preferable.
>
> FOR VALUES { IN ( <replaceable
> class="PARAMETER">expression</replaceable> [, ...] ) |
> START <replaceable class="PARAMETER">lower-bound</replaceable> [
> INCLUSIVE | EXCLUSIVE ] END <replaceable
> class="PARAMETER">upper-bound</replaceable> [ INCLUSIVE | EXCLUSIVE ]
> }

Ah, done that way.

>
> +      parent table (name optionally schema-qualified).
>
> Parenthetical phrase isn't needed.

Removed.

>
> +      A partition bound specification must be present and must correspond with
> +      partition method and key of the parent table.  It is checked using the
> +      specification that the new partition does not overlap with any existing
> +      partitions of the parent.
>
> The partition bound specification must correspond to the partitioning
> method and partitioning key of the parent table, and must not overlap
> with any existing partition of that parent.

Done.

> +      clause, if any.  Defaults and constraints can optionally be specified
> +      for each of the inherited columns, which override those in the parent.
>
> Surely not.  You can't "override" an inherited constraint or an
> inherited default.  The child may have extra constraints not present
> in the parent, and may have different defaults when it is directly
> targeted by an insertion, but it can't possibly override the parent
> defaults.

Removed ", which override those in the parent.".

>
> +      One can also specify table constraints, in addition to those inherited
> +      from the parent.  Note that all subsequent schema modifications to the
> +      parent propagate to partition.
>
> The first part of this seems right, but then what's going on with the
> reference to constraints in the previous sentence?  (See previous
> review comment.) The second sentence I would delete (but see below).

In the previous sentence, I meant the constraints that one can specify
using WITH OPTIONS [ column_constraint [...] ], but I removed
the "overrides those in the parent" part as mentioned.

Removed the second sentence and instead adopted your text below.

>
> +     <para>
> +      Any data row subsequently inserted into the parent table is mapped to
> +      and stored in the partition, provided partition key of the row falls
> +      within the partition bounds.
> +     </para>
>
> How about: Rows inserted into a partitioned table will be
> automatically routed to the correct partition.  If no suitable
> partition exists, an error will occur.

Much better, done.

> +     <para>
> +      A partition is dropped or truncated when the parent table is dropped or
> +      truncated.  Dropping it directly using <literal>DROP TABLE</literal>
> +      will fail; it must first be <firstterm>detached</> from the parent.
> +      However, truncating a partition directly works.
> +     </para>
>
> How about: A partition must have the same column names and types as
> the table of which it is a partition.  Therefore, modifications the
> column names or types of the partitioned table will automatically
> propagate to all children, as will operations such as TRUNCATE which
> normally affect a table and all of its inheritance children.  It is
> also possible to TRUNCATE a partition individually, just as for an
> inheritance child.

Done.

> Insisting that you can't drop a child without detaching it first seems
> wrong to me.  If I already made this comment and you responded to it,
> please point me back to whatever you said.  However, my feeling is
> this is flat wrong and absolutely must be changed.

I said the following [1]:

| Hmm, I don't think I like this.  Why should it be necessary to detach
| a partition before dropping it?  That seems like an unnecessary step.

I thought we had better lock the parent table when removing one of its
partitions and it seemed a bit odd to lock the parent table when dropping
a partition using DROP TABLE?  OTOH, with ALTER TABLE parent DETACH
PARTITION, the parent table is locked anyway.

> -            if (is_no_inherit)
> +
> +            /* Discard the NO INHERIT flag if the relation is a partition */
> +            if (is_no_inherit && !rel->rd_rel->relispartition)
>
> Something about this seems fishy.  In StoreRelCheck(), you disallow
> the addition of a NO INHERIT constraint on a partition, but here you
> seem happy to accept it and ignore the NO INHERIT property.  That
> doesn't seem consistent.  We should adopt a consistent policy about
> what to do about such constraints, and I submit that throwing an error
> is better than silently changing things, unless you have some reason
> for doing that which I'm not seeing.  Anyway, we should have the same
> behavior in both cases.

Allowing to add NO INHERIT constraints to (leaf) partitions does not seem
like it will cause any problems not that the flag will have any meaning.
It will come into play only in one case - in the form of causing an error
if a constraint with the same name is added to the parent.

With that in mind, I removed the check in StoreRelCheck() that you mention
here and also removed the above hunk to which your comment was addressed.

> We should also standardize on what value of conislocal and coninhcount
> children end up with; presumably the latter should be 1, but I'm not
> sure if the former should be true or false.  In either case, anything
> that can vary between children probably needs to be dumped, so let's
> enforce that it doesn't so we don't have to dump it.  I'm not sure
> whether the code here achieves those objectives, though, and note the
> comment in the function header about making sure the logic here
> matches MergeConstraintsIntoExisting.

To summarize:

If a table is partition, it doesn't have any local attributes. Also, it
doesn't have any check constraint that is both local *and* inherited.
Only the non-inherited constraints are ever local.

So, all parent attributes will be present present in all the partitions at
any given time with attislocal = false and attinhcount = 1.  Likewise for
inherited constraints with conislocal = false and coninhcount = 1.

With above rules in place, pg_dump no longer dumps attributes or inherited
constraints in CREATE TABLE commands of individual partitions.

> I think the overriding principle here should be: If you attach a table
> as a partition, it must not be part of a standard inheritance
> hierarchy, and it must not be a partition of any other table.  It can,
> however, be partitioned itself.  If you later detach a partition, it
> ends up as a standalone table with a copy of each constraint it had as
> a partition - probably including the implicit partition constraint.
> The DBA can drop those constraints if they're not wanted.

Check.  About implicit constraints, see below.

> I wonder if it's really a good idea for the partition constraints to
> be implicit; what is the benefit of leaving those uncatalogued?

I did start out that way - ie, catalogued implicit constraints, but later
thought it might not be good to end up with multiple copies of essentially
the same information.  With cataloguing them will come dependencies and
all places that know about pg_constraint.

In the long term, I think we're only going to need them because we want to
enforce them when directly inserting data into partitions.

So, detaching a partition does not emit a check constraint matching the
implicit partition constraint.

> + * Depending on whether the relation in question is list or range
> + * partitioned, one of the fields is set.
> + */
> +typedef struct BoundCollectionData
> +{
> +    struct ListInfo       *listinfo;
> +    struct RangeInfo   *rangeinfo;
> +} BoundCollectionData;
>
> This seems like an odd design.  First, when you have a pointer to
> either of two things, the normal tool for that in C would be a union,
> not a struct. Second, in PostgreSQL we typically handle that by making
> both of the things nodes and then you can use IsA() or switch on
> nodeTag() to figure out what you've got.  Third, the only place this
> is used at least in 0003 is as part of PartitionDescData, which only
> has 3 members, so if you were going to do it with 2 members, you could
> just include these two members directly.  Considering all of the
> foregoing, I'd suggest making this a union and including partstrategy
> in PartitionDescData.
>
> I think that the names ListInfo and RangeInfo are far too generic for
> something that's specific to partitioning.

How about this:

/*
 * Collection of bounds of a partitioned relation (either physical or
 * logical)
 */
typedef struct BoundCollectionData
{
    char    strategy;   /* list or range bounds? */

    union
    {
        struct PartitionListInfo    lists;
        struct PartitionRangeInfo   ranges;
    } bounds;
} BoundCollectionData;

I added the strategy field here instead of PartitionDescData, because this
structure is supposed to represent even the transient partitioned
relations.  If the relation is a physical relation we can determine
strategy from PartitionKey.

I renamed ListInfo/RangeInfo to PartitionListInfo/PartitionRangeInfo.

> +/*
> + * Range bound collection - sorted array of ranges of partitions of a range
> + * partitioned table
> + */
> +typedef struct RangeInfo
> +{
> +    struct PartitionRange    **ranges;
> +} RangeInfo;
> +
> +/* One partition's range */
> +typedef struct PartitionRange
> +{
> +    struct PartitionRangeBound    *lower;
> +    struct PartitionRangeBound    *upper;
> +} PartitionRange;
>
> This representation doesn't seem optimal, because in most cases the
> lower bound of one partition will be the upper bound of the next.  I
> suggest that you flatten this representation into a single list of
> relevant bounds, each flagged as to whether it is exclusive and
> whether it is finite; and a list with one more element of bounds.  For
> example, suppose the partition bounds are [10, 20), [20, 30), (30,
> 40), and [50, 60).  You first construct a list of all of the distinct
> bounds, flipping inclusive/exclusive for the lefthand bound in each
> case.  So you get:
>
> 10 EXCLUSIVE
> 20 EXCLUSIVE
> 30 EXCLUSIVE
> 30 INCLUSIVE
> 40 EXCLUSIVE
> 50 EXCLUSIVE
> 60 EXCLUSIVE
>
> When ordering items for this list, if the same item appears twice, the
> EXCLUSIVE copy always appears before INCLUSIVE.  When comparing
> against an EXCLUSIVE item, we move to the first half of the array if
> we are searching for a value strictly less than that value; when
> comparing against an INCLUSIVE item, we move to the first half of the
> array if we are searching for a value less than or equal to that
> value.
>
> This is a list of seven items, so a binary search will return a
> position between 0 (less than all items) and 7 (greater than all
> items).  So we need a list of 8 partition mappings, which in this case
> will look like this: -1, 0, 1, -1, 2, -1, 3, -1.
>
> In this particular example, there are only two adjacent partitions, so
> we end up with 7 bounds with this representation vs. 8 with yours, but
> in general I think the gains will be bigger.  If you've got 1000
> partitions and they're all adjacent, which is likely, you store 1000
> bounds instead of 2000 bounds by doing it this way.

Thanks for the idea.  I have implemented the same.

>
> + * Note: This function should be called only when it is known that 'relid'
> + * is a partition.
>
> Why?  How about "Because this function assumes that the relation whose
> OID is passed as an argument will have precisely one parent, it should
> only been called when it is known that the relation is a partition."

Rewrote the note.

> +    /*
> +     * Translate vars in the generated expression to have correct attnos.
> +     * Note that the vars in my_qual bear attnos dictated by key which carries
> +     * physical attnos of the parent.  We must allow for a case where physical
> +     * attnos of a partition can be different from the parent.
> +     */
> +    partexprs_item = list_head(key->partexprs);
> +    for (i = 0; i < key->partnatts; i++)
> +    {
> +        AttrNumber    attno = key->partattrs[i],
> +                    new_attno;
> +        char       *attname;
> +
> +        if (attno != 0)
> +        {
> +            /* Simple column reference */
> +            attname = get_attname(RelationGetRelid(parent), attno);
> +            new_attno = get_attnum(RelationGetRelid(rel), attname);
> +
> +            if (new_attno != attno)
> +                my_qual = (List *) translate_var_attno((Node *) my_qual,
> +                                                       attno,
> +                                                       new_attno);
>
> It can't really be safe to do this one attribute number at a time, or
> even if by some good fortune it can't be broken, it at least it seems
> extremely fragile.  Suppose that you translate 0 -> 3 and then 3 -> 5;
> now the result is garbage.  It's not very efficient to do this one
> attno at a time, either.

You are quite right.  I changed this to us map_variable_attnos().

When doing that, I noticed that the above function does not map whole-row
references but instead just returns whether one was encountered, so that
the caller can output an error that expressions containing whole-row vars
are not allowed in the corresponding context (examples of such callers
include transformTableLikeClause()).  In context of this function, there
is no way we could output such error.

Thinking more about that, even if we did map whole-row vars in check
expressions, constraint exclusion code would not be able to handle them,
because it does not work with anything but Const node as predicate's or
query clause's right-operand.

I went ahead and prohibited whole-row references from being used in the
partition key expressions in the first place (ie, in patch 0001).

>
> +    if (classform->relispartition)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +                 errmsg("\"%s\" is a partition of \"%s\"", rel->relname,
> +                        get_rel_name(get_partition_parent(relOid))),
> +                 errhint("Use ALTER TABLE DETACH PARTITION to be able
> to drop it.")));
> +
>
> RangeVarCallbackForDropRelation should do only minimal sanity checks;
> defer this until after we have a relation lock.

Moved this check to RemoveRelations(), wherein after
RangeVarGetRelidExtended() returns, the relation is guaranteed to be locked.

> I didn't get all the way through this patch, so this is a pretty
> incomplete review, but it's late here and I'm out of steam for
> tonight.  Some general comments:
>
> 1. I think that this patch seems to introduce an awful lot of new
> structures with confusingly similar names and purposes:
> PartitionBoundList, PartitionBoundRange, ListInfo, RangeInfo,
> PartitionRange, PartitionList, ListValue, RangePartition.  You've got
> 4 different structures here that all have "Partition" and "Range" in
> the name someplace, including both PartitionRange and RangePartition.
> Maybe there's no way around that kind of duplication; after all there
> are quite a few moving parts here.  But it seems like it would be good
> to try to simplify it.

OK, I got rid of most of those structs and now there are only the
following: PartitionListInfo, PartitionRangeInfo, PartitionRangeBound and
PartitionListValue.

> 2. I'm also a bit concerned about the fairly large amount of
> apparently-boilerplate code in partition.c, all having to do with how
> we create all of these data structures and translate between different
> forms of them.  I haven't understood that stuff thoroughly enough to
> have a specific idea about how we might be able to get rid of any of
> it, and maybe there's no way.  But that too seems like a topic for
> futher investigation.  One idea is that maybe some of these things
> should be nodes that piggyback on the existing infrastructure in
> src/backend/nodes instead of inventing a new way to do something
> similar.

I thought about the idea of using the src/backend/nodes infrastructure for
partitioning data structures.  However, none of those structures are part
of any existing Node nor do I imagine they will be in future, so making
them a Node seems unnecessary.  But then there is a subset of NodeTags
called "TAGS FOR RANDOM OTHER STUFF" which are Node objects that not part
of parse/plan/execute node tree structures such as T_TriggerData,
T_ForeignKeyCacheInfo.  I wonder if you meant to make partitioning
structures nodes of that category.

That said, I have tried in the current patch to reduce the amount of
unnecessary boilerplate code.

> 3. There are a lot of places where you have separate code for the
> range and list partitioning cases, and I'm suspicious that there are
> ways that code could be unified.  For example, with list partitioning,
> you have a bunch of Datums which represent the specific values that
> can appear in the various partitions, and with range partitioning, you
> have a bunch of Datums that represent the edges of partitions.  Well,
> if you used the same array for both purposes, you could use the same
> code to copy it.  That would involve flattening away a lot of the
> subsidiary structure under PartitionDescData and pulling up stuff that
> is buried lower down into the main structure, but I think that's
> likely a good idea anyway - see also point #1.

Hmm, I think it might be somewhat clearer to keep them separate in the
form of PartitionListInfo and PartitionRangeInfo structs.  In case of
range partitioning, each individual range bound is actually a
PartitionRangeBound instance, not just a Datum.  That's because we have to
store for each range bound the following information: an array of Datums
to represent the composite partition key, is-finite flags for each
partitioning column, a flag representing whether it is a inclusive bound,
and a flag representing whether it is lower or upper bound.

> 4. I'm somewhat wondering if we ought to just legislate that the lower
> bound is always inclusive and the upper bound is always exclusive.
> The decision to support both inclusive and exclusive partition bounds
> is responsible for an enormous truckload of complexity in this patch,
> and I have a feeling it is also going to be a not-infrequent cause of
> user error.

I thought we decided at some point to go with range type like notation to
specify range partition bound because of its flexibility.  I agree though
that with that flexibility, there will more input combinations that will
cause error.  As for the internal complexity, it's not clear to me whether
it will be reduced by always-inclusive lower and always-exclusive upper
bounds.  We would still need to store the inclusive flag with individual
PartitionRangeBound and consider it when comparing them with each other
and with partition key of tuples.

> 5. I wonder how well this handles multi-column partition keys.  You've
> just got one Datum flag and one is-finite flag per partition, but I
> wonder if you don't need to track is-finite on a per-column basis, so
> that you could partition on (a, b) and have the first partition go up
> to (10, 10), the second to (10, infinity), the third to (20, 10), the
> fourth to (20, infinity), and the last to (infinity, infinity).  FWIW,
> Oracle supports this sort of thing, so perhaps we should, too.  On a

I have implemented the feature that allows specifying UNBOUNDED per column.

> related note, I'm not sure it's going to work to treat a composite
> partition key as a record type.  The user may want to specify a
> separate opfamily and collation for each column, not just inherit
> whatever the record behavior is.  I'm not sure if that's what you are
> doing, but the relcache structures don't seem adapted to storing one
> Datum per partitioning column per partition, but rather just one Datum
> per partition, and I'm not sure that's going to work very well.

Actually, there *is* one Datum per partitioning column.  That is,
composite partition key is not treated as a record as it may have seemed.


Please find attached the latest version of the patches taking care of the
above review comments and some other issues I found.  As Amit Kapila
requested [2], here is a list of restrictions on partitioned tables, of
which some we might be able to overcome in future:

- cannot create indexes
- cannot create row triggers
- cannot specify UNIQUE, PRIMARY KEY, FOREIGN KEY, EXCLUDE constraints
- cannot become inheritance parent or child
- cannot use in COPY TO command

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/169708f6-6e5a-18d1-707b-1b323e4a6baf%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/CAA4eK1LqTqZkPSoonF5_cOz94OUZG9j0PNfLdhi_nPtW82fFVA%40mail.gmail.com


Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 4. I'm somewhat wondering if we ought to just legislate that the lower
>> bound is always inclusive and the upper bound is always exclusive.
>> The decision to support both inclusive and exclusive partition bounds
>> is responsible for an enormous truckload of complexity in this patch,
>> and I have a feeling it is also going to be a not-infrequent cause of
>> user error.
>
> I thought we decided at some point to go with range type like notation to
> specify range partition bound because of its flexibility.  I agree though
> that with that flexibility, there will more input combinations that will
> cause error.  As for the internal complexity, it's not clear to me whether
> it will be reduced by always-inclusive lower and always-exclusive upper
> bounds.  We would still need to store the inclusive flag with individual
> PartitionRangeBound and consider it when comparing them with each other
> and with partition key of tuples.

We did.  But I now think that was kind of silly.  I mean, we also
talked about not having a hard distinction between list partitioning
and range partitioning, but you didn't implement it that way and I
think that's probably a good thing.  The question is - what's the
benefit of allowing this to be configurable?

For integers, there's absolutely no difference in expressive power.
If you want to allow 1000-2000 with both bounds inclusive, you can
just say START (1000) END (2001) instead of START (1000) END (2000)
INCLUSIVE.  This is also true for any other datatype where it makes
sense to talk about "the next value" and "the previous value".
Instead of making the upper bound inclusive, you can just end at the
next value instead.  If you were tempted to make the lower bound
exclusive, same thing.

For strings and numeric types that are not integers, there is in
theory a loss of power.  If you want a partition that allows very
value starting with 'a' plus the string 'b' but not anything after
that, you are out of luck.  START ('a') END ('b') INCLUSIVE would have
done exactly what you want, but now you need to store the first string
that you *don't* want to include in that partition, and what's that?
Dunno.  Or similarly if you want to store everything from 1.0 up to
and including 2.0 but nothing higher, you can't, really.

But who wants that?  People who are doing prefix-based partitioning of
their text keys are going to want all of the 'a' things together, and
all of the 'b' things in another category.  Same for ranges of
floating-point numbers, which are also probably an unlikely candidate
for a partitioning key anyway.

So let's look at the other side.  What do we gain by excluding this
functionality?  Well, we save a parser keyword: INCLUSIVE no longer
needs to be a keyword.  We also can save some code in
make_one_range_bound(), RelationBuildPartitionDesc(),
copy_range_bound(), partition_rbound_cmp(), and partition_rbound_eq().

Also, if we exclude this now as I'm proposing, we can always add it
back later if it turns out that people need it.  On the other hand, if
we include it in the first version, it's going to be very hard to get
rid of it if it turns out we don't want it.  Once we release support
for a syntax, we're kinda stuck with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [ new patches ]

Reviewing 0006:

This patch seems scary.  I sort of assumed from the title -- "Teach a
few places to use partition check quals." -- that this was an optional
thing, some kind of optimization from which we could reap further
advantage once the basic infrastructure was in place.  But it's not
that at all.  It's absolutely necessary that we do this, or data
integrity is fundamentally compromised.  How do we know that we've
found all of the places that need to be taught about these new,
uncatalogued constraints?

I'm feeling fairly strongly like you should rewind and make the
partitioning constraints normal catalogued constraints.  That's got a
number of advantages, most notably that we can be sure they will be
properly enforced by the entire system (modulo existing bugs, of
course).  Also, they'll show up automatically in tools like psql's \d
output, pgAdmin, and anything else that is accustomed to being able to
find constraints in the catalog.  We do need to make sure that those
constraints can't be dropped (or altered?) inappropriately, but that's
a relatively small problem.  If we stick with the design you've got
here, every client tool in the world needs to be updated, and I'm not
seeing nearly enough advantage in this system to justify that kind of
upheaval.

In fact, as far as I can see, the only advantage of this approach is
that when the insert arrives through the parent and is routed to the
child by whatever tuple-routing code we end up with (I guess that's
what 0008 does), we get to skip checking the constraint, saving CPU
cycles.  That's probably an important optimization, but I don't think
that putting the partitioning constraint in the catalog in any way
rules out the possibility of performing that optimization.  It's just
that instead of having the partitioning excluded-by-default and then
sometimes choosing to include it, you'll have it included-by-default
and then sometimes choose to exclude it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Insisting that you can't drop a child without detaching it first seems
>> wrong to me.  If I already made this comment and you responded to it,
>> please point me back to whatever you said.  However, my feeling is
>> this is flat wrong and absolutely must be changed.
>
> I said the following [1]:
>
> | Hmm, I don't think I like this.  Why should it be necessary to detach
> | a partition before dropping it?  That seems like an unnecessary step.
>
> I thought we had better lock the parent table when removing one of its
> partitions and it seemed a bit odd to lock the parent table when dropping
> a partition using DROP TABLE?  OTOH, with ALTER TABLE parent DETACH
> PARTITION, the parent table is locked anyway.

That "OTOH" part seems like a pretty relevant point.

Basically, I think people expect to be able to say "DROP THINGTYPE
thingname" or at most "DROP THINGTYPE thingname CASCADE" and have that
thing go away.  I'm opposed to anything which requires some other
series of steps without a very good reason, and possible surprise
about the precise locks that the command requires isn't a good enough
reason from my point of view.

>> I wonder if it's really a good idea for the partition constraints to
>> be implicit; what is the benefit of leaving those uncatalogued?
>
> I did start out that way - ie, catalogued implicit constraints, but later
> thought it might not be good to end up with multiple copies of essentially
> the same information.  With cataloguing them will come dependencies and
> all places that know about pg_constraint.
>
> In the long term, I think we're only going to need them because we want to
> enforce them when directly inserting data into partitions.

See my other email on this topic.  I agree there are some complexities
here, including making sure that pg_dump does the right thing.  But I
think it's the right way to go.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Corey Huinker
Date:

On Tue, Nov 1, 2016 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
For strings and numeric types that are not integers, there is in
theory a loss of power.  If you want a partition that allows very
value starting with 'a' plus the string 'b' but not anything after
that, you are out of luck.  START ('a') END ('b') INCLUSIVE would have
done exactly what you want, but now you need to store the first string
that you *don't* want to include in that partition, and what's that?
Dunno.  Or similarly if you want to store everything from 1.0 up to
and including 2.0 but nothing higher, you can't, really.


Exactly. This is especially true for date ranges. There's a lot of cognitive dissonance in defining the "2014" partition as < '2015-01-01', as was the case in Oracle waterfall-style partitioning. That was my reasoning for pushing for range-ish syntax as well as form.
 
But who wants that?  People who are doing prefix-based partitioning of
their text keys are going to want all of the 'a' things together, and
all of the 'b' things in another category.  Same for ranges of
floating-point numbers, which are also probably an unlikely candidate
for a partitioning key anyway.

/me raises hand.  We have tables with a taxonomy in them where the even data splits don't fall on single letter boundaries, and often the single string values have more rows than entire letters. In those situations, being able to express ['XYZ','XYZ'] is important.  ['XYZ,'XZ') would let 'XYZ1' bleed into the partition and ['XYZ','XYZ1') lets in other values, and so I go chasing down the non-discrete set rabbit hole.

If we're worried about keywords, maybe a BOUNDED '[]' clause?

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>  [ new patches ]

Reviewing 0005:

Your proposed commit messages says this:

> If relation is the target table (UPDATE and DELETE), flattening is
> done regardless (scared to modify inheritance_planner() yet).

In the immortal words of Frank Herbert: “I must not fear. Fear is the
mind-killer. Fear is the little-death that brings total obliteration.
I will face my fear. I will permit it to pass over me and through me.
And when it has gone past I will turn the inner eye to see its path.
Where the fear has gone there will be nothing. Only I will remain.”

In other words, I'm not going to accept fear as a justification for
randomly-excluding the target-table case from this code.  If there's
an actual reason not to do this in that case or some other case, then
let's document that reason.  But weird warts in the code that are
justified only by nameless anxiety are not good.

Of course, the prior question is whether we should EVER be doing this.
I realize that something like this MAY be needed for partition-wise
join, but the mission of this patch is not to implement partition-wise
join.  Does anything in this patch series really require this?  If so,
what?  If not, how about we leave it out and refactor it when that
change is really needed for something?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 1, 2016 at 1:49 PM, Corey Huinker <corey.huinker@gmail.com> wrote:
> Exactly. This is especially true for date ranges. There's a lot of cognitive
> dissonance in defining the "2014" partition as < '2015-01-01', as was the
> case in Oracle waterfall-style partitioning. That was my reasoning for
> pushing for range-ish syntax as well as form.

Yeah.  That syntax has some big advantages, though.  If we define that
partition as START ('2014-01-01') INCLUSIVE END ('2014-12-31')
INCLUSIVE, there's no way for the system to tell that the there's no
gap between the that ending bound and the starting bound of the 2015
partition, because the system has no domain-specific knowledge that
there is no daylight between 2014-12-31 and 2015-01-01.  So if we
allow things to be specified that way, then people will use that
syntax and then complain when it doesn't perform quite as well as
START ('2014-01-01') END ('2015-01-01').  Maybe the difference isn't
material and maybe we don't care; what do you think?

(I really don't want to get tied up adding a system for adding and
subtracting one to and from arbitrary data types.  Life is too short.
If that requires that users cope with a bit of cognitive dissidence,
well, it's not the first time something like that will have happened.
I have some cognitive dissidence about the fact that creat(2) has no
trailing "e" but truncate(2) does, and moreover the latter can be used
to make a file longer rather than shorter.  But, hey, that's what you
get for choosing a career in computer science.)

>> But who wants that?  People who are doing prefix-based partitioning of
>> their text keys are going to want all of the 'a' things together, and
>> all of the 'b' things in another category.  Same for ranges of
>> floating-point numbers, which are also probably an unlikely candidate
>> for a partitioning key anyway.
>
> /me raises hand.  We have tables with a taxonomy in them where the even data
> splits don't fall on single letter boundaries, and often the single string
> values have more rows than entire letters. In those situations, being able
> to express ['XYZ','XYZ'] is important.  ['XYZ,'XZ') would let 'XYZ1' bleed
> into the partition and ['XYZ','XYZ1') lets in other values, and so I go
> chasing down the non-discrete set rabbit hole.

Hmm.  I have to admit that I hadn't considered the case where you have
a range partitioning scheme but one of the ranges includes only a
single string.  If that's an important use case, that might be a fatal
problem with my proposal.  :-(

> If we're worried about keywords, maybe a BOUNDED '[]' clause?

In the end, keywords are not the defining issue here; the issue is
whether all of this complexity around inclusive and exclusive bounds
carries its weight, and whether we want to be committed to that.

Any other opinions out there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Francisco Olarte
Date:
On Tue, Nov 1, 2016 at 6:49 PM, Corey Huinker <corey.huinker@gmail.com> wrote:
>
> On Tue, Nov 1, 2016 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> For strings and numeric types that are not integers, there is in
>> theory a loss of power.  If you want a partition that allows very
>> value starting with 'a' plus the string 'b' but not anything after
>> that, you are out of luck.  START ('a') END ('b') INCLUSIVE would have
>> done exactly what you want, but now you need to store the first string
>> that you *don't* want to include in that partition, and what's that?
>> Dunno.  Or similarly if you want to store everything from 1.0 up to
>> and including 2.0 but nothing higher, you can't, really.
> Exactly. This is especially true for date ranges. There's a lot of cognitive
> dissonance in defining the "2014" partition as < '2015-01-01', as was the
> case in Oracle waterfall-style partitioning. That was my reasoning for
> pushing for range-ish syntax as well as form.

OTOH I've seen a lot of people bitten by [2014-01-01,2014-12-31] on
TIMESTAMP intervals.

Everybody remembers december has 31 days, but when we have to do
MONTHLY partitions if you use closed intervals someone always miskeys
the number of days, or forgets wheter a particular year is leap or
not, and when doing it automatically I always have to code it as start
+ 1 month - 1day. In my experience having the non-significant part of
the dates ( days in monthly case, months too in yearly cases ) both 1
and equal in start and end makes it easier to check and identify, and
less error prone.

>> But who wants that?  People who are doing prefix-based partitioning of
>> their text keys are going to want all of the 'a' things together, and
>> all of the 'b' things in another category.  Same for ranges of
>> floating-point numbers, which are also probably an unlikely candidate
>> for a partitioning key anyway.
> /me raises hand.  We have tables with a taxonomy in them where the even data
> splits don't fall on single letter boundaries, and often the single string
> values have more rows than entire letters. In those situations, being able
> to express ['XYZ','XYZ'] is important.  ['XYZ,'XZ') would let 'XYZ1' bleed
> into the partition and ['XYZ','XYZ1') lets in other values, and so I go
> chasing down the non-discrete set rabbit hole.

You just do the classical ( I've had to do it ) closed end || minimum
char ( "XYZ","XYZ\0" in this case ). It is not that difficult as
strings have a global order, the next string to any one is always that
plus the \0, or whatever your minimum is.

The problem is with anything similar to a real number, but then there
I've always opted for half-open interval, as they can cover the line
without overlapping, unlike closed ones.

Anyway, as long as anyone makes sure HALF-OPEN intervals are allowed,
I'm fine ( I do not remember the name, but once had to work with a
system that only allowed closed or open and it was a real PITA.

Francisco Olarte.



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 1, 2016 at 2:05 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
>> /me raises hand.  We have tables with a taxonomy in them where the even data
>> splits don't fall on single letter boundaries, and often the single string
>> values have more rows than entire letters. In those situations, being able
>> to express ['XYZ','XYZ'] is important.  ['XYZ,'XZ') would let 'XYZ1' bleed
>> into the partition and ['XYZ','XYZ1') lets in other values, and so I go
>> chasing down the non-discrete set rabbit hole.
>
> You just do the classical ( I've had to do it ) closed end || minimum
> char ( "XYZ","XYZ\0" in this case ). It is not that difficult as
> strings have a global order, the next string to any one is always that
> plus the \0, or whatever your minimum is.

In defense of Corey's position, that's not so easy.  First, \0 doesn't
work; our strings can't include null bytes.  Second, the minimum legal
character depends on the collation in use.  It's not so easy to figure
out what the "next" string is, even though there necessarily must be
one.

> The problem is with anything similar to a real number, but then there
> I've always opted for half-open interval, as they can cover the line
> without overlapping, unlike closed ones.
>
> Anyway, as long as anyone makes sure HALF-OPEN intervals are allowed,
> I'm fine ( I do not remember the name, but once had to work with a
> system that only allowed closed or open and it was a real PITA.

I think we're all in agreement that half-open intervals should not
only be allowed, but the default.  The question is whether it's a good
idea to also allow other possibilities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Corey Huinker
Date:
On Tue, Nov 1, 2016 at 2:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Yeah.  That syntax has some big advantages, though.  If we define that
partition as START ('2014-01-01') INCLUSIVE END ('2014-12-31')
INCLUSIVE, there's no way for the system to tell that the there's no
gap between the that ending bound and the starting bound of the 2015
partition, because the system has no domain-specific knowledge that
there is no daylight between 2014-12-31 and 2015-01-01.  So if we
allow things to be specified that way, then people will use that
syntax and then complain when it doesn't perform quite as well as
START ('2014-01-01') END ('2015-01-01').  Maybe the difference isn't
material and maybe we don't care; what do you think?

It was a fight I didn't expect to win, and even if we don't get [x,x]-expressible partitions, at least we're not in the Oracle context-waterfall, where the lower bound of your partition is determined by the upper bound of the NEXT partition.

(I really don't want to get tied up adding a system for adding and
subtracting one to and from arbitrary data types.  Life is too short.
If that requires that users cope with a bit of cognitive dissidence,
well, it's not the first time something like that will have happened.
I have some cognitive dissidence about the fact that creat(2) has no
trailing "e" but truncate(2) does, and moreover the latter can be used
to make a file longer rather than shorter.  But, hey, that's what you
get for choosing a career in computer science.)

That noise your heard was the sound of my dream dying.

Re: Declarative partitioning - another take

From
Francisco Olarte
Date:
On Tue, Nov 1, 2016 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> In the end, keywords are not the defining issue here; the issue is
> whether all of this complexity around inclusive and exclusive bounds
> carries its weight, and whether we want to be committed to that.
>
> Any other opinions out there?

If it where for me I would opt for just half-open intervals. The only
problem I've ever had with them is when working with FINITE ranges,
i.e., there is no way of expresing the range of 8 bits integer with
half open intervals of 8 bit integers, but I would happily pay that
cost for the benefits of not having people unintentionally make
non-contiguous date/timestamp intervals, which I periodically suffer.

Francisco Olarte.



Re: Declarative partitioning - another take

From
Corey Huinker
Date:
OTOH I've seen a lot of people bitten by [2014-01-01,2014-12-31] on
TIMESTAMP intervals.

No argument there.
 
Everybody remembers december has 31 days, but when we have to do
MONTHLY partitions if you use closed intervals someone always miskeys
the number of days, or forgets wheter a particular year is leap or
not, and when doing it automatically I always have to code it as start
+ 1 month - 1day. In my experience having the non-significant part of
the dates ( days in monthly case, months too in yearly cases ) both 1
and equal in start and end makes it easier to check and identify, and
less error prone.

Being able to define partitions by expressions based on the values I want would be a good thing.
 
You just do the classical ( I've had to do it ) closed end || minimum
char ( "XYZ","XYZ\0" in this case ). It is not that difficult as
strings have a global order, the next string to any one is always that
plus the \0, or whatever your minimum is.

I've thought about doing that in the past, but wasn't sure it would actually work correctly. If you have experience with it doing so, that would be encouraging. How does that *look* when you print your partition layout though?
 

Re: Declarative partitioning - another take

From
Francisco Olarte
Date:
Robert:

On Tue, Nov 1, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> In defense of Corey's position, that's not so easy.  First, \0 doesn't
> work; our strings can't include null bytes.  Second, the minimum legal
> character depends on the collation in use.  It's not so easy to figure
> out what the "next" string is, even though there necessarily must be
> one.

I'm aware of that, just wanted to point that it can be done on strings.

> I think we're all in agreement that half-open intervals should not
> only be allowed, but the default.  The question is whether it's a good
> idea to also allow other possibilities.

In my experience, people continuously misuse them. I would specially
like to have them disallowed on timestamp columns ( and other
real-like data, including numeric ). But knowing they cannot do a few
things, and some others are easier with them is enough for allowing
them as an explicit non default for me.

Francisco Olarte.



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 1, 2016 at 2:11 PM, Corey Huinker <corey.huinker@gmail.com> wrote:
> On Tue, Nov 1, 2016 at 2:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Yeah.  That syntax has some big advantages, though.  If we define that
>> partition as START ('2014-01-01') INCLUSIVE END ('2014-12-31')
>> INCLUSIVE, there's no way for the system to tell that the there's no
>> gap between the that ending bound and the starting bound of the 2015
>> partition, because the system has no domain-specific knowledge that
>> there is no daylight between 2014-12-31 and 2015-01-01.  So if we
>> allow things to be specified that way, then people will use that
>> syntax and then complain when it doesn't perform quite as well as
>> START ('2014-01-01') END ('2015-01-01').  Maybe the difference isn't
>> material and maybe we don't care; what do you think?
>
> It was a fight I didn't expect to win, and even if we don't get
> [x,x]-expressible partitions, at least we're not in the Oracle
> context-waterfall, where the lower bound of your partition is determined by
> the upper bound of the NEXT partition.

I certainly agree that was a horrible design decision on Oracle's
part.  It's really messy.  If you drop a partition, it changes the
partition constraint for the adjacent partition.  Then you want to add
the partition back, say, but you have to first check whether the
adjacent partition, whose legal range has been expanded, has any
values that are out of bounds.  Which it can't, but you don't know
that, so you have to scan the whole partition.  Aargh!  Maybe this
somehow works out in their system - I'm not an Oracle expert - but I
think it's absolutely vital that we don't replicate it into
PostgreSQL.  (I have some, ahem, first-hand knowledge of how that
works out.)

>> (I really don't want to get tied up adding a system for adding and
>> subtracting one to and from arbitrary data types.  Life is too short.
>> If that requires that users cope with a bit of cognitive dissidence,
>> well, it's not the first time something like that will have happened.
>> I have some cognitive dissidence about the fact that creat(2) has no
>> trailing "e" but truncate(2) does, and moreover the latter can be used
>> to make a file longer rather than shorter.  But, hey, that's what you
>> get for choosing a career in computer science.)
>
> That noise your heard was the sound of my dream dying.

Well, I'm not sure we've exactly reached consensus here, and you're
making me feel like I just kicked a puppy.  However, my goal here is
not to kill your dream but to converge on a committable patch as
quickly as possible.  Adding increment/decrement operators to every
discrete(-ish) type we have is not the way to accomplish that.  To the
contrary, that's just about guaranteed to make this patch take an
extra release cycle to finish.  Now, that does not necessarily mean we
can't keep the INCLUSIVE/EXCLUSIVE stuff -- after all, Amit has
already written it -- but then we have to live with the fact that
+1/-1 based optimizations and matching are not going to be there.
Whether it's still worth having fully-open and fully-closed ranges on
general principle -- and whether the lack of such optimizations
changes the calculus -- is what we are now debating.

More votes welcome.  Right now I count 2 votes for keeping the
inclusive-exclusive stuff (Amit, Corey) and two for nuking it from
orbit (Robert, Francisco).  I'm not going to put my foot down on this
point against a contrary consensus, so please chime in.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Corey Huinker
Date:
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Nov 1, 2016 at 2:24 PM, Robert Haas
<spandir="ltr"><<a href="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div
class="a3saXjCH m1582122e59b1b8eb" id=":2dm">Well, I'm not sure we've exactly reached consensus here, and you're<br />
makingme feel like I just kicked a puppy. </div></blockquote></div><br />It was hyperbole. I hope you found it as funny
toread as I did to write. This is a great feature and I'm not going to make "perfect" the enemy of "good".</div></div> 

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/02 2:53, Robert Haas wrote:
> On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>  [ new patches ]
> 
> Reviewing 0005:
> 
> Your proposed commit messages says this:
> 
>> If relation is the target table (UPDATE and DELETE), flattening is
>> done regardless (scared to modify inheritance_planner() yet).

I should have explained my thinking behind why I ended up not handling the
result relation case:

While set_append_rel_size() can be safely invoked recursively on the roots
of partition sub-trees, I was not quite sure if inheritance_planner() is
amenable to such recursive invocation.  While the former is relatively
straightforward baserel processing, the latter is not-so-straightforward
transformation of the whole query for every target child relation of the
target parent.

> In the immortal words of Frank Herbert: “I must not fear. Fear is the
> mind-killer. Fear is the little-death that brings total obliteration.
> I will face my fear. I will permit it to pass over me and through me.
> And when it has gone past I will turn the inner eye to see its path.
> Where the fear has gone there will be nothing. Only I will remain.”
> 
> In other words, I'm not going to accept fear as a justification for
> randomly-excluding the target-table case from this code.  If there's
> an actual reason not to do this in that case or some other case, then
> let's document that reason.  But weird warts in the code that are
> justified only by nameless anxiety are not good.

Perhaps the above comment expands a little on the actual reason.  Though I
guess it still amounts to giving up on doing a full analysis of whether
recursive processing within inheritance_planner() is feasible.

I think we could just skip this patch as long as a full investigation into
inheritance_planner() issues is not done.  It's possible that there might
be other places in the planner code which are not amenable to the proposed
multi-level AppendRelInfos.  Ashutosh had reported one [1], wherein
lateral joins wouldn't work with multi-level partitioned table owing to
the multi-level AppendRelInfos (the patch contains a hunk to address that).

> Of course, the prior question is whether we should EVER be doing this.
> I realize that something like this MAY be needed for partition-wise
> join, but the mission of this patch is not to implement partition-wise
> join.  Does anything in this patch series really require this?  If so,
> what?  If not, how about we leave it out and refactor it when that
> change is really needed for something?

Nothing *requires* it as such.  One benefit I see is that exclusion of the
root of some partition sub-tree means the whole sub-tree is excluded in
one go.  Currently, because of the flattening, every relation in that
sub-tree would be excluded separately, needlessly repeating the expensive
constraint exclusion proof. But I guess that's just an optimization.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj%3Diqg%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/02 2:34, Robert Haas wrote:
> On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ new patches ]
> 
> Reviewing 0006:

Thanks for the review!

> This patch seems scary.  I sort of assumed from the title -- "Teach a
> few places to use partition check quals." -- that this was an optional
> thing, some kind of optimization from which we could reap further
> advantage once the basic infrastructure was in place.  But it's not
> that at all.  It's absolutely necessary that we do this, or data
> integrity is fundamentally compromised.  How do we know that we've
> found all of the places that need to be taught about these new,
> uncatalogued constraints?

Making this a separate commit from 0003 was essentially to avoid this
getting lost among all of its other changes.  In fact, it was to bring to
notice for closer scrutiny whether all the sites in the backend code that
are critical for data integrity in face of the implicit partition
constraints are being informed about those constraints.

> I'm feeling fairly strongly like you should rewind and make the
> partitioning constraints normal catalogued constraints.  That's got a
> number of advantages, most notably that we can be sure they will be
> properly enforced by the entire system (modulo existing bugs, of
> course).  Also, they'll show up automatically in tools like psql's \d
> output, pgAdmin, and anything else that is accustomed to being able to
> find constraints in the catalog.  We do need to make sure that those
> constraints can't be dropped (or altered?) inappropriately, but that's
> a relatively small problem.  If we stick with the design you've got
> here, every client tool in the world needs to be updated, and I'm not
> seeing nearly enough advantage in this system to justify that kind of
> upheaval.

As for which parts of the system need to know about these implicit
partition constraints to *enforce* them for data integrity, we could say
that it's really just one site - ExecConstraints() called from
ExecInsert()/ExecUpdate().

Admittedly, the current error message style as in this patch exposes the
implicit constraint approach to a certain criticism: "ERROR:  new row
violates the partition boundary specification of \"%s\"".  It would say
the following if it were a named constraint: "ERROR: new row for relation
\"%s\" violates check constraint \"%s\""

For constraint exclusion optimization, we teach get_relation_constraints()
to look at these constraints.  Although greatly useful, it's not the case
of being absolutely critical.

Beside the above two cases, there is bunch of code (relcache, DDL) that
looks at regular constraints, but IMHO, we need not let any of that code
concern itself with the implicit partition constraints.  Especially, I
wonder why the client tools should want the implicit partitioning
constraint to be shown as a CHECK constraint?  As the proposed patch 0004
(psql) currently does, isn't it better to instead show the partition
bounds as follows?

+CREATE TABLE part_b PARTITION OF parted (
+    b WITH OPTIONS NOT NULL DEFAULT 1 CHECK (b >= 0),
+    CONSTRAINT check_a CHECK (length(a) > 0)
+) FOR VALUES IN ('b');

+\d part_b
+         Table "public.part_b"
+ Column |  Type   |     Modifiers
+--------+---------+--------------------
+ a      | text    |
+ b      | integer | not null default 1
+Partition of: parted FOR VALUES IN ('b')
+Check constraints:
+    "check_a" CHECK (length(a) > 0)
+    "part_b_b_check" CHECK (b >= 0)

Needless to say, that could save a lot of trouble thinking about
generating collision-free names of these constraints, their dependency
handling, unintended altering of these constraints, pg_dump, etc.

> In fact, as far as I can see, the only advantage of this approach is
> that when the insert arrives through the parent and is routed to the
> child by whatever tuple-routing code we end up with (I guess that's
> what 0008 does), we get to skip checking the constraint, saving CPU
> cycles.  That's probably an important optimization, but I don't think
> that putting the partitioning constraint in the catalog in any way
> rules out the possibility of performing that optimization.  It's just
> that instead of having the partitioning excluded-by-default and then
> sometimes choosing to include it, you'll have it included-by-default
> and then sometimes choose to exclude it.

Hmm, doesn't it seem like we would be making *more* modifications to the
existing code (backend or otherwise) to teach it about excluding the
implicit partition constraints than the other way around?  The other way
around being to modify the existing code to include the implicit
constraints which is what this patch is about.

Having said all that, I am open to switching to the catalogued partition
constraints if the arguments I make above in favor of this patch are not
all that sound.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/02 16:41, Amit Langote wrote:
> Having said all that, I am open to switching to the catalogued partition
> constraints if the arguments I make above in favor of this patch are not
> all that sound.

One problem I didn't immediately see a solution for if we go with the
catalogued partition constraints is how do we retrieve a relation's
partition constraint?  There are a couple of cases where that becomes
necessary.  Consider that we are adding a partition to a partitioned table
that is itself partition.  In this case, the new partition's constraint
consists of whatever we generate from its FOR VALUES specification *ANDed*
with the parent's constraint.  We must somehow get hold of the latter.
Which of the parent's named check constraints in the pg_constraint catalog
is supposed to be the partition constraint?  With the uncatalogued
partition constraints approach, we simply generate it from the parent's
pg_class.relpartbound (or we might get the relcache copy of the same
stored in rd_partcheck).  Hmm.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/02 2:44, Robert Haas wrote:
> On Fri, Oct 28, 2016 at 3:53 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Insisting that you can't drop a child without detaching it first seems
>>> wrong to me.  If I already made this comment and you responded to it,
>>> please point me back to whatever you said.  However, my feeling is
>>> this is flat wrong and absolutely must be changed.
>>
>> I said the following [1]:
>>
>> | Hmm, I don't think I like this.  Why should it be necessary to detach
>> | a partition before dropping it?  That seems like an unnecessary step.
>>
>> I thought we had better lock the parent table when removing one of its
>> partitions and it seemed a bit odd to lock the parent table when dropping
>> a partition using DROP TABLE?  OTOH, with ALTER TABLE parent DETACH
>> PARTITION, the parent table is locked anyway.
> 
> That "OTOH" part seems like a pretty relevant point.
> 
> Basically, I think people expect to be able to say "DROP THINGTYPE
> thingname" or at most "DROP THINGTYPE thingname CASCADE" and have that
> thing go away.  I'm opposed to anything which requires some other
> series of steps without a very good reason, and possible surprise
> about the precise locks that the command requires isn't a good enough
> reason from my point of view.

OK, I changed things so that DROP TABLE a_partition no longer complains
about requiring to detach first.  Much like how index_drop() locks the
parent table ('parent' in a different sense, of course) and later
invalidates its relcache, heap_drop_with_catalog(), if the passed in relid
is a partition, locks the parent table using AccessExclusiveLock, performs
its usual business, and finally invalidates the parent's relcache before
closing it without relinquishing the lock.  Does that sounds sane?  One
downside is that if the specified command is DROP TABLE parent CASCADE,
the above described invalidation is a waste of cycles because the parent
will be dropped anyway after all the partitions are dropped.

Thanks,
Amit





Re: Declarative partitioning - another take

From
alvherre@alvh.no-ip.org
Date:
El 2016-10-28 07:53, Amit Langote escribió:

> @@ -6267,6 +6416,12 @@ ATAddForeignKeyConstraint(AlteredTableInfo *tab, 
> Relation rel,
>       * Validity checks (permission checks wait till we have the column
>       * numbers)
>       */
> +    if (pkrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +                 errmsg("cannot reference relation \"%s\"", 
> RelationGetRelationName(pkrel)),
> +                 errdetail("Referencing partitioned tables in foreign key 
> constraints is not supported.")));

Is there a plan for fixing this particular limitation?  It's a pretty 
serious problem for users,
and the suggested workaround (to create a separate non-partitioned table 
which carries only the PK
columns which is updated by triggers, and direct the FKs to it instead 
of to the partitioned table)
is not only a very ugly one, but also very slow.



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Nov 2, 2016 at 6:47 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> OK, I changed things so that DROP TABLE a_partition no longer complains
> about requiring to detach first.  Much like how index_drop() locks the
> parent table ('parent' in a different sense, of course) and later
> invalidates its relcache, heap_drop_with_catalog(), if the passed in relid
> is a partition, locks the parent table using AccessExclusiveLock, performs
> its usual business, and finally invalidates the parent's relcache before
> closing it without relinquishing the lock.  Does that sounds sane?

Yes.

> One
> downside is that if the specified command is DROP TABLE parent CASCADE,
> the above described invalidation is a waste of cycles because the parent
> will be dropped anyway after all the partitions are dropped.

I don't think that's even slightly worth worrying about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Nov 3, 2016 at 7:46 AM,  <alvherre@alvh.no-ip.org> wrote:
> El 2016-10-28 07:53, Amit Langote escribió:
>> @@ -6267,6 +6416,12 @@ ATAddForeignKeyConstraint(AlteredTableInfo *tab,
>> Relation rel,
>>          * Validity checks (permission checks wait till we have the column
>>          * numbers)
>>          */
>> +       if (pkrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>> +               ereport(ERROR,
>> +                               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>> +                                errmsg("cannot reference relation
>> \"%s\"", RelationGetRelationName(pkrel)),
>> +                                errdetail("Referencing partitioned tables
>> in foreign key constraints is not supported.")));
>
> Is there a plan for fixing this particular limitation?  It's a pretty
> serious problem for users,
> and the suggested workaround (to create a separate non-partitioned table
> which carries only the PK
> columns which is updated by triggers, and direct the FKs to it instead of to
> the partitioned table)
> is not only a very ugly one, but also very slow.

If you have two compatibly partitioned tables, and the foreign key
matches the partitioning keys, you could implement a foreign key
between the two tables as a foreign key between each pair of matching
partitions.  Otherwise, isn't the only way to handle this a global
index?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 1, 2016 at 2:36 PM, Corey Huinker <corey.huinker@gmail.com> wrote:
> On Tue, Nov 1, 2016 at 2:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, I'm not sure we've exactly reached consensus here, and you're
>> making me feel like I just kicked a puppy.
>
> It was hyperbole. I hope you found it as funny to read as I did to write.
> This is a great feature and I'm not going to make "perfect" the enemy of
> "good".

Oh, OK.  Sorry, the tone did not transfer to my brain in the way you
intended it - I thought you were actually upset.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Nov 2, 2016 at 3:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> As for which parts of the system need to know about these implicit
> partition constraints to *enforce* them for data integrity, we could say
> that it's really just one site - ExecConstraints() called from
> ExecInsert()/ExecUpdate().

Well, that's a slightly optimistic view of the situation, because the
issue is whether ExecConstraints() is going to get called in the first
place.  But now that I look at it, there are only a handful of
callers, so maybe it's not so bad.

> Admittedly, the current error message style as in this patch exposes the
> implicit constraint approach to a certain criticism: "ERROR:  new row
> violates the partition boundary specification of \"%s\"".  It would say
> the following if it were a named constraint: "ERROR: new row for relation
> \"%s\" violates check constraint \"%s\""
>
> For constraint exclusion optimization, we teach get_relation_constraints()
> to look at these constraints.  Although greatly useful, it's not the case
> of being absolutely critical.
>
> Beside the above two cases, there is bunch of code (relcache, DDL) that
> looks at regular constraints, but IMHO, we need not let any of that code
> concern itself with the implicit partition constraints.  Especially, I
> wonder why the client tools should want the implicit partitioning
> constraint to be shown as a CHECK constraint?  As the proposed patch 0004
> (psql) currently does, isn't it better to instead show the partition
> bounds as follows?
>
> +CREATE TABLE part_b PARTITION OF parted (
> +       b WITH OPTIONS NOT NULL DEFAULT 1 CHECK (b >= 0),
> +       CONSTRAINT check_a CHECK (length(a) > 0)
> +) FOR VALUES IN ('b');

Good point.

> +\d part_b
> +         Table "public.part_b"
> + Column |  Type   |     Modifiers
> +--------+---------+--------------------
> + a      | text    |
> + b      | integer | not null default 1
> +Partition of: parted FOR VALUES IN ('b')
> +Check constraints:
> +    "check_a" CHECK (length(a) > 0)
> +    "part_b_b_check" CHECK (b >= 0)
>
> Needless to say, that could save a lot of trouble thinking about
> generating collision-free names of these constraints, their dependency
> handling, unintended altering of these constraints, pg_dump, etc.

Well, that's certainly true, but those problems don't strike me as so
serious that they couldn't be solved with a reasonable amount of
labor.

>> In fact, as far as I can see, the only advantage of this approach is
>> that when the insert arrives through the parent and is routed to the
>> child by whatever tuple-routing code we end up with (I guess that's
>> what 0008 does), we get to skip checking the constraint, saving CPU
>> cycles.  That's probably an important optimization, but I don't think
>> that putting the partitioning constraint in the catalog in any way
>> rules out the possibility of performing that optimization.  It's just
>> that instead of having the partitioning excluded-by-default and then
>> sometimes choosing to include it, you'll have it included-by-default
>> and then sometimes choose to exclude it.
>
> Hmm, doesn't it seem like we would be making *more* modifications to the
> existing code (backend or otherwise) to teach it about excluding the
> implicit partition constraints than the other way around?  The other way
> around being to modify the existing code to include the implicit
> constraints which is what this patch is about.

I'm not sure which way is actually going to be more code modification,
but I guess you've persuaded me that the way you have it is
reasonable, so let's stick with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
Apologies if I've made some of these comments before and/or missed
comments you've made on these topics.  The size of this patch set is
so large that it's hard to keep track of everything.

Re-reviewing 0001:

+       indicate which table columns are used as partition key.  For example,

s/are used as/are part of the/

+       third table columns make up the partition key.  A zero in this array
+       indicates that the corresponding partition key column is an expression
+       over the table columns, rather than a simple column reference.

I think you can leave out "over the table columns".

+      columns or expressions forms the <firstterm>partitioning key</firstterm>

s/forms/form/

+      The table itself is empty.  A data row inserted into the table is routed

s/The table/The partitioned table/

+     * Anything mentioned in the expressions.  We must ignore the column
+     * references which will count as self-dependency items; in this case,
+     * the depender is the table itself (there is no such thing as partition
+     * key object).

"depender" is not really a word, and the parenthetical note isn't very
clear.  Maybe: We must ignore the column references, which will depend
on the table itself; there is no separate partition key object.

+    heap_close(pg_partitioned_table, RowExclusiveLock);

It seems like it would be safe to do this right after
CatalogUpdateIndexes(pg_partitioned_table, tuple), and I'd recommend
that you do.  Not for performance or anything, but just to keep
related code together.
/* * Resolve possibly-defaulted operator class specification */
-static Oid
+OidGetIndexOpClass(List *opclass, Oid attrType,

Perhaps we should rename this function to ResolveOpClass, since it's
now going to be used for both indexes and partitioning keys.

+ * Sets *is_expr if attnum is found to be referenced in some partition key
+ * expression.

is_expr doesn't seem quite as clear as, say, used_by_expr or used_in_expr.

Also, the specification for this function doesn't seem to be very
clear about what this is supposed to do if the same column is both an
explicit partitioning column and also used in an expression, and the
code looks like it'll return with *is_expr set based on whichever use
it encounters first.  If that's intended behavior, maybe add a comment
like: It's possible for a column to be used both directly and as part
of a partition key expression; if that happens, *is_expr may end up as
either true or false.  That's OK for current uses of this function,
because *is_expr is only used to tailor the error message text.

+            if (is_expr)
+                *is_expr = false;
+            if (attnum == partattno)
+                return true;

I think you should adjust this (and the other bit in the same
function) so that you don't set *is_expr until you're committed to
returning.

+            index = -1;
+            while ((index = bms_next_member(expr_attrs, index)) > 0)
+            {
+                AttrNumber attno = index + FirstLowInvalidHeapAttributeNumber;
+
+                if (attno == attnum)
+                    return true;
+            }

How about bms_is_member(expr_attrs, attnum -
FirstLowInvalidHeapAttributeNumber), instead of looping?

+                 errmsg("cannot reference relation \"%s\"",
RelationGetRelationName(pkrel)),
+                 errdetail("Referencing partitioned tables in foreign
key constraints is not supported.")));

I think you could get rid of the errdetail and just have the error
message be "cannot reference partitioned table \"%s\"".

+                         errmsg("column \"%s\" appears twice in
partition key", pelem->name),

It could be there three times!  How about column \"%s\" appears more
than once in partition key?  (I see that you seem to have adapted this
from some code in parse_utilcmd.c, which perhaps should also be
adjusted, but that's a job for another patch.)

+            /*
+             * Strip any top-level COLLATE clause.  This ensures that we treat
+             * "x COLLATE y" and "(x COLLATE y)" alike.
+             */

But you don't, right?  Unless I am confused, which is possible, the
latter COLLATE will be ignored, while the former one will set the
collation to be used in the context of partitioning comparisons.

Re-reviewing 0002:

+       if (fout->remoteVersion >= 100000)
+       {
+               PQExpBuffer acl_subquery = createPQExpBuffer();
+               PQExpBuffer racl_subquery = createPQExpBuffer();
+               PQExpBuffer initacl_subquery = createPQExpBuffer();
+               PQExpBuffer initracl_subquery = createPQExpBuffer();
+
+               PQExpBuffer attacl_subquery = createPQExpBuffer();
+               PQExpBuffer attracl_subquery = createPQExpBuffer();
+               PQExpBuffer attinitacl_subquery = createPQExpBuffer();
+               PQExpBuffer attinitracl_subquery = createPQExpBuffer();

It seems unnecessary to repeat all of this.  The only differences
between the 10000 version and the 9600 version are:

60,61c60
<                           "AS changed_acl, "
<                           "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
---
>                           "AS changed_acl "
73c72
<                    "WHERE c.relkind in ('%c', '%c', '%c', '%c',
'%c', '%c', '%c') "
---
>                    "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') "
87,88c86
<                           RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
<                           RELKIND_PARTITIONED_TABLE);
---
>                           RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE);

But none of that is really a problem.  Sure, the 'P' case will never
arise in 9.6, but so what?  I'd really like to not keep duplicating
these increasingly-complex hunks of code if we can find some way to
avoid that.
       /* find all the inheritance information */

-       appendPQExpBufferStr(query, "SELECT inhrelid, inhparent FROM
pg_inherits");
+       appendPQExpBufferStr(query,
+                                                "SELECT inhrelid, inhparent "
+                                                "FROM pg_inherits "
+                                                "WHERE inhparent NOT
IN (SELECT oid FROM pg_class WHERE relkind = 'P')");

I think you need to update the comment.  "Find inheritance
information, excluding implicit inheritance via partitioning.  We're
not interested in that case because $REASON."

Re-reviewing 0003:

+     <para>
+      If this table is a partition, one cannot perform <literal>DROP
NOT NULL</>
+      on a column if it is marked not null in the parent table.
+     </para>

This would, presumably, also be true for inheritance.  I think we
could just leave this out.

+      as partition of the target table.  The partition bound specification must

s/as partition/as a partition/

+      correspond to the partitioning method and partitioning key of the target

I think that in most places were are referring to the "partitioning
method" (with ing) but the "partition key" (without ing). Let's try to
be consistent.

+      table.  The table to be attached must have all the columns as the target
+      table and no more; moreover, the column types must also match.  Also, it
+      must have all the matching constraints as the target table.

s/all the columns/all of the same columns/

The second sentence doesn't seem quite grammatical.  And why would
that be true anyway?  Partitions can have extra constraints, and if
they lack constraints that are present on the partitioned table, those
constraints will be added and validated, right?

+      A full table scan is performed on the table being attached to check that
+      existing rows in the table are within the specified partition bound.
+      If it is known in advance that no partition bound violating rows are
+      present in the table, the above scan can be skipped by specifying the
+      <literal>NO VALIDATE</> option.
+      Default behavior is to perform the scan, as if the
<literal>VALIDATE</literal>
+      option were specified.

I don't think it's OK to allow the partition to be added if it
contains rows that might not be valid.  We are generally vary wary
about adding options that can cause data integrity failures and I
think we shouldn't start here, either.  On the other hand, it's also
not desirable for adding a partition to take O(n) time in all cases.
So what would be nice is if the new partition could self-certify that
contains no problematic rows by having a constraint that matches the
new partitioning constraint.  Then we can skip the scan if that
constraint is already present.

+      inherited columns.  One can also specify table constraints, in addition

Delete comma.

+      to those inherited from the parent.  If a check constraint with the name
+      matching one of the parent's constraint is specified, it is merged with
+      the latter, provided the specified condition is same.

Doesn't that effectively delete the merged constraint?

Suggest: "If the parent already has a check constraint with the same
name as a constraint specified for the child, the conditions must be
the same."

+) FOR VALUES IN ('los angeles', 'san fransisco');

That's not you you spell San Francisco.

+   Create partition of a list partitioned table that itself is further
+   partitioned and then create its partition:

s/itself is/is itself/
s/then create its partition/then add a partition to it/

+                               if (!is_local || con->coninhcount == 0)
+                                       con->coninhcount++;

I would think that you could get rid of the "if" and just say
con->coninhcount = 1.  It seems to me (and perhaps the comment could
say this) that for a partitioned table, we can simplify the handling
of coninhcount and conislocal.  Since partitioning hierarchies don't
allow multiple inheritance, any inherited constraint must be inherited
exactly once.  Since a partition cannot have inheritance children --
except by being partitioned itself -- there is no point in tracking
conislocal, so we always store it as false.

+
+void
+StorePartitionBound(Relation rel, Node *bound)

Header comment, please!

+       (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
+                                                  &isnull);
+       Assert(isnull);

We try not to do unnecessary work in non-assert-enabled builds solely
for the benefit of assert-enabled builds.  We also try not to end up
with variables that are only used in assert-enabled builds but not
marked PG_USED_FOR_ASSERTS_ONLY, because those tend to cause compiler
warnings.  I'm not sure an compiler would be smart enough to warn
about this, but I suggest adding an #ifdef USE_ASSERT_CHECKING with a
block inside where the offending variable is declared.  Actually, I
think you need to move a bit more code.  Hmm.  Something like:

#ifdef USE_ASSERT_CHECKING   {       Form_pg_class classForm;       bool isnull;
      classForm = (Form_pg_class) GETSTRUCT(tuple);      Assert(!classForm->relispartition);
      (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
&isnull);     Assert(isnull);   }
 
#endif

+ * are same in common cases, of which we only store one.

"and we only store one of them."

+                                                       /*
+                                                        * Never put a
null into the values array, flag
+                                                        * instead for
the code further down below where
+                                                        * we
construct the actual relcache struct.
+                                                        */
+
found_null_partition = true;
+
null_partition_index = i;

How about we elog(ERROR, ...) if found_null_partition is already set?

+                               foreach(cell, non_null_values)
+                               {
+                                       PartitionListValue      *src =
lfirst(cell);
+
+                                       all_values[i] = (PartitionListValue *)
+        palloc(sizeof(PartitionListValue));
+                                       all_values[i]->value =
datumCopy(src->value,
+                                 key->parttypbyval[0],
+                                 key->parttyplen[0]);
+                                       all_values[i]->index = src->index;
+                                       i++;
+                               }

Why do we need to datumCopy() here when we just did that in the
previous loop?  Why did the previous loop need to datumCopy(), anyway?Can't we just pass the same datums around?  I
understandthat you
 
need to copy the data into the correct context at the end, but doing
two copies prior to that seems excessive.

+get_leaf_partition_oids(Oid relid, int lockmode)

Instead of implementing this, can you just use find_all_inheritors()?

+               /*
+                * Is lower_val = upper_val?
+                */
+               if (lower_val && upper_val)

So, that comment does not actually match that code.  That if-test is
just checking whether both bounds are finite.  What I think you should
be explaining here is that if lower_val and upper_val are both
non-infinite, and if the happen to be equal, we want to emit an
equality test rather than a >= test plus a <= test because $REASON.

+       operoid = get_opfamily_member(key->partopfamily[col],
+
key->parttypid[col],
+
key->parttypid[col],
+                                                                 strategy);
+       if (!OidIsValid(operoid))
+       {
+               operoid = get_opfamily_member(key->partopfamily[col],
+  key->partopcintype[col],
+  key->partopcintype[col],
+  strategy);
+               *need_relabel = true;
+       }

There's a comment explaining THAT you do this ("Use either the column
type as the operator datatype or opclass's declared input type.") but,
to repeat a complaint that I've made often, nothing explaining why.
In this particular case, what's not clear - in my opinion - is why you
need to try two different possibilities, and why at least one of those
possibilities is guaranteed to work.  I gather that if the opfamily
doesn't contain an operator for the actual type of the partitioning
column, you think it will certainly contain one for the input type of
the operator class (which seems right), and that the input type of the
operator class will be binary-compatible with the type of the
partitioning column (which is less-obviously true, and needs some
explanation).

I also think that this function should elog(ERROR, ...) if by any
chance the second get_opfamily_member() call also fails. Otherwise the
error might occur quite a bit downstream and be hard to understand.

+       ReleaseSysCache(tuple);
+       heap_close(parent, AccessShareLock);

I think you had better not release the lock here - i.e. pass NoLock.
We don't release locks on tables until transaction commit, except for
catalog tables.  This also comes up in, at least,
transformPartitionOf().

+    PartitionRangeBound *b1 = (*(PartitionRangeBound *const *) a);
+    PartitionRangeBound *b2 = (*(PartitionRangeBound *const *) b);
+       PartitionKey key = (PartitionKey) arg;
+
+       return partition_rbound_cmp(key, b1, b2);

Whitespace.

+        * as partition and schema consists of columns definitions corresponding

the schema consists

-               if (recurse)
+               /* Force inheritance recursion, if partitioned table. */
+               if (recurse || rel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE)

I would instead error out if ONLY is specified for a partitioned table.

+               /*
+                * If the table is source table of ATTACH PARTITION
command, following
+                * check is unnecessary.
+                */

As usual, comment should say WHY.

+                       if (partqualstate && !ExecQual(partqualstate,
econtext, true))
+                               ereport(ERROR,
+
(errcode(ERRCODE_CHECK_VIOLATION),
+                                                errmsg("child table
contains a row violating partition bound specification")));

Why not mimic the existing phrasing?  "partition constraint is
violated by some row"

What happens if you try to attach a table as a partition of itself or
one of its ancestors?

-                               errmsg("column \"%s\" in child table
must be marked NOT NULL",
-                                          attributeName)));
+                                                errmsg("column \"%s\"
in child table must be marked NOT NULL",
+
attributeName)));

Whitespace-only hunk; revert.  You cannot fight the power of pgindent.

+                                errmsg("cannot attach table that is a
inheritance child as partition")));

an inheritance child

+               errmsg("cannot attach a temporary relation of another
session as partition ")));

Extra space.

+                                        errdetail("Table being
attached should contain only the columns"
+                                                          " present
in parent.")));

Suggest: "New partition should contain only..."

Also, don't break error messages into multiple strings.  Make it one
long string and let pgindent deal.

+                       | FOR VALUES START '(' range_datum_list ')' lb_inc
+                                                END_P '('
range_datum_list ')' ub_inc

Just a random idea.  Would for VALUES FROM ( ... ) TO ( ... ) be more
idiomatic than START and END?

+static void
+transformPartitionOf(CreateStmtContext *cxt, Node *bound)
+{
+       TupleDesc       tupdesc;
+       int                     i;
+       RangeVar   *part = cxt->relation;
+       RangeVar   *partof = linitial(cxt->inhRelations);
+       Relation        parentRel;
+
+       parentRel = heap_openrv(partof, AccessShareLock);
+

If there's anyway that we might do heap_openrv() on the same RangeVar
at multiple places in the code, it presents security and integrity
hazards because the referent of that RangeVar might change in the
meantime.  I suspect there is such a risk here - won't we need to open
the relation again when we actually want to do the operation?

Also, we should never acquire a lower-level lock on a relation and
then, further downstream, acquire a higher-level lock on the same
object.  To do so creates a deadlock hazard.  That seems like it might
be a problem here, too.

+                               /*if (ldatum->infinite && rdatum->infinite)
+                                       ereport(ERROR,
+
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                                                errmsg("both START
and END datum for column \"%s\" cannot be UNBOUNDED",
+                                                               colname),
+
parser_errposition(cxt->pstate, rdatum->location)));*/

Commented-out code is bad.

+                       if (list_length(spec->lowerdatums) > partnatts)
+                               ereport(ERROR,
+
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                                        errmsg("START has more values
than the length of the partition key"),
+
parser_errposition(cxt->pstate,
+
exprLocation(list_nth(spec->lowerdatums,
+ list_length(spec->lowerdatums) - 1)))));
+                       else if (list_length(spec->lowerdatums) < partnatts)
+                               ereport(ERROR,
+
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+                                        errmsg("START has fewer
values than the length of the partition key"),
+
parser_errposition(cxt->pstate,
+
exprLocation(list_nth(spec->lowerdatums,
+ list_length(spec->lowerdatums) - 1)))));

It would be worth looking for a way to unify these cases.  Like "START
must specify exactly one value per partitioning column".

+                * Same oids? No mind order - in the list case, it
matches the order
+                * in which partition oids are returned by a
pg_inherits scan, whereas
+                * in the range case, they are in order of ranges of individual
+                * partitions.  XXX - is the former unsafe?

Probably.  It likely depends on the physical ordering of tuples in the
table, which can change.

+ * BoundCollection encapsulates a set of partition bounds of either physical
+ * or logical relations.  It is associated with a partitioned relation of
+ * which the aforementioned relations are partitions.

"physical or logical relations" is unfamiliar terminology.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Jaime Casanova
Date:
On 28 October 2016 at 02:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Please find attached the latest version of the patches

Hi,

I started to review the functionality of this patch, so i applied all
9 patches. After that i found this warning, which i guess is because
it needs a cast.

After that, i tried a case that i think is important: to partition an
already existing table. Because there is no ALTER TABL SET PARTITION
or something similar (which i think makes sense because such a command
would need to create the partitions and move the rows to follow the
rule that there is no rows in a parent table).

So, what i tried was:

1) rename original table
2) create a new partitioned table with old name
3) attach old table as a partition with bounds outside normal bounds
and no validate

the idea is to start attaching valid partitions and delete and insert
rows from the invalid one (is there a better way of doing that?), that
will allow to partition a table easily.
So far so good, until i decided points 1 to 3 should happen inside a
transaction to make things transparent to the user.

Attached is script that shows the failure when trying it:

script 1 (failing_test_1.sql) fails the assert
"Assert(RelationGetPartitionKey(parentRel) != NULL);" in
transformAttachPartition() at src/backend/parser/parse_utilcmd.c:3164

After that i tried the same but with an already partitioned (via
inheritance) table and got this (i did this first without a
transaction, doing it with a transaction will show the same failure as
before):

script 2 (failing_test_2.sql) fails the assert
"Assert(childrte->relkind == RELKIND_PARTITIONED_TABLE);" in
expand_inherited_rte_internal() at
src/backend/optimizer/prep/prepunion.c:1551

PS: i don't like the START END syntax but i don't have any ideas
there... except, there was a reason not to use expressions (like a
CHECK constraint?)

--
Jaime Casanova                      www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Declarative partitioning - another take

From
Jaime Casanova
Date:
On 7 November 2016 at 12:15, Jaime Casanova
<jaime.casanova@2ndquadrant.com> wrote:
> On 28 October 2016 at 02:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> Please find attached the latest version of the patches
>
> Hi,
>
> I started to review the functionality of this patch, so i applied all
> 9 patches. After that i found this warning, which i guess is because
> it needs a cast.
>

oh! i forgot the warning
"""
partition.c: In function ‘get_qual_for_list’:
partition.c:1159:6: warning: assignment from incompatible pointer type
   or = makeBoolExpr(OR_EXPR, list_make2(nulltest2, opexpr), -1);
      ^
"""

attached a list of the warnings that my compiler give me (basically
most are just variables that could be used uninitialized)

--
Jaime Casanova                      www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Jaime,

On 2016/11/08 2:24, Jaime Casanova wrote:
> On 7 November 2016 at 12:15, Jaime Casanova
> <jaime.casanova@2ndquadrant.com> wrote:
>> On 28 October 2016 at 02:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>
>>> Please find attached the latest version of the patches
>>
>> Hi,
>>
>> I started to review the functionality of this patch, so i applied all
>> 9 patches. After that i found this warning, which i guess is because
>> it needs a cast.
>>
> 
> oh! i forgot the warning
> """
> partition.c: In function ‘get_qual_for_list’:
> partition.c:1159:6: warning: assignment from incompatible pointer type
>    or = makeBoolExpr(OR_EXPR, list_make2(nulltest2, opexpr), -1);
>       ^
> """

This one I noticed too and have fixed.

> 
> attached a list of the warnings that my compiler give me (basically
> most are just variables that could be used uninitialized)

Thanks a lot for spotting and reporting these.  Will fix as appropriate.

Regards,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Jaime,

On 2016/11/08 2:15, Jaime Casanova wrote:
> On 28 October 2016 at 02:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
> I started to review the functionality of this patch, so i applied all
> 9 patches. After that i found this warning, which i guess is because
> it needs a cast.

Thanks a ton for reviewing!

> After that, i tried a case that i think is important: to partition an
> already existing table. Because there is no ALTER TABL SET PARTITION
> or something similar (which i think makes sense because such a command
> would need to create the partitions and move the rows to follow the
> rule that there is no rows in a parent table).
> 
> So, what i tried was:
> 
> 1) rename original table
> 2) create a new partitioned table with old name
> 3) attach old table as a partition with bounds outside normal bounds
> and no validate
> 
> the idea is to start attaching valid partitions and delete and insert
> rows from the invalid one (is there a better way of doing that?), that
> will allow to partition a table easily.

So, step 3 creates a partition that is basically unbounded.  From there,
it seems you want to create partitions with proper bounds, moving data
into them as they are created.  It seems like a job of some redistribution
command (split partition?) which is currently unsupported.

> So far so good, until i decided points 1 to 3 should happen inside a
> transaction to make things transparent to the user.
> 
> Attached is script that shows the failure when trying it:
> 
> script 1 (failing_test_1.sql) fails the assert
> "Assert(RelationGetPartitionKey(parentRel) != NULL);" in
> transformAttachPartition() at src/backend/parser/parse_utilcmd.c:3164

Thanks for the test.  This test uncovered a bug which I have fixed in my
local repository.  Given the fix the above will work, although I see that
it's not the best way to do what you want to do.

> After that i tried the same but with an already partitioned (via
> inheritance) table and got this (i did this first without a
> transaction, doing it with a transaction will show the same failure as
> before):
> 
> script 2 (failing_test_2.sql) fails the assert
> "Assert(childrte->relkind == RELKIND_PARTITIONED_TABLE);" in
> expand_inherited_rte_internal() at
> src/backend/optimizer/prep/prepunion.c:1551

This again was an oversight/bug in the patch.  It's not supported to
combine old-style inheritance partitioning with the new partitioned
tables.  In fact, ATTACH PARTITION prevents adding a regular inheritance
parent as partition.  After fixing the bug, you would instead get this error:

alter table prueba attach partition prueba_old for values start
(unbounded, unbounded) end (2008,1) no validate;
ERROR:  cannot attach regular inheritance parent as partition

In this case, you should instead create prueba_old partition hierarchy
using the new partitioning commands and then attach the same.

> PS: i don't like the START END syntax but i don't have any ideas
> there... except, there was a reason not to use expressions (like a
> CHECK constraint?)

Expression syntax is too unrestricted.  What's to prevent users from using
completely different check constraint expressions from partition to
partition (not that any users deliberately try do that)?  With the new
partitioning syntax, you specify the partitioning columns once when
creating the partitioned parent and then specify, using partitioning
method specific syntax, the bounds for every partition of the parent.
That allows to capture the partition metadata in a form that is more
suitable to implement other features related to partitioning.

That said, I think it's always a challenge to come up with a syntax that
is universally acceptable.  For example, the recent discussion about
whether to allow inclusive/exclusive to be specified for START and END
bounds of range partitions or always assume inclusive start and exclusive
end [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoaKOycHcVoed%2BF3fk-z6xUOeysQFG6HT%3Doucw76bSMHCQ%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/04 9:16, Robert Haas wrote:
> Apologies if I've made some of these comments before and/or missed
> comments you've made on these topics.  The size of this patch set is
> so large that it's hard to keep track of everything.

Thanks for the reviews!

> Re-reviewing 0001:
>
> +       indicate which table columns are used as partition key.  For example,
>
> s/are used as/are part of the/

Fixed.

>
> +       third table columns make up the partition key.  A zero in this array
> +       indicates that the corresponding partition key column is an expression
> +       over the table columns, rather than a simple column reference.
>
> I think you can leave out "over the table columns".

Done.

> +      columns or expressions forms the <firstterm>partitioning key</firstterm>
>
> s/forms/form/

Actually the sentence there is: The parenthesized list of columns or
expressions forms the the <firstterm>partitioning key</firstterm> for the
table

>
> +      The table itself is empty.  A data row inserted into the table is routed
>
> s/The table/The partitioned table/

Done.

>
> +     * Anything mentioned in the expressions.  We must ignore the column
> +     * references which will count as self-dependency items; in this case,
> +     * the depender is the table itself (there is no such thing as partition
> +     * key object).
>
> "depender" is not really a word, and the parenthetical note isn't very
> clear.  Maybe: We must ignore the column references, which will depend
> on the table itself; there is no separate partition key object.

I see "depender" being used as a variable name, but I guess it's not
appropriate to use the same in comments.  In any case, I adopted your text.

>
> +    heap_close(pg_partitioned_table, RowExclusiveLock);
>
> It seems like it would be safe to do this right after
> CatalogUpdateIndexes(pg_partitioned_table, tuple), and I'd recommend
> that you do.  Not for performance or anything, but just to keep
> related code together.

I see, done.

>
>  /*
>   * Resolve possibly-defaulted operator class specification
>   */
> -static Oid
> +Oid
>  GetIndexOpClass(List *opclass, Oid attrType,
>
> Perhaps we should rename this function to ResolveOpClass, since it's
> now going to be used for both indexes and partitioning keys.

Aha, good idea!  Done.

> + * Sets *is_expr if attnum is found to be referenced in some partition key
> + * expression.
>
> is_expr doesn't seem quite as clear as, say, used_by_expr or used_in_expr.
>
> Also, the specification for this function doesn't seem to be very
> clear about what this is supposed to do if the same column is both an
> explicit partitioning column and also used in an expression, and the
> code looks like it'll return with *is_expr set based on whichever use
> it encounters first.  If that's intended behavior, maybe add a comment
> like: It's possible for a column to be used both directly and as part
> of a partition key expression; if that happens, *is_expr may end up as
> either true or false.  That's OK for current uses of this function,
> because *is_expr is only used to tailor the error message text.

OK, I added the note as you suggest.

>
> +            if (is_expr)
> +                *is_expr = false;
> +            if (attnum == partattno)
> +                return true;
>
> I think you should adjust this (and the other bit in the same
> function) so that you don't set *is_expr until you're committed to
> returning.

Done.

>
> +            index = -1;
> +            while ((index = bms_next_member(expr_attrs, index)) > 0)
> +            {
> +                AttrNumber attno = index + FirstLowInvalidHeapAttributeNumber;
> +
> +                if (attno == attnum)
> +                    return true;
> +            }
>
> How about bms_is_member(expr_attrs, attnum -
> FirstLowInvalidHeapAttributeNumber), instead of looping?

Done that way, thanks.

> +                 errmsg("cannot reference relation \"%s\"",
> RelationGetRelationName(pkrel)),
> +                 errdetail("Referencing partitioned tables in foreign
> key constraints is not supported.")));
>
> I think you could get rid of the errdetail and just have the error
> message be "cannot reference partitioned table \"%s\"".

OK, done.

>
> +                         errmsg("column \"%s\" appears twice in
> partition key", pelem->name),
>
> It could be there three times!  How about column \"%s\" appears more
> than once in partition key?  (I see that you seem to have adapted this
> from some code in parse_utilcmd.c, which perhaps should also be
> adjusted, but that's a job for another patch.)

Done.

>
> +            /*
> +             * Strip any top-level COLLATE clause.  This ensures that we treat
> +             * "x COLLATE y" and "(x COLLATE y)" alike.
> +             */
>
> But you don't, right?  Unless I am confused, which is possible, the
> latter COLLATE will be ignored, while the former one will set the
> collation to be used in the context of partitioning comparisons.

The code immediately following the comment does in fact strip the
top-level COLLATE clause.

            while (IsA(expr, CollateExpr))
                expr = (Node *) ((CollateExpr *) expr)->arg;

So that the following two specifications are equivalent which is the intent:

create table p (a text) partition by range (a collate "en_US");
vs.
create table p (a text) partition by range ((a collate "en_US"));

> Re-reviewing 0002:
>
> +       if (fout->remoteVersion >= 100000)
> +       {
> +               PQExpBuffer acl_subquery = createPQExpBuffer();
> +               PQExpBuffer racl_subquery = createPQExpBuffer();
> +               PQExpBuffer initacl_subquery = createPQExpBuffer();
> +               PQExpBuffer initracl_subquery = createPQExpBuffer();
> +
> +               PQExpBuffer attacl_subquery = createPQExpBuffer();
> +               PQExpBuffer attracl_subquery = createPQExpBuffer();
> +               PQExpBuffer attinitacl_subquery = createPQExpBuffer();
> +               PQExpBuffer attinitracl_subquery = createPQExpBuffer();
>
> It seems unnecessary to repeat all of this.  The only differences
> between the 10000 version and the 9600 version are:
>
> 60,61c60
> <                           "AS changed_acl, "
> <                           "CASE WHEN c.relkind = 'P' THEN
> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
> ---
>>                           "AS changed_acl "
> 73c72
> <                    "WHERE c.relkind in ('%c', '%c', '%c', '%c',
> '%c', '%c', '%c') "
> ---
>>                    "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') "
> 87,88c86
> <                           RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
> <                           RELKIND_PARTITIONED_TABLE);
> ---
>>                           RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE);
>
> But none of that is really a problem.  Sure, the 'P' case will never
> arise in 9.6, but so what?  I'd really like to not keep duplicating
> these increasingly-complex hunks of code if we can find some way to
> avoid that.

We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that
getTables() sends to pre-10 servers, right?  But I suppose we need not
call it in that particular SQL query in the first place.

How about we do it in the following manner in getSchemaData():

if (g_verbose)
    write_msg(NULL, "reading partition key information for interesting
tables\n");
getTablePartitionKeyInfo(fout, tblinfo, numTables);

I have implemented the same.

>
>         /* find all the inheritance information */
>
> -       appendPQExpBufferStr(query, "SELECT inhrelid, inhparent FROM
> pg_inherits");
> +       appendPQExpBufferStr(query,
> +                                                "SELECT inhrelid, inhparent "
> +                                                "FROM pg_inherits "
> +                                                "WHERE inhparent NOT
> IN (SELECT oid FROM pg_class WHERE relkind = 'P')");
>
> I think you need to update the comment.  "Find inheritance
> information, excluding implicit inheritance via partitioning.  We're
> not interested in that case because $REASON."

I just realized that this hunk really belongs in patch 0004.  In any case,
I added the explanatory comment like you suggest.

> Re-reviewing 0003:
>
> +     <para>
> +      If this table is a partition, one cannot perform <literal>DROP
> NOT NULL</>
> +      on a column if it is marked not null in the parent table.
> +     </para>
>
> This would, presumably, also be true for inheritance.  I think we
> could just leave this out.

Actually, it isn't true for the regular inheritance situation.

create table parent (a int not null);
create table child () inherits (parent);
alter table child alter a drop not null;  -- this works (bug?)

vs.

create table p (a int, b int, c int) partition by range (a);
create table p1 partition of p for values start (1) end (10);
alter table p1 alter a drop not null;  -- this causes error


Once NOT NULL constraints start using pg_constraint [1] to keep track of
inheritance (which currently is not kept track of), the above illustrated
bug will be fixed.  If that patch gets committed, it will have taken care
of the partitioning case as well.

But in the meantime, we can proceed with enforcing inheritance on NOT NULL
constraint for *partitions*, because they only ever have one parent and
hence do not need elaborate coninhcount based scheme, I think.  Thoughts?

>
> +      as partition of the target table.  The partition bound specification must
>
> s/as partition/as a partition/

Fixed.

>
> +      correspond to the partitioning method and partitioning key of the target
>
> I think that in most places were are referring to the "partitioning
> method" (with ing) but the "partition key" (without ing). Let's try to
> be consistent.

I'm inclined to switch to "partitioning method" and "partitioning key",
but do you mean just the documentation or throughout?  Beside
documentation, I mean source code comments, error messages, etc.  I have
assumed throughout.

>
> +      table.  The table to be attached must have all the columns as the target
> +      table and no more; moreover, the column types must also match.  Also, it
> +      must have all the matching constraints as the target table.
>
> s/all the columns/all of the same columns/

Fixed.

> The second sentence doesn't seem quite grammatical.  And why would
> that be true anyway?  Partitions can have extra constraints, and if
> they lack constraints that are present on the partitioned table, those
> constraints will be added and validated, right?

It's true that partitions can have extra constraints which don't affect
the discussion.  What I meant to say with the second sentence is that any
check constraints defined on the parent table (being referred to as the
target table) must be present in the table being attached.  It's the same
rule as for regular inheritance.  I didn't change how that works in the
partitioning case.  So no, check constraints of the parent table that are
missing in the table being attached are not added and validated as part of
the attach partition command processing.

I did change the second sentence to: Also, it must have all the
<literal>NOT NULL</literal> and <literal>CHECK</literal> constraints
present in the target table.

>
> +      A full table scan is performed on the table being attached to check that
> +      existing rows in the table are within the specified partition bound.
> +      If it is known in advance that no partition bound violating rows are
> +      present in the table, the above scan can be skipped by specifying the
> +      <literal>NO VALIDATE</> option.
> +      Default behavior is to perform the scan, as if the
> <literal>VALIDATE</literal>
> +      option were specified.
>
> I don't think it's OK to allow the partition to be added if it
> contains rows that might not be valid.  We are generally vary wary
> about adding options that can cause data integrity failures and I
> think we shouldn't start here, either.  On the other hand, it's also
> not desirable for adding a partition to take O(n) time in all cases.
> So what would be nice is if the new partition could self-certify that
> contains no problematic rows by having a constraint that matches the
> new partitioning constraint.  Then we can skip the scan if that
> constraint is already present.

I agree that NO VALIDATE is undesirable and it would be better if we could
get rid of the same.  I want to clarify one thing though: what does it
mean for the new partition to have a constraint that *matches* the new
partitioning constraint?  Does it mean the new partition's constraint
*implies* the partitioning constraint?  Or as certified by equal()?

>
> +      inherited columns.  One can also specify table constraints, in addition
>
> Delete comma.

Done.

>
> +      to those inherited from the parent.  If a check constraint with the name
> +      matching one of the parent's constraint is specified, it is merged with
> +      the latter, provided the specified condition is same.
>
> Doesn't that effectively delete the merged constraint?
>
> Suggest: "If the parent already has a check constraint with the same
> name as a constraint specified for the child, the conditions must be
> the same."
>
> +) FOR VALUES IN ('los angeles', 'san fransisco');
>
> That's not you you spell San Francisco.

Oops.  I changed the example such that specified values are properly
capitalized.

<programlisting>
CREATE TABLE cities (
    name         text not null,
    population   int,
) PARTITION BY LIST (initcap(name));
</programlisting></para>

<programlisting>
CREATE TABLE cities_west
    PARTITION OF cities (
    CONSTRAINT city_id_nonzero CHECK (city_id != 0)
) FOR VALUES IN ('Los Angeles', 'San Francisco');
</programlisting></para>

Previously, the key was: PARTITION BY LIST (lower(name))

> +   Create partition of a list partitioned table that itself is further
> +   partitioned and then create its partition:
>
> s/itself is/is itself/
> s/then create its partition/then add a partition to it/

Fixed.

>
> +                               if (!is_local || con->coninhcount == 0)
> +                                       con->coninhcount++;
>
> I would think that you could get rid of the "if" and just say
> con->coninhcount = 1.

That's better.  So:

/*
 * In case of partitions, an inherited constraint must be
 * inherited only once since it cannot have multiple parents and
 * it is never considered local.
 */
if (rel->rd_rel->relispartition)
{
    con->coninhcount = 1;
    con->conislocal = false;
}

The above is enforced in both MergeWithExistingConstraint() and
MergeConstraintsIntoExisting().

> It seems to me (and perhaps the comment could
> say this) that for a partitioned table, we can simplify the handling
> of coninhcount and conislocal.  Since partitioning hierarchies don't
> allow multiple inheritance, any inherited constraint must be inherited
> exactly once.  Since a partition cannot have inheritance children --
> except by being partitioned itself -- there is no point in tracking
> conislocal, so we always store it as false.

Agreed.

> +
> +void
> +StorePartitionBound(Relation rel, Node *bound)
>
> Header comment, please!

Sorry about that, added.

>
> +       (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
> +                                                  &isnull);
> +       Assert(isnull);
>
> We try not to do unnecessary work in non-assert-enabled builds solely
> for the benefit of assert-enabled builds.  We also try not to end up
> with variables that are only used in assert-enabled builds but not
> marked PG_USED_FOR_ASSERTS_ONLY, because those tend to cause compiler
> warnings.  I'm not sure an compiler would be smart enough to warn
> about this, but I suggest adding an #ifdef USE_ASSERT_CHECKING with a
> block inside where the offending variable is declared.  Actually, I
> think you need to move a bit more code.  Hmm.  Something like:
>
> #ifdef USE_ASSERT_CHECKING
>     {
>         Form_pg_class classForm;
>         bool isnull;
>
>        classForm = (Form_pg_class) GETSTRUCT(tuple);
>        Assert(!classForm->relispartition);
>
>        (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
>                                                   &isnull);
>        Assert(isnull);
>     }
> #endif

Done this way, thanks.

>
> + * are same in common cases, of which we only store one.
>
> "and we only store one of them."

Done.

>
> +                                                       /*
> +                                                        * Never put a
> null into the values array, flag
> +                                                        * instead for
> the code further down below where
> +                                                        * we
> construct the actual relcache struct.
> +                                                        */
> +
> found_null_partition = true;
> +
> null_partition_index = i;
>
> How about we elog(ERROR, ...) if found_null_partition is already set?

Makes sense.  However, let me mention here that duplicates either within
one partition's list or across the partitions are not possible.  That's
because in case of the former, we de-duplicate before storing the list
into the catalog and the latter would simply be an overlap error.  Could
this be made an Assert() then?

>
> +                               foreach(cell, non_null_values)
> +                               {
> +                                       PartitionListValue      *src =
> lfirst(cell);
> +
> +                                       all_values[i] = (PartitionListValue *)
> +
>          palloc(sizeof(PartitionListValue));
> +                                       all_values[i]->value =
> datumCopy(src->value,
> +
>                                   key->parttypbyval[0],
> +
>                                   key->parttyplen[0]);
> +                                       all_values[i]->index = src->index;
> +                                       i++;
> +                               }
>
> Why do we need to datumCopy() here when we just did that in the
> previous loop?  Why did the previous loop need to datumCopy(), anyway?
>  Can't we just pass the same datums around?  I understand that you
> need to copy the data into the correct context at the end, but doing
> two copies prior to that seems excessive.

Agreed. Now the datumCopying() happens only once when copying to the
relcache context.

> +get_leaf_partition_oids(Oid relid, int lockmode)
>
> Instead of implementing this, can you just use find_all_inheritors()?

OK, got rid of get_leaf_partition_oids().

>
> +               /*
> +                * Is lower_val = upper_val?
> +                */
> +               if (lower_val && upper_val)
>
> So, that comment does not actually match that code.  That if-test is
> just checking whether both bounds are finite.  What I think you should
> be explaining here is that if lower_val and upper_val are both
> non-infinite, and if the happen to be equal, we want to emit an
> equality test rather than a >= test plus a <= test because $REASON.

OK, I have added some explanatory comments around this code.

>
> +       operoid = get_opfamily_member(key->partopfamily[col],
> +
> key->parttypid[col],
> +
> key->parttypid[col],
> +                                                                 strategy);
> +       if (!OidIsValid(operoid))
> +       {
> +               operoid = get_opfamily_member(key->partopfamily[col],
> +
>    key->partopcintype[col],
> +
>    key->partopcintype[col],
> +
>    strategy);
> +               *need_relabel = true;
> +       }
>
> There's a comment explaining THAT you do this ("Use either the column
> type as the operator datatype or opclass's declared input type.") but,
> to repeat a complaint that I've made often, nothing explaining why.
> In this particular case, what's not clear - in my opinion - is why you
> need to try two different possibilities, and why at least one of those
> possibilities is guaranteed to work.  I gather that if the opfamily
> doesn't contain an operator for the actual type of the partitioning
> column, you think it will certainly contain one for the input type of
> the operator class (which seems right), and that the input type of the
> operator class will be binary-compatible with the type of the
> partitioning column (which is less-obviously true, and needs some
> explanation).

Sorry, it is indeed not clear why the code is the way it it.  I expanded
the comment explaining why (borrowing from what you wrote above).

>
> I also think that this function should elog(ERROR, ...) if by any
> chance the second get_opfamily_member() call also fails. Otherwise the
> error might occur quite a bit downstream and be hard to understand.

OK, added an elog(ERROR, ...) after the second get_opfamily_member() call.

>
> +       ReleaseSysCache(tuple);
> +       heap_close(parent, AccessShareLock);
>
> I think you had better not release the lock here - i.e. pass NoLock.
> We don't release locks on tables until transaction commit, except for
> catalog tables.  This also comes up in, at least,
> transformPartitionOf().

Ah, fixed.

> +    PartitionRangeBound *b1 = (*(PartitionRangeBound *const *) a);
> +    PartitionRangeBound *b2 = (*(PartitionRangeBound *const *) b);
> +       PartitionKey key = (PartitionKey) arg;
> +
> +       return partition_rbound_cmp(key, b1, b2);
>
> Whitespace.

Fixed.

>
> +        * as partition and schema consists of columns definitions corresponding
>
> the schema consists
>
> -               if (recurse)
> +               /* Force inheritance recursion, if partitioned table. */
> +               if (recurse || rel->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE)
>
> I would instead error out if ONLY is specified for a partitioned table.

I forgot this instance of forced-recursion-if-partitioned, fixed.  I was
not quite sure what the error message would say; how about:

ereport(ERROR,
        (errcode(ERRCODE_WRONG_OBJECT_TYPE),
         errmsg("must truncate child tables too")));

> +               /*
> +                * If the table is source table of ATTACH PARTITION
> command, following
> +                * check is unnecessary.
> +                */
>
> As usual, comment should say WHY.

OK, expanded the comment explaining why.

>
> +                       if (partqualstate && !ExecQual(partqualstate,
> econtext, true))
> +                               ereport(ERROR,
> +
> (errcode(ERRCODE_CHECK_VIOLATION),
> +                                                errmsg("child table
> contains a row violating partition bound specification")));
>
> Why not mimic the existing phrasing?  "partition constraint is
> violated by some row"

Agreed, done.

>
> What happens if you try to attach a table as a partition of itself or
> one of its ancestors?

The latter fails with "already a partition" error.

The former case was not being handled at all which has now been fixed.
ATExecAddInherit() prevents that case as an instance of preventing the
circularity of inheritance.  It says: ERROR: circular inheritance not
allowed.  And then: DETAIL: "rel" is already a child of "rel".

Should ATExecAttachPartition() use the same trick and keep using the same
message(s)?  The above detail message doesn't quite sound appropriate when
one tries to attach a table as partition of itself.

>
> -                               errmsg("column \"%s\" in child table
> must be marked NOT NULL",
> -                                          attributeName)));
> +                                                errmsg("column \"%s\"
> in child table must be marked NOT NULL",
> +
> attributeName)));
>
> Whitespace-only hunk; revert.  You cannot fight the power of pgindent.

Oops, fixed.

>
> +                                errmsg("cannot attach table that is a
> inheritance child as partition")));
>
> an inheritance child
>
> +               errmsg("cannot attach a temporary relation of another
> session as partition ")));
>
> Extra space.

Fixed.

>
> +                                        errdetail("Table being
> attached should contain only the columns"
> +                                                          " present
> in parent.")));
>
> Suggest: "New partition should contain only..."
>
> Also, don't break error messages into multiple strings.  Make it one
> long string and let pgindent deal.

Done.

> +                       | FOR VALUES START '(' range_datum_list ')' lb_inc
> +                                                END_P '('
> range_datum_list ')' ub_inc
>
> Just a random idea.  Would for VALUES FROM ( ... ) TO ( ... ) be more
> idiomatic than START and END?

It would actually.  Should I go ahead and change to FROM (...) TO (...)?

Related to range partitioning, should we finalize on inclusive START/FROM
and exclusive END/TO preventing explicit specification of the inclusivity?

> +static void
> +transformPartitionOf(CreateStmtContext *cxt, Node *bound)
> +{
> +       TupleDesc       tupdesc;
> +       int                     i;
> +       RangeVar   *part = cxt->relation;
> +       RangeVar   *partof = linitial(cxt->inhRelations);
> +       Relation        parentRel;
> +
> +       parentRel = heap_openrv(partof, AccessShareLock);
> +
>
> If there's anyway that we might do heap_openrv() on the same RangeVar
> at multiple places in the code, it presents security and integrity
> hazards because the referent of that RangeVar might change in the
> meantime.  I suspect there is such a risk here - won't we need to open
> the relation again when we actually want to do the operation?
>
> Also, we should never acquire a lower-level lock on a relation and
> then, further downstream, acquire a higher-level lock on the same
> object.  To do so creates a deadlock hazard.  That seems like it might
> be a problem here, too.

I think I have modified things so that the concerns you express are taken
care of.  In particular, I got rid of transformPartitionOf() altogether,
moving part of its responsibilities to MergeAttributes() where the parent
table is locked anyway and calling transformPartitionBound from
DefineRelation() (we needed access to the parent's PartitionKey for that).

>
> +                               /*if (ldatum->infinite && rdatum->infinite)
> +                                       ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                                                errmsg("both START
> and END datum for column \"%s\" cannot be UNBOUNDED",
> +                                                               colname),
> +
> parser_errposition(cxt->pstate, rdatum->location)));*/
>
> Commented-out code is bad.

Oops, removed.

>
> +                       if (list_length(spec->lowerdatums) > partnatts)
> +                               ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                                        errmsg("START has more values
> than the length of the partition key"),
> +
> parser_errposition(cxt->pstate,
> +
> exprLocation(list_nth(spec->lowerdatums,
> +
>   list_length(spec->lowerdatums) - 1)))));
> +                       else if (list_length(spec->lowerdatums) < partnatts)
> +                               ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                                        errmsg("START has fewer
> values than the length of the partition key"),
> +
> parser_errposition(cxt->pstate,
> +
> exprLocation(list_nth(spec->lowerdatums,
> +
>   list_length(spec->lowerdatums) - 1)))));
>
> It would be worth looking for a way to unify these cases.  Like "START
> must specify exactly one value per partitioning column".

Agreed.  The code is trying to generate too specific error messages.

> +                * Same oids? No mind order - in the list case, it
> matches the order
> +                * in which partition oids are returned by a
> pg_inherits scan, whereas
> +                * in the range case, they are in order of ranges of individual
> +                * partitions.  XXX - is the former unsafe?
>
> Probably.  It likely depends on the physical ordering of tuples in the
> table, which can change.

This comment is outdated.  RelationBuildPartitionDesc() now always puts
the OIDs into the array in a canonical order (in both list and range
cases) using.  I have updated the comment.

>
> + * BoundCollection encapsulates a set of partition bounds of either physical
> + * or logical relations.  It is associated with a partitioned relation of
> + * which the aforementioned relations are partitions.
>
> "physical or logical relations" is unfamiliar terminology.

What I am trying to say there is that one can associate a BoundCollection
with either an actual partitioned table or with a transient partitioned
relation, say, a partitioned joinrel.  In case of the former, the
BoundCollection is part of the table's partition descriptor.  I can see
though that "logical/physical relations" terminology is misleading at
best.  I rewrote the comment:

/*
 * BoundCollection encapsulates a set of partition bounds.  It is usually
 * associated with partitioned tables as part of its partition descriptor.
 *
 * The internal structure is opaque outside partition.c.
 */
typedef struct BoundCollectionData *BoundCollection;


Attached updated patches take care of the above comments and few other
fixes.  There are still a few items I have not addressed right away:

- Remove NO VALIDATE clause in ATTACH PARTITION and instead rely on the
  new partition's constraints to skip the validation scan
- Remove the syntax to specify inclusivity of each of START and END bounds
  of range partitions and instead assume inclusive START and exclusive END

Also, what used to be "[PATCH 5/9] Refactor optimizer's inheritance set
expansion code" is no longer included in the set.  I think we can go back
to that topic later as I mentioned last week [2].

Thanks,
Amit

[1] https://www.postgresql.org/message-id/16589.1465997664%40sss.pgh.pa.us
[2]
https://www.postgresql.org/message-id/0a299fbf-e5a1-dc3d-eb5e-11d3601eac16%40lab.ntt.co.jp

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
In this latest patch set:

src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
+                                    *rdatum;

With all patches applied, "make check" fails with a bunch of diffs
that look like this:
 Check constraints:
-     "pt1chk2" CHECK (c2 <> ''::text)     "pt1chk3" CHECK (c2 <> ''::text)


On Wed, Nov 9, 2016 at 6:14 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Actually the sentence there is: The parenthesized list of columns or
> expressions forms the the <firstterm>partitioning key</firstterm> for the
> table

OK.

>> +            /*
>> +             * Strip any top-level COLLATE clause.  This ensures that we treat
>> +             * "x COLLATE y" and "(x COLLATE y)" alike.
>> +             */
>>
>> But you don't, right?  Unless I am confused, which is possible, the
>> latter COLLATE will be ignored, while the former one will set the
>> collation to be used in the context of partitioning comparisons.
>
> The code immediately following the comment does in fact strip the
> top-level COLLATE clause.
>
>             while (IsA(expr, CollateExpr))
>                 expr = (Node *) ((CollateExpr *) expr)->arg;
>
> So that the following two specifications are equivalent which is the intent:
>
> create table p (a text) partition by range (a collate "en_US");
> vs.
> create table p (a text) partition by range ((a collate "en_US"));

I see.  You're right.

>> Re-reviewing 0002:
>>
>> +       if (fout->remoteVersion >= 100000)
>> +       {
>> +               PQExpBuffer acl_subquery = createPQExpBuffer();
>> +               PQExpBuffer racl_subquery = createPQExpBuffer();
>> +               PQExpBuffer initacl_subquery = createPQExpBuffer();
>> +               PQExpBuffer initracl_subquery = createPQExpBuffer();
>> +
>> +               PQExpBuffer attacl_subquery = createPQExpBuffer();
>> +               PQExpBuffer attracl_subquery = createPQExpBuffer();
>> +               PQExpBuffer attinitacl_subquery = createPQExpBuffer();
>> +               PQExpBuffer attinitracl_subquery = createPQExpBuffer();
>>
>> It seems unnecessary to repeat all of this.  The only differences
>> between the 10000 version and the 9600 version are:
>>
>> 60,61c60
>> <                           "AS changed_acl, "
>> <                           "CASE WHEN c.relkind = 'P' THEN
>> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
>> ---
>>>                           "AS changed_acl "
>> 73c72
>> <                    "WHERE c.relkind in ('%c', '%c', '%c', '%c',
>> '%c', '%c', '%c') "
>> ---
>>>                    "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') "
>> 87,88c86
>> <                           RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
>> <                           RELKIND_PARTITIONED_TABLE);
>> ---
>>>                           RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE);
>>
>> But none of that is really a problem.  Sure, the 'P' case will never
>> arise in 9.6, but so what?  I'd really like to not keep duplicating
>> these increasingly-complex hunks of code if we can find some way to
>> avoid that.
>
> We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that
> getTables() sends to pre-10 servers, right?  But I suppose we need not
> call it in that particular SQL query in the first place.

Oh, yeah, that's a problem; the query will error out against older
servers.  You could do something like:

char *partkeydef;
if (version <= 90600)   partkeydef = "NULL";
else   partkeydef = "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END";

...and the use %s to interpolate that into the query string.

> How about we do it in the following manner in getSchemaData():
>
> if (g_verbose)
>     write_msg(NULL, "reading partition key information for interesting
> tables\n");
> getTablePartitionKeyInfo(fout, tblinfo, numTables);
>
> I have implemented the same.

That might be OK too; I'll have to read it through carefully.

>> Re-reviewing 0003:
>>
>> +     <para>
>> +      If this table is a partition, one cannot perform <literal>DROP
>> NOT NULL</>
>> +      on a column if it is marked not null in the parent table.
>> +     </para>
>>
>> This would, presumably, also be true for inheritance.  I think we
>> could just leave this out.
>
> Actually, it isn't true for the regular inheritance situation.
>
> create table parent (a int not null);
> create table child () inherits (parent);
> alter table child alter a drop not null;  -- this works (bug?)

Hrm, OK.  That doesn't satisfy MY idea of the principle of least
astonishment, but hey...

> But in the meantime, we can proceed with enforcing inheritance on NOT NULL
> constraint for *partitions*, because they only ever have one parent and
> hence do not need elaborate coninhcount based scheme, I think.  Thoughts?

I agree.

>> +      correspond to the partitioning method and partitioning key of the target
>>
>> I think that in most places were are referring to the "partitioning
>> method" (with ing) but the "partition key" (without ing). Let's try to
>> be consistent.
>
> I'm inclined to switch to "partitioning method" and "partitioning key",
> but do you mean just the documentation or throughout?  Beside
> documentation, I mean source code comments, error messages, etc.  I have
> assumed throughout.

I think "partitioning key" is a bit awkward and actually prefer
"partiton key".  But "partition method" sounds funny so I would go
with "partitioning method".

>> +      A full table scan is performed on the table being attached to check that
>> +      existing rows in the table are within the specified partition bound.
>> +      If it is known in advance that no partition bound violating rows are
>> +      present in the table, the above scan can be skipped by specifying the
>> +      <literal>NO VALIDATE</> option.
>> +      Default behavior is to perform the scan, as if the
>> <literal>VALIDATE</literal>
>> +      option were specified.
>>
>> I don't think it's OK to allow the partition to be added if it
>> contains rows that might not be valid.  We are generally vary wary
>> about adding options that can cause data integrity failures and I
>> think we shouldn't start here, either.  On the other hand, it's also
>> not desirable for adding a partition to take O(n) time in all cases.
>> So what would be nice is if the new partition could self-certify that
>> contains no problematic rows by having a constraint that matches the
>> new partitioning constraint.  Then we can skip the scan if that
>> constraint is already present.
>
> I agree that NO VALIDATE is undesirable and it would be better if we could
> get rid of the same.  I want to clarify one thing though: what does it
> mean for the new partition to have a constraint that *matches* the new
> partitioning constraint?  Does it mean the new partition's constraint
> *implies* the partitioning constraint?  Or as certified by equal()?

Implies would be better, but equal() might be tolerable.

>> +                                                       /*
>> +                                                        * Never put a
>> null into the values array, flag
>> +                                                        * instead for
>> the code further down below where
>> +                                                        * we
>> construct the actual relcache struct.
>> +                                                        */
>> +
>> found_null_partition = true;
>> +
>> null_partition_index = i;
>>
>> How about we elog(ERROR, ...) if found_null_partition is already set?
>
> Makes sense.  However, let me mention here that duplicates either within
> one partition's list or across the partitions are not possible.  That's
> because in case of the former, we de-duplicate before storing the list
> into the catalog and the latter would simply be an overlap error.  Could
> this be made an Assert() then?

Well, I think that wouldn't be as good, because for example some
misguided person might do a manual update of the catalog.  It seems to
me that if the system catalog contents are questionable, it's better
to error out than to have completely undefined behavior.  In other
words, we probably want it checked in non-Assert builds.  See similar
cases where we have things like:

elog(ERROR, "conpfeqop is not a 1-D Oid array");

>> +               /*
>> +                * Is lower_val = upper_val?
>> +                */
>> +               if (lower_val && upper_val)
>>
>> So, that comment does not actually match that code.  That if-test is
>> just checking whether both bounds are finite.  What I think you should
>> be explaining here is that if lower_val and upper_val are both
>> non-infinite, and if the happen to be equal, we want to emit an
>> equality test rather than a >= test plus a <= test because $REASON.
>
> OK, I have added some explanatory comments around this code.

So, I was kind of assuming that the answer to "why are we doing this"
was "for efficiency".  It's true that val >= const AND const <= val
can be simplified to val = const, but it wouldn't be necessary to do
so for correctness.  It just looks nicer and runs faster.

By the way, if you want to keep the inclusive stuff, I think you
probably need to consider this logic in light of the possibility that
one bound might be exclusive.  Maybe you're thinking that can't happen
anyway because it would have been rejected earlier, but an elog(ERROR,
...) might still be appropriate just to guard against messed-up
catalogs.

> I forgot this instance of forced-recursion-if-partitioned, fixed.  I was
> not quite sure what the error message would say; how about:
>
> ereport(ERROR,
>         (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>          errmsg("must truncate child tables too")));

Works fror me.

>> What happens if you try to attach a table as a partition of itself or
>> one of its ancestors?
>
> The latter fails with "already a partition" error.
>
> The former case was not being handled at all which has now been fixed.
> ATExecAddInherit() prevents that case as an instance of preventing the
> circularity of inheritance.  It says: ERROR: circular inheritance not
> allowed.  And then: DETAIL: "rel" is already a child of "rel".
>
> Should ATExecAttachPartition() use the same trick and keep using the same
> message(s)?  The above detail message doesn't quite sound appropriate when
> one tries to attach a table as partition of itself.

Whichever way is easier to code seems fine.  It's a pretty obscure
case, so I don't think we need to add code just to get a very slightly
better error message.

>> +                       | FOR VALUES START '(' range_datum_list ')' lb_inc
>> +                                                END_P '('
>> range_datum_list ')' ub_inc
>>
>> Just a random idea.  Would for VALUES FROM ( ... ) TO ( ... ) be more
>> idiomatic than START and END?
>
> It would actually.  Should I go ahead and change to FROM (...) TO (...)?

+1!

> Related to range partitioning, should we finalize on inclusive START/FROM
> and exclusive END/TO preventing explicit specification of the inclusivity?

I would be in favor of committing the initial patch set without that,
and then considering the possibility of adding it later.  If we
include it in the initial patch set we are stuck with it.

> Attached updated patches take care of the above comments and few other
> fixes.  There are still a few items I have not addressed right away:
>
> - Remove NO VALIDATE clause in ATTACH PARTITION and instead rely on the
>   new partition's constraints to skip the validation scan
> - Remove the syntax to specify inclusivity of each of START and END bounds
>   of range partitions and instead assume inclusive START and exclusive END

OK - what is the time frame for these changes?

> Also, what used to be "[PATCH 5/9] Refactor optimizer's inheritance set
> expansion code" is no longer included in the set.  I think we can go back
> to that topic later as I mentioned last week [2].

Great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Nov 9, 2016 at 12:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> In this latest patch set:
>
> src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
> +                                    *rdatum;
>
> With all patches applied, "make check" fails with a bunch of diffs
> that look like this:
>
>   Check constraints:
> -     "pt1chk2" CHECK (c2 <> ''::text)
>       "pt1chk3" CHECK (c2 <> ''::text)

And the pg_upgrade test also fails:

Done
+ pg_dumpall -f /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
+ pg_ctl -m fast stop
waiting for server to shut down.... done
server stopped
+ set +x

Files /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql and
/Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql differ
dumps were not identical
make[2]: *** [check] Error 1
make[1]: *** [check-pg_upgrade-recurse] Error 2
make: *** [check-world-src/bin-recurse] Error 2
[rhaas pgsql]$ diff
/Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql
/Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
6403d6402
<     c text
8736,8737c8735
<     CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double
precision)),
<     CONSTRAINT bmerged CHECK (((b)::double precision > (1)::double precision))
---
>     CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double precision))

For future revisions, please make sure "make check-world" passes before posting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/10 2:00, Robert Haas wrote:
> In this latest patch set:
> 
> src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
> +                                    *rdatum;

This one I will fix.

> 
> With all patches applied, "make check" fails with a bunch of diffs
> that look like this:
> 
>   Check constraints:
> -     "pt1chk2" CHECK (c2 <> ''::text)
>       "pt1chk3" CHECK (c2 <> ''::text)

Hm, I can't seem to reproduce this one.  Is it perhaps possible that you
applied the patches on top of some other WIP patches or something?

> And the pg_upgrade test also fails:
>
> Done
> + pg_dumpall -f /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
> + pg_ctl -m fast stop
> waiting for server to shut down.... done
> server stopped
> + set +x
>
> Files /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql and
> /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql differ
> dumps were not identical
> make[2]: *** [check] Error 1
> make[1]: *** [check-pg_upgrade-recurse] Error 2
> make: *** [check-world-src/bin-recurse] Error 2
> [rhaas pgsql]$ diff
> /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql
> /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
> 6403d6402
> <     c text
> 8736,8737c8735
> <     CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double
> precision)),
> <     CONSTRAINT bmerged CHECK (((b)::double precision > (1)::double
precision))
> ---
> >     CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double
precision))

This one too I can't seem to reproduce.

> For future revisions, please make sure "make check-world" passes before
posting.

OK, I will make sure.  FWIW, make check-world passes here after applying
the patches posted yesterday.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
Thanks again for the prompt review!

On 2016/11/10 2:00, Robert Haas wrote:
> On Wed, Nov 9, 2016 at 6:14 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> But none of that is really a problem.  Sure, the 'P' case will never
>>> arise in 9.6, but so what?  I'd really like to not keep duplicating
>>> these increasingly-complex hunks of code if we can find some way to
>>> avoid that.
>>
>> We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that
>> getTables() sends to pre-10 servers, right?  But I suppose we need not
>> call it in that particular SQL query in the first place.
>
> Oh, yeah, that's a problem; the query will error out against older
> servers.  You could do something like:
>
> char *partkeydef;
> if (version <= 90600)
>     partkeydef = "NULL";
> else
>     partkeydef = "CASE WHEN c.relkind = 'P' THEN
> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END";
>
> ...and the use %s to interpolate that into the query string.

Yeah, that's a way.

>>> I think that in most places were are referring to the "partitioning
>>> method" (with ing) but the "partition key" (without ing). Let's try to
>>> be consistent.
>>
>> I'm inclined to switch to "partitioning method" and "partitioning key",
>> but do you mean just the documentation or throughout?  Beside
>> documentation, I mean source code comments, error messages, etc.  I have
>> assumed throughout.
>
> I think "partitioning key" is a bit awkward and actually prefer
> "partiton key".  But "partition method" sounds funny so I would go
> with "partitioning method".

OK, "partition key" and "partitioning method" it is then.  Source code
comments, error messages, variables call the latter (partitioning)
"strategy" though which hopefully is fine.

>>> I don't think it's OK to allow the partition to be added if it
>>> contains rows that might not be valid.  We are generally vary wary
>>> about adding options that can cause data integrity failures and I
>>> think we shouldn't start here, either.  On the other hand, it's also
>>> not desirable for adding a partition to take O(n) time in all cases.
>>> So what would be nice is if the new partition could self-certify that
>>> contains no problematic rows by having a constraint that matches the
>>> new partitioning constraint.  Then we can skip the scan if that
>>> constraint is already present.
>>
>> I agree that NO VALIDATE is undesirable and it would be better if we could
>> get rid of the same.  I want to clarify one thing though: what does it
>> mean for the new partition to have a constraint that *matches* the new
>> partitioning constraint?  Does it mean the new partition's constraint
>> *implies* the partitioning constraint?  Or as certified by equal()?
>
> Implies would be better, but equal() might be tolerable.

I think a equal() -based method would fail to help in most cases. Consider
the following example:

create table foo1 (a int);
alter table foo add constraint check_a check (a < 10 and a >= 1);

create table foo (a int) partition by range (a);
alter table foo attach partition foo1 for values from (1) to (10);

The last command will internally generate the partition constraints that
is basically a list of implicitly AND'd expressions viz. a is not null, a
>= 1 and a < 10 which we wrap into a BoolExpr.  It would not be equal()
with foo1's constraint even though they are basically the same constraint.
 So, simple structural equality may prove to be less productive, IMHO.

It seems a *implies* -based solution would work much better, although a
bit slower for obvious reasons.  I reckon slownsess is not a big issue in
this case.  So I prototyped the same using predicate_implied_by() which
seems to work reasonably.  Looks something like this:

    skip_validate = false;
    if (predicate_implied_by(partConstraint, existConstraint))
        skip_validate = true;

Where partConstraint is 1-member list with the new partition constraint
and existConstraint is a list of the existing constraints of the table
being attached, derived from its TupleConstr.

>>> +                                                       /*
>>> +                                                        * Never put a
>>> null into the values array, flag
>>> +                                                        * instead for
>>> the code further down below where
>>> +                                                        * we
>>> construct the actual relcache struct.
>>> +                                                        */
>>> +
>>> found_null_partition = true;
>>> +
>>> null_partition_index = i;
>>>
>>> How about we elog(ERROR, ...) if found_null_partition is already set?
>>
>> Makes sense.  However, let me mention here that duplicates either within
>> one partition's list or across the partitions are not possible.  That's
>> because in case of the former, we de-duplicate before storing the list
>> into the catalog and the latter would simply be an overlap error.  Could
>> this be made an Assert() then?
>
> Well, I think that wouldn't be as good, because for example some
> misguided person might do a manual update of the catalog.  It seems to
> me that if the system catalog contents are questionable, it's better
> to error out than to have completely undefined behavior.  In other
> words, we probably want it checked in non-Assert builds.  See similar
> cases where we have things like:
>
> elog(ERROR, "conpfeqop is not a 1-D Oid array");

Good point, agreed.

>>> +               /*
>>> +                * Is lower_val = upper_val?
>>> +                */
>>> +               if (lower_val && upper_val)
>>>
>>> So, that comment does not actually match that code.  That if-test is
>>> just checking whether both bounds are finite.  What I think you should
>>> be explaining here is that if lower_val and upper_val are both
>>> non-infinite, and if the happen to be equal, we want to emit an
>>> equality test rather than a >= test plus a <= test because $REASON.
>>
>> OK, I have added some explanatory comments around this code.
>
> So, I was kind of assuming that the answer to "why are we doing this"
> was "for efficiency".  It's true that val >= const AND const <= val
> can be simplified to val = const, but it wouldn't be necessary to do
> so for correctness.  It just looks nicer and runs faster.

I guess I ended up with this code following along a bit different way of
thinking about it, but in the end what you're saying is true.

> By the way, if you want to keep the inclusive stuff, I think you
> probably need to consider this logic in light of the possibility that
> one bound might be exclusive.  Maybe you're thinking that can't happen
> anyway because it would have been rejected earlier, but an elog(ERROR,
> ...) might still be appropriate just to guard against messed-up
> catalogs.

Yes, that makes sense.  I guess you mean a case like the one shown below:

create table foo5 partition of foo for values start (3, 10) end (3, 10);
ERROR:  cannot create range partition with empty range

I think the following would suffice as a guard (checked only if it turns
out that lower_val and upper_val are indeed equal):

    if (i == key->partnatts - 1 && spec->lowerinc != spec->upperinc)
        elog(ERROR, "invalid range bound specification");


>>> What happens if you try to attach a table as a partition of itself or
>>> one of its ancestors?
>>
>> The latter fails with "already a partition" error.
>>
>> The former case was not being handled at all which has now been fixed.
>> ATExecAddInherit() prevents that case as an instance of preventing the
>> circularity of inheritance.  It says: ERROR: circular inheritance not
>> allowed.  And then: DETAIL: "rel" is already a child of "rel".
>>
>> Should ATExecAttachPartition() use the same trick and keep using the same
>> message(s)?  The above detail message doesn't quite sound appropriate when
>> one tries to attach a table as partition of itself.
>
> Whichever way is easier to code seems fine.  It's a pretty obscure
> case, so I don't think we need to add code just to get a very slightly
> better error message.

OK, so let's keep it same as with regular inheritance.

>>> +                       | FOR VALUES START '(' range_datum_list ')' lb_inc
>>> +                                                END_P '('
>>> range_datum_list ')' ub_inc
>>>
>>> Just a random idea.  Would for VALUES FROM ( ... ) TO ( ... ) be more
>>> idiomatic than START and END?
>>
>> It would actually.  Should I go ahead and change to FROM (...) TO (...)?
>
> +1!

Done!

>> Related to range partitioning, should we finalize on inclusive START/FROM
>> and exclusive END/TO preventing explicit specification of the inclusivity?
>
> I would be in favor of committing the initial patch set without that,
> and then considering the possibility of adding it later.  If we
> include it in the initial patch set we are stuck with it.

OK, I have removed the syntactic ability to specify INCLUSIVE/EXCLUSIVE
with each of the range bounds.

I haven't changed any code (such as comparison functions) that manipulates
instances of PartitionRangeBound which has a flag called inclusive.  I
didn't remove the flag, but is instead just set to (is_lower ? true :
false) when initializing from the parse node. Perhaps, there is some scope
for further simplifying that code, which you probably alluded to when you
proposed that we do this.

>> Attached updated patches take care of the above comments and few other
>> fixes.  There are still a few items I have not addressed right away:
>>
>> - Remove NO VALIDATE clause in ATTACH PARTITION and instead rely on the
>>   new partition's constraints to skip the validation scan
>> - Remove the syntax to specify inclusivity of each of START and END bounds
>>   of range partitions and instead assume inclusive START and exclusive END
>
> OK - what is the time frame for these changes?

I have implemented both of these in the attached patch.  As mentioned
above, the logic to skip the validation scan using the new partition's
constraints is still kind of a prototype solution, but it seems to work so
far.  Comments on the same would be very helpful.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Nov 10, 2016 at 7:40 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think "partitioning key" is a bit awkward and actually prefer
>> "partiton key".  But "partition method" sounds funny so I would go
>> with "partitioning method".
>
> OK, "partition key" and "partitioning method" it is then.  Source code
> comments, error messages, variables call the latter (partitioning)
> "strategy" though which hopefully is fine.

Oh, I like "partitioning strategy".  Can we standardize on that?

>>> Related to range partitioning, should we finalize on inclusive START/FROM
>>> and exclusive END/TO preventing explicit specification of the inclusivity?
>>
>> I would be in favor of committing the initial patch set without that,
>> and then considering the possibility of adding it later.  If we
>> include it in the initial patch set we are stuck with it.
>
> OK, I have removed the syntactic ability to specify INCLUSIVE/EXCLUSIVE
> with each of the range bounds.
>
> I haven't changed any code (such as comparison functions) that manipulates
> instances of PartitionRangeBound which has a flag called inclusive.  I
> didn't remove the flag, but is instead just set to (is_lower ? true :
> false) when initializing from the parse node. Perhaps, there is some scope
> for further simplifying that code, which you probably alluded to when you
> proposed that we do this.

Yes, you need to rip out all of the logic that supports it.  Having
the logic to support it but not the syntax is bad because then that
code can't be properly tested.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Nov 9, 2016 at 9:58 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> With all patches applied, "make check" fails with a bunch of diffs
>> that look like this:
>>
>>   Check constraints:
>> -     "pt1chk2" CHECK (c2 <> ''::text)
>>       "pt1chk3" CHECK (c2 <> ''::text)
>
> Hm, I can't seem to reproduce this one.  Is it perhaps possible that you
> applied the patches on top of some other WIP patches or something?

Nope.  I just checked and this passes with only 0001 and 0002 applied,
but when I add 0003 and 0004 then it starts failing.  It appears that
the problem starts at this point in the foreign_data test:

ALTER TABLE pt1 DROP CONSTRAINT pt1chk2 CASCADE;

After that command, in the expected output, pt1chk2 stops showing up
in the output of \d+ pt1, but continues to appear in the output of \d+
ft2.  With your patch, however, it stops showing up for ft2 also.  If
that's not also happening for you, it might be due to an uninitialized
variable someplace.

+        /* Force inheritance recursion, if partitioned table. */

Doesn't match code (any more).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/11 6:51, Robert Haas wrote:
> On Wed, Nov 9, 2016 at 9:58 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> With all patches applied, "make check" fails with a bunch of diffs
>>> that look like this:
>>>
>>>   Check constraints:
>>> -     "pt1chk2" CHECK (c2 <> ''::text)
>>>       "pt1chk3" CHECK (c2 <> ''::text)
>>
>> Hm, I can't seem to reproduce this one.  Is it perhaps possible that you
>> applied the patches on top of some other WIP patches or something?
>
> Nope.  I just checked and this passes with only 0001 and 0002 applied,
> but when I add 0003 and 0004 then it starts failing.

Sorry, it definitely wasn't an error on your part.

> It appears that
> the problem starts at this point in the foreign_data test:
>
> ALTER TABLE pt1 DROP CONSTRAINT pt1chk2 CASCADE;
>
> After that command, in the expected output, pt1chk2 stops showing up
> in the output of \d+ pt1, but continues to appear in the output of \d+
> ft2.  With your patch, however, it stops showing up for ft2 also.  If
> that's not also happening for you, it might be due to an uninitialized
> variable someplace.

Thanks for the context.  I think I found the culprit variable in
MergeConstraintsIntoExisting() and fixed it.  As you correctly guessed,
the uninitialized variable caused (in your environment) even non-partition
child relations to be treated partitions and hence forced any merged
constraints to be non-local in all cases, not just in case of partitions.
Which meant the command you quoted would even drop the ft2's (a child)
constraint because its conislocal is wrongly false.

>
> +        /* Force inheritance recursion, if partitioned table. */
>
> Doesn't match code (any more).

Fixed.

>>> I think "partitioning key" is a bit awkward and actually prefer
>>> "partiton key".  But "partition method" sounds funny so I would go
>>> with "partitioning method".
>>
>> OK, "partition key" and "partitioning method" it is then.  Source code
>> comments, error messages, variables call the latter (partitioning)
>> "strategy" though which hopefully is fine.
>
> Oh, I like "partitioning strategy".  Can we standardize on that?

OK, done.

>>> I would be in favor of committing the initial patch set without that,
>>> and then considering the possibility of adding it later.  If we
>>> include it in the initial patch set we are stuck with it.
>>
>> OK, I have removed the syntactic ability to specify INCLUSIVE/EXCLUSIVE
>> with each of the range bounds.
>>
>> I haven't changed any code (such as comparison functions) that manipulates
>> instances of PartitionRangeBound which has a flag called inclusive.  I
>> didn't remove the flag, but is instead just set to (is_lower ? true :
>> false) when initializing from the parse node. Perhaps, there is some scope
>> for further simplifying that code, which you probably alluded to when you
>> proposed that we do this.
>
> Yes, you need to rip out all of the logic that supports it.  Having
> the logic to support it but not the syntax is bad because then that
> code can't be properly tested.

Agreed, done.


Attached updated patches.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
I have not looked at the latest set of patches, but in the version
that I have we create one composite type for every partition. This
means that if there are thousand partitions, there will be thousand
identical entries in pg_type. Since all the partitions share the same
definition (by syntax), it doesn't make sense to add so many identical
entries. Moreover, in set_append_rel_size(), while translating the
expressions from parent to child, we add a ConvertRowtypeExpr instead
of whole-row reference if reltype of the parent and child do not match
(adjust_appendrel_attrs_mutator())                  if (appinfo->parent_reltype != appinfo->child_reltype)
    {                       ConvertRowtypeExpr *r = makeNode(ConvertRowtypeExpr);
 

I guess, we should set reltype of child to that of parent for
declarative partitions.

On Fri, Nov 11, 2016 at 4:00 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/11/11 6:51, Robert Haas wrote:
>> On Wed, Nov 9, 2016 at 9:58 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> With all patches applied, "make check" fails with a bunch of diffs
>>>> that look like this:
>>>>
>>>>   Check constraints:
>>>> -     "pt1chk2" CHECK (c2 <> ''::text)
>>>>       "pt1chk3" CHECK (c2 <> ''::text)
>>>
>>> Hm, I can't seem to reproduce this one.  Is it perhaps possible that you
>>> applied the patches on top of some other WIP patches or something?
>>
>> Nope.  I just checked and this passes with only 0001 and 0002 applied,
>> but when I add 0003 and 0004 then it starts failing.
>
> Sorry, it definitely wasn't an error on your part.
>
>> It appears that
>> the problem starts at this point in the foreign_data test:
>>
>> ALTER TABLE pt1 DROP CONSTRAINT pt1chk2 CASCADE;
>>
>> After that command, in the expected output, pt1chk2 stops showing up
>> in the output of \d+ pt1, but continues to appear in the output of \d+
>> ft2.  With your patch, however, it stops showing up for ft2 also.  If
>> that's not also happening for you, it might be due to an uninitialized
>> variable someplace.
>
> Thanks for the context.  I think I found the culprit variable in
> MergeConstraintsIntoExisting() and fixed it.  As you correctly guessed,
> the uninitialized variable caused (in your environment) even non-partition
> child relations to be treated partitions and hence forced any merged
> constraints to be non-local in all cases, not just in case of partitions.
> Which meant the command you quoted would even drop the ft2's (a child)
> constraint because its conislocal is wrongly false.
>
>>
>> +        /* Force inheritance recursion, if partitioned table. */
>>
>> Doesn't match code (any more).
>
> Fixed.
>
>>>> I think "partitioning key" is a bit awkward and actually prefer
>>>> "partiton key".  But "partition method" sounds funny so I would go
>>>> with "partitioning method".
>>>
>>> OK, "partition key" and "partitioning method" it is then.  Source code
>>> comments, error messages, variables call the latter (partitioning)
>>> "strategy" though which hopefully is fine.
>>
>> Oh, I like "partitioning strategy".  Can we standardize on that?
>
> OK, done.
>
>>>> I would be in favor of committing the initial patch set without that,
>>>> and then considering the possibility of adding it later.  If we
>>>> include it in the initial patch set we are stuck with it.
>>>
>>> OK, I have removed the syntactic ability to specify INCLUSIVE/EXCLUSIVE
>>> with each of the range bounds.
>>>
>>> I haven't changed any code (such as comparison functions) that manipulates
>>> instances of PartitionRangeBound which has a flag called inclusive.  I
>>> didn't remove the flag, but is instead just set to (is_lower ? true :
>>> false) when initializing from the parse node. Perhaps, there is some scope
>>> for further simplifying that code, which you probably alluded to when you
>>> proposed that we do this.
>>
>> Yes, you need to rip out all of the logic that supports it.  Having
>> the logic to support it but not the syntax is bad because then that
>> code can't be properly tested.
>
> Agreed, done.
>
>
> Attached updated patches.
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
I forgot to quote your comments in the email I sent on Friday [1], with
new patches that do take care of the following comments.

On 2016/11/11 4:04, Robert Haas wrote:
> On Thu, Nov 10, 2016 at 7:40 AM, Amit Langote
>>
>> OK, "partition key" and "partitioning method" it is then.  Source code
>> comments, error messages, variables call the latter (partitioning)
>> "strategy" though which hopefully is fine.
> 
> Oh, I like "partitioning strategy".  Can we standardize on that?

Done.

>> OK, I have removed the syntactic ability to specify INCLUSIVE/EXCLUSIVE
>> with each of the range bounds.
>>
>> I haven't changed any code (such as comparison functions) that manipulates
>> instances of PartitionRangeBound which has a flag called inclusive.  I
>> didn't remove the flag, but is instead just set to (is_lower ? true :
>> false) when initializing from the parse node. Perhaps, there is some scope
>> for further simplifying that code, which you probably alluded to when you
>> proposed that we do this.
> 
> Yes, you need to rip out all of the logic that supports it.  Having
> the logic to support it but not the syntax is bad because then that
> code can't be properly tested.

Agreed and done.  Now the only thing that dictates the inclusivity of
individual range bounds during comparisons (with other range bounds or
partition key of tuples) is whether the bound is a lower bound or not;
inclusive if, exclusive if not.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/8d7c35e3-1c85-33d0-4440-0a75bf9d31cd%40lab.ntt.co.jp





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/04 0:49, Robert Haas wrote:
> On Thu, Nov 3, 2016 at 7:46 AM,  <alvherre@alvh.no-ip.org> wrote:
>> El 2016-10-28 07:53, Amit Langote escribió:
>>> @@ -6267,6 +6416,12 @@ ATAddForeignKeyConstraint(AlteredTableInfo *tab,
>>> Relation rel,
>>>          * Validity checks (permission checks wait till we have the column
>>>          * numbers)
>>>          */
>>> +       if (pkrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>>> +               ereport(ERROR,
>>> +                               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>>> +                                errmsg("cannot reference relation
>>> \"%s\"", RelationGetRelationName(pkrel)),
>>> +                                errdetail("Referencing partitioned tables
>>> in foreign key constraints is not supported.")));
>>
>> Is there a plan for fixing this particular limitation?  It's a pretty
>> serious problem for users,
>> and the suggested workaround (to create a separate non-partitioned table
>> which carries only the PK
>> columns which is updated by triggers, and direct the FKs to it instead of to
>> the partitioned table)
>> is not only a very ugly one, but also very slow.
> 
> If you have two compatibly partitioned tables, and the foreign key
> matches the partitioning keys, you could implement a foreign key
> between the two tables as a foreign key between each pair of matching
> partitions.  Otherwise, isn't the only way to handle this a global
> index?

I am assuming you don't mean a global index (on partitioned tables) as in
some new kind of monolithic physical structure that implements the
constraint across tables (partitions), right?  I'm thinking you mean a
collection of btree indexes on individual partitions with the key of each
index matching the partition key of the parent, created internally as part
of the creation of the same index on the parent.  In fact, the said
indexes are created and maintained sort of like how inherited attributes,
constraints are.  That would require quite a bit of new infrastructure.
We did discuss about the possibility of such a feature being implemented
on top of declarative partitioning, but not in version 1 [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZZMfcf16YaHuhP1Vk%3Dj8PDFeHCvfj%2BFJQd%2BeFhs%2B7P8A%40mail.gmail.com





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/11 20:49, Ashutosh Bapat wrote:
> I have not looked at the latest set of patches, but in the version
> that I have we create one composite type for every partition. This
> means that if there are thousand partitions, there will be thousand
> identical entries in pg_type. Since all the partitions share the same
> definition (by syntax), it doesn't make sense to add so many identical
> entries. Moreover, in set_append_rel_size(), while translating the
> expressions from parent to child, we add a ConvertRowtypeExpr instead
> of whole-row reference if reltype of the parent and child do not match
> (adjust_appendrel_attrs_mutator())
>                    if (appinfo->parent_reltype != appinfo->child_reltype)
>                     {
>                         ConvertRowtypeExpr *r = makeNode(ConvertRowtypeExpr);
> 
> I guess, we should set reltype of child to that of parent for
> declarative partitions.

Thanks for the suggestion.  I agree that partitions having the same
reltype as the parent will help a number of cases including the one you
mentioned, but I haven't yet considered how invasive such a change will be.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/11 19:30, Amit Langote wrote:
>
> Attached updated patches.

Here is the latest version of the patches with some fixes along with those
mentioned below (mostly in 0003):

- Fixed the logic to skip the attach partition validation scan such that
  it won't skip scanning a list partition *that doesn't accept NULLs* if
  the partition key column is not set NOT NULL (it similarly doesn't skip
  scanning a range partition if either of the partition key columns is not
  set NOT NULL, because a range partition key cannot contain NULLs at all)

- Added some more regression tests for ATTACH PARTITION command

- Some fixes to documentation and source code comments

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 15, 2016 at 5:30 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/11/11 19:30, Amit Langote wrote:
>>
>> Attached updated patches.
>
> Here is the latest version of the patches with some fixes along with those
> mentioned below (mostly in 0003):
>
> - Fixed the logic to skip the attach partition validation scan such that
>   it won't skip scanning a list partition *that doesn't accept NULLs* if
>   the partition key column is not set NOT NULL (it similarly doesn't skip
>   scanning a range partition if either of the partition key columns is not
>   set NOT NULL, because a range partition key cannot contain NULLs at all)
>
> - Added some more regression tests for ATTACH PARTITION command
>
> - Some fixes to documentation and source code comments

Have you done any performance testing on the tuple routing code?
Suppose we insert a million (or 10 million) tuples into an
unpartitioned table, a table with 10 partitions, a table with 100
partitions, a table with 1000 partitions, and a table that is
partitioned into 10 partitions each of which has 10 subpartitions.
Ideally, the partitioned cases would run almost as fast as the
unpartitioned case, but probably there will be some overhead.
However, it would be useful to know how much.  Also, it would be
useful to set up the same cases with inheritance using a PL/pgsql ON
INSERT trigger for tuple routing and compare.  Hopefully the tuple
routing code is far faster than a trigger, but we should make sure
that's the case and look for optimizations if not.  Also, it would be
useful to know how much slower the tuple-mapping-required case is than
the no-tuple-mapping-required case.

I think the comments in some of the later patches could use some work
yet.  For example, in 0007, FormPartitionKeyDatum()'s header comment
is largely uninformative, get_partition_for_tuple()'s header comment
doesn't explain what the return value means in the non-failure case,
and ExecFindPartition() doesn't have a header comment at all.

The number of places in these patches where you end up reopening a
hopefully-already-locked relation with NoLock (or sometimes with
AccessShareLock) is worrying to me.  I think that's a coding pattern
we should be seeking to avoid; every one of those is not only a hazard
(what if we reach that point in the code without a lock?) but a
possible performance issue (we have to look up the OID in the
backend-private hash table; and if you are passing AccessShareLock
then you might also hit the lock manager which could be slow or create
deadlock hazards).  It would be much better to pass the Relation
around rather than the OID whenever possible.

Also, in 0006:

- I doubt that PartitionTreeNodeData's header comment will survive
contact with pgindent.

In 0007:

- oid_is_foreign_table could/should do a syscache lookup instead of
opening the heap relation.  But actually I think you could drop it
altogether and use get_rel_relkind().

- The XXX in EndCopy will need to be fixed somehow.

- I suspect that many of the pfree's in this patch are pointless
because the contexts in which those allocations are performed will be
reset or deleted shortly afterwards anyway.  Only pfree data that
might otherwise live substantially longer than we want, because pfree
consumes some cycles.

- The code in make_modifytable() to swap out the rte_array for a fake
one looks like an unacceptable kludge.  I don't know offhand what a
better design would look like, but what you've got is really ugly.

- I don't understand how it can be right for get_partition_for_tuple()
to be emitting an error that says "range partition key contains null".
A defective partition key should be detected at partition creation
time, not in the middle of tuple routing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/16 4:21, Robert Haas wrote:
> On Tue, Nov 15, 2016 at 5:30 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2016/11/11 19:30, Amit Langote wrote:
>>>
>>> Attached updated patches.
>
> Have you done any performance testing on the tuple routing code?
> Suppose we insert a million (or 10 million) tuples into an
> unpartitioned table, a table with 10 partitions, a table with 100
> partitions, a table with 1000 partitions, and a table that is
> partitioned into 10 partitions each of which has 10 subpartitions.
> Ideally, the partitioned cases would run almost as fast as the
> unpartitioned case, but probably there will be some overhead.
> However, it would be useful to know how much.  Also, it would be
> useful to set up the same cases with inheritance using a PL/pgsql ON
> INSERT trigger for tuple routing and compare.  Hopefully the tuple
> routing code is far faster than a trigger, but we should make sure
> that's the case and look for optimizations if not.  Also, it would be
> useful to know how much slower the tuple-mapping-required case is than
> the no-tuple-mapping-required case.

OK, I will share the performance results soon.

Meanwhile, here are updated patch that address most of the following comments.

> I think the comments in some of the later patches could use some work
> yet.  For example, in 0007, FormPartitionKeyDatum()'s header comment
> is largely uninformative, get_partition_for_tuple()'s header comment
> doesn't explain what the return value means in the non-failure case,
> and ExecFindPartition() doesn't have a header comment at all.

Sorry, I have tried to make these comments more informative in the attached.

> The number of places in these patches where you end up reopening a
> hopefully-already-locked relation with NoLock (or sometimes with
> AccessShareLock) is worrying to me.  I think that's a coding pattern
> we should be seeking to avoid; every one of those is not only a hazard
> (what if we reach that point in the code without a lock?) but a
> possible performance issue (we have to look up the OID in the
> backend-private hash table; and if you are passing AccessShareLock
> then you might also hit the lock manager which could be slow or create
> deadlock hazards).  It would be much better to pass the Relation
> around rather than the OID whenever possible.

I have removed most instances of heap_open() calls after realizing they
were in fact unnecessary (and as you warn possibly hazardous and
performance hogs).  Where it's inevitable that we perform heap_open(), I
have changed it so that the correct lockmode is passed.

>
> Also, in 0006:
>
> - I doubt that PartitionTreeNodeData's header comment will survive
> contact with pgindent.

Fixed by adding "/* ----" at the top of the comment.

> In 0007:
>
> - oid_is_foreign_table could/should do a syscache lookup instead of
> opening the heap relation.  But actually I think you could drop it
> altogether and use get_rel_relkind().

Done with get_rel_relkind().

> - The XXX in EndCopy will need to be fixed somehow.

Fixed.  Realized that doing that in EndCopy() is kind of pointless, I
moved that code to the end of CopyFrom(), right before a call to
FreeExecutorState.

> - I suspect that many of the pfree's in this patch are pointless
> because the contexts in which those allocations are performed will be
> reset or deleted shortly afterwards anyway.  Only pfree data that
> might otherwise live substantially longer than we want, because pfree
> consumes some cycles.

They do seem to be freeing unnecessarily, removed.

>
> - The code in make_modifytable() to swap out the rte_array for a fake
> one looks like an unacceptable kludge.  I don't know offhand what a
> better design would look like, but what you've got is really ugly.

Agree that it looks horrible.  The problem is we don't add partition
(child table) RTEs when planning an insert on the parent and FDW
partitions can't do without some planner handling - planForeignModify()
expects a valid PlannerInfo for deparsing target lists (basically, to be
able to use planner_rt_fetch()).

Any design beside this kludge would perhaps mean that we will be adding
valid partition RTEs at some earlier planning stage much like what
expand_inherited_rtentry() does during non-insert queries.  Perhaps, we
could have expand_inherited_rtentry() make an exception in partitioned
tables' case and create root->append_rel_list members even in the insert
case.  We could then go over the append_rel_list in make_modifytable() to
get the valid RT index of foreign child tables to pass to
PlanForeignModify().  Using append_rel_list for insert planning is perhaps
equally ugly though. Thoughts?

> - I don't understand how it can be right for get_partition_for_tuple()
> to be emitting an error that says "range partition key contains null".
> A defective partition key should be detected at partition creation
> time, not in the middle of tuple routing.

That error is emitted when the partition key of the *input tuple* is found
to contain NULLs.  Maybe we need to consider something like how btree
indexes treat NULLs - order NULLs either before or after all the non-NULL
values based on NULLS FIRST/LAST config.  Thoughts?

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> OK, I will share the performance results soon.

Thanks.

>> Also, in 0006:
>>
>> - I doubt that PartitionTreeNodeData's header comment will survive
>> contact with pgindent.
>
> Fixed by adding "/* ----" at the top of the comment.

OK.  Hopefully you also tested that.  In general, with a patch set
this large, the more you can do to make it pgindent-clean, the better.
Ideally none of your code would get changed by pgindent, or at least
not the new files.  But note that you will probably have have to
update typedefs.list if you actually want to be able to run it without
having it mangle things that involve your new structs.

>> - The code in make_modifytable() to swap out the rte_array for a fake
>> one looks like an unacceptable kludge.  I don't know offhand what a
>> better design would look like, but what you've got is really ugly.
>
> Agree that it looks horrible.  The problem is we don't add partition
> (child table) RTEs when planning an insert on the parent and FDW
> partitions can't do without some planner handling - planForeignModify()
> expects a valid PlannerInfo for deparsing target lists (basically, to be
> able to use planner_rt_fetch()).
>
> Any design beside this kludge would perhaps mean that we will be adding
> valid partition RTEs at some earlier planning stage much like what
> expand_inherited_rtentry() does during non-insert queries.  Perhaps, we
> could have expand_inherited_rtentry() make an exception in partitioned
> tables' case and create root->append_rel_list members even in the insert
> case.  We could then go over the append_rel_list in make_modifytable() to
> get the valid RT index of foreign child tables to pass to
> PlanForeignModify().  Using append_rel_list for insert planning is perhaps
> equally ugly though. Thoughts?

If it's only needed for foreign tables, how about for v1 we just throw
an error and say errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot route inserted tuples to a foreign table") for now.  We
can come back and fix it later.  Doing more inheritance expansion
early is probably not a good idea because it likely sucks for
performance, and that's especially unfortunate if it's only needed for
foreign tables.  Coming up with some new FDW API or some modification
to the existing one is probably better, but I don't really want to get
hung up on that right now.

>> - I don't understand how it can be right for get_partition_for_tuple()
>> to be emitting an error that says "range partition key contains null".
>> A defective partition key should be detected at partition creation
>> time, not in the middle of tuple routing.
>
> That error is emitted when the partition key of the *input tuple* is found
> to contain NULLs.  Maybe we need to consider something like how btree
> indexes treat NULLs - order NULLs either before or after all the non-NULL
> values based on NULLS FIRST/LAST config.  Thoughts?

Well, I'd say my confusion suggests that the error message needs to be
clearer about what the problem is.  I think this is basically a case
of there being no partition for the given row, so maybe just arrange
to use that same message here ("no partition of relation \"%s\" found
for row").

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Meanwhile, here are updated patch that address most of the following comments.

OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
there, and I think the entire comment block that starts with "NOTE:
SQL specifies that a NULL" and ends with "it's unlikely that NULL
would result." should be changed to say something like /* As for
catalogued constraints, we treat a NULL result as a success, not a
failure. */ rather than duplicating an existing comment that doesn't
quite apply here.  Finally, ExecConstraints() contains a new if-block
whose sole contents are another if-block.  Perhaps if (this && that)
would be better.

Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
you've chosen is awfully convoluted and probably not that efficient.
For example, get_partition_for_tuple() contains this loop:

+       prev = parent;
+       node = parent->downlink;
+       while (node != NULL)
+       {
+               if (node->index >= cur_idx)
+                       break;
+
+               prev = node;
+               node = node->next;
+       }

Well, it looks to me like that's an O(n) way to find the n'th
partition, which seems like a pretty bad idea in performance-critical
code, which this is.  I think this whole structure needs a pretty
heavy overhaul.  Here's a proposal:

1. Forget the idea of a tree.  Instead, let the total number of tables
in the partitioning hierarchy be N and let the number of those that
are partitioned be K.  Assign each partitioned table in the hierarchy
an index between 0 and K-1.  Make your top level data structure (in
lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
objects, with the partitioning root in entry 0 and the rest in the
remaining entries.

2. Within each PartitionDispatch object, store (a) a pointer to a
PartitionDesc and (b) an array of integers of length equal to the
PartitionDesc's nparts value.  Each integer i, if non-negative, is the
final return value for get_partition_for_tuple.  If i == -1, tuple
routing fails.  If i < -1, we must next route using the subpartition
whose PartitionDesc is at index -(i+1).  Arrange for the array to be
in the same order the PartitionDesc's OID list.

3. Now get_partition_for_tuple looks something like this:

K = 0
loop:   pd = PartitionDispatch[K]   idx = list/range_partition_for_tuple(pd->partdesc, ...)   if (idx >= -1)
returnidx   K = -(idx + 1)
 

No recursion, minimal pointer chasing, no linked lists.  The whole
thing is basically trivial aside from the cost of
list/range_partition_for_tuple itself; optimizing that is a different
project.  I might have some details slightly off here, but hopefully
you can see what I'm going for: you want to keep the computation that
happens in get_partition_for_tuple() to an absolute minimum, and
instead set things up in advance so that getting the partition for a
tuple is FAST.  And you want the data structures that you are using in
that process to be very compact, hence arrays instead of linked lists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/18 1:43, Robert Haas wrote:
> On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote wrote:
>> OK, I will share the performance results soon.
> 
> Thanks.
> 
>>> Also, in 0006:
>>>
>>> - I doubt that PartitionTreeNodeData's header comment will survive
>>> contact with pgindent.
>>
>> Fixed by adding "/* ----" at the top of the comment.
> 
> OK.  Hopefully you also tested that.  In general, with a patch set
> this large, the more you can do to make it pgindent-clean, the better.
> Ideally none of your code would get changed by pgindent, or at least
> not the new files.  But note that you will probably have have to
> update typedefs.list if you actually want to be able to run it without
> having it mangle things that involve your new structs.

OK, I was afraid that running pgindent might end up introducing unrelated
diffs in the patch, but I'm sure any code committed to HEAD would have
been through pgindent and my worry might be pointless after all.  Also,
thanks for the tip about the typedefs.list.  I will go try pgindent'ing
the patches.

>>> - The code in make_modifytable() to swap out the rte_array for a fake
>>> one looks like an unacceptable kludge.  I don't know offhand what a
>>> better design would look like, but what you've got is really ugly.
>>
>> Agree that it looks horrible.  The problem is we don't add partition
>> (child table) RTEs when planning an insert on the parent and FDW
>> partitions can't do without some planner handling - planForeignModify()
>> expects a valid PlannerInfo for deparsing target lists (basically, to be
>> able to use planner_rt_fetch()).
>>
>> Any design beside this kludge would perhaps mean that we will be adding
>> valid partition RTEs at some earlier planning stage much like what
>> expand_inherited_rtentry() does during non-insert queries.  Perhaps, we
>> could have expand_inherited_rtentry() make an exception in partitioned
>> tables' case and create root->append_rel_list members even in the insert
>> case.  We could then go over the append_rel_list in make_modifytable() to
>> get the valid RT index of foreign child tables to pass to
>> PlanForeignModify().  Using append_rel_list for insert planning is perhaps
>> equally ugly though. Thoughts?
> 
> If it's only needed for foreign tables, how about for v1 we just throw
> an error and say errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot route inserted tuples to a foreign table") for now.  We
> can come back and fix it later.  Doing more inheritance expansion
> early is probably not a good idea because it likely sucks for
> performance, and that's especially unfortunate if it's only needed for
> foreign tables.  Coming up with some new FDW API or some modification
> to the existing one is probably better, but I don't really want to get
> hung up on that right now.

OK, I agree with the decision to make tuple-routing a unsupported feature
in the first cut as far as foreign partitions are concerned.  Once can
still insert data into partitions directly.

I am assuming that the error should be thrown all the way below
ExecInsert(), not in the planner.  Which means I should revert any changes
I've made to the planner in this patch.

>>> - I don't understand how it can be right for get_partition_for_tuple()
>>> to be emitting an error that says "range partition key contains null".
>>> A defective partition key should be detected at partition creation
>>> time, not in the middle of tuple routing.
>>
>> That error is emitted when the partition key of the *input tuple* is found
>> to contain NULLs.  Maybe we need to consider something like how btree
>> indexes treat NULLs - order NULLs either before or after all the non-NULL
>> values based on NULLS FIRST/LAST config.  Thoughts?
> 
> Well, I'd say my confusion suggests that the error message needs to be
> clearer about what the problem is.  I think this is basically a case
> of there being no partition for the given row, so maybe just arrange
> to use that same message here ("no partition of relation \"%s\" found
> for row").

The reason NULLs in an input row are caught and rejected (with the current
message) before control reaches range_partition_for_tuple() is because
it's not clear to me whether the range bound comparison logic in
partition_rbound_datum_cmp() should be prepared to handle NULLs and what
the results of comparisons should look like.  Currently, all it ever
expects to see in the input tuple's partition key is non-NULL datums.
Comparison proceeds as follows: if a range bound datum is a finite value,
we invoke the comparison proc or if it is infinite, we conclude that the
input tuple is > or < the bound in question based on whether the bound is
a lower or upper bound, respectively.

Or are you saying that get_tuple_for_partition() should simply return -1
(partition not found) in case of encountering a NULL in range partition
key to the caller instead of throwing error as is now?  If the user sees
the message and decides to create a new range partition that *will* accept
such a row, how do they decide what its boundaries should be?

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/18 4:14, Robert Haas wrote:
> On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Meanwhile, here are updated patch that address most of the following comments.
>
> OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
> a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
> there,

Fixed.

> and I think the entire comment block that starts with "NOTE:
> SQL specifies that a NULL" and ends with "it's unlikely that NULL
> would result." should be changed to say something like /* As for
> catalogued constraints, we treat a NULL result as a success, not a
> failure. */ rather than duplicating an existing comment that doesn't
> quite apply here.

Ah, you're right that the comment does not apply as it is.  I rewrote that
comment.

Oh but wait, that means I can insert rows with NULLs in the range
partition key if I choose to insert it directly into the partition,
whereas I have been thinking all this while that there could never be
NULLs in the partition key of a range partition.  What's more,
get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
every partition key column in case of a range partition.  Is that
wrongheaded altogether?  (also see my reply to your earlier message about
NULLs in the range partition key)

>  Finally, ExecConstraints() contains a new if-block
> whose sole contents are another if-block.  Perhaps if (this && that)
> would be better.

Agreed, should have noticed that.

> Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
> you've chosen is awfully convoluted and probably not that efficient.
> For example, get_partition_for_tuple() contains this loop:
>
> +       prev = parent;
> +       node = parent->downlink;
> +       while (node != NULL)
> +       {
> +               if (node->index >= cur_idx)
> +                       break;
> +
> +               prev = node;
> +               node = node->next;
> +       }
>
> Well, it looks to me like that's an O(n) way to find the n'th
> partition, which seems like a pretty bad idea in performance-critical
> code, which this is.  I think this whole structure needs a pretty
> heavy overhaul.  Here's a proposal:

Thanks for the idea below!

> 1. Forget the idea of a tree.  Instead, let the total number of tables
> in the partitioning hierarchy be N and let the number of those that
> are partitioned be K.  Assign each partitioned table in the hierarchy
> an index between 0 and K-1.  Make your top level data structure (in
> lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
> objects, with the partitioning root in entry 0 and the rest in the
> remaining entries.
>
> 2. Within each PartitionDispatch object, store (a) a pointer to a
> PartitionDesc and (b) an array of integers of length equal to the
> PartitionDesc's nparts value.  Each integer i, if non-negative, is the
> final return value for get_partition_for_tuple.  If i == -1, tuple
> routing fails.  If i < -1, we must next route using the subpartition
> whose PartitionDesc is at index -(i+1).  Arrange for the array to be
> in the same order the PartitionDesc's OID list.
>
> 3. Now get_partition_for_tuple looks something like this:
>
> K = 0
> loop:
>     pd = PartitionDispatch[K]
>     idx = list/range_partition_for_tuple(pd->partdesc, ...)
>     if (idx >= -1)
>         return idx
>     K = -(idx + 1)
>
> No recursion, minimal pointer chasing, no linked lists.  The whole
> thing is basically trivial aside from the cost of
> list/range_partition_for_tuple itself; optimizing that is a different
> project.  I might have some details slightly off here, but hopefully
> you can see what I'm going for: you want to keep the computation that
> happens in get_partition_for_tuple() to an absolute minimum, and
> instead set things up in advance so that getting the partition for a
> tuple is FAST.  And you want the data structures that you are using in
> that process to be very compact, hence arrays instead of linked lists.

This sounds *much* better.  Here is a quick attempt at coding the design
you have outlined above in the attached latest set of patches.

PS: I haven't run the patches through pgindent yet.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Nov 17, 2016 at 8:18 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The reason NULLs in an input row are caught and rejected (with the current
> message) before control reaches range_partition_for_tuple() is because
> it's not clear to me whether the range bound comparison logic in
> partition_rbound_datum_cmp() should be prepared to handle NULLs and what
> the results of comparisons should look like.  Currently, all it ever
> expects to see in the input tuple's partition key is non-NULL datums.
> Comparison proceeds as follows: if a range bound datum is a finite value,
> we invoke the comparison proc or if it is infinite, we conclude that the
> input tuple is > or < the bound in question based on whether the bound is
> a lower or upper bound, respectively.
>
> Or are you saying that get_tuple_for_partition() should simply return -1
> (partition not found) in case of encountering a NULL in range partition
> key to the caller instead of throwing error as is now?  If the user sees
> the message and decides to create a new range partition that *will* accept
> such a row, how do they decide what its boundaries should be?

Well, I think the first thing you have to decide is whether range
partitioning is going to support NULL values in the partition key
columns at all.  If you want it to support that, then you've got to
decide how it's going to be specified in the SQL syntax.  I had the
impression that you were planning not to support that, in which case
you need to reject all such rows.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Oh but wait, that means I can insert rows with NULLs in the range
> partition key if I choose to insert it directly into the partition,
> whereas I have been thinking all this while that there could never be
> NULLs in the partition key of a range partition.  What's more,
> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
> every partition key column in case of a range partition.  Is that
> wrongheaded altogether?  (also see my reply to your earlier message about
> NULLs in the range partition key)

The easiest thing to do might be to just enforce that all of the
partition key columns have to be not-null when the range-partitioned
table is defined, and reject any attempt to DROP NOT NULL on them
later.  That's probably better that shoehorning it into the table
constraint.

> Thanks for the idea below!
>
>> 1. Forget the idea of a tree.  Instead, let the total number of tables
>> in the partitioning hierarchy be N and let the number of those that
>> are partitioned be K.  Assign each partitioned table in the hierarchy
>> an index between 0 and K-1.  Make your top level data structure (in
>> lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
>> objects, with the partitioning root in entry 0 and the rest in the
>> remaining entries.
>>
>> 2. Within each PartitionDispatch object, store (a) a pointer to a
>> PartitionDesc and (b) an array of integers of length equal to the
>> PartitionDesc's nparts value.  Each integer i, if non-negative, is the
>> final return value for get_partition_for_tuple.  If i == -1, tuple
>> routing fails.  If i < -1, we must next route using the subpartition
>> whose PartitionDesc is at index -(i+1).  Arrange for the array to be
>> in the same order the PartitionDesc's OID list.
>>
>> 3. Now get_partition_for_tuple looks something like this:
>>
>> K = 0
>> loop:
>>     pd = PartitionDispatch[K]
>>     idx = list/range_partition_for_tuple(pd->partdesc, ...)
>>     if (idx >= -1)
>>         return idx
>>     K = -(idx + 1)
>>
>> No recursion, minimal pointer chasing, no linked lists.  The whole
>> thing is basically trivial aside from the cost of
>> list/range_partition_for_tuple itself; optimizing that is a different
>> project.  I might have some details slightly off here, but hopefully
>> you can see what I'm going for: you want to keep the computation that
>> happens in get_partition_for_tuple() to an absolute minimum, and
>> instead set things up in advance so that getting the partition for a
>> tuple is FAST.  And you want the data structures that you are using in
>> that process to be very compact, hence arrays instead of linked lists.
>
> This sounds *much* better.  Here is a quick attempt at coding the design
> you have outlined above in the attached latest set of patches.

That shrank both 0006 and 0007 substantially, and it should be faster,
too.   I bet you can shrink them further:

- Why is PartitionKeyExecInfo a separate structure and why does it
have a NodeTag?  I bet you can dump the node tag, merge it into
PartitionDispatch, and save some more code and some more
pointer-chasing.

- I still think it's a seriously bad idea for list partitioning and
range partitioning to need different code-paths all over the place
here. List partitions support nulls but not multi-column partitioning
keys and range partitions support multi-column partitioning keys but
not nulls, but you could use an internal structure that supports both.
Then you wouldn't need partition_list_values_bsearch and also
partition_rbound_bsearch; you could have one kind of bound structure
that can be bsearch'd for either list or range.  You might even be
able to unify list_partition_for_tuple and range_partition_for_tuple
although that looks a little harder.  In either case, you bsearch for
the greatest value <= the value you have.  The only difference is that
for list partitioning, you have to enforce at the end that it is an
equal value, whereas for range partitioning less-than-or-equal-to is
enough.  But you should still be able to arrange for more code
sharing.

- I don't see why you need the bound->lower stuff any more.  If
rangeinfo.bounds[offset] is a lower bound for a partition, then
rangeinfo.bounds[offset+1] is either (a) the upper bound for that
partition and the partition is followed by a "gap" or (b) both the
upper bound for that partition and the lower bound for the next
partition.  With the inclusive/exclusive bound stuff gone, every range
bound has the same sense: if the probed value is <= the bound then
we're supposed to be a lower-numbered partition, but if > then we're
supposed to be in this partition or a higher-numbered one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Updated patches attached.  I merged what used to be 0006 and 0007 into one.

On 2016/11/19 2:23, Robert Haas wrote:
> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Oh but wait, that means I can insert rows with NULLs in the range
>> partition key if I choose to insert it directly into the partition,
>> whereas I have been thinking all this while that there could never be
>> NULLs in the partition key of a range partition.  What's more,
>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>> every partition key column in case of a range partition.  Is that
>> wrongheaded altogether?  (also see my reply to your earlier message about
>> NULLs in the range partition key)
>
> The easiest thing to do might be to just enforce that all of the
> partition key columns have to be not-null when the range-partitioned
> table is defined, and reject any attempt to DROP NOT NULL on them
> later.  That's probably better that shoehorning it into the table
> constraint.

Agreed that forcing range partitioning columns to be NOT NULL during table
creation would be a better approach.  But then we would have to reject
using expressions in the range partition key, right?

>> Thanks for the idea below!
>>
>>> 1. Forget the idea of a tree.  Instead, let the total number of tables
>>> in the partitioning hierarchy be N and let the number of those that

[ ... ]

>>>
>>> No recursion, minimal pointer chasing, no linked lists.  The whole
>>> thing is basically trivial aside from the cost of
>>> list/range_partition_for_tuple itself; optimizing that is a different
>>> project.  I might have some details slightly off here, but hopefully
>>> you can see what I'm going for: you want to keep the computation that
>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>> instead set things up in advance so that getting the partition for a
>>> tuple is FAST.  And you want the data structures that you are using in
>>> that process to be very compact, hence arrays instead of linked lists.
>>
>> This sounds *much* better.  Here is a quick attempt at coding the design
>> you have outlined above in the attached latest set of patches.
>
> That shrank both 0006 and 0007 substantially, and it should be faster,
> too.   I bet you can shrink them further:

Some changes described below have reduced the size to a certain degree.

>
> - Why is PartitionKeyExecInfo a separate structure and why does it
> have a NodeTag?  I bet you can dump the node tag, merge it into
> PartitionDispatch, and save some more code and some more
> pointer-chasing.

OK, I merged the fields of what used to be PartitionKeyExecInfo into
PartitionDispatchData as the latter's new fields key and keystate.

> - I still think it's a seriously bad idea for list partitioning and
> range partitioning to need different code-paths all over the place
> here. List partitions support nulls but not multi-column partitioning
> keys and range partitions support multi-column partitioning keys but
> not nulls, but you could use an internal structure that supports both.
> Then you wouldn't need partition_list_values_bsearch and also
> partition_rbound_bsearch; you could have one kind of bound structure
> that can be bsearch'd for either list or range.  You might even be
> able to unify list_partition_for_tuple and range_partition_for_tuple
> although that looks a little harder.  In either case, you bsearch for
> the greatest value <= the value you have.  The only difference is that
> for list partitioning, you have to enforce at the end that it is an
> equal value, whereas for range partitioning less-than-or-equal-to is
> enough.  But you should still be able to arrange for more code
> sharing.

I have considered these suggestions in the latest patch.  Now instead of
PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
there is only one PartitionBoundInfo which consolidates the partition
bound information of a partitioned table.  Some of the fields are
applicable only to one of list or range case; for example, null-accepting
list partition index, infinite status of individual range datums.

Also, there is now only one binary search function named
partition_bound_bsearch() which invokes a comparison function named
partition_bound_cmp().  The former searches a probe (a partition bound or
tuple) within a PartitionBoundInfo, which is passed all the way down to
the comparison function.

Also, we no longer have list_partition_for_tuple() and
range_partition_for_tuple().  Instead, in get_partition_for_tuple()
itself, there is a bsearch followed by list and range partitioning
specific steps based on the returned offset.

> - I don't see why you need the bound->lower stuff any more.  If
> rangeinfo.bounds[offset] is a lower bound for a partition, then
> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
> partition and the partition is followed by a "gap" or (b) both the
> upper bound for that partition and the lower bound for the next
> partition.  With the inclusive/exclusive bound stuff gone, every range
> bound has the same sense: if the probed value is <= the bound then
> we're supposed to be a lower-numbered partition, but if > then we're
> supposed to be in this partition or a higher-numbered one.

OK, I've managed to get rid of lower.  At least it is no longer kept in
the new relcache struct PartitionBoundInfo.  It is still kept in
PartitionRangeBound which is used to hold individual range bounds when
sorting them (during relcache build).  Comparisons invoked during the
aforementioned sorting step still need to distinguish between lower and
upper bounds (such that '1)' < '[1').

Tuple-routing no longer needs to look at lower.  In that case, what you
described above applies.

As a result, one change became necessary: to how we flag individual range
bound datum as infinite or not.  Previously, it was a regular Boolean
value (either infinite or not) and to distinguish +infinity from
-infinity, we looked at whether the bound is lower or upper (the lower
flag).  Now, instead, the variable holding the status of individual range
bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
a bool.  Upon encountering an infinite range bound datum, whether it's
negative or positive infinity derives the comparison result.  Consider the
following example:

partition p1 from (1, unbounded) to (1, 1);
partition p2 from (1, 1) to (1, 10);
partition p3 from (1, 10) to (1, unbounded);
partition p4 from (2, unbounded) to (2, 1);
... so on

In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
(1, +inf), so that tuple (1, 15) is assigned to the proper partition.

Does this last thing sound reasonable?

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Rushabh Lathia
Date:
Hi Amit,

I was just reading through your patches and here are some quick review comments
for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch.

Review comments for  0001-Catalog-and-DDL-for-partitioned-tables-17.patch:

1)
@@ -1102,9 +1104,10 @@ heap_create_with_catalog(const char *relname,
     {
         /* Use binary-upgrade override for pg_class.oid/relfilenode? */
         if (IsBinaryUpgrade &&
-            (relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE ||
-             relkind == RELKIND_VIEW || relkind == RELKIND_MATVIEW ||
-             relkind == RELKIND_COMPOSITE_TYPE || relkind == RELKIND_FOREIGN_TABLE))
+            (relkind == RELKIND_RELATION || relkind == RELKIND_PARTITIONED_TABLE ||
+             relkind == RELKIND_SEQUENCE || relkind == RELKIND_VIEW ||
+             relkind == RELKIND_MATVIEW || relkind == RELKIND_COMPOSITE_TYPE ||
+             relkind == RELKIND_FOREIGN_TABLE))

You should add the RELKIND_PARTITIONED_TABLE at the end of if condition that
will make diff minimal. While reading through the patch I noticed that there
is inconsistency - someplace its been added at the end and at few places its
at the start. I think you can make add it at the end of condition and be
consistent with each place.

2)

+        /*
+         * We need to transform the raw parsetrees corresponding to partition
+         * expressions into executable expression trees.  Like column defaults
+         * and CHECK constraints, we could not have done the transformation
+         * earlier.
+         */


Additional space before "Like column defaults".

3)
-    char        relkind;
+    char        relkind,
+                expected_relkind;

Newly added variable should be define separately with its type. Something like:

    char        relkind;
+    char        expected_relkind;

4)

a)
+    /* Prevent partitioned tables from becoming inheritance parents */
+    if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+        ereport(ERROR,
+                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                 errmsg("cannot inherit from partitioned table \"%s\"",
+                         parent->relname)));
+

need alignment for last line.

b)
+            atttuple = SearchSysCacheAttName(RelationGetRelid(rel), pelem->name);
+            if (!HeapTupleIsValid(atttuple))
+                ereport(ERROR,
+                        (errcode(ERRCODE_UNDEFINED_COLUMN),
+                         errmsg("column \"%s\" named in partition key does not exist",
+                         pelem->name)));
+            attform = (Form_pg_attribute) GETSTRUCT(atttuple);
+
+            if (attform->attnum <= 0)
+                ereport(ERROR,
+                        (errcode(ERRCODE_UNDEFINED_COLUMN),
+                         errmsg("cannot use system column \"%s\" in partition key",
+                         pelem->name)));

need alignment for last line of ereport

c)
+        /* Disallow ROW triggers on partitioned tables */
+        if (stmt->row && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+            ereport(ERROR,
+                    (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                    errmsg("\"%s\" is a partitioned table",
+                            RelationGetRelationName(rel)),
+              errdetail("Partitioned tables cannot have ROW triggers.")));

need alignment

5)

@@ -2512,6 +2579,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,
     cxt.blist = NIL;
     cxt.alist = NIL;
     cxt.pkey = NULL;
+    cxt.ispartitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;

I think adding bracket will look code more clear.

+    cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);

6)

+ * RelationBuildPartitionKey
+ *        Build and attach to relcache partition key data of relation
+ *
+ * Partitioning key data is stored in CacheMemoryContext to ensure it survives
+ * as long as the relcache.  To avoid leaking memory in that context in case
+ * of an error partway through this function, we build the structure in the
+ * working context (which must be short-lived) and copy the completed
+ * structure into the cache memory.

extra space before "To avoid leaking memory"

7)
+    /* variable-length fields start here, but we allow direct access to partattrs */
+    int2vector        partattrs;        /* attribute numbers of columns in the

Why partattrs is allow direct access - its not really clear from the comments.

I will continue reading more patch and testing functionality.. will share the
comments as I have it.

Thanks,

On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Updated patches attached.  I merged what used to be 0006 and 0007 into one.

On 2016/11/19 2:23, Robert Haas wrote:
> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Oh but wait, that means I can insert rows with NULLs in the range
>> partition key if I choose to insert it directly into the partition,
>> whereas I have been thinking all this while that there could never be
>> NULLs in the partition key of a range partition.  What's more,
>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>> every partition key column in case of a range partition.  Is that
>> wrongheaded altogether?  (also see my reply to your earlier message about
>> NULLs in the range partition key)
>
> The easiest thing to do might be to just enforce that all of the
> partition key columns have to be not-null when the range-partitioned
> table is defined, and reject any attempt to DROP NOT NULL on them
> later.  That's probably better that shoehorning it into the table
> constraint.

Agreed that forcing range partitioning columns to be NOT NULL during table
creation would be a better approach.  But then we would have to reject
using expressions in the range partition key, right?

>> Thanks for the idea below!
>>
>>> 1. Forget the idea of a tree.  Instead, let the total number of tables
>>> in the partitioning hierarchy be N and let the number of those that

[ ... ]

>>>
>>> No recursion, minimal pointer chasing, no linked lists.  The whole
>>> thing is basically trivial aside from the cost of
>>> list/range_partition_for_tuple itself; optimizing that is a different
>>> project.  I might have some details slightly off here, but hopefully
>>> you can see what I'm going for: you want to keep the computation that
>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>> instead set things up in advance so that getting the partition for a
>>> tuple is FAST.  And you want the data structures that you are using in
>>> that process to be very compact, hence arrays instead of linked lists.
>>
>> This sounds *much* better.  Here is a quick attempt at coding the design
>> you have outlined above in the attached latest set of patches.
>
> That shrank both 0006 and 0007 substantially, and it should be faster,
> too.   I bet you can shrink them further:

Some changes described below have reduced the size to a certain degree.

>
> - Why is PartitionKeyExecInfo a separate structure and why does it
> have a NodeTag?  I bet you can dump the node tag, merge it into
> PartitionDispatch, and save some more code and some more
> pointer-chasing.

OK, I merged the fields of what used to be PartitionKeyExecInfo into
PartitionDispatchData as the latter's new fields key and keystate.

> - I still think it's a seriously bad idea for list partitioning and
> range partitioning to need different code-paths all over the place
> here. List partitions support nulls but not multi-column partitioning
> keys and range partitions support multi-column partitioning keys but
> not nulls, but you could use an internal structure that supports both.
> Then you wouldn't need partition_list_values_bsearch and also
> partition_rbound_bsearch; you could have one kind of bound structure
> that can be bsearch'd for either list or range.  You might even be
> able to unify list_partition_for_tuple and range_partition_for_tuple
> although that looks a little harder.  In either case, you bsearch for
> the greatest value <= the value you have.  The only difference is that
> for list partitioning, you have to enforce at the end that it is an
> equal value, whereas for range partitioning less-than-or-equal-to is
> enough.  But you should still be able to arrange for more code
> sharing.

I have considered these suggestions in the latest patch.  Now instead of
PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
there is only one PartitionBoundInfo which consolidates the partition
bound information of a partitioned table.  Some of the fields are
applicable only to one of list or range case; for example, null-accepting
list partition index, infinite status of individual range datums.

Also, there is now only one binary search function named
partition_bound_bsearch() which invokes a comparison function named
partition_bound_cmp().  The former searches a probe (a partition bound or
tuple) within a PartitionBoundInfo, which is passed all the way down to
the comparison function.

Also, we no longer have list_partition_for_tuple() and
range_partition_for_tuple().  Instead, in get_partition_for_tuple()
itself, there is a bsearch followed by list and range partitioning
specific steps based on the returned offset.

> - I don't see why you need the bound->lower stuff any more.  If
> rangeinfo.bounds[offset] is a lower bound for a partition, then
> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
> partition and the partition is followed by a "gap" or (b) both the
> upper bound for that partition and the lower bound for the next
> partition.  With the inclusive/exclusive bound stuff gone, every range
> bound has the same sense: if the probed value is <= the bound then
> we're supposed to be a lower-numbered partition, but if > then we're
> supposed to be in this partition or a higher-numbered one.

OK, I've managed to get rid of lower.  At least it is no longer kept in
the new relcache struct PartitionBoundInfo.  It is still kept in
PartitionRangeBound which is used to hold individual range bounds when
sorting them (during relcache build).  Comparisons invoked during the
aforementioned sorting step still need to distinguish between lower and
upper bounds (such that '1)' < '[1').

Tuple-routing no longer needs to look at lower.  In that case, what you
described above applies.

As a result, one change became necessary: to how we flag individual range
bound datum as infinite or not.  Previously, it was a regular Boolean
value (either infinite or not) and to distinguish +infinity from
-infinity, we looked at whether the bound is lower or upper (the lower
flag).  Now, instead, the variable holding the status of individual range
bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
a bool.  Upon encountering an infinite range bound datum, whether it's
negative or positive infinity derives the comparison result.  Consider the
following example:

partition p1 from (1, unbounded) to (1, 1);
partition p2 from (1, 1) to (1, 10);
partition p3 from (1, 10) to (1, unbounded);
partition p4 from (2, unbounded) to (2, 1);
... so on

In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
(1, +inf), so that tuple (1, 15) is assigned to the proper partition.

Does this last thing sound reasonable?

Thanks,
Amit


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Rushabh Lathia

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 22, 2016 at 4:15 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The easiest thing to do might be to just enforce that all of the
>> partition key columns have to be not-null when the range-partitioned
>> table is defined, and reject any attempt to DROP NOT NULL on them
>> later.  That's probably better that shoehorning it into the table
>> constraint.
>
> Agreed that forcing range partitioning columns to be NOT NULL during table
> creation would be a better approach.  But then we would have to reject
> using expressions in the range partition key, right?

Why?

> As a result, one change became necessary: to how we flag individual range
> bound datum as infinite or not.  Previously, it was a regular Boolean
> value (either infinite or not) and to distinguish +infinity from
> -infinity, we looked at whether the bound is lower or upper (the lower
> flag).  Now, instead, the variable holding the status of individual range
> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
> a bool.

You better not be using a bool to represent a ternary value!  Use an
enum for that -- or if in the system catalogs, a char.

> Upon encountering an infinite range bound datum, whether it's
> negative or positive infinity derives the comparison result.  Consider the
> following example:
>
> partition p1 from (1, unbounded) to (1, 1);
> partition p2 from (1, 1) to (1, 10);
> partition p3 from (1, 10) to (1, unbounded);
> partition p4 from (2, unbounded) to (2, 1);
> ... so on
>
> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
I am trying to create a partitioned table with primary keys on the
partitions. Here's the corresponding syntax as per documentation
CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
IF NOT EXISTS ] table_name   PARTITION OF parent_table [ ( { column_name WITH OPTIONS [ column_constraint [ ... ] ]   |
table_constraint}   [, ... ]
 
) ] partition_bound_spec

IIUC, it should allow "create table t1_p1 partition of t1 (a primary
key) ...", (a primary key) is nothing but "column_name
column_constraint", but here's what happens
create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
ERROR:  syntax error at or near "primary"
LINE 1: create table t1_p1 partition of t1 (a primary key) for value...

The same syntax also suggests using table_constraints but that too doesn't workcreate table t1_p1 partition of t1
(primarykey (a) )  for values
 
from (0) to (100);
ERROR:  inherited relation "t1" is not a table or foreign table

of course t1 is a table, what it isn't?

Am I missing something? How do I define constraints on the partitions?


On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Updated patches attached.  I merged what used to be 0006 and 0007 into one.
>
> On 2016/11/19 2:23, Robert Haas wrote:
>> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Oh but wait, that means I can insert rows with NULLs in the range
>>> partition key if I choose to insert it directly into the partition,
>>> whereas I have been thinking all this while that there could never be
>>> NULLs in the partition key of a range partition.  What's more,
>>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>>> every partition key column in case of a range partition.  Is that
>>> wrongheaded altogether?  (also see my reply to your earlier message about
>>> NULLs in the range partition key)
>>
>> The easiest thing to do might be to just enforce that all of the
>> partition key columns have to be not-null when the range-partitioned
>> table is defined, and reject any attempt to DROP NOT NULL on them
>> later.  That's probably better that shoehorning it into the table
>> constraint.
>
> Agreed that forcing range partitioning columns to be NOT NULL during table
> creation would be a better approach.  But then we would have to reject
> using expressions in the range partition key, right?
>
>>> Thanks for the idea below!
>>>
>>>> 1. Forget the idea of a tree.  Instead, let the total number of tables
>>>> in the partitioning hierarchy be N and let the number of those that
>
> [ ... ]
>
>>>>
>>>> No recursion, minimal pointer chasing, no linked lists.  The whole
>>>> thing is basically trivial aside from the cost of
>>>> list/range_partition_for_tuple itself; optimizing that is a different
>>>> project.  I might have some details slightly off here, but hopefully
>>>> you can see what I'm going for: you want to keep the computation that
>>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>>> instead set things up in advance so that getting the partition for a
>>>> tuple is FAST.  And you want the data structures that you are using in
>>>> that process to be very compact, hence arrays instead of linked lists.
>>>
>>> This sounds *much* better.  Here is a quick attempt at coding the design
>>> you have outlined above in the attached latest set of patches.
>>
>> That shrank both 0006 and 0007 substantially, and it should be faster,
>> too.   I bet you can shrink them further:
>
> Some changes described below have reduced the size to a certain degree.
>
>>
>> - Why is PartitionKeyExecInfo a separate structure and why does it
>> have a NodeTag?  I bet you can dump the node tag, merge it into
>> PartitionDispatch, and save some more code and some more
>> pointer-chasing.
>
> OK, I merged the fields of what used to be PartitionKeyExecInfo into
> PartitionDispatchData as the latter's new fields key and keystate.
>
>> - I still think it's a seriously bad idea for list partitioning and
>> range partitioning to need different code-paths all over the place
>> here. List partitions support nulls but not multi-column partitioning
>> keys and range partitions support multi-column partitioning keys but
>> not nulls, but you could use an internal structure that supports both.
>> Then you wouldn't need partition_list_values_bsearch and also
>> partition_rbound_bsearch; you could have one kind of bound structure
>> that can be bsearch'd for either list or range.  You might even be
>> able to unify list_partition_for_tuple and range_partition_for_tuple
>> although that looks a little harder.  In either case, you bsearch for
>> the greatest value <= the value you have.  The only difference is that
>> for list partitioning, you have to enforce at the end that it is an
>> equal value, whereas for range partitioning less-than-or-equal-to is
>> enough.  But you should still be able to arrange for more code
>> sharing.
>
> I have considered these suggestions in the latest patch.  Now instead of
> PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
> there is only one PartitionBoundInfo which consolidates the partition
> bound information of a partitioned table.  Some of the fields are
> applicable only to one of list or range case; for example, null-accepting
> list partition index, infinite status of individual range datums.
>
> Also, there is now only one binary search function named
> partition_bound_bsearch() which invokes a comparison function named
> partition_bound_cmp().  The former searches a probe (a partition bound or
> tuple) within a PartitionBoundInfo, which is passed all the way down to
> the comparison function.
>
> Also, we no longer have list_partition_for_tuple() and
> range_partition_for_tuple().  Instead, in get_partition_for_tuple()
> itself, there is a bsearch followed by list and range partitioning
> specific steps based on the returned offset.
>
>> - I don't see why you need the bound->lower stuff any more.  If
>> rangeinfo.bounds[offset] is a lower bound for a partition, then
>> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
>> partition and the partition is followed by a "gap" or (b) both the
>> upper bound for that partition and the lower bound for the next
>> partition.  With the inclusive/exclusive bound stuff gone, every range
>> bound has the same sense: if the probed value is <= the bound then
>> we're supposed to be a lower-numbered partition, but if > then we're
>> supposed to be in this partition or a higher-numbered one.
>
> OK, I've managed to get rid of lower.  At least it is no longer kept in
> the new relcache struct PartitionBoundInfo.  It is still kept in
> PartitionRangeBound which is used to hold individual range bounds when
> sorting them (during relcache build).  Comparisons invoked during the
> aforementioned sorting step still need to distinguish between lower and
> upper bounds (such that '1)' < '[1').
>
> Tuple-routing no longer needs to look at lower.  In that case, what you
> described above applies.
>
> As a result, one change became necessary: to how we flag individual range
> bound datum as infinite or not.  Previously, it was a regular Boolean
> value (either infinite or not) and to distinguish +infinity from
> -infinity, we looked at whether the bound is lower or upper (the lower
> flag).  Now, instead, the variable holding the status of individual range
> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
> a bool.  Upon encountering an infinite range bound datum, whether it's
> negative or positive infinity derives the comparison result.  Consider the
> following example:
>
> partition p1 from (1, unbounded) to (1, 1);
> partition p2 from (1, 1) to (1, 10);
> partition p3 from (1, 10) to (1, unbounded);
> partition p4 from (2, unbounded) to (2, 1);
> ... so on
>
> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.
>
> Does this last thing sound reasonable?
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
Unsurprisingly ALTER TABLE worked
alter table t1_p1 add primary key(a);
ALTER TABLE
postgres=# \d+ t1_p1                                  Table "public.t1_p1"Column |  Type   | Collation | Nullable |
Default| Storage | Stats
 
target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------a      | integer |
|not null |         | plain   |              |b      | integer |           |          |         | plain   |
|
 
Indexes:   "t1_p1_pkey" PRIMARY KEY, btree (a)
Inherits: t1

On Thu, Nov 24, 2016 at 11:05 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I am trying to create a partitioned table with primary keys on the
> partitions. Here's the corresponding syntax as per documentation
> CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
> IF NOT EXISTS ] table_name
>     PARTITION OF parent_table [ (
>   { column_name WITH OPTIONS [ column_constraint [ ... ] ]
>     | table_constraint }
>     [, ... ]
> ) ] partition_bound_spec
>
> IIUC, it should allow "create table t1_p1 partition of t1 (a primary
> key) ...", (a primary key) is nothing but "column_name
> column_constraint", but here's what happens
> create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
> ERROR:  syntax error at or near "primary"
> LINE 1: create table t1_p1 partition of t1 (a primary key) for value...
>
> The same syntax also suggests using table_constraints but that too doesn't work
>  create table t1_p1 partition of t1 (primary key (a) )  for values
> from (0) to (100);
> ERROR:  inherited relation "t1" is not a table or foreign table
>
> of course t1 is a table, what it isn't?
>
> Am I missing something? How do I define constraints on the partitions?
>
>
> On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> Updated patches attached.  I merged what used to be 0006 and 0007 into one.
>>
>> On 2016/11/19 2:23, Robert Haas wrote:
>>> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> Oh but wait, that means I can insert rows with NULLs in the range
>>>> partition key if I choose to insert it directly into the partition,
>>>> whereas I have been thinking all this while that there could never be
>>>> NULLs in the partition key of a range partition.  What's more,
>>>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>>>> every partition key column in case of a range partition.  Is that
>>>> wrongheaded altogether?  (also see my reply to your earlier message about
>>>> NULLs in the range partition key)
>>>
>>> The easiest thing to do might be to just enforce that all of the
>>> partition key columns have to be not-null when the range-partitioned
>>> table is defined, and reject any attempt to DROP NOT NULL on them
>>> later.  That's probably better that shoehorning it into the table
>>> constraint.
>>
>> Agreed that forcing range partitioning columns to be NOT NULL during table
>> creation would be a better approach.  But then we would have to reject
>> using expressions in the range partition key, right?
>>
>>>> Thanks for the idea below!
>>>>
>>>>> 1. Forget the idea of a tree.  Instead, let the total number of tables
>>>>> in the partitioning hierarchy be N and let the number of those that
>>
>> [ ... ]
>>
>>>>>
>>>>> No recursion, minimal pointer chasing, no linked lists.  The whole
>>>>> thing is basically trivial aside from the cost of
>>>>> list/range_partition_for_tuple itself; optimizing that is a different
>>>>> project.  I might have some details slightly off here, but hopefully
>>>>> you can see what I'm going for: you want to keep the computation that
>>>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>>>> instead set things up in advance so that getting the partition for a
>>>>> tuple is FAST.  And you want the data structures that you are using in
>>>>> that process to be very compact, hence arrays instead of linked lists.
>>>>
>>>> This sounds *much* better.  Here is a quick attempt at coding the design
>>>> you have outlined above in the attached latest set of patches.
>>>
>>> That shrank both 0006 and 0007 substantially, and it should be faster,
>>> too.   I bet you can shrink them further:
>>
>> Some changes described below have reduced the size to a certain degree.
>>
>>>
>>> - Why is PartitionKeyExecInfo a separate structure and why does it
>>> have a NodeTag?  I bet you can dump the node tag, merge it into
>>> PartitionDispatch, and save some more code and some more
>>> pointer-chasing.
>>
>> OK, I merged the fields of what used to be PartitionKeyExecInfo into
>> PartitionDispatchData as the latter's new fields key and keystate.
>>
>>> - I still think it's a seriously bad idea for list partitioning and
>>> range partitioning to need different code-paths all over the place
>>> here. List partitions support nulls but not multi-column partitioning
>>> keys and range partitions support multi-column partitioning keys but
>>> not nulls, but you could use an internal structure that supports both.
>>> Then you wouldn't need partition_list_values_bsearch and also
>>> partition_rbound_bsearch; you could have one kind of bound structure
>>> that can be bsearch'd for either list or range.  You might even be
>>> able to unify list_partition_for_tuple and range_partition_for_tuple
>>> although that looks a little harder.  In either case, you bsearch for
>>> the greatest value <= the value you have.  The only difference is that
>>> for list partitioning, you have to enforce at the end that it is an
>>> equal value, whereas for range partitioning less-than-or-equal-to is
>>> enough.  But you should still be able to arrange for more code
>>> sharing.
>>
>> I have considered these suggestions in the latest patch.  Now instead of
>> PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
>> there is only one PartitionBoundInfo which consolidates the partition
>> bound information of a partitioned table.  Some of the fields are
>> applicable only to one of list or range case; for example, null-accepting
>> list partition index, infinite status of individual range datums.
>>
>> Also, there is now only one binary search function named
>> partition_bound_bsearch() which invokes a comparison function named
>> partition_bound_cmp().  The former searches a probe (a partition bound or
>> tuple) within a PartitionBoundInfo, which is passed all the way down to
>> the comparison function.
>>
>> Also, we no longer have list_partition_for_tuple() and
>> range_partition_for_tuple().  Instead, in get_partition_for_tuple()
>> itself, there is a bsearch followed by list and range partitioning
>> specific steps based on the returned offset.
>>
>>> - I don't see why you need the bound->lower stuff any more.  If
>>> rangeinfo.bounds[offset] is a lower bound for a partition, then
>>> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
>>> partition and the partition is followed by a "gap" or (b) both the
>>> upper bound for that partition and the lower bound for the next
>>> partition.  With the inclusive/exclusive bound stuff gone, every range
>>> bound has the same sense: if the probed value is <= the bound then
>>> we're supposed to be a lower-numbered partition, but if > then we're
>>> supposed to be in this partition or a higher-numbered one.
>>
>> OK, I've managed to get rid of lower.  At least it is no longer kept in
>> the new relcache struct PartitionBoundInfo.  It is still kept in
>> PartitionRangeBound which is used to hold individual range bounds when
>> sorting them (during relcache build).  Comparisons invoked during the
>> aforementioned sorting step still need to distinguish between lower and
>> upper bounds (such that '1)' < '[1').
>>
>> Tuple-routing no longer needs to look at lower.  In that case, what you
>> described above applies.
>>
>> As a result, one change became necessary: to how we flag individual range
>> bound datum as infinite or not.  Previously, it was a regular Boolean
>> value (either infinite or not) and to distinguish +infinity from
>> -infinity, we looked at whether the bound is lower or upper (the lower
>> flag).  Now, instead, the variable holding the status of individual range
>> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
>> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
>> a bool.  Upon encountering an infinite range bound datum, whether it's
>> negative or positive infinity derives the comparison result.  Consider the
>> following example:
>>
>> partition p1 from (1, unbounded) to (1, 1);
>> partition p2 from (1, 1) to (1, 10);
>> partition p3 from (1, 10) to (1, unbounded);
>> partition p4 from (2, unbounded) to (2, 1);
>> ... so on
>>
>> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
>> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.
>>
>> Does this last thing sound reasonable?
>>
>> Thanks,
>> Amit
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Ashutosh,

On 2016/11/24 14:35, Ashutosh Bapat wrote:
> I am trying to create a partitioned table with primary keys on the
> partitions. Here's the corresponding syntax as per documentation
> CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
> IF NOT EXISTS ] table_name
>     PARTITION OF parent_table [ (
>   { column_name WITH OPTIONS [ column_constraint [ ... ] ]
>     | table_constraint }
>     [, ... ]
> ) ] partition_bound_spec
> 
> IIUC, it should allow "create table t1_p1 partition of t1 (a primary
> key) ...", (a primary key) is nothing but "column_name
> column_constraint", but here's what happens
> create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
> ERROR:  syntax error at or near "primary"
> LINE 1: create table t1_p1 partition of t1 (a primary key) for value...

You have to specify column constraints using the keywords WITH OPTIONS,
like below:

create table p1 partition of p (   a with options primary key
) for values in (1);

> The same syntax also suggests using table_constraints but that too doesn't work
>  create table t1_p1 partition of t1 (primary key (a) )  for values
> from (0) to (100);
> ERROR:  inherited relation "t1" is not a table or foreign table
> 
> of course t1 is a table, what it isn't?

It's a bug.  Forgot to consider RELKIND_PARTITIONED_TABLE to an if block
in the code that checks inheritance parent relation's relkind when
creating an index constraint (primary key) on a child table.  Will fix,
thanks for catching it.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
On Thu, Nov 24, 2016 at 11:34 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Hi Ashutosh,
>
> On 2016/11/24 14:35, Ashutosh Bapat wrote:
>> I am trying to create a partitioned table with primary keys on the
>> partitions. Here's the corresponding syntax as per documentation
>> CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
>> IF NOT EXISTS ] table_name
>>     PARTITION OF parent_table [ (
>>   { column_name WITH OPTIONS [ column_constraint [ ... ] ]
>>     | table_constraint }
>>     [, ... ]
>> ) ] partition_bound_spec
>>
>> IIUC, it should allow "create table t1_p1 partition of t1 (a primary
>> key) ...", (a primary key) is nothing but "column_name
>> column_constraint", but here's what happens
>> create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
>> ERROR:  syntax error at or near "primary"
>> LINE 1: create table t1_p1 partition of t1 (a primary key) for value...
>
> You have to specify column constraints using the keywords WITH OPTIONS,
> like below:
>
> create table p1 partition of p (
>     a with options primary key
> ) for values in (1);

Oh, sorry for not noticing it. You are right. Why do we need "with
option" there? Shouldn't user be able to specify just "a primary key";
it's not really an "option", it's a constraint.

>
>> The same syntax also suggests using table_constraints but that too doesn't work
>>  create table t1_p1 partition of t1 (primary key (a) )  for values
>> from (0) to (100);
>> ERROR:  inherited relation "t1" is not a table or foreign table
>>
>> of course t1 is a table, what it isn't?
>
> It's a bug.  Forgot to consider RELKIND_PARTITIONED_TABLE to an if block
> in the code that checks inheritance parent relation's relkind when
> creating an index constraint (primary key) on a child table.  Will fix,
> thanks for catching it.
>
> Thanks,
> Amit
>
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/24 15:10, Ashutosh Bapat wrote:
> On Thu, Nov 24, 2016 at 11:34 AM, Amit Langote wrote:
>> On 2016/11/24 14:35, Ashutosh Bapat wrote:
>>> IIUC, it should allow "create table t1_p1 partition of t1 (a primary
>>> key) ...", (a primary key) is nothing but "column_name
>>> column_constraint", but here's what happens
>>> create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
>>> ERROR:  syntax error at or near "primary"
>>> LINE 1: create table t1_p1 partition of t1 (a primary key) for value...
>>
>> You have to specify column constraints using the keywords WITH OPTIONS,
>> like below:
>>
>> create table p1 partition of p (
>>     a with options primary key
>> ) for values in (1);
> 
> Oh, sorry for not noticing it. You are right. Why do we need "with
> option" there? Shouldn't user be able to specify just "a primary key";
> it's not really an "option", it's a constraint.

I just adopted the existing syntax for specifying column/table constraints
of a table created with CREATE TABLE OF type_name.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
On Thu, Nov 24, 2016 at 12:02 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/11/24 15:10, Ashutosh Bapat wrote:
>> On Thu, Nov 24, 2016 at 11:34 AM, Amit Langote wrote:
>>> On 2016/11/24 14:35, Ashutosh Bapat wrote:
>>>> IIUC, it should allow "create table t1_p1 partition of t1 (a primary
>>>> key) ...", (a primary key) is nothing but "column_name
>>>> column_constraint", but here's what happens
>>>> create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
>>>> ERROR:  syntax error at or near "primary"
>>>> LINE 1: create table t1_p1 partition of t1 (a primary key) for value...
>>>
>>> You have to specify column constraints using the keywords WITH OPTIONS,
>>> like below:
>>>
>>> create table p1 partition of p (
>>>     a with options primary key
>>> ) for values in (1);
>>
>> Oh, sorry for not noticing it. You are right. Why do we need "with
>> option" there? Shouldn't user be able to specify just "a primary key";
>> it's not really an "option", it's a constraint.
>
> I just adopted the existing syntax for specifying column/table constraints
> of a table created with CREATE TABLE OF type_name.

Hmm, I don't fine it quite intuitive. But others might find it so.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Rushabh,

On 2016/11/22 22:11, Rushabh Lathia wrote:
> Hi Amit,
>
> I was just reading through your patches and here are some quick review
> comments
> for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch.

Thanks for the review!

>
> Review comments for  0001-Catalog-and-DDL-for-partitioned-tables-17.patch:
>
> 1)
> @@ -1102,9 +1104,10 @@ heap_create_with_catalog(const char *relname,
>      {
>          /* Use binary-upgrade override for pg_class.oid/relfilenode? */
>          if (IsBinaryUpgrade &&
> -            (relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE ||
> -             relkind == RELKIND_VIEW || relkind == RELKIND_MATVIEW ||
> -             relkind == RELKIND_COMPOSITE_TYPE || relkind ==
> RELKIND_FOREIGN_TABLE))
> +            (relkind == RELKIND_RELATION || relkind ==
> RELKIND_PARTITIONED_TABLE ||
> +             relkind == RELKIND_SEQUENCE || relkind == RELKIND_VIEW ||
> +             relkind == RELKIND_MATVIEW || relkind ==
> RELKIND_COMPOSITE_TYPE ||
> +             relkind == RELKIND_FOREIGN_TABLE))
>
> You should add the RELKIND_PARTITIONED_TABLE at the end of if condition that
> will make diff minimal. While reading through the patch I noticed that there
> is inconsistency - someplace its been added at the end and at few places its
> at the start. I think you can make add it at the end of condition and be
> consistent with each place.

OK, done.

>
> 2)
>
> +        /*
> +         * We need to transform the raw parsetrees corresponding to
> partition
> +         * expressions into executable expression trees.  Like column
> defaults
> +         * and CHECK constraints, we could not have done the transformation
> +         * earlier.
> +         */
>
>
> Additional space before "Like column defaults".

I think it's a common practice to add two spaces after a sentence-ending
period [https://www.gnu.org/prep/standards/html_node/Comments.html], which
it seems, is followed more or less regularly in the formatting of the
comments in the PostgreSQL source code.

> 3)
> -    char        relkind;
> +    char        relkind,
> +                expected_relkind;
>
> Newly added variable should be define separately with its type. Something
> like:
>
>     char        relkind;
> +    char        expected_relkind;

OK, done.

>
> 4)
>
> a)
> +    /* Prevent partitioned tables from becoming inheritance parents */
> +    if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +                 errmsg("cannot inherit from partitioned table \"%s\"",
> +                         parent->relname)));
> +
>
> need alignment for last line.

Fixed.

> b)
> +            atttuple = SearchSysCacheAttName(RelationGetRelid(rel),
> pelem->name);
> +            if (!HeapTupleIsValid(atttuple))
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
> +                         errmsg("column \"%s\" named in partition key does
> not exist",
> +                         pelem->name)));
> +            attform = (Form_pg_attribute) GETSTRUCT(atttuple);
> +
> +            if (attform->attnum <= 0)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
> +                         errmsg("cannot use system column \"%s\" in
> partition key",
> +                         pelem->name)));
>
> need alignment for last line of ereport

Fixed.

> c)
> +        /* Disallow ROW triggers on partitioned tables */
> +        if (stmt->row && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +                    errmsg("\"%s\" is a partitioned table",
> +                            RelationGetRelationName(rel)),
> +              errdetail("Partitioned tables cannot have ROW triggers.")));
>
> need alignment

Fixed.

> 5)
>
> @@ -2512,6 +2579,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt
> *stmt,
>      cxt.blist = NIL;
>      cxt.alist = NIL;
>      cxt.pkey = NULL;
> +    cxt.ispartitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
>
> I think adding bracket will look code more clear.
>
> +    cxt.ispartitioned = (rel->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE);

Agreed, done.

> 6)
>
> + * RelationBuildPartitionKey
> + *        Build and attach to relcache partition key data of relation
> + *
> + * Partitioning key data is stored in CacheMemoryContext to ensure it
> survives
> + * as long as the relcache.  To avoid leaking memory in that context in
> case
> + * of an error partway through this function, we build the structure in the
> + * working context (which must be short-lived) and copy the completed
> + * structure into the cache memory.
>
> extra space before "To avoid leaking memory"

Same thing I said above.

> 7)
> +    /* variable-length fields start here, but we allow direct access to
> partattrs */
> +    int2vector        partattrs;        /* attribute numbers of columns in
> the
>
> Why partattrs is allow direct access - its not really clear from the
> comments.

I have copied the comment from another catalog header file (a number of
catalog headers have the same comment).  I updated the new file's comment
to say a little bit more; I wonder if the comment should be updated in
other files as well?  However, I noticed that there are explanatory notes
elsewhere (for example, around the code that reads such a field from the
catalog) about why the first variable-length field of a catalog's tuple
(especially of type int2vector or oidvector) are directly accessible via
their C struct offsets.

> I will continue reading more patch and testing functionality.. will share
> the
> comments as I have it.

Updated patches attached.  I have also considered Robert's comments [1] as
follows and fixed a bug that Ashutosh reported [2]:

- Force partition key columns to be NOT NULL at the table creation time
  if using range partitioning
- Use enum to represent the content of individual range datums (viz.
  finite datum, -infinity, +infinity) in the relcache struct

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA+TgmoZ-feQsxc7U_JerM_AFChp3Qf6btK708SAe7M8Vdv5=jA@mail.gmail.com
[2]
https://www.postgresql.org/message-id/306c85e9-c702-3742-eeff-9b7a40498afc%40lab.ntt.co.jp

Attachment

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/23 4:50, Robert Haas wrote:
> On Tue, Nov 22, 2016 at 4:15 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> The easiest thing to do might be to just enforce that all of the
>>> partition key columns have to be not-null when the range-partitioned
>>> table is defined, and reject any attempt to DROP NOT NULL on them
>>> later.  That's probably better that shoehorning it into the table
>>> constraint.
>>
>> Agreed that forcing range partitioning columns to be NOT NULL during table
>> creation would be a better approach.  But then we would have to reject
>> using expressions in the range partition key, right?
> 
> Why?

I was thinking of it like how primary key columns cannot contain
expressions; the reason for which, I assume, is because while we can
ensure that a column contains only non-null values by defining a
constraint on the column, there is no way to force expressions to be non-null.

In any case, I have implemented in the latest patch that when creating a
range partitioned table, its partition key columns are automatically set
to be NOT NULL.  Although, it leaves out the columns that are referenced
in expressions.  So even after doing so, we need to check after computing
the range partition key of a input row, that none of the partitions keys
is null, because expressions can still return null.

Also, it does nothing to help the undesirable situation that one can
insert a row with a null partition key (expression) into any of the range
partitions if targeted directly, because of how ExecQual() handles
nullable constraint expressions (treats null value as satisfying the
partition constraint).

An alternative possibly worth considering might be to somehow handle the
null range partition keys within the logic to compare against range bound
datums.  It looks like other databases will map the rows containing nulls
to the unbounded partition.  One database allows specifying NULLS
FIRST/LAST and maps a row containing null key to the partition with
-infinity as the lower bound or +infinity as the upper bound, respectively
with  NULLS LAST the default behavior.

In our case, if we allowed a similar way of defining a range partitioned
table:

create table p (a int, b int) partition by range nulls first (a);
create table p0 partition of p for values from (unbounded) to (1);
create table p1 partition of p for values from (1) to (10);
create table p2 partition of p for values from (10) to (unbounded);

Row (null, 1) will be mapped to p0.  If we didn't have p0, we would report
the "partition not found" error.

In case of a multi-column key:

create table p (a int, b int) partition by range (a, b);
create table p0 partition of p for values from (1, unbounded) to (1, 1);
create table p1 partition of p for values from (1, 1) to (1, 10);
create table p2 partition of p for values from (1, 10) to (1, unbounded);

Row (1, null) will be mapped to p2 (default nulls last behavior).

But I guess we still end up without a solution for the problem that a row
with null partition key (expression) could be inserted into any of the
range partitions if targeted directly.

Thoughts?

>> As a result, one change became necessary: to how we flag individual range
>> bound datum as infinite or not.  Previously, it was a regular Boolean
>> value (either infinite or not) and to distinguish +infinity from
>> -infinity, we looked at whether the bound is lower or upper (the lower
>> flag).  Now, instead, the variable holding the status of individual range
>> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
>> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
>> a bool.
> 
> You better not be using a bool to represent a ternary value!  Use an
> enum for that -- or if in the system catalogs, a char.

OK, created an enum called RangeDatumContent.  In the system catalog, we
still store the boolean value; it is only after we read it into the
relcache structure that we use one of these enum values.  I'm worried
though that using enum would consume more memory (we need to store nparts
* partnattrs instances of the enum).

Thanks,
Amit





Re: Declarative partitioning - another take

From
Alvaro Herrera
Date:
Amit Langote wrote:
> On 2016/11/24 15:10, Ashutosh Bapat wrote:
> > On Thu, Nov 24, 2016 at 11:34 AM, Amit Langote wrote:

> >> You have to specify column constraints using the keywords WITH OPTIONS,
> >> like below:
> >>
> >> create table p1 partition of p (
> >>     a with options primary key
> >> ) for values in (1);
> > 
> > Oh, sorry for not noticing it. You are right. Why do we need "with
> > option" there? Shouldn't user be able to specify just "a primary key";
> > it's not really an "option", it's a constraint.
> 
> I just adopted the existing syntax for specifying column/table constraints
> of a table created with CREATE TABLE OF type_name.

I think CREATE TABLE OF is pretty much a corner case.  I agree that
allowing the constraint right after the constraint name is more
intuitive.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/25 4:36, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2016/11/24 15:10, Ashutosh Bapat wrote:
>>> On Thu, Nov 24, 2016 at 11:34 AM, Amit Langote wrote:
> 
>>>> You have to specify column constraints using the keywords WITH OPTIONS,
>>>> like below:
>>>>
>>>> create table p1 partition of p (
>>>>     a with options primary key
>>>> ) for values in (1);
>>>
>>> Oh, sorry for not noticing it. You are right. Why do we need "with
>>> option" there? Shouldn't user be able to specify just "a primary key";
>>> it's not really an "option", it's a constraint.
>>
>> I just adopted the existing syntax for specifying column/table constraints
>> of a table created with CREATE TABLE OF type_name.
> 
> I think CREATE TABLE OF is pretty much a corner case.  I agree that
> allowing the constraint right after the constraint name is more
> intuitive.

I assume you meant "...right after the column name"?

I will modify the grammar to allow that way then, so that the following
will work:

create table p1 partition of p (    a primary key
) for values in (1);

Thanks,
Amit





Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Nov 24, 2016 at 6:13 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/11/23 4:50, Robert Haas wrote:
>> On Tue, Nov 22, 2016 at 4:15 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> The easiest thing to do might be to just enforce that all of the
>>>> partition key columns have to be not-null when the range-partitioned
>>>> table is defined, and reject any attempt to DROP NOT NULL on them
>>>> later.  That's probably better that shoehorning it into the table
>>>> constraint.
>>>
>>> Agreed that forcing range partitioning columns to be NOT NULL during table
>>> creation would be a better approach.  But then we would have to reject
>>> using expressions in the range partition key, right?
>>
>> Why?
>
> I was thinking of it like how primary key columns cannot contain
> expressions; the reason for which, I assume, is because while we can
> ensure that a column contains only non-null values by defining a
> constraint on the column, there is no way to force expressions to be non-null.
>
> In any case, I have implemented in the latest patch that when creating a
> range partitioned table, its partition key columns are automatically set
> to be NOT NULL.  Although, it leaves out the columns that are referenced
> in expressions.  So even after doing so, we need to check after computing
> the range partition key of a input row, that none of the partitions keys
> is null, because expressions can still return null.

Right.  And ensuring that those columns were NOT NULL would be wrong,
as it wouldn't guarantee a non-null result anyway.

> Also, it does nothing to help the undesirable situation that one can
> insert a row with a null partition key (expression) into any of the range
> partitions if targeted directly, because of how ExecQual() handles
> nullable constraint expressions (treats null value as satisfying the
> partition constraint).

That's going to have to be fixed somehow.  How bad would it be if we
passed ExecQual's third argument as false for partition constraints?
Or else you could generate the actual constraint as expr IS NOT NULL
AND expr >= lb AND expr < ub.

> An alternative possibly worth considering might be to somehow handle the
> null range partition keys within the logic to compare against range bound
> datums.  It looks like other databases will map the rows containing nulls
> to the unbounded partition.  One database allows specifying NULLS
> FIRST/LAST and maps a row containing null key to the partition with
> -infinity as the lower bound or +infinity as the upper bound, respectively
> with  NULLS LAST the default behavior.

It seems more future-proof not to allow NULLs at all for now, and
figure out what if anything we want to do about that later.  I mean,
with the syntax we've got here, anything else is basically deciding
whether NULL is the lowest value or the highest value.  It would be
convenient for my employer if we made the same decision that Oracle
did, here, but it doesn't really seem like the PostgreSQL way - or to
put that another way, it's really ugly and unprincipled.  So I
recommend we decide for now that a partitioning column can't be null
and a partitioning expression can't evaluate to NULL.  If it does,
ERROR.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Alvaro Herrera
Date:
Amit Langote wrote:
> On 2016/11/25 4:36, Alvaro Herrera wrote:

> > I think CREATE TABLE OF is pretty much a corner case.  I agree that
> > allowing the constraint right after the constraint name is more
> > intuitive.
> 
> I assume you meant "...right after the column name"?

Eh, right.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
>
> I assume you meant "...right after the column name"?
>
> I will modify the grammar to allow that way then, so that the following
> will work:
>
> create table p1 partition of p (
>      a primary key
> ) for values in (1);
>

That seems to be non-intuitive as well. The way it's written it looks
like "a primary key" is associated with p rather than p1.

Is there any column constraint that can not be a table constraint? If
no, then we can think of dropping column constraint syntax all
together and let the user specify column constraints through table
constraint syntax. OR we may drop constraints all-together from the
"CREATE TABLE .. PARTITION OF" syntax and let user handle it through
ALTER TABLE commands. In a later version, we will introduce constraint
syntax in that DDL.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/25 13:51, Ashutosh Bapat wrote:
>>
>> I assume you meant "...right after the column name"?
>>
>> I will modify the grammar to allow that way then, so that the following
>> will work:
>>
>> create table p1 partition of p (
>>      a primary key
>> ) for values in (1);
>>
> 
> That seems to be non-intuitive as well. The way it's written it looks
> like "a primary key" is associated with p rather than p1.

It kind of does, but it's still a "create table p1" statement, so it's not
that ambiguous, IMHO.  Although I'm not attached to this syntax, there may
not be other options, such as moving the parenthesized list right before
"partition of", due to resulting grammar conflicts.

> Is there any column constraint that can not be a table constraint?

NOT NULL, DEFAULT, and COLLATE (although only the first of these is really
a constraint, albeit without a pg_constraint entry)

> If no, then we can think of dropping column constraint syntax all
> together and let the user specify column constraints through table
> constraint syntax. OR we may drop constraints all-together from the
> "CREATE TABLE .. PARTITION OF" syntax and let user handle it through
> ALTER TABLE commands. In a later version, we will introduce constraint
> syntax in that DDL.

Hmm, it was like that in the past versions of the patch, but I thought
it'd be better to follow the CREATE TABLE OF style to allow specifying the
table and column constraints during table creation time.  If many think
that it is not required (or should have some other syntax), I will modify
the patch accordingly.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/25 11:44, Robert Haas wrote:
> On Thu, Nov 24, 2016 at 6:13 AM, Amit Langote wrote:
>> Also, it does nothing to help the undesirable situation that one can
>> insert a row with a null partition key (expression) into any of the range
>> partitions if targeted directly, because of how ExecQual() handles
>> nullable constraint expressions (treats null value as satisfying the
>> partition constraint).
> 
> That's going to have to be fixed somehow.  How bad would it be if we
> passed ExecQual's third argument as false for partition constraints?
> Or else you could generate the actual constraint as expr IS NOT NULL
> AND expr >= lb AND expr < ub.

About the former, I think that might work.  If a column is NULL, it would
be caught in ExecConstraints() even before ExecQual() is called, because
of the NOT NULL constraint.  If an expression is NULL, or for some reason,
the partitioning operator (=, >=, or <) returned NULL even for a non-NULL
column or expression, then ExecQual() would fail if we passed false for
resultForNull.  Not sure if that would be violating the SQL specification
though.

The latter would work too.  But I guess we would only emit expr IS NOT
NULL, not column IS NOT NULL, because columns are covered by NOT NULL
constraints.

>> An alternative possibly worth considering might be to somehow handle the
>> null range partition keys within the logic to compare against range bound
>> datums.  It looks like other databases will map the rows containing nulls
>> to the unbounded partition.  One database allows specifying NULLS
>> FIRST/LAST and maps a row containing null key to the partition with
>> -infinity as the lower bound or +infinity as the upper bound, respectively
>> with  NULLS LAST the default behavior.
> 
> It seems more future-proof not to allow NULLs at all for now, and
> figure out what if anything we want to do about that later.  I mean,
> with the syntax we've got here, anything else is basically deciding
> whether NULL is the lowest value or the highest value.  It would be
> convenient for my employer if we made the same decision that Oracle
> did, here, but it doesn't really seem like the PostgreSQL way - or to
> put that another way, it's really ugly and unprincipled.  So I
> recommend we decide for now that a partitioning column can't be null
> and a partitioning expression can't evaluate to NULL.  If it does,
> ERROR.

OK, we can decide later if we want to handle NULLs somehow.

Thanks,
Amit





Re: Declarative partitioning - another take

From
Ashutosh Bapat
Date:
Here are some comments on 0003 patch.
1. In ALTER TABLE documentation we should refer to CREATE TABLE documentation
for partition_bound_spec syntax, similar ADD table_constraint note.

2. I think, "of the target table" is missing after all the ... constraints
+      match.  Also, it must have all the <literal>NOT NULL</literal> and
+      <literal>CHECK</literal> constraints present in the target table.

3. I think, using "any of the" instead of "some" is preferred in the following
sentence. In the following sentence, we should use "such a constraint" instead
of "constraints" to avoid mixed singular and plural usage.
+      If some <literal>CHECK</literal> constraint of the table being attached

4. This construction is ambiguous. "are not considered" - considered where? for
what? Constraints on which object?
+      clause.  Currently <literal>UNIQUE</literal>, <literal>PRIMARY
KEY</literal>,
+      and <literal>FOREIGN KEY</literal> constraints are not considered.

5. What is a partition constraint? Do we define that term anywhere in the
documentation? If not we should define that term and then use it here.
+      A full table scan is performed on the table being attached to check that
+      no existing row in the table violates the partition constraint.  It is

6. The paragraph following
+      A full table scan is performed on the table
being attached to check that seems to be confusing. It says that the
potentially expensive scan can be avoided by adding a constraint equivalent to
partition constraint. That seems to be wrong. The table will be scanned anyway,
when adding a valid constraint per ALTER TABLE ADD table_constraint
documentation. Instead you might want to say that the table scan will not be
carried out if the existing constraints imply the partition constraints.

7. You might want to specify the fate of range or lists covered by the
partition being detached in DETACH PARTITION section.

8. Since partition bound specification is more than a few words, you might want
to refer to the actual syntax in CREATE TABLE command.
+      <term><replaceable
class="PARAMETER">partition_bound_spec</replaceable></term>
+      <listitem>
+       <para>
+        The partition bound specification for a new partition.
+       </para>
+      </listitem>
+     </varlistentry>

9. I think, we need to use "may be scanned" instead of "is scanned", given that
some other part of the documentation talks about "avoiding" the table scan. May
be refer to that documentation to clarify the uncertainty.
+    Similarly, when attaching a new partition it is scanned to verify that

10. The same paragraph talks about multiple ALTER TABLE commands being run
together to avoid multiple table rewrites. Do we want to document whether
ATTACH/DETACH partition can be run with any other command or not.

11. ATTACH partition documentation is not talking about the unlogged or
temporary tables being attached to is logged or non-temporary. What is current
code doing about these cases? Do we allow such mixed partition hierarchy?
+    existing rows meet the partition constraint.

I will continue to look at the patch.

On Thu, Nov 24, 2016 at 4:04 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Hi Rushabh,
>
> On 2016/11/22 22:11, Rushabh Lathia wrote:
>> Hi Amit,
>>
>> I was just reading through your patches and here are some quick review
>> comments
>> for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch.
>
> Thanks for the review!
>
>>
>> Review comments for  0001-Catalog-and-DDL-for-partitioned-tables-17.patch:
>>
>> 1)
>> @@ -1102,9 +1104,10 @@ heap_create_with_catalog(const char *relname,
>>      {
>>          /* Use binary-upgrade override for pg_class.oid/relfilenode? */
>>          if (IsBinaryUpgrade &&
>> -            (relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE ||
>> -             relkind == RELKIND_VIEW || relkind == RELKIND_MATVIEW ||
>> -             relkind == RELKIND_COMPOSITE_TYPE || relkind ==
>> RELKIND_FOREIGN_TABLE))
>> +            (relkind == RELKIND_RELATION || relkind ==
>> RELKIND_PARTITIONED_TABLE ||
>> +             relkind == RELKIND_SEQUENCE || relkind == RELKIND_VIEW ||
>> +             relkind == RELKIND_MATVIEW || relkind ==
>> RELKIND_COMPOSITE_TYPE ||
>> +             relkind == RELKIND_FOREIGN_TABLE))
>>
>> You should add the RELKIND_PARTITIONED_TABLE at the end of if condition that
>> will make diff minimal. While reading through the patch I noticed that there
>> is inconsistency - someplace its been added at the end and at few places its
>> at the start. I think you can make add it at the end of condition and be
>> consistent with each place.
>
> OK, done.
>
>>
>> 2)
>>
>> +        /*
>> +         * We need to transform the raw parsetrees corresponding to
>> partition
>> +         * expressions into executable expression trees.  Like column
>> defaults
>> +         * and CHECK constraints, we could not have done the transformation
>> +         * earlier.
>> +         */
>>
>>
>> Additional space before "Like column defaults".
>
> I think it's a common practice to add two spaces after a sentence-ending
> period [https://www.gnu.org/prep/standards/html_node/Comments.html], which
> it seems, is followed more or less regularly in the formatting of the
> comments in the PostgreSQL source code.
>
>> 3)
>> -    char        relkind;
>> +    char        relkind,
>> +                expected_relkind;
>>
>> Newly added variable should be define separately with its type. Something
>> like:
>>
>>     char        relkind;
>> +    char        expected_relkind;
>
> OK, done.
>
>>
>> 4)
>>
>> a)
>> +    /* Prevent partitioned tables from becoming inheritance parents */
>> +    if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>> +                 errmsg("cannot inherit from partitioned table \"%s\"",
>> +                         parent->relname)));
>> +
>>
>> need alignment for last line.
>
> Fixed.
>
>> b)
>> +            atttuple = SearchSysCacheAttName(RelationGetRelid(rel),
>> pelem->name);
>> +            if (!HeapTupleIsValid(atttuple))
>> +                ereport(ERROR,
>> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
>> +                         errmsg("column \"%s\" named in partition key does
>> not exist",
>> +                         pelem->name)));
>> +            attform = (Form_pg_attribute) GETSTRUCT(atttuple);
>> +
>> +            if (attform->attnum <= 0)
>> +                ereport(ERROR,
>> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
>> +                         errmsg("cannot use system column \"%s\" in
>> partition key",
>> +                         pelem->name)));
>>
>> need alignment for last line of ereport
>
> Fixed.
>
>> c)
>> +        /* Disallow ROW triggers on partitioned tables */
>> +        if (stmt->row && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>> +            ereport(ERROR,
>> +                    (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>> +                    errmsg("\"%s\" is a partitioned table",
>> +                            RelationGetRelationName(rel)),
>> +              errdetail("Partitioned tables cannot have ROW triggers.")));
>>
>> need alignment
>
> Fixed.
>
>> 5)
>>
>> @@ -2512,6 +2579,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt
>> *stmt,
>>      cxt.blist = NIL;
>>      cxt.alist = NIL;
>>      cxt.pkey = NULL;
>> +    cxt.ispartitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
>>
>> I think adding bracket will look code more clear.
>>
>> +    cxt.ispartitioned = (rel->rd_rel->relkind ==
>> RELKIND_PARTITIONED_TABLE);
>
> Agreed, done.
>
>> 6)
>>
>> + * RelationBuildPartitionKey
>> + *        Build and attach to relcache partition key data of relation
>> + *
>> + * Partitioning key data is stored in CacheMemoryContext to ensure it
>> survives
>> + * as long as the relcache.  To avoid leaking memory in that context in
>> case
>> + * of an error partway through this function, we build the structure in the
>> + * working context (which must be short-lived) and copy the completed
>> + * structure into the cache memory.
>>
>> extra space before "To avoid leaking memory"
>
> Same thing I said above.
>
>> 7)
>> +    /* variable-length fields start here, but we allow direct access to
>> partattrs */
>> +    int2vector        partattrs;        /* attribute numbers of columns in
>> the
>>
>> Why partattrs is allow direct access - its not really clear from the
>> comments.
>
> I have copied the comment from another catalog header file (a number of
> catalog headers have the same comment).  I updated the new file's comment
> to say a little bit more; I wonder if the comment should be updated in
> other files as well?  However, I noticed that there are explanatory notes
> elsewhere (for example, around the code that reads such a field from the
> catalog) about why the first variable-length field of a catalog's tuple
> (especially of type int2vector or oidvector) are directly accessible via
> their C struct offsets.
>
>> I will continue reading more patch and testing functionality.. will share
>> the
>> comments as I have it.
>
> Updated patches attached.  I have also considered Robert's comments [1] as
> follows and fixed a bug that Ashutosh reported [2]:
>
> - Force partition key columns to be NOT NULL at the table creation time
>   if using range partitioning
> - Use enum to represent the content of individual range datums (viz.
>   finite datum, -infinity, +infinity) in the relcache struct
>
> Thanks,
> Amit
>
> [1]
> https://www.postgresql.org/message-id/CA+TgmoZ-feQsxc7U_JerM_AFChp3Qf6btK708SAe7M8Vdv5=jA@mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/306c85e9-c702-3742-eeff-9b7a40498afc%40lab.ntt.co.jp



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/17 20:27, Amit Langote wrote:
> On 2016/11/16 4:21, Robert Haas wrote:
>> Have you done any performance testing on the tuple routing code?
>> Suppose we insert a million (or 10 million) tuples into an
>> unpartitioned table, a table with 10 partitions, a table with 100
>> partitions, a table with 1000 partitions, and a table that is
>> partitioned into 10 partitions each of which has 10 subpartitions.
>> Ideally, the partitioned cases would run almost as fast as the
>> unpartitioned case, but probably there will be some overhead.
>> However, it would be useful to know how much.  Also, it would be
>> useful to set up the same cases with inheritance using a PL/pgsql ON
>> INSERT trigger for tuple routing and compare.  Hopefully the tuple
>> routing code is far faster than a trigger, but we should make sure
>> that's the case and look for optimizations if not.  Also, it would be
>> useful to know how much slower the tuple-mapping-required case is than
>> the no-tuple-mapping-required case.
> 
> OK, I will share the performance results soon.

Sorry about the delay; here are some numbers with the following
partitioning schema:

# plain table
create table plain (a date, b int, c int);

# partitioned table
create table ptab (a date, b int, c int) partition by range (a, b);

Partitions (not the full commands):

ptab_00001 for values from ('2016-11-29',    1) to ('2016-11-29', 1000);
ptab_00002 for values from ('2016-11-29', 1000) to ('2016-11-29', 2000);
...
ptab_00005 for values from ('2016-11-29', 4000) to ('2016-11-29', 5000);

ptab_00006 for values from ('2016-11-30',    1) to ('2016-11-30', 1000);
...
...
ptab_NNNNN for values from ('20XX-XX-XX', 4000) to ('20XX-XX-XX', 5000);

# inheritance partitioned table
create table itab (a date, b int, c int);
create table itab_00001 (   check part_check check (a = '2016-11-29' and b >=    1 and b < 1000)
) inherits (itab);
...
create table itab_00005 (   check part_check check (a = '2016-11-29' and b >= 4000 and b < 5000)
) inherits (itab);
create table itab_0006 (   check part_check check (a = '2016-11-30' and b >=    1and b < 1000)
) inherits (itab);
...
...
create table itab_NNNNN (   check part_check check (a = '2016-11-29' and b >= 4000 and b < 5000)
) inherits (itab);


The BR trigger (on itab) procedure as follows:

CREATE OR REPLACE FUNCTION itab_ins_trig()
RETURNS TRIGGER AS $$
DECLARE partno text;
BEGIN SELECT to_char((NEW.a - '2016-11-29'::date) * 5 + NEW.b / 1000 + 1,                'fm00000') INTO partno;
EXECUTE'INSERT INTO itab_' || partno || ' SELECT $1.*' USING NEW; RETURN NULL;
 
END; $$ LANGUAGE plpgsql;

Note that the tuple-routing procedure above assumes a fixed-stride range
partitioning scheme (shown as tg-direct-map below).  In other cases, the
simplest approach involves defining a if-else ladder, which I tried too
(shown as tg-if-else below), but reporting times only for up to 200
partitions at most (I'm sure there might be ways to be smarter there
somehow, but I didn't; the point here may only be to compare the new
tuple-routing code's overhead vs. trigger overhead in the traditional method).

# All times in seconds (on my modestly-powerful development VM)
#
# nrows = 10,000,000 generated using:
#
# INSERT INTO $tab
# SELECT '$last'::date - ((s.id % $maxsecs + 1)::bigint || 's')::interval,
#       (random() * 5000)::int % 4999 + 1,
#        case s.id % 10
#          when 0 then 'a'
#          when 1 then 'b'
#          when 2 then 'c'
#          ...
#          when 9 then 'j'
#       end
# FROM generate_series(1, $nrows) s(id)
# ORDER BY random();
#
# The first item in the select list is basically a date that won't fall
# outside the defined partitions.

Time for a plain table = 98.1 sec

#part    parted    tg-direct-map    tg-if-else
=====    ======    =============    ==========
10       114.3     1483.3            742.4
50       112.5     1476.6           2016.8
100      117.1     1498.4           5386.1
500      125.3     1475.5             --
1000     129.9     1474.4             --
5000     137.5     1491.4             --
10000    154.7     1480.9             --


Then for a 2-level partitioned table with each of the above partitions
partitioned by list (c), with 10 sub-partitions each as follows:

ptab_NNNNN_a for values in ('a');
ptab_NNNNN_b for values in ('b');
...
ptab_NNNNN_k for values in ('j');

I didn't include the times for inheritance table with a routing trigger in
this case, as it seems that the results would look something like the above:

Time for a plain table = 98.1 sec

#part    (sub-)parted
=====    ============
10       127.0
50       152.3
100      156.6
500      191.8
1000     187.3


Regarding tuple-mapping-required vs no-tuple-mapping-required, all cases
currently require tuple-mapping, because the decision is based on the
result of comparing parent and partition TupleDesc using
equalTupleDescs(), which fails so quickly because TupleDesc.tdtypeid are
not the same.  Anyway, I simply commented out the tuple-mapping statement
in ExecInsert() to observe just slightly improved numbers as follows
(comparing with numbers in the table just above):

#part    (sub-)parted
=====    =================
10       113.9 (vs. 127.0)
100      135.7 (vs. 156.6)
500      182.1 (vs. 191.8)

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/11/28 23:42, Ashutosh Bapat wrote:
> Here are some comments on 0003 patch.

Thanks for the review!

> 1. In ALTER TABLE documentation we should refer to CREATE TABLE documentation
> for partition_bound_spec syntax, similar ADD table_constraint note.

That makes sense, done.

>
> 2. I think, "of the target table" is missing after all the ... constraints
> +      match.  Also, it must have all the <literal>NOT NULL</literal> and
> +      <literal>CHECK</literal> constraints present in the target table.

The "present in the target table" part at the end of the sentence is
supposed to mean just that, but I changed it to say "of the target table"
instead, replacing "present in" by "of".

> 3. I think, using "any of the" instead of "some" is preferred in the following
> sentence. In the following sentence, we should use "such a constraint" instead
> of "constraints" to avoid mixed singular and plural usage.
> +      If some <literal>CHECK</literal> constraint of the table being attached

OK, done.

> 4. This construction is ambiguous. "are not considered" - considered where? for
> what? Constraints on which object?
> +      clause.  Currently <literal>UNIQUE</literal>, <literal>PRIMARY
> KEY</literal>,
> +      and <literal>FOREIGN KEY</literal> constraints are not considered.

That is essentially the same sentence as last sentence in the description
of INHERITS.  But I moved it next to: "Also, it must have all the NOT NULL
and CHECK constraints present in the target table."  Hopefully, that makes
it less ambiguous.  It's supposed to mean that those constraints are not
considered for inheritance unlike NOT NULL and CHECK constraints.

> 5. What is a partition constraint? Do we define that term anywhere in the
> documentation? If not we should define that term and then use it here.
> +      A full table scan is performed on the table being attached to check that
> +      no existing row in the table violates the partition constraint.  It is

Partition constraint is an implicit constraint derived from the parent's
partition key and its partition bounds enforced whenever an action is
applied to a partition directly - such as, inserting/updating a row and
when attaching a table as partition.  In fact, it also comes into play
during constraint exclusion when selecting from the parent table.

Do you think it's better to mention something like the above in the
description of CREATE TABLE PARTITION OF?

> 6. The paragraph following
> +      A full table scan is performed on the table
> being attached to check that seems to be confusing. It says that the
> potentially expensive scan can be avoided by adding a constraint equivalent to
> partition constraint. That seems to be wrong. The table will be scanned anyway,
> when adding a valid constraint per ALTER TABLE ADD table_constraint
> documentation. Instead you might want to say that the table scan will not be
> carried out if the existing constraints imply the partition constraints.

Ah, I see the confusion.  What's written is not supposed to mean that the
constraint be added after attaching the table as a partition, it's before.
 It's right that the table will be scanned anyway to validate the
constraint, but at that point, it's not related to the partitioned table
in any way and hence does not affect it.  As soon as the ATTACH PARTITION
command is executed, an exclusive lock will be acquired on the
parent/partitioned table, during which it's better to avoid any actions
that will take long - such as the validation scan.  If the constraint
added *beforehand* to the table being attached is such that it would allow
only a subset or a strict subset of the rows that the partition constraint
will allow, then it is not necessary to *repeat* the scan and hence is
not.  If it's not necessarily true that the constraints of the table being
attached will only allow a subset of a strict subset of the rows that the
partition constraint will allow, then the validation scan cannot be
avoided.  IOW, the advice is to help avoid the scan while taking an
exclusive lock on the parent.

Anyway, I have modified some sentences to be slightly clearer, see if that
works for you.

> 7. You might want to specify the fate of range or lists covered by the
> partition being detached in DETACH PARTITION section.

The partitioned table simply ceases to accept the rows falling in such a
range or a list.  The detached partition turn into a standalone table
which happens to contain only the rows with values of certain columns
(partition keys) within the range or the list.  However, there is no
explicit (catalogued) check constraint to that effect on the newly
standalone table, if that's what you were asking.

> 8. Since partition bound specification is more than a few words, you might want
> to refer to the actual syntax in CREATE TABLE command.
> +      <term><replaceable
> class="PARAMETER">partition_bound_spec</replaceable></term>
> +      <listitem>
> +       <para>
> +        The partition bound specification for a new partition.
> +       </para>
> +      </listitem>
> +     </varlistentry>

OK, added a link to the CREATE TABLE page.

>
> 9. I think, we need to use "may be scanned" instead of "is scanned", given that
> some other part of the documentation talks about "avoiding" the table scan. May
> be refer to that documentation to clarify the uncertainty.
> +    Similarly, when attaching a new partition it is scanned to verify that

Agreed, done.

> 10. The same paragraph talks about multiple ALTER TABLE commands being run
> together to avoid multiple table rewrites. Do we want to document whether
> ATTACH/DETACH partition can be run with any other command or not.

While currently ATTACH/DETACH can be run with any other ALTER TABLE
sub-command (much as ALTER TABLE INHERIT can), I think it's perhaps better
to prevent that.  For example, following awkward-sounding thing happens:

alter table p attach partition p1 for values in (1, 2, 3), add b int;
ERROR:  child table is missing column "b"

I have modified the patch so that the syntax does not allow such
combination of sub-commands.  Also, moved ATTACH PARTITION and DETACH
PARTITION in the synopsis section from "where action is one of:" area to
above where various forms of ALTER TABLE are listed.  Also specified down
below that ATTACH/DETACH PARTITION cannot be run in parallel with other
alterations.

> 11. ATTACH partition documentation is not talking about the unlogged or
> temporary tables being attached to is logged or non-temporary. What is current
> code doing about these cases? Do we allow such mixed partition hierarchy?

I tried to match the behavior of CREATE TABLE INHERITS() and ALTER TABLE
INHERIT

In ALTER TABLE child INHERIT parent case (as in CREATE TABLE child()
INHERITS(parent) case), failure occurs if the child is permanent rel and
parent temporary or if both are temporary, but if either parent or child
is of another session.  The same happens in case of ALTER TABLE parent
ATTACH PARTITION child and CREATE TABLE child PARTITION OF parent,
respectively.

LOGGED-ness is not considered here at all.


Attached updated patches.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Nov 25, 2016 at 5:49 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/11/25 11:44, Robert Haas wrote:
>> On Thu, Nov 24, 2016 at 6:13 AM, Amit Langote wrote:
>>> Also, it does nothing to help the undesirable situation that one can
>>> insert a row with a null partition key (expression) into any of the range
>>> partitions if targeted directly, because of how ExecQual() handles
>>> nullable constraint expressions (treats null value as satisfying the
>>> partition constraint).
>>
>> That's going to have to be fixed somehow.  How bad would it be if we
>> passed ExecQual's third argument as false for partition constraints?
>> Or else you could generate the actual constraint as expr IS NOT NULL
>> AND expr >= lb AND expr < ub.
>
> About the former, I think that might work.  If a column is NULL, it would
> be caught in ExecConstraints() even before ExecQual() is called, because
> of the NOT NULL constraint.  If an expression is NULL, or for some reason,
> the partitioning operator (=, >=, or <) returned NULL even for a non-NULL
> column or expression, then ExecQual() would fail if we passed false for
> resultForNull.  Not sure if that would be violating the SQL specification
> though.

I don't think the SQL specification can have anything to say about an
implicit constraint generated as an implementation detail of our
partitioning implementation.

> The latter would work too.  But I guess we would only emit expr IS NOT
> NULL, not column IS NOT NULL, because columns are covered by NOT NULL
> constraints.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Nov 29, 2016 at 6:24 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> # All times in seconds (on my modestly-powerful development VM)
> #
> # nrows = 10,000,000 generated using:
> #
> # INSERT INTO $tab
> # SELECT '$last'::date - ((s.id % $maxsecs + 1)::bigint || 's')::interval,
> #       (random() * 5000)::int % 4999 + 1,
> #        case s.id % 10
> #          when 0 then 'a'
> #          when 1 then 'b'
> #          when 2 then 'c'
> #          ...
> #          when 9 then 'j'
> #       end
> # FROM generate_series(1, $nrows) s(id)
> # ORDER BY random();
> #
> # The first item in the select list is basically a date that won't fall
> # outside the defined partitions.
>
> Time for a plain table = 98.1 sec
>
> #part    parted    tg-direct-map    tg-if-else
> =====    ======    =============    ==========
> 10       114.3     1483.3            742.4
> 50       112.5     1476.6           2016.8
> 100      117.1     1498.4           5386.1
> 500      125.3     1475.5             --
> 1000     129.9     1474.4             --
> 5000     137.5     1491.4             --
> 10000    154.7     1480.9             --

Very nice!

Obviously, it would be nice if the overhead were even lower, but it's
clearly a vast improvement over what we have today.

> Regarding tuple-mapping-required vs no-tuple-mapping-required, all cases
> currently require tuple-mapping, because the decision is based on the
> result of comparing parent and partition TupleDesc using
> equalTupleDescs(), which fails so quickly because TupleDesc.tdtypeid are
> not the same.  Anyway, I simply commented out the tuple-mapping statement
> in ExecInsert() to observe just slightly improved numbers as follows
> (comparing with numbers in the table just above):
>
> #part    (sub-)parted
> =====    =================
> 10       113.9 (vs. 127.0)
> 100      135.7 (vs. 156.6)
> 500      182.1 (vs. 191.8)

I think you should definitely try to get that additional speedup when
you can.  It doesn't seem like a lot when you think of how much is
already being saved, but a healthy number of users are going to
compare it to the performance on an unpartitioned table rather than to
our historical performance.   127/98.1 = 1.29, but 113.9/98.1 = 1.16
-- and obviously a 16% overhead from partitioning is way better than a
29% overhead, even if the old overhead was a million percent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On Thu, Dec 1, 2016 at 12:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Nov 25, 2016 at 5:49 AM, Amit Langote wrote:
>> On 2016/11/25 11:44, Robert Haas wrote:
>>> On Thu, Nov 24, 2016 at 6:13 AM, Amit Langote wrote:
>>>> Also, it does nothing to help the undesirable situation that one can
>>>> insert a row with a null partition key (expression) into any of the range
>>>> partitions if targeted directly, because of how ExecQual() handles
>>>> nullable constraint expressions (treats null value as satisfying the
>>>> partition constraint).
>>>
>>> That's going to have to be fixed somehow.  How bad would it be if we
>>> passed ExecQual's third argument as false for partition constraints?
>>> Or else you could generate the actual constraint as expr IS NOT NULL
>>> AND expr >= lb AND expr < ub.
>>
>> About the former, I think that might work.  If a column is NULL, it would
>> be caught in ExecConstraints() even before ExecQual() is called, because
>> of the NOT NULL constraint.  If an expression is NULL, or for some reason,
>> the partitioning operator (=, >=, or <) returned NULL even for a non-NULL
>> column or expression, then ExecQual() would fail if we passed false for
>> resultForNull.  Not sure if that would be violating the SQL specification
>> though.
>
> I don't think the SQL specification can have anything to say about an
> implicit constraint generated as an implementation detail of our
> partitioning implementation.

Yeah, I thought so too.

>> The latter would work too.  But I guess we would only emit expr IS NOT
>> NULL, not column IS NOT NULL, because columns are covered by NOT NULL
>> constraints.
>
> Right.

The latest patch I posted earlier today has this implementation.

Thanks,
Amit



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Nov 30, 2016 at 10:56 AM, Amit Langote <amitlangote09@gmail.com> wrote:
> The latest patch I posted earlier today has this implementation.

I decided to try out these patches today with #define
CLOBBER_CACHE_ALWAYS 1 in pg_config_manual.h, which found a couple of
problems:

1. RelationClearRelation() wasn't preserving the rd_partkey, even
though there's plenty of code that relies on it not changing while we
hold a lock on the relation - in particular, transformPartitionBound.

2. partition_bounds_equal() was using the comparator and collation for
partitioning column 0 to compare the datums for all partitioning
columns.  It's amazing this passed the regression tests.

The attached incremental patch fixes those things and some cosmetic
issues I found along the way.

3. RelationGetPartitionDispatchInfo() is badly broken:

1010         pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
1011         pd[i]->relid = RelationGetRelid(partrel);
1012         pd[i]->key = RelationGetPartitionKey(partrel);
1013         pd[i]->keystate = NIL;
1014         pd[i]->partdesc = partdesc;
1015         pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
1016         heap_close(partrel, NoLock);
1017
1018         m = 0;
1019         for (j = 0; j < partdesc->nparts; j++)
1020         {
1021             Oid     partrelid = partdesc->oids[j];

This code imagines that pointers it extracted from partrel are certain
to remain valid after heap_close(partrel, NoLock), perhaps on the
strength of the fact that we still retain a lock on the relation.  But
this isn't the case.  As soon as nobody has the relation open, a call
to RelationClearRelation() will destroy the relcache entry and
everything to which it points; with CLOBBER_CACHE_ALWAYS, I see a
failure at line 1021:

#0  RelationGetPartitionDispatchInfo (rel=0x1136dddf8, lockmode=3,
leaf_part_oids=0x7fff5633b938) at partition.c:1021
1021                Oid        partrelid = partdesc->oids[j];
(gdb) bt 5
#0  RelationGetPartitionDispatchInfo (rel=0x1136dddf8, lockmode=3,
leaf_part_oids=0x7fff5633b938) at partition.c:1021
#1  0x0000000109b8d71f in ExecInitModifyTable (node=0x7fd12984d750,
estate=0x7fd12b885438, eflags=0) at nodeModifyTable.c:1730
#2  0x0000000109b5e7ac in ExecInitNode (node=0x7fd12984d750,
estate=0x7fd12b885438, eflags=0) at execProcnode.c:159
#3  0x0000000109b58548 in InitPlan (queryDesc=0x7fd12b87b638,
eflags=0) at execMain.c:961
#4  0x0000000109b57dcd in standard_ExecutorStart
(queryDesc=0x7fd12b87b638, eflags=0) at execMain.c:239
(More stack frames follow...)
Current language:  auto; currently minimal
(gdb) p debug_query_string
$1 = 0x7fd12b84c238 "insert into list_parted values (null, 1);"
(gdb) p partdesc[0]
$2 = {
  nparts = 2139062143,
  oids = 0x7f7f7f7f7f7f7f7f,
  boundinfo = 0x7f7f7f7f7f7f7f7f
}

As you can see, the partdesc is no longer valid here.  I'm not
immediately sure how to fix this; this isn't a simple thinko.  You
need to keep the relations open for the whole duration of the query,
not just long enough to build the dispatch info.  I think you should
try to revise this so that each relation is opened once and kept open;
maybe the first loop should be making a pointer-list of Relations
rather than an int-list of relation OIDs.  And it seems to me (though
I'm getting a little fuzzy here because it's late) that you need all
of the partitions open, not just the ones that are subpartitioned,
because how else are you going to know how to remap the tuple if the
column order is different?  But I don't see this code doing that,
which makes me wonder if the partitions are being opened yet again in
some other location.

I recommend that once you fix this, you run 'make check' with #define
CLOBBER_CACHE_ALWAYS 1 and look for other hazards.  Such mistakes are
easy to make with this kind of patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/07 13:38, Robert Haas wrote:
> On Wed, Nov 30, 2016 at 10:56 AM, Amit Langote <amitlangote09@gmail.com> wrote:
>> The latest patch I posted earlier today has this implementation.
>
> I decided to try out these patches today with #define
> CLOBBER_CACHE_ALWAYS 1 in pg_config_manual.h, which found a couple of
> problems:
>
> 1. RelationClearRelation() wasn't preserving the rd_partkey, even
> though there's plenty of code that relies on it not changing while we
> hold a lock on the relation - in particular, transformPartitionBound.

Oh, I thought an AccessExclusiveLock on the relation would prevent having
to worry about that, but guess I'm wrong.  Perhaps, having a lock on a
table does not preclude RelationClearRelation() resetting the table's
relcache.

> 2. partition_bounds_equal() was using the comparator and collation for
> partitioning column 0 to compare the datums for all partitioning
> columns.  It's amazing this passed the regression tests.

Oops, it seems that the regression tests where the above code might be
exercised consisted only of range partition key with columns all of the
same type: create table test(a int, b int) partition by range (a, (a+b));

> The attached incremental patch fixes those things and some cosmetic
> issues I found along the way.

Thanks for fixing these.  Attached patches include these changes.

> 3. RelationGetPartitionDispatchInfo() is badly broken:
>
> 1010         pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
> 1011         pd[i]->relid = RelationGetRelid(partrel);
> 1012         pd[i]->key = RelationGetPartitionKey(partrel);
> 1013         pd[i]->keystate = NIL;
> 1014         pd[i]->partdesc = partdesc;
> 1015         pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
> 1016         heap_close(partrel, NoLock);
> 1017
> 1018         m = 0;
> 1019         for (j = 0; j < partdesc->nparts; j++)
> 1020         {
> 1021             Oid     partrelid = partdesc->oids[j];
>
> This code imagines that pointers it extracted from partrel are certain
> to remain valid after heap_close(partrel, NoLock), perhaps on the
> strength of the fact that we still retain a lock on the relation.  But
> this isn't the case.  As soon as nobody has the relation open, a call
> to RelationClearRelation() will destroy the relcache entry and
> everything to which it points; with CLOBBER_CACHE_ALWAYS, I see a
> failure at line 1021:
>
> #0  RelationGetPartitionDispatchInfo (rel=0x1136dddf8, lockmode=3,
> leaf_part_oids=0x7fff5633b938) at partition.c:1021
> 1021                Oid        partrelid = partdesc->oids[j];
> (gdb) bt 5
> #0  RelationGetPartitionDispatchInfo (rel=0x1136dddf8, lockmode=3,
> leaf_part_oids=0x7fff5633b938) at partition.c:1021
> #1  0x0000000109b8d71f in ExecInitModifyTable (node=0x7fd12984d750,
> estate=0x7fd12b885438, eflags=0) at nodeModifyTable.c:1730
> #2  0x0000000109b5e7ac in ExecInitNode (node=0x7fd12984d750,
> estate=0x7fd12b885438, eflags=0) at execProcnode.c:159
> #3  0x0000000109b58548 in InitPlan (queryDesc=0x7fd12b87b638,
> eflags=0) at execMain.c:961
> #4  0x0000000109b57dcd in standard_ExecutorStart
> (queryDesc=0x7fd12b87b638, eflags=0) at execMain.c:239
> (More stack frames follow...)
> Current language:  auto; currently minimal
> (gdb) p debug_query_string
> $1 = 0x7fd12b84c238 "insert into list_parted values (null, 1);"
> (gdb) p partdesc[0]
> $2 = {
>   nparts = 2139062143,
>   oids = 0x7f7f7f7f7f7f7f7f,
>   boundinfo = 0x7f7f7f7f7f7f7f7f
> }
>
> As you can see, the partdesc is no longer valid here.  I'm not
> immediately sure how to fix this; this isn't a simple thinko.  You
> need to keep the relations open for the whole duration of the query,
> not just long enough to build the dispatch info.  I think you should
> try to revise this so that each relation is opened once and kept open;
> maybe the first loop should be making a pointer-list of Relations
> rather than an int-list of relation OIDs.

Thanks for the explanation, I see the problem.  I changed
PartitionDispatchData such that its 1st field is now Relation (instead of
current Oid), where we keep the relation descriptor of the corresponding
partitioned table.  Currently, only the leaf relations are held open in
their respective ResultRelInfo's (ModifyTableState.mt_partitions) and
later closed in ExecEndModifyTable().  Similarly, we have the
ModifyTableState.mt_partition_dispatch_info array, each of whose members
holds open a partitioned relation using the new field.  We close that too
along with leaf partitions as just mentioned. Similarly in case of COPY FROM.

> And it seems to me (though
> I'm getting a little fuzzy here because it's late) that you need all
> of the partitions open, not just the ones that are subpartitioned,
> because how else are you going to know how to remap the tuple if the
> column order is different?  But I don't see this code doing that,
> which makes me wonder if the partitions are being opened yet again in
> some other location.

Actually leaf partitions are opened by the respective callers of
RelationGetPartitionDispatchInfo() and held open in the corresponding
ResultRelInfo's (ModifyTableState.mt_partitions).  As mentioned above,
they will be closed by either ExecEndModifyTable() or CopyFrom() after
finishing the INSERT or COPY, respectively.

> I recommend that once you fix this, you run 'make check' with #define
> CLOBBER_CACHE_ALWAYS 1 and look for other hazards.  Such mistakes are
> easy to make with this kind of patch.

With the attached latest version of the patches, I couldn't see any
failures with a CLOBBER_CACHE_ALWAYS build.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Dec 7, 2016 at 6:42 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/12/07 13:38, Robert Haas wrote:
>> On Wed, Nov 30, 2016 at 10:56 AM, Amit Langote <amitlangote09@gmail.com> wrote:
>>> The latest patch I posted earlier today has this implementation.
>>
>> I decided to try out these patches today with #define
>> CLOBBER_CACHE_ALWAYS 1 in pg_config_manual.h, which found a couple of
>> problems:
>>
>> 1. RelationClearRelation() wasn't preserving the rd_partkey, even
>> though there's plenty of code that relies on it not changing while we
>> hold a lock on the relation - in particular, transformPartitionBound.
>
> Oh, I thought an AccessExclusiveLock on the relation would prevent having
> to worry about that, but guess I'm wrong.  Perhaps, having a lock on a
> table does not preclude RelationClearRelation() resetting the table's
> relcache.

No, it sure doesn't.  The lock prevents the table from actually being
changed, so a reload will find data equivalent to what it had before,
but it doesn't prevent the backend's cache from being flushed.

>> 2. partition_bounds_equal() was using the comparator and collation for
>> partitioning column 0 to compare the datums for all partitioning
>> columns.  It's amazing this passed the regression tests.
>
> Oops, it seems that the regression tests where the above code might be
> exercised consisted only of range partition key with columns all of the
> same type: create table test(a int, b int) partition by range (a, (a+b));

It doesn't seem like it; you had this: create table part1 partition of
range_parted for values from ('a', 1) to ('a', 10);

>> I recommend that once you fix this, you run 'make check' with #define
>> CLOBBER_CACHE_ALWAYS 1 and look for other hazards.  Such mistakes are
>> easy to make with this kind of patch.
>
> With the attached latest version of the patches, I couldn't see any
> failures with a CLOBBER_CACHE_ALWAYS build.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Erik Rijkers
Date:
On 2016-12-07 12:42, Amit Langote wrote:

> 0001-Catalog-and-DDL-for-partitioned-tables-20.patch
> 0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch
> 0003-Catalog-and-DDL-for-partitions-20.patch
> 0004-psql-and-pg_dump-support-for-partitions-20.patch
> 0005-Teach-a-few-places-to-use-partition-check-quals-20.patch
> 0006-Tuple-routing-for-partitioned-tables-20.patch
> 0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch

Patches apply, compile, check OK.


But this yields a segfault:

begin;
create schema if not exists s;
create table s.t (c text, d text, id serial) partition by list 
((ascii(substring(coalesce(c, d, ''), 1, 1))));
create table s.t_part_ascii_065 partition of s.t for values in ( 65 );




it logs as follows:

2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was 
terminated by signal 11: Segmentation fault
2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: 
create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
2016-12-07 17:03:45.787 CET 6125 LOG:  terminating any other active 
server processes
2016-12-07 17:03:45.791 CET 6125 LOG:  all server processes terminated; 
reinitializing
2016-12-07 17:03:45.999 CET 11655 LOG:  database system was interrupted; 
last known up at 2016-12-07 17:00:38 CET
2016-12-07 17:03:48.040 CET 11655 LOG:  database system was not properly 
shut down; automatic recovery in progress
2016-12-07 17:03:48.156 CET 11655 LOG:  redo starts at 0/2897988
2016-12-07 17:03:48.172 CET 11655 LOG:  invalid magic number 0000 in log 
segment 000000010000000000000002, offset 9207808
2016-12-07 17:03:48.172 CET 11655 LOG:  redo done at 0/28C72C0
2016-12-07 17:03:48.172 CET 11655 LOG:  last completed transaction was 
at log time 2016-12-07 17:01:29.580562+01
2016-12-07 17:03:49.534 CET 11655 LOG:  MultiXact member wraparound 
protections are now enabled
2016-12-07 17:03:49.622 CET 6125 LOG:  database system is ready to 
accept connections



Thanks,

Erik Rijkers



Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Erik,

On Thu, Dec 8, 2016 at 1:19 AM, Erik Rijkers <er@xs4all.nl> wrote:
> On 2016-12-07 12:42, Amit Langote wrote:
>
>> 0001-Catalog-and-DDL-for-partitioned-tables-20.patch
>> 0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch
>> 0003-Catalog-and-DDL-for-partitions-20.patch
>> 0004-psql-and-pg_dump-support-for-partitions-20.patch
>> 0005-Teach-a-few-places-to-use-partition-check-quals-20.patch
>> 0006-Tuple-routing-for-partitioned-tables-20.patch
>> 0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch
>
>
> Patches apply, compile, check OK.

Thanks!

> But this yields a segfault:
>
> begin;
> create schema if not exists s;
> create table s.t (c text, d text, id serial) partition by list
> ((ascii(substring(coalesce(c, d, ''), 1, 1))));
> create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>
> it logs as follows:
>
> 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
> terminated by signal 11: Segmentation fault
> 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: create
> table s.t_part_ascii_065 partition of s.t for values in ( 65 );

Hm, will look into this in a few hours.

Thanks,
Amit



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote <amitlangote09@gmail.com> wrote:
>> begin;
>> create schema if not exists s;
>> create table s.t (c text, d text, id serial) partition by list
>> ((ascii(substring(coalesce(c, d, ''), 1, 1))));
>> create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>>
>> it logs as follows:
>>
>> 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
>> terminated by signal 11: Segmentation fault
>> 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: create
>> table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>
> Hm, will look into this in a few hours.

My bad.  The fix I sent last night for one of the cache flush issues
wasn't quite right.  The attached seems to fix it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Declarative partitioning - another take

From
Erik Rijkers
Date:
On 2016-12-07 17:38, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote <amitlangote09@gmail.com> 
> wrote:
>>> begin;
>>> create schema if not exists s;
>>> create table s.t (c text, d text, id serial) partition by list
>>> ((ascii(substring(coalesce(c, d, ''), 1, 1))));
>>> create table s.t_part_ascii_065 partition of s.t for values in ( 65 
>>> );
>>> 
>>> it logs as follows:
>>> 
>>> 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
>>> terminated by signal 11: Segmentation fault
>>> 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: 
>>> create
>>> table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>> 
>> Hm, will look into this in a few hours.
> 
> My bad.  The fix I sent last night for one of the cache flush issues
> wasn't quite right.  The attached seems to fix it.

Yes, fixed here too.  Thanks.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
>> My bad.  The fix I sent last night for one of the cache flush issues
>> wasn't quite right.  The attached seems to fix it.
> Yes, fixed here too.  Thanks.

Thanks for the report - that was a good catch.

I've committed 0001 - 0006 with that correction and a few other
adjustments.  There's plenty of room for improvement here, and almost
certainly some straight-up bugs too, but I think we're at a point
where it will be easier and less error-prone to commit follow on
changes incrementally rather than by continuously re-reviewing a very
large patch set for increasingly smaller changes.

Some notes:

* We should try to teach the executor never to scan the parent.
That's never necessary with this system, and it might add significant
overhead.  We should also try to get rid of the idea of the parent
having storage (i.e. a relfilenode).

* The fact that, in some cases, locking requirements for partitioning
are stronger than those for inheritance is not good.  We made those
decisions for good reasons -- namely, data integrity and not crashing
the server -- but it would certainly be good to revisit those things
and see whether there's any room for improvement.

* I didn't commit 0007, which updates the documentation for this new
feature. That patch removes more lines than it adds, and I suspect
what is needed here
is an expansion of the documentation rather than a diminishing of it.

* The fact that there's no implementation of row movement should be
documented as a limitation.  We should also look at removing that
limitation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/08 1:53, Erik Rijkers wrote:
> On 2016-12-07 17:38, Robert Haas wrote:
>> On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote <amitlangote09@gmail.com>
>> wrote:
>>>> begin;
>>>> create schema if not exists s;
>>>> create table s.t (c text, d text, id serial) partition by list
>>>> ((ascii(substring(coalesce(c, d, ''), 1, 1))));
>>>> create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>>>>
>>>> it logs as follows:
>>>>
>>>> 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
>>>> terminated by signal 11: Segmentation fault
>>>> 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running:
>>>> create
>>>> table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>>>
>>> Hm, will look into this in a few hours.
>>
>> My bad.  The fix I sent last night for one of the cache flush issues
>> wasn't quite right.  The attached seems to fix it.
> 
> Yes, fixed here too.  Thanks.

Thanks for reporting and the fix, Erik and Robert!

Thanks,
Amit





Re: Declarative partitioning - another take

From
Amit Langote
Date:
Hi Robert,

On 2016/12/08 3:20, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
>>> My bad.  The fix I sent last night for one of the cache flush issues
>>> wasn't quite right.  The attached seems to fix it.
>> Yes, fixed here too.  Thanks.
> 
> Thanks for the report - that was a good catch.
> 
> I've committed 0001 - 0006 with that correction and a few other
> adjustments.  There's plenty of room for improvement here, and almost
> certainly some straight-up bugs too, but I think we're at a point
> where it will be easier and less error-prone to commit follow on
> changes incrementally rather than by continuously re-reviewing a very
> large patch set for increasingly smaller changes.

+1 and thanks a lot for your and everyone else's very patient support in
reviewing the patches.

> Some notes:
> 
> * We should try to teach the executor never to scan the parent.
> That's never necessary with this system, and it might add significant
> overhead.  We should also try to get rid of the idea of the parent
> having storage (i.e. a relfilenode).

Agreed, I will start investigating.

> * The fact that, in some cases, locking requirements for partitioning
> are stronger than those for inheritance is not good.  We made those
> decisions for good reasons -- namely, data integrity and not crashing
> the server -- but it would certainly be good to revisit those things
> and see whether there's any room for improvement.

+1

> * I didn't commit 0007, which updates the documentation for this new
> feature. That patch removes more lines than it adds, and I suspect
> what is needed here
> is an expansion of the documentation rather than a diminishing of it.

Hmm, I had mixed feeling about what to do about that as well.  So now, we
have the description of various new features buried into VI. Reference
section of the documentation, which is simply meant as a command
reference.  I agree that the new partitioning warrants more expansion in
the DDL partitioning chapter.  Will see how that could be done.

> * The fact that there's no implementation of row movement should be
> documented as a limitation.  We should also look at removing that
> limitation.

Yes, something to improve.  By the way, since we currently mention INSERT
tuple-routing directly in the description of the partitioned tables in the
CREATE TABLE command reference, is that also the place to list this
particular limitation?  Or is UPDATE command reference rather the correct
place?

Thanks,
Amit





Re: Declarative partitioning - another take

From
Andres Freund
Date:
On 2016-12-07 13:20:04 -0500, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
> >> My bad.  The fix I sent last night for one of the cache flush issues
> >> wasn't quite right.  The attached seems to fix it.
> > Yes, fixed here too.  Thanks.
> 
> Thanks for the report - that was a good catch.

Congrats to everyone working on this! This is a large step forward.

- Andres



Re: Declarative partitioning - another take

From
Michael Paquier
Date:
On Thu, Dec 8, 2016 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-12-07 13:20:04 -0500, Robert Haas wrote:
>> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
>> >> My bad.  The fix I sent last night for one of the cache flush issues
>> >> wasn't quite right.  The attached seems to fix it.
>> > Yes, fixed here too.  Thanks.
>>
>> Thanks for the report - that was a good catch.
>
> Congrats to everyone working on this! This is a large step forward.

Congratulations to all! It was a long way to this result.
-- 
Michael



Re: Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/08 3:20, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
>>> My bad.  The fix I sent last night for one of the cache flush issues
>>> wasn't quite right.  The attached seems to fix it.
>> Yes, fixed here too.  Thanks.
>
> Thanks for the report - that was a good catch.
>
> I've committed 0001 - 0006 with that correction and a few other
> adjustments.  There's plenty of room for improvement here, and almost
> certainly some straight-up bugs too, but I think we're at a point
> where it will be easier and less error-prone to commit follow on
> changes incrementally rather than by continuously re-reviewing a very
> large patch set for increasingly smaller changes.

Attached is a patch to fix some stale comments in the code and a minor
correction to one of the examples on the CREATE TABLE page.

Thanks,
Amit

Attachment

Re: Declarative partitioning - another take

From
Stephen Frost
Date:
Amit,

* Amit Langote (Langote_Amit_f8@lab.ntt.co.jp) wrote:
> Hmm, I had mixed feeling about what to do about that as well.  So now, we
> have the description of various new features buried into VI. Reference
> section of the documentation, which is simply meant as a command
> reference.  I agree that the new partitioning warrants more expansion in
> the DDL partitioning chapter.  Will see how that could be done.

Definitely.

> > * The fact that there's no implementation of row movement should be
> > documented as a limitation.  We should also look at removing that
> > limitation.
>
> Yes, something to improve.  By the way, since we currently mention INSERT
> tuple-routing directly in the description of the partitioned tables in the
> CREATE TABLE command reference, is that also the place to list this
> particular limitation?  Or is UPDATE command reference rather the correct
> place?

Both.

Thanks!

Stephen

Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Dec 7, 2016 at 11:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Congrats to everyone working on this! This is a large step forward.
>
> Congratulations to all! It was a long way to this result.

Yes.  The last effort in this area which I can remember was by Itagaki
Takahiro in 2010, so we've been waiting for this for more than 6
years.  It's really good that Amit was able to put in the effort to
produce a committable patch, and I think he deserves all of our thanks
for getting that done - and NTT deserves our thanks for paying him to
do it.

Even though I know he put in a lot more work than I did, let me just
say: phew, even reviewing that was a ton of work.

Of course, this is the beginning, not the end.  I've been thinking
about next steps -- here's an expanded list:

- more efficient plan-time partition pruning (constraint exclusion is too slow)
- run-time partition pruning
- partition-wise join (Ashutosh Bapat is already working on this)
- try to reduce lock levels
- hash partitioning
- the ability to create an index on the parent and have all of the
children inherit it; this should work something like constraint
inheritance.  you could argue that this doesn't add any real new
capability but it's a huge usability feature.
- teaching autovacuum enough about inheritance hierarchies for it to
update the parent statistics when they get stale despite the lack of
any actual inserts/updates/deletes to the parent.  this has been
pending for a long time, but it's only going to get more important
- row movement (aka avoiding the need for an ON UPDATE trigger on each
partition)
- insert (and eventually update) tuple routing for foreign partitions
- not scanning the parent
- fixing the insert routing so that we can skip tuple conversion where possible
- fleshing out the documentation

One thing I'm wondering is whether we can optimize away some of the
heavyweight locks.  For example, if somebody does SELECT * FROM ptab
WHERE id = 1, they really shouldn't need to lock the entire
partitioning hierarchy, but right now they do.  If the root knows
based on its own partitioning key that only one child is relevant, it
would be good to lock *only that child*.  For this feature to be
competitive, it needs to scale to at least a few thousand partitions,
and locking thousands of objects instead of one or two is bound to be
slow.  Similarly, you can imagine teaching COPY to lock partitions
only on demand; if no tuples are routed to a particular partition, we
don't need to lock it.  There's a manageability component here, too:
not locking partitions unnecessarily makes ti easier to get DDL on
other partitions through.  Alternatively, maybe we could rewrite the
lock manager to be hierarchical, so that you can take a single lock
that represents an AccessShareLock on all partitions and only need to
make one entry in the lock table to do it.  That means that attempts
to lock individual partitions need to check not only for a lock on
that partition but also on anything further up in the hierarchy, but
that might be a good trade if it gives us O(1) locking on the parent.
And maybe we could also have a level of the hierarchy that represents
every-table-in-the-database, for the benefit of pg_dump.  Of course,
rewriting the lock manager is a big project not for the faint of
heart, but I think if we don't it's going to be a scaling bottleneck.

We also need to consider other parts of the system that may not scale,
like pg_dump.  For a long time, we've been sorta-kinda willing to fix
the worst of the scalability problems with pg_dump, but that's really
no longer an adequate response.  People want 1000 partitions.  Heck,
people want 1,000,000 partitions, but getting to where 1000 partitions
works well would help PostgreSQL a lot.  Our oft-repeated line that
inheritance isn't designed for large numbers of inheritance children
is basically just telling people who have the use case where they need
that to go use some other product.  Partitioning, like replication, is
not an optional feature for a world-class database.  And, from a
technical point of view, I think we've now got an infrastructure that
really should be able to be scaled up considerably higher than what
we've been able to do in the past.  When we were stuck with
inheritance + constraint exclusion, we could say "well, there's not
really any point because you'll hit these other limits anyway".  But I
think now that's not really true.  This patch eliminates one of the
core scalability problems in this area, and provides infrastructure
for attacking some of the others.  I hope that people will step up and
do that.  There's a huge opportunity here for PostgreSQL to become
relevant in use cases where it currently falters badly, and we should
try to take advantage of it.  This patch is a big step by itself, but
if we ignore the potential to do more with this as the base we will be
leaving a lot of "win" on the table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Dmitry Ivanov
Date:
Hi everyone,

> Of course, this is the beginning, not the end.  I've been thinking
> about next steps -- here's an expanded list:

> - more efficient plan-time partition pruning (constraint
> exclusion is too slow)
> - run-time partition pruning
> - try to reduce lock levels
> ...

We (PostgresPro) have been working on pg_pathman for quite a while, and
since it's obviously going to become the thing of the past, it would be a
wasted effort if we didn't try to participate.

For starters, I'd love to work on both plan-time & run-time partition
pruning. I created a custom node for run-time partition elimination, so I
think I'm capable of developing something similar.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Dec 8, 2016 at 11:13 AM, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
> We (PostgresPro) have been working on pg_pathman for quite a while, and
> since it's obviously going to become the thing of the past, it would be a
> wasted effort if we didn't try to participate.
>
> For starters, I'd love to work on both plan-time & run-time partition
> pruning. I created a custom node for run-time partition elimination, so I
> think I'm capable of developing something similar.

That would be fantastic.  I and my colleagues at EnterpriseDB can
surely help review; of course, maybe you and some of your colleagues
would like to help review our patches, too.  Do you think this is
likely to be something where you can get something done quickly, with
the hope of getting it into v10?  Time is growing short, but it would
be great to polish this a little more before we ship it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Alexander Korotkov
Date:
On Thu, Dec 8, 2016 at 7:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 8, 2016 at 11:13 AM, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
> We (PostgresPro) have been working on pg_pathman for quite a while, and
> since it's obviously going to become the thing of the past, it would be a
> wasted effort if we didn't try to participate.
>
> For starters, I'd love to work on both plan-time & run-time partition
> pruning. I created a custom node for run-time partition elimination, so I
> think I'm capable of developing something similar.
 
That would be fantastic.  I and my colleagues at EnterpriseDB can
surely help review;

Great! And it is very cool that we have basic infrastructure already committed.  Thanks a lot to you and everybody involved.
 
of course, maybe you and some of your colleagues
would like to help review our patches, too.

We understand our reviewing performance is not sufficient.  Will try to do better during next commitfest.
 
Do you think this is
likely to be something where you can get something done quickly, with
the hope of getting it into v10?

Yes, because we have set of features already implemented in pg_pathman.  In particular we have following features from your list and some more.
 
- more efficient plan-time partition pruning (constraint exclusion is too slow)
- run-time partition pruning
- insert (and eventually update) tuple routing for foreign partitions
- hash partitioning
- not scanning the parent

Time is growing short, but it would
be great to polish this a little more before we ship it.

Yes. Getting at least some of this features committed to v10 would be great and improve partitioning usability a lot.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: Declarative partitioning - another take

From
Dmitry Ivanov
Date:
> That would be fantastic.  I and my colleagues at EnterpriseDB can
> surely help review; of course, maybe you and some of your colleagues
> would like to help review our patches, too.

Certainly, I'll start reviewing as soon as I get familiar with the code.


> Do you think this is
> likely to be something where you can get something done quickly, with
> the hope of getting it into v10?

Yes, I've just cleared my schedule in order to make this possible. I'll
bring in the patches ASAP.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Dec 8, 2016 at 11:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Great! And it is very cool that we have basic infrastructure already
> committed.  Thanks a lot to you and everybody involved.

Thanks.

>> of course, maybe you and some of your colleagues
>> would like to help review our patches, too.
> We understand our reviewing performance is not sufficient.  Will try to do
> better during next commitfest.

Not trying to throw stones, just want to get as much committed as
possible.  And I think our patches are good and valuable improvements
too, so I want to see them go in because they will help everybody.
Thanks for trying to increase the reviewing effort; it is sorely
needed.

>> Do you think this is
>> likely to be something where you can get something done quickly, with
>> the hope of getting it into v10?
>
> Yes, because we have set of features already implemented in pg_pathman.  In
> particular we have following features from your list and some more.
>
> - more efficient plan-time partition pruning (constraint exclusion is too
> slow)
> - run-time partition pruning
> - insert (and eventually update) tuple routing for foreign partitions
> - hash partitioning
> - not scanning the parent

That's a lot of stuff.  Getting even a couple of those would be a big win.

>> Time is growing short, but it would
>> be great to polish this a little more before we ship it.
>
> Yes. Getting at least some of this features committed to v10 would be great
> and improve partitioning usability a lot.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Dec 8, 2016 at 11:44 AM, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
>> That would be fantastic.  I and my colleagues at EnterpriseDB can
>> surely help review; of course, maybe you and some of your colleagues
>> would like to help review our patches, too.
>
> Certainly, I'll start reviewing as soon as I get familiar with the code.

Thanks!

>> Do you think this is
>> likely to be something where you can get something done quickly, with
>> the hope of getting it into v10?
>
> Yes, I've just cleared my schedule in order to make this possible. I'll
> bring in the patches ASAP.

Fantastic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Alexander
> Korotkov
> Yes. Getting at least some of this features committed to v10 would be great
> and improve partitioning usability a lot.

I'm sorry for not contributing to the real partitioning feature, but I'm really looking forward to seeing the efficient
plan-timeand run-time partition pruning implemented in v10.  Recently, we failed to acquire a customer because they
couldnot achieve their performance goal due to the slow partition pruning compared to Oracle.  The batch app prepares a
SELECTstatement against a partitioned table, then executes it millions of time with different parameter values.  It
tooka long time to process Bind messages.
 

Another requirement was subpartitioning.  Will this be possible with the current infrastructure, or does this need
drasticchange?
 

Regards
Takayuki Tsunakawa


Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/09 10:09, Tsunakawa, Takayuki wrote:
> Another requirement was subpartitioning.  Will this be possible with the current infrastructure, or does this need
drasticchange?
 

It does support sub-partitioning, although the syntax is a bit different.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
"Tsunakawa, Takayuki"
Date:
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
> On 2016/12/09 10:09, Tsunakawa, Takayuki wrote:
> > Another requirement was subpartitioning.  Will this be possible with the
> current infrastructure, or does this need drastic change?
> 
> It does support sub-partitioning, although the syntax is a bit different.

Super great!  I'm excited to try the feature when I have time.  I hope I can contribute to the quality by find any
bug.

Regards
Takayuki Tsunakawa


Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Stephen,

On 2016/12/08 22:35, Stephen Frost wrote:
>>> * The fact that there's no implementation of row movement should be
>>> documented as a limitation.  We should also look at removing that
>>> limitation.
>>
>> Yes, something to improve.  By the way, since we currently mention INSERT
>> tuple-routing directly in the description of the partitioned tables in the
>> CREATE TABLE command reference, is that also the place to list this
>> particular limitation?  Or is UPDATE command reference rather the correct
>> place?
> 
> Both.

Attached a documentation fix patch.

Actually, there was no mention on the INSERT reference page of
tuple-routing occurring in case of partitioned tables and also the
possibility of an error if a *partition* is directly targeted in an
INSERT. Mentioned that as well.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/09 0:25, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:42 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> Congrats to everyone working on this! This is a large step forward.
>>
>> Congratulations to all! It was a long way to this result.
> 
> Yes.  The last effort in this area which I can remember was by Itagaki
> Takahiro in 2010, so we've been waiting for this for more than 6
> years.  It's really good that Amit was able to put in the effort to
> produce a committable patch, and I think he deserves all of our thanks
> for getting that done - and NTT deserves our thanks for paying him to
> do it.
> 
> Even though I know he put in a lot more work than I did, let me just
> say: phew, even reviewing that was a ton of work.

Absolutely!  Your review comments and design suggestions have been
instrumental in improving (and cutting down on the size of) the patches.

> Of course, this is the beginning, not the end.

+1000!

> I've been thinking
> about next steps -- here's an expanded list:
> 
> - more efficient plan-time partition pruning (constraint exclusion is too slow)
> - run-time partition pruning
> - partition-wise join (Ashutosh Bapat is already working on this)
> - try to reduce lock levels
> - hash partitioning
> - the ability to create an index on the parent and have all of the
> children inherit it; this should work something like constraint
> inheritance.  you could argue that this doesn't add any real new
> capability but it's a huge usability feature.
> - teaching autovacuum enough about inheritance hierarchies for it to
> update the parent statistics when they get stale despite the lack of
> any actual inserts/updates/deletes to the parent.  this has been
> pending for a long time, but it's only going to get more important
> - row movement (aka avoiding the need for an ON UPDATE trigger on each
> partition)
> - insert (and eventually update) tuple routing for foreign partitions
> - not scanning the parent
> - fixing the insert routing so that we can skip tuple conversion where possible
> - fleshing out the documentation

I would definitely want to contribute to some of these items.  It's great
that many others plan to contribute toward this as well.

> One thing I'm wondering is whether we can optimize away some of the
> heavyweight locks.  For example, if somebody does SELECT * FROM ptab
> WHERE id = 1, they really shouldn't need to lock the entire
> partitioning hierarchy, but right now they do.  If the root knows
> based on its own partitioning key that only one child is relevant, it
> would be good to lock *only that child*.  For this feature to be
> competitive, it needs to scale to at least a few thousand partitions,
> and locking thousands of objects instead of one or two is bound to be
> slow.  Similarly, you can imagine teaching COPY to lock partitions
> only on demand; if no tuples are routed to a particular partition, we
> don't need to lock it.  There's a manageability component here, too:
> not locking partitions unnecessarily makes ti easier to get DDL on
> other partitions through.  Alternatively, maybe we could rewrite the
> lock manager to be hierarchical, so that you can take a single lock
> that represents an AccessShareLock on all partitions and only need to
> make one entry in the lock table to do it.  That means that attempts
> to lock individual partitions need to check not only for a lock on
> that partition but also on anything further up in the hierarchy, but
> that might be a good trade if it gives us O(1) locking on the parent.
> And maybe we could also have a level of the hierarchy that represents
> every-table-in-the-database, for the benefit of pg_dump.  Of course,
> rewriting the lock manager is a big project not for the faint of
> heart, but I think if we don't it's going to be a scaling bottleneck.

Hierarchical lock manager stuff is interesting.  Are you perhaps alluding
to a new *intention* lock mode as described in the literature on multiple
granularity locking [1]?

> We also need to consider other parts of the system that may not scale,
> like pg_dump.  For a long time, we've been sorta-kinda willing to fix
> the worst of the scalability problems with pg_dump, but that's really
> no longer an adequate response.  People want 1000 partitions.  Heck,
> people want 1,000,000 partitions, but getting to where 1000 partitions
> works well would help PostgreSQL a lot.  Our oft-repeated line that
> inheritance isn't designed for large numbers of inheritance children
> is basically just telling people who have the use case where they need
> that to go use some other product.  Partitioning, like replication, is
> not an optional feature for a world-class database.  And, from a
> technical point of view, I think we've now got an infrastructure that
> really should be able to be scaled up considerably higher than what
> we've been able to do in the past.  When we were stuck with
> inheritance + constraint exclusion, we could say "well, there's not
> really any point because you'll hit these other limits anyway".  But I
> think now that's not really true.  This patch eliminates one of the
> core scalability problems in this area, and provides infrastructure
> for attacking some of the others.  I hope that people will step up and
> do that.  There's a huge opportunity here for PostgreSQL to become
> relevant in use cases where it currently falters badly, and we should
> try to take advantage of it.  This patch is a big step by itself, but
> if we ignore the potential to do more with this as the base we will be
> leaving a lot of "win" on the table.

Agreed on all counts.

Thanks,
Amit

[1] https://en.wikipedia.org/wiki/Multiple_granularity_locking





Re: [HACKERS] Declarative partitioning - another take

From
Venkata B Nagothi
Date:
Hi,

I am testing the partitioning feature from the latest master and got the following error while loading the data -

db01=# create table orders_y1993 PARTITION OF orders FOR VALUES FROM ('1993-01-01') TO ('1993-12-31');
CREATE TABLE

db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 376589: "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely regular pack"

Am i doing something wrong ?

Regards,

Venkata B N
Database Consultant
 

On Fri, Dec 9, 2016 at 3:58 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/12/09 0:25, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:42 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> Congrats to everyone working on this! This is a large step forward.
>>
>> Congratulations to all! It was a long way to this result.
>
> Yes.  The last effort in this area which I can remember was by Itagaki
> Takahiro in 2010, so we've been waiting for this for more than 6
> years.  It's really good that Amit was able to put in the effort to
> produce a committable patch, and I think he deserves all of our thanks
> for getting that done - and NTT deserves our thanks for paying him to
> do it.
>
> Even though I know he put in a lot more work than I did, let me just
> say: phew, even reviewing that was a ton of work.

Absolutely!  Your review comments and design suggestions have been
instrumental in improving (and cutting down on the size of) the patches.

> Of course, this is the beginning, not the end.

+1000!

> I've been thinking
> about next steps -- here's an expanded list:
>
> - more efficient plan-time partition pruning (constraint exclusion is too slow)
> - run-time partition pruning
> - partition-wise join (Ashutosh Bapat is already working on this)
> - try to reduce lock levels
> - hash partitioning
> - the ability to create an index on the parent and have all of the
> children inherit it; this should work something like constraint
> inheritance.  you could argue that this doesn't add any real new
> capability but it's a huge usability feature.
> - teaching autovacuum enough about inheritance hierarchies for it to
> update the parent statistics when they get stale despite the lack of
> any actual inserts/updates/deletes to the parent.  this has been
> pending for a long time, but it's only going to get more important
> - row movement (aka avoiding the need for an ON UPDATE trigger on each
> partition)
> - insert (and eventually update) tuple routing for foreign partitions
> - not scanning the parent
> - fixing the insert routing so that we can skip tuple conversion where possible
> - fleshing out the documentation

I would definitely want to contribute to some of these items.  It's great
that many others plan to contribute toward this as well.

> One thing I'm wondering is whether we can optimize away some of the
> heavyweight locks.  For example, if somebody does SELECT * FROM ptab
> WHERE id = 1, they really shouldn't need to lock the entire
> partitioning hierarchy, but right now they do.  If the root knows
> based on its own partitioning key that only one child is relevant, it
> would be good to lock *only that child*.  For this feature to be
> competitive, it needs to scale to at least a few thousand partitions,
> and locking thousands of objects instead of one or two is bound to be
> slow.  Similarly, you can imagine teaching COPY to lock partitions
> only on demand; if no tuples are routed to a particular partition, we
> don't need to lock it.  There's a manageability component here, too:
> not locking partitions unnecessarily makes ti easier to get DDL on
> other partitions through.  Alternatively, maybe we could rewrite the
> lock manager to be hierarchical, so that you can take a single lock
> that represents an AccessShareLock on all partitions and only need to
> make one entry in the lock table to do it.  That means that attempts
> to lock individual partitions need to check not only for a lock on
> that partition but also on anything further up in the hierarchy, but
> that might be a good trade if it gives us O(1) locking on the parent.
> And maybe we could also have a level of the hierarchy that represents
> every-table-in-the-database, for the benefit of pg_dump.  Of course,
> rewriting the lock manager is a big project not for the faint of
> heart, but I think if we don't it's going to be a scaling bottleneck.

Hierarchical lock manager stuff is interesting.  Are you perhaps alluding
to a new *intention* lock mode as described in the literature on multiple
granularity locking [1]?

> We also need to consider other parts of the system that may not scale,
> like pg_dump.  For a long time, we've been sorta-kinda willing to fix
> the worst of the scalability problems with pg_dump, but that's really
> no longer an adequate response.  People want 1000 partitions.  Heck,
> people want 1,000,000 partitions, but getting to where 1000 partitions
> works well would help PostgreSQL a lot.  Our oft-repeated line that
> inheritance isn't designed for large numbers of inheritance children
> is basically just telling people who have the use case where they need
> that to go use some other product.  Partitioning, like replication, is
> not an optional feature for a world-class database.  And, from a
> technical point of view, I think we've now got an infrastructure that
> really should be able to be scaled up considerably higher than what
> we've been able to do in the past.  When we were stuck with
> inheritance + constraint exclusion, we could say "well, there's not
> really any point because you'll hit these other limits anyway".  But I
> think now that's not really true.  This patch eliminates one of the
> core scalability problems in this area, and provides infrastructure
> for attacking some of the others.  I hope that people will step up and
> do that.  There's a huge opportunity here for PostgreSQL to become
> relevant in use cases where it currently falters badly, and we should
> try to take advantage of it.  This patch is a big step by itself, but
> if we ignore the potential to do more with this as the base we will be
> leaving a lot of "win" on the table.

Agreed on all counts.

Thanks,
Amit

[1] https://en.wikipedia.org/wiki/Multiple_granularity_locking




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

From
Maksim Milyutin
Date:
Hi, everyone!


08.12.16 18:25, Robert Haas wrote:
> Of course, this is the beginning, not the end.  I've been thinking
> about next steps -- here's an expanded list:
>
> - more efficient plan-time partition pruning (constraint exclusion is too slow)
> - run-time partition pruning
> - partition-wise join (Ashutosh Bapat is already working on this)
> - try to reduce lock levels
> - hash partitioning
> - the ability to create an index on the parent and have all of the
> children inherit it; this should work something like constraint
> inheritance.  you could argue that this doesn't add any real new
> capability but it's a huge usability feature.
> - teaching autovacuum enough about inheritance hierarchies for it to
> update the parent statistics when they get stale despite the lack of
> any actual inserts/updates/deletes to the parent.  this has been
> pending for a long time, but it's only going to get more important
> - row movement (aka avoiding the need for an ON UPDATE trigger on each
> partition)
> - insert (and eventually update) tuple routing for foreign partitions
> - not scanning the parent
> - fixing the insert routing so that we can skip tuple conversion where possible
> - fleshing out the documentation
>
I would like to work on two tasks: - insert (and eventually update) tuple routing for foreign partition. - the ability
tocreate an index on the parent and have all of the 
 
children inherit it;

The first one has been implemented in pg_pathman somehow, but the code 
relies on dirty hacks, so the FDW API has to be improved. As for the 
extended index support, it doesn't look like a super-hard task.

-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On Fri, Dec 9, 2016 at 3:16 PM, Venkata B Nagothi <nag1010@gmail.com> wrote:
> Hi,
>
> I am testing the partitioning feature from the latest master and got the
> following error while loading the data -
>
> db01=# create table orders_y1993 PARTITION OF orders FOR VALUES FROM
> ('1993-01-01') TO ('1993-12-31');
> CREATE TABLE
>
> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16412": read only 0 of
> 8192 bytes
> CONTEXT:  COPY orders, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"

Hmm.   Could you tell what relation the file/relfilenode 16412 belongs to?

Also, is orders_y1993 the only partition of orders?  How about \d+ orders?

Thanks,
Amit



Re: [HACKERS] Declarative partitioning - another take

From
Venkata B Nagothi
Date:



Regards,

Venkata B N
Database Consultant
 

On Fri, Dec 9, 2016 at 11:11 PM, Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Dec 9, 2016 at 3:16 PM, Venkata B Nagothi <nag1010@gmail.com> wrote:
> Hi,
>
> I am testing the partitioning feature from the latest master and got the
> following error while loading the data -
>
> db01=# create table orders_y1993 PARTITION OF orders FOR VALUES FROM
> ('1993-01-01') TO ('1993-12-31');
> CREATE TABLE
>
> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16412": read only 0 of
> 8192 bytes
> CONTEXT:  COPY orders, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"

Hmm.   Could you tell what relation the file/relfilenode 16412 belongs to?

db01=# select relname from pg_class where relfilenode=16412 ;
   relname
--------------
 orders_y1997
(1 row)


I VACUUMED the partition and then re-ran the copy command and no luck.
 
db01=# vacuum orders_y1997;
VACUUM

db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 376589: "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely regular pack"

I do not quite understand the below behaviour as well. I VACUUMED 1997 partition and then i got an error for 1992 partition and then after 1996 and then after 1994 and so on.

postgres=# \c db01
You are now connected to database "db01" as user "dba".
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 376589: "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely regular pack"
db01=# vacuum orders_y1997;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 3942 in file "base/16384/16406": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 75445: "1993510|185287|F|42667.9|1992-08-15|2-HIGH         |Clerk#000000079|0| dugouts above the even "
db01=# select relname from pg_class where relfilenode=16406;
   relname
--------------
 orders_y1992
(1 row)

db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 3942 in file "base/16384/16406": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 75396: "1993317|260510|F|165852|1992-12-13|5-LOW          |Clerk#000003023|0|regular foxes. ironic dependenc..."
db01=# vacuum orders_y1992;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 3708 in file "base/16384/16394": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 178820: "4713957|286270|O|200492|1996-10-01|1-URGENT       |Clerk#000001993|0|uriously final packages. slyly "
db01=# select relname from pg_class where relfilenode=16394;
   relname
--------------
 orders_y1996
(1 row)

db01=# vacuum orders_y1996;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 5602 in file "base/16384/16403": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 147390: "3882662|738010|F|199365|1994-12-26|5-LOW          |Clerk#000001305|0|ar instructions above the expre..."
db01=# select relname from pg_class where relfilenode=16403;
   relname
--------------
 orders_y1994
(1 row)

db01=# vacuum orders_y1994;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 5561 in file "base/16384/16412": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 59276: "1572448|646948|O|25658.6|1997-05-02|4-NOT SPECIFIED|Clerk#000001993|0|es. ironic, regular p"

And finally the error again occurred for 1997 partition

db01=# select relname from pg_class where relfilenode=16412;
   relname
--------------
 orders_y1997
(1 row)

db01=# vacuum orders_y1997;
VACUUM
db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16412": read only 0 of 8192 bytes
CONTEXT:  COPY orders, line 376589: "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely regular pack"
db01=#

Am i not understanding anything here ?
 
Also, is orders_y1993 the only partition of orders?  How about \d+ orders?

Yes, i created multiple yearly partitions for orders table. I wanted to 1993 year's data first and see if the data goes into orders_y1993 partition and itseems that, the CSV contains 1997 data as wellCopy command found a

db01=# \d+ orders
                                              Table "public.orders"
     Column      |         Type          | Collation | Nullable | Default | Storage  | Stats target | Description
-----------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
 o_orderkey      | integer               |           |          |         | plain    |              |
 o_custkey       | integer               |           |          |         | plain    |              |
 o_orderstatus   | character(1)          |           |          |         | extended |              |
 o_totalprice    | real                  |           |          |         | plain    |              |
 o_orderdate     | date                  |           | not null |         | plain    |              |
 o_orderpriority | character(15)         |           |          |         | extended |              |
 o_clerk         | character(15)         |           |          |         | extended |              |
 o_shippriority  | integer               |           |          |         | plain    |              |
 o_comment       | character varying(79) |           |          |         | extended |              |
Partition key: RANGE (o_orderdate)
Partitions: orders_y1992 FOR VALUES FROM ('1992-01-01') TO ('1992-12-31'),
            orders_y1993 FOR VALUES FROM ('1993-01-01') TO ('1993-12-31'),
            orders_y1994 FOR VALUES FROM ('1994-01-01') TO ('1994-12-31'),
            orders_y1995 FOR VALUES FROM ('1995-01-01') TO ('1995-12-31'),
            orders_y1996 FOR VALUES FROM ('1996-01-01') TO ('1996-12-31'),
            orders_y1997 FOR VALUES FROM ('1997-01-01') TO ('1997-12-31'),
            orders_y1998 FOR VALUES FROM ('1998-01-01') TO ('1998-12-31')


Re: [HACKERS] Declarative partitioning - another take

From
Tomas Vondra
Date:
Hi,

On 12/07/2016 07:20 PM, Robert Haas wrote:> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er@xs4all.nl> wrote:>>> My
bad. The fix I sent last night for one of the cache flush issues>>> wasn't quite right.  The attached seems to fix
it.>>Yes, fixed here too.  Thanks.>> Thanks for the report - that was a good catch.>> I've committed 0001 - 0006 with
thatcorrection and a few other> adjustments.  There's plenty of room for improvement here, and almost> certainly some
straight-upbugs too, but I think we're at a point> where it will be easier and less error-prone to commit follow on>
changesincrementally rather than by continuously re-reviewing a very> large patch set for increasingly smaller
changes.>

I've been working on a review / testing of the partitioning patch, but 
have been unable to submit it before the commit due to a lot of travel. 
However, at least some of the points seem to be still valid, so let me 
share it as an after-commit review. Most of the issues are fairly minor 
(possibly even nitpicking).


review
------

1) src/include/catalog/pg_partitioned_table.h contains this bit:
 * $PostgreSQL: pgsql/src/include/catalog/pg_partitioned_table.h $

2) I'm wondering whether having 'table' in the catalog name (and also in 
the new relkind) is too limiting. I assume we'll have partitioned 
indexes one day, for example - do we expect to use the same catalogs?

3) A comment within BeginCopy (in copy.c) says:
    * Initialize state for CopyFrom tuple routing.  Watch out for    * any foreign partitions.

But the code does not seem to be doing that (at least I don't see any 
obvious checks for foreign partitions). Also, the comment should 
probably at least mention why foreign partitions need extra care.

To nitpick, the 'pd' variable in that block seems rather pointless - we 
can assign directly to cstate->partition_dispatch_info.

4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the 
new name rather strange - all other similar functions start with "Get", 
so I guess "GetOpClass()" would be better. But I wonder if the rename 
was even necessary, considering that it still deals with index operator 
classes (although now also in context of partition keys). If the rename 
really is needed, isn't that a sign that the function does not belong to 
indexcmds.c anymore?

5) Half the error messages use 'child table' while the other half uses 
'partition'. I think we should be using 'partition' as child tables 
really have little meaning outside inheritance (which is kinda hidden 
behind the new partitioning stuff).

6) The psql tab-completion seems a bit broken, because it only offers 
the partitions, not the parent table. Which is usually exactly the 
opposite of what the user wants.


testing
-------

I've also done quite a bit of testing with different partition layouts 
(single-level list/range partitioning, multi-level partitioning etc.), 
with fairly large number (~10k) of partitions. The scripts I used are 
available here: https://bitbucket.org/tvondra/partitioning-tests

1) There seems to be an issue when a partition is created and then 
accessed within the same transaction, i.e. for example

BEGIN;
... create parent ...
... create partition ....
... insert into parent ...
COMMIT;

which simply fails with an error like this:

ERROR:  no partition of relation "range_test_single" found for row
DETAIL:  Failing row contains (99000, 99000).

Of course, the partition is there. And interestingly enough, this works 
perfectly fine when executed without the explicit transaction, so I 
assume it's some sort of cache invalidation mix-up.

2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan 
like this:
                                  QUERY PLAN
------------------------------------------------------------------- Finalize Aggregate  (cost=124523.64..124523.65
rows=1width=8)   ->  Gather  (cost=124523.53..124523.64 rows=1 width=8)         Workers Planned: 1         ->  Partial
Aggregate (cost=123523.53..123523.54 rows=1 ...)               ->  Append  (cost=0.00..108823.53 rows=5880001 width=0)
                  ->  Parallel Seq Scan on parent ...                     ->  Parallel Seq Scan on partition_1 ...
              ->  Parallel Seq Scan on partition_2 ...                     ->  Parallel Seq Scan on partition_3 ...
               ->  Parallel Seq Scan on partition_4 ...                     ->  Parallel Seq Scan on partition_5 ...
                ->  Parallel Seq Scan on partition_6 ...                     ... and the rest of the 10k partitions
 

So if I understand the plan correctly, we first do a parallel scan of 
the parent, then partition_1, partition_2 etc. But AFAIK we scan the 
tables in Append sequentially, and each partition only has 1000 rows 
each, making the parallel execution rather inefficient. Unless we scan 
the partitions concurrently.

In any case, as this happens even with plain inheritance, it's probably 
more about the parallel query than about the new partitioning patch.

3) The last issue I noticed is that while
    EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;

works just fine (it takes fair amount of time to plan with 10k 
partitions, but that's expected), this
    EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;

allocates a lot of memory (~8GB on my laptop, before it gets killed by 
OOM killer). Again, the same thing happens with plain inheritance-based 
partitioning, so it's probably not a bug in the partitioning patch.

I'm mentioning it here because I think the new partitioning will 
hopefully get more efficient and handle large partition counts more 
efficiently (the inheritance only really works for ~100 partitions, 
which is probably why no one complained about OOM during UPDATEs). Of 
course, 10k partitions is a bit extreme (good for testing, though).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

On 2016/12/11 10:02, Venkata B Nagothi wrote:
> On Fri, Dec 9, 2016 at 11:11 PM, Amit Langote <amitlangote09@gmail.com>
> wrote:
>> On Fri, Dec 9, 2016 at 3:16 PM, Venkata B Nagothi <nag1010@gmail.com>
>> wrote:
>>> I am testing the partitioning feature from the latest master and got the
>>> following error while loading the data -
>>>
>>> db01=# create table orders_y1993 PARTITION OF orders FOR VALUES FROM
>>> ('1993-01-01') TO ('1993-12-31');
>>> CREATE TABLE
>>>
>>> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
>>> ERROR:  could not read block 6060 in file "base/16384/16412": read only
>> 0 of
>>> 8192 bytes
>>> CONTEXT:  COPY orders, line 376589:
>>> "9876391|374509|O|54847|1997-07-16|3-MEDIUM
>>  |Clerk#000001993|0|ithely
>>> regular pack"
>>
>> Hmm.   Could you tell what relation the file/relfilenode 16412 belongs to?
>>
> 
> db01=# select relname from pg_class where relfilenode=16412 ;
>    relname
> --------------
>  orders_y1997
> (1 row)
> 
> 
> I VACUUMED the partition and then re-ran the copy command and no luck.
> 
> db01=# vacuum orders_y1997;
> VACUUM
> 
> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
> of 8192 bytes
> CONTEXT:  COPY orders, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"
> 
> I do not quite understand the below behaviour as well. I VACUUMED 1997
> partition and then i got an error for 1992 partition and then after 1996
> and then after 1994 and so on.
> 

[ ... ]

> db01=# vacuum orders_y1997;
> VACUUM
> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
> of 8192 bytes
> CONTEXT:  COPY orders, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"
> db01=#
> 
> Am i not understanding anything here ?

I could not reproduce this issue.  Also, I could not say what might have
gone wrong based only on the information I have seen so far.

Have you tried inserting the same data using insert?

create table orders_unpartitioned (like orders);
copy orders_unpartitioned from '/data/orders-1993.csv';
insert into orders select * from orders_unpartitioned;

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Tomas,

On 2016/12/12 10:02, Tomas Vondra wrote:
> On 12/07/2016 07:20 PM, Robert Haas wrote:
>> I've committed 0001 - 0006 with that correction and a few other
>> adjustments.  There's plenty of room for improvement here, and almost
>> certainly some straight-up bugs too, but I think we're at a point
>> where it will be easier and less error-prone to commit follow on
>> changes incrementally rather than by continuously re-reviewing a very
>> large patch set for increasingly smaller changes.
>>
> 
> I've been working on a review / testing of the partitioning patch, but
> have been unable to submit it before the commit due to a lot of travel.
> However, at least some of the points seem to be still valid, so let me
> share it as an after-commit review. Most of the issues are fairly minor
> (possibly even nitpicking).

Thanks a lot for the review comments and testing!

I attach a patch that implements some of the fixes that you suggest below.
 Also, I am merging a patch I sent earlier [1] to avoid having to look at
multiple patches.

> review
> ------
> 
> 1) src/include/catalog/pg_partitioned_table.h contains this bit:
> 
>  * $PostgreSQL: pgsql/src/include/catalog/pg_partitioned_table.h $

Fixed.

> 2) I'm wondering whether having 'table' in the catalog name (and also in
> the new relkind) is too limiting. I assume we'll have partitioned indexes
> one day, for example - do we expect to use the same catalogs?

I am not sure I understand your idea of partitioned indexes, but I doubt
it would require entries in the catalog under consideration.  Could you
perhaps elaborate more?

> 3) A comment within BeginCopy (in copy.c) says:
> 
>     * Initialize state for CopyFrom tuple routing.  Watch out for
>     * any foreign partitions.
> 
> But the code does not seem to be doing that (at least I don't see any
> obvious checks for foreign partitions). Also, the comment should probably
> at least mention why foreign partitions need extra care.

It's a stale comment after I took out the code that handled foreign
partitions in final versions of the tuple-routing patch.  You may have
noticed in the commit message that tuple-routing does not work for foreign
partitions, because the changes I'd proposed weren't that good.  Comment
fixed anyway.

> To nitpick, the 'pd' variable in that block seems rather pointless - we
> can assign directly to cstate->partition_dispatch_info.

OK, fixed.  Also, fixed similar code in ExecInitModifyTable().

> 4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the new
> name rather strange - all other similar functions start with "Get", so I
> guess "GetOpClass()" would be better. But I wonder if the rename was even
> necessary, considering that it still deals with index operator classes
> (although now also in context of partition keys). If the rename really is
> needed, isn't that a sign that the function does not belong to indexcmds.c
> anymore?

The fact that both index keys and partition keys have very similar
specification syntax means there would be some common code handling the
same, of which this is one (maybe, only) example.  I didn't see much point
in finding a new place for the function, although I can see why it may be
a bit confusing to future readers of the code.

About the new name - seeing the top line in the old header comment, what
the function does is resolve a possibly explicitly specified operator
class name to an operator class OID.  I hadn't changed the name in my
original patch, but Robert suggested the rename, which I thought made sense.

> 5) Half the error messages use 'child table' while the other half uses
> 'partition'. I think we should be using 'partition' as child tables really
> have little meaning outside inheritance (which is kinda hidden behind the
> new partitioning stuff).

One way to go about it may be to check all sites that can possibly report
an error involving child tables (aka "partitions") whether they know from
the context which name to use.  I think it's possible, because we have
access to the parent relation in all such sites and looking at the relkind
can tell whether to call child tables "partitions".

> 6) The psql tab-completion seems a bit broken, because it only offers the
> partitions, not the parent table. Which is usually exactly the opposite of
> what the user wants.

Actually, I have not implemented any support for tab-completion for the
new DDL.  I will work on that.

> testing
> -------
> 
> I've also done quite a bit of testing with different partition layouts
> (single-level list/range partitioning, multi-level partitioning etc.),
> with fairly large number (~10k) of partitions. The scripts I used are
> available here: https://bitbucket.org/tvondra/partitioning-tests
> 
> 1) There seems to be an issue when a partition is created and then
> accessed within the same transaction, i.e. for example
> 
> BEGIN;
> ... create parent ...
> ... create partition ....
> ... insert into parent ...
> COMMIT;
> 
> which simply fails with an error like this:
> 
> ERROR:  no partition of relation "range_test_single" found for row
> DETAIL:  Failing row contains (99000, 99000).
> 
> Of course, the partition is there. And interestingly enough, this works
> perfectly fine when executed without the explicit transaction, so I assume
> it's some sort of cache invalidation mix-up.

That's a clearly bug.  A relcache invalidation on the parent was missing
in create table ... partition of code path.  Fixed.

> 2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan
> like this:
> 
>                                   QUERY PLAN
> -------------------------------------------------------------------
>  Finalize Aggregate  (cost=124523.64..124523.65 rows=1 width=8)
>    ->  Gather  (cost=124523.53..124523.64 rows=1 width=8)
>          Workers Planned: 1
>          ->  Partial Aggregate  (cost=123523.53..123523.54 rows=1 ...)
>                ->  Append  (cost=0.00..108823.53 rows=5880001 width=0)
>                      ->  Parallel Seq Scan on parent ...
>                      ->  Parallel Seq Scan on partition_1 ...
>                      ->  Parallel Seq Scan on partition_2 ...
>                      ->  Parallel Seq Scan on partition_3 ...
>                      ->  Parallel Seq Scan on partition_4 ...
>                      ->  Parallel Seq Scan on partition_5 ...
>                      ->  Parallel Seq Scan on partition_6 ...
>                      ... and the rest of the 10k partitions
> 
> So if I understand the plan correctly, we first do a parallel scan of the
> parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
> Append sequentially, and each partition only has 1000 rows each, making
> the parallel execution rather inefficient. Unless we scan the partitions
> concurrently.
> 
> In any case, as this happens even with plain inheritance, it's probably
> more about the parallel query than about the new partitioning patch.

Yes, I have seen some discussion [2] about a Parallel Append, which would
result in plans you're probably thinking of.

> 3) The last issue I noticed is that while
> 
>     EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;
> 
> works just fine (it takes fair amount of time to plan with 10k partitions,
> but that's expected), this
> 
>     EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;
> 
> allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
> killer). Again, the same thing happens with plain inheritance-based
> partitioning, so it's probably not a bug in the partitioning patch.
> 
> I'm mentioning it here because I think the new partitioning will hopefully
> get more efficient and handle large partition counts more efficiently (the
> inheritance only really works for ~100 partitions, which is probably why
> no one complained about OOM during UPDATEs). Of course, 10k partitions is
> a bit extreme (good for testing, though).

Plans with the inheritance parents as target tables (update/delete) go
through inheritance_planner() in the optimizer, which currently has a
design that is based on certain assumptions about traditional inheritance.
 It's possible hopefully to make it less expensive for the partitioned tables.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/c9737515-36ba-6a42-f788-1b8867e2dd38%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/83755E07-C435-493F-9F93-F727604D66A1%40gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Peter Eisentraut
Date:
On 12/7/16 1:20 PM, Robert Haas wrote:
> I've committed 0001 - 0006 with that correction and a few other
> adjustments.  There's plenty of room for improvement here, and almost
> certainly some straight-up bugs too, but I think we're at a point
> where it will be easier and less error-prone to commit follow on
> changes incrementally rather than by continuously re-reviewing a very
> large patch set for increasingly smaller changes.

This page
<https://www.postgresql.org/docs/devel/static/ddl-partitioning.html>,
which is found via the index entry for "Partitioning", should be updated
for the new functionality.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Tomas Vondra
Date:
On 12/12/2016 07:37 AM, Amit Langote wrote:
>
> Hi Tomas,
>
> On 2016/12/12 10:02, Tomas Vondra wrote:
>>
>> 2) I'm wondering whether having 'table' in the catalog name (and also in
>> the new relkind) is too limiting. I assume we'll have partitioned indexes
>> one day, for example - do we expect to use the same catalogs?
>
> I am not sure I understand your idea of partitioned indexes, but I doubt
> it would require entries in the catalog under consideration.  Could you
> perhaps elaborate more?
>

OK, let me elaborate. Let's say we have a partitioned table, and I want 
to create an index. The index may be either "global" i.e. creating a 
single relation for data from all the partitions, or "local" (i.e. 
partitioned the same way as the table).

Local indexes are easier to implement (it's essentially what we have 
now, except that we need to create the indexes manually for each 
partition), and don't work particularly well for some use cases (e.g. 
unique constraints). This is what I mean by "partitioned indexes".

If the index is partitioned just like the table, we probably won't need 
to copy the partition key info (so, nothing in pg_partitioned_table).
I'm not sure it makes sense to partition the index differently than the 
table - I don't see a case where that would be useful.

The global indexes would work better for the unique constraint use case, 
but it clearly contradicts our idea of TID (no information about which 
partition that references).

So maybe the catalog really only needs to track info about tables? Not 
sure. I'm just saying it'd be unfortunate to have _table in the name, 
and end up using it for indexes too.

>
>> 4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the new
>> name rather strange - all other similar functions start with "Get", so I
>> guess "GetOpClass()" would be better. But I wonder if the rename was even
>> necessary, considering that it still deals with index operator classes
>> (although now also in context of partition keys). If the rename really is
>> needed, isn't that a sign that the function does not belong to indexcmds.c
>> anymore?
>
> The fact that both index keys and partition keys have very similar
> specification syntax means there would be some common code handling the
> same, of which this is one (maybe, only) example.  I didn't see much point
> in finding a new place for the function, although I can see why it may be
> a bit confusing to future readers of the code.
>
> About the new name - seeing the top line in the old header comment, what
> the function does is resolve a possibly explicitly specified operator
> class name to an operator class OID.  I hadn't changed the name in my
> original patch, but Robert suggested the rename, which I thought made sense.
>

OK, I don't particularly like the name but I can live with that.

>> 5) Half the error messages use 'child table' while the other half uses
>> 'partition'. I think we should be using 'partition' as child tables really
>> have little meaning outside inheritance (which is kinda hidden behind the
>> new partitioning stuff).
>
> One way to go about it may be to check all sites that can possibly report
> an error involving child tables (aka "partitions") whether they know from
> the context which name to use.  I think it's possible, because we have
> access to the parent relation in all such sites and looking at the relkind
> can tell whether to call child tables "partitions".
>

Clearly, this is a consequence of building the partitioning on top of 
inheritance (not objecting to that approach, merely stating a fact).

I'm fine with whatever makes the error messages more consistent, if it 
does not make the code significantly more complex. It's a bit confusing 
when some use 'child tables' and others 'partitions'. I suspect even a 
single DML command may return a mix of those, depending on where exactly 
it fails (old vs. new code).

>> 6) The psql tab-completion seems a bit broken, because it only offers the
>> partitions, not the parent table. Which is usually exactly the opposite of
>> what the user wants.
>
> Actually, I have not implemented any support for tab-completion for the
> new DDL.  I will work on that.
>

Aha! So the problem probably is that psql does not recognize the new 
'partitioned table' relkind, and only looks at regular tables.

>> testing
>> -------
>
>> 2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan
>> like this:
>>
>>                                   QUERY PLAN
>> -------------------------------------------------------------------
>>  Finalize Aggregate  (cost=124523.64..124523.65 rows=1 width=8)
>>    ->  Gather  (cost=124523.53..124523.64 rows=1 width=8)
>>          Workers Planned: 1
>>          ->  Partial Aggregate  (cost=123523.53..123523.54 rows=1 ...)
>>                ->  Append  (cost=0.00..108823.53 rows=5880001 width=0)
>>                      ->  Parallel Seq Scan on parent ...
>>                      ->  Parallel Seq Scan on partition_1 ...
>>                      ->  Parallel Seq Scan on partition_2 ...
>>                      ->  Parallel Seq Scan on partition_3 ...
>>                      ->  Parallel Seq Scan on partition_4 ...
>>                      ->  Parallel Seq Scan on partition_5 ...
>>                      ->  Parallel Seq Scan on partition_6 ...
>>                      ... and the rest of the 10k partitions
>>
>> So if I understand the plan correctly, we first do a parallel scan of the
>> parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
>> Append sequentially, and each partition only has 1000 rows each, making
>> the parallel execution rather inefficient. Unless we scan the partitions
>> concurrently.
>>
>> In any case, as this happens even with plain inheritance, it's probably
>> more about the parallel query than about the new partitioning patch.
>
> Yes, I have seen some discussion [2] about a Parallel Append, which would
> result in plans you're probably thinking of.
>

Actually, I think that thread is about allowing partial paths even if 
only some appendrel members support it. My point is that in this case 
it's a bit silly to even build the partial paths, when the partitions 
only have 1000 rows each.

I kinda suspect we only look at the total appendrel rowcount estimate, 
but haven't checked.

>> 3) The last issue I noticed is that while
>>
>>     EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;
>>
>> works just fine (it takes fair amount of time to plan with 10k partitions,
>> but that's expected), this
>>
>>     EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;
>>
>> allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
>> killer). Again, the same thing happens with plain inheritance-based
>> partitioning, so it's probably not a bug in the partitioning patch.
>>
>> I'm mentioning it here because I think the new partitioning will hopefully
>> get more efficient and handle large partition counts more efficiently (the
>> inheritance only really works for ~100 partitions, which is probably why
>> no one complained about OOM during UPDATEs). Of course, 10k partitions is
>> a bit extreme (good for testing, though).
>
> Plans with the inheritance parents as target tables (update/delete)
> go through inheritance_planner() in the optimizer, which currently
> has a design that is based on certain assumptions about traditional
> inheritance. It's possible hopefully to make it less expensive for
> the partitioned  tables.
>

Yes, I know. I wasn't really reporting it as a bug in the partitioning 
patch, but more as a rather surprising difference between plain SELECT 
and UPDATE.

Am I right that one of the ambitions of the new partitioning is to 
improve behavior with large number of partitions?

At first I thought it's somewhat related to the FDW sharding (each node 
being a partition and having local subpartitions), but I realize the 
planner will only deal with the node partitions I guess.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Dmitry Ivanov
Date:
Hi guys,

Looks like I've just encountered a bug. Please excuse me for the messy
email, I don't have much time at the moment.


Here's the test case:

create table test(val int) partition by range (val);
create table test_1 partition of test for values from (1) to (1000)
partition by range(val);
create table test_2 partition of test for values from (1000) to (2000)
partition by range(val);
create table test_1_1 partition of test_1 for values from (1) to (500)
partition by range(val);
create table test_1_2 partition of test_1 for values from (500) to (1000)
partition by range(val);
create table test_1_1_1 partition of test_1_1 for values from (1) to (500);
create table test_1_2_1 partition of test_1_2 for values from (500) to
(1000);


/* insert a row into "test_1_2_1" */
insert into test values(600);


/* what we EXPECT to see */
select *, tableoid::regclass from test;
 val |  tableoid
-----+------------
 600 | test_1_2_1
(1 row)


/* what we ACTUALLY see */
insert into test values(600);
ERROR:  no partition of relation "test_1_1" found for row
DETAIL:  Failing row contains (600).


How does this happen? This is how "PartitionDispatch" array looks like:

test | test_1 | test_2 | test_1_1 | test_1_2

which means that this code (partition.c : 1025):


/*
 * We can assign indexes this way because of the way
 * parted_rels has been generated.
 */
pd[i]->indexes[j] = -(i + 1 + m);


doesn't work, since partitions are not always placed right after the parent
(implied by index "m").

We have to take into account the total amount of partitions we've
encountered so far (right before index "i").

I've attached a patch with a hotfix, but the code looks so-so and has a
smell. I think it must be rewritten. This bug hunt surely took a while: I
had to recheck all of the steps several times.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Dmitry Ivanov
Date:
Huh, this code is broken as well. We have to ignore partitions that don't
have any subpartitions. Patch is attached below (v2).


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/12 23:14, Peter Eisentraut wrote:
> On 12/7/16 1:20 PM, Robert Haas wrote:
>> I've committed 0001 - 0006 with that correction and a few other
>> adjustments.  There's plenty of room for improvement here, and almost
>> certainly some straight-up bugs too, but I think we're at a point
>> where it will be easier and less error-prone to commit follow on
>> changes incrementally rather than by continuously re-reviewing a very
>> large patch set for increasingly smaller changes.
> 
> This page
> <https://www.postgresql.org/docs/devel/static/ddl-partitioning.html>,
> which is found via the index entry for "Partitioning", should be updated
> for the new functionality.

A patch was included to do that, but it didn't do more than replacing the
old DDL listing by the new partitioning commands.  As Robert said in the
email you quoted, that's not enough.  So I'm trying to come up with a
patch updating the page explaining the new functionality better.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi,

On 2016/12/13 2:45, Dmitry Ivanov wrote:
> Huh, this code is broken as well. We have to ignore partitions that don't
> have any subpartitions. Patch is attached below (v2).

Good catch and thanks a lot for the patch!  I have revised it a bit and
added some explanatory comments to that function.

Attaching the above patch, along with some other patches posted earlier,
and one more patch fixing another bug I found.  Patch descriptions follow:

0001-Miscallaneous-code-and-comment-improvements.patch

Fixes some obsolete comments while improving others.  Also, implements
some of Tomas Vondra's review comments.

0002-Miscallaneous-documentation-improvements.patch

Fixes inconsistencies and improves some examples in the documentation.
Also, mentions the limitation regarding row movement.

0003-Invalidate-the-parent-s-relcache-after-partition-cre.patch

Fixes a bug reported by Tomas, whereby a parent's relcache was not
invalidated after creation of a new partition using CREATE TABLE PARTITION
OF.  This resulted in tuple-routing code not being to able to find a
partition that was created by the last command in a given transaction.

0004-Fix-a-bug-of-insertion-into-an-internal-partition.patch

Fixes a bug I found this morning, whereby an internal partition's
constraint would not be enforced if it is targeted directly.  See example
below:

create table p (a int, b char) partition by range (a);
create table p1 partition of p for values from (1) to (10) partition by
list (b);
create table p1a partition of p1 for values in ('a');
insert into p1 values (0, 'a');  -- should fail, but doesn't

0005-Fix-a-tuple-routing-bug-in-multi-level-partitioned-t.patch

Fixes a bug discovered by Dmitry Ivanov, whereby wrong indexes were
assigned to the partitions of lower levels (level > 1), causing spurious
"partition not found" error as demonstrated in his email [1].


Sorry about sending some of these patches repeatedly though.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/e6c56fe9-4b87-4f64-ac6f-bc99675f3f9e%40postgrespro.ru

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/13 0:17, Tomas Vondra wrote:
> On 12/12/2016 07:37 AM, Amit Langote wrote:
>>
>> Hi Tomas,
>>
>> On 2016/12/12 10:02, Tomas Vondra wrote:
>>>
>>> 2) I'm wondering whether having 'table' in the catalog name (and also in
>>> the new relkind) is too limiting. I assume we'll have partitioned indexes
>>> one day, for example - do we expect to use the same catalogs?
>>
>> I am not sure I understand your idea of partitioned indexes, but I doubt
>> it would require entries in the catalog under consideration.  Could you
>> perhaps elaborate more?
>>
> 
> OK, let me elaborate. Let's say we have a partitioned table, and I want to
> create an index. The index may be either "global" i.e. creating a single
> relation for data from all the partitions, or "local" (i.e. partitioned
> the same way as the table).
> 
> Local indexes are easier to implement (it's essentially what we have now,
> except that we need to create the indexes manually for each partition),
> and don't work particularly well for some use cases (e.g. unique
> constraints). This is what I mean by "partitioned indexes".
> 
> If the index is partitioned just like the table, we probably won't need to
> copy the partition key info (so, nothing in pg_partitioned_table).
> I'm not sure it makes sense to partition the index differently than the
> table - I don't see a case where that would be useful.
> 
> The global indexes would work better for the unique constraint use case,
> but it clearly contradicts our idea of TID (no information about which
> partition that references).
> 
> So maybe the catalog really only needs to track info about tables? Not
> sure. I'm just saying it'd be unfortunate to have _table in the name, and
> end up using it for indexes too.

Hmm, I didn't quite think of the case where the index is partitioned
differently from the table, but perhaps that's possible with some other
databases.

What you describe as "local indexes" or "locally partitioned indexes" is
something I would like to see being pursued in the near term.  In that
case, we would allow defining indexes on the parent that are recursively
defined on the partitions and marked as inherited index, just like we have
inherited check constraints and NOT NULL constraints.  I have not studied
whether we could implement (globally) *unique* indexes with this scheme
though, wherein the index key is a superset of the partition key.

>>> 5) Half the error messages use 'child table' while the other half uses
>>> 'partition'. I think we should be using 'partition' as child tables really
>>> have little meaning outside inheritance (which is kinda hidden behind the
>>> new partitioning stuff).
>>
>> One way to go about it may be to check all sites that can possibly report
>> an error involving child tables (aka "partitions") whether they know from
>> the context which name to use.  I think it's possible, because we have
>> access to the parent relation in all such sites and looking at the relkind
>> can tell whether to call child tables "partitions".
>>
> 
> Clearly, this is a consequence of building the partitioning on top of
> inheritance (not objecting to that approach, merely stating a fact).
> 
> I'm fine with whatever makes the error messages more consistent, if it
> does not make the code significantly more complex. It's a bit confusing
> when some use 'child tables' and others 'partitions'. I suspect even a
> single DML command may return a mix of those, depending on where exactly
> it fails (old vs. new code).

So, we have mostly some old DDL (CREATE/ALTER TABLE) and maintenance
commands that understand inheritance.  All of the their error messages
apply to partitions as well, wherein they will be referred to as "child
tables" using old terms.  We now have some cases where the commands cause
additional error messages for only partitions because of additional
restrictions that apply to them.  We use "partitions" for them because
they are essentially new error messages.

There won't be a case where single DML command would mix the two terms,
because we do not allow mixing partitioning and regular inheritance.
Maybe I misunderstood you though.

>>> So if I understand the plan correctly, we first do a parallel scan of the
>>> parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
>>> Append sequentially, and each partition only has 1000 rows each, making
>>> the parallel execution rather inefficient. Unless we scan the partitions
>>> concurrently.
>>>
>>> In any case, as this happens even with plain inheritance, it's probably
>>> more about the parallel query than about the new partitioning patch.
>>
>> Yes, I have seen some discussion [2] about a Parallel Append, which would
>> result in plans you're probably thinking of.
>>
> 
> Actually, I think that thread is about allowing partial paths even if only
> some appendrel members support it. My point is that in this case it's a
> bit silly to even build the partial paths, when the partitions only have
> 1000 rows each.
> 
> I kinda suspect we only look at the total appendrel rowcount estimate, but
> haven't checked.

Oh, I misread.  Anyway, Parallel Append is something to think about.

>>> 3) The last issue I noticed is that while
>>>
>>>     EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;
>>>
>>> works just fine (it takes fair amount of time to plan with 10k partitions,
>>> but that's expected), this
>>>
>>>     EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;
>>>
>>> allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
>>> killer). Again, the same thing happens with plain inheritance-based
>>> partitioning, so it's probably not a bug in the partitioning patch.
>>>
>>> I'm mentioning it here because I think the new partitioning will hopefully
>>> get more efficient and handle large partition counts more efficiently (the
>>> inheritance only really works for ~100 partitions, which is probably why
>>> no one complained about OOM during UPDATEs). Of course, 10k partitions is
>>> a bit extreme (good for testing, though).
>>
>> Plans with the inheritance parents as target tables (update/delete)
>> go through inheritance_planner() in the optimizer, which currently
>> has a design that is based on certain assumptions about traditional
>> inheritance. It's possible hopefully to make it less expensive for
>> the partitioned  tables.
>>
> 
> Yes, I know. I wasn't really reporting it as a bug in the partitioning
> patch, but more as a rather surprising difference between plain SELECT and
> UPDATE.
> 
> Am I right that one of the ambitions of the new partitioning is to improve
> behavior with large number of partitions?

Yes.  Currently, SELECT planning is O(n) with significantly large constant
factor.  It is possible now to make it O(log n).  Also, if we can do away
with inheritance_planner() treatment for the *partitioned tables* in case
of UPDATE/DELETE, then that would be great.  That would mean their
planning time would be almost same as the SELECT case.

As you might know, we have volunteers to make this happen sooner [1], :)

> At first I thought it's somewhat related to the FDW sharding (each node
> being a partition and having local subpartitions), but I realize the
> planner will only deal with the node partitions I guess.

Yeah, planner would only have the local partitioning metadata at its
disposal.  Foreign tables can only be leaf partitions, which if need to be
scanned for a given query, will be scanned using a ForeignScan.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/426b2b01-61e0-43aa-bd84-c6fcf516f1c3%40postgrespro.ru





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Dec 8, 2016 at 11:58 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hierarchical lock manager stuff is interesting.  Are you perhaps alluding
> to a new *intention* lock mode as described in the literature on multiple
> granularity locking [1]?

Yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Dec 9, 2016 at 5:46 AM, Maksim Milyutin
<m.milyutin@postgrespro.ru> wrote:
> I would like to work on two tasks:
>  - insert (and eventually update) tuple routing for foreign partition.
>  - the ability to create an index on the parent and have all of the children
> inherit it;
>
> The first one has been implemented in pg_pathman somehow, but the code
> relies on dirty hacks, so the FDW API has to be improved. As for the
> extended index support, it doesn't look like a super-hard task.

Great!

I think that the second one will be fairly tricky.  You will need some
way of linking the child indexes back to the parent index.  And then
you will need to prohibit them from being dropped or modified
independent of the parent index.  And you will need to cascade ALTER
commands on the parent index (which will have no real storage, I hope)
down to the children.  Unfortunately, index names have to be unique on
a schema-wide basis, so we'll somehow need to generate names for the
"child" indexes.  But those names might collide with existing objects,
and will need to be preserved across a dump-and-restore.  The concept
is simple but getting all of the details right is hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Dec 13, 2016 at 1:58 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attaching the above patch, along with some other patches posted earlier,
> and one more patch fixing another bug I found.  Patch descriptions follow:
>
> 0001-Miscallaneous-code-and-comment-improvements.patch
>
> Fixes some obsolete comments while improving others.  Also, implements
> some of Tomas Vondra's review comments.

Committed with some pgindenting.

> 0002-Miscallaneous-documentation-improvements.patch
>
> Fixes inconsistencies and improves some examples in the documentation.
> Also, mentions the limitation regarding row movement.

Ignored because I committed what I think is the same or a similar
patch earlier.  Please resubmit any remaining changes.

> 0003-Invalidate-the-parent-s-relcache-after-partition-cre.patch
>
> Fixes a bug reported by Tomas, whereby a parent's relcache was not
> invalidated after creation of a new partition using CREATE TABLE PARTITION
> OF.  This resulted in tuple-routing code not being to able to find a
> partition that was created by the last command in a given transaction.

Shouldn't StorePartitionBound() be responsible for issuing its own
invalidations, as StorePartitionKey() already is?  Maybe you'd need to
pass "parent" as another argument, but that way you know you don't
have the same bug at some other place where the function is called.

> 0004-Fix-a-bug-of-insertion-into-an-internal-partition.patch
>
> Fixes a bug I found this morning, whereby an internal partition's
> constraint would not be enforced if it is targeted directly.  See example
> below:
>
> create table p (a int, b char) partition by range (a);
> create table p1 partition of p for values from (1) to (10) partition by
> list (b);
> create table p1a partition of p1 for values in ('a');
> insert into p1 values (0, 'a');  -- should fail, but doesn't

I expect I'm missing something here, but why do we need to hoist
RelationGetPartitionQual() out of InitResultRelInfo() instead of just
having BeginCopy() pass true instead of false?

(Also needs a rebase due to the pgindent cleanup.)

> 0005-Fix-a-tuple-routing-bug-in-multi-level-partitioned-t.patch
>
> Fixes a bug discovered by Dmitry Ivanov, whereby wrong indexes were
> assigned to the partitions of lower levels (level > 1), causing spurious
> "partition not found" error as demonstrated in his email [1].

Committed.  It might have been good to include a test case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Ildar Musin
Date:
Hi hackers,

On 08.12.2016 19:44, Dmitry Ivanov wrote:
>> That would be fantastic.  I and my colleagues at EnterpriseDB can
>> surely help review; of course, maybe you and some of your colleagues
>> would like to help review our patches, too.
>
> Certainly, I'll start reviewing as soon as I get familiar with the code.
>
>
>> Do you think this is
>> likely to be something where you can get something done quickly, with
>> the hope of getting it into v10?
>
> Yes, I've just cleared my schedule in order to make this possible. I'll
> bring in the patches ASAP.
>
>

We've noticed that PartitionDispatch object is built on every INSERT 
query and that it could create unnecessary overhead. Wouldn't it be 
better to keep it in relcache?

Thanks!

-- 
Ildar Musin
i.musin@postgrespro.ru



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Dec 13, 2016 at 12:22 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
> We've noticed that PartitionDispatch object is built on every INSERT query
> and that it could create unnecessary overhead. Wouldn't it be better to keep
> it in relcache?

You might be able to cache some of that data in the relcache, but List
*keystate is pointing to query-lifespan data, so you can't cache that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Venkata B Nagothi
Date:
 
On Mon, Dec 12, 2016 at 3:06 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi,

On 2016/12/11 10:02, Venkata B Nagothi wrote:
> On Fri, Dec 9, 2016 at 11:11 PM, Amit Langote <amitlangote09@gmail.com>
> wrote:
>> On Fri, Dec 9, 2016 at 3:16 PM, Venkata B Nagothi <nag1010@gmail.com>
>> wrote:
>>> I am testing the partitioning feature from the latest master and got the
>>> following error while loading the data -
>>>
>>> db01=# create table orders_y1993 PARTITION OF orders FOR VALUES FROM
>>> ('1993-01-01') TO ('1993-12-31');
>>> CREATE TABLE
>>>
>>> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
>>> ERROR:  could not read block 6060 in file "base/16384/16412": read only
>> 0 of
>>> 8192 bytes
>>> CONTEXT:  COPY orders, line 376589:
>>> "9876391|374509|O|54847|1997-07-16|3-MEDIUM
>>  |Clerk#000001993|0|ithely
>>> regular pack"
>>
>> Hmm.   Could you tell what relation the file/relfilenode 16412 belongs to?
>>
>
> db01=# select relname from pg_class where relfilenode=16412 ;
>    relname
> --------------
>  orders_y1997
> (1 row)
>
>
> I VACUUMED the partition and then re-ran the copy command and no luck.
>
> db01=# vacuum orders_y1997;
> VACUUM
>
> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
> of 8192 bytes
> CONTEXT:  COPY orders, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"
>
> I do not quite understand the below behaviour as well. I VACUUMED 1997
> partition and then i got an error for 1992 partition and then after 1996
> and then after 1994 and so on.
>

[ ... ]

> db01=# vacuum orders_y1997;
> VACUUM
> db01=# copy orders from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16412": read only 0
> of 8192 bytes
> CONTEXT:  COPY orders, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"
> db01=#
>
> Am i not understanding anything here ?

I could not reproduce this issue.  Also, I could not say what might have
gone wrong based only on the information I have seen so far.

Have you tried inserting the same data using insert?

I can load the data into appropriate partitions using INSERT. So, no issues there.

db01=# CREATE TABLE orders2(
        o_orderkey INTEGER,
        o_custkey INTEGER,
        o_orderstatus CHAR(1),
        o_totalprice REAL,
        o_orderdate DATE,
        o_orderpriority CHAR(15),
        o_clerk CHAR(15),
        o_shippriority INTEGER,
        o_comment VARCHAR(79)) partition by (o_orderdate);

db01=# insert into orders2 select * from orders where o_orderdate='1995-10-11';
INSERT 0 3110
 
create table orders_unpartitioned (like orders);
copy orders_unpartitioned from '/data/orders-1993.csv';
insert into orders select * from orders_unpartitioned;

Loading the data into a normal table is not an issue (infact the csv is generated from the table itself)

The issue is occurring only when i am trying to load the data from CSV file into a partitioned table -

db01=# CREATE TABLE orders_y1992
    PARTITION OF orders2 FOR VALUES FROM ('1992-01-01') TO ('1992-12-31');
CREATE TABLE
db01=# copy orders2 from '/data/orders-1993.csv' delimiter '|';
ERROR:  could not read block 6060 in file "base/16384/16407": read only 0 of 8192 bytes
CONTEXT:  COPY orders2, line 376589: "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely regular pack"

Not sure why COPY is failing.

Regards,

Venkata B N
Database Consultant

Re: [HACKERS] Declarative partitioning - another take

From
Etsuro Fujita
Date:
On 2016/12/09 19:46, Maksim Milyutin wrote:
> I would like to work on two tasks:
>  - insert (and eventually update) tuple routing for foreign partition.
>  - the ability to create an index on the parent and have all of the
> children inherit it;
>
> The first one has been implemented in pg_pathman somehow, but the code
> relies on dirty hacks, so the FDW API has to be improved. As for the
> extended index support, it doesn't look like a super-hard task.

That would be great!  I'd like to help review the first one.

Best regards,
Etsuro Fujita





Re: [HACKERS] Declarative partitioning - another take

From
Ildar Musin
Date:
Hi,

On 13.12.2016 21:10, Robert Haas wrote:
> On Tue, Dec 13, 2016 at 12:22 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
>> We've noticed that PartitionDispatch object is built on every INSERT query
>> and that it could create unnecessary overhead. Wouldn't it be better to keep
>> it in relcache?
>
> You might be able to cache some of that data in the relcache, but List
> *keystate is pointing to query-lifespan data, so you can't cache that.
>

Yes, you are right. I meant mostly the 'indexes' field. I've measured 
insert performance with perf in case when there are thousand partitions 
and it seems that 34% of the time it takes to run 
RelationGetPartitionDispatchInfo() which builds this indexes array. And 
the most of the time it spends on acquiring locks on all partitions 
which is unnecessary if we're inserting in just a single partition. 
Probably we could avoid this by moving at least indexes field into cache.

-- 
Ildar Musin
i.musin@postgrespro.ru



Re: [HACKERS] Declarative partitioning - another take

From
Tomas Vondra
Date:
Hi Amit,

On 12/13/2016 09:45 AM, Amit Langote wrote:
> On 2016/12/13 0:17, Tomas Vondra wrote:
>> On 12/12/2016 07:37 AM, Amit Langote wrote:
>>>
>>> Hi Tomas,
>>>
>>> On 2016/12/12 10:02, Tomas Vondra wrote:
>>>>
>>>> 2) I'm wondering whether having 'table' in the catalog name (and also in
>>>> the new relkind) is too limiting. I assume we'll have partitioned indexes
>>>> one day, for example - do we expect to use the same catalogs?
>>>
>>> I am not sure I understand your idea of partitioned indexes, but I doubt
>>> it would require entries in the catalog under consideration.  Could you
>>> perhaps elaborate more?
>>>
>>
>> OK, let me elaborate. Let's say we have a partitioned table, and I want to
>> create an index. The index may be either "global" i.e. creating a single
>> relation for data from all the partitions, or "local" (i.e. partitioned
>> the same way as the table).
>>
>> Local indexes are easier to implement (it's essentially what we have now,
>> except that we need to create the indexes manually for each partition),
>> and don't work particularly well for some use cases (e.g. unique
>> constraints). This is what I mean by "partitioned indexes".
>>
>> If the index is partitioned just like the table, we probably won't need to
>> copy the partition key info (so, nothing in pg_partitioned_table).
>> I'm not sure it makes sense to partition the index differently than the
>> table - I don't see a case where that would be useful.
>>
>> The global indexes would work better for the unique constraint use case,
>> but it clearly contradicts our idea of TID (no information about which
>> partition that references).
>>
>> So maybe the catalog really only needs to track info about tables? Not
>> sure. I'm just saying it'd be unfortunate to have _table in the name, and
>> end up using it for indexes too.
>
> Hmm, I didn't quite think of the case where the index is partitioned
> differently from the table, but perhaps that's possible with some other
> databases.
>

I haven't thought about that very deeply either, so perhaps it's an 
entirely silly idea. Also, probably quite complex to implement I guess, 
so unlikely to be pursued soon.

> What you describe as "local indexes" or "locally partitioned indexes" is
> something I would like to see being pursued in the near term.  In that
> case, we would allow defining indexes on the parent that are recursively
> defined on the partitions and marked as inherited index, just like we have
> inherited check constraints and NOT NULL constraints.  I have not studied
> whether we could implement (globally) *unique* indexes with this scheme
> though, wherein the index key is a superset of the partition key.
>

I think implementing UNIQUE constraint with local indexes is possible 
and possibly even fairly simple, but it likely requires SHARE lock on 
all partitions, which is not particularly nice.

When the partition key is referenced in the constraint, that may allow 
locking only a subset of the partitions, possibly even a single one. But 
with multi-level partitioning schemes that may be difficult.

Also, I don't think it's very likely to have the partitioning key as 
part of the unique constraint. For example 'users' table is unlikely to 
be distributed by 'login' and so on.

The  global indexes make this easier, because there's just a single 
index to check. But of course, attaching/detaching partitions gets more 
expensive.

Anyway, starting a detailed discussion about local/global indexes was 
not really what I meant to do.

>> Clearly, this is a consequence of building the partitioning on top of
>> inheritance (not objecting to that approach, merely stating a fact).
>>
>> I'm fine with whatever makes the error messages more consistent, if it
>> does not make the code significantly more complex. It's a bit confusing
>> when some use 'child tables' and others 'partitions'. I suspect even a
>> single DML command may return a mix of those, depending on where exactly
>> it fails (old vs. new code).
>
> So, we have mostly some old DDL (CREATE/ALTER TABLE) and maintenance
> commands that understand inheritance.  All of the their error messages
> apply to partitions as well, wherein they will be referred to as "child
> tables" using old terms.  We now have some cases where the commands cause
> additional error messages for only partitions because of additional
> restrictions that apply to them.  We use "partitions" for them because
> they are essentially new error messages.
>
> There won't be a case where single DML command would mix the two terms,
> because we do not allow mixing partitioning and regular inheritance.
> Maybe I misunderstood you though.
>

Don't we call inheritance-related functions from the new DDL? In that 
case we'd fail with 'child tables' error messages in the old code, and 
'partitions' in the new code. I'd be surprised if there was no such code 
reuse, but I haven't checked.

>> Am I right that one of the ambitions of the new partitioning is to improve
>> behavior with large number of partitions?
>
> Yes.  Currently, SELECT planning is O(n) with significantly large constant
> factor.  It is possible now to make it O(log n).  Also, if we can do away
> with inheritance_planner() treatment for the *partitioned tables* in case
> of UPDATE/DELETE, then that would be great.  That would mean their
> planning time would be almost same as the SELECT case.
>
> As you might know, we have volunteers to make this happen sooner [1], :)
>

Yes, I know. And it's great that you've managed to make the first step, 
getting all the infrastructure in, allowing others to build on that. 
Kudos to you!

>> At first I thought it's somewhat related to the FDW sharding (each node
>> being a partition and having local subpartitions), but I realize the
>> planner will only deal with the node partitions I guess.
>
> Yeah, planner would only have the local partitioning metadata at its
> disposal.  Foreign tables can only be leaf partitions, which if need to be
> scanned for a given query, will be scanned using a ForeignScan.
>

Right, makes sense. Still, I can imagine for example having many daily 
partitions and not having to merge them regularly just to reduce the 
number of partitions.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Dmitry Ivanov
Date:
Hi everyone,

Looks like "sql_inheritance" GUC is affecting partitioned tables:

explain (costs off) select * from test;
          QUERY PLAN
------------------------------
 Append
   ->  Seq Scan on test
   ->  Seq Scan on test_1
   ->  Seq Scan on test_2
   ->  Seq Scan on test_1_1
   ->  Seq Scan on test_1_2
   ->  Seq Scan on test_1_1_1
   ->  Seq Scan on test_1_2_1
(8 rows)


set sql_inheritance = off;


explain (costs off) select * from test;
    QUERY PLAN
------------------
 Seq Scan on test
(1 row)


I might be wrong, but IMO this should not happen. Queries involving update,
delete etc on partitioned tables are basically broken. Moreover, there's no
point in performing such operations on a parent table that's supposed to be
empty at all times.

I've come up with a patch which fixes this behavior for UPDATE, DELETE,
TRUNCATE and also in transformTableEntry(). It might be hacky, but it gives
an idea.

I didn't touch RenameConstraint() and renameatt() since this would break
ALTER TABLE ONLY command.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/14 1:32, Robert Haas wrote:
> On Tue, Dec 13, 2016 at 1:58 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attaching the above patch, along with some other patches posted earlier,
>> and one more patch fixing another bug I found.  Patch descriptions follow:
>>
>> 0001-Miscallaneous-code-and-comment-improvements.patch
>>
>> Fixes some obsolete comments while improving others.  Also, implements
>> some of Tomas Vondra's review comments.
> 
> Committed with some pgindenting.

Thanks!

>> 0002-Miscallaneous-documentation-improvements.patch
>>
>> Fixes inconsistencies and improves some examples in the documentation.
>> Also, mentions the limitation regarding row movement.
> 
> Ignored because I committed what I think is the same or a similar
> patch earlier.  Please resubmit any remaining changes.

It seems this patch is almost the same thing as what you committed.

>> 0003-Invalidate-the-parent-s-relcache-after-partition-cre.patch
>>
>> Fixes a bug reported by Tomas, whereby a parent's relcache was not
>> invalidated after creation of a new partition using CREATE TABLE PARTITION
>> OF.  This resulted in tuple-routing code not being to able to find a
>> partition that was created by the last command in a given transaction.
> 
> Shouldn't StorePartitionBound() be responsible for issuing its own
> invalidations, as StorePartitionKey() already is?  Maybe you'd need to
> pass "parent" as another argument, but that way you know you don't
> have the same bug at some other place where the function is called.

OK, done that way in PATCH 1/7 (of the attached various patches as
described below).

>> 0004-Fix-a-bug-of-insertion-into-an-internal-partition.patch
>>
>> Fixes a bug I found this morning, whereby an internal partition's
>> constraint would not be enforced if it is targeted directly.  See example
>> below:
>>
>> create table p (a int, b char) partition by range (a);
>> create table p1 partition of p for values from (1) to (10) partition by
>> list (b);
>> create table p1a partition of p1 for values in ('a');
>> insert into p1 values (0, 'a');  -- should fail, but doesn't
> 
> I expect I'm missing something here, but why do we need to hoist
> RelationGetPartitionQual() out of InitResultRelInfo() instead of just
> having BeginCopy() pass true instead of false?
> 
> (Also needs a rebase due to the pgindent cleanup.)

In this case, we want to enforce only the main target relation's partition
constraint (only needed if it happens to be an internal node partition),
not leaf partitions', because the latter is unnecessary.

We do InitResultRelInfo() for every leaf partition.  What
RelationGetPartitionQual() would return to InitResultRelInfo() is the
partition constraint of the individual leaf partitions, which as just
mentioned is unnecessary, and also would be inefficient.  With the
proposed patch, we only retrieve the partition constraint for the targeted
table (if there is any) once.  However, when assigning to
ri_PartitionCheck of individual leaf partition's ResultRelInfo, we still
must map any Vars in the expression from the target tables's attnos to the
corresponding leaf partition's attnos (previous version of the patch
failed to do that).

>> 0005-Fix-a-tuple-routing-bug-in-multi-level-partitioned-t.patch
>>
>> Fixes a bug discovered by Dmitry Ivanov, whereby wrong indexes were
>> assigned to the partitions of lower levels (level > 1), causing spurious
>> "partition not found" error as demonstrated in his email [1].
>
> Committed.  It might have been good to include a test case.

Agreed, added tests in the attached patch PATCH 7/7.


Aside from the above, I found few other issues and fixed them in the
attached patches.  Descriptions follow:

[PATCH 1/7] Invalidate the parent's relcache after partition creation.

Invalidate parent's relcache after a partition is created using CREATE
TABLE PARTITION OF.  (Independent reports by Keith Fiske and David Fetter)

[PATCH 2/7] Change how RelationGetPartitionQual() and related code works

Since we always want to recurse, ie, include the parent's partition
constraint (if any), get rid of the argument recurse.

Refactor out the code doing the mapping of attnos of Vars in partition
constraint expressions (parent attnos -> child attnos).  Move it to a
separate function map_partition_varattnos() and call it from appropriate
places.  It previously used to be done in get_qual_from_partbound(),
which would lead to wrong results in certain multi-level partitioning
cases, as the mapping would be done for immediate parent-partition pairs.
Now in generate_partition_qual() which is the workhorse of
RelationGetPartitionQual(), we first generate the full expression
(considering all levels of partitioning) and then do the mapping from the
root parent to a leaf partition.  It is also possible to generate
partition constraint up to certain non-leaf level and then apply the
same to leaf partitions of that sub-tree after suitable substitution
of varattnos using the new map_partition_varattnos() directly.

Bug fix: ATExecAttachPartition() failed to do the mapping when attaching
a partitioned table as partition. It is possible for the partitions of
such table to have different attributes from the table being attached
and/or the target partitioned table.

[PATCH 3/7] Refactor tuple-routing setup code

It's unnecessarily repeated in copy.c and nodeModifyTable.c, which makes
it a burden to maintain.  Should've been like this to begin with.

I moved the common code to ExecSetupPartitionTupleRouting() in execMain.c
that also houses ExecFindParttion() currently.  Hmm, should there be a
new src/backend/executor/execPartition.c?

[PATCH 4/7] Fix a bug of insertion into an internal partition.

Since implicit partition constraints are not inherited, an internal
partition's constraint was not being enforced when targeted directly.
So, include such constraint when setting up leaf partition result
relations for tuple-routing.

InitResultRelInfo()'s API changes with this.  Instead of passing
a boolean telling whether or not to load the partition constraint,
callers now need to pass the exact constraint expression to use
as ri_PartitionCheck or NIL.

[PATCH 5/7] Fix oddities of tuple-routing and TupleTableSlots

We must use the partition's tuple descriptor *after* a tuple is routed,
not the root table's.  Partition's attributes, for example, may be
ordered diferently from the root table's.

We must then switch back to the root table's for the next tuple and
so on.  A dedicated TupleTableSlot is allocated within EState called
es_partition_tuple_slot whose descriptor is set to a given leaf
partition for every row after it's routed.

[PATCH 6/7] Make ExecConstraints() emit the correct row in error msgs.

After a tuple is routed to a partition, it has been converted from the
root table's rowtype to the partition's.  If such a tuple causes an
error in ExecConstraints(), the row shown in error messages might not
match the input row due to possible differences between the root
table's rowtype and the partition's.

To convert back to the correct row format, keep root table relation
descriptor and a reverse tuple conversion map in the ResultRelInfo's
of leaf partitions.

[PATCH 7/7] Add some tests for recent fixes to PartitionDispatch code in
a25665088d

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Greg Stark
Date:
Just poking around with partitioning. I notice that "\d parent"
doesn't list all the partitions, suggesting to use \d+ but a plain
"\d" does indeed list the partitions. That seems a bit strange and
also probably impractical if you have hundreds or thousands of
partitions. Has this come up in previous discussions? Unfortunately
it's proving a bit hard to search for "\d" :/



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/16 17:02, Amit Langote wrote:
> [PATCH 2/7] Change how RelationGetPartitionQual() and related code works
> 
> Since we always want to recurse, ie, include the parent's partition
> constraint (if any), get rid of the argument recurse.
> 
> Refactor out the code doing the mapping of attnos of Vars in partition
> constraint expressions (parent attnos -> child attnos).  Move it to a
> separate function map_partition_varattnos() and call it from appropriate
> places.  It previously used to be done in get_qual_from_partbound(),
> which would lead to wrong results in certain multi-level partitioning
> cases, as the mapping would be done for immediate parent-partition pairs.
> Now in generate_partition_qual() which is the workhorse of
> RelationGetPartitionQual(), we first generate the full expression
> (considering all levels of partitioning) and then do the mapping from the
> root parent to a leaf partition.  It is also possible to generate
> partition constraint up to certain non-leaf level and then apply the
> same to leaf partitions of that sub-tree after suitable substitution
> of varattnos using the new map_partition_varattnos() directly.
> 
> Bug fix: ATExecAttachPartition() failed to do the mapping when attaching
> a partitioned table as partition. It is possible for the partitions of
> such table to have different attributes from the table being attached
> and/or the target partitioned table.

Oops, PATCH 2/7 attached with the previous email had a bug in it, whereby
map_partition_varattnos() was not applied to the partition constraint
expressions returned directly from the relcache (rd_partcheck) copy.
Attaching just the updated (only) PATCH 2 which fixes that.  Patches
1,3,4,5,6,7/7 from the previous email [1] are fine.  Sorry about that.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/c820c0eb-6935-6f84-8c6a-785fdff130c1@lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/16 17:38, Greg Stark wrote:
> Just poking around with partitioning. I notice that "\d parent"
> doesn't list all the partitions, suggesting to use \d+ but a plain
> "\d" does indeed list the partitions. That seems a bit strange and
> also probably impractical if you have hundreds or thousands of
> partitions. Has this come up in previous discussions? Unfortunately
> it's proving a bit hard to search for "\d" :/

Do you mean a plain "\d" (without an argument) should not list tables that
are partitions?  I think that might be preferable.  That would mean, we
list only the root partitioned tables with a plain "\d".

Regarding "\d parent", it does the same thing as regular inheritance, but
using the term "partition" instead of "child table".  Without specifying a
+ (\d parent), one gets just "Number of partitions: # (Use \d+ to list
them.)" and with + (\d+ parent), one gets the full list of partitions
showing the partition bound with each.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Dmitry,

On 2016/12/16 0:40, Dmitry Ivanov wrote:
> Hi everyone,
> 
> Looks like "sql_inheritance" GUC is affecting partitioned tables:
> 
> explain (costs off) select * from test;
>          QUERY PLAN          ------------------------------
> Append
>   ->  Seq Scan on test
>   ->  Seq Scan on test_1
>   ->  Seq Scan on test_2
>   ->  Seq Scan on test_1_1
>   ->  Seq Scan on test_1_2
>   ->  Seq Scan on test_1_1_1
>   ->  Seq Scan on test_1_2_1
> (8 rows)
> 
> 
> set sql_inheritance = off;
> 
> 
> explain (costs off) select * from test;
>    QUERY PLAN    ------------------
> Seq Scan on test
> (1 row)
> 
> 
> I might be wrong, but IMO this should not happen. Queries involving
> update, delete etc on partitioned tables are basically broken. Moreover,
> there's no point in performing such operations on a parent table that's
> supposed to be empty at all times.
> 
> I've come up with a patch which fixes this behavior for UPDATE, DELETE,
> TRUNCATE and also in transformTableEntry(). It might be hacky, but it
> gives an idea.
> 
> I didn't touch RenameConstraint() and renameatt() since this would break
> ALTER TABLE ONLY command.

@@ -1198,6 +1198,12 @@ ExecuteTruncate(TruncateStmt *stmt)        rels = lappend(rels, rel);        relids =
lappend_oid(relids,myrelid);
 

+        /* Use interpretInhOption() unless it's a partitioned table */
+        if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+            recurse = interpretInhOption(rv->inhOpt);
+        else
+            recurse = true;
+        if (recurse)        {            ListCell   *child;

If you see the else block of this if, you'll notice this:

else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)           ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),                   errmsg("must truncate child tables too")));
 

So that you get this behavior:

# set sql_inheritance to off;
SET

# truncate p;
ERROR:  must truncate child tables too

# reset sql_inheritance;
RESET

# truncate only p;
ERROR:  must truncate child tables too

# truncate p;
TRUNCATE TABLE

Beside that, I initially had implemented the same thing as what you are
proposing here, but reverted to existing behavior at some point during the
discussion. I think the idea behind was to not *silently* ignore user
specified configuration and instead error out with appropriate message.
While it seems to work reasonably for DDL and maintenance commands (like
TRUNCATE above), things sound strange for SELECT/UPDATE/DELETE as you're
saying.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Dec 16, 2016 at 3:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Aside from the above, I found few other issues and fixed them in the
> attached patches.  Descriptions follow:

To avoid any further mistakes on my part, can you please resubmit
these with each patch file containing a proposed commit message
including patch authorship information, who reported the issue, links
to relevant discussion if any, and any other attribution information
which I should not fail to include when committing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On Sat, Dec 17, 2016 at 1:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 16, 2016 at 3:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Aside from the above, I found few other issues and fixed them in the
>> attached patches.  Descriptions follow:
>
> To avoid any further mistakes on my part, can you please resubmit
> these with each patch file containing a proposed commit message
> including patch authorship information, who reported the issue, links
> to relevant discussion if any, and any other attribution information
> which I should not fail to include when committing?

I think it's a good advice and will keep in mind for any patches I
post henceforth.

In this particular case, I found all the issues myself while working
with some more esoteric test scenarios, except the first patch (1/7),
where I have mentioned in the description of the patch in the email,
that there were independent reports of the issue by Tomas Vondra and
David Fetter.

Thanks,
Amit



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/17 11:32, Amit Langote wrote:
> On Sat, Dec 17, 2016 at 1:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Dec 16, 2016 at 3:02 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Aside from the above, I found few other issues and fixed them in the
>>> attached patches.  Descriptions follow:
>>
>> To avoid any further mistakes on my part, can you please resubmit
>> these with each patch file containing a proposed commit message
>> including patch authorship information, who reported the issue, links
>> to relevant discussion if any, and any other attribution information
>> which I should not fail to include when committing?
> 
> I think it's a good advice and will keep in mind for any patches I
> post henceforth.
> 
> In this particular case, I found all the issues myself while working
> with some more esoteric test scenarios, except the first patch (1/7),
> where I have mentioned in the description of the patch in the email,
> that there were independent reports of the issue by Tomas Vondra and
> David Fetter.

Here are updated patches including the additional information.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Here are updated patches including the additional information.

Thanks.  Committed 0001.  Will have to review the others when I'm less tired.

BTW, elog() is only supposed to be used for can't happen error
messages; when it is used, no translation is done.  So this is wrong:
   if (skip_validate)       elog(NOTICE, "skipping scan to validate partition constraint");


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/20 12:59, Robert Haas wrote:
> On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Here are updated patches including the additional information.
> 
> Thanks.  Committed 0001.  Will have to review the others when I'm less tired.

Thanks!

> BTW, elog() is only supposed to be used for can't happen error
> messages; when it is used, no translation is done.  So this is wrong:
> 
>     if (skip_validate)
>         elog(NOTICE, "skipping scan to validate partition constraint");

You're right.  I was using it for debugging when I first wrote that code,
but then thought it would be better to eventually turn that into
ereport(INFO/NOTICE) for the final submission, which I missed to do. Sorry
about that.  Here's a patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Alvaro Herrera
Date:
Amit Langote wrote:

> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> index 1c219b03dd..6a179596ce 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
>          }
>      }
>  
> +    /* It's safe to skip the validation scan after all */
>      if (skip_validate)
> -        elog(NOTICE, "skipping scan to validate partition constraint");
> +        ereport(INFO,
> +                (errmsg("skipping scan to validate partition constraint")));

Why not just remove the message altogether?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Dec 20, 2016 at 4:51 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Langote wrote:
>
>> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
>> index 1c219b03dd..6a179596ce 100644
>> --- a/src/backend/commands/tablecmds.c
>> +++ b/src/backend/commands/tablecmds.c
>> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
>>               }
>>       }
>>
>> +     /* It's safe to skip the validation scan after all */
>>       if (skip_validate)
>> -             elog(NOTICE, "skipping scan to validate partition constraint");
>> +             ereport(INFO,
>> +                             (errmsg("skipping scan to validate partition constraint")));
>
> Why not just remove the message altogether?

That's certainly an option.  It might be noise in some situations.  On
the other hand, it affects whether attaching the partition is O(1) or
O(n), so somebody might well want to know.  Or maybe they might be
more likely to want a message in the reverse situation, telling them
that the partition constraint DOES need to be validated.  I'm not sure
what the best user interface is here; thoughts welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Tue, Dec 20, 2016 at 4:51 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Amit Langote wrote:
> >
> >> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> >> index 1c219b03dd..6a179596ce 100644
> >> --- a/src/backend/commands/tablecmds.c
> >> +++ b/src/backend/commands/tablecmds.c
> >> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
> >>               }
> >>       }
> >>
> >> +     /* It's safe to skip the validation scan after all */
> >>       if (skip_validate)
> >> -             elog(NOTICE, "skipping scan to validate partition constraint");
> >> +             ereport(INFO,
> >> +                             (errmsg("skipping scan to validate partition constraint")));
> >
> > Why not just remove the message altogether?
> 
> That's certainly an option.  It might be noise in some situations.  On
> the other hand, it affects whether attaching the partition is O(1) or
> O(n), so somebody might well want to know.  Or maybe they might be
> more likely to want a message in the reverse situation, telling them
> that the partition constraint DOES need to be validated.  I'm not sure
> what the best user interface is here; thoughts welcome.

Even if we decide to keep the message, I think it's not very good
wording anyhow; as a translator I disliked it on sight.  Instead of
"skipping scan to validate" I would use "skipping validation scan",
except that it's not clear what it is we're validating.  Mentioning
partition constraint in errcontext() doesn't like a great solution, but
I can't think of anything better.

(We have the table_rewrite event trigger, for a very similar use case.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Even if we decide to keep the message, I think it's not very good
> wording anyhow; as a translator I disliked it on sight.  Instead of
> "skipping scan to validate" I would use "skipping validation scan",
> except that it's not clear what it is we're validating.  Mentioning
> partition constraint in errcontext() doesn't like a great solution, but
> I can't think of anything better.

Maybe something like: partition constraint for table \"%s\" is implied
by existing constraints

> (We have the table_rewrite event trigger, for a very similar use case.)

Hmm, maybe we should see if we can safely support an event trigger here, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Even if we decide to keep the message, I think it's not very good
> > wording anyhow; as a translator I disliked it on sight.  Instead of
> > "skipping scan to validate" I would use "skipping validation scan",
> > except that it's not clear what it is we're validating.  Mentioning
> > partition constraint in errcontext() doesn't like a great solution, but
> > I can't think of anything better.
> 
> Maybe something like: partition constraint for table \"%s\" is implied
> by existing constraints

Actually, shouldn't we emit a message if we *don't* skip the check?


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, Dec 19, 2016 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Here are updated patches including the additional information.
>
> Thanks.  Committed 0001.  Will have to review the others when I'm less tired.

0002. Can you add a test case for the bug fixed by this patch?

0003. Loses equalTupleDescs() check and various cases where
ExecOpenIndexes can be skipped.  Isn't that bad?  Also, "We locked all
the partitions above including the leaf partitions" should say "Caller
must have locked all the partitions including the leaf partitions".

0004. Unnecessary whitespace change in executor.h.  Still don't
understand why we need to hoist RelationGetPartitionQual() into the
caller.

0005. Can you add a test case for the bug fixed by this patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/21 1:53, Robert Haas wrote:
> On Mon, Dec 19, 2016 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Here are updated patches including the additional information.
>>
>> Thanks.  Committed 0001.  Will have to review the others when I'm less tired.
> 
> 0002. Can you add a test case for the bug fixed by this patch?

Done (also see below).

> 0003. Loses equalTupleDescs() check and various cases where
> ExecOpenIndexes can be skipped.  Isn't that bad?

I realized that as far as checking whether tuple conversion mapping is
required, the checks performed by convert_tuples_by_name() at the
beginning of the function following the comment /* Verify compatibility
and prepare attribute-number map */ are enough.  If equalTupleDescs()
returned false (which it always does because tdtypeid are never the same
for passed in tuple descriptors), we would be repeating some of the tests
in convert_tuples_by_name() anyway.

As for the checks performed for ExecOpenIndices(), it seems would be
better to keep the following in place, so added back:

        if (leaf_part_rri->ri_RelationDesc->rd_rel->relhasindex &&
            leaf_part_rri->ri_IndexRelationDescs == NULL)
            ExecOpenIndices(leaf_part_rri, false);

> Also, "We locked all
> the partitions above including the leaf partitions" should say "Caller
> must have locked all the partitions including the leaf partitions".

No, we do the locking in RelationGetPartitionDispatchInfo(), which is
called by ExecSetupPartitionTupleRouting() itself.

In ExecSetupPartitionTupleRouting() is the first time we lock all the
partitions.

> 0004. Unnecessary whitespace change in executor.h.  Still don't
> understand why we need to hoist RelationGetPartitionQual() into the
> caller.

We only need to check a result relation's (ri_RelationDesc's) partition
constraint if we are inserting into the result relation directly.  In case
of tuple-routing, we do not want to check the leaf partitions' partition
constraint, but if the direct target in that case is an internal
partition, we must check its partition constraint, which is same for all
leaf partitions in that sub-tree.  It wouldn't be wrong per se to check
each leaf partition's constraint in that case, which includes the target
partitioned table's constraint as well, but that would inefficient due to
both having to retrieve the constraints and having ExecConstraints()
*unnecessarily* check it for every row.

If we keep doing RelationGetPartitionQual() in InitResultRelInfo()
controlled by a bool argument (load_partition_check), we cannot avoid the
above mentioned inefficiency if we want to fix this bug.

> 0005. Can you add a test case for the bug fixed by this patch?

Done, but...

Breaking changes into multiple commits/patches does not seem to work for
adding regression tests.  So, I've combined multiple patches into a single
patch which is now patch 0002 in the attached set of patches.  Its commit
message is very long now.  To show an example of bugs that 0002 is meant for:

create table p (a int, b int) partition by range (a, b);
create table p1 (b int, a int not null) partition by range (b);
create table p11 (like p1);
alter table p11 drop a;
alter table p11 add a int;
alter table p11 drop a;
alter table p11 add a int not null;

# select attrelid::regclass, attname, attnum
  from pg_attribute
  where attnum > 0
  and (attrelid = 'p'::regclass
    or attrelid = 'p1'::regclass
    or attrelid = 'p11'::regclass) and attname = 'a';
 attrelid | attname | attnum
----------+---------+--------
 p        | a       |      1
 p1       | a       |      2
 p11      | a       |      4
(3 rows)

alter table p1 attach partition p11 for values from (1) to (5);
alter table p attach partition p1 for values from (1, 1) to (1, 10);

-- the following is wrong
# insert into p11 (a, b) values (10, 4);
INSERT 0 1

-- wrong too (using the wrong TupleDesc after tuple routing)
# insert into p1 (a, b) values (10, 4);
ERROR:  null value in column "a" violates not-null constraint
DETAIL:  Failing row contains (4, null).

-- once we fix the wrong TupleDesc issue
# insert into p1 (a, b) values (10, 4);
INSERT 0 1

which is wrong because p1, as a partition of p, should not accept 10 for
a.  But its partition constraint is not being applied to the leaf
partition p11 into which the tuple is routed (the bug).

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Dec 21, 2016 at 5:33 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Breaking changes into multiple commits/patches does not seem to work for
> adding regression tests.  So, I've combined multiple patches into a single
> patch which is now patch 0002 in the attached set of patches.

Ugh, seriously?  It's fine to combine closely related bug fixes but
not all of these are.  I don't see why you can't add some regression
tests in one patch and then add some more in the next patch.

Meanwhile, committed the latest 0001 and the elog() patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/22 1:50, Robert Haas wrote:
> On Wed, Dec 21, 2016 at 5:33 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Breaking changes into multiple commits/patches does not seem to work for
>> adding regression tests.  So, I've combined multiple patches into a single
>> patch which is now patch 0002 in the attached set of patches.
> 
> Ugh, seriously?  It's fine to combine closely related bug fixes but
> not all of these are.  I don't see why you can't add some regression
> tests in one patch and then add some more in the next patch.

I managed to do that this time around with the attached set of patches.
Guess I gave up too easily in the previous attempt.

While working on that, I discovered yet-another-bug having to do with the
tuple descriptor that's used as we route a tuple down a partition tree. If
attnums of given key attribute(s) are different on different levels, it
would be incorrect to use the original slot's (one passed by ExecInsert())
tuple descriptor to inspect the original slot's heap tuple, as we go down
the tree.  It might cause spurious "partition not found" at some level due
to looking at incorrect field in the input tuple because of using the
wrong tuple descriptor (root table's attnums not always same as other
partitioned tables in the tree).  Patch 0001 fixes that including a test.
It also addresses the problem I mentioned previously that once
tuple-routing is done, we failed to switch to a slot with the leaf
partition's tupdesc (IOW, continued to use the original slot with root
table's tupdesc causing spurious failures due to differences in attums
between the leaf partition and the root table).

Further patches 0002, 0003 and 0004 fix bugs that I sent one-big-patch for
in my previous message.  Each patch has a test for the bug it's meant to fix.

Patch 0005 is the same old "Add some more tests for tuple-routing" per [1]:

> Meanwhile, committed the latest 0001 and the elog() patch.

Thanks!

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZ86v1G%2Bzx9etMiSQaBBvYMKfU-iitqZArSh5z0n8Q4cA%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Dec 22, 2016 at 3:35 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> While working on that, I discovered yet-another-bug having to do with the
> tuple descriptor that's used as we route a tuple down a partition tree. If
> attnums of given key attribute(s) are different on different levels, it
> would be incorrect to use the original slot's (one passed by ExecInsert())
> tuple descriptor to inspect the original slot's heap tuple, as we go down
> the tree.  It might cause spurious "partition not found" at some level due
> to looking at incorrect field in the input tuple because of using the
> wrong tuple descriptor (root table's attnums not always same as other
> partitioned tables in the tree).  Patch 0001 fixes that including a test.

I committed this, but I'm a bit uncomfortable with it: should the
TupleTableSlot be part of the ModifyTableState rather than the EState?

> It also addresses the problem I mentioned previously that once
> tuple-routing is done, we failed to switch to a slot with the leaf
> partition's tupdesc (IOW, continued to use the original slot with root
> table's tupdesc causing spurious failures due to differences in attums
> between the leaf partition and the root table).
>
> Further patches 0002, 0003 and 0004 fix bugs that I sent one-big-patch for
> in my previous message.  Each patch has a test for the bug it's meant to fix.

Regarding 0002, I think that this is kind of a strange fix.  Wouldn't
it be better to get hold of the original tuple instead of reversing
the conversion?  And what of the idea of avoiding the conversion in
the (probably very common) case where we can?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Sorry about the delay in replying.

On 2016/12/23 8:08, Robert Haas wrote:
> On Thu, Dec 22, 2016 at 3:35 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> While working on that, I discovered yet-another-bug having to do with the
>> tuple descriptor that's used as we route a tuple down a partition tree. If
>> attnums of given key attribute(s) are different on different levels, it
>> would be incorrect to use the original slot's (one passed by ExecInsert())
>> tuple descriptor to inspect the original slot's heap tuple, as we go down
>> the tree.  It might cause spurious "partition not found" at some level due
>> to looking at incorrect field in the input tuple because of using the
>> wrong tuple descriptor (root table's attnums not always same as other
>> partitioned tables in the tree).  Patch 0001 fixes that including a test.
> 
> I committed this, but I'm a bit uncomfortable with it: should the
> TupleTableSlot be part of the ModifyTableState rather than the EState?

Done that way in 0001 of the attached patches.  So, instead of making the
standalone partition_tuple_slot a field of EState (with the actual
TupleTableSlot in its tupleTable), it is now allocated within
ModifyTableState and CopyState, and released when ModifyTable node or
CopyFrom finishes, respectively.

>> It also addresses the problem I mentioned previously that once
>> tuple-routing is done, we failed to switch to a slot with the leaf
>> partition's tupdesc (IOW, continued to use the original slot with root
>> table's tupdesc causing spurious failures due to differences in attums
>> between the leaf partition and the root table).
>>
>> Further patches 0002, 0003 and 0004 fix bugs that I sent one-big-patch for
>> in my previous message.  Each patch has a test for the bug it's meant to fix.
> 
> Regarding 0002, I think that this is kind of a strange fix.  Wouldn't
> it be better to get hold of the original tuple instead of reversing
> the conversion?  And what of the idea of avoiding the conversion in
> the (probably very common) case where we can?

To get hold of the original tuple, how about adding an argument orig_slot
to ExecConstraints()?  I've implemented that approach in the new 0002.

Regarding the possibility of avoiding the conversion in very common cases,
I think that could be done considering the following:  If the mapping from
the attribute numbers of the parent table to that of a child table is an
identity map, we don't need to convert tuples.  Currently however,
convert_tuples_by_name() also requires tdtypeid of the input and output
TupleDescs to be equal.  The reason cited for that is that we may fail to
"inject the right OID into the tuple datum" if the types don't match.  In
case of partitioning, hasoid status must match between the parent and its
partitions at all times, so the aforementioned condition is satisfied
without requiring that tdtypeid are same.  And oid column (if present) is
always located at a given position in HeapTuple, so need not map that.

Based on the above argument, patch 0006  teaches convert_tuples_by_name()
to *optionally* not require tdtypeid of input and output tuple descriptors
to be equal.  It's implemented by introducing a new argument to
convert_tuples_by_name() named 'consider_typeid'.  We pass 'false' only
for the partitioning cases.


(Perhaps, the following should be its own new thread)

I noticed that ExecProcessReturning() doesn't work properly after tuple
routing (example shows how returning tableoid currently fails but I
mention some other issues below):

create table p (a int, b int) partition by range (a);
create table p1 partition of p for values from (1) to (10);
insert into p values (1) returning tableoid::regclass, *;
 tableoid | a | b
----------+---+---
 -        | 1 |
(1 row)

INSERT 0 1

I tried to fix that in 0007 to get:

insert into p values (1) returning tableoid::regclass, *;
 tableoid | a | b
----------+---+---
 p        | 1 |
(1 row)

INSERT 0 1

But I think it *may* be wrong to return the root table OID for tuples
inserted into leaf partitions, because with select we get partition OIDs:

select tableoid::regclass, * from p;
 tableoid | a | b
----------+---+---
 p1       | 1 |
(1 row)

If so, that means we should build the projection info (corresponding to
the returning list) for each target partition somehow.  ISTM, that's going
to have to be done within the planner by appropriate inheritance
translation of the original returning targetlist.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/26 19:46, Amit Langote wrote:
> (Perhaps, the following should be its own new thread)
> 
> I noticed that ExecProcessReturning() doesn't work properly after tuple
> routing (example shows how returning tableoid currently fails but I
> mention some other issues below):
> 
> create table p (a int, b int) partition by range (a);
> create table p1 partition of p for values from (1) to (10);
> insert into p values (1) returning tableoid::regclass, *;
>  tableoid | a | b
> ----------+---+---
>  -        | 1 |
> (1 row)
> 
> INSERT 0 1
> 
> I tried to fix that in 0007 to get:
> 
> insert into p values (1) returning tableoid::regclass, *;
>  tableoid | a | b
> ----------+---+---
>  p        | 1 |
> (1 row)
> 
> INSERT 0 1
> 
> But I think it *may* be wrong to return the root table OID for tuples
> inserted into leaf partitions, because with select we get partition OIDs:
> 
> select tableoid::regclass, * from p;
>  tableoid | a | b
> ----------+---+---
>  p1       | 1 |
> (1 row)
> 
> If so, that means we should build the projection info (corresponding to
> the returning list) for each target partition somehow.  ISTM, that's going
> to have to be done within the planner by appropriate inheritance
> translation of the original returning targetlist.

Turns out getting the 2nd result may not require planner tweaks after all.
Unless I'm missing something, translation of varattnos of the RETURNING
target list can be done as late as ExecInitModifyTable() for the insert
case, unlike update/delete (which do require planner's attention).

I updated the patch 0007 to implement the same, including the test. While
doing that, I realized map_partition_varattnos introduced in 0003 is
rather restrictive in its applicability, because it assumes varno = 1 for
the expressions it accepts as input for the mapping.  Mapping returning
(target) list required modifying map_partition_varattnos to accept
target_varno as additional argument.  That way, we can map arbitrary
expressions from the parent attributes numbers to partition attribute
numbers for expressions not limited to partition constraints.

Patches 0001 to 0006 unchanged.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
Hi Amit,

I have pulled latest sources from git and tried to create multi-level partition,  getting a server crash, below are steps to reproduce. please check if it is reproducible in your machine also.

postgres=# CREATE TABLE test_ml (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p1 PARTITION OF test_ml FOR VALUES FROM (0) TO (250) PARTITION BY RANGE (b);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p1_p1 PARTITION OF test_ml_p1 FOR VALUES FROM (0) TO (100);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p1_p2 PARTITION OF test_ml_p1 FOR VALUES FROM (100) TO (250);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p2 PARTITION OF test_ml FOR VALUES FROM (250) TO (500) PARTITION BY RANGE (c);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p2_p1 PARTITION OF test_ml_p2 FOR VALUES FROM ('0250') TO ('0400');
CREATE TABLE
postgres=# CREATE TABLE test_ml_p2_p2 PARTITION OF test_ml_p2 FOR VALUES FROM ('0400') TO ('0500');
CREATE TABLE
postgres=# CREATE TABLE test_ml_p3 PARTITION OF test_ml FOR VALUES FROM (500) TO (600) PARTITION BY RANGE ((b + a));
CREATE TABLE
postgres=# CREATE TABLE test_ml_p3_p1 PARTITION OF test_ml_p3 FOR VALUES FROM (1000) TO (1100);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p3_p2 PARTITION OF test_ml_p3 FOR VALUES FROM (1100) TO (1200);
CREATE TABLE
postgres=# INSERT INTO test_ml SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 2) i;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [HACKERS] Declarative partitioning - another take

From
高增琦
Date:
Hi ,

I tried "COPY FROM"  in the git version. It inserts rows to wrong partition.

step to reproduce:
create table t(a int, b int) partition by range(a);
create table t_p1 partition of t for values from (1) to (100);
create table t_p2 partition of t for values from (100) to (200);
create table t_p3 partition of t for values from (200) to (300);
insert into t values(1,1);
insert into t values(101,101);
insert into t values(201,201);
copy (select * from t) to '/tmp/test2.txt';
copy t from '/tmp/test2.txt';
select * from t_p1;

result:
postgres=# select * from t_p1;
  a  |  b
-----+-----
   1 |   1
   1 |   1
 101 | 101
 201 | 201
(4 rows)

I think the argument "BulkInsertState" used in CopyFrom/heap_insert
is related to this problem. Please check it.

Thanks.





2016-12-27 17:30 GMT+08:00 Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>:
Hi Amit,

I have pulled latest sources from git and tried to create multi-level partition,  getting a server crash, below are steps to reproduce. please check if it is reproducible in your machine also.

postgres=# CREATE TABLE test_ml (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p1 PARTITION OF test_ml FOR VALUES FROM (0) TO (250) PARTITION BY RANGE (b);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p1_p1 PARTITION OF test_ml_p1 FOR VALUES FROM (0) TO (100);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p1_p2 PARTITION OF test_ml_p1 FOR VALUES FROM (100) TO (250);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p2 PARTITION OF test_ml FOR VALUES FROM (250) TO (500) PARTITION BY RANGE (c);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p2_p1 PARTITION OF test_ml_p2 FOR VALUES FROM ('0250') TO ('0400');
CREATE TABLE
postgres=# CREATE TABLE test_ml_p2_p2 PARTITION OF test_ml_p2 FOR VALUES FROM ('0400') TO ('0500');
CREATE TABLE
postgres=# CREATE TABLE test_ml_p3 PARTITION OF test_ml FOR VALUES FROM (500) TO (600) PARTITION BY RANGE ((b + a));
CREATE TABLE
postgres=# CREATE TABLE test_ml_p3_p1 PARTITION OF test_ml_p3 FOR VALUES FROM (1000) TO (1100);
CREATE TABLE
postgres=# CREATE TABLE test_ml_p3_p2 PARTITION OF test_ml_p3 FOR VALUES FROM (1100) TO (1200);
CREATE TABLE
postgres=# INSERT INTO test_ml SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 2) i;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation





--

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/27 18:30, Rajkumar Raghuwanshi wrote:
> Hi Amit,
> 
> I have pulled latest sources from git and tried to create multi-level
> partition,  getting a server crash, below are steps to reproduce. please
> check if it is reproducible in your machine also.
> 

[ ... ]

> postgres=# INSERT INTO test_ml SELECT i, i, to_char(i, 'FM0000') FROM
> generate_series(0, 599, 2) i;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Thanks for the example.  Looks like there was an oversight in my patch
that got committed as 2ac3ef7a01 [1].

Attached patch should fix the same.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ac3ef7a01df859c62d0a02333b646d65eaec5ff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/27 18:48, 高增琦 wrote:
> Hi ,
> 
> I tried "COPY FROM"  in the git version. It inserts rows to wrong partition.
> 
> step to reproduce:
> create table t(a int, b int) partition by range(a);
> create table t_p1 partition of t for values from (1) to (100);
> create table t_p2 partition of t for values from (100) to (200);
> create table t_p3 partition of t for values from (200) to (300);
> insert into t values(1,1);
> insert into t values(101,101);
> insert into t values(201,201);
> copy (select * from t) to '/tmp/test2.txt';
> copy t from '/tmp/test2.txt';
> select * from t_p1;
> 
> result:
> postgres=# select * from t_p1;
>   a  |  b
> -----+-----
>    1 |   1
>    1 |   1
>  101 | 101
>  201 | 201
> (4 rows)
> 
> I think the argument "BulkInsertState" used in CopyFrom/heap_insert
> is related to this problem. Please check it.

You're quite right.  Attached should fix that.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/27 19:07, Amit Langote wrote:
> Attached should fix that.

Here are the last two patches with additional information like other
patches.  Forgot to do that yesterday.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
On Tue, Dec 27, 2016 at 3:24 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/12/27 18:30, Rajkumar Raghuwanshi wrote:
> Hi Amit,
>
> I have pulled latest sources from git and tried to create multi-level
> partition,  getting a server crash, below are steps to reproduce. please
> check if it is reproducible in your machine also.
>

[ ... ]

> postgres=# INSERT INTO test_ml SELECT i, i, to_char(i, 'FM0000') FROM
> generate_series(0, 599, 2) i;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Thanks for the example.  Looks like there was an oversight in my patch
that got committed as 2ac3ef7a01 [1].

Attached patch should fix the same.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ac3ef7a01df859c62d0a02333b646d65eaec5ff

Hi Amit,

I have applied attached patch, server crash for range is fixed, but still getting crash for multi-level list partitioning insert.

postgres=# CREATE TABLE test_ml_l (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p1 PARTITION OF test_ml_l FOR VALUES IN ('0000', '0003', '0004', '0010') PARTITION BY LIST (c);
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p1_p1 PARTITION OF test_ml_l_p1 FOR VALUES IN ('0000', '0003');
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p1_p2 PARTITION OF test_ml_l_p1 FOR VALUES IN ('0004', '0010');
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p2 PARTITION OF test_ml_l FOR VALUES IN ('0001', '0005', '0002', '0009') PARTITION BY LIST (c);
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p2_p1 PARTITION OF test_ml_l_p2 FOR VALUES IN ('0001', '0005');
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p2_p2 PARTITION OF test_ml_l_p2 FOR VALUES IN ('0002', '0009');
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p3 PARTITION OF test_ml_l FOR VALUES IN ('0006', '0007', '0008', '0011') PARTITION BY LIST (ltrim(c,'A'));
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p3_p1 PARTITION OF test_ml_l_p3 FOR VALUES IN ('0006', '0007');
CREATE TABLE
postgres=# CREATE TABLE test_ml_l_p3_p2 PARTITION OF test_ml_l_p3 FOR VALUES IN ('0008', '0011');
CREATE TABLE
postgres=# INSERT INTO test_ml_l SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 2) i;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/03 19:04, Rajkumar Raghuwanshi wrote:
> On Tue, Dec 27, 2016 at 3:24 PM, Amit Langote wrote:
>>
>> Attached patch should fix the same.
> 
> I have applied attached patch, server crash for range is fixed, but still
> getting crash for multi-level list partitioning insert.
> 
> postgres=# CREATE TABLE test_ml_l (a int, b int, c varchar) PARTITION BY
> LIST(c);

[ ... ]

> postgres=# INSERT INTO test_ml_l SELECT i, i, to_char(i/50, 'FM0000') FROM
> generate_series(0, 599, 2) i;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hm, that's odd.  I tried your new example, but didn't get the crash.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
On Wed, Jan 4, 2017 at 10:37 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/01/03 19:04, Rajkumar Raghuwanshi wrote:
> On Tue, Dec 27, 2016 at 3:24 PM, Amit Langote wrote:
>>
>> Attached patch should fix the same.
>
> I have applied attached patch, server crash for range is fixed, but still
> getting crash for multi-level list partitioning insert.
>
> postgres=# CREATE TABLE test_ml_l (a int, b int, c varchar) PARTITION BY
> LIST(c);

[ ... ]

> postgres=# INSERT INTO test_ml_l SELECT i, i, to_char(i/50, 'FM0000') FROM
> generate_series(0, 599, 2) i;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hm, that's odd.  I tried your new example, but didn't get the crash.

Thanks,
Amit

Thanks, I have pulled latest sources from git, and then applied patch "fix-wrong-ecxt_scantuple-crash.patch", Not getting crash now, may be I have missed something last time.

Re: [HACKERS] Declarative partitioning - another take

From
高增琦
Date:
Server crash(failed assertion) when two "insert" in one SQL:

Step to reproduce:
create table t(a int, b int) partition by range(a);
create table t_p1 partition of t for values from (1) to (100);
create table t_p2 partition of t for values from (100) to (200);
create table t_p3 partition of t for values from (200) to (300);

create table b(a int, b int);
with a(a,b) as(insert into t values(3, 3) returning a, b) insert into b select * from a;

Please check it.

2017-01-04 14:11 GMT+08:00 Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>:
On Wed, Jan 4, 2017 at 10:37 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/01/03 19:04, Rajkumar Raghuwanshi wrote:
> On Tue, Dec 27, 2016 at 3:24 PM, Amit Langote wrote:
>>
>> Attached patch should fix the same.
>
> I have applied attached patch, server crash for range is fixed, but still
> getting crash for multi-level list partitioning insert.
>
> postgres=# CREATE TABLE test_ml_l (a int, b int, c varchar) PARTITION BY
> LIST(c);

[ ... ]

> postgres=# INSERT INTO test_ml_l SELECT i, i, to_char(i/50, 'FM0000') FROM
> generate_series(0, 599, 2) i;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Hm, that's odd.  I tried your new example, but didn't get the crash.

Thanks,
Amit

Thanks, I have pulled latest sources from git, and then applied patch "fix-wrong-ecxt_scantuple-crash.patch", Not getting crash now, may be I have missed something last time.




--

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/04 16:31, 高增琦 wrote:
> Server crash(failed assertion) when two "insert" in one SQL:
> 
> Step to reproduce:
> create table t(a int, b int) partition by range(a);
> create table t_p1 partition of t for values from (1) to (100);
> create table t_p2 partition of t for values from (100) to (200);
> create table t_p3 partition of t for values from (200) to (300);
> 
> create table b(a int, b int);
> with a(a,b) as(insert into t values(3, 3) returning a, b) insert into b
> select * from a;
> 
> Please check it.

Thanks for testing!  This should be fixed by a patch I posted earlier (Try
the patch 0001 of the patches posted at [1]).  Robert did express his
concern [2] about the approach used in my patch that was committed as
2ac3ef7a01 [3]; your test demonstrates that it wasn't a good approach
after all.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/f6f3a214-5bb5-aa8c-f82c-c720348cf086%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYUfs8peo-p%2BStw7afTdXqNWv_S4dx_6AWc-Y_ZrGWZbQ%40mail.gmail.com

[3]
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commit&h=2ac3ef7a01df859c62d0a02333b646d65eaec5ff





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, Dec 26, 2016 at 5:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/12/23 8:08, Robert Haas wrote:
>> On Thu, Dec 22, 2016 at 3:35 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> While working on that, I discovered yet-another-bug having to do with the
>>> tuple descriptor that's used as we route a tuple down a partition tree. If
>>> attnums of given key attribute(s) are different on different levels, it
>>> would be incorrect to use the original slot's (one passed by ExecInsert())
>>> tuple descriptor to inspect the original slot's heap tuple, as we go down
>>> the tree.  It might cause spurious "partition not found" at some level due
>>> to looking at incorrect field in the input tuple because of using the
>>> wrong tuple descriptor (root table's attnums not always same as other
>>> partitioned tables in the tree).  Patch 0001 fixes that including a test.
>>
>> I committed this, but I'm a bit uncomfortable with it: should the
>> TupleTableSlot be part of the ModifyTableState rather than the EState?
>
> Done that way in 0001 of the attached patches.  So, instead of making the
> standalone partition_tuple_slot a field of EState (with the actual
> TupleTableSlot in its tupleTable), it is now allocated within
> ModifyTableState and CopyState, and released when ModifyTable node or
> CopyFrom finishes, respectively.

I dropped some comments from this and committed it.  They were
formatted in a way that wouldn't survive pgindent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Dec 27, 2016 at 8:41 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2016/12/27 19:07, Amit Langote wrote:
>> Attached should fix that.
>
> Here are the last two patches with additional information like other
> patches.  Forgot to do that yesterday.

0001 has the disadvantage that get_partition_for_tuple() acquires a
side effect.  That seems undesirable.  At the least, it needs to be
documented in the function's header comment.

It's unclear to me why we need to do 0002.  It doesn't seem like it
should be necessary, it doesn't seem like a good idea, and the commit
message you proposed is uninformative.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Dec 27, 2016 at 3:59 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Patches 0001 to 0006 unchanged.

Committed 0001 earlier, as mentioned in a separate email.  Committed
0002 and part of 0003.  But I'm skeptical that the as-patched-by-0003
logic in generate_partition_qual() makes sense.  You do this:
       result = list_concat(generate_partition_qual(parent),
copyObject(rel->rd_partcheck));
       /* Mark Vars with correct attnos */       result = map_partition_varattnos(result, rel, parent);

But that has the effect of applying map_partition_varattnos to
everything in rel->rd_partcheck in addition to applying it to
everything returned by generate_partition_qual() on the parent, which
doesn't seem right.

Also, don't we want to do map_partition_varattnos() just ONCE, rather
than on every call to this function?  I think maybe your concern is
that the parent might be changed without a relcache flush on the
child, but I don't quite see how that could happen.  If the parent's
tuple descriptor changes, surely the child's tuple descriptor would
have to be altered at the same time...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2016/12/14 12:14, Venkata B Nagothi wrote:
> Loading the data into a normal table is not an issue (infact the csv is
> generated from the table itself)
> 
> The issue is occurring only when i am trying to load the data from CSV file
> into a partitioned table -
> 
> db01=# CREATE TABLE orders_y1992
>     PARTITION OF orders2 FOR VALUES FROM ('1992-01-01') TO ('1992-12-31');
> CREATE TABLE
> db01=# copy orders2 from '/data/orders-1993.csv' delimiter '|';
> ERROR:  could not read block 6060 in file "base/16384/16407": read only 0
> of 8192 bytes
> CONTEXT:  COPY orders2, line 376589:
> "9876391|374509|O|54847|1997-07-16|3-MEDIUM       |Clerk#000001993|0|ithely
> regular pack"
> 
> Not sure why COPY is failing.

I think I've been able to reproduce this issue and suspect that it's a
bug.  I tried to solve it in response to another related report [1], where
it was apparent that the cause was related to how the bulk-insert mode in
the COPY FROM code is not handled correctly for a partitioned table.  My
proposed solution [2] was to disable bulk-insert mode completely for
partitioned tables.  But it may not be desirable performance-wise (for
example, COPY FROM on partitioned tables would have same performance as
INSERT, whereas in case of regular tables, COPY FROM is much faster than
INSERT due to the bulk insert mode).

I will propose another solution for the same.  Meanwhile, could you please
try your test again with the patch posted at [1], although it will not
likely be committed as the fix for this issue.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/101e2c2d-45d6-fb1a-468c-d3f67572a2f3%40lab.ntt.co.jp





Re: [HACKERS] Declarative partitioning - another take

From
Keith Fiske
Date:

Could we get some clarification on the partition_bound_spec portion of the PARTITION OF clause? Just doing some testing it seems it's inclusive of the FROM value but exclusive of the TO value. I don't see mention of this in the docs as of commit 18fc5192a631441a73e6a3b911ecb14765140389 yesterday. It does mention that the values aren't allowed to overlap, but looking at the schema below, without the clarification of which side is inclusive/exclusive it seems confusing because 2016-08-01 is in both. Even the child table does not clarify this. Not sure if there's a way to do this in the \d+ display which would be ideal, but it should at least be mentioned in the docs.

keith@keith=# \d+ measurement
                                 Table "public.measurement"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 logdate   | date    |           | not null |         | plain   |              |
 peaktemp  | integer |           |          | 1       | plain   |              |
 unitsales | integer |           |          |         | plain   |              |
Partition key: RANGE (logdate)
Check constraints:
    "measurement_peaktemp_check" CHECK (peaktemp > 0)
Partitions: measurement_y2016m07 FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'),
            measurement_y2016m08 FOR VALUES FROM ('2016-08-01') TO ('2016-09-01')

keith@keith=# \d+ measurement_y2016m07
                             Table "public.measurement_y2016m07"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 logdate   | date    |           | not null |         | plain   |              |
 peaktemp  | integer |           |          | 1       | plain   |              |
 unitsales | integer |           |          | 0       | plain   |              |
Partition of: measurement FOR VALUES FROM ('2016-07-01') TO ('2016-08-01')
Check constraints:
    "measurement_peaktemp_check" CHECK (peaktemp > 0)


keith@keith=# insert into measurement (logdate) values ('2016-08-01');
INSERT 0 1
Time: 2.848 ms

keith@keith=# select * from measurement_y2016m07;
 logdate | peaktemp | unitsales
---------+----------+-----------
(0 rows)

Time: 0.273 ms
keith@keith=# select * from measurement_y2016m08;
  logdate   | peaktemp | unitsales
------------+----------+-----------
 2016-08-01 |        1 |    «NULL»
(1 row)

Time: 0.272 ms

keith@keith=# drop table measurement_y2016m08;
DROP TABLE
Time: 5.919 ms
keith@keith=# select * from only measurement;
 logdate | peaktemp | unitsales
---------+----------+-----------
(0 rows)

Time: 0.307 ms
keith@keith=# insert into measurement (logdate) values ('2016-08-01');
ERROR:  no partition of relation "measurement" found for row
DETAIL:  Failing row contains (2016-08-01, 1, null).
Time: 0.622 ms


Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Keith,

On 2017/01/06 2:16, Keith Fiske wrote:
> Could we get some clarification on the partition_bound_spec portion of the
> PARTITION OF clause? Just doing some testing it seems it's inclusive of the
> FROM value but exclusive of the TO value. I don't see mention of this in
> the docs as of commit 18fc5192a631441a73e6a3b911ecb14765140389 yesterday.
> It does mention that the values aren't allowed to overlap, but looking at
> the schema below, without the clarification of which side is
> inclusive/exclusive it seems confusing because 2016-08-01 is in both. Even
> the child table does not clarify this. Not sure if there's a way to do this
> in the \d+ display which would be ideal, but it should at least be
> mentioned in the docs.

I agree that needs highlighting.  I'm planning to write a doc patch for
that (among other documentation improvements).

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/05 3:26, Robert Haas wrote:
> On Tue, Dec 27, 2016 at 8:41 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2016/12/27 19:07, Amit Langote wrote:
>>> Attached should fix that.
>>
>> Here are the last two patches with additional information like other
>> patches.  Forgot to do that yesterday.
> 
> 0001 has the disadvantage that get_partition_for_tuple() acquires a
> side effect.  That seems undesirable.  At the least, it needs to be
> documented in the function's header comment.

That's true.  How about we save away the original ecxt_scantuple at entry
and restore the same just before returning from the function?  That way
there would be no side effect.  0001 implements that.

> It's unclear to me why we need to do 0002.  It doesn't seem like it
> should be necessary, it doesn't seem like a good idea, and the commit
> message you proposed is uninformative.

If a single BulkInsertState object is passed to
heap_insert()/heap_multi_insert() for different heaps corresponding to
different partitions (from one input tuple to next), tuples might end up
going into wrong heaps (like demonstrated in one of the reports [1]).  A
simple solution is to disable bulk-insert in case of partitioned tables.

But my patch (or its motivations) was slightly wrongheaded, wherein I
conflated multi-insert stuff and bulk-insert considerations.  I revised
0002 to not do that.

However if we disable bulk-insert mode, COPY's purported performance
benefit compared with INSERT is naught.  Patch 0003 is a proposal to
implement bulk-insert mode even for partitioned tables.  Basically,
allocate separate BulkInsertState objects for each partition and switch to
the appropriate one just before calling heap_insert()/heap_multi_insert().
 Then to be able to use heap_multi_insert(), we must also manage buffered
tuples separately for each partition.  Although, I didn't modify the limit
on number of buffered tuples and/or size of buffered tuples which controls
when we pause buffering and do heap_multi_insert() on buffered tuples.
Maybe, it should work slightly differently for the partitioned table case,
like for example, increase the overall limit on both the number of tuples
and tuple size in the partitioning case (I observed that increasing it 10x
or 100x helped to some degree).  Thoughts on this?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
amul sul
Date:
Hi,

I got server crash due to assert failure at ATTACHing overlap rang
partition, here is test case to reproduce this:

CREATE TABLE test_parent(a int) PARTITION BY RANGE (a);
CREATE TABLE test_parent_part2 PARTITION OF test_parent FOR VALUES
FROM(100) TO(200);
CREATE TABLE test_parent_part1(a int NOT NULL);
ALTER TABLE test_parent ATTACH PARTITION test_parent_part1 FOR VALUES
FROM(1) TO(200);

I think, this bug exists in the following code of check_new_partition_bound():
767                         if (equal || off1 != off2)768                         {769
overlap= true;770                             with = boundinfo->indexes[off2 + 1];771                         }
 

When equal is true array index should not be 'off2 + 1'.

While reading code related to this, I wondered why
partition_bound_bsearch is not immediately returns when cmpval==0?

Apologise if this has been already reported.

Regards,
Amul



Re: [HACKERS] Declarative partitioning - another take

From
Keith Fiske
Date:
Is there any reason for the exclusion of parent tables from the pg_tables system catalog view? They do not show up in information_schema.tables as well. I believe I found where to make the changes and I tested to make sure it works for my simple case. Attached is my first attempt at patching anything in core. Not sure if there's anywhere else this would need to be fixed.

--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Amul,

On 2017/01/09 17:29, amul sul wrote:
> I got server crash due to assert failure at ATTACHing overlap rang
> partition, here is test case to reproduce this:
> 
> CREATE TABLE test_parent(a int) PARTITION BY RANGE (a);
> CREATE TABLE test_parent_part2 PARTITION OF test_parent FOR VALUES
> FROM(100) TO(200);
> CREATE TABLE test_parent_part1(a int NOT NULL);
> ALTER TABLE test_parent ATTACH PARTITION test_parent_part1 FOR VALUES
> FROM(1) TO(200);
> 
> I think, this bug exists in the following code of check_new_partition_bound():
> 
>  767                         if (equal || off1 != off2)
>  768                         {
>  769                             overlap = true;
>  770                             with = boundinfo->indexes[off2 + 1];
>  771                         }
> 
> When equal is true array index should not be 'off2 + 1'.

Good catch.  Attached patch should fix that.  I observed crash with the
following command as well:

ALTER TABLE test_parent ATTACH PARTITION test_parent_part1 FOR VALUES FROM
(1) TO (300);

That's because there is one more case when the array index shouldn't be
off2 + 1 - the case where the bound at off2 is an upper bound (I'd wrongly
assumed that it's always a lower bound).  Anyway, I rewrote the
surrounding comments to clarify the logic a bit.

> While reading code related to this, I wondered why
> partition_bound_bsearch is not immediately returns when cmpval==0?

partition_bound_bsearch() is meant to return the *greatest* index of the
bound less than or equal to the input bound ("probe").  But it seems to me
now that we would always return the first index at which we get 0 for
cmpval, albeit after wasting cycles to try to find even greater index.
Because we don't have duplicates in the datums array, once we encounter a
bound that is equal to probe, we are only going to find bounds that are
*greater than* probe if we continue looking right, only to turn back again
to return the equal index (which is wasted cycles in invoking the
partition key comparison function(s)).  So, it perhaps makes sense to do
this per your suggestion:

@@ -1988,8 +2018,11 @@ partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
         if (cmpval <= 0)
         {
             lo = mid;
             *is_equal = (cmpval == 0);
+
+            if (*is_equal)
+                break;
         }

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Kieth,

On 2017/01/10 14:44, Keith Fiske wrote:
> Is there any reason for the exclusion of parent tables from the pg_tables
> system catalog view? They do not show up in information_schema.tables as
> well. I believe I found where to make the changes and I tested to make sure
> it works for my simple case. Attached is my first attempt at patching
> anything in core. Not sure if there's anywhere else this would need to be
> fixed.

That's an oversight.  The original partitioning patch didn't touch
information_schema.sql and system_views.sql at all.  I added the relkind =
'P' check in some other views as well, including what your patch considered.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/05 5:50, Robert Haas wrote:
> On Tue, Dec 27, 2016 at 3:59 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Patches 0001 to 0006 unchanged.
>
> Committed 0001 earlier, as mentioned in a separate email.  Committed
> 0002 and part of 0003.

Thanks!  I realized however that the approach I used in 0002 of passing
the original slot to ExecConstraints() fails in certain situations.  For
example, if a BR trigger changes the tuple, the original slot would not
receive those changes, so it will be wrong to use such a tuple anymore.
In attached 0001, I switched back to the approach of converting the
partition-tupdesc-based tuple back to the root partitioned table's format.
 The converted tuple contains the changes by BR triggers, if any.  Sorry
about some unnecessary work.

> But I'm skeptical that the as-patched-by-0003
> logic in generate_partition_qual() makes sense.  You do this:
>
>         result = list_concat(generate_partition_qual(parent),
>                              copyObject(rel->rd_partcheck));
>
>         /* Mark Vars with correct attnos */
>         result = map_partition_varattnos(result, rel, parent);
>
> But that has the effect of applying map_partition_varattnos to
> everything in rel->rd_partcheck in addition to applying it to
> everything returned by generate_partition_qual() on the parent, which
> doesn't seem right.

I've replaced this portion of the code with (as also mentioned below):

    /* Quick copy */
    if (rel->rd_partcheck != NIL)
        return copyObject(rel->rd_partcheck);

Down below (for the case when the partition qual is not cached, we now do
this:

    my_qual = get_qual_from_partbound(rel, parent, bound);

    /* Add the parent's quals to the list (if any) */
    if (parent->rd_rel->relispartition)
        result = list_concat(generate_partition_qual(parent), my_qual);
    else
        result = my_qual;

    /*
     * Change Vars to have partition's attnos instead of the parent's.
     * We do this after we concatenate the parent's quals, because
     * we want every Var in it to bear this relation's attnos.
     */
    result = map_partition_varattnos(result, rel, parent);

Which is then cached wholly in rd_partcheck.

As for your concern whether it's correct to do so, consider that doing
generate_partition_qual() on the parent returns qual with Vars that bear
the parent's attnos (which is OK as far parent as partition is concerned).
 To apply the qual to the current relation as partition, we must change
the Vars to have this relation's attnos.

> Also, don't we want to do map_partition_varattnos() just ONCE, rather
> than on every call to this function?  I think maybe your concern is
> that the parent might be changed without a relcache flush on the
> child, but I don't quite see how that could happen.  If the parent's
> tuple descriptor changes, surely the child's tuple descriptor would
> have to be altered at the same time...

Makes sense.  I fixed so that we return copyObject(rel->rd_partcheck), if
it's non-NIL, instead of generating parent's qual and doing the mapping
again.  For some reason, I thought we couldn't save the mapped version in
the relcache.

By the way, in addition to the previously mentioned bug of RETURNING, I
found that WITH CHECK OPTION didn't work correctly as well.  In fact
automatically updatable views failed to consider partitioned tables at
all.  Patch 0007 is addressed towards fixing that.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/06 20:23, Amit Langote wrote:
> On 2017/01/05 3:26, Robert Haas wrote:
>> It's unclear to me why we need to do 0002.  It doesn't seem like it
>> should be necessary, it doesn't seem like a good idea, and the commit
>> message you proposed is uninformative.
>
> If a single BulkInsertState object is passed to
> heap_insert()/heap_multi_insert() for different heaps corresponding to
> different partitions (from one input tuple to next), tuples might end up
> going into wrong heaps (like demonstrated in one of the reports [1]).  A
> simple solution is to disable bulk-insert in case of partitioned tables.
>
> But my patch (or its motivations) was slightly wrongheaded, wherein I
> conflated multi-insert stuff and bulk-insert considerations.  I revised
> 0002 to not do that.

Ragnar Ouchterlony pointed out [1] on pgsql-bugs that 0002 wasn't correct.
Attaching updated 0002 along with rebased 0001 and 0003.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/732dfc84-25f5-413c-1eee-0bfa7a370093%40agama.tv

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/01/05 5:50, Robert Haas wrote:
>> On Tue, Dec 27, 2016 at 3:59 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Patches 0001 to 0006 unchanged.
>>
>> Committed 0001 earlier, as mentioned in a separate email.  Committed
>> 0002 and part of 0003.
>
> Thanks!  I realized however that the approach I used in 0002 of passing
> the original slot to ExecConstraints() fails in certain situations.  For
> example, if a BR trigger changes the tuple, the original slot would not
> receive those changes, so it will be wrong to use such a tuple anymore.
> In attached 0001, I switched back to the approach of converting the
> partition-tupdesc-based tuple back to the root partitioned table's format.
>  The converted tuple contains the changes by BR triggers, if any.  Sorry
> about some unnecessary work.

Hmm.  Even with this patch, I wonder if this is really correct.  I
mean, isn't the root of the problem here that ExecConstraints() is
expecting that resultRelInfo matches slot, and it doesn't? And why
isn't that also a problem for the things that get passed resultRelInfo
and slot after tuple routing and before ExecConstraints?  In
particular, in copy.c, ExecBRInsertTriggers.

Committed 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/14 6:24, Robert Haas wrote:
> On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote wrote:
>>
>> Thanks!  I realized however that the approach I used in 0002 of passing
>> the original slot to ExecConstraints() fails in certain situations.  For
>> example, if a BR trigger changes the tuple, the original slot would not
>> receive those changes, so it will be wrong to use such a tuple anymore.
>> In attached 0001, I switched back to the approach of converting the
>> partition-tupdesc-based tuple back to the root partitioned table's format.
>>  The converted tuple contains the changes by BR triggers, if any.  Sorry
>> about some unnecessary work.
> 
> Hmm.  Even with this patch, I wonder if this is really correct.  I
> mean, isn't the root of the problem here that ExecConstraints() is
> expecting that resultRelInfo matches slot, and it doesn't?

The problem is that whereas the SlotValueDescription that we build to show
in the error message should be based on the tuple that was passed to
ExecInsert() or whatever NextCopyFrom() returned for CopyFrom() to
process, it might fail to be the case if the tuple was needed to be
converted after tuple routing.  slot (the tuple in it and its tuple
descriptor) and resultRelInfo that ExecConstraint() receives *do*
correspond with each other, even after possible tuple conversion following
tuple-routing, and hence constraint checking itself works fine (since
commit 2ac3ef7a01 [1]).  As said, it's the val_desc built to show in the
error message being based on the converted-for-partition tuple that could
be seen as a problem - is it acceptable if we showed in the error message
whatever the converted-for-partition tuple looks like which might have
columns ordered differently from the root table?  If so, we could simply
forget the whole thing, including reverting f1b4c771 [2].

An example:

create table p (a int, b char, c int) partition by list (a);
create table p1 (b char, c int, a int);    -- note the column order
alter table p attach partition p1 for values in (1);
alter table p add constraint check_b check (b = 'x');

insert into p values (1, 'y', 1);
ERROR:  new row for relation "p1" violates check constraint "check_b"
DETAIL:  Failing row contains (y, 1, 1).

Note that "(y, 1, 1)" results from using p1's descriptor on the converted
tuple.  As long that's clear and acceptable, I think we need not worry
about this patch and revert the previously committed patch for this "problem".

> And why
> isn't that also a problem for the things that get passed resultRelInfo
> and slot after tuple routing and before ExecConstraints?  In
> particular, in copy.c, ExecBRInsertTriggers.

If I explained the problem (as I'm seeing it) well enough above, you may
see why that's not an issue in other cases.  Other sub-routines viz.
ExecBRInsertTriggers, ExecWithCheckOptions for RLS INSERT WITH CHECK
policies, ExecARInsertTriggers, etc. do receive the correct slot and
resultRelInfo for whatever they do with a given tuple and the relation
(partition) it is being inserted into.  However, if we do consider the
above to be a problem, then we would need to fix ExecWithCheckOptions() to
do whatever we decide ExecConstraints() should do, because it can also
report WITH CHECK violation for a tuple.

> Committed 0002.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ac3ef7
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1b4c77





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, Jan 16, 2017 at 4:09 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The problem is that whereas the SlotValueDescription that we build to show
> in the error message should be based on the tuple that was passed to
> ExecInsert() or whatever NextCopyFrom() returned for CopyFrom() to
> process, it might fail to be the case if the tuple was needed to be
> converted after tuple routing.  slot (the tuple in it and its tuple
> descriptor) and resultRelInfo that ExecConstraint() receives *do*
> correspond with each other, even after possible tuple conversion following
> tuple-routing, and hence constraint checking itself works fine (since
> commit 2ac3ef7a01 [1]).  As said, it's the val_desc built to show in the
> error message being based on the converted-for-partition tuple that could
> be seen as a problem - is it acceptable if we showed in the error message
> whatever the converted-for-partition tuple looks like which might have
> columns ordered differently from the root table?  If so, we could simply
> forget the whole thing, including reverting f1b4c771 [2].
>
> An example:
>
> create table p (a int, b char, c int) partition by list (a);
> create table p1 (b char, c int, a int);    -- note the column order
> alter table p attach partition p1 for values in (1);
> alter table p add constraint check_b check (b = 'x');
>
> insert into p values (1, 'y', 1);
> ERROR:  new row for relation "p1" violates check constraint "check_b"
> DETAIL:  Failing row contains (y, 1, 1).
>
> Note that "(y, 1, 1)" results from using p1's descriptor on the converted
> tuple.  As long that's clear and acceptable, I think we need not worry
> about this patch and revert the previously committed patch for this "problem".

Hmm.  It would be fine, IMHO, if the detail message looked like the
one that BuildIndexValueDescription produces.  Without the column
names, the clarity is somewhat lessened.

Anybody else have an opinion on this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [ updated patches ]

I committed 0004 and also fixed the related regression test not to
rely on DROP .. CASCADE, which isn't always stable.  The remainder of
this patch set needs a rebase, and perhaps you could also fold in
other pending partitioning fixes so I have everything to look at it in
one place.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Jan 18, 2017 at 3:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ updated patches ]
>
> I committed 0004 and also fixed the related regression test not to
> rely on DROP .. CASCADE, which isn't always stable.  The remainder of
> this patch set needs a rebase, and perhaps you could also fold in
> other pending partitioning fixes so I have everything to look at it in
> one place.

Just to be a little more clear, I don't mind multiple threads each
with a patch or patch set so much, but multiple patch sets on the same
thread gets hard for me to track.  Sorry for the inconvenience.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/19 5:29, Robert Haas wrote:
> On Wed, Jan 18, 2017 at 3:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jan 10, 2017 at 6:06 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> [ updated patches ]
>>
>> I committed 0004 and also fixed the related regression test not to
>> rely on DROP .. CASCADE, which isn't always stable.  The remainder of

Thanks!

>> this patch set needs a rebase, and perhaps you could also fold in
>> other pending partitioning fixes so I have everything to look at it in
>> one place.
> 
> Just to be a little more clear, I don't mind multiple threads each
> with a patch or patch set so much, but multiple patch sets on the same
> thread gets hard for me to track.  Sorry for the inconvenience.

OK, I agree that having multiple patch sets on the same thread is cumbersome.

So, here are all the patches I posted to date (and one new at the bottom)
for reported and unreported bugs, excluding the one involving
BulkInsertState for which you replied in a new thread.

I'll describe the attached patches in brief:

0001-Fix-a-bug-of-insertion-into-an-internal-partition.patch

Since implicit partition constraints are not inherited, an internal
partition's constraint was not being enforced when targeted directly.
So, include such constraint when setting up leaf partition result
relations for tuple-routing.

0002-Set-ecxt_scantuple-correctly-for-tuple-routing.patch

In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that
it's possible for a different TupleTableSlot to be used for partitioned
tables at successively lower levels.  If we do end up changing the slot
from the original, we must update ecxt_scantuple to point to the new one
for partition key of the tuple to be computed correctly.

Last posted here:
https://www.postgresql.org/message-id/99fbfad6-b89b-9427-a6ca-197aad98c48e%40lab.ntt.co.jp

0003-Fix-RETURNING-to-work-correctly-after-tuple-routing.patch

In ExecInsert(), do not switch back to the root partitioned table
ResultRelInfo until after we finish ExecProcessReturning(), so that
RETURNING projection is done using the partition's descriptor.  For
the projection to work correctly, we must initialize the same for
each leaf partition during ModifyTableState initialization.

0004-Fix-some-issues-with-views-and-partitioned-tables.patch

Automatically updatable views failed to handle partitioned tables.
Once that's fixed, WITH CHECK OPTIONS wouldn't work correctly without
the WCO expressions having been suitably converted for each partition
(think applying map_partition_varattnos to Vars in the WCO expressions
just like with partition constraint expressions).

0005-Fix-some-wrong-thinking-in-check_new_partition_bound.patch

Because a given range bound in the PartitionBoundInfo.datums array
is sometimes a range lower bound and upper bound at other times, we
must be careful when assuming which, especially when interpreting
the result of partition_bound_bsearch which returns the index of the
greatest bound that is less than or equal to probe.  Due to an error
in thinking about the same, the relevant code in
check_new_partition_bound() caused invalid partition (index = -1)
to be chosen as the partition being overlapped.

Last posted here:
https://www.postgresql.org/message-id/603acb8b-5dec-31e8-29b0-609a68aac50f%40lab.ntt.co.jp

0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch

Currently, the tuple conversion is performed after a tuple is routed,
even if the attributes of a target leaf partition map one-to-one with
those of the root table, which is wasteful.  Avoid that by making
convert_tuples_by_name() return a NULL map for such cases.

0007-Avoid-code-duplication-in-map_partition_varattnos.patch

Code to map attribute numbers in map_partition_varattnos() duplicates
what convert_tuples_by_name_map() does.  Avoid that as pointed out by
Álvaro Herrera.

Last posted here:
https://www.postgresql.org/message-id/9ce97382-54c8-deb3-9ee9-a2ec271d866b%40lab.ntt.co.jp

0008-Avoid-DROP-TABLE-.-CASCADE-in-more-partitioning-test.patch

This is the new one.  There were quite a few commits recently to fix the
breakage in regression tests due to not using ORDER BY in queries on
system catalogs and using DROP TABLE ... CASCADE.  There were still some
instances of the latter in create_table.sql and alter_table.sql.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/19 14:15, Amit Langote wrote:
> So, here are all the patches I posted to date (and one new at the bottom)
> for reported and unreported bugs, excluding the one involving
> BulkInsertState for which you replied in a new thread.
> 
> I'll describe the attached patches in brief:

Sorry, I forgot to mention that I have skipped the patch I proposed to
modify the committed approach [1] to get the correct tuple to show in the
constraint violation messages.  It might be better to continue that
discussion at [2].

And because I skipped that patch, I should have removed the related logic
in ExecWithCheckOptions() that I added in one of the later patches (patch
0004), which I forgot to do.  So, here are all the patches again with the
correct 0004 this time.  Sigh.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1b4c771ea74f42447dccaed42ffcdcccf3aa694

[2]
https://www.postgresql.org/message-id/CA%2BTgmoZjGzSM5WwnyapFaw3GxnDLWh7pm8Xiz8_QWQnUQy%3DSCA%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Jan 19, 2017 at 12:15 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 0001-Fix-a-bug-of-insertion-into-an-internal-partition.patch
>
> Since implicit partition constraints are not inherited, an internal
> partition's constraint was not being enforced when targeted directly.
> So, include such constraint when setting up leaf partition result
> relations for tuple-routing.

Committed.

> 0002-Set-ecxt_scantuple-correctly-for-tuple-routing.patch
>
> In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that
> it's possible for a different TupleTableSlot to be used for partitioned
> tables at successively lower levels.  If we do end up changing the slot
> from the original, we must update ecxt_scantuple to point to the new one
> for partition key of the tuple to be computed correctly.
>
> Last posted here:
> https://www.postgresql.org/message-id/99fbfad6-b89b-9427-a6ca-197aad98c48e%40lab.ntt.co.jp

Why does FormPartitionKeyDatum need this?  Could that be documented in
a comment in here someplace, perhaps the header comment to
FormPartitionKeyDatum?

> 0003-Fix-RETURNING-to-work-correctly-after-tuple-routing.patch
>
> In ExecInsert(), do not switch back to the root partitioned table
> ResultRelInfo until after we finish ExecProcessReturning(), so that
> RETURNING projection is done using the partition's descriptor.  For
> the projection to work correctly, we must initialize the same for
> each leaf partition during ModifyTableState initialization.

Committed.

> 0004-Fix-some-issues-with-views-and-partitioned-tables.patch
>
> Automatically updatable views failed to handle partitioned tables.
> Once that's fixed, WITH CHECK OPTIONS wouldn't work correctly without
> the WCO expressions having been suitably converted for each partition
> (think applying map_partition_varattnos to Vars in the WCO expressions
> just like with partition constraint expressions).

The changes to execMain.c contain a hunk which has essentially the
same code twice.  That looks like a mistake.  Also, the patch doesn't
compile because convert_tuples_by_name() takes 3 arguments, not 4.

> 0005-Fix-some-wrong-thinking-in-check_new_partition_bound.patch
>
> Because a given range bound in the PartitionBoundInfo.datums array
> is sometimes a range lower bound and upper bound at other times, we
> must be careful when assuming which, especially when interpreting
> the result of partition_bound_bsearch which returns the index of the
> greatest bound that is less than or equal to probe.  Due to an error
> in thinking about the same, the relevant code in
> check_new_partition_bound() caused invalid partition (index = -1)
> to be chosen as the partition being overlapped.
>
> Last posted here:
> https://www.postgresql.org/message-id/603acb8b-5dec-31e8-29b0-609a68aac50f%40lab.ntt.co.jp
                    }
+                    /*
+                     * If equal has been set to true or if there is no "gap"
+                     * between the bound at off1 and that at off1 + 1, the new
+                     * partition will overlap some partition.  In the former
+                     * case, the new lower bound is found to be equal to the
+                     * bound at off1, which could only ever be true if the
+                     * latter is the lower bound of some partition.  It's
+                     * clear in such a case that the new partition overlaps
+                     * that partition, whose index we get using its upper
+                     * bound (that is, using the bound at off1 + 1).
+                     */                    else

Stylistically, we usually avoid this, or at least I do.  The comment
should go inside the "else" block.  But it looks OK apart from that,
so committed with a little rephrasing and reformatting of the comment.

> 0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch
>
> Currently, the tuple conversion is performed after a tuple is routed,
> even if the attributes of a target leaf partition map one-to-one with
> those of the root table, which is wasteful.  Avoid that by making
> convert_tuples_by_name() return a NULL map for such cases.

+        Assert(!consider_typeid && indesc->tdhasoid == outdesc->tdhasoid);

I think you mean Assert(consider_typeid || indesc->tdhasoid ==
outdesc->tdhasoid);

But I wonder why we don't instead just change this function to
consider tdhasoid rather than tdtypeid.  I mean, if the only point of
comparing the type OIDs is to find out whether the table-has-OIDs
setting matches, we could instead test that directly and avoid needing
to pass an extra argument.  I wonder if there's some other reason this
code is there which is not documented in the comment...

> 0007-Avoid-code-duplication-in-map_partition_varattnos.patch
>
> Code to map attribute numbers in map_partition_varattnos() duplicates
> what convert_tuples_by_name_map() does.  Avoid that as pointed out by
> Álvaro Herrera.
>
> Last posted here:
> https://www.postgresql.org/message-id/9ce97382-54c8-deb3-9ee9-a2ec271d866b%40lab.ntt.co.jp

Committed.

> 0008-Avoid-DROP-TABLE-.-CASCADE-in-more-partitioning-test.patch
>
> This is the new one.  There were quite a few commits recently to fix the
> breakage in regression tests due to not using ORDER BY in queries on
> system catalogs and using DROP TABLE ... CASCADE.  There were still some
> instances of the latter in create_table.sql and alter_table.sql.

Committed.

Phew.  Thanks for all the patches, sorry I'm having trouble keeping up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/20 4:18, Robert Haas wrote:
> On Thu, Jan 19, 2017 at 12:15 AM, Amit Langote wrote:
>> 0002-Set-ecxt_scantuple-correctly-for-tuple-routing.patch
>>
>> In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that
>> it's possible for a different TupleTableSlot to be used for partitioned
>> tables at successively lower levels.  If we do end up changing the slot
>> from the original, we must update ecxt_scantuple to point to the new one
>> for partition key of the tuple to be computed correctly.
>>
>> Last posted here:
>> https://www.postgresql.org/message-id/99fbfad6-b89b-9427-a6ca-197aad98c48e%40lab.ntt.co.jp
> 
> Why does FormPartitionKeyDatum need this?  Could that be documented in
> a comment in here someplace, perhaps the header comment to
> FormPartitionKeyDatum?

FormPartitionKeyDatum() computes partition key expressions (if any) and
the expression evaluation machinery expects ecxt_scantuple to point to the
tuple to extract attribute values from.

FormPartitionKeyDatum() already has a tiny comment, which it seems is the
only thing we could say here about this there:

* the ecxt_scantuple slot of estate's per-tuple expr context must point to
* the heap tuple passed in.

In get_partition_for_tuple() which calls FormPartitionKeyDatum(), the
patch adds this comment (changed it a little from the last version):

+ /*
+  * Extract partition key from tuple. Expression evaluation machinery
+  * that FormPartitionKeyDatum() invokes expects ecxt_scantuple to
+  * point to the correct tuple slot.  The slot might have changed from
+  * what was used for the parent table if the table of the current
+  * partitioning level has different tuple descriptor from the parent.
+  * So update ecxt_scantuple accordingly.
+  */
+ ecxt->ecxt_scantuple = slot;
FormPartitionKeyDatum(parent, slot, estate, values, isnull);

It says why we need to change which slot ecxt_scantuple points to.

>> 0004-Fix-some-issues-with-views-and-partitioned-tables.patch
>>
>> Automatically updatable views failed to handle partitioned tables.
>> Once that's fixed, WITH CHECK OPTIONS wouldn't work correctly without
>> the WCO expressions having been suitably converted for each partition
>> (think applying map_partition_varattnos to Vars in the WCO expressions
>> just like with partition constraint expressions).
> 
> The changes to execMain.c contain a hunk which has essentially the
> same code twice.  That looks like a mistake.  Also, the patch doesn't
> compile because convert_tuples_by_name() takes 3 arguments, not 4.

Actually, I realized that and sent the updated version [1] of this patch
that fixed this issue.  In the updated version, I removed that code block
(the 2 copies of it), because we are still discussing what to do about
showing tuples in constraint violation (in this case, WITH CHECK OPTION
violation) messages.  Anyway, attached here again.

>> 0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch
>>
>> Currently, the tuple conversion is performed after a tuple is routed,
>> even if the attributes of a target leaf partition map one-to-one with
>> those of the root table, which is wasteful.  Avoid that by making
>> convert_tuples_by_name() return a NULL map for such cases.
> 
> +        Assert(!consider_typeid && indesc->tdhasoid == outdesc->tdhasoid);
> 
> I think you mean Assert(consider_typeid || indesc->tdhasoid ==
> outdesc->tdhasoid);

Ah, you're right.

> But I wonder why we don't instead just change this function to
> consider tdhasoid rather than tdtypeid.  I mean, if the only point of
> comparing the type OIDs is to find out whether the table-has-OIDs
> setting matches, we could instead test that directly and avoid needing
> to pass an extra argument.  I wonder if there's some other reason this
> code is there which is not documented in the comment...

With the following patch, regression tests run fine:

  if (indesc->natts == outdesc->natts &&
-     indesc->tdtypeid == outdesc->tdtypeid)
+     indesc->tdhasoid != outdesc->tdhasoid)
     {

If checking tdtypeid (instead of tdhasoid directly) has some other
consideration, I'd would have seen at least some tests broken by this
change.  So, if we are to go with this, I too prefer it over my previous
proposal to add an argument to convert_tuples_by_name().  Attached 0003
implements the above approach.

> Phew.  Thanks for all the patches, sorry I'm having trouble keeping up.

Thanks a lot for taking time to look through them and committing.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/92fe2a71-5eb7-ee8d-53ef-cfd5a65dfc3d%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Keith Fiske
Date:
So testing things out in pg_partman for native sub-partitioning and ran into what is a bug for me that I know I have to fix, but I'm curious if this can be prevented in the first place within the native partitioning code itself. The below shows a sub-partitioning set where the sub-partition has a constraint range that is outside of the range of its parent. If the columns were different I could see where this would be allowed, but the same column is used throughout the levels of sub-partitioning. Understandable if that may be too complex to check for, but figured I'd bring it up as something I accidentally ran into in case you see an easy way to prevent it.

This was encountered as of commit ba61a04bc7fefeee03416d9911eb825c4897c223.

keith@keith=# \d+ partman_test.time_taptest_table
                                  Table "partman_test.time_taptest_table"
 Column |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description
--------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 col1   | integer                  |           |          |         | plain    |              |
 col2   | text                     |           |          |         | extended |              |
 col3   | timestamp with time zone |           | not null | now()   | plain    |              |
Partition key: RANGE (col3)
Partitions: partman_test.time_taptest_table_p2015 FOR VALUES FROM ('2015-01-01 00:00:00-05') TO ('2016-01-01 00:00:00-05'),
            partman_test.time_taptest_table_p2016 FOR VALUES FROM ('2016-01-01 00:00:00-05') TO ('2017-01-01 00:00:00-05'),
            partman_test.time_taptest_table_p2017 FOR VALUES FROM ('2017-01-01 00:00:00-05') TO ('2018-01-01 00:00:00-05'),
            partman_test.time_taptest_table_p2018 FOR VALUES FROM ('2018-01-01 00:00:00-05') TO ('2019-01-01 00:00:00-05'),
            partman_test.time_taptest_table_p2019 FOR VALUES FROM ('2019-01-01 00:00:00-05') TO ('2020-01-01 00:00:00-05')

keith@keith=# \d+ partman_test.time_taptest_table_p2015
                               Table "partman_test.time_taptest_table_p2015"
 Column |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description
--------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 col1   | integer                  |           |          |         | plain    |              |
 col2   | text                     |           |          |         | extended |              |
 col3   | timestamp with time zone |           | not null | now()   | plain    |              |
Partition of: partman_test.time_taptest_table FOR VALUES FROM ('2015-01-01 00:00:00-05') TO ('2016-01-01 00:00:00-05')
Partition key: RANGE (col3)
Partitions: partman_test.time_taptest_table_p2015_p2016_11 FOR VALUES FROM ('2016-11-01 00:00:00-04') TO ('2016-12-01 00:00:00-05'),
            partman_test.time_taptest_table_p2015_p2016_12 FOR VALUES FROM ('2016-12-01 00:00:00-05') TO ('2017-01-01 00:00:00-05'),
            partman_test.time_taptest_table_p2015_p2017_01 FOR VALUES FROM ('2017-01-01 00:00:00-05') TO ('2017-02-01 00:00:00-05'),
            partman_test.time_taptest_table_p2015_p2017_02 FOR VALUES FROM ('2017-02-01 00:00:00-05') TO ('2017-03-01 00:00:00-05'),
            partman_test.time_taptest_table_p2015_p2017_03 FOR VALUES FROM ('2017-03-01 00:00:00-05') TO ('2017-04-01 00:00:00-04')

keith@keith=# \d+ partman_test.time_taptest_table_p2015_p2017_03
                           Table "partman_test.time_taptest_table_p2015_p2017_03"
 Column |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description
--------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 col1   | integer                  |           |          |         | plain    |              |
 col2   | text                     |           |          |         | extended |              |
 col3   | timestamp with time zone |           | not null | now()   | plain    |              |
Partition of: partman_test.time_taptest_table_p2015 FOR VALUES FROM ('2017-03-01 00:00:00-05') TO ('2017-04-01 00:00:00-04')

--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com

Re: [HACKERS] Declarative partitioning - another take

From
Andres Freund
Date:
On 2017-01-19 14:18:23 -0500, Robert Haas wrote:
> Committed.

One of the patches around this topic committed recently seems to cause
valgrind failures like
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02
:
==24969== Conditional jump or move depends on uninitialised value(s)
==24969==    at 0x4C38DA: btint4cmp (nbtcompare.c:97)
==24969==    by 0x83860C: FunctionCall2Coll (fmgr.c:1318)
==24969==    by 0x536643: partition_bounds_equal (partition.c:627)
==24969==    by 0x820864: equalPartitionDescs (relcache.c:1203)
==24969==    by 0x82423A: RelationClearRelation (relcache.c:2553)
==24969==    by 0x8248BA: RelationFlushRelation (relcache.c:2662)
==24969==    by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714)
==24969==    by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568)
==24969==    by 0x81CB0D: ProcessInvalidationMessages (inval.c:444)
==24969==    by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056)
==24969==    by 0x4F6735: AtCCI_LocalCache (xact.c:1374)
==24969==    by 0x4F8249: CommandCounterIncrement (xact.c:957)
==24969==  Uninitialised value was created by a heap allocation
==24969==    at 0x85AA83: palloc (mcxt.c:914)
==24969==    by 0x53648E: RelationBuildPartitionDesc (partition.c:528)
==24969==    by 0x823F93: RelationBuildDesc (relcache.c:1348)
==24969==    by 0x8241DB: RelationClearRelation (relcache.c:2524)
==24969==    by 0x8248BA: RelationFlushRelation (relcache.c:2662)
==24969==    by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714)
==24969==    by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568)
==24969==    by 0x81CB0D: ProcessInvalidationMessages (inval.c:444)
==24969==    by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056)
==24969==    by 0x4F6735: AtCCI_LocalCache (xact.c:1374)
==24969==    by 0x4F8249: CommandCounterIncrement (xact.c:957)
==24969==    by 0x82538B: RelationSetNewRelfilenode (relcache.c:3490)
==24969== 
==24969== VALGRINDERROR-END


Regards,

Andres



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Andres,

On 2017/01/20 15:15, Andres Freund wrote:
> On 2017-01-19 14:18:23 -0500, Robert Haas wrote:
>> Committed.
> 
> One of the patches around this topic committed recently seems to cause
> valgrind failures like
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02
> :
> ==24969== Conditional jump or move depends on uninitialised value(s)
> ==24969==    at 0x4C38DA: btint4cmp (nbtcompare.c:97)
> ==24969==    by 0x83860C: FunctionCall2Coll (fmgr.c:1318)
> ==24969==    by 0x536643: partition_bounds_equal (partition.c:627)
> ==24969==    by 0x820864: equalPartitionDescs (relcache.c:1203)
> ==24969==    by 0x82423A: RelationClearRelation (relcache.c:2553)
> ==24969==    by 0x8248BA: RelationFlushRelation (relcache.c:2662)
> ==24969==    by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714)
> ==24969==    by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568)
> ==24969==    by 0x81CB0D: ProcessInvalidationMessages (inval.c:444)
> ==24969==    by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056)
> ==24969==    by 0x4F6735: AtCCI_LocalCache (xact.c:1374)
> ==24969==    by 0x4F8249: CommandCounterIncrement (xact.c:957)
> ==24969==  Uninitialised value was created by a heap allocation
> ==24969==    at 0x85AA83: palloc (mcxt.c:914)
> ==24969==    by 0x53648E: RelationBuildPartitionDesc (partition.c:528)
> ==24969==    by 0x823F93: RelationBuildDesc (relcache.c:1348)
> ==24969==    by 0x8241DB: RelationClearRelation (relcache.c:2524)
> ==24969==    by 0x8248BA: RelationFlushRelation (relcache.c:2662)
> ==24969==    by 0x824983: RelationCacheInvalidateEntry (relcache.c:2714)
> ==24969==    by 0x81D9D6: LocalExecuteInvalidationMessage (inval.c:568)
> ==24969==    by 0x81CB0D: ProcessInvalidationMessages (inval.c:444)
> ==24969==    by 0x81D3CB: CommandEndInvalidationMessages (inval.c:1056)
> ==24969==    by 0x4F6735: AtCCI_LocalCache (xact.c:1374)
> ==24969==    by 0x4F8249: CommandCounterIncrement (xact.c:957)
> ==24969==    by 0x82538B: RelationSetNewRelfilenode (relcache.c:3490)
> ==24969== 
> ==24969== VALGRINDERROR-END

Thanks for the report.  This being my first time reading a valgrind report
on buildfarm, is it correct to to assume that the command immediately
preceding VALGRINDERROR-BEGIN is what triggered the failure?

... LOG:  statement: truncate list_parted;
==24969== VALGRINDERROR-BEGIN
==24969== Conditional jump or move depends on uninitialised value(s)
==24969==    at 0x4C38DA: btint4cmp (nbtcompare.c:97)

So in this case: truncate list_parted?

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Jan 20, 2017 at 1:15 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-01-19 14:18:23 -0500, Robert Haas wrote:
>> Committed.
>
> One of the patches around this topic committed recently seems to cause
> valgrind failures like
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02
> :

Tom Lane independently reported this same issue.  I believe I've fixed
it.  Sorry for not noticing and crediting this report also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/21 6:29, Robert Haas wrote:
> On Fri, Jan 20, 2017 at 1:15 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2017-01-19 14:18:23 -0500, Robert Haas wrote:
>>> Committed.
>>
>> One of the patches around this topic committed recently seems to cause
>> valgrind failures like
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-01-19%2008%3A40%3A02
>> :
> 
> Tom Lane independently reported this same issue.  I believe I've fixed
> it.  Sorry for not noticing and crediting this report also.

Thanks Robert for looking at this and committing the fix.

Regards,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> But I wonder why we don't instead just change this function to
>> consider tdhasoid rather than tdtypeid.  I mean, if the only point of
>> comparing the type OIDs is to find out whether the table-has-OIDs
>> setting matches, we could instead test that directly and avoid needing
>> to pass an extra argument.  I wonder if there's some other reason this
>> code is there which is not documented in the comment...
>
> With the following patch, regression tests run fine:
>
>   if (indesc->natts == outdesc->natts &&
> -     indesc->tdtypeid == outdesc->tdtypeid)
> +     indesc->tdhasoid != outdesc->tdhasoid)
>      {
>
> If checking tdtypeid (instead of tdhasoid directly) has some other
> consideration, I'd would have seen at least some tests broken by this
> change.  So, if we are to go with this, I too prefer it over my previous
> proposal to add an argument to convert_tuples_by_name().  Attached 0003
> implements the above approach.

I think this is not quite right.  First, the patch compares the
tdhasoid status with != rather than ==, which would have the effect of
saying that we can skip conversion of the has-OID statuses do NOT
match.  That can't be right.  Second, I believe that the comments
imply that conversion should be done if *either* tuple has OIDs.  I
believe that's because whoever wrote this comment thought that we
needed to replace the OID if the tuple already had one, which is what
do_convert_tuple would do.  I'm not sure whether that's really
necessary, but we're less likely to break anything if we preserve the
existing behavior, and I don't think we lose much from doing so
because few user tables will have OIDs.  So I would change this test
to if (indesc->natts == outdesc->natts && !indesc->tdhasoid &&
!outdesc->tdhasoid), and I'd revise the one in
convert_tuples_by_position() to match.  Then I think it's much clearer
that we're just optimizing what's there already, not changing the
behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [ new patches ]

Committed 0001 and 0002.  See my earlier email for comments on 0003.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Keith,

On 2017/01/20 12:40, Keith Fiske wrote:
> So testing things out in pg_partman for native sub-partitioning and ran
> into what is a bug for me that I know I have to fix, but I'm curious if
> this can be prevented in the first place within the native partitioning
> code itself. The below shows a sub-partitioning set where the sub-partition
> has a constraint range that is outside of the range of its parent. If the
> columns were different I could see where this would be allowed, but the
> same column is used throughout the levels of sub-partitioning.
> Understandable if that may be too complex to check for, but figured I'd
> bring it up as something I accidentally ran into in case you see an easy
> way to prevent it.

This was discussed.  See Robert's response (2nd part of):
https://www.postgresql.org/message-id/CA%2BTgmoaQABrsLQK4ms_4NiyavyJGS-b6ZFkZBBNC%2B-P5DjJNFA%40mail.gmail.com

In short, defining partitions across different levels such that the data
user intended to insert into the table (the part of the sub-partition's
range that doesn't overlap with its parent's) couldn't be, that's an
operator error.  It's like adding contradictory check constraints to the
table:

create table foo (a int check (a > 0 and a < 0));
insert into foo values (1);
ERROR:  new row for relation "foo" violates check constraint "foo_a_check"
DETAIL:  Failing row contains (1).

One (perhaps the only) thing that could be done is to warn users to
prevent this kind of mistake through documentation.  Trying to do anything
else in the core partitioning code is making it too complicated for not
much benefit (also see Robert's last line, :)).

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/25 2:56, Robert Haas wrote:
> On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> But I wonder why we don't instead just change this function to
>>> consider tdhasoid rather than tdtypeid.  I mean, if the only point of
>>> comparing the type OIDs is to find out whether the table-has-OIDs
>>> setting matches, we could instead test that directly and avoid needing
>>> to pass an extra argument.  I wonder if there's some other reason this
>>> code is there which is not documented in the comment...
>>
>> With the following patch, regression tests run fine:
>>
>>   if (indesc->natts == outdesc->natts &&
>> -     indesc->tdtypeid == outdesc->tdtypeid)
>> +     indesc->tdhasoid != outdesc->tdhasoid)
>>      {
>>
>> If checking tdtypeid (instead of tdhasoid directly) has some other
>> consideration, I'd would have seen at least some tests broken by this
>> change.  So, if we are to go with this, I too prefer it over my previous
>> proposal to add an argument to convert_tuples_by_name().  Attached 0003
>> implements the above approach.
> 
> I think this is not quite right.  First, the patch compares the
> tdhasoid status with != rather than ==, which would have the effect of
> saying that we can skip conversion of the has-OID statuses do NOT
> match.  That can't be right.

You're right.

> Second, I believe that the comments
> imply that conversion should be done if *either* tuple has OIDs.  I
> believe that's because whoever wrote this comment thought that we
> needed to replace the OID if the tuple already had one, which is what
> do_convert_tuple would do.  I'm not sure whether that's really
> necessary, but we're less likely to break anything if we preserve the
> existing behavior, and I don't think we lose much from doing so
> because few user tables will have OIDs.  So I would change this test
> to if (indesc->natts == outdesc->natts && !indesc->tdhasoid &&
> !outdesc->tdhasoid), and I'd revise the one in
> convert_tuples_by_position() to match.  Then I think it's much clearer
> that we're just optimizing what's there already, not changing the
> behavior.

Agreed.  Updated patch attached.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/25 5:55, Robert Haas wrote:
> On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ new patches ]
> 
> Committed 0001 and 0002.  See my earlier email for comments on 0003.

It seems patches for all the issues mentioned in this thread so far,
except 0003 (I just sent an updated version in another email), have been
committed.  Thanks a lot for your time.  I will create new threads for any
more patches from here on.

Regards,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Ashutosh Bapat
Date:
The documentation available at
https://www.postgresql.org/docs/devel/static/sql-createtable.html,
does not make it clear that the lower bound of a range partition is
always inclusive and the higher one is exclusive. I think a note in
section " PARTITION OF parent_table FOR VALUES partition_bound_spec"
would be helpful.

On Wed, Jan 25, 2017 at 7:11 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/01/25 5:55, Robert Haas wrote:
>> On Thu, Jan 19, 2017 at 9:58 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> [ new patches ]
>>
>> Committed 0001 and 0002.  See my earlier email for comments on 0003.
>
> It seems patches for all the issues mentioned in this thread so far,
> except 0003 (I just sent an updated version in another email), have been
> committed.  Thanks a lot for your time.  I will create new threads for any
> more patches from here on.
>
> Regards,
> Amit
>
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Ashutosh,

On 2017/01/25 14:54, Ashutosh Bapat wrote:
> The documentation available at
> https://www.postgresql.org/docs/devel/static/sql-createtable.html,
> does not make it clear that the lower bound of a range partition is
> always inclusive and the higher one is exclusive. I think a note in
> section " PARTITION OF parent_table FOR VALUES partition_bound_spec"
> would be helpful.

Yes, I'm working on a documentation patch for that and a few other things
such as revising the Partitioning chapter.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Peter Eisentraut
Date:
On 1/25/17 12:54 AM, Ashutosh Bapat wrote:
> The documentation available at
> https://www.postgresql.org/docs/devel/static/sql-createtable.html,
> does not make it clear that the lower bound of a range partition is
> always inclusive and the higher one is exclusive. I think a note in
> section " PARTITION OF parent_table FOR VALUES partition_bound_spec"
> would be helpful.

Hmm.  I see the practical use of that, but I think this is going to be a
source of endless confusion.  Can we make that a bit clearer in the
syntax, for example by using additional keywords (INCLUSIVE/EXCLUSIVE)?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/01/31 6:42, Peter Eisentraut wrote:
> On 1/25/17 12:54 AM, Ashutosh Bapat wrote:
>> The documentation available at
>> https://www.postgresql.org/docs/devel/static/sql-createtable.html,
>> does not make it clear that the lower bound of a range partition is
>> always inclusive and the higher one is exclusive. I think a note in
>> section " PARTITION OF parent_table FOR VALUES partition_bound_spec"
>> would be helpful.
> 
> Hmm.  I see the practical use of that, but I think this is going to be a
> source of endless confusion.  Can we make that a bit clearer in the
> syntax, for example by using additional keywords (INCLUSIVE/EXCLUSIVE)?

The decision not to make that configurable with INCLUSIVE/EXCLUSIVE syntax
was deliberate.  To summarize, we can start with a default configuration
catering to most practical cases (that is, inclusive lower and exclusive
upper bounds) and documenting so (not done yet, which I will post a doc
patch today for).  If it turns out that there is some demand for making
that configurable, we can later add the code to handle that internally
plus the syntax.  But *starting* with that syntax means we have to
potentially needlessly carry the code to handle seldom used cases that
could not be made as efficient as it is now with all lower bounds being
inclusive and upper bounds exclusive.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZou4ApEvC_nfhOxsi5G4SoD_evwNaiYn60ZcJ4XB_-QQ%40mail.gmail.com





Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, Jan 30, 2017 at 4:42 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 1/25/17 12:54 AM, Ashutosh Bapat wrote:
>> The documentation available at
>> https://www.postgresql.org/docs/devel/static/sql-createtable.html,
>> does not make it clear that the lower bound of a range partition is
>> always inclusive and the higher one is exclusive. I think a note in
>> section " PARTITION OF parent_table FOR VALUES partition_bound_spec"
>> would be helpful.
>
> Hmm.  I see the practical use of that, but I think this is going to be a
> source of endless confusion.  Can we make that a bit clearer in the
> syntax, for example by using additional keywords (INCLUSIVE/EXCLUSIVE)?

I am not in favor of adding mandatory noise words to the syntax; it's
fairly verbose already.  I think it would be reasonable to eventually
consider supporting optional keywords to allow multiple behaviors, but
I don't think that the usefulness of that has been so firmly
established that we should do it right this minute.  I think there are
a heck of a lot of other things about this partitioning implementation
that are more urgently in need of improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
amul sul
Date:
Hi Amit,

Regarding following code in ATExecDropNotNull function, I don't see
any special check for RANGE partitioned, is it intended to have same
restriction for LIST partitioned too, I guess not?
 /*    * If the table is a range partitioned table, check that the column is not    * in the partition key.    */   if
(rel->rd_rel->relkind== RELKIND_PARTITIONED_TABLE)   {       PartitionKey key = RelationGetPartitionKey(rel);       int
       partnatts = get_partition_natts(key),                   i;
 
       for (i = 0; i < partnatts; i++)       {           AttrNumber  partattnum = get_partition_col_attnum(key, i);
           if (partattnum == attnum)               ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),                       errmsg("column \"%s\" is in range partition key",
                         colName)));       }   }
 

Regards,
Amul



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/02/08 21:20, amul sul wrote:
> Regarding following code in ATExecDropNotNull function, I don't see
> any special check for RANGE partitioned, is it intended to have same
> restriction for LIST partitioned too, I guess not?
> 
>   /*
>      * If the table is a range partitioned table, check that the column is not
>      * in the partition key.
>      */
>     if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>     {
>         PartitionKey key = RelationGetPartitionKey(rel);
>         int         partnatts = get_partition_natts(key),
>                     i;
> 
>         for (i = 0; i < partnatts; i++)
>         {
>             AttrNumber  partattnum = get_partition_col_attnum(key, i);
> 
>             if (partattnum == attnum)
>                 ereport(ERROR,
>                         (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
>                          errmsg("column \"%s\" is in range partition key",
>                                 colName)));
>         }
>     }

Good catch!  Seems like an oversight, attached fixes it.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
amul sul
Date:
About 0001-Check-partition-strategy-in-ATExecDropNotNull.patch,
following test is already covered in alter_table.sql @ Line # 1918,
instead of this kindly add test for list_partition:
77 +-- cannot drop NOT NULL constraint of a range partition key column78 +ALTER TABLE range_parted ALTER a DROP NOT
NULL;79+
 

Regards,
Amul



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/02/09 15:22, amul sul wrote:
> About 0001-Check-partition-strategy-in-ATExecDropNotNull.patch,
> following test is already covered in alter_table.sql @ Line # 1918,
> instead of this kindly add test for list_partition:
> 
>  77 +-- cannot drop NOT NULL constraint of a range partition key column
>  78 +ALTER TABLE range_parted ALTER a DROP NOT NULL;
>  79 +

Thanks for the review!  You're right.  Updated patch attached.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Fri, Feb 10, 2017 at 12:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/02/09 15:22, amul sul wrote:
>> About 0001-Check-partition-strategy-in-ATExecDropNotNull.patch,
>> following test is already covered in alter_table.sql @ Line # 1918,
>> instead of this kindly add test for list_partition:
>>
>>  77 +-- cannot drop NOT NULL constraint of a range partition key column
>>  78 +ALTER TABLE range_parted ALTER a DROP NOT NULL;
>>  79 +
>
> Thanks for the review!  You're right.  Updated patch attached.

Committed, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Yugo Nagata
Date:
Hi,

I found that a comment for PartitionRoot in ResultRelInfo is missing.
Although this is trivial, since all other members have comments, I
think it is needed. Attached is the patch to fix it.

Regards,
Yugo Nagata

On Tue, 27 Dec 2016 17:59:05 +0900
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

> On 2016/12/26 19:46, Amit Langote wrote:
> > (Perhaps, the following should be its own new thread)
> > 
> > I noticed that ExecProcessReturning() doesn't work properly after tuple
> > routing (example shows how returning tableoid currently fails but I
> > mention some other issues below):
> > 
> > create table p (a int, b int) partition by range (a);
> > create table p1 partition of p for values from (1) to (10);
> > insert into p values (1) returning tableoid::regclass, *;
> >  tableoid | a | b
> > ----------+---+---
> >  -        | 1 |
> > (1 row)
> > 
> > INSERT 0 1
> > 
> > I tried to fix that in 0007 to get:
> > 
> > insert into p values (1) returning tableoid::regclass, *;
> >  tableoid | a | b
> > ----------+---+---
> >  p        | 1 |
> > (1 row)
> > 
> > INSERT 0 1
> > 
> > But I think it *may* be wrong to return the root table OID for tuples
> > inserted into leaf partitions, because with select we get partition OIDs:
> > 
> > select tableoid::regclass, * from p;
> >  tableoid | a | b
> > ----------+---+---
> >  p1       | 1 |
> > (1 row)
> > 
> > If so, that means we should build the projection info (corresponding to
> > the returning list) for each target partition somehow.  ISTM, that's going
> > to have to be done within the planner by appropriate inheritance
> > translation of the original returning targetlist.
> 
> Turns out getting the 2nd result may not require planner tweaks after all.
> Unless I'm missing something, translation of varattnos of the RETURNING
> target list can be done as late as ExecInitModifyTable() for the insert
> case, unlike update/delete (which do require planner's attention).
> 
> I updated the patch 0007 to implement the same, including the test. While
> doing that, I realized map_partition_varattnos introduced in 0003 is
> rather restrictive in its applicability, because it assumes varno = 1 for
> the expressions it accepts as input for the mapping.  Mapping returning
> (target) list required modifying map_partition_varattnos to accept
> target_varno as additional argument.  That way, we can map arbitrary
> expressions from the parent attributes numbers to partition attribute
> numbers for expressions not limited to partition constraints.
> 
> Patches 0001 to 0006 unchanged.
> 
> Thanks,
> Amit


-- 
Yugo Nagata <nagata@sraoss.co.jp>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Nagata-san,

On 2017/02/23 16:17, Yugo Nagata wrote:
> Hi,
> 
> I found that a comment for PartitionRoot in ResultRelInfo is missing.
> Although this is trivial, since all other members have comments, I
> think it is needed. Attached is the patch to fix it.

Thanks for taking care of that.

+ *        PartitionRoot            relation descriptor for parent relation

Maybe: relation descriptor for root parent relation

Thanks,
Amit





Re: [HACKERS] Declarative partitioning - another take

From
Yugo Nagata
Date:
Hi Amit,

On Thu, 23 Feb 2017 16:29:32 +0900
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

> Thanks for taking care of that.
> 
> + *        PartitionRoot            relation descriptor for parent relation
> 
> Maybe: relation descriptor for root parent relation

This seems better. Patch is updated.

Thanks,
-- 
Yugo Nagata <nagata@sraoss.co.jp>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Etsuro Fujita
Date:
On 2016/12/14 16:20, Etsuro Fujita wrote:
> On 2016/12/09 19:46, Maksim Milyutin wrote:
>> I would like to work on two tasks:
>>  - insert (and eventually update) tuple routing for foreign partition.
>>  - the ability to create an index on the parent and have all of the
>> children inherit it;
>>
>> The first one has been implemented in pg_pathman somehow, but the code
>> relies on dirty hacks, so the FDW API has to be improved. As for the
>> extended index support, it doesn't look like a super-hard task.

> That would be great!  I'd like to help review the first one.

There seems to be no work on the first one, so I'd like to work on that.

Best regards,
Etsuro Fujita





Re: [HACKERS] Declarative partitioning - another take

From
Maksim Milyutin
Date:
On 07.04.2017 13:05, Etsuro Fujita wrote:
> On 2016/12/14 16:20, Etsuro Fujita wrote:
>> On 2016/12/09 19:46, Maksim Milyutin wrote:
>>> I would like to work on two tasks:
>>>  - insert (and eventually update) tuple routing for foreign partition.
>>>  - the ability to create an index on the parent and have all of the
>>> children inherit it;
>>>
>>> The first one has been implemented in pg_pathman somehow, but the code
>>> relies on dirty hacks, so the FDW API has to be improved. As for the
>>> extended index support, it doesn't look like a super-hard task.
>
>> That would be great!  I'd like to help review the first one.
>
> There seems to be no work on the first one, so I'd like to work on that.

Yes, you can start to work on this, I'll join later as a reviewer.



-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] Declarative partitioning - another take

From
Rajkumar Raghuwanshi
Date:
Hi,

I have observed below with the statement triggers.

I am able to create statement triggers at root partition, but these triggers, not getting fired on updating partition.

CREATE TABLE pt (a INT, b INT) PARTITION BY RANGE(a);
CREATE TABLE pt1 PARTITION OF pt FOR VALUES FROM (1) to (7);
CREATE TABLE pt2 PARTITION OF pt FOR VALUES FROM (7) to (11);
INSERT INTO pt (a,b) SELECT i,i FROM generate_series(1,10)i;
CREATE TABLE pt_trigger(TG_NAME varchar,TG_TABLE_NAME varchar,TG_LEVEL varchar,TG_WHEN varchar);
CREATE OR REPLACE FUNCTION process_pt_trigger() RETURNS TRIGGER AS $pttg$
    BEGIN
        IF (TG_OP = 'UPDATE') THEN
            INSERT INTO pt_trigger SELECT TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN;
            RETURN NEW;
        END IF;
        RETURN NULL;
    END;
$pttg$ LANGUAGE plpgsql;
CREATE TRIGGER pt_trigger_after_p0 AFTER UPDATE ON pt FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger();
CREATE TRIGGER pt_trigger_before_p0 BEFORE UPDATE ON pt FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger();

postgres=# UPDATE pt SET a = 2 WHERE a = 1;
UPDATE 1
postgres=# SELECT * FROM pt_trigger ORDER BY 1;
 tg_name | tg_table_name | tg_level | tg_when
---------+---------------+----------+---------
(0 rows)

no statement level trigger fired in this case, is this expected behaviour??

but if i am creating triggers on leaf partition, trigger is getting fired.

CREATE TRIGGER pt_trigger_after_p1 AFTER UPDATE ON pt1 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger();
CREATE TRIGGER pt_trigger_before_p1 BEFORE UPDATE ON pt1 FOR EACH STATEMENT EXECUTE PROCEDURE process_pt_trigger();

postgres=# UPDATE pt SET a = 5 WHERE a = 4;
UPDATE 1
postgres=# SELECT * FROM pt_trigger ORDER BY 1;
       tg_name        | tg_table_name | tg_level  | tg_when
----------------------+---------------+-----------+---------
 pt_trigger_after_p1  | pt1           | STATEMENT | AFTER
 pt_trigger_before_p1 | pt1           | STATEMENT | BEFORE
(2 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Hi Rajkumar,

Thanks for testing and the report.

On 2017/04/21 17:00, Rajkumar Raghuwanshi wrote:
> Hi,
> 
> I have observed below with the statement triggers.
> 
> I am able to create statement triggers at root partition, but these
> triggers, not getting fired on updating partition.
> 
> CREATE TABLE pt (a INT, b INT) PARTITION BY RANGE(a);
> CREATE TABLE pt1 PARTITION OF pt FOR VALUES FROM (1) to (7);
> CREATE TABLE pt2 PARTITION OF pt FOR VALUES FROM (7) to (11);
> INSERT INTO pt (a,b) SELECT i,i FROM generate_series(1,10)i;
> CREATE TABLE pt_trigger(TG_NAME varchar,TG_TABLE_NAME varchar,TG_LEVEL
> varchar,TG_WHEN varchar);
> CREATE OR REPLACE FUNCTION process_pt_trigger() RETURNS TRIGGER AS $pttg$
>     BEGIN
>         IF (TG_OP = 'UPDATE') THEN
>             INSERT INTO pt_trigger SELECT
> TG_NAME,TG_TABLE_NAME,TG_LEVEL,TG_WHEN;
>             RETURN NEW;
>         END IF;
>         RETURN NULL;
>     END;
> $pttg$ LANGUAGE plpgsql;
> CREATE TRIGGER pt_trigger_after_p0 AFTER UPDATE ON pt FOR EACH STATEMENT
> EXECUTE PROCEDURE process_pt_trigger();
> CREATE TRIGGER pt_trigger_before_p0 BEFORE UPDATE ON pt FOR EACH STATEMENT
> EXECUTE PROCEDURE process_pt_trigger();
> 
> postgres=# UPDATE pt SET a = 2 WHERE a = 1;
> UPDATE 1
> postgres=# SELECT * FROM pt_trigger ORDER BY 1;
>  tg_name | tg_table_name | tg_level | tg_when
> ---------+---------------+----------+---------
> (0 rows)
> 
> no statement level trigger fired in this case, is this expected behaviour??

I think we discussed this during the development and decided to allow
per-statement triggers on partitioned tables [1], but it seems it doesn't
quite work for update/delete as your example shows.  You can however see
that per-statement triggers of the root partitioned table *are* fired in
the case of insert.

The reason it doesn't work is that we do not allocate ResultRelInfos for
partitioned tables (not even for the root partitioned table in the
update/delete cases), because the current implementation assumes they are
not required.  That's fine only so long as we consider that no rows are
inserted into them, no indexes need to be updated, and that no row-level
triggers are to be fired.  But it misses the fact that we do allow
statement-level triggers on partitioned tables.  So, we should fix things
such that ResultRelInfos are allocated so that any statement-level
triggers are fired.  But there are following questions to consider:

1. Should we consider only the root partitioned table or all partitioned
   tables in a given hierarchy, including the child partitioned tables?
   Note that this is related to a current limitation of updating/deleting
   inheritance hierarchies that we do not currently fire per-statement
   triggers of the child tables.  See the TODO item in wiki:
   https://wiki.postgresql.org/wiki/Todo#Triggers, which says: "When
   statement-level triggers are defined on a parent table, have them fire
   only on the parent table, and fire child table triggers only where
   appropriate"

2. Do we apply the same to inserts on the partitioned tables?  Since
   insert on a partitioned table implicitly affects its partitions, one
   might say that we would need to fire per-statement insert triggers of
   all the partitions.

Meanwhile, attached is a set of patches to fix this.  Descriptions follow:

0001: fire per-statement triggers of the partitioned table mentioned in a
      given update or delete statement

0002: fire per-statement triggers of the child tables during update/delete
      of inheritance hierarchies (including the partitioned table case)

Depending on the answer to 2 above, we can arrange for the per-statement
triggers of all the partitions to be fired upon insert into the
partitioned table.

> but if i am creating triggers on leaf partition, trigger is getting fired.
> 
> CREATE TRIGGER pt_trigger_after_p1 AFTER UPDATE ON pt1 FOR EACH STATEMENT
> EXECUTE PROCEDURE process_pt_trigger();
> CREATE TRIGGER pt_trigger_before_p1 BEFORE UPDATE ON pt1 FOR EACH STATEMENT
> EXECUTE PROCEDURE process_pt_trigger();
> 
> postgres=# UPDATE pt SET a = 5 WHERE a = 4;
> UPDATE 1
> postgres=# SELECT * FROM pt_trigger ORDER BY 1;
>        tg_name        | tg_table_name | tg_level  | tg_when
> ----------------------+---------------+-----------+---------
>  pt_trigger_after_p1  | pt1           | STATEMENT | AFTER
>  pt_trigger_before_p1 | pt1           | STATEMENT | BEFORE
> (2 rows)

Actually, this works only by accident (with the current implementation,
the *first* partition's triggers will get fired).  If you create another
partition, its per-statement triggers won't get fired.  The patches
described above fixes that.

It would be great if you could check if the patches fix the issues.

Also, adding this to the PostgreSQL 10 open items list.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/8e05817e-14c8-13e4-502b-e440adb24258%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, Apr 24, 2017 at 6:43 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The reason it doesn't work is that we do not allocate ResultRelInfos for
> partitioned tables (not even for the root partitioned table in the
> update/delete cases), because the current implementation assumes they are
> not required.  That's fine only so long as we consider that no rows are
> inserted into them, no indexes need to be updated, and that no row-level
> triggers are to be fired.  But it misses the fact that we do allow
> statement-level triggers on partitioned tables.  So, we should fix things
> such that ResultRelInfos are allocated so that any statement-level
> triggers are fired. But there are following questions to consider:
>
> 1. Should we consider only the root partitioned table or all partitioned
>    tables in a given hierarchy, including the child partitioned tables?
>    Note that this is related to a current limitation of updating/deleting
>    inheritance hierarchies that we do not currently fire per-statement
>    triggers of the child tables.  See the TODO item in wiki:
>    https://wiki.postgresql.org/wiki/Todo#Triggers, which says: "When
>    statement-level triggers are defined on a parent table, have them fire
>    only on the parent table, and fire child table triggers only where
>    appropriate"
>
> 2. Do we apply the same to inserts on the partitioned tables?  Since
>    insert on a partitioned table implicitly affects its partitions, one
>    might say that we would need to fire per-statement insert triggers of
>    all the partitions.

It seems to me that it doesn't make a lot of sense to fire the
triggers for some tables involved in the hierarchy and not others.  I
suppose the consistent thing to do here is to make sure that we fire
the statement triggers for all tables in the partitioning hierarchy
for all operations (insert, update, delete, etc.).

TBH, I don't like that very much.  I'd rather fire the triggers only
for the table actually named in the query and skip all the others,
mostly because it seems like it would be faster and less likely to
block future optimizations; eventually, I'd like to consider not even
locking the children we're not touching, but that's not going to work
if we have to check them all for triggers.  So what I'd prefer -- on
the totally unprincipled basis that it would let us improve
performance in the future -- if you operate on a partition directly,
you fire the partition's triggers, but if you operate on the parent,
only the parent's triggers fire.

How heretical is that idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/04/26 3:58, Robert Haas wrote:
> On Mon, Apr 24, 2017 at 6:43 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The reason it doesn't work is that we do not allocate ResultRelInfos for
>> partitioned tables (not even for the root partitioned table in the
>> update/delete cases), because the current implementation assumes they are
>> not required.  That's fine only so long as we consider that no rows are
>> inserted into them, no indexes need to be updated, and that no row-level
>> triggers are to be fired.  But it misses the fact that we do allow
>> statement-level triggers on partitioned tables.  So, we should fix things
>> such that ResultRelInfos are allocated so that any statement-level
>> triggers are fired. But there are following questions to consider:
>>
>> 1. Should we consider only the root partitioned table or all partitioned
>>    tables in a given hierarchy, including the child partitioned tables?
>>    Note that this is related to a current limitation of updating/deleting
>>    inheritance hierarchies that we do not currently fire per-statement
>>    triggers of the child tables.  See the TODO item in wiki:
>>    https://wiki.postgresql.org/wiki/Todo#Triggers, which says: "When
>>    statement-level triggers are defined on a parent table, have them fire
>>    only on the parent table, and fire child table triggers only where
>>    appropriate"
>>
>> 2. Do we apply the same to inserts on the partitioned tables?  Since
>>    insert on a partitioned table implicitly affects its partitions, one
>>    might say that we would need to fire per-statement insert triggers of
>>    all the partitions.
> 
> It seems to me that it doesn't make a lot of sense to fire the
> triggers for some tables involved in the hierarchy and not others.  I
> suppose the consistent thing to do here is to make sure that we fire
> the statement triggers for all tables in the partitioning hierarchy
> for all operations (insert, update, delete, etc.).
> 
> TBH, I don't like that very much.  I'd rather fire the triggers only
> for the table actually named in the query and skip all the others,
> mostly because it seems like it would be faster and less likely to
> block future optimizations; eventually, I'd like to consider not even
> locking the children we're not touching, but that's not going to work
> if we have to check them all for triggers.  So what I'd prefer -- on
> the totally unprincipled basis that it would let us improve
> performance in the future -- if you operate on a partition directly,
> you fire the partition's triggers, but if you operate on the parent,
> only the parent's triggers fire.

FWIW, I too prefer the latter, that is, fire only the parent's triggers.
In that case, applying only the patch 0001 will do.

Thanks,
Amit




Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> FWIW, I too prefer the latter, that is, fire only the parent's triggers.
> In that case, applying only the patch 0001 will do.

Do we need to update the documentation?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/04/27 1:52, Robert Haas wrote:
> On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> FWIW, I too prefer the latter, that is, fire only the parent's triggers.
>> In that case, applying only the patch 0001 will do.
> 
> Do we need to update the documentation?

Yes, I think we should.  How about as in the attached?

By the way, code changes I made in the attached are such that a subsequent
patch could implement firing statement-level triggers of all the tables in
a partition hierarchy, which it seems we don't want to do.  Should then
the code be changed to not create ResultRelInfos of all the tables but
only the root table (the one mentioned in the command)?  You will see that
the patch adds fields named es_nonleaf_result_relations and
es_num_nonleaf_result_relations, whereas just es_root_result_relation
would perhaps do, for example.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, Apr 26, 2017 at 9:30 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Do we need to update the documentation?
>
> Yes, I think we should.  How about as in the attached?

Looks reasonable, but I was thinking you might also update the section
which contrasts inheritance-based partitioning with declarative
partitioning.

> By the way, code changes I made in the attached are such that a subsequent
> patch could implement firing statement-level triggers of all the tables in
> a partition hierarchy, which it seems we don't want to do.  Should then
> the code be changed to not create ResultRelInfos of all the tables but
> only the root table (the one mentioned in the command)?  You will see that
> the patch adds fields named es_nonleaf_result_relations and
> es_num_nonleaf_result_relations, whereas just es_root_result_relation
> would perhaps do, for example.

It seems better not to create any ResultRelInfos that we don't
actually need, so +1 for such a revision to the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
David Fetter
Date:
On Thu, Apr 27, 2017 at 10:30:54AM +0900, Amit Langote wrote:
> On 2017/04/27 1:52, Robert Haas wrote:
> > On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> FWIW, I too prefer the latter, that is, fire only the parent's triggers.
> >> In that case, applying only the patch 0001 will do.
> > 
> > Do we need to update the documentation?
> 
> Yes, I think we should.  How about as in the attached?
> 
> By the way, code changes I made in the attached are such that a subsequent
> patch could implement firing statement-level triggers of all the tables in
> a partition hierarchy, which it seems we don't want to do.  Should then
> the code be changed to not create ResultRelInfos of all the tables but
> only the root table (the one mentioned in the command)?  You will see that
> the patch adds fields named es_nonleaf_result_relations and
> es_num_nonleaf_result_relations, whereas just es_root_result_relation
> would perhaps do, for example.

Did I notice correctly that there's no way to handle transition tables
for partitions in AFTER STATEMENT triggers?

If not, I'm not suggesting that this be added at this late date, but
we might want to document that.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Thanks for taking a look.

On 2017/04/28 5:24, Robert Haas wrote:
> On Wed, Apr 26, 2017 at 9:30 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Do we need to update the documentation?
>>
>> Yes, I think we should.  How about as in the attached?
> 
> Looks reasonable, but I was thinking you might also update the section
> which contrasts inheritance-based partitioning with declarative
> partitioning.

It seems to me that there is no difference in behavior between
inheritance-based and declarative partitioning as far as statement-level
triggers are concerned (at least currently).  In both the cases, we fire
statement-level triggers only for the table specified in the command.

Maybe, we will fix things someday so that statement-level triggers will be
fired for all the tables in a inheritance hierarchy when the root parent
is updated or deleted, but that's currently not true.  We may never
implement that behavior for declarative partitioned tables though, so
there will be a difference if and when we implement the former.

Am I missing something?

>> By the way, code changes I made in the attached are such that a subsequent
>> patch could implement firing statement-level triggers of all the tables in
>> a partition hierarchy, which it seems we don't want to do.  Should then
>> the code be changed to not create ResultRelInfos of all the tables but
>> only the root table (the one mentioned in the command)?  You will see that
>> the patch adds fields named es_nonleaf_result_relations and
>> es_num_nonleaf_result_relations, whereas just es_root_result_relation
>> would perhaps do, for example.
> 
> It seems better not to create any ResultRelInfos that we don't
> actually need, so +1 for such a revision to the patch.

OK, done.  It took a bit more work than I thought.

Updated patch attached.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/04/28 7:36, David Fetter wrote:
> On Thu, Apr 27, 2017 at 10:30:54AM +0900, Amit Langote wrote:
>> On 2017/04/27 1:52, Robert Haas wrote:
>>> On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> FWIW, I too prefer the latter, that is, fire only the parent's triggers.
>>>> In that case, applying only the patch 0001 will do.
>>>
>>> Do we need to update the documentation?
>>
>> Yes, I think we should.  How about as in the attached?
>>
>> By the way, code changes I made in the attached are such that a subsequent
>> patch could implement firing statement-level triggers of all the tables in
>> a partition hierarchy, which it seems we don't want to do.  Should then
>> the code be changed to not create ResultRelInfos of all the tables but
>> only the root table (the one mentioned in the command)?  You will see that
>> the patch adds fields named es_nonleaf_result_relations and
>> es_num_nonleaf_result_relations, whereas just es_root_result_relation
>> would perhaps do, for example.
> 
> Did I notice correctly that there's no way to handle transition tables
> for partitions in AFTER STATEMENT triggers?

Did you mean to ask about AFTER STATEMENT triggers defined on
"partitioned" tables?  Specifying transition table for them is disallowed
at all.

ERROR:  "p" is a partitioned table
DETAIL:  Triggers on partitioned tables cannot have transition tables.

Triggers created on (leaf) partitions *do* allow specifying transition table.

Or are you asking something else altogether?

> If not, I'm not suggesting that this be added at this late date, but
> we might want to document that.

I don't see mentioned in the documentation that such triggers cannot be
defined on partitioned tables.  Is that what you are saying should be
documented?

Thanks,
Amit




Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
 On Fri, Apr 28, 2017 at 2:13 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> It seems to me that there is no difference in behavior between
> inheritance-based and declarative partitioning as far as statement-level
> triggers are concerned (at least currently).  In both the cases, we fire
> statement-level triggers only for the table specified in the command.

OK.

>>> By the way, code changes I made in the attached are such that a subsequent
>>> patch could implement firing statement-level triggers of all the tables in
>>> a partition hierarchy, which it seems we don't want to do.  Should then
>>> the code be changed to not create ResultRelInfos of all the tables but
>>> only the root table (the one mentioned in the command)?  You will see that
>>> the patch adds fields named es_nonleaf_result_relations and
>>> es_num_nonleaf_result_relations, whereas just es_root_result_relation
>>> would perhaps do, for example.
>>
>> It seems better not to create any ResultRelInfos that we don't
>> actually need, so +1 for such a revision to the patch.
>
> OK, done.  It took a bit more work than I thought.

So, this seems weird, because rootResultRelIndex is initialized even
when splan->partitioned_rels == NIL, but isn't actually valid in that
case.  ExecInitModifyTable seems to think it's always valid, though.

I think the way that you've refactored fireBSTriggers and
fireASTriggers is a bit confusing.  Instead of splitting out a
separate function, how about just having the existing function begin
with if (node->rootResultRelInfo) resultRelInfo =
node->rootResultRelInfo; else resultRelInfo = node->resultRelInfo; ?
I think the way you've coded it is a holdover from the earlier design
where you were going to call it multiple times, but now that's not
needed.

Looks OK, otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
David Fetter
Date:
On Fri, Apr 28, 2017 at 06:29:48PM +0900, Amit Langote wrote:
> On 2017/04/28 7:36, David Fetter wrote:
> > On Thu, Apr 27, 2017 at 10:30:54AM +0900, Amit Langote wrote:
> >> On 2017/04/27 1:52, Robert Haas wrote:
> >>> On Tue, Apr 25, 2017 at 10:34 PM, Amit Langote
> >>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >>>> FWIW, I too prefer the latter, that is, fire only the parent's triggers.
> >>>> In that case, applying only the patch 0001 will do.
> >>>
> >>> Do we need to update the documentation?
> >>
> >> Yes, I think we should.  How about as in the attached?
> >>
> >> By the way, code changes I made in the attached are such that a subsequent
> >> patch could implement firing statement-level triggers of all the tables in
> >> a partition hierarchy, which it seems we don't want to do.  Should then
> >> the code be changed to not create ResultRelInfos of all the tables but
> >> only the root table (the one mentioned in the command)?  You will see that
> >> the patch adds fields named es_nonleaf_result_relations and
> >> es_num_nonleaf_result_relations, whereas just es_root_result_relation
> >> would perhaps do, for example.
> > 
> > Did I notice correctly that there's no way to handle transition tables
> > for partitions in AFTER STATEMENT triggers?
> 
> Did you mean to ask about AFTER STATEMENT triggers defined on
> "partitioned" tables?  Specifying transition table for them is disallowed
> at all.
> 
> ERROR:  "p" is a partitioned table
> DETAIL:  Triggers on partitioned tables cannot have transition tables.

OK, I suppose.  It wasn't clear from the documentation.

> Triggers created on (leaf) partitions *do* allow specifying transition table.

That includes the upcoming "default" tables, I presume.

> Or are you asking something else altogether?

I was just fuzzy on the interactions among these features.

> > If not, I'm not suggesting that this be added at this late date, but
> > we might want to document that.
> 
> I don't see mentioned in the documentation that such triggers cannot be
> defined on partitioned tables.  Is that what you are saying should be
> documented?

Yes, but I bias toward documenting a lot, and this restriction could
go away in some future version, which would make things more confusing
in the long run.  I'm picturing a conversation in 2020 that goes
something like this:

"On 10, you could have AFTER STATEMENT triggers on tables, foreigh
tables, and leaf partition tables which referenced transition tables,
but not on DEFAULT partitions.  On 11, you could on DEFAULT partition
tables.  From 12 onward, you can have transition tables on any
relation."

Kevin?  Thomas?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] Declarative partitioning - another take

From
Noah Misch
Date:
On Mon, Apr 24, 2017 at 07:43:29PM +0900, Amit Langote wrote:
> On 2017/04/21 17:00, Rajkumar Raghuwanshi wrote:
> > I am able to create statement triggers at root partition, but these
> > triggers, not getting fired on updating partition.

> It would be great if you could check if the patches fix the issues.
> 
> Also, adding this to the PostgreSQL 10 open items list.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
Thanks for the review.

On 2017/04/29 1:25, Robert Haas wrote:
>  On Fri, Apr 28, 2017 at 2:13 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> By the way, code changes I made in the attached are such that a subsequent
>>>> patch could implement firing statement-level triggers of all the tables in
>>>> a partition hierarchy, which it seems we don't want to do.  Should then
>>>> the code be changed to not create ResultRelInfos of all the tables but
>>>> only the root table (the one mentioned in the command)?  You will see that
>>>> the patch adds fields named es_nonleaf_result_relations and
>>>> es_num_nonleaf_result_relations, whereas just es_root_result_relation
>>>> would perhaps do, for example.
>>>
>>> It seems better not to create any ResultRelInfos that we don't
>>> actually need, so +1 for such a revision to the patch.
>>
>> OK, done.  It took a bit more work than I thought.
> 
> So, this seems weird, because rootResultRelIndex is initialized even
> when splan->partitioned_rels == NIL, but isn't actually valid in that
> case.  ExecInitModifyTable seems to think it's always valid, though.

OK, rootResultRelIndex is now set to a value >= 0 only when the node
modifies a partitioned table.  And then ExecInitModifyTable() checks if
the index is valid before initializing the root table info.

> I think the way that you've refactored fireBSTriggers and
> fireASTriggers is a bit confusing.  Instead of splitting out a
> separate function, how about just having the existing function begin
> with if (node->rootResultRelInfo) resultRelInfo =
> node->rootResultRelInfo; else resultRelInfo = node->resultRelInfo; ?
> I think the way you've coded it is a holdover from the earlier design
> where you were going to call it multiple times, but now that's not
> needed.

OK, done that way.

> Looks OK, otherwise.

Attached updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/04/29 6:56, David Fetter wrote:
> On Fri, Apr 28, 2017 at 06:29:48PM +0900, Amit Langote wrote:
>> On 2017/04/28 7:36, David Fetter wrote:
>>> Did I notice correctly that there's no way to handle transition tables
>>> for partitions in AFTER STATEMENT triggers?
>>
>> Did you mean to ask about AFTER STATEMENT triggers defined on
>> "partitioned" tables?  Specifying transition table for them is disallowed
>> at all.
>>
>> ERROR:  "p" is a partitioned table
>> DETAIL:  Triggers on partitioned tables cannot have transition tables.
> 
> OK, I suppose.  It wasn't clear from the documentation.
> 
>> Triggers created on (leaf) partitions *do* allow specifying transition table.
> 
> That includes the upcoming "default" tables, I presume.

If a "default" table is also a "leaf" table, then yes.  A default
table/partition can also be itself a partitioned table, in which case, it
won't allow triggers that require transition tables.  AFAICS, it's the
table's being partitioned that stops it from supporting transition tables,
not whether it is a "default" partition or not.

>> Or are you asking something else altogether?
> 
> I was just fuzzy on the interactions among these features.
> 
>>> If not, I'm not suggesting that this be added at this late date, but
>>> we might want to document that.
>>
>> I don't see mentioned in the documentation that such triggers cannot be
>> defined on partitioned tables.  Is that what you are saying should be
>> documented?
> 
> Yes, but I bias toward documenting a lot, and this restriction could
> go away in some future version, which would make things more confusing
> in the long run.

Yeah, it would be a good idea to document this.

> I'm picturing a conversation in 2020 that goes
> something like this:
> 
> "On 10, you could have AFTER STATEMENT triggers on tables, foreigh
> tables, and leaf partition tables which referenced transition tables,
> but not on DEFAULT partitions.  On 11, you could on DEFAULT partition
> tables.  From 12 onward, you can have transition tables on any
> relation."

What we could document now is that partitioned tables don't allow
specifying triggers that reference transition tables.  Although, I am
wondering where this note really belongs - the partitioning chapter, the
triggers chapter or the CREATE TRIGGER reference page?  Maybe, Kevin and
Thomas have something to say about that.  If it turns out that the
partitioning chapter is a good place, here is a documentation patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Mon, May 1, 2017 at 12:18 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patch.

Committed, except for this bit:

+    A statement-level trigger defined on partitioned tables is fired only
+    once for the table itself, not once for every table in the partitioning
+    hierarchy.  However, row-level triggers of any affected leaf partitions
+    will be fired.

The first sentence there has a number of issues.  Grammatically, there
is an agreement problem: trigger is singular, but partitioned table is
plural, and one trigger isn't defined across multiple tables.  It
would have to say something like "A statement-level trigger defined on
a partitioned table".  But even with that correction, it's not really
saying what you want it to say.  Nobody would expect that the same
statement-level trigger would be fired multiple times.  The issue is
whether statement-level triggers on the partitions themselves will be
fired, not the statement-level trigger on the partitioned table.
Also, if this applies to inheritance as well as partitioning, then why
mention only partitioning?  I think we might want to document
something more here, but not like this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/05/01 21:30, Robert Haas wrote:
> On Mon, May 1, 2017 at 12:18 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated patch.
> 
> Committed, except for this bit:

Thanks.

> +    A statement-level trigger defined on partitioned tables is fired only
> +    once for the table itself, not once for every table in the partitioning
> +    hierarchy.  However, row-level triggers of any affected leaf partitions
> +    will be fired.
> 
> The first sentence there has a number of issues.  Grammatically, there
> is an agreement problem: trigger is singular, but partitioned table is
> plural, and one trigger isn't defined across multiple tables.  It
> would have to say something like "A statement-level trigger defined on
> a partitioned table".  But even with that correction, it's not really
> saying what you want it to say.  Nobody would expect that the same
> statement-level trigger would be fired multiple times.  The issue is
> whether statement-level triggers on the partitions themselves will be
> fired, not the statement-level trigger on the partitioned table.
> Also, if this applies to inheritance as well as partitioning, then why
> mention only partitioning?  I think we might want to document
> something more here, but not like this.

You're right.  I agree that whatever text we add here should be pointing
out that statement-level triggers of affected child tables are not fired,
when root parent is specified in the command.

Since there was least some talk of changing that behavior for regular
inheritance so that statement triggers of any affected children are fired
[1], I thought we shouldn't say something general that applies to both
inheritance and partitioning.  But since nothing has happened in that
regard, we might as well.

How about the attached?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/cd282adde5b70b20c57f53bb9ab75e27@biglumber.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, May 2, 2017 at 3:30 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> You're right.  I agree that whatever text we add here should be pointing
> out that statement-level triggers of affected child tables are not fired,
> when root parent is specified in the command.
>
> Since there was least some talk of changing that behavior for regular
> inheritance so that statement triggers of any affected children are fired
> [1], I thought we shouldn't say something general that applies to both
> inheritance and partitioning.  But since nothing has happened in that
> regard, we might as well.
>
> How about the attached?

Looks better, but I think we should say "statement" instead of
"operation" for consistency with the previous paragraph, and it
certainly shouldn't be capitalized.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/05/03 2:48, Robert Haas wrote:
> On Tue, May 2, 2017 at 3:30 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> You're right.  I agree that whatever text we add here should be pointing
>> out that statement-level triggers of affected child tables are not fired,
>> when root parent is specified in the command.
>>
>> Since there was least some talk of changing that behavior for regular
>> inheritance so that statement triggers of any affected children are fired
>> [1], I thought we shouldn't say something general that applies to both
>> inheritance and partitioning.  But since nothing has happened in that
>> regard, we might as well.
>>
>> How about the attached?
> 
> Looks better, but I think we should say "statement" instead of
> "operation" for consistency with the previous paragraph, and it
> certainly shouldn't be capitalized.

Agreed, done.  Attached updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Thomas Munro
Date:
On Mon, May 8, 2017 at 12:47 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/05/03 2:48, Robert Haas wrote:
>> On Tue, May 2, 2017 at 3:30 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> You're right.  I agree that whatever text we add here should be pointing
>>> out that statement-level triggers of affected child tables are not fired,
>>> when root parent is specified in the command.
>>>
>>> Since there was least some talk of changing that behavior for regular
>>> inheritance so that statement triggers of any affected children are fired
>>> [1], I thought we shouldn't say something general that applies to both
>>> inheritance and partitioning.  But since nothing has happened in that
>>> regard, we might as well.
>>>
>>> How about the attached?
>>
>> Looks better, but I think we should say "statement" instead of
>> "operation" for consistency with the previous paragraph, and it
>> certainly shouldn't be capitalized.
>
> Agreed, done.  Attached updated patch.
   <para>
+    A statement that targets the root table in a inheritance or partitioning
+    hierarchy does not cause the statement-level triggers of affected child
+    tables to be fired; only the root table's statement-level triggers are
+    fired.  However, row-level triggers of any affected child tables will be
+    fired.
+   </para>
+
+   <para>

Why talk specifically about the "root" table?  Wouldn't we describe
the situation more generally if we said [a,the] "parent"?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/05/08 10:22, Thomas Munro wrote:
> On Mon, May 8, 2017 at 12:47 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/05/03 2:48, Robert Haas wrote:
>>> On Tue, May 2, 2017 at 3:30 AM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> You're right.  I agree that whatever text we add here should be pointing
>>>> out that statement-level triggers of affected child tables are not fired,
>>>> when root parent is specified in the command.
>>>>
>>>> Since there was least some talk of changing that behavior for regular
>>>> inheritance so that statement triggers of any affected children are fired
>>>> [1], I thought we shouldn't say something general that applies to both
>>>> inheritance and partitioning.  But since nothing has happened in that
>>>> regard, we might as well.
>>>>
>>>> How about the attached?
>>>
>>> Looks better, but I think we should say "statement" instead of
>>> "operation" for consistency with the previous paragraph, and it
>>> certainly shouldn't be capitalized.
>>
>> Agreed, done.  Attached updated patch.
> 
>     <para>
> +    A statement that targets the root table in a inheritance or partitioning
> +    hierarchy does not cause the statement-level triggers of affected child
> +    tables to be fired; only the root table's statement-level triggers are
> +    fired.  However, row-level triggers of any affected child tables will be
> +    fired.
> +   </para>
> +
> +   <para>
> 
> Why talk specifically about the "root" table?  Wouldn't we describe
> the situation more generally if we said [a,the] "parent"?

I think that makes sense.  Modified it to read: "A statement that targets
a parent table in a inheritance or partitioning hierarchy..." in the
attached updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Sun, May 7, 2017 at 9:44 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think that makes sense.  Modified it to read: "A statement that targets
> a parent table in a inheritance or partitioning hierarchy..." in the
> attached updated patch.

LGTM.  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Thomas Munro
Date:
On Wed, May 10, 2017 at 3:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, May 7, 2017 at 9:44 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think that makes sense.  Modified it to read: "A statement that targets
>> a parent table in a inheritance or partitioning hierarchy..." in the
>> attached updated patch.
>
> LGTM.  Committed.

+    A statement that targets a parent table in a inheritance or partitioning

A tiny typo: s/a inheritance/an inheritance/

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Tue, May 9, 2017 at 11:54 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> +    A statement that targets a parent table in a inheritance or partitioning
>
> A tiny typo: s/a inheritance/an inheritance/

Now he tells me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Amit Langote
Date:
On 2017/05/10 12:59, Robert Haas wrote:
> On Tue, May 9, 2017 at 11:54 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> +    A statement that targets a parent table in a inheritance or partitioning
>>
>> A tiny typo: s/a inheritance/an inheritance/
> 
> Now he tells me.

Thanks both.

Regards,
Amit




Re: [HACKERS] Declarative partitioning - another take

From
Etsuro Fujita
Date:
On 2016/11/18 1:43, Robert Haas wrote:
> On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:

>>> - The code in make_modifytable() to swap out the rte_array for a fake
>>> one looks like an unacceptable kludge.  I don't know offhand what a
>>> better design would look like, but what you've got is really ugly.

>> Agree that it looks horrible.  The problem is we don't add partition
>> (child table) RTEs when planning an insert on the parent and FDW
>> partitions can't do without some planner handling - planForeignModify()
>> expects a valid PlannerInfo for deparsing target lists (basically, to be
>> able to use planner_rt_fetch()).

> If it's only needed for foreign tables, how about for v1 we just throw
> an error and say errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot route inserted tuples to a foreign table") for now.  We
> can come back and fix it later.  Doing more inheritance expansion

> Coming up with some new FDW API or some modification
> to the existing one is probably better, but I don't really want to get
> hung up on that right now.

I started working on this.  I agree that the changes made in 
make_modifytable would be unacceptable, but I'd vote for Amit's idea of 
passing a modified PlannerInfo to PlanForeignModify so that the FDW can 
do query planning for INSERT into a foreign partition in the same way as 
for INSERT into a non-partition foreign table.  (Though, I think we 
should generate a more-valid-looking working-copy of the PlannerInfo 
which has Query with the foreign partition as target.)  I'm not sure 
it's a good idea to add a new FDW API or modify the existing one such as 
PlanForeignModify for this purpose.

Best regards,
Etsuro Fujita




Re: [HACKERS] Declarative partitioning - another take

From
Robert Haas
Date:
On Wed, May 10, 2017 at 6:50 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
> I started working on this.  I agree that the changes made in
> make_modifytable would be unacceptable, but I'd vote for Amit's idea of
> passing a modified PlannerInfo to PlanForeignModify so that the FDW can do
> query planning for INSERT into a foreign partition in the same way as for
> INSERT into a non-partition foreign table.  (Though, I think we should
> generate a more-valid-looking working-copy of the PlannerInfo which has
> Query with the foreign partition as target.)  I'm not sure it's a good idea
> to add a new FDW API or modify the existing one such as PlanForeignModify
> for this purpose.

Thanks for working on this, but I think it would be better to start a
new thread for this discussion.

And probably also for any other issues that come up.  This thread has
gotten so long that between the original discussion and commit of the
patch and discussion of multiple follow-on commits and patches that
it's very hard to follow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Declarative partitioning - another take

From
Etsuro Fujita
Date:
Hi Maksim,

On 2017/04/07 19:52, Maksim Milyutin wrote:
> On 07.04.2017 13:05, Etsuro Fujita wrote:

>>> On 2016/12/09 19:46, Maksim Milyutin wrote:
>>>> I would like to work on two tasks:
>>>>  - insert (and eventually update) tuple routing for foreign partition.
>>>>  - the ability to create an index on the parent and have all of the
>>>> children inherit it;

>> There seems to be no work on the first one, so I'd like to work on that.

> Yes, you can start to work on this, I'll join later as a reviewer.

Great!  I added the patch to the next commitfest:

https://commitfest.postgresql.org/14/1184/

Sorry for the delay.

Best regards,
Etsuro Fujita