Thread: [HACKERS] Partitioned tables and relfilenode

[HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
The new partitioned tables do not contain any data by themselves.  Any
data inserted into a partitioned table is routed to and stored in one of
its partitions.  In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created.  It seems wasteful
then to allocate physical storage (files) for partitioned tables.  If we
do not allocate the storage, then we must make sure that the right thing
happens when a command that is intended to manipulate a table's storage
encounters a partitioned table, the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class.  Commands
that need to be taught about this are vacuum, analyze, truncate, and alter
table.  Specifically:

- In case of vacuum, specifying a partitioned table causes a warning

- In case of analyze, we do not throw an error or warning but simply
  avoid calling do_analyze_rel() *non-recursively*.  Further in
  acquire_inherited_sample_rows(), any partitioned tables in the list
  returned by find_all_inheritors() are skipped.

- In case of truncate, only the part which manipulates table's physical
  storage is skipped for partitioned tables.

- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
  tables, because there is nothing to be done.

- Since we cannot create indexes on partitioned tables anyway, there is
  no need to handle cluster and reindex (they throw a meaningful error
  already due to the lack of indexes.)

Patches 0001 and 0002 of the attached implement the above part.  0001
teaches the above mentioned commands to do the "right thing" as described
above and 0002 teaches heap_create() and heap_create_with_catalog() to not
create any physical storage (none of the forks) for partitioned tables.

Then comes 0003, which concerns inheritance planning.  In a regular
inheritance set (ie, the inheritance set corresponding to an inheritance
hierarchy whose root is a regular table), the inheritance parents are
included in their role as child members, because they might contribute
rows to the final result.  So AppendRelInfo's are created for each such
table by the planner prep phase, which the later planning steps use to
create a scan plan for those tables as the Append's child plans.
Currently, the partitioned tables are also processed by the optimizer as
inheritance sets.  Partitioned table inheritance parents do not own any
storage, so we *must* not create scan plans for them.  So we do not need
to process them as child members of the inheritance set.  0003 teaches
expand_inherited_rtentry() to not add partitioned tables as child members.
 Also, since the root partitioned table RTE is no longer added to the
Append list as the 1st child member, inheritance_planner() cannot assume
that it can install the 1st child RTE as the nominalRelation of a given
ModifyTable node, instead the original root parent table RTE is installed
as the nominalRelation.


Together the above patches implement the first item listed in "Some
notes:" part of an email [1] on the original declarative partitioning
thread, which says:

"We should try to teach the executor never to scan the parent. That's
never necessary with this system, and it might add significant overhead.
We should also try to get rid of the idea of the parent having storage
(i.e. a relfilenode)."

Thoughts, comments?

Thanks,
Amit

[1]
https://postgr.es/m/CA%2BTgmob6DJEQBd%3DqH2wZG1onYZ9R1Sji3q%2BGqCRUqQi%2Bug28Rw%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Simon Riggs
Date:
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The new partitioned tables do not contain any data by themselves.  Any
> data inserted into a partitioned table is routed to and stored in one of
> its partitions.  In fact, it is impossible to insert *any* data before a
> partition (to be precise, a leaf partition) is created.

Where is all of this documented? All I see at the moment is old docs.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/10 15:58, Simon Riggs wrote:
> On 10 February 2017 at 06:19, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The new partitioned tables do not contain any data by themselves.  Any
>> data inserted into a partitioned table is routed to and stored in one of
>> its partitions.  In fact, it is impossible to insert *any* data before a
>> partition (to be precise, a leaf partition) is created.
> 
> Where is all of this documented? All I see at the moment is old docs.

Last week I sent documentation patches in a separate message titled
"Documentation improvements for partitioning" [1].  One of the patches
adds a separate section in the DDL chapter describing partitioned tables.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/bf9ebbb3-fb1e-3206-b67c-e7a803a747d9%40lab.ntt.co.jp





Re: [HACKERS] Partitioned tables and relfilenode

From
Michael Paquier
Date:
On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The new partitioned tables do not contain any data by themselves.  Any
> data inserted into a partitioned table is routed to and stored in one of
> its partitions.  In fact, it is impossible to insert *any* data before a
> partition (to be precise, a leaf partition) is created.  It seems wasteful
> then to allocate physical storage (files) for partitioned tables.  If we
> do not allocate the storage, then we must make sure that the right thing
> happens when a command that is intended to manipulate a table's storage
> encounters a partitioned table, the "right thing" here being that the
> command's code either throws an error or warning (in some cases) if the
> specified table is a partitioned table or ignores any partitioned tables
> when it reads the list of relations to process from pg_class.  Commands
> that need to be taught about this are vacuum, analyze, truncate, and alter
> table.  Specifically:

Thanks. I have been looking a bit at this set of patches.

> - In case of vacuum, specifying a partitioned table causes a warning
> - In case of analyze, we do not throw an error or warning but simply
>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>   acquire_inherited_sample_rows(), any partitioned tables in the list
>   returned by find_all_inheritors() are skipped.
> - In case of truncate, only the part which manipulates table's physical
>   storage is skipped for partitioned tables.

I am wondering if instead those should not just be errors saying that
such operations are just not support. This could be relaxed in the
future to mean that a vacuum, truncate or analyze on the partitioned
table triggers an operator in cascade.

> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>   tables, because there is nothing to be done.

Perhaps that makes sense, foreign tables do that.

> - Since we cannot create indexes on partitioned tables anyway, there is
>   no need to handle cluster and reindex (they throw a meaningful error
>   already due to the lack of indexes.)

Yep.

> Patches 0001 and 0002 of the attached implement the above part.  0001
> teaches the above mentioned commands to do the "right thing" as described
> above and 0002 teaches heap_create() and heap_create_with_catalog() to not
> create any physical storage (none of the forks) for partitioned tables.

-   else
+   /*
+    * Although we cannot analyze partitioned tables themselves, we are
+    * still be able to do the recursive ANALYZE.
+    */
+   else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)   {       /* No need for a WARNING if we already
complainedduring VACUUM */       if (!(options & VACOPT_VACUUM))
 
It seems to me that it is better style to use an empty else if with
only the comment. Comments related to a condition that are outside
this condition should be conditional in their formulations. Comments
within a condition can be affirmations when they refer to a decided
state of things.

From patch 2:
@@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname,              relkind == RELKIND_TOASTVALUE ||
       relkind == RELKIND_PARTITIONED_TABLE);
 

-       heap_create_init_fork(new_rel_desc);
+       /*
+        * We do not want to create any storage objects for a partitioned
+        * table.
+        */
+       if (relkind != RELKIND_PARTITIONED_TABLE)
+           heap_create_init_fork(new_rel_desc);   }
This does not make much sense with an assertion telling exactly the
contrary a couple of lines before. I think that it would make more
sense to move the assertion on relkind directly in
heap_create_init_fork().

> Then comes 0003, which concerns inheritance planning.  In a regular
> inheritance set (ie, the inheritance set corresponding to an inheritance
> hierarchy whose root is a regular table), the inheritance parents are
> included in their role as child members, because they might contribute
> rows to the final result.  So AppendRelInfo's are created for each such
> table by the planner prep phase, which the later planning steps use to
> create a scan plan for those tables as the Append's child plans.
> Currently, the partitioned tables are also processed by the optimizer as
> inheritance sets.  Partitioned table inheritance parents do not own any
> storage, so we *must* not create scan plans for them.  So we do not need
> to process them as child members of the inheritance set.  0003 teaches
> expand_inherited_rtentry() to not add partitioned tables as child members.
>  Also, since the root partitioned table RTE is no longer added to the
> Append list as the 1st child member, inheritance_planner() cannot assume
> that it can install the 1st child RTE as the nominalRelation of a given
> ModifyTable node, instead the original root parent table RTE is installed
> as the nominalRelation.

This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
to avoid interactions with partition tables. Did you consider doing
something in the executor instead?
-- 
Michael



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Wed, Feb 15, 2017 at 2:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
> to avoid interactions with partition tables. Did you consider doing
> something in the executor instead?

That seems inferior, because the planner really ought to know that the
partitioning root doesn't need to be scanned.  That way, it can do a
better job getting the cost and selectivity estimates right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Simon Riggs
Date:
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

>  the "right thing" here being that the
> command's code either throws an error or warning (in some cases) if the
> specified table is a partitioned table or ignores any partitioned tables
> when it reads the list of relations to process from pg_class.

This is a massive assumption and deserves major discussion.

My expectation is that "partitioned tables" are "tables". Anything
else seems to fly in the face of both the SQL Standard and the POLA
principle for users coming from other database systems.

IMHO all the main actions should all "just work" not throw errors.

> Commands
> that need to be taught about this are vacuum, analyze, truncate, and alter
> table.  Specifically:
>
> - In case of vacuum, specifying a partitioned table causes a warning

Why not vacuum all partitions?

> - In case of analyze, we do not throw an error or warning but simply
>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>   acquire_inherited_sample_rows(), any partitioned tables in the list
>   returned by find_all_inheritors() are skipped.

Why not analyze all partitions?

> - In case of truncate, only the part which manipulates table's physical
>   storage is skipped for partitioned tables.

Truncate all partitions

> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>   tables, because there is nothing to be done.
>
> - Since we cannot create indexes on partitioned tables anyway, there is
>   no need to handle cluster and reindex (they throw a meaningful error
>   already due to the lack of indexes.)

Create index on all partitions

> Thoughts, comments?

We already have Inheritance. Anybody that likes that behaviour can use it.

Most people don't like that behaviour and wish to see change. They
want the same behaviour as they get from other database products where
a partitioned table is a table and you can do stuff to it just like
other tables. We have the opportunity to change things here and we
should do so.

(It also seems like wasted effort to try to remove the overhead caused
by a parent table for partitioning. Why introduce a special case to
save a few bytes? Premature optimization, surely?)

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Why not vacuum all partitions?
> Why not analyze all partitions?
> Truncate all partitions

I agree.  But, we need to be careful that a database-wide VACUUM or
ANALYZE doesn't hit the partitions multiple times, once for the parent
and again for each child.  Actually, a database-wide VACUUM should hit
each partition individually and do nothing for the parents, but a
database-wide ANALYZE should process the parents and do nothing for
the children, so that the inheritance statistics get updated.

>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>>   tables, because there is nothing to be done.
>>
>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
>
> Create index on all partitions

That one's more complicated, per what I wrote in
https://www.postgresql.org/message-id/CA+TgmoZUwj=QYnaK+F7xEf4w_e2g3XxdMnSNZMZjuinHRcOB8A@mail.gmail.com

> (It also seems like wasted effort to try to remove the overhead caused
> by a parent table for partitioning. Why introduce a special case to
> save a few bytes? Premature optimization, surely?)

I don't think it's wasted effort, actually.  My concern isn't so much
the empty file on disk (which is stupid, but maybe harmless) as
eliminating the dummy scan from the query plan.  I believe the
do-nothing scan can actually be a noticeable drag on performance in
some cases - e.g. if the scan of the partitioned table is on the
inside of a nested loop, so that instead of repeatedly doing an index
scan on each of 4 partitions, you repeatedly do an index scan on each
of 4 partitions and a sequential scan of an empty table.  A zero-page
sequential scan is pretty fast, but not free.  An even bigger problem
is that the planner may think that always-empty parent can contain
some rows, throwing planner estimates off and messing up the whole
plan.  We've been living with that problem for a long time, but now
that we have an opportunity to fix it, it would be good to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/15 16:14, Michael Paquier wrote:
> On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The new partitioned tables do not contain any data by themselves.  Any
>> data inserted into a partitioned table is routed to and stored in one of
>> its partitions.  In fact, it is impossible to insert *any* data before a
>> partition (to be precise, a leaf partition) is created.  It seems wasteful
>> then to allocate physical storage (files) for partitioned tables.  If we
>> do not allocate the storage, then we must make sure that the right thing
>> happens when a command that is intended to manipulate a table's storage
>> encounters a partitioned table, the "right thing" here being that the
>> command's code either throws an error or warning (in some cases) if the
>> specified table is a partitioned table or ignores any partitioned tables
>> when it reads the list of relations to process from pg_class.  Commands
>> that need to be taught about this are vacuum, analyze, truncate, and alter
>> table.  Specifically:
> 
> Thanks. I have been looking a bit at this set of patches.

Thanks for reviewing!

>> - In case of vacuum, specifying a partitioned table causes a warning
>> - In case of analyze, we do not throw an error or warning but simply
>>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>>   acquire_inherited_sample_rows(), any partitioned tables in the list
>>   returned by find_all_inheritors() are skipped.
>> - In case of truncate, only the part which manipulates table's physical
>>   storage is skipped for partitioned tables.
> 
> I am wondering if instead those should not just be errors saying that
> such operations are just not support. This could be relaxed in the
> future to mean that a vacuum, truncate or analyze on the partitioned
> table triggers an operator in cascade.

As the discussion down-thread suggests, that might be a path worth
considering.  That is, vacuum or analyze on a partitioned table causes all
the (leaf) partitions to be vacuumed or analyzed.

Truncate already does that (that is, recurse to leaf partitions).  This
patch simply prevents ExecuteTruncate() from trying to manipulate the
partitioned table's relfilenode.

>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
> 
> Yep.

This one is still under discussion.

>> Patches 0001 and 0002 of the attached implement the above part.  0001
>> teaches the above mentioned commands to do the "right thing" as described
>> above and 0002 teaches heap_create() and heap_create_with_catalog() to not
>> create any physical storage (none of the forks) for partitioned tables.
> 
> -   else
> +   /*
> +    * Although we cannot analyze partitioned tables themselves, we are
> +    * still be able to do the recursive ANALYZE.
> +    */
> +   else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
>     {
>         /* No need for a WARNING if we already complained during VACUUM */
>         if (!(options & VACOPT_VACUUM))
> It seems to me that it is better style to use an empty else if with
> only the comment. Comments related to a condition that are outside
> this condition should be conditional in their formulations. Comments
> within a condition can be affirmations when they refer to a decided
> state of things.

Not sure if this will survive based on the decisions on what to do about
vacuum and analyze, but how about the following rewrite of the above:

else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{   /*    * Although we cannot analyze partitioned tables themselves, we are    * still be able to do the recursive
ANALYZE,which we do below.    */
 
}
else
{

>>From patch 2:
> @@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname,
>                relkind == RELKIND_TOASTVALUE ||
>                relkind == RELKIND_PARTITIONED_TABLE);
> 
> -       heap_create_init_fork(new_rel_desc);
> +       /*
> +        * We do not want to create any storage objects for a partitioned
> +        * table.
> +        */
> +       if (relkind != RELKIND_PARTITIONED_TABLE)
> +           heap_create_init_fork(new_rel_desc);
>     }
> This does not make much sense with an assertion telling exactly the
> contrary a couple of lines before. I think that it would make more
> sense to move the assertion on relkind directly in
> heap_create_init_fork().

Hmm, perhaps the below is saner, with the Assert moved to the start of
heap_create_init_fork() as shown below:

/** We do not want to create any storage objects for a partitioned* table, including the init fork.*/
if (relpersistence == RELPERSISTENCE_UNLOGGED &&   relkind != RELKIND_PARTITIONED_TABLE)
heap_create_init_fork(new_rel_desc);


@@ -1382,6 +1376,8 @@ heap_create_with_catalog(const char *relname,voidheap_create_init_fork(Relation rel){
+    Assert(relkind == RELKIND_RELATION || relkind == RELKIND_MATVIEW ||
+           relkind == RELKIND_TOASTVALUE);

>> Then comes 0003, which concerns inheritance planning.  In a regular
>> inheritance set (ie, the inheritance set corresponding to an inheritance
>> hierarchy whose root is a regular table), the inheritance parents are
>> included in their role as child members, because they might contribute
>> rows to the final result.  So AppendRelInfo's are created for each such
>> table by the planner prep phase, which the later planning steps use to
>> create a scan plan for those tables as the Append's child plans.
>> Currently, the partitioned tables are also processed by the optimizer as
>> inheritance sets.  Partitioned table inheritance parents do not own any
>> storage, so we *must* not create scan plans for them.  So we do not need
>> to process them as child members of the inheritance set.  0003 teaches
>> expand_inherited_rtentry() to not add partitioned tables as child members.
>>  Also, since the root partitioned table RTE is no longer added to the
>> Append list as the 1st child member, inheritance_planner() cannot assume
>> that it can install the 1st child RTE as the nominalRelation of a given
>> ModifyTable node, instead the original root parent table RTE is installed
>> as the nominalRelation.
> 
> This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
> to avoid interactions with partition tables. Did you consider doing
> something in the executor instead?

Getting rid of the parent relation (of which there could be many if a
given partition tree is multi-level) well within the optimizer means that
the optimizer makes a plan that does not have redundant Scan nodes for
partitioned tables (and if the 2nd patch is worthy of consideration, we
absolutely cannot create Scan nodes for them at all).  And Robert's point
about optimizer's cost and selectivity estimations.

I will post the updated patches after addressing other comments in the thread.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/16 23:40, Robert Haas wrote:
> On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Why not vacuum all partitions?
>> Why not analyze all partitions?
>> Truncate all partitions
> 
> I agree.  But, we need to be careful that a database-wide VACUUM or
> ANALYZE doesn't hit the partitions multiple times, once for the parent
> and again for each child.  Actually, a database-wide VACUUM should hit
> each partition individually and do nothing for the parents, but a

This is what would happen even without the patch.  Patch only modifies
what happens when a partitioned table is specified in the vacuum command.
It emits a warning:

WARNING: skipping "%s" --- cannot vacuum partitioned tables

It seems both you and Simon agree that instead of this warning, we should
recurse to process the leaf partitions (ignoring any partitioned tables in
the hierarchy for which there is nothing to do).  If that's right, I will
modify the patch to do that.

> database-wide ANALYZE should process the parents and do nothing for
> the children, so that the inheritance statistics get updated.

Currently vacuum() processes the following relations:
       /*        * Process all plain relations and materialized views listed in        * pg_class        */
       while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)       {           Form_pg_class classForm =
(Form_pg_class)GETSTRUCT(tuple);
 
           if (classForm->relkind != RELKIND_RELATION &&               classForm->relkind != RELKIND_MATVIEW)
   continue;
 

Do you mean that if database-wide analyze is to be run, we should also
exclude those RELKIND_RELATION relations that are partitions?

So the only way to update a partition's statistics is to directly specify
it in the command or by autovacuum.

Truncate already recurses to partitions by way of inheritance recursion
that's already in place.  The patch simply teaches ExecuteTruncate() to
ignore partitioned tables when we get to the part where relfilenodes are
manipulated.

>>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>>>   tables, because there is nothing to be done.
>>>
>>> - Since we cannot create indexes on partitioned tables anyway, there is
>>>   no need to handle cluster and reindex (they throw a meaningful error
>>>   already due to the lack of indexes.)
>>
>> Create index on all partitions
> 
> That one's more complicated, per what I wrote in
> https://www.postgresql.org/message-id/CA+TgmoZUwj=QYnaK+F7xEf4w_e2g3XxdMnSNZMZjuinHRcOB8A@mail.gmail.com

I agree that it would be nice to fix this usability issue.  But, it's also
important as you say in the quoted email that we should not underestimate
the amount of work that might be required to properly implement it.

>> (It also seems like wasted effort to try to remove the overhead caused
>> by a parent table for partitioning. Why introduce a special case to
>> save a few bytes? Premature optimization, surely?)
> 
> I don't think it's wasted effort, actually.  My concern isn't so much
> the empty file on disk (which is stupid, but maybe harmless) as
> eliminating the dummy scan from the query plan.  I believe the
> do-nothing scan can actually be a noticeable drag on performance in
> some cases - e.g. if the scan of the partitioned table is on the
> inside of a nested loop, so that instead of repeatedly doing an index
> scan on each of 4 partitions, you repeatedly do an index scan on each
> of 4 partitions and a sequential scan of an empty table.  A zero-page
> sequential scan is pretty fast, but not free.  An even bigger problem
> is that the planner may think that always-empty parent can contain
> some rows, throwing planner estimates off and messing up the whole
> plan.  We've been living with that problem for a long time, but now
> that we have an opportunity to fix it, it would be good to do so.

Agreed.  I would have reconsidered if it were a more invasive patch.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I agree.  But, we need to be careful that a database-wide VACUUM or
>> ANALYZE doesn't hit the partitions multiple times, once for the parent
>> and again for each child.  Actually, a database-wide VACUUM should hit
>> each partition individually and do nothing for the parents, but a
>
> This is what would happen even without the patch.  Patch only modifies
> what happens when a partitioned table is specified in the vacuum command.
> It emits a warning:
>
> WARNING: skipping "%s" --- cannot vacuum partitioned tables
>
> It seems both you and Simon agree that instead of this warning, we should
> recurse to process the leaf partitions (ignoring any partitioned tables in
> the hierarchy for which there is nothing to do).  If that's right, I will
> modify the patch to do that.

Yeah, that sounds fine.

>> database-wide ANALYZE should process the parents and do nothing for
>> the children, so that the inheritance statistics get updated.
>
> Currently vacuum() processes the following relations:
>
>         /*
>          * Process all plain relations and materialized views listed in
>          * pg_class
>          */
>
>         while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
>         {
>             Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
>
>             if (classForm->relkind != RELKIND_RELATION &&
>                 classForm->relkind != RELKIND_MATVIEW)
>                 continue;
>
> Do you mean that if database-wide analyze is to be run, we should also
> exclude those RELKIND_RELATION relations that are partitions?
>
> So the only way to update a partition's statistics is to directly specify
> it in the command or by autovacuum.

I think if you type:

ANALYZE;

...that should process all partitioned tables and all tables that are
not themselves partitions.  If you type:

ANALYZE name;

...that should ANALYZE that relation, whatever it is.  If it's a
partitioned table, it should recurse.

> Truncate already recurses to partitions by way of inheritance recursion
> that's already in place.  The patch simply teaches ExecuteTruncate() to
> ignore partitioned tables when we get to the part where relfilenodes are
> manipulated.

Oh, OK.  That seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Simon Riggs
Date:
On 16 February 2017 at 11:32, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 10 February 2017 at 06:19, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>>  the "right thing" here being that the
>> command's code either throws an error or warning (in some cases) if the
>> specified table is a partitioned table or ignores any partitioned tables
>> when it reads the list of relations to process from pg_class.
>
> This is a massive assumption and deserves major discussion.
>
> My expectation is that "partitioned tables" are "tables". Anything
> else seems to fly in the face of both the SQL Standard and the POLA
> principle for users coming from other database systems.
>
> IMHO all the main actions should all "just work" not throw errors.

This included DROP TABLE, which I commented on before.

CASCADE should not be required.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/20 5:31, Simon Riggs wrote:
> On 16 February 2017 at 11:32, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 10 February 2017 at 06:19, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>>>  the "right thing" here being that the
>>> command's code either throws an error or warning (in some cases) if the
>>> specified table is a partitioned table or ignores any partitioned tables
>>> when it reads the list of relations to process from pg_class.
>>
>> This is a massive assumption and deserves major discussion.
>>
>> My expectation is that "partitioned tables" are "tables". Anything
>> else seems to fly in the face of both the SQL Standard and the POLA
>> principle for users coming from other database systems.
>>
>> IMHO all the main actions should all "just work" not throw errors.
> 
> This included DROP TABLE, which I commented on before.
> 
> CASCADE should not be required.

Yeah, it seemed like that is the consensus so I posted a patch [0], which
re-posted in a new thread titled "dropping partitioned tables without
CASCADE" [1].

Thanks,
Amit

[0] https://postgr.es/m/ca132b99-0d18-439a-fe65-024085449259%40lab.ntt.co.jp
[1] https://postgr.es/m/6c420206-45d7-3f56-8325-4bd7b76483ba%40lab.ntt.co.jp





Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/19 18:53, Robert Haas wrote:
> On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote wrote:
>> Do you mean that if database-wide analyze is to be run, we should also
>> exclude those RELKIND_RELATION relations that are partitions?
>>
>> So the only way to update a partition's statistics is to directly specify
>> it in the command or by autovacuum.
> 
> I think if you type:
> 
> ANALYZE;
> 
> ...that should process all partitioned tables and all tables that are
> not themselves partitions.

OK.

> If you type:
> 
> ANALYZE name;
> 
> ...that should ANALYZE that relation, whatever it is.  If it's a
> partitioned table, it should recurse.

To be clear, by "recurse" I assume you mean to perform ANALYZE on
individual partitions, not just collect the inheritance statistics.  So
ANALYZE partitioned_table would both a) collect the inheritance statistics
for the specified table and other partitioned tables in the hierarchy, b)
ANALYZE every leaf partitions updating their statistics in pg_class.

While working on this, I noticed that autovacuum.c does not collect
RELKIND_PARTITIONED_TABLE relations, which I think is not right.  It
should match what get_rel_oids() does, which in the database-wide
VACUUM/ANALYZE case collects them.

Attached updated patches:

Updated 0001 addresses the following comments:

 - should recurse when vacuum/analyze is performed on a partitioned table
 - database-wide vacuum should ignore partitioned tables
 - database-wide analyze should ignore partitions; only the inheritance
   statistics of the partitioned tables must be collected in this case

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
Some comments about 0003 patch.

@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
    Index       rti;
+   RangeTblEntry *parent_rte;
There's already another variable declared in that function within a loop
    foreach(lc, root->append_rel_list)
    {
        ...
        RangeTblEntry *parent_rte;
        RangeTblEntry *child_rte;

You might want to choose a different name or delete the one within the loop.

I am wondering whether we should deal with inh flat reset in a
slightly different way. Let expand_inherited_rtentry() mark inh =
false for the partitioned tables without any partitions and deal with
those at the time of estimating size by marking those as dummy. That
might be better than multiple changes here. I will try to cook up a
patch soon for that.

Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/21 22:21, Ashutosh Bapat wrote:
> Some comments about 0003 patch.
> 
> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>     Index       rti;
> +   RangeTblEntry *parent_rte;
> There's already another variable declared in that function within a loop
>     foreach(lc, root->append_rel_list)
>     {
>         ...
>         RangeTblEntry *parent_rte;
>         RangeTblEntry *child_rte;
> 
> You might want to choose a different name or delete the one within the loop.

Deleted the one within the loop.

> I am wondering whether we should deal with inh flat reset in a
> slightly different way. Let expand_inherited_rtentry() mark inh =
> false for the partitioned tables without any partitions and deal with
> those at the time of estimating size by marking those as dummy. That
> might be better than multiple changes here. I will try to cook up a
> patch soon for that.

Are thinking something along the lines of the attached rewritten patch
0003?  I also tend to think that's probably a cleaner patch.  Thanks for
the idea.

> Also we should add tests to make sure the scans on partitioned tables
> without any partitions do not get into problems. PFA patch which adds
> those tests.

I added the test case you suggest, but kept just the first one.

I am including the patches 0001 and 0002 to keep all patches being
discussed on this thread together.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
>
>> I am wondering whether we should deal with inh flat reset in a
>> slightly different way. Let expand_inherited_rtentry() mark inh =
>> false for the partitioned tables without any partitions and deal with
>> those at the time of estimating size by marking those as dummy. That
>> might be better than multiple changes here. I will try to cook up a
>> patch soon for that.
>
> Are thinking something along the lines of the attached rewritten patch
> 0003?  I also tend to think that's probably a cleaner patch.  Thanks for
> the idea.

Yes. Thanks for working on it.

>
>> Also we should add tests to make sure the scans on partitioned tables
>> without any partitions do not get into problems. PFA patch which adds
>> those tests.
>
> I added the test case you suggest, but kept just the first one.

The second one was testing multi-level partitioning case, which
deserves a testcase by its own.

Some more comments

The comment below seems too verbose. All it can say is "A partitioned table
without any partitions results in a dummy relation." I don't think we need to
be explicit about rte->inh. But it's more of personal preference. We will leave
it to the committer, if you don't agree.
+                   /*
+                    * An empty partitioned table, i.e., one without any leaf
+                    * partitions, as signaled by rte->inh being set to false
+                    * by the prep phase (see expand_inherited_rtentry).
+                    */

We don't need this comment as well. Rather than repeating what's been said at
the top of the function, it should just refer to it like "nominal relation has
been set above for partitioned tables. For other parent relations, we'll use
the first child ...".
+        *
+        * If the parent is a partitioned table, we do not have a separate RTE
+        * representing it as a member of the inheritance set, because we do
+        * not create a scan plan for it.  As mentioned at the top of this
+        * function, we make the parent RTE itself the nominal relation.        */

Following condition is not very readable. It's not evident that it's of the
form (A && B) || C, at a glance it looks like it's A && (B || C).
+   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
+        list_length(appinfos) < 2) || list_length(appinfos) < 1)

Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
Thanks for the review.

On 2017/02/22 21:58, Ashutosh Bapat wrote:
>>> Also we should add tests to make sure the scans on partitioned tables
>>> without any partitions do not get into problems. PFA patch which adds
>>> those tests.
>>
>> I added the test case you suggest, but kept just the first one.
> 
> The second one was testing multi-level partitioning case, which
> deserves a testcase by its own.

Ah, okay.  Added it back.

> Some more comments
> 
> The comment below seems too verbose. All it can say is "A partitioned table
> without any partitions results in a dummy relation." I don't think we need to
> be explicit about rte->inh. But it's more of personal preference. We will leave
> it to the committer, if you don't agree.
> +                   /*
> +                    * An empty partitioned table, i.e., one without any leaf
> +                    * partitions, as signaled by rte->inh being set to false
> +                    * by the prep phase (see expand_inherited_rtentry).
> +                    */

It does seem poorly worded.  How about:

                /*
                 * A partitioned table without leaf partitions is marked
                 * as a dummy rel.
                 */

> We don't need this comment as well. Rather than repeating what's been said at
> the top of the function, it should just refer to it like "nominal relation has
> been set above for partitioned tables. For other parent relations, we'll use
> the first child ...".
> +        *
> +        * If the parent is a partitioned table, we do not have a separate RTE
> +        * representing it as a member of the inheritance set, because we do
> +        * not create a scan plan for it.  As mentioned at the top of this
> +        * function, we make the parent RTE itself the nominal relation.
>          */

Rewrote that comment block as:

     *
     * If the parent is a partitioned table, we already set the nominal
     * relation.
     */


> Following condition is not very readable. It's not evident that it's of the
> form (A && B) || C, at a glance it looks like it's A && (B || C).
> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
> 
> Instead you may rearrage it as
> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
> if (list_length(appinfos) < min_child_rels)

OK, done that way.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/02/22 21:58, Ashutosh Bapat wrote:
>>>> Also we should add tests to make sure the scans on partitioned tables
>>>> without any partitions do not get into problems. PFA patch which adds
>>>> those tests.
>>>
>>> I added the test case you suggest, but kept just the first one.
>>
>> The second one was testing multi-level partitioning case, which
>> deserves a testcase by its own.
>
> Ah, okay.  Added it back.

Thanks.

>
>> Some more comments
>>
>> The comment below seems too verbose. All it can say is "A partitioned table
>> without any partitions results in a dummy relation." I don't think we need to
>> be explicit about rte->inh. But it's more of personal preference. We will leave
>> it to the committer, if you don't agree.
>> +                   /*
>> +                    * An empty partitioned table, i.e., one without any leaf
>> +                    * partitions, as signaled by rte->inh being set to false
>> +                    * by the prep phase (see expand_inherited_rtentry).
>> +                    */
>
> It does seem poorly worded.  How about:
>
>                 /*
>                  * A partitioned table without leaf partitions is marked
>                  * as a dummy rel.
>                  */
>
>> We don't need this comment as well. Rather than repeating what's been said at
>> the top of the function, it should just refer to it like "nominal relation has
>> been set above for partitioned tables. For other parent relations, we'll use
>> the first child ...".
>> +        *
>> +        * If the parent is a partitioned table, we do not have a separate RTE
>> +        * representing it as a member of the inheritance set, because we do
>> +        * not create a scan plan for it.  As mentioned at the top of this
>> +        * function, we make the parent RTE itself the nominal relation.
>>          */
>
> Rewrote that comment block as:
>
>      *
>      * If the parent is a partitioned table, we already set the nominal
>      * relation.
>      */
>

I reworded those comments a bit and corrected grammar. Please check in
the attached patch.

>
>> Following condition is not very readable. It's not evident that it's of the
>> form (A && B) || C, at a glance it looks like it's A && (B || C).
>> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
>> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
>>
>> Instead you may rearrage it as
>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
>> if (list_length(appinfos) < min_child_rels)
>
> OK, done that way.

On a second thought, I added a boolean to check if there were any
children created and then reset rte->inh based on that value. That's
better than relying on appinfos length now.

Let me know what you think.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
Thanks for the review.

On 2017/02/23 15:44, Ashutosh Bapat wrote:
> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote:
>> Rewrote that comment block as:
>>
>>      *
>>      * If the parent is a partitioned table, we already set the nominal
>>      * relation.
>>      */
>>
> 
> I reworded those comments a bit and corrected grammar. Please check in
> the attached patch.

What was there sounds grammatically correct to me, but fine.

>>> Following condition is not very readable. It's not evident that it's of the
>>> form (A && B) || C, at a glance it looks like it's A && (B || C).
>>> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
>>> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
>>>
>>> Instead you may rearrage it as
>>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
>>> if (list_length(appinfos) < min_child_rels)
>>
>> OK, done that way.
> 
> On a second thought, I added a boolean to check if there were any
> children created and then reset rte->inh based on that value. That's
> better than relying on appinfos length now.

@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)    /*
+     * Partitioned tables do not have storage for themselves and should not be
+     * scanned.

@@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
RangeTblEntry *rte, Index rti)        /*
+         * Partitioned tables themselves do not have any storage and should not
+         * be scanned. So, do not create child relations for those.
+         */

I guess we should not have to repeat "partitioned tables do not have
storage" in all these places.

+ * a partitioned relation as dummy. The duplicate RTE we added for the
+ * parent table is harmless, so we don't bother to get rid of it; ditto for
+ * the useless PlanRowMark node.

There is no duplicate RTE in the partitioned table case, which even my
original comment failed to consider.  Can you, maybe?

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/02/23 15:44, Ashutosh Bapat wrote:
>> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote:
>>> Rewrote that comment block as:
>>>
>>>      *
>>>      * If the parent is a partitioned table, we already set the nominal
>>>      * relation.
>>>      */
>>>
>>
>> I reworded those comments a bit and corrected grammar. Please check in
>> the attached patch.
>
> What was there sounds grammatically correct to me, but fine.
>
>>>> Following condition is not very readable. It's not evident that it's of the
>>>> form (A && B) || C, at a glance it looks like it's A && (B || C).
>>>> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
>>>> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
>>>>
>>>> Instead you may rearrage it as
>>>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
>>>> if (list_length(appinfos) < min_child_rels)
>>>
>>> OK, done that way.
>>
>> On a second thought, I added a boolean to check if there were any
>> children created and then reset rte->inh based on that value. That's
>> better than relying on appinfos length now.
>
> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>         /*
> +        * Partitioned tables do not have storage for themselves and should not be
> +        * scanned.
>
> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
> RangeTblEntry *rte, Index rti)
>                 /*
> +                * Partitioned tables themselves do not have any storage and should not
> +                * be scanned. So, do not create child relations for those.
> +                */
>
> I guess we should not have to repeat "partitioned tables do not have
> storage" in all these places.

Hmm, you are right. But they are two different functions and the code
blocks are located away from each other. So, I thought it would be
better to have complete comment there, instead of pointing to other
places. I am fine, if we can reword it without compromising
readability.

>
> + * a partitioned relation as dummy. The duplicate RTE we added for the
> + * parent table is harmless, so we don't bother to get rid of it; ditto for
> + * the useless PlanRowMark node.
>
> There is no duplicate RTE in the partitioned table case, which even my
> original comment failed to consider.  Can you, maybe?

May be we could says "If we have added duplicate RTE for the parent
table, it is harmless ...". Does that sound good?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/23 16:48, Ashutosh Bapat wrote:
> On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote wrote:
>> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>>         /*
>> +        * Partitioned tables do not have storage for themselves and should not be
>> +        * scanned.
>>
>> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
>> RangeTblEntry *rte, Index rti)
>>                 /*
>> +                * Partitioned tables themselves do not have any storage and should not
>> +                * be scanned. So, do not create child relations for those.
>> +                */
>>
>> I guess we should not have to repeat "partitioned tables do not have
>> storage" in all these places.
> 
> Hmm, you are right. But they are two different functions and the code
> blocks are located away from each other. So, I thought it would be
> better to have complete comment there, instead of pointing to other
> places. I am fine, if we can reword it without compromising
> readability.

I was saying in general.  I agree that different sites in the optimizer
may have different considerations for why partitioned tables are to be
handled specially, common theme being that we do not have to scan the
parent relation.  If the implementation changes someday such that we don't
manipulate child tables (especially, partitions) through most planning
phases anymore, then maybe we will start using some different terminology
where we don't have to stress this fact too much.  We're not there yet though.

>> + * a partitioned relation as dummy. The duplicate RTE we added for the
>> + * parent table is harmless, so we don't bother to get rid of it; ditto for
>> + * the useless PlanRowMark node.
>>
>> There is no duplicate RTE in the partitioned table case, which even my
>> original comment failed to consider.  Can you, maybe?
> 
> May be we could says "If we have added duplicate RTE for the parent
> table, it is harmless ...". Does that sound good?

Duplicate RTE added in the non-partitioned table case is harmless, so we
don't bother to get rid of it; ditto for the useless PlanRowMark node.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
On Thu, Feb 23, 2017 at 1:38 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/02/23 16:48, Ashutosh Bapat wrote:
>> On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote wrote:
>>> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>>>         /*
>>> +        * Partitioned tables do not have storage for themselves and should not be
>>> +        * scanned.
>>>
>>> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
>>> RangeTblEntry *rte, Index rti)
>>>                 /*
>>> +                * Partitioned tables themselves do not have any storage and should not
>>> +                * be scanned. So, do not create child relations for those.
>>> +                */
>>>
>>> I guess we should not have to repeat "partitioned tables do not have
>>> storage" in all these places.
>>
>> Hmm, you are right. But they are two different functions and the code
>> blocks are located away from each other. So, I thought it would be
>> better to have complete comment there, instead of pointing to other
>> places. I am fine, if we can reword it without compromising
>> readability.
>
> I was saying in general.  I agree that different sites in the optimizer
> may have different considerations for why partitioned tables are to be
> handled specially, common theme being that we do not have to scan the
> parent relation.  If the implementation changes someday such that we don't
> manipulate child tables (especially, partitions) through most planning
> phases anymore, then maybe we will start using some different terminology
> where we don't have to stress this fact too much.  We're not there yet though.

I agree.

>
>>> + * a partitioned relation as dummy. The duplicate RTE we added for the
>>> + * parent table is harmless, so we don't bother to get rid of it; ditto for
>>> + * the useless PlanRowMark node.
>>>
>>> There is no duplicate RTE in the partitioned table case, which even my
>>> original comment failed to consider.  Can you, maybe?
>>
>> May be we could says "If we have added duplicate RTE for the parent
>> table, it is harmless ...". Does that sound good?
>
> Duplicate RTE added in the non-partitioned table case is harmless, so we
> don't bother to get rid of it; ditto for the useless PlanRowMark node.

Fine with me.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Bruce Momjian
Date:
On Fri, Feb 10, 2017 at 03:19:47PM +0900, Amit Langote wrote:
> The new partitioned tables do not contain any data by themselves.  Any
> data inserted into a partitioned table is routed to and stored in one of
> its partitions.  In fact, it is impossible to insert *any* data before a
> partition (to be precise, a leaf partition) is created.  It seems wasteful
> then to allocate physical storage (files) for partitioned tables.  If we
> do not allocate the storage, then we must make sure that the right thing
> happens when a command that is intended to manipulate a table's storage
> encounters a partitioned table, the "right thing" here being that the
> command's code either throws an error or warning (in some cases) if the
> specified table is a partitioned table or ignores any partitioned tables
> when it reads the list of relations to process from pg_class.  Commands
> that need to be taught about this are vacuum, analyze, truncate, and alter
> table.  Specifically:
> 
> - In case of vacuum, specifying a partitioned table causes a warning
> 
> - In case of analyze, we do not throw an error or warning but simply
>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>   acquire_inherited_sample_rows(), any partitioned tables in the list
>   returned by find_all_inheritors() are skipped.
> 
> - In case of truncate, only the part which manipulates table's physical
>   storage is skipped for partitioned tables.
> 
> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>   tables, because there is nothing to be done.
> 
> - Since we cannot create indexes on partitioned tables anyway, there is
>   no need to handle cluster and reindex (they throw a meaningful error
>   already due to the lack of indexes.)

I don't think we are doing this, but if the parent table doesn't have a
physical file pg_upgrade will need to be taught that.  We have that case
now for unlogged tables on standby servers that we need to address.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/28 3:54, Bruce Momjian wrote:
> On Fri, Feb 10, 2017 at 03:19:47PM +0900, Amit Langote wrote:
>> The new partitioned tables do not contain any data by themselves.  Any
>> data inserted into a partitioned table is routed to and stored in one of
>> its partitions.  In fact, it is impossible to insert *any* data before a
>> partition (to be precise, a leaf partition) is created.  It seems wasteful
>> then to allocate physical storage (files) for partitioned tables.  If we
>> do not allocate the storage, then we must make sure that the right thing
>> happens when a command that is intended to manipulate a table's storage
>> encounters a partitioned table, the "right thing" here being that the
>> command's code either throws an error or warning (in some cases) if the
>> specified table is a partitioned table or ignores any partitioned tables
>> when it reads the list of relations to process from pg_class.  Commands
>> that need to be taught about this are vacuum, analyze, truncate, and alter
>> table.  Specifically:
>>
>> - In case of vacuum, specifying a partitioned table causes a warning
>>
>> - In case of analyze, we do not throw an error or warning but simply
>>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>>   acquire_inherited_sample_rows(), any partitioned tables in the list
>>   returned by find_all_inheritors() are skipped.
>>
>> - In case of truncate, only the part which manipulates table's physical
>>   storage is skipped for partitioned tables.
>>
>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>>   tables, because there is nothing to be done.
>>
>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
> 
> I don't think we are doing this, but if the parent table doesn't have a
> physical file pg_upgrade will need to be taught that.  We have that case
> now for unlogged tables on standby servers that we need to address.

Partitioned tables do have physical files as of now.  This thread is to
discuss the proposal to get rid of the physical file for partitioned tables.

By the way, I just tried pg_upgrade on top of this patch, and it seems
partitioned tables without the physical file migrate just fine to the new
cluster.  To test I did: created a partitioned table and few partitions,
inserted some data into it, pg_upgraded the cluster to find the
partitioned table intact with its data in the new cluster (to be clear,
the data is contained in partitions).  Is there something that wouldn't
work that I should instead be testing?

Also, it seems that the partitioned tables (without physical files) won't
have the same issue on the standby servers as unlogged tables.  It's just
that we route inserts into a partitioned table to its partitions and hence
the partitioned table itself does not contain because all the incoming
data is routed.  Am I missing something?

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.

In 0001, the documentation which you are patching has a section for
limitations that apply only to both partitioning and constraint
exclusion, and another for limitations that apply only to constraint
exclusion.  Surely the patch should be moving a limitation that will
no longer apply to partitioning from the first section to the second
section, rather than leaving it in the section for limitations that
apply to both systems and just adding a note that say "this doesn't
apply to partitioning any more".

In acquire_inherited_sample_rows(), instead of inserting a whole
stanza of logic just above the existing dispatch on relkind, I think
we can get by with a very slightly update to what's already there.

You can't use the result of a & b as a bool.  You need to write (a &
b) != 0, because the bool should always use 1 for true and 0 for
false; it should not set some higher-numbered bit.

The changes to autovacuum.c look useless, because
relation_needs_vacanalyze() will presumably never fire for a table
with no tuples of its own.

Updated patch with those changes and a few cosmetic tweaks attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Bruce Momjian
Date:
On Tue, Feb 28, 2017 at 11:53:16AM +0900, Amit Langote wrote:
> > I don't think we are doing this, but if the parent table doesn't have a
> > physical file pg_upgrade will need to be taught that.  We have that case
> > now for unlogged tables on standby servers that we need to address.
> 
> Partitioned tables do have physical files as of now.  This thread is to
> discuss the proposal to get rid of the physical file for partitioned tables.
> 
> By the way, I just tried pg_upgrade on top of this patch, and it seems
> partitioned tables without the physical file migrate just fine to the new
> cluster.  To test I did: created a partitioned table and few partitions,
> inserted some data into it, pg_upgraded the cluster to find the
> partitioned table intact with its data in the new cluster (to be clear,
> the data is contained in partitions).  Is there something that wouldn't
> work that I should instead be testing?
> 
> Also, it seems that the partitioned tables (without physical files) won't
> have the same issue on the standby servers as unlogged tables.  It's just
> that we route inserts into a partitioned table to its partitions and hence
> the partitioned table itself does not contain because all the incoming
> data is routed.  Am I missing something?

I see why it works now.  pg_upgrade only upgrades relations and
materialized views:
       "  WHERE relkind IN ('r', 'm') AND "

but partitions are defined as 'P':
#define       RELKIND_PARTITIONED_TABLE 'P'     /* partitioned table */

I am a little confused by the above.  Is a partitioned table the parent
or the children?  Reading the code it seems it is the parent, which
explains why it works.  Can I clarify that?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Partitioned tables and relfilenode

From
Michael Paquier
Date:
On Tue, Feb 28, 2017 at 12:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I am a little confused by the above.  Is a partitioned table the parent
> or the children?  Reading the code it seems it is the parent, which
> explains why it works.  Can I clarify that?

As I understand things, a partitioned table is a parent relation that
exists only to link to its child portions, holding as well the
definitions linking to each partition.
-- 
Michael



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/28 12:29, Michael Paquier wrote:
> On Tue, Feb 28, 2017 at 12:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> I am a little confused by the above.  Is a partitioned table the parent
>> or the children?  Reading the code it seems it is the parent, which
>> explains why it works.  Can I clarify that?
> 
> As I understand things, a partitioned table is a parent relation that
> exists only to link to its child portions, holding as well the
> definitions linking to each partition.

Yes.  As I mentioned in my previous email, data inserted into the
partitioned table is routed to its partitions.  A partition may be itself
a partitioned table, so the routing continues until we find a leaf
partition.  All the partitioned tables in this chain leading to the leaf
partition are RELKIND_PARTITIONED_TABLE ('P') relations and only the leaf
tables are RELKIND_RELATION tables.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
Amit's original patch had an assertion failure in
extract_autovac_opts(). His patch adds partitioned tables in the list
of tables to be auto-analyzed. But there's an assertion in
extract_autovac_opts(), which did not consider partitioned tables.
When a partitioned table is created, the assertion trips in autovacuum
worker and resets the cluster, restarting all the backends. I have
updated Robert's patch with the assertion fixed.

On Tue, Feb 28, 2017 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks for the review.
>
> In 0001, the documentation which you are patching has a section for
> limitations that apply only to both partitioning and constraint
> exclusion, and another for limitations that apply only to constraint
> exclusion.  Surely the patch should be moving a limitation that will
> no longer apply to partitioning from the first section to the second
> section, rather than leaving it in the section for limitations that
> apply to both systems and just adding a note that say "this doesn't
> apply to partitioning any more".
>
> In acquire_inherited_sample_rows(), instead of inserting a whole
> stanza of logic just above the existing dispatch on relkind, I think
> we can get by with a very slightly update to what's already there.
>
> You can't use the result of a & b as a bool.  You need to write (a &
> b) != 0, because the bool should always use 1 for true and 0 for
> false; it should not set some higher-numbered bit.
>
> The changes to autovacuum.c look useless, because
> relation_needs_vacanalyze() will presumably never fire for a table
> with no tuples of its own.
>
> Updated patch with those changes and a few cosmetic tweaks attached.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/02/28 13:52, Ashutosh Bapat wrote:
> Amit's original patch had an assertion failure in
> extract_autovac_opts(). His patch adds partitioned tables in the list
> of tables to be auto-analyzed. But there's an assertion in
> extract_autovac_opts(), which did not consider partitioned tables.
> When a partitioned table is created, the assertion trips in autovacuum
> worker and resets the cluster, restarting all the backends. I have
> updated Robert's patch with the assertion fixed.

Since Robert's patch backed out all autovacuum.c changes, partitioned
tables case would no longer be able to reach that Assert.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
Thanks for the review.  I was just about to send a new version of the patches.

On 2017/02/28 12:20, Robert Haas wrote:
> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks for the review.
> 
> In 0001, the documentation which you are patching has a section for
> limitations that apply only to both partitioning and constraint
> exclusion, and another for limitations that apply only to constraint
> exclusion.  Surely the patch should be moving a limitation that will
> no longer apply to partitioning from the first section to the second
> section, rather than leaving it in the section for limitations that
> apply to both systems and just adding a note that say "this doesn't
> apply to partitioning any more".

So we have the following headings under 5.11.6 Caveats:

"The following caveats apply to partitioned tables implemented using
either method (unless noted otherwise)"

and,

"The following caveats apply to constraint exclusion"

The latter lists only the caveats about the planner capability (constraint
exclusion) for partitioning - that it cannot be performed against volatile
WHERE conditions, only works with comparisons using btree operators, does
not scale beyond hundred partitions, etc.  These limitations are common to
both systems (inheritance and declarative partitioned tables), because
both rely on constraint exclusion.

How about separating the limitations listed under the first heading into:

"Limitations of using inheritance for partitioning"

and,

"Limitations of using partitioned tables"

This particular patch gets rid of the restriction that vacuum/analyze do
not recurse to child tables (partitions) only for the second of the above.

How about the documentation changes in the attached updated 0001?  I know
that this page needs a much larger rewrite as we are discussing in the
other thread.

> In acquire_inherited_sample_rows(), instead of inserting a whole
> stanza of logic just above the existing dispatch on relkind, I think
> we can get by with a very slightly update to what's already there.
> 
> You can't use the result of a & b as a bool.  You need to write (a &
> b) != 0, because the bool should always use 1 for true and 0 for
> false; it should not set some higher-numbered bit.

Oops, thanks for fixing that.  I suppose you are referring to this hunk in
the original patch:

-    relations = get_rel_oids(relid, relation);
+    relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM);

And we need to do it this way in *this* case, because we're passing it as
a bool argument.  I see that it's OK to do this:

    stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";

Or this:

    if (options & VACOPT_VACUUM)
    {
        PreventTransactionChain(isTopLevel, stmttype);

> The changes to autovacuum.c look useless, because
> relation_needs_vacanalyze() will presumably never fire for a table
> with no tuples of its own.

Hmm, I guess that makes sense.  Inheritance statistics are never collected
by autovacuum (at least in the partitioning setups and especially with the
partitioned tables now).  I recall that you mentioned [1] this as one of
the limitations of the system currently.

> Updated patch with those changes and a few cosmetic tweaks attached.

I have incorporated the changes in your patch in the attached updated
0001, with documentation tweaks as mentioned above.

Other than that, in acquire_inherited_sample_rows() I've used a method
suggested by Ashutosh Bapat to track if we encountered a child table,
which is much less ugly than below:

-    if (nrels < 2)
+    if ((onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE && nrels <
2) ||
+        nrels < 1)


I keep updated 0002 and 0003 here, but I've changed their order.  Don't
scan partitioned tables thing is now 0002 and don't allocate file for
partitioned tables is 0003.  Sorry about the confusion.

Thanks,
Amit

[1]
https://postgr.es/m/CA%2BTgmobTxn2%2B0x96h5Le%2BGOK5kw3J37SRveNfzEdx9s5-Yd8vA%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Feb 28, 2017 at 11:35 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> How about the documentation changes in the attached updated 0001?  I know
> that this page needs a much larger rewrite as we are discussing in the
> other thread.

Looks good.

>> In acquire_inherited_sample_rows(), instead of inserting a whole
>> stanza of logic just above the existing dispatch on relkind, I think
>> we can get by with a very slightly update to what's already there.
>>
>> You can't use the result of a & b as a bool.  You need to write (a &
>> b) != 0, because the bool should always use 1 for true and 0 for
>> false; it should not set some higher-numbered bit.
>
> Oops, thanks for fixing that.  I suppose you are referring to this hunk in
> the original patch:
>
> -    relations = get_rel_oids(relid, relation);
> +    relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM);
>
> And we need to do it this way in *this* case, because we're passing it as
> a bool argument.  I see that it's OK to do this:
>
>     stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";
>
> Or this:
>
>     if (options & VACOPT_VACUUM)
>     {
>         PreventTransactionChain(isTopLevel, stmttype);

In those cases it's still clearer, IMHO, to use != 0, but it's not
necessary.  However, when you're explicitly creating a value of type
"bool", then it's necessary.

Actually, looking at this again, I now think this part is wrong:

+            /*
+             * If only ANALYZE is to be performed, there is no need to include
+             * partitions in the list.  In a database-wide ANALYZE, we only
+             * update the inheritance statistics of partitioned tables, not
+             * the statistics of individual partitions.
+             */
+            if (!is_vacuum && classForm->relispartition)                continue;

I was thinking earlier that an ANALYZE on the parent would also update
the statistics for each child, but now I see that's not so.  So now I
think we should omit this logic (and change the documentation to
match).  That is, a database-wide ANALYZE should update the statistics
for each child as well as for the parent.  Otherwise direct queries
against the children (and partitionwise joins, once we have that) are
going to go haywire.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/02 18:36, Robert Haas wrote:
> On Tue, Feb 28, 2017 at 11:35 AM, Amit Langote wrote:
>>> In acquire_inherited_sample_rows(), instead of inserting a whole
>>> stanza of logic just above the existing dispatch on relkind, I think
>>> we can get by with a very slightly update to what's already there.
>>>
>>> You can't use the result of a & b as a bool.  You need to write (a &
>>> b) != 0, because the bool should always use 1 for true and 0 for
>>> false; it should not set some higher-numbered bit.
>>
>> Oops, thanks for fixing that.  I suppose you are referring to this hunk in
>> the original patch:
>>
>> -    relations = get_rel_oids(relid, relation);
>> +    relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM);
>>
>> And we need to do it this way in *this* case, because we're passing it as
>> a bool argument.  I see that it's OK to do this:
>>
>>     stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";
>>
>> Or this:
>>
>>     if (options & VACOPT_VACUUM)
>>     {
>>         PreventTransactionChain(isTopLevel, stmttype);
> 
> In those cases it's still clearer, IMHO, to use != 0, but it's not
> necessary.  However, when you're explicitly creating a value of type
> "bool", then it's necessary.

Agreed.

> Actually, looking at this again, I now think this part is wrong:
> 
> +            /*
> +             * If only ANALYZE is to be performed, there is no need to include
> +             * partitions in the list.  In a database-wide ANALYZE, we only
> +             * update the inheritance statistics of partitioned tables, not
> +             * the statistics of individual partitions.
> +             */
> +            if (!is_vacuum && classForm->relispartition)
>                  continue;
> 
> I was thinking earlier that an ANALYZE on the parent would also update
> the statistics for each child, but now I see that's not so.  So now I

Yep, the patch enables ANALYZE to be propagated to partitions when the
parent table is specified in the command.  The above logic in the patch
made the database-wide ANALYZE to ignore partitions, in which case, only
the inheritance statistics would be updated.  I can also see why that'd be
undesirable.

> think we should omit this logic (and change the documentation to
> match).  That is, a database-wide ANALYZE should update the statistics
> for each child as well as for the parent.  Otherwise direct queries
> against the children (and partitionwise joins, once we have that) are
> going to go haywire.

OK, done.  I updated both analyze.sgml and vacuum.sgml to be more up to
date.  Both pages previously omitted materialized views.

Attached updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Thu, Mar 2, 2017 at 3:52 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> think we should omit this logic (and change the documentation to
>> match).  That is, a database-wide ANALYZE should update the statistics
>> for each child as well as for the parent.  Otherwise direct queries
>> against the children (and partitionwise joins, once we have that) are
>> going to go haywire.
>
> OK, done.  I updated both analyze.sgml and vacuum.sgml to be more up to
> date.  Both pages previously omitted materialized views.
>
> Attached updated patches.

Thanks, committed 0001 with a slight change to the wording.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/02 21:48, Robert Haas wrote:
> On Thu, Mar 2, 2017 at 3:52 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> think we should omit this logic (and change the documentation to
>>> match).  That is, a database-wide ANALYZE should update the statistics
>>> for each child as well as for the parent.  Otherwise direct queries
>>> against the children (and partitionwise joins, once we have that) are
>>> going to go haywire.
>>
>> OK, done.  I updated both analyze.sgml and vacuum.sgml to be more up to
>> date.  Both pages previously omitted materialized views.
>>
>> Attached updated patches.
> 
> Thanks, committed 0001 with a slight change to the wording.

Thanks.  I noticed that 'and' is duplicated in a line added by the commit
to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
last version.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Michael Paquier
Date:
On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
> last version.
   /*
-    * If all the children were temp tables, pretend it's a non-inheritance
-    * situation.  The duplicate RTE we added for the parent table is
-    * harmless, so we don't bother to get rid of it; ditto for the useless
-    * PlanRowMark node.
+    * If all the children were temp tables or if the parent is a partitioned
+    * table without any leaf partitions, pretend it's a non-inheritance
+    * situation.  The duplicate RTE for the parent table we added in the
+    * non-partitioned table case is harmless, so we don't bother to get rid
+    * of it; ditto for the useless PlanRowMark node.    */
-   if (list_length(appinfos) < 2)
+   if (!has_child)
This comment is not completely correct. Children can be temp tables,
they just cannot be temp tables of other backends. It seems to me that
you could still keep this code simple and remove has_child..

@@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,       case RELKIND_RELATION:       case
RELKIND_TOASTVALUE:      case RELKIND_MATVIEW:
 
-       case RELKIND_PARTITIONED_TABLE:           options = heap_reloptions(classForm->relkind, datum, false);
break;
 
Partitioned tables cannot have reloptions? What about all the
autovacuum_* parameters then? Those are mainly not storage-related.
-- 
Michael



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
Thanks for the review.

On 2017/03/06 15:41, Michael Paquier wrote:
> On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
>> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
>> last version.
> 
>     /*
> -    * If all the children were temp tables, pretend it's a non-inheritance
> -    * situation.  The duplicate RTE we added for the parent table is
> -    * harmless, so we don't bother to get rid of it; ditto for the useless
> -    * PlanRowMark node.
> +    * If all the children were temp tables or if the parent is a partitioned
> +    * table without any leaf partitions, pretend it's a non-inheritance
> +    * situation.  The duplicate RTE for the parent table we added in the
> +    * non-partitioned table case is harmless, so we don't bother to get rid
> +    * of it; ditto for the useless PlanRowMark node.
>      */
> -   if (list_length(appinfos) < 2)
> +   if (!has_child)
> This comment is not completely correct. Children can be temp tables,
> they just cannot be temp tables of other backends. It seems to me that
> you could still keep this code simple and remove has_child..

I updated the comment.  I recall having posted a patch for that once, but
perhaps went unnoticed.

About has_child, the other option is to make the minimum length of
appinfos list relkind-based, but  the condition quickly becomes ugly. Do
you have a suggestion?

> @@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
>         case RELKIND_RELATION:
>         case RELKIND_TOASTVALUE:
>         case RELKIND_MATVIEW:
> -       case RELKIND_PARTITIONED_TABLE:
>             options = heap_reloptions(classForm->relkind, datum, false);
>             break;
> Partitioned tables cannot have reloptions? What about all the
> autovacuum_* parameters then? Those are mainly not storage-related.

AFAIK, none of the heap reloptions will be applicable to partitioned table
relations once we eliminate storage.

About autovacuum_* parameters - we currently don't handle partitioned
tables in autovacuum.c, because no statistics are reported for them. That
is, relation_needs_vacanalyze() will never return true for dovacuum,
doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE
relation.  That's something to be fixed separately though.  When we add
autovacuum support for partitioned tables, we may want to add a new set of
reloptions (new because partitioned tables still won't support all options
returned by heap_reloptions()).  Am I missing something?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/03/06 15:41, Michael Paquier wrote:
>> On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
>>> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
>>> last version.
>>
>>     /*
>> -    * If all the children were temp tables, pretend it's a non-inheritance
>> -    * situation.  The duplicate RTE we added for the parent table is
>> -    * harmless, so we don't bother to get rid of it; ditto for the useless
>> -    * PlanRowMark node.
>> +    * If all the children were temp tables or if the parent is a partitioned
>> +    * table without any leaf partitions, pretend it's a non-inheritance
>> +    * situation.  The duplicate RTE for the parent table we added in the
>> +    * non-partitioned table case is harmless, so we don't bother to get rid
>> +    * of it; ditto for the useless PlanRowMark node.
>>      */
>> -   if (list_length(appinfos) < 2)
>> +   if (!has_child)
>> This comment is not completely correct. Children can be temp tables,
>> they just cannot be temp tables of other backends. It seems to me that
>> you could still keep this code simple and remove has_child..
>
> I updated the comment.  I recall having posted a patch for that once, but
> perhaps went unnoticed.

The existing comment only specifies "temp tables" and not "temp table
of other backends". The new comment keeps that part same and adds
partitioned table case. So, I don't see any reason to change the "temp
tables" to "temp table of other backends" in this patch.

>
> About has_child, the other option is to make the minimum length of
> appinfos list relkind-based, but  the condition quickly becomes ugly. Do
> you have a suggestion?
>
+1.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/06 16:49, Ashutosh Bapat wrote:
> On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote wrote:
>> On 2017/03/06 15:41, Michael Paquier wrote:
>>> This comment is not completely correct. Children can be temp tables,
>>> they just cannot be temp tables of other backends. It seems to me that
>>> you could still keep this code simple and remove has_child..
>>
>> I updated the comment.  I recall having posted a patch for that once, but
>> perhaps went unnoticed.
> 
> The existing comment only specifies "temp tables" and not "temp table
> of other backends". The new comment keeps that part same and adds
> partitioned table case. So, I don't see any reason to change the "temp
> tables" to "temp table of other backends" in this patch.

Hmm.  A separate patch might be fine but why not fix the incorrect part
while we are updating the whole comment anyway.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
On Mon, Mar 6, 2017 at 1:26 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/03/06 16:49, Ashutosh Bapat wrote:
>> On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote wrote:
>>> On 2017/03/06 15:41, Michael Paquier wrote:
>>>> This comment is not completely correct. Children can be temp tables,
>>>> they just cannot be temp tables of other backends. It seems to me that
>>>> you could still keep this code simple and remove has_child..
>>>
>>> I updated the comment.  I recall having posted a patch for that once, but
>>> perhaps went unnoticed.
>>
>> The existing comment only specifies "temp tables" and not "temp table
>> of other backends". The new comment keeps that part same and adds
>> partitioned table case. So, I don't see any reason to change the "temp
>> tables" to "temp table of other backends" in this patch.
>
> Hmm.  A separate patch might be fine but why not fix the incorrect part
> while we are updating the whole comment anyway.

There must be a reason why that comment is written the way it's
written. I guess, "from other backends" part of "temp tables" story
has been explained just few lines above and the author/s didn't want
to repeat it again. There's no point in changing it while we are not
touching temp tables handling.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Michael Paquier
Date:
On Mon, Mar 6, 2017 at 4:18 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> About autovacuum_* parameters - we currently don't handle partitioned
> tables in autovacuum.c, because no statistics are reported for them. That
> is, relation_needs_vacanalyze() will never return true for dovacuum,
> doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE
> relation.  That's something to be fixed separately though.  When we add
> autovacuum support for partitioned tables, we may want to add a new set of
> reloptions (new because partitioned tables still won't support all options
> returned by heap_reloptions()).  Am I missing something?

OK. I got confused by the fact that settings on parents should
super-seed the settings of the children. Or if you want if a value is
set on the parent by default it would apply to the child if it has no
value set, which is where autovacuum_enabled makes sense even for
partitioned tables. Leading to the point that parents could have
reloptions, with a new category of the type RELOPT_KIND_PARTITION.
Still, it is sensible as well to bypass the parents in autovacuum as
well, now that I read it. And the handling is more simple.
-- 
Michael



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/06 17:01, Ashutosh Bapat wrote:
> On Mon, Mar 6, 2017 at 1:26 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/03/06 16:49, Ashutosh Bapat wrote:
>>> On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote wrote:
>>>> On 2017/03/06 15:41, Michael Paquier wrote:
>>>>> This comment is not completely correct. Children can be temp tables,
>>>>> they just cannot be temp tables of other backends. It seems to me that
>>>>> you could still keep this code simple and remove has_child..
>>>>
>>>> I updated the comment.  I recall having posted a patch for that once, but
>>>> perhaps went unnoticed.
>>>
>>> The existing comment only specifies "temp tables" and not "temp table
>>> of other backends". The new comment keeps that part same and adds
>>> partitioned table case. So, I don't see any reason to change the "temp
>>> tables" to "temp table of other backends" in this patch.
>>
>> Hmm.  A separate patch might be fine but why not fix the incorrect part
>> while we are updating the whole comment anyway.
> 
> There must be a reason why that comment is written the way it's
> written. I guess, "from other backends" part of "temp tables" story
> has been explained just few lines above and the author/s didn't want
> to repeat it again.

That's perhaps true.  I guess it depends on who reads it.  Someone might
think the comment is incorrect because *not all* temporary child tables
are excluded.

> There's no point in changing it while we are not
> touching temp tables handling.

We can leave it for the committer to decide, maybe.  Committers often
rewrite surrounding comments to improve wording, correcting factual
errors, etc.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
>
> We can leave it for the committer to decide, maybe.  Committers often
> rewrite surrounding comments to improve wording, correcting factual
> errors, etc.
>

Sure.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/06 17:22, Michael Paquier wrote:
> On Mon, Mar 6, 2017 at 4:18 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> About autovacuum_* parameters - we currently don't handle partitioned
>> tables in autovacuum.c, because no statistics are reported for them. That
>> is, relation_needs_vacanalyze() will never return true for dovacuum,
>> doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE
>> relation.  That's something to be fixed separately though.  When we add
>> autovacuum support for partitioned tables, we may want to add a new set of
>> reloptions (new because partitioned tables still won't support all options
>> returned by heap_reloptions()).  Am I missing something?
> 
> OK. I got confused by the fact that settings on parents should
> super-seed the settings of the children. Or if you want if a value is
> set on the parent by default it would apply to the child if it has no
> value set, which is where autovacuum_enabled makes sense even for
> partitioned tables.

Hmm, setting autovacuum_enabled on partitioned parent should be made to
work after we have fixed autovacuum support for partitioned tables.  Using
the parent's value as a default for partitions may not be what we'd want
eventually.

> Leading to the point that parents could have
> reloptions, with a new category of the type RELOPT_KIND_PARTITION.
> Still, it is sensible as well to bypass the parents in autovacuum as
> well, now that I read it. And the handling is more simple.

We will need it though, because lack of automatically updated
"inheritance" (or whole tree) statistics on partitioned tables is a problem.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
I see that all the changes by Amit and myself to what was earlier 0003
patch are now part of 0002 patch. The patch looks ready for committer.

Some comments about 0003 patch.

CREATE TABLE syntax seems to allow specifying reloptions for a
partitioned table. But extractRelOptions() and heap_reloptions() seem
to be ignoring those. Is that intentional? There are two callers of
extractRelOptions(), extract_autovac_opts() and
RelationParseRelOptions(). There's an assert in extract_autovac_opts()
preventing that function from getting called for partitioned table.
RelationParseRelOptions() seems to filter the relations before calling
extractRelOptions(). I am wondering whether it's better to leave
extractRelOptions() and filter partitioned relations in
RelationParseRelOptions().

Do we need similar condition modification in the other caller of i.e.
heap_create_init_fork(), ExecuteTruncate()?

On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/03/06 15:41, Michael Paquier wrote:
>> On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
>>> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
>>> last version.
>>
>>     /*
>> -    * If all the children were temp tables, pretend it's a non-inheritance
>> -    * situation.  The duplicate RTE we added for the parent table is
>> -    * harmless, so we don't bother to get rid of it; ditto for the useless
>> -    * PlanRowMark node.
>> +    * If all the children were temp tables or if the parent is a partitioned
>> +    * table without any leaf partitions, pretend it's a non-inheritance
>> +    * situation.  The duplicate RTE for the parent table we added in the
>> +    * non-partitioned table case is harmless, so we don't bother to get rid
>> +    * of it; ditto for the useless PlanRowMark node.
>>      */
>> -   if (list_length(appinfos) < 2)
>> +   if (!has_child)
>> This comment is not completely correct. Children can be temp tables,
>> they just cannot be temp tables of other backends. It seems to me that
>> you could still keep this code simple and remove has_child..
>
> I updated the comment.  I recall having posted a patch for that once, but
> perhaps went unnoticed.
>
> About has_child, the other option is to make the minimum length of
> appinfos list relkind-based, but  the condition quickly becomes ugly. Do
> you have a suggestion?
>
>> @@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
>>         case RELKIND_RELATION:
>>         case RELKIND_TOASTVALUE:
>>         case RELKIND_MATVIEW:
>> -       case RELKIND_PARTITIONED_TABLE:
>>             options = heap_reloptions(classForm->relkind, datum, false);
>>             break;
>> Partitioned tables cannot have reloptions? What about all the
>> autovacuum_* parameters then? Those are mainly not storage-related.
>
> AFAIK, none of the heap reloptions will be applicable to partitioned table
> relations once we eliminate storage.
>
> About autovacuum_* parameters - we currently don't handle partitioned
> tables in autovacuum.c, because no statistics are reported for them. That
> is, relation_needs_vacanalyze() will never return true for dovacuum,
> doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE
> relation.  That's something to be fixed separately though.  When we add
> autovacuum support for partitioned tables, we may want to add a new set of
> reloptions (new because partitioned tables still won't support all options
> returned by heap_reloptions()).  Am I missing something?
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Thu, Mar 2, 2017 at 8:02 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
> last version.

Oh, rats.  Thanks for noticing.  Committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Mar 7, 2017 at 12:11 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I see that all the changes by Amit and myself to what was earlier 0003
> patch are now part of 0002 patch. The patch looks ready for committer.

Reviewing 0002:

This patch seems to have falsified the header comments for
expand_inherited_rtentry.

I don't quite understand the need for the change to set_rel_size().
The rte->inh case is handled before reaching the new code, and IIUC
the !rte->inh case should never happen given the changes to
expand_inherited_rtentry.  Or am I confused?  If not, I think we
should just add an Assert() to the "plain relation" case that this is
not RELKIND_PARTITIONED_TABLE (with a comment explaining why it can't
happen).

In inheritance_planner, this comment addition does not seem adequate:

+         * If the parent is a partitioned table, we already set the nominal
+         * relation.

That basically contradicts the previous paragraph.  I think you need
to adjust the previous paragraph instead of adding a new one, probably
in multiple places.  For example, the parenthesized bit now only
applies in the non-partitioned case.

-    rel = mtstate->resultRelInfo->ri_RelationDesc;
+    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
+    nominalRel = heap_open(nominalRTE->relid, NoLock);

No lock?

Another thing that bothers me about this is that, apparently, the use
of nominalRelation is no longer strictly for EXPLAIN, as the comments
in plannodes.h/relation.h still claim that it is.  I'm not sure how to
adjust that exactly; there's not a lot of room in those comments to
give all the details.  Maybe they should simply say something like /*
Parent RT index */ instead of /* Parent RT index for use of EXPLAIN
*/.  But we can't just go around changing the meaning of things
without updating the comments accordingly.  A related question is
whether we should really be using nominalRelation for this or whether
there's some other way we should be getting at the parent -- I don't
have another idea, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/08 2:27, Robert Haas wrote:
> On Tue, Mar 7, 2017 at 12:11 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I see that all the changes by Amit and myself to what was earlier 0003
>> patch are now part of 0002 patch. The patch looks ready for committer.
> 
> Reviewing 0002:

Thanks for the review.

> 
> This patch seems to have falsified the header comments for
> expand_inherited_rtentry.

Oops, updated.

> I don't quite understand the need for the change to set_rel_size().
> The rte->inh case is handled before reaching the new code, and IIUC
> the !rte->inh case should never happen given the changes to
> expand_inherited_rtentry.  Or am I confused?

In expand_inherited_rtentry(), patch only changes the rule about the
minimum number of child RTEs required to keep rte->inh true.  In the
traditional case, it is 2 (the table itself and at least one child).  For
a partitioned table, since we don't want to scan the table itself, that
becomes 1 (at least one leaf partition).

So, it's still possible for rte->inh to become false if the required
minimum is not met, even for partitioned tables.

> If not, I think we
> should just add an Assert() to the "plain relation" case that this is
> not RELKIND_PARTITIONED_TABLE (with a comment explaining why it can't
> happen).

How about adding Assert(rte->relkind != RELKIND_PARTITIONED_TABLE) at the
beginning of the following functions as the updated patch does:

set_plain_rel_size()
set_tablesample_rel_size()
set_plain_rel_pathlist()
set_tablesample_rel_pathlist()

> In inheritance_planner, this comment addition does not seem adequate:
> 
> +         * If the parent is a partitioned table, we already set the nominal
> +         * relation.
> 
> That basically contradicts the previous paragraph.  I think you need
> to adjust the previous paragraph instead of adding a new one, probably
> in multiple places.  For example, the parenthesized bit now only
> applies in the non-partitioned case.

Hmm, that's true.  I rewrote the entire comment.

> 
> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
> 
> No lock?

Hmm, I think I missed that a partitioned parent table would not be locked
by the *executor* at all after applying this patch.  As of now, InitPlan()
takes care of locking all the result relations in the
PlannedStmt.resultRelations list.  This patch removes partitioned
parent(s) from appearing in this list.  But that makes me wonder - aren't
the locks taken by expand_inherited_rtentry() kept long enough?  Why does
InitPlan need to acquire them again?  I see this comment in
expand_inherited_rtentry:

   * If the parent relation is the query's result relation, then we need
   * RowExclusiveLock.  Otherwise, if it's accessed FOR UPDATE/SHARE, we
   * need RowShareLock; otherwise AccessShareLock.  We can't just grab
   * AccessShareLock because then the executor would be trying to upgrade
   * the lock, leading to possible deadlocks.  (This code should match the
   * parser and rewriter.)

So, I assumed all the necessary locking is being taken care of here.

Anyway, I changed NoLock to RowExclusiveLock here for now, but maybe it's
either not necessary or a way should be found to do that in InitPlan along
with other result relations.

> Another thing that bothers me about this is that, apparently, the use
> of nominalRelation is no longer strictly for EXPLAIN, as the comments
> in plannodes.h/relation.h still claim that it is.  I'm not sure how to
> adjust that exactly; there's not a lot of room in those comments to
> give all the details.  Maybe they should simply say something like /*
> Parent RT index */ instead of /* Parent RT index for use of EXPLAIN
> */.  But we can't just go around changing the meaning of things
> without updating the comments accordingly.

It seems that this comment and one more need to be updated.  The other
comment change is in explain.c as follows:

  * Adds the relid of each referenced RTE to *rels_used.  The result controls
  * which RTEs are assigned aliases by select_rtable_names_for_explain.
  * This ensures that we don't confusingly assign un-suffixed aliases to RTEs
- * that never appear in the EXPLAIN output (such as inheritance parents).
+ * that never appear in the EXPLAIN output (such as non-partitioned
+ * inheritance parents).
  */
 static bool
 ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)

I studied the EXPLAIN code a bit to see whether the problem cited for
using inheritance parent RTE as nominalRelation (for example, in comments
in inheritance_planner) apply to the partitioned parent RTE case, which it
doesn't.  The partitioned parent RTE won't appear anywhere else in the
plan.  So, different aliases for what's the same table/RTE won't happen in
the partitioned parent case.

> A related question is
> whether we should really be using nominalRelation for this or whether
> there's some other way we should be getting at the parent -- I don't
> have another idea, though.

This is just for the ModifyTable node.  Instead of adding any new member,
I thought nominalRelation would work fine.  But as I said above regarding
locking, we *may* need to find a different place for the partitioned
parent RTE (other than ModifyTable.nominalRelation).

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I don't quite understand the need for the change to set_rel_size().
>> The rte->inh case is handled before reaching the new code, and IIUC
>> the !rte->inh case should never happen given the changes to
>> expand_inherited_rtentry.  Or am I confused?
>
> In expand_inherited_rtentry(), patch only changes the rule about the
> minimum number of child RTEs required to keep rte->inh true.  In the
> traditional case, it is 2 (the table itself and at least one child).  For
> a partitioned table, since we don't want to scan the table itself, that
> becomes 1 (at least one leaf partition).
>
> So, it's still possible for rte->inh to become false if the required
> minimum is not met, even for partitioned tables.

OK.

>> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
>> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
>> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
>>
>> No lock?
>
> Hmm, I think I missed that a partitioned parent table would not be locked
> by the *executor* at all after applying this patch.  As of now, InitPlan()
> takes care of locking all the result relations in the
> PlannedStmt.resultRelations list.  This patch removes partitioned
> parent(s) from appearing in this list.  But that makes me wonder - aren't
> the locks taken by expand_inherited_rtentry() kept long enough?  Why does
> InitPlan need to acquire them again?  I see this comment in
> expand_inherited_rtentry:

Parse-time locks, plan-time locks, and execution-time locks are all
separate.  See the header comments for AcquireRewriteLocks in
rewriteHandler.c; think about a view, where the parsed query has been
stored and is later rewritten and planned.  Similarly, a plan can be
reused even after the transaction that created that plan has
committed; see AcquireExecutorLocks in plancache.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/08 22:36, Robert Haas wrote:
> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote:
>>> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
>>> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
>>> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
>>>
>>> No lock?
>>
>> Hmm, I think I missed that a partitioned parent table would not be locked
>> by the *executor* at all after applying this patch.  As of now, InitPlan()
>> takes care of locking all the result relations in the
>> PlannedStmt.resultRelations list.  This patch removes partitioned
>> parent(s) from appearing in this list.  But that makes me wonder - aren't
>> the locks taken by expand_inherited_rtentry() kept long enough?  Why does
>> InitPlan need to acquire them again?  I see this comment in
>> expand_inherited_rtentry:
> 
> Parse-time locks, plan-time locks, and execution-time locks are all
> separate.  See the header comments for AcquireRewriteLocks in
> rewriteHandler.c; think about a view, where the parsed query has been
> stored and is later rewritten and planned.  Similarly, a plan can be
> reused even after the transaction that created that plan has
> committed; see AcquireExecutorLocks in plancache.c.

Thanks for those references.

I took a step back here and thought a bit more about the implications this
patch.  It occurred to me that the complete absence of partitioned table
RT entries in the plan-tree has more consequences than I originally
imagined.  I will post an updated patch by Monday latest.

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/10 17:57, Amit Langote wrote:
> On 2017/03/08 22:36, Robert Haas wrote:
>> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote:
>>>> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
>>>> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
>>>> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
>>>>
>>>> No lock?
>>>
>>> Hmm, I think I missed that a partitioned parent table would not be locked
>>> by the *executor* at all after applying this patch.  As of now, InitPlan()
>>> takes care of locking all the result relations in the
>>> PlannedStmt.resultRelations list.  This patch removes partitioned
>>> parent(s) from appearing in this list.  But that makes me wonder - aren't
>>> the locks taken by expand_inherited_rtentry() kept long enough?  Why does
>>> InitPlan need to acquire them again?  I see this comment in
>>> expand_inherited_rtentry:
>>
>> Parse-time locks, plan-time locks, and execution-time locks are all
>> separate.  See the header comments for AcquireRewriteLocks in
>> rewriteHandler.c; think about a view, where the parsed query has been
>> stored and is later rewritten and planned.  Similarly, a plan can be
>> reused even after the transaction that created that plan has
>> committed; see AcquireExecutorLocks in plancache.c.
> 
> Thanks for those references.
> 
> I took a step back here and thought a bit more about the implications this
> patch.  It occurred to me that the complete absence of partitioned table
> RT entries in the plan-tree has more consequences than I originally
> imagined.  I will post an updated patch by Monday latest.

Here is the updated patch.

Since this patch proposes to avoid creating scan nodes for non-leaf tables
in a partition tree, they won't be referenced anywhere in the resulting
plan tree.  So the executor will not lock those tables in the
select/update/delete cases. Insert is different since we lock all tables
in the partition tree when setting up tuple-routing in
ExecInitModifyTable.  Not taking executor locks on the tables means that
the cached plans that should be invalidated upon adding/removing a
partition somewhere in the partition tree won't be.

First I thought that we could remember just the root table RT index using
a new Index field partitionRoot in Append, MergeAppend, and ModifyTable
nodes and use it to locate and lock the root table during executor node
initialization.  But soon realized that that's not enough, because it
ignores the fact that adding/removing partitions at lower levels does not
require taking a lock on the root table; only the immediate parent.  So
any cached select/update/delete plans won't be invalidated when a new
lower-level partition is added/removed, because the immediate parent would
not have been added to the query's range table and hence the
PlannedStmt.relationOids list.  Remember that the latter is used by
plancache.c to remember the relations that a given cached plan depends on
remaining unchanged.  So the patch now adds a list member called
partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores
the RT indexes of all the non-leaf tables in a partition tree with root
table RT index at the head (note that these RT indexes are of the RTEs
added by expand_inherited_rtenrty; also see below).  Since the
partitioned_rels list is constructed when building paths and must be
propagated to the plan nodes, the same field is also present in the
corresponding Path nodes.  ExecInit* routines for the aforementioned nodes
now locate and lock the non-leaf tables using the RT indexes in
partitioned_rels.  Leaf tables are locked, as before, either by InitPlan
(update/delete result relations case) or by ExecInitAppend or
ExecInitMergeAppend when initializing the appendplans/mergeplans (select
case).

The previous proposal was for expand_inherited_rtentry to not create RT
entries and AppendRelInfo's for the non-leaf tables, but I think that
doesn't work, as I tried to explain above.  We need RTEs because that
seems to be the only way currently for informing the executor of the
non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
RTEs for the latter planning steps to collect them in partitioned_rels
mentioned above. So with the latest patch, we do create the RT entry and
AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
a minimal one; only parent_relid and child_relid are valid.  To make the
latter planning steps ignore these minimal AppendRelInfo's, every
AppendRelInfo is now marked with child_relkind.  Only
set_append_rel_pathlist() and inheritance_planner() process them to
collect the child_relid into the partitioned_rels list to be stored in
AppendPath/MergeAppendPath and ModifyTablePath, respectively.

Thoughts?

Thanks,
Amit





Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/13 19:24, Amit Langote wrote:
> On 2017/03/10 17:57, Amit Langote wrote:
>> On 2017/03/08 22:36, Robert Haas wrote:
>>> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote:
>>>>> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
>>>>> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
>>>>> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
>>>>>
>>>>> No lock?
>>>>
>>>> Hmm, I think I missed that a partitioned parent table would not be locked
>>>> by the *executor* at all after applying this patch.  As of now, InitPlan()
>>>> takes care of locking all the result relations in the
>>>> PlannedStmt.resultRelations list.  This patch removes partitioned
>>>> parent(s) from appearing in this list.  But that makes me wonder - aren't
>>>> the locks taken by expand_inherited_rtentry() kept long enough?  Why does
>>>> InitPlan need to acquire them again?  I see this comment in
>>>> expand_inherited_rtentry:
>>>
>>> Parse-time locks, plan-time locks, and execution-time locks are all
>>> separate.  See the header comments for AcquireRewriteLocks in
>>> rewriteHandler.c; think about a view, where the parsed query has been
>>> stored and is later rewritten and planned.  Similarly, a plan can be
>>> reused even after the transaction that created that plan has
>>> committed; see AcquireExecutorLocks in plancache.c.
>>
>> Thanks for those references.
>>
>> I took a step back here and thought a bit more about the implications this
>> patch.  It occurred to me that the complete absence of partitioned table
>> RT entries in the plan-tree has more consequences than I originally
>> imagined.  I will post an updated patch by Monday latest.
> 
> Here is the updated patch.
> 
> Since this patch proposes to avoid creating scan nodes for non-leaf tables
> in a partition tree, they won't be referenced anywhere in the resulting
> plan tree.  So the executor will not lock those tables in the
> select/update/delete cases. Insert is different since we lock all tables
> in the partition tree when setting up tuple-routing in
> ExecInitModifyTable.  Not taking executor locks on the tables means that
> the cached plans that should be invalidated upon adding/removing a
> partition somewhere in the partition tree won't be.
> 
> First I thought that we could remember just the root table RT index using
> a new Index field partitionRoot in Append, MergeAppend, and ModifyTable
> nodes and use it to locate and lock the root table during executor node
> initialization.  But soon realized that that's not enough, because it
> ignores the fact that adding/removing partitions at lower levels does not
> require taking a lock on the root table; only the immediate parent.  So
> any cached select/update/delete plans won't be invalidated when a new
> lower-level partition is added/removed, because the immediate parent would
> not have been added to the query's range table and hence the
> PlannedStmt.relationOids list.  Remember that the latter is used by
> plancache.c to remember the relations that a given cached plan depends on
> remaining unchanged.  So the patch now adds a list member called
> partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores
> the RT indexes of all the non-leaf tables in a partition tree with root
> table RT index at the head (note that these RT indexes are of the RTEs
> added by expand_inherited_rtenrty; also see below).  Since the
> partitioned_rels list is constructed when building paths and must be
> propagated to the plan nodes, the same field is also present in the
> corresponding Path nodes.  ExecInit* routines for the aforementioned nodes
> now locate and lock the non-leaf tables using the RT indexes in
> partitioned_rels.  Leaf tables are locked, as before, either by InitPlan
> (update/delete result relations case) or by ExecInitAppend or
> ExecInitMergeAppend when initializing the appendplans/mergeplans (select
> case).
> 
> The previous proposal was for expand_inherited_rtentry to not create RT
> entries and AppendRelInfo's for the non-leaf tables, but I think that
> doesn't work, as I tried to explain above.  We need RTEs because that
> seems to be the only way currently for informing the executor of the
> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
> RTEs for the latter planning steps to collect them in partitioned_rels
> mentioned above. So with the latest patch, we do create the RT entry and
> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
> a minimal one; only parent_relid and child_relid are valid.  To make the
> latter planning steps ignore these minimal AppendRelInfo's, every
> AppendRelInfo is now marked with child_relkind.  Only
> set_append_rel_pathlist() and inheritance_planner() process them to
> collect the child_relid into the partitioned_rels list to be stored in
> AppendPath/MergeAppendPath and ModifyTablePath, respectively.

Sorry, forgot to attach the patches themselves.  Attached this time.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/13 19:30, Amit Langote wrote:
>> Here is the updated patch.
>>
>> Since this patch proposes to avoid creating scan nodes for non-leaf tables
>> in a partition tree, they won't be referenced anywhere in the resulting
>> plan tree.  So the executor will not lock those tables in the
>> select/update/delete cases. Insert is different since we lock all tables
>> in the partition tree when setting up tuple-routing in
>> ExecInitModifyTable.  Not taking executor locks on the tables means that
>> the cached plans that should be invalidated upon adding/removing a
>> partition somewhere in the partition tree won't be.
>>
>> First I thought that we could remember just the root table RT index using
>> a new Index field partitionRoot in Append, MergeAppend, and ModifyTable
>> nodes and use it to locate and lock the root table during executor node
>> initialization.  But soon realized that that's not enough, because it
>> ignores the fact that adding/removing partitions at lower levels does not
>> require taking a lock on the root table; only the immediate parent.  So
>> any cached select/update/delete plans won't be invalidated when a new
>> lower-level partition is added/removed, because the immediate parent would
>> not have been added to the query's range table and hence the
>> PlannedStmt.relationOids list.  Remember that the latter is used by
>> plancache.c to remember the relations that a given cached plan depends on
>> remaining unchanged.  So the patch now adds a list member called
>> partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores
>> the RT indexes of all the non-leaf tables in a partition tree with root
>> table RT index at the head (note that these RT indexes are of the RTEs
>> added by expand_inherited_rtenrty; also see below).  Since the
>> partitioned_rels list is constructed when building paths and must be
>> propagated to the plan nodes, the same field is also present in the
>> corresponding Path nodes.  ExecInit* routines for the aforementioned nodes
>> now locate and lock the non-leaf tables using the RT indexes in
>> partitioned_rels.  Leaf tables are locked, as before, either by InitPlan
>> (update/delete result relations case) or by ExecInitAppend or
>> ExecInitMergeAppend when initializing the appendplans/mergeplans (select
>> case).
>>
>> The previous proposal was for expand_inherited_rtentry to not create RT
>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>> doesn't work, as I tried to explain above.  We need RTEs because that
>> seems to be the only way currently for informing the executor of the
>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>> RTEs for the latter planning steps to collect them in partitioned_rels
>> mentioned above. So with the latest patch, we do create the RT entry and
>> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
>> a minimal one; only parent_relid and child_relid are valid.  To make the
>> latter planning steps ignore these minimal AppendRelInfo's, every
>> AppendRelInfo is now marked with child_relkind.  Only
>> set_append_rel_pathlist() and inheritance_planner() process them to
>> collect the child_relid into the partitioned_rels list to be stored in
>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
> 
> Sorry, forgot to attach the patches themselves.  Attached this time.

I made some changes to 0001:

* No need to create RelOptInfo's for partitioned RTEs added by
  expand_inherited_rtentry

* Instead of storing relkind of child tables in AppendRelInfo's, simply
  store if they are partitioned; IOW, change child_relkind to
  child_is_partitioned

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The previous proposal was for expand_inherited_rtentry to not create RT
> entries and AppendRelInfo's for the non-leaf tables, but I think that
> doesn't work, as I tried to explain above.  We need RTEs because that
> seems to be the only way currently for informing the executor of the
> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
> RTEs for the latter planning steps to collect them in partitioned_rels
> mentioned above. So with the latest patch, we do create the RT entry and
> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
> a minimal one; only parent_relid and child_relid are valid.  To make the
> latter planning steps ignore these minimal AppendRelInfo's, every
> AppendRelInfo is now marked with child_relkind.  Only
> set_append_rel_pathlist() and inheritance_planner() process them to
> collect the child_relid into the partitioned_rels list to be stored in
> AppendPath/MergeAppendPath and ModifyTablePath, respectively.

I see your point, but I still think this kind of stinks.  You've got
all kinds of logic that is now conditional on child_is_partitioned,
and that seems like a recipe for bugs and future maintenance
difficulties.  It would be much nicer if we could come up with a
design that doesn't create the AppendRelInfo in the first place,
because then all of that stuff could just work.  Can we get away with
creating an RTE for each partitioned table (other than the parent,
perhaps; for that one it would be nice to use the inh-flagged RTE we
already have) but NOT creating an AppendRelInfo to go with it?  If
we've got the list of RTIs for the new RTEs associated with the append
path in some other form, can't we get by without also having an
AppendRelInfo to hold onto that translation?

The comments in InitPlan() explain why locks on result relations are
taken in that function directly rather than during the ExecInitNode
pass over the tree; it's because we need to make sure we take the
strongest lock on any given relation first.  But the changes in
ExecInitAppend and ExecInitMergeAppend are problematic in that regard;
some AccessShareLocks may already have been taken, and now we're
taking locks that in some case may be RowShareLock, which could cause
a lock upgrade.  Remember that there's no reason that you couldn't
join a table to one of its own partitions, or something like that.  I
think you need to try to jigger things so that InitPlan() takes all
locks stronger than AccessShareLock that are required anywhere in the
query, and then other nodes can take anything at AccessShareLock that
they need.

I think that eliding the Append node when there's only one child may
be unsafe in the case where the child's attribute numbers are
different from the parent's attribute numbers.  I remember Tom making
some comment about this when I was working on MergeAppend, although I
no longer remember the specific details.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
On Wed, Mar 15, 2017 at 3:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The previous proposal was for expand_inherited_rtentry to not create RT
>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>> doesn't work, as I tried to explain above.  We need RTEs because that
>> seems to be the only way currently for informing the executor of the
>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>> RTEs for the latter planning steps to collect them in partitioned_rels
>> mentioned above. So with the latest patch, we do create the RT entry and
>> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
>> a minimal one; only parent_relid and child_relid are valid.  To make the
>> latter planning steps ignore these minimal AppendRelInfo's, every
>> AppendRelInfo is now marked with child_relkind.  Only
>> set_append_rel_pathlist() and inheritance_planner() process them to
>> collect the child_relid into the partitioned_rels list to be stored in
>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
>
> I see your point, but I still think this kind of stinks.  You've got
> all kinds of logic that is now conditional on child_is_partitioned,
> and that seems like a recipe for bugs and future maintenance
> difficulties.  It would be much nicer if we could come up with a
> design that doesn't create the AppendRelInfo in the first place,
> because then all of that stuff could just work.  Can we get away with
> creating an RTE for each partitioned table (other than the parent,
> perhaps; for that one it would be nice to use the inh-flagged RTE we
> already have) but NOT creating an AppendRelInfo to go with it?  If
> we've got the list of RTIs for the new RTEs associated with the append
> path in some other form, can't we get by without also having an
> AppendRelInfo to hold onto that translation?
>

Will it help to retain the partition hierarchy as inheritance
hierarchy and then collapse it while creating append paths. That will
be needed by partition-wise join, will be helpful in partition pruning
without using constraints and so on. So, may be could use that
infrastructure to simplify the logic here. The patch is available as
0013 in [1].

[1] CAFjFpRfqotRR6cM3sooBHMHEVdkFfAZ6PyYg4GRZsoMuW08HjQ@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/15 13:38, Ashutosh Bapat wrote:
> On Wed, Mar 15, 2017 at 3:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> The previous proposal was for expand_inherited_rtentry to not create RT
>>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>>> doesn't work, as I tried to explain above.  We need RTEs because that
>>> seems to be the only way currently for informing the executor of the
>>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>>> RTEs for the latter planning steps to collect them in partitioned_rels
>>> mentioned above. So with the latest patch, we do create the RT entry and
>>> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
>>> a minimal one; only parent_relid and child_relid are valid.  To make the
>>> latter planning steps ignore these minimal AppendRelInfo's, every
>>> AppendRelInfo is now marked with child_relkind.  Only
>>> set_append_rel_pathlist() and inheritance_planner() process them to
>>> collect the child_relid into the partitioned_rels list to be stored in
>>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
>>
>> I see your point, but I still think this kind of stinks.  You've got
>> all kinds of logic that is now conditional on child_is_partitioned,
>> and that seems like a recipe for bugs and future maintenance
>> difficulties.  It would be much nicer if we could come up with a
>> design that doesn't create the AppendRelInfo in the first place,
>> because then all of that stuff could just work.  Can we get away with
>> creating an RTE for each partitioned table (other than the parent,
>> perhaps; for that one it would be nice to use the inh-flagged RTE we
>> already have) but NOT creating an AppendRelInfo to go with it?  If
>> we've got the list of RTIs for the new RTEs associated with the append
>> path in some other form, can't we get by without also having an
>> AppendRelInfo to hold onto that translation?
>>
> 
> Will it help to retain the partition hierarchy as inheritance
> hierarchy and then collapse it while creating append paths. That will
> be needed by partition-wise join, will be helpful in partition pruning
> without using constraints and so on. So, may be could use that
> infrastructure to simplify the logic here. The patch is available as
> 0013 in [1].
> 
> [1] CAFjFpRfqotRR6cM3sooBHMHEVdkFfAZ6PyYg4GRZsoMuW08HjQ@mail.gmail.com

IMHO, it would be better to keep those patches separate because the
problems being solved are different.  By the way, one of the reasons that
patch (as I had written it) was skipped was because it didn't cover the
inheritance_planner() case [1].  Your patch 0013 at the link should be
updated (maybe I should report on the partitionwise joins thread as well)
in some way to handle the update/delete case, because this happens:

create table p (a int, b char) partition by list (a);
create table p1 partition of p for values in (1) partition by list (b);
create table p1a partition of p1 for values in ('a');
create table p2 partition of p for values in (2);

explain (costs off) update p set a = a, b = 'b';           QUERY PLAN
-----------------------------------Update on p  Update on p  Update on p1 p  Update on p2  ->  Seq Scan on p  ->
Result       ->  Append              ->  Seq Scan on p1              ->  Seq Scan on p1a  ->  Seq Scan on p2
 
(10 rows)

update p set a = a, b = 'b';       server closed the connection unexpectedlyThis probably means the server terminated
abnormallybeforeor while processing the request.
 
The connection to the server was lost. Attempting reset: Failed.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmobMy%3DrqM%3DMTN_FUEfD-PiWSCSonH%2BZ1_SjL6ZmQ2GGz1w%40mail.gmail.com





Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
>>
>> Will it help to retain the partition hierarchy as inheritance
>> hierarchy and then collapse it while creating append paths. That will
>> be needed by partition-wise join, will be helpful in partition pruning
>> without using constraints and so on. So, may be could use that
>> infrastructure to simplify the logic here. The patch is available as
>> 0013 in [1].
>>
>> [1] CAFjFpRfqotRR6cM3sooBHMHEVdkFfAZ6PyYg4GRZsoMuW08HjQ@mail.gmail.com
>
> IMHO, it would be better to keep those patches separate because the
> problems being solved are different.  By the way, one of the reasons that
> patch (as I had written it) was skipped was because it didn't cover the
> inheritance_planner() case [1].  Your patch 0013 at the link should be
> updated (maybe I should report on the partitionwise joins thread as well)
> in some way to handle the update/delete case, because this happens:
>
> create table p (a int, b char) partition by list (a);
> create table p1 partition of p for values in (1) partition by list (b);
> create table p1a partition of p1 for values in ('a');
> create table p2 partition of p for values in (2);
>
> explain (costs off) update p set a = a, b = 'b';
>             QUERY PLAN
> -----------------------------------
>  Update on p
>    Update on p
>    Update on p1 p
>    Update on p2
>    ->  Seq Scan on p
>    ->  Result
>          ->  Append
>                ->  Seq Scan on p1
>                ->  Seq Scan on p1a
>    ->  Seq Scan on p2
> (10 rows)
>
> update p set a = a, b = 'b';
>         server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Thanks for pointing that out. I am able to reproduce the crash. I
think, we will need to teach it to add the indirect children as well.
Looks like we are missing a testcase for this scenario. I had run
regression with that patch, and didn't catch any failures and crashes.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/15 7:09, Robert Haas wrote:
> On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The previous proposal was for expand_inherited_rtentry to not create RT
>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>> doesn't work, as I tried to explain above.  We need RTEs because that
>> seems to be the only way currently for informing the executor of the
>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>> RTEs for the latter planning steps to collect them in partitioned_rels
>> mentioned above. So with the latest patch, we do create the RT entry and
>> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
>> a minimal one; only parent_relid and child_relid are valid.  To make the
>> latter planning steps ignore these minimal AppendRelInfo's, every
>> AppendRelInfo is now marked with child_relkind.  Only
>> set_append_rel_pathlist() and inheritance_planner() process them to
>> collect the child_relid into the partitioned_rels list to be stored in
>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
> 
> I see your point, but I still think this kind of stinks.  You've got
> all kinds of logic that is now conditional on child_is_partitioned,
> and that seems like a recipe for bugs and future maintenance
> difficulties.  It would be much nicer if we could come up with a
> design that doesn't create the AppendRelInfo in the first place,
> because then all of that stuff could just work.  Can we get away with
> creating an RTE for each partitioned table (other than the parent,
> perhaps; for that one it would be nice to use the inh-flagged RTE we
> already have) but NOT creating an AppendRelInfo to go with it?  If
> we've got the list of RTIs for the new RTEs associated with the append
> path in some other form, can't we get by without also having an
> AppendRelInfo to hold onto that translation?

I think we'll need to store *somewhere* the mapping of which inh=false
partitioned table RTE is the child of which inh=true (IOW, parent)
partitioned table RTE.  I've come to think that AppendRelInfos, although
contain extraneous information that won't be used, are better than
inventing something new altogether for time being.  AppendRelInfos are
referred to a few times by query_planner() steps before we eventually get
to either set_append_rel_pathlist() or inheritance_planner(), so not
changing that approach seems less worrisome for now.  So now if we both
create child RTEs and AppendRelInfos for the partitioned tables, we don't
need to change expand_inherited_rtentry() at all with this patch.
Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
child partitioned table RTEs, because no path/plan need to be created.  We
can do away with having to create RelOptInfos for child partitioned table
RTEs, which I found to be not that invasive.

We seem to agree that we need to carry the partitioned table RT indexes
over to the executor somehow so that they are locked properly during
execution.  To that end, Append, MergeAppend, and ModifyTable (and the
corresponding path nodes) nodes will contain a list of those RT indexes in
a field called partitioned_rels.  In the result relation case, the list
from ModifyTable is copied to a field in PlannerGlobal called
nonleafResultRelations (next to the old resultRelations) and further to a
field of the same name in PlannedStmt.  I considered if it can be done
without a new field, but realized that the old resultRelations is tied
closely with how the EState manages result relations.

InitPlan() and AcquireExecutorLocks() lock the tables denoted by the RT
indexes in PlannedStmt.nonleafResultRelations using RowExclusiveLock, just
like they would those in PlannedStmt.resultRelations.

Since expand_inherited_rtentries() adds PlanRowMarks for partitioned
tables just like previously, InitPlan() and AcquireExecutorLocks() will
lock them with RowShareLock if needed.

Finally, ExecInitAppend() and ExecInitMergeAppend() must acquire
AccessShareLock on the partition tables, if not already locked by InitPlan().

> The comments in InitPlan() explain why locks on result relations are
> taken in that function directly rather than during the ExecInitNode
> pass over the tree; it's because we need to make sure we take the
> strongest lock on any given relation first.  But the changes in
> ExecInitAppend and ExecInitMergeAppend are problematic in that regard;
> some AccessShareLocks may already have been taken, and now we're
> taking locks that in some case may be RowShareLock, which could cause
> a lock upgrade.  Remember that there's no reason that you couldn't
> join a table to one of its own partitions, or something like that.  I
> think you need to try to jigger things so that InitPlan() takes all
> locks stronger than AccessShareLock that are required anywhere in the
> query, and then other nodes can take anything at AccessShareLock that
> they need.

Thanks for the explanation.  If the scheme in the new patch that I
described above sounds OK, I think it takes care of locking the
partitioned tables without the upgrade hazards.

> I think that eliding the Append node when there's only one child may
> be unsafe in the case where the child's attribute numbers are
> different from the parent's attribute numbers.  I remember Tom making
> some comment about this when I was working on MergeAppend, although I
> no longer remember the specific details.

Append node elision does not occur in the one-child case.  With the patch:

create table q (a int) partition by list (a);
explain select * from q;
                QUERY PLAN
------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=4)
   One-Time Filter: false
(2 rows)

create table q1 partition of q for values in (1);
explain select * from q;
                         QUERY PLAN
------------------------------------------------------------
 Append  (cost=0.00..35.50 rows=2550 width=4)
   ->  Seq Scan on q1  (cost=0.00..35.50 rows=2550 width=4)
(2 rows)

Maybe that should be done, but this patch doesn't implement that.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Thu, Mar 16, 2017 at 6:03 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think we'll need to store *somewhere* the mapping of which inh=false
> partitioned table RTE is the child of which inh=true (IOW, parent)
> partitioned table RTE.

I mean, for the children you're going to scan, that seems to be
necessary so that you can do things like translate targetlists to use
the correct varno.  But for the children you're not going to scan,
well, you need to know which ones they are so you can lock them, but
do you really need the parent-child mappings?  Or just a list of which
ones there are?

> I've come to think that AppendRelInfos, although
> contain extraneous information that won't be used, are better than
> inventing something new altogether for time being.  AppendRelInfos are
> referred to a few times by query_planner() steps before we eventually get
> to either set_append_rel_pathlist() or inheritance_planner(), so not
> changing that approach seems less worrisome for now.  So now if we both
> create child RTEs and AppendRelInfos for the partitioned tables, we don't
> need to change expand_inherited_rtentry() at all with this patch.
> Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
> child partitioned table RTEs, because no path/plan need to be created.  We
> can do away with having to create RelOptInfos for child partitioned table
> RTEs, which I found to be not that invasive.

Yes, but on the flip side, you're having to add code in a lot of
places -- I think I counted 7 -- where you turn around and ignore
those AppendRelInfos.  That's a lot; how do we know we've got them
all?  I'm not sure what the patch would look like the other way, but
I'm hoping that you could just keep the list of partitioned table RTIs
someplace that mostly gets ignored, and then all of that special
handling could be ripped out.

> Append node elision does not occur in the one-child case.  With the patch:

Oh, OK.  Somehow the commit message you included led me to the
contrary conclusion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/16 22:16, Robert Haas wrote:
> On Thu, Mar 16, 2017 at 6:03 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think we'll need to store *somewhere* the mapping of which inh=false
>> partitioned table RTE is the child of which inh=true (IOW, parent)
>> partitioned table RTE.
> 
> I mean, for the children you're going to scan, that seems to be
> necessary so that you can do things like translate targetlists to use
> the correct varno.  But for the children you're not going to scan,
> well, you need to know which ones they are so you can lock them, but
> do you really need the parent-child mappings?  Or just a list of which
> ones there are?

I thought we'd want to keep the RT indexes segregated per controlling
node, which is why we'd want to store the mapping somehow.  There may be
multiple partitioned tables in a query and one of them might be the result
relation.  If we propagate all of them in one list to the executor,
InitPlan wouldn't know which ones are target relations, for example.  But
I guess you don't mean passing all of them in one big list to the executor.

>> I've come to think that AppendRelInfos, although
>> contain extraneous information that won't be used, are better than
>> inventing something new altogether for time being.  AppendRelInfos are
>> referred to a few times by query_planner() steps before we eventually get
>> to either set_append_rel_pathlist() or inheritance_planner(), so not
>> changing that approach seems less worrisome for now.  So now if we both
>> create child RTEs and AppendRelInfos for the partitioned tables, we don't
>> need to change expand_inherited_rtentry() at all with this patch.
>> Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
>> child partitioned table RTEs, because no path/plan need to be created.  We
>> can do away with having to create RelOptInfos for child partitioned table
>> RTEs, which I found to be not that invasive.
> 
> Yes, but on the flip side, you're having to add code in a lot of
> places -- I think I counted 7 -- where you turn around and ignore
> those AppendRelInfos.

Perhaps you were looking at the previous version with "minimal" appinfos
containing the child_is_partitioned field?  Following are the instances in
the latest patch posted yesterday:

- In set_append_re_size, do not count the partitioned child tables

- In set_append_rel_pathlist, do not create paths, but add to the list of
  the partitioned tables to be included in the final Append path

- In create_lateral_join_info(), no need to propagate lateral join info to
  partitioned child tables

- In inheritance_planner, no need to create plans for partitioned child
  tables, but add to the list of the partitioned tables to be included in
  the final ModifyTable path

In the previous version, because AppendRelInfo structure was itself
modified, there were more instances because more places would need to know
about the new kinds of appinfos (the minimal ones), which isn't the case
with the latest patch.  Although admittedly, there will be a lot of
useless manipulation of those appinfos whose only purpose is to carry the
parent-child relid mapping up to where we finally consume that
information.  Anyway, I've tried to implement an approach that doesn't
create AppendRelInfos for partitioned child tables as described below.

> That's a lot; how do we know we've got them
> all?  I'm not sure what the patch would look like the other way, but
> I'm hoping that you could just keep the list of partitioned table RTIs
> someplace that mostly gets ignored, and then all of that special
> handling could be ripped out.

Since we'll want to save the RT indexes when we create child RTEs, that is
in expand_inherited_rtentry(), we'd do use a data structure that parallels
root->append_rel_list.  The structure might consist of a the parent_relid
and a list of partitioned child RT indexes.  Since we build this
information early during subquery_planner() it might equally be subject to
errors of omission, although that somehow sounds less likely.

I've tried to implement something like that in the attached updated 0001.
The mapping structure PartitionedChildRelInfo (a Node in fact) consists of
parent_relid and the list of partitioned child RT indexes.  A global list
of these nodes is maintained in the root PlannerInfo called pcinfo_list.
In set_append_rel_pathlist() and inheritance_planner(), we search the
aforementioned list and copy the correct list to
Append/MergeAppend/ModifyTable path created for the parent.  Thoughts?

I also noticed that PlanRowMarks that are created by
expand_inherited_rtentry() for partitioned child tables are problematic -
they must somehow be ignored in the executor.  One way to do that is to
set isParent (making it a dummy entry) when initializing them, which is
what the patch does.  Comment is updated to say that just like inheritance
parent relations, PlanRowMarks for partitioned child tables are also
marked as dummy.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Fri, Mar 17, 2017 at 4:57 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Yes, but on the flip side, you're having to add code in a lot of
>> places -- I think I counted 7 -- where you turn around and ignore
>> those AppendRelInfos.
>
> Perhaps you were looking at the previous version with "minimal" appinfos
> containing the child_is_partitioned field?

Yes, I think I was.  I think this version looks a lot better.
    /*
+     * Close the root partitioned rel if we opened it above, but keep the
+     * lock.
+     */
+    if (rel != mtstate->resultRelInfo->ri_RelationDesc)
+        heap_close(rel, NoLock);

We didn't take a lock above, though, so drop everything in the comment
from "but" onward.

-    add_paths_to_append_rel(root, rel, live_childrels);
+    add_paths_to_append_rel(root, rel, live_childrels, partitioned_rels);

I think it would make more sense to put the new logic into
add_paths_to_append_rel, instead of passing this down as an additional
parameter.

+     * do not appear anywhere else in the plan.  Situation is exactly the

The situation is...

+    if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
+    {
+        foreach(lc, root->pcinfo_list)
+        {
+            PartitionedChildRelInfo *pc = lfirst(lc);
+
+            if (pc->parent_relid == parentRTindex)
+            {
+                partitioned_rels = pc->child_rels;
+                break;
+            }
+        }
+    }

You seem to have a few copies of this logic.  I think it would be
worth factoring it out into a separate function.

+                root->glob->nonleafResultRelations =
+                    list_concat(root->glob->nonleafResultRelations,
+                                list_copy(splan->partitioned_rels));

Please add a brief comment.  One line is fine.

+            newrc->isParent = childrte->relkind == RELKIND_PARTITIONED_TABLE;

I'm not sure what project style is, but I personally find these kinds
of assignments easier to read with an extra set of parantheses:
           newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);

+    if (partitioned_rels == NIL)
+        return;
+
+    foreach(lc, partitioned_rels)

I think the if-test is pointless; the foreach loop is going to start
by comparing the initial value with NIL.

Why doesn't ExecSerializePlan() need to transfer a proper value for
nonleafResultRelations to workers?  Seems like it should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/21 1:16, Robert Haas wrote:
> On Fri, Mar 17, 2017 at 4:57 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Yes, but on the flip side, you're having to add code in a lot of
>>> places -- I think I counted 7 -- where you turn around and ignore
>>> those AppendRelInfos.
>>
>> Perhaps you were looking at the previous version with "minimal" appinfos
>> containing the child_is_partitioned field?
> 
> Yes, I think I was.  I think this version looks a lot better.

Just to clarify, I assume you reviewed the latest version which does not
create AppendRelInfos, but instead creates PartitionedChildRelInfos (as
also evident from your comments below).  Sorry about the confusion.

>      /*
> +     * Close the root partitioned rel if we opened it above, but keep the
> +     * lock.
> +     */
> +    if (rel != mtstate->resultRelInfo->ri_RelationDesc)
> +        heap_close(rel, NoLock);
> 
> We didn't take a lock above, though, so drop everything in the comment
> from "but" onward.

Oh, right.

> -    add_paths_to_append_rel(root, rel, live_childrels);
> +    add_paths_to_append_rel(root, rel, live_childrels, partitioned_rels);
> 
> I think it would make more sense to put the new logic into
> add_paths_to_append_rel, instead of passing this down as an additional
> parameter.

OK, done.

> +     * do not appear anywhere else in the plan.  Situation is exactly the
> 
> The situation is...

Fixed.

> 
> +    if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
> +    {
> +        foreach(lc, root->pcinfo_list)
> +        {
> +            PartitionedChildRelInfo *pc = lfirst(lc);
> +
> +            if (pc->parent_relid == parentRTindex)
> +            {
> +                partitioned_rels = pc->child_rels;
> +                break;
> +            }
> +        }
> +    }
> 
> You seem to have a few copies of this logic.  I think it would be
> worth factoring it out into a separate function.

OK, done.  Put the definition in in planner.c

> +                root->glob->nonleafResultRelations =
> +                    list_concat(root->glob->nonleafResultRelations,
> +                                list_copy(splan->partitioned_rels));
> 
> Please add a brief comment.  One line is fine.

Done.

> 
> +            newrc->isParent = childrte->relkind == RELKIND_PARTITIONED_TABLE;
> 
> I'm not sure what project style is, but I personally find these kinds
> of assignments easier to read with an extra set of parantheses:
> 
>             newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);

Ah, done.

> 
> +    if (partitioned_rels == NIL)
> +        return;
> +
> +    foreach(lc, partitioned_rels)
> 
> I think the if-test is pointless; the foreach loop is going to start
> by comparing the initial value with NIL.

Right, fixed.

> Why doesn't ExecSerializePlan() need to transfer a proper value for
> nonleafResultRelations to workers?  Seems like it should.

It doesn't transfer resultRelations either, presumably because workers do
not handle result relations yet.  Also, both resultRelations and
nonleafResultRelations are set *only* if there is a ModifyTable node.

Attached updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.

Committed 0001 after removing a comma.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Simon Riggs
Date:
On 16 March 2017 at 10:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/03/15 7:09, Robert Haas wrote:

>> I think that eliding the Append node when there's only one child may
>> be unsafe in the case where the child's attribute numbers are
>> different from the parent's attribute numbers.  I remember Tom making
>> some comment about this when I was working on MergeAppend, although I
>> no longer remember the specific details.
>
> Append node elision does not occur in the one-child case.  With the patch:
...
> create table q1 partition of q for values in (1);
> explain select * from q;
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append  (cost=0.00..35.50 rows=2550 width=4)
>    ->  Seq Scan on q1  (cost=0.00..35.50 rows=2550 width=4)
> (2 rows)
>
> Maybe that should be done, but this patch doesn't implement that.

Robert raises the possible problem that removing the Append wouldn't
work when the parent and child attribute numbers don't match. Surely
that never happens with partitions, by definition?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 12:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 16 March 2017 at 10:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/03/15 7:09, Robert Haas wrote:
>
>>> I think that eliding the Append node when there's only one child may
>>> be unsafe in the case where the child's attribute numbers are
>>> different from the parent's attribute numbers.  I remember Tom making
>>> some comment about this when I was working on MergeAppend, although I
>>> no longer remember the specific details.
>>
>> Append node elision does not occur in the one-child case.  With the patch:
> ...
>> create table q1 partition of q for values in (1);
>> explain select * from q;
>>                          QUERY PLAN
>> ------------------------------------------------------------
>>  Append  (cost=0.00..35.50 rows=2550 width=4)
>>    ->  Seq Scan on q1  (cost=0.00..35.50 rows=2550 width=4)
>> (2 rows)
>>
>> Maybe that should be done, but this patch doesn't implement that.
>
> Robert raises the possible problem that removing the Append wouldn't
> work when the parent and child attribute numbers don't match. Surely
> that never happens with partitions, by definition?

No, the attribute numbers don't have to match.  This decision was made
a long time ago, and there have been a whole bunch of followup commits
since the original partitioning patch that were dedicated to fixing up
cases where that wasn't working properly in the original commit.  It
seems superficially attractive to require the attribute numbers to
match, but it creates some really unpleasant cases.  For example,
suppose a user tries to creates a partitioned table, drops a column,
then creates a standalone table which matches the apparent column list
of the partitioned table, then tries to attach it as a partition.

ERROR: the columns you previously dropped from the parent that you
can't see and don't know about aren't the same as the ones dropped
from the standalone table you're trying to attach as a partition
DETAIL: Try recreating your proposed new partition with a
pass-by-value column of width 8 after the third column, and then
dropping that column before trying to attach it as a partition.
HINT: Abandon all hope, ye who enter here.

Not cool with that.

The decision not to require the attribute numbers to match doesn't
necessarily mean we can't get rid of the Append node, though.  First
of all, in a lot of practical cases the attribute numbers will all
match.  Second, if they don't, the most that would be required is a
projection step, which could usually be done without a separate node
because most nodes are projection-capable.  And maybe not even that
much is needed; I'd have to go back and look at what Tom was worried
about the last time this came up.  (Hmm, maybe the problem had to do
with varnos matching, rather then attribute numbers?)

Another and independent problem with eliding the Append node is that,
if we did that, we'd still have to guarantee that the parent relation
corresponding to the Append node got locked somehow.  Otherwise, we'll
be accessing the tuple routing information for a table on which we
don't have a lock.  That's probably a solvable problem, too, but it
hasn't been solved yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Simon Riggs
Date:
On 21 March 2017 at 16:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 12:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 16 March 2017 at 10:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2017/03/15 7:09, Robert Haas wrote:
>>
>>>> I think that eliding the Append node when there's only one child may
>>>> be unsafe in the case where the child's attribute numbers are
>>>> different from the parent's attribute numbers.  I remember Tom making
>>>> some comment about this when I was working on MergeAppend, although I
>>>> no longer remember the specific details.
>>>
>>> Append node elision does not occur in the one-child case.  With the patch:
>> ...
>>> create table q1 partition of q for values in (1);
>>> explain select * from q;
>>>                          QUERY PLAN
>>> ------------------------------------------------------------
>>>  Append  (cost=0.00..35.50 rows=2550 width=4)
>>>    ->  Seq Scan on q1  (cost=0.00..35.50 rows=2550 width=4)
>>> (2 rows)
>>>
>>> Maybe that should be done, but this patch doesn't implement that.
>>
>> Robert raises the possible problem that removing the Append wouldn't
>> work when the parent and child attribute numbers don't match. Surely
>> that never happens with partitions, by definition?
>
> No, the attribute numbers don't have to match.  This decision was made
> a long time ago, and there have been a whole bunch of followup commits
> since the original partitioning patch that were dedicated to fixing up
> cases where that wasn't working properly in the original commit.  It
> seems superficially attractive to require the attribute numbers to
> match, but it creates some really unpleasant cases.  For example,
> suppose a user tries to creates a partitioned table, drops a column,
> then creates a standalone table which matches the apparent column list
> of the partitioned table, then tries to attach it as a partition.
>
> ERROR: the columns you previously dropped from the parent that you
> can't see and don't know about aren't the same as the ones dropped
> from the standalone table you're trying to attach as a partition
> DETAIL: Try recreating your proposed new partition with a
> pass-by-value column of width 8 after the third column, and then
> dropping that column before trying to attach it as a partition.
> HINT: Abandon all hope, ye who enter here.
>
> Not cool with that.

Thanks for the explanation.

> The decision not to require the attribute numbers to match doesn't
> necessarily mean we can't get rid of the Append node, though.  First
> of all, in a lot of practical cases the attribute numbers will all
> match.  Second, if they don't, the most that would be required is a
> projection step, which could usually be done without a separate node
> because most nodes are projection-capable.  And maybe not even that
> much is needed; I'd have to go back and look at what Tom was worried
> about the last time this came up.  (Hmm, maybe the problem had to do
> with varnos matching, rather then attribute numbers?)

There used to be some code there to fix them up, not sure where that went.

> Another and independent problem with eliding the Append node is that,
> if we did that, we'd still have to guarantee that the parent relation
> corresponding to the Append node got locked somehow.  Otherwise, we'll
> be accessing the tuple routing information for a table on which we
> don't have a lock.  That's probably a solvable problem, too, but it
> hasn't been solved yet.

Hmm, why would we need to access tuple routing information?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 1:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The decision not to require the attribute numbers to match doesn't
>> necessarily mean we can't get rid of the Append node, though.  First
>> of all, in a lot of practical cases the attribute numbers will all
>> match.  Second, if they don't, the most that would be required is a
>> projection step, which could usually be done without a separate node
>> because most nodes are projection-capable.  And maybe not even that
>> much is needed; I'd have to go back and look at what Tom was worried
>> about the last time this came up.  (Hmm, maybe the problem had to do
>> with varnos matching, rather then attribute numbers?)
>
> There used to be some code there to fix them up, not sure where that went.

Me neither.  To be clear in case I haven't been already, I'm totally
fine with somebody doing the work to get rid of the Append node; I
just think it'll take some investigation and work that hasn't been
done yet.

(I'm also a little skeptical about the value of the work.  The Append
node doesn't cost much; what's expensive is that the planner isn't
smart about planning queries that involve Append nodes and so getting
rid of one can improve the whole plan shape.  But I think the answer
to that problem is optimizations like partition-wise join and
partition-wise aggregate, which can handle cases where an Append has
any number of surviving children.  Eliminating the Append only helps
when the number of surviving children is exactly one.  Now, that's not
to say I'm going to fight a patch if somebody writes one, but I think
to some extent it's just a band-aid.)

>> Another and independent problem with eliding the Append node is that,
>> if we did that, we'd still have to guarantee that the parent relation
>> corresponding to the Append node got locked somehow.  Otherwise, we'll
>> be accessing the tuple routing information for a table on which we
>> don't have a lock.  That's probably a solvable problem, too, but it
>> hasn't been solved yet.
>
> Hmm, why would we need to access tuple routing information?

I didn't state that very well.  It's not so much that we need access
to the tuple routing information as that we need to replan if it
changes, because if the tuple routing information changes then we
might need to include partitions that were previously being pruned.
If we haven't got some kind of a lock on the parent, I'm pretty sure
that's not going to work reliably.  Synchronization of invalidation
traffic relies on DDL statements holding a lock that conflicts with
the lock held by the process using the table; if there is no such
lock, we might fail to notice that we need to replan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
Off-list by accident.  Re-adding the list.

On Tue, Mar 21, 2017 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Attached updated patches.
>>
>> Committed 0001 after removing a comma.
>
> Regarding 0002, I notice this surprising behavior:
>
> rhaas=# create table foo (a int, b text) partition by list (a);
> CREATE TABLE
> rhaas=# select relfilenode from pg_class where oid = 'foo'::regclass;
>  relfilenode
> -------------
>        16385
> (1 row)
>
> Why do we end up with a relfilenode if there's no storage?  I would
> have expected to see 0 there.
>
> Other than that, there's not much to see here.  I'm a little worried
> that you might have missed some case that can result in an access to
> the file, but I can't find one.  Stuff I tried:
>
> VACUUM
> VACUUM FULL
> CLUSTER
> ANALYZE
> CREATE INDEX
> ALTER TABLE .. ALTER COLUMN .. TYPE
> TRUNCATE
>
> It would be good to go through and make sure all of those - and any
> others you can think of - are represented in the regression tests.
>
> Also, a documentation update is probably in order to explain that
> we're not going to accept heap_reloptions on the parent because the
> parent has no storage; such options must be set on the child in order
> to have effect.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
> On Tue, Mar 21, 2017 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Mar 21, 2017 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> Attached updated patches.
>>>
>>> Committed 0001 after removing a comma.
>>
>> Regarding 0002,

Thanks for the review.

>> I notice this surprising behavior:
>>
>> rhaas=# create table foo (a int, b text) partition by list (a);
>> CREATE TABLE
>> rhaas=# select relfilenode from pg_class where oid = 'foo'::regclass;
>>  relfilenode
>> -------------
>>        16385
>> (1 row)
>>
>> Why do we end up with a relfilenode if there's no storage?  I would
>> have expected to see 0 there.

Hmm, I see that happening even with views (and other relkinds that do not
have storage):

create table foo (a int);
create view foov as select * from foo;
select relfilenode from pg_class where relname = 'foov';
-[ RECORD 1 ]------
relfilenode | 24606

create type typ as (a int);
select relfilenode from pg_class where relname = 'typ';
-[ RECORD 1 ]------
relfilenode | 24610


After staring at the relevant code, fixing things so that relfilenode is 0
for relations without storage seems to be slightly involved. In the header
comment of relmapper.c, there is a line:

Mapped catalogs have zero in their pg_class.relfilenode entries.

which is kind of significant for the relcache initialization code.
RelationInitPhysicalAddr() called when building a relcache entry
initializes the rd_rel.relfilenode from pg_class.relfilenode if it's valid
and from relation mapper if invalid (i.e. 0).  So when relcache entries
for the mapped catalogs are being built, pg_class.relfilenode being
InvalidOid is a signal to get the same from the relmapper.  Since the same
code path is exercised in other cases, all relations supposedly without
storage would wrongly be looked up in the relmapper.  It may be possible
to fix that, but will require non-trivial rearrangement of relevant code.

>> Other than that, there's not much to see here.  I'm a little worried
>> that you might have missed some case that can result in an access to
>> the file, but I can't find one.  Stuff I tried:
>>
>> VACUUM
>> VACUUM FULL
>> CLUSTER
>> ANALYZE
>> CREATE INDEX
>> ALTER TABLE .. ALTER COLUMN .. TYPE
>> TRUNCATE
>>
>> It would be good to go through and make sure all of those - and any
>> others you can think of - are represented in the regression tests.

We now have tests that cover commands that previously would try to access
file for partitioned tables (most of those your listed above).  The commit
yesterday to eliminate scan nodes is also covered by tests.

Speaking of - do you think the following error message is reasonable when
a a partitioned table is passed to pg_prewarm():

select pg_prewarm('p');
ERROR:  fork "main" does not exist for this relation

It's the same if a view name is passed.

>> Also, a documentation update is probably in order to explain that
>> we're not going to accept heap_reloptions on the parent because the
>> parent has no storage; such options must be set on the child in order
>> to have effect.

Hmm, actually I am no longer sure if we should completely get rid of the
reloptions.  Some options like fillfactor don't make sense, but we might
want to use the values for autovacuum_* options when we will improve
autovacuum's handling of partitioned tables.  Maybe even parallel_workers
could be useful to specify for parent tables.  So, I took out reloptions.c
changes from the patch for now and documented that specifying the storage
options for partitioned parents is ineffective currently.  Does that make
sense?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Maksim Milyutin
Date:
Hi!

I have noticed that there is scheduled unlinking of nonexistent physical 
storage under partitioned table when we execute DROP TABLE statement on 
this partitioned table. Though this action doesn't generate any error 
under typical behavior of postgres because the error of storage's lack 
is caught through if-statement [1] I think it is not safe.

My patch fixes this issue.

1. src/backend/storage/smgr/md.c:1385

-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
Hi,

On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
<m.milyutin@postgrespro.ru> wrote:
> Hi!
>
> I have noticed that there is scheduled unlinking of nonexistent physical
> storage under partitioned table when we execute DROP TABLE statement on this
> partitioned table. Though this action doesn't generate any error under
> typical behavior of postgres because the error of storage's lack is caught
> through if-statement [1] I think it is not safe.
>
> My patch fixes this issue.
>
> 1. src/backend/storage/smgr/md.c:1385

Good catch, will incorporate that in the main patch.

Thanks,
Amit



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/23 23:47, Amit Langote wrote:
> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
> <m.milyutin@postgrespro.ru> wrote:
>> Hi!
>>
>> I have noticed that there is scheduled unlinking of nonexistent physical
>> storage under partitioned table when we execute DROP TABLE statement on this
>> partitioned table. Though this action doesn't generate any error under
>> typical behavior of postgres because the error of storage's lack is caught
>> through if-statement [1] I think it is not safe.
>>
>> My patch fixes this issue.
>>
>> 1. src/backend/storage/smgr/md.c:1385
> 
> Good catch, will incorporate that in the main patch.

And here is the updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: Partitioned tables and relfilenode

From
Robert Haas
Date:
On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/03/23 23:47, Amit Langote wrote:
>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
>> <m.milyutin@postgrespro.ru> wrote:
>>> Hi!
>>>
>>> I have noticed that there is scheduled unlinking of nonexistent physical
>>> storage under partitioned table when we execute DROP TABLE statement on this
>>> partitioned table. Though this action doesn't generate any error under
>>> typical behavior of postgres because the error of storage's lack is caught
>>> through if-statement [1] I think it is not safe.
>>>
>>> My patch fixes this issue.
>>>
>>> 1. src/backend/storage/smgr/md.c:1385
>>
>> Good catch, will incorporate that in the main patch.
>
> And here is the updated patch.

I think you should go back to the earlier strategy of disallowing heap
reloptions to be set on the partitioned table.  The thing is, we don't
really know which way we're going to want to go in the future.  Maybe
we'll come up with a system where you can set options on the
partitioned table, and those options will cascade to the children.  Or
maybe we'll come up with a system where partitioned tables have a
completely different set of options to control behaviors specific to
partitioned tables.  If we do the latter, then we don't want to also
have to support useless heap reloptions for forward compatibility, nor
do we want to break backward-compatibility to remove support.  If we
do the former, then it's better if we allow it in the same release
where it starts working.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitioned tables and relfilenode

From
Maksim Milyutin
Date:
On 24.03.2017 03:54, Amit Langote wrote:
>
> And here is the updated patch.
>

Perhaps you forgot my fix in the updated patch

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 3999e6e..36917c8 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1823,7 +1823,8 @@ heap_drop_with_catalog(Oid relid)      */     if (rel->rd_rel->relkind != RELKIND_VIEW &&
rel->rd_rel->relkind!= RELKIND_COMPOSITE_TYPE &&
 
-        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE)
+        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE &&
+        rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)     {         RelationDropStorage(rel);     }

-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Partitioned tables and relfilenode

From
Amit Langote
Date:
Hi,

On 2017/03/28 0:13, Maksim Milyutin wrote:
> On 24.03.2017 03:54, Amit Langote wrote:
>>
>> And here is the updated patch.
>>
> 
> Perhaps you forgot my fix in the updated patch
> 
> diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> index 3999e6e..36917c8 100644
> --- a/src/backend/catalog/heap.c
> +++ b/src/backend/catalog/heap.c
> @@ -1823,7 +1823,8 @@ heap_drop_with_catalog(Oid relid)
>       */
>      if (rel->rd_rel->relkind != RELKIND_VIEW &&
>          rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE &&
> -        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE)
> +        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE &&
> +        rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
>      {
>          RelationDropStorage(rel);
>      }

Oops, my bad.  I will include it in the patch I'll send after addressing
Robert's comments.  Thanks again!

Regards,
Amit





Re: Partitioned tables and relfilenode

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 8:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Oops, my bad.  I will include it in the patch I'll send after addressing
> Robert's comments.  Thanks again!

That patch coming soon?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/27 23:27, Robert Haas wrote:
> On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/03/23 23:47, Amit Langote wrote:
>>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
>>> <m.milyutin@postgrespro.ru> wrote:
>>>> Hi!
>>>>
>>>> I have noticed that there is scheduled unlinking of nonexistent physical
>>>> storage under partitioned table when we execute DROP TABLE statement on this
>>>> partitioned table. Though this action doesn't generate any error under
>>>> typical behavior of postgres because the error of storage's lack is caught
>>>> through if-statement [1] I think it is not safe.
>>>>
>>>> My patch fixes this issue.
>>>>
>>>> 1. src/backend/storage/smgr/md.c:1385
>>>
>>> Good catch, will incorporate that in the main patch.
>>
>> And here is the updated patch.
> 
> I think you should go back to the earlier strategy of disallowing heap
> reloptions to be set on the partitioned table.  The thing is, we don't
> really know which way we're going to want to go in the future.  Maybe
> we'll come up with a system where you can set options on the
> partitioned table, and those options will cascade to the children.  Or
> maybe we'll come up with a system where partitioned tables have a
> completely different set of options to control behaviors specific to
> partitioned tables.  If we do the latter, then we don't want to also
> have to support useless heap reloptions for forward compatibility, nor
> do we want to break backward-compatibility to remove support.  If we
> do the former, then it's better if we allow it in the same release
> where it starts working.

You're right, modified the patch accordingly.

By the way, the previous version of the patch didn't really "disallow"
specifying heap reloptions though.  What I'd imagine that should entail is
CREATE TABLE or ALTER TABLE raising error if one of those reloptions is
specified, which didn't really occur with the patch.  The options were
silently accepted and stored into pg_class, but their values were never
used.  I modified the patch so that an error occurs instead of silently
accepting the user input.

create table p (a int) partition by list (a) with (fillfactor = 10);
ERROR:  unrecognized parameter "fillfactor" for a partitioned table

Thanks,
Amit

Re: Partitioned tables and relfilenode

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 29 Mar 2017 15:40:20 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<c4d71df2-9e0e-3912-dc81-9a72e080c238@lab.ntt.co.jp>
> On 2017/03/27 23:27, Robert Haas wrote:
> > On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On 2017/03/23 23:47, Amit Langote wrote:
> >>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
> >>> <m.milyutin@postgrespro.ru> wrote:
> >>>> Hi!
> >>>>
> >>>> I have noticed that there is scheduled unlinking of nonexistent physical
> >>>> storage under partitioned table when we execute DROP TABLE statement on this
> >>>> partitioned table. Though this action doesn't generate any error under
> >>>> typical behavior of postgres because the error of storage's lack is caught
> >>>> through if-statement [1] I think it is not safe.
> >>>>
> >>>> My patch fixes this issue.
> >>>>
> >>>> 1. src/backend/storage/smgr/md.c:1385
> >>>
> >>> Good catch, will incorporate that in the main patch.
> >>
> >> And here is the updated patch.
> > 
> > I think you should go back to the earlier strategy of disallowing heap
> > reloptions to be set on the partitioned table.  The thing is, we don't
> > really know which way we're going to want to go in the future.  Maybe
> > we'll come up with a system where you can set options on the
> > partitioned table, and those options will cascade to the children.  Or
> > maybe we'll come up with a system where partitioned tables have a
> > completely different set of options to control behaviors specific to
> > partitioned tables.  If we do the latter, then we don't want to also
> > have to support useless heap reloptions for forward compatibility, nor
> > do we want to break backward-compatibility to remove support.  If we
> > do the former, then it's better if we allow it in the same release
> > where it starts working.
> 
> You're right, modified the patch accordingly.
> 
> By the way, the previous version of the patch didn't really "disallow"
> specifying heap reloptions though.  What I'd imagine that should entail is
> CREATE TABLE or ALTER TABLE raising error if one of those reloptions is
> specified, which didn't really occur with the patch.  The options were
> silently accepted and stored into pg_class, but their values were never
> used.  I modified the patch so that an error occurs instead of silently
> accepting the user input.
> 
> create table p (a int) partition by list (a) with (fillfactor = 10);
> ERROR:  unrecognized parameter "fillfactor" for a partitioned table

The following attracted my eyes.

+      if (def->defnamespace == NULL &&
+          pg_strcasecmp(def->defname, "oids") != 0)
+        ereport(ERROR,
+          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+           errmsg("unrecognized parameter \"%s\" for a partitioned table",

This works since defnamespace is always NULL here, but if I
understand correctly what we should do here is "reject any option
other than "(default).OID"". So I think that the condition should
be like the following.

+      if (def->defnamespace != NULL ||
+          pg_strcasecmp(def->defname, "oids") != 0)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Partitioned tables and relfilenode

From
Amit Langote
Date:
Horiguchi-san,

Thanks for taking a look.

On 2017/03/29 16:49, Kyotaro HORIGUCHI wrote:
> At Wed, 29 Mar 2017 15:40:20 +0900, Amit Langote wrote:
>> On 2017/03/27 23:27, Robert Haas wrote:
>>>> And here is the updated patch.
>>>
>>> I think you should go back to the earlier strategy of disallowing heap
>>> reloptions to be set on the partitioned table.  The thing is, we don't
>>> really know which way we're going to want to go in the future.  Maybe
>>> we'll come up with a system where you can set options on the
>>> partitioned table, and those options will cascade to the children.  Or
>>> maybe we'll come up with a system where partitioned tables have a
>>> completely different set of options to control behaviors specific to
>>> partitioned tables.  If we do the latter, then we don't want to also
>>> have to support useless heap reloptions for forward compatibility, nor
>>> do we want to break backward-compatibility to remove support.  If we
>>> do the former, then it's better if we allow it in the same release
>>> where it starts working.
>>
>> You're right, modified the patch accordingly.
>>
>> By the way, the previous version of the patch didn't really "disallow"
>> specifying heap reloptions though.  What I'd imagine that should entail is
>> CREATE TABLE or ALTER TABLE raising error if one of those reloptions is
>> specified, which didn't really occur with the patch.  The options were
>> silently accepted and stored into pg_class, but their values were never
>> used.  I modified the patch so that an error occurs instead of silently
>> accepting the user input.
>>
>> create table p (a int) partition by list (a) with (fillfactor = 10);
>> ERROR:  unrecognized parameter "fillfactor" for a partitioned table
> 
> The following attracted my eyes.
> 
> +      if (def->defnamespace == NULL &&
> +          pg_strcasecmp(def->defname, "oids") != 0)
> +        ereport(ERROR,
> +          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +           errmsg("unrecognized parameter \"%s\" for a partitioned table",
> 
> This works since defnamespace is always NULL here, but if I
> understand correctly what we should do here is "reject any option
> other than "(default).OID"". So I think that the condition should
> be like the following.

You're right.  The following *wrongly* succeeds:

create table p (a int) partition by list (a) with
(toast.autovacuum_enabled = true);
CREATE TABLE

> +      if (def->defnamespace != NULL ||
> +          pg_strcasecmp(def->defname, "oids") != 0)

Looks correct, so incorporated in the attached updated patch.  Thanks.

Regards,
Amit

Re: Partitioned tables and relfilenode

From
Kyotaro HORIGUCHI
Date:
At Wed, 29 Mar 2017 17:21:26 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<b1234f04-53e0-011d-9f02-2437b909cce4@lab.ntt.co.jp>
> > Thanks for taking a look.

This patch is small enough to look at in a short time:p

> > The following attracted my eyes.
> > 
> > +      if (def->defnamespace == NULL &&
> > +          pg_strcasecmp(def->defname, "oids") != 0)
> > +        ereport(ERROR,
> > +          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +           errmsg("unrecognized parameter \"%s\" for a partitioned table",
> > 
> > This works since defnamespace is always NULL here, but if I
> > understand correctly what we should do here is "reject any option
> > other than "(default).OID"". So I think that the condition should
> > be like the following.
> 
> You're right.  The following *wrongly* succeeds:

Ouch! I briefly checked that by "hoge.oids" without confirming
around.

> create table p (a int) partition by list (a) with
> (toast.autovacuum_enabled = true);
> CREATE TABLE
> 
> > +      if (def->defnamespace != NULL ||
> > +          pg_strcasecmp(def->defname, "oids") != 0)
> 
> Looks correct, so incorporated in the attached updated patch.  Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Partitioned tables and relfilenode

From
Robert Haas
Date:
On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Looks correct, so incorporated in the attached updated patch.  Thanks.

This seems like a hacky way to limit the reloptions to just OIDs.
Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
like that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/03/29 23:58, Robert Haas wrote:
> On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Looks correct, so incorporated in the attached updated patch.  Thanks.
> 
> This seems like a hacky way to limit the reloptions to just OIDs.
> Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
> like that?

OK, I tried that in the updated patch.

DefineRelation() and ATExecSetRelOptions() still calls heap_reloptions(),
but passes the new relopt_kind.  None of the options listed in
reloptions.c are of this kind as of now, so parseRelOptions() simply
outputs the "unrecognized parameter %s" in the case of partitioned tables
(except in some cases described below, but not because of the limitations
of this patch).  If and when partitioned tables start supporting the
existing parameters, we'll update the relopt_gen.kinds bitmask of those
parameters to allow them to be specified for partitioned tables.

With the patch:

create table parted (a int) partition by list (a) with (fillfactor = 10);
ERROR:  unrecognized parameter "fillfactor"

create table parted (a int) partition by list (a) with (toast.fillfactor =
10);
ERROR:  unrecognized parameter "fillfactor"

and:

create table parted (a int) partition by list (a) with (oids = true);
alter table parted set (fillfactor = 90);
ERROR:  unrecognized parameter "fillfactor"

but:

-- appears to succeed, whereas an error should have been reported (I think)
alter table parted set (toast.fillfactor = 10);
ALTER TABLE

-- even with views
create table foo (a int);
create view foov with (toast.fillfactor = 10) as select * from foo;
CREATE VIEW
alter view foov set (toast.fillfactor = 20);
ALTER VIEW

Nothing is stored in pg_class.reloptions really, but the validation that
should have occurred in parseRelOptions() doesn't.  This happens because
of the way transformRelOptions() works.  It will ignore the DefElem's that
don't apply to the specified relation based on the value of the namspace
parameter ("toast" or NULL).  That means it won't be included in the array
of options later received by parseRelOptions(), which is where the
validation occurs.

Also, alter table reset (xxx) never validates anything:

alter table foo reset (foo);
ALTER TABLE
alter table foo reset (foo.bar);
ALTER TABLE

Again, no pg_class.reloptions update occurs in this case. The reason this
happens is because transformRelOptions() never includes the options to be
reset in the array of options received by parseRelOptions(), so no
validation occurs.

But since both are existing behaviors, perhaps we can worry about it some
other time.

Thanks,
Amit

Re: Partitioned tables and relfilenode

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<ee5f1cd4-4783-42e8-0263-648ae9c11264@lab.ntt.co.jp>
> On 2017/03/29 23:58, Robert Haas wrote:
> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> Looks correct, so incorporated in the attached updated patch.  Thanks.
> > 
> > This seems like a hacky way to limit the reloptions to just OIDs.
> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
> > like that?
> 
> OK, I tried that in the updated patch.

The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for
partitioned tables is RELKIND_PARTITIONED_TABLE, so is this
better to be _TABLE, even if a bit longer?

parseRelOptions seems to return garbage pointer if no option of
the kind is available.

> DefineRelation() and ATExecSetRelOptions() still calls heap_reloptions(),
> but passes the new relopt_kind.  None of the options listed in
> reloptions.c are of this kind as of now, so parseRelOptions() simply
> outputs the "unrecognized parameter %s" in the case of partitioned tables
> (except in some cases described below, but not because of the limitations
> of this patch).  If and when partitioned tables start supporting the
> existing parameters, we'll update the relopt_gen.kinds bitmask of those
> parameters to allow them to be specified for partitioned tables.
> 
> With the patch:
> 
> create table parted (a int) partition by list (a) with (fillfactor = 10);
> ERROR:  unrecognized parameter "fillfactor"
> 
> create table parted (a int) partition by list (a) with (toast.fillfactor =
> 10);
> ERROR:  unrecognized parameter "fillfactor"
> 
> and:
> 
> create table parted (a int) partition by list (a) with (oids = true);
> alter table parted set (fillfactor = 90);
> ERROR:  unrecognized parameter "fillfactor"
> 
> but:
> 
> -- appears to succeed, whereas an error should have been reported (I think)
> alter table parted set (toast.fillfactor = 10);
> ALTER TABLE
> 
> -- even with views
> create table foo (a int);
> create view foov with (toast.fillfactor = 10) as select * from foo;
> CREATE VIEW
> alter view foov set (toast.fillfactor = 20);
> ALTER VIEW
> 
> Nothing is stored in pg_class.reloptions really, but the validation that
> should have occurred in parseRelOptions() doesn't.  This happens because
> of the way transformRelOptions() works.  It will ignore the DefElem's that
> don't apply to the specified relation based on the value of the namspace
> parameter ("toast" or NULL).  That means it won't be included in the array
> of options later received by parseRelOptions(), which is where the
> validation occurs.
> 
> Also, alter table reset (xxx) never validates anything:
> 
> alter table foo reset (foo);
> ALTER TABLE
> alter table foo reset (foo.bar);
> ALTER TABLE
> 
> Again, no pg_class.reloptions update occurs in this case. The reason this
> happens is because transformRelOptions() never includes the options to be
> reset in the array of options received by parseRelOptions(), so no
> validation occurs.
> 
> But since both are existing behaviors, perhaps we can worry about it some
> other time.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Partitioned tables and relfilenode

From
Amit Langote
Date:
Thanks for the review.

On Thu, Mar 30, 2017 at 7:37 PM, Kyotaro HORIGUCHI wrote:
> At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote wrote:
>> On 2017/03/29 23:58, Robert Haas wrote:
>> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
>> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> >> Looks correct, so incorporated in the attached updated patch.  Thanks.
>> >
>> > This seems like a hacky way to limit the reloptions to just OIDs.
>> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
>> > like that?
>>
>> OK, I tried that in the updated patch.
>
> The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for
> partitioned tables is RELKIND_PARTITIONED_TABLE, so is this
> better to be _TABLE, even if a bit longer?

Hm, OK.  Done.

> parseRelOptions seems to return garbage pointer if no option of
> the kind is available.

Oops, fixed that too.

Updated patch attached.

Thanks,
Amit

Attachment

Re: Partitioned tables and relfilenode

From
Kyotaro HORIGUCHI
Date:
At Thu, 30 Mar 2017 20:58:35 +0900, Amit Langote <amitlangote09@gmail.com> wrote in
<CA+HiwqFmrJGe4eid2EaqnCjjQxrAYR5hJfzc5Cz929WA9p=A+w@mail.gmail.com>
> Thanks for the review.
> 
> On Thu, Mar 30, 2017 at 7:37 PM, Kyotaro HORIGUCHI wrote:
> > At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote wrote:
> >> On 2017/03/29 23:58, Robert Haas wrote:
> >> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
> >> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> >> Looks correct, so incorporated in the attached updated patch.  Thanks.
> >> >
> >> > This seems like a hacky way to limit the reloptions to just OIDs.
> >> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
> >> > like that?
> >>
> >> OK, I tried that in the updated patch.
> >
> > The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for
> > partitioned tables is RELKIND_PARTITIONED_TABLE, so is this
> > better to be _TABLE, even if a bit longer?
> 
> Hm, OK.  Done.
> 
> > parseRelOptions seems to return garbage pointer if no option of
> > the kind is available.
> 
> Oops, fixed that too.
> 
> Updated patch attached.

Thank you.

- Applies cleanly on master (f90d23d)
- Compiled without error
- Code seems fine.
- Documentaion seems fine.. for me.
- Passes regression test.
- Leaving the ALTER-on-toast.* problem is fine for me.

The regression contains the tests to fail with several reloptions
only for partitioned tables. Are they still required even though
it is now in the same framework with other kind of reloptions?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/04/03 11:39, Amit Langote wrote:
> On 2017/04/01 5:29, Robert Haas wrote:
>> Thanks for reviewing, and thanks to Maksim as well, and thanks to Amit
>> for writing the patch.
> 
> Thanks for committing. :)

I noticed that I had missed a couple of places that would try to scan
partitioned tables, resulting in file access.

1. In validateCheckConstraint(), along with foreign tables, must ignore
partitioned tables.

2. DefineQueryRewrite() may try to scan a partitioned table in the case of
converting a table to view, where we must make sure that the table being
converted is empty.  It's checked by scanning the heap, which we should
not do for a partitioned table.  Nor should we try to drop the storage
once ready to make the table into a REKIND_VIEW relation (because all
other checks passed okaying the conversion).

Tests are added for both the cases.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Ashutosh Bapat
Date:
Looking at the number of issues where we have to fix tests based on
the relkind checks, I think, we have to consider creating macros as
described in my reply to thread with subject " Allowing extended stats
on foreign and partitioned tables".

On Tue, Apr 11, 2017 at 2:46 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/04/03 11:39, Amit Langote wrote:
>> On 2017/04/01 5:29, Robert Haas wrote:
>>> Thanks for reviewing, and thanks to Maksim as well, and thanks to Amit
>>> for writing the patch.
>>
>> Thanks for committing. :)
>
> I noticed that I had missed a couple of places that would try to scan
> partitioned tables, resulting in file access.
>
> 1. In validateCheckConstraint(), along with foreign tables, must ignore
> partitioned tables.
>
> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
> converting a table to view, where we must make sure that the table being
> converted is empty.  It's checked by scanning the heap, which we should
> not do for a partitioned table.  Nor should we try to drop the storage
> once ready to make the table into a REKIND_VIEW relation (because all
> other checks passed okaying the conversion).
>
> Tests are added for both the cases.
>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Tom Lane
Date:
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
> converting a table to view, where we must make sure that the table being
> converted is empty.  It's checked by scanning the heap, which we should
> not do for a partitioned table.  Nor should we try to drop the storage
> once ready to make the table into a REKIND_VIEW relation (because all
> other checks passed okaying the conversion).

It looks like this patch intends to allow converting a partitioned table
to a view.  I would lobby for refusing the command, instead.  There is
no good reason to allow it, and it might well be a user error.
        regards, tom lane



Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Apr 11, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
>> converting a table to view, where we must make sure that the table being
>> converted is empty.  It's checked by scanning the heap, which we should
>> not do for a partitioned table.  Nor should we try to drop the storage
>> once ready to make the table into a REKIND_VIEW relation (because all
>> other checks passed okaying the conversion).
>
> It looks like this patch intends to allow converting a partitioned table
> to a view.  I would lobby for refusing the command, instead.  There is
> no good reason to allow it, and it might well be a user error.

Yeah, I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/04/12 2:41, Robert Haas wrote:
> On Tue, Apr 11, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>>> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
>>> converting a table to view, where we must make sure that the table being
>>> converted is empty.  It's checked by scanning the heap, which we should
>>> not do for a partitioned table.  Nor should we try to drop the storage
>>> once ready to make the table into a REKIND_VIEW relation (because all
>>> other checks passed okaying the conversion).
>>
>> It looks like this patch intends to allow converting a partitioned table
>> to a view.  I would lobby for refusing the command, instead.  There is
>> no good reason to allow it, and it might well be a user error.
> 
> Yeah, I agree.

Alright.  So I made it into two patches instead: 0001 fixes the bug that
validateCheckConstraint() tries to scan partitioned tables and 0002 makes
trying to convert a partitioned table to a view a user error.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From
Robert Haas
Date:
On Tue, Apr 11, 2017 at 10:15 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Alright.  So I made it into two patches instead: 0001 fixes the bug that
> validateCheckConstraint() tries to scan partitioned tables and 0002 makes
> trying to convert a partitioned table to a view a user error.

Committed together, after updating the regression test outputs to make
the tests pass.  You forgot to update the expected output file in
0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partitioned tables and relfilenode

From
Amit Langote
Date:
On 2017/04/13 0:36, Robert Haas wrote:
> On Tue, Apr 11, 2017 at 10:15 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Alright.  So I made it into two patches instead: 0001 fixes the bug that
>> validateCheckConstraint() tries to scan partitioned tables and 0002 makes
>> trying to convert a partitioned table to a view a user error.
> 
> Committed together, after updating the regression test outputs to make
> the tests pass.  You forgot to update the expected output file in
> 0002.

Oops, sorry about that.  Thanks for fixing and committing.

Regards,
Amit