Thread: [HACKERS] Partitioned tables and relfilenode

[HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

10 February 2017, 09:19:47

The new partitioned tables do not contain any data by themselves. Any
data inserted into a partitioned table is routed to and stored in one of
its partitions. In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created. It seems wasteful
then to allocate physical storage (files) for partitioned tables. If we
do not allocate the storage, then we must make sure that the right thing
happens when a command that is intended to manipulate a table's storage
encounters a partitioned table, the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class. Commands
that need to be taught about this are vacuum, analyze, truncate, and alter
table. Specifically:

- In case of vacuum, specifying a partitioned table causes a warning

- In case of analyze, we do not throw an error or warning but simply
avoid calling do_analyze_rel() *non-recursively*. Further in
acquire_inherited_sample_rows(), any partitioned tables in the list
returned by find_all_inheritors() are skipped.

- In case of truncate, only the part which manipulates table's physical
storage is skipped for partitioned tables.

- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
tables, because there is nothing to be done.

- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)

Patches 0001 and 0002 of the attached implement the above part. 0001
teaches the above mentioned commands to do the "right thing" as described
above and 0002 teaches heap_create() and heap_create_with_catalog() to not
create any physical storage (none of the forks) for partitioned tables.

Then comes 0003, which concerns inheritance planning. In a regular
inheritance set (ie, the inheritance set corresponding to an inheritance
hierarchy whose root is a regular table), the inheritance parents are
included in their role as child members, because they might contribute
rows to the final result. So AppendRelInfo's are created for each such
table by the planner prep phase, which the later planning steps use to
create a scan plan for those tables as the Append's child plans.
Currently, the partitioned tables are also processed by the optimizer as
inheritance sets. Partitioned table inheritance parents do not own any
storage, so we *must* not create scan plans for them. So we do not need
to process them as child members of the inheritance set. 0003 teaches
expand_inherited_rtentry() to not add partitioned tables as child members.
Also, since the root partitioned table RTE is no longer added to the
Append list as the 1st child member, inheritance_planner() cannot assume
that it can install the 1st child RTE as the nominalRelation of a given
ModifyTable node, instead the original root parent table RTE is installed
as the nominalRelation.

Together the above patches implement the first item listed in "Some
notes:" part of an email [1] on the original declarative partitioning
thread, which says:

"We should try to teach the executor never to scan the parent. That's
never necessary with this system, and it might add significant overhead.
We should also try to get rid of the idea of the parent having storage
(i.e. a relfilenode)."

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

21 February 2017, 16:21:53

Some comments about 0003 patch.

@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
    Index       rti;
+   RangeTblEntry *parent_rte;
There's already another variable declared in that function within a loop
    foreach(lc, root->append_rel_list)
    {
        ...
        RangeTblEntry *parent_rte;
        RangeTblEntry *child_rte;

You might want to choose a different name or delete the one within the loop.

I am wondering whether we should deal with inh flat reset in a
slightly different way. Let expand_inherited_rtentry() mark inh =
false for the partitioned tables without any partitions and deal with
those at the time of estimating size by marking those as dummy. That
might be better than multiple changes here. I will try to cook up a
patch soon for that.

Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0003-Always-plan-partitioned-tables-as-inheritance-sets_v2.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

22 February 2017, 11:20:42

On 2017/02/21 22:21, Ashutosh Bapat wrote:
> Some comments about 0003 patch.
> 
> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>     Index       rti;
> +   RangeTblEntry *parent_rte;
> There's already another variable declared in that function within a loop
>     foreach(lc, root->append_rel_list)
>     {
>         ...
>         RangeTblEntry *parent_rte;
>         RangeTblEntry *child_rte;
> 
> You might want to choose a different name or delete the one within the loop.

Deleted the one within the loop.

> I am wondering whether we should deal with inh flat reset in a
> slightly different way. Let expand_inherited_rtentry() mark inh =
> false for the partitioned tables without any partitions and deal with
> those at the time of estimating size by marking those as dummy. That
> might be better than multiple changes here. I will try to cook up a
> patch soon for that.

Are thinking something along the lines of the attached rewritten patch
0003?  I also tend to think that's probably a cleaner patch.  Thanks for
the idea.

> Also we should add tests to make sure the scans on partitioned tables
> without any partitions do not get into problems. PFA patch which adds
> those tests.

I added the test case you suggest, but kept just the first one.

I am including the patches 0001 and 0002 to keep all patches being
discussed on this thread together.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

22 February 2017, 15:58:19

>
>> I am wondering whether we should deal with inh flat reset in a
>> slightly different way. Let expand_inherited_rtentry() mark inh =
>> false for the partitioned tables without any partitions and deal with
>> those at the time of estimating size by marking those as dummy. That
>> might be better than multiple changes here. I will try to cook up a
>> patch soon for that.
>
> Are thinking something along the lines of the attached rewritten patch
> 0003?  I also tend to think that's probably a cleaner patch.  Thanks for
> the idea.

Yes. Thanks for working on it.

>
>> Also we should add tests to make sure the scans on partitioned tables
>> without any partitions do not get into problems. PFA patch which adds
>> those tests.
>
> I added the test case you suggest, but kept just the first one.

The second one was testing multi-level partitioning case, which
deserves a testcase by its own.

Some more comments

The comment below seems too verbose. All it can say is "A partitioned table
without any partitions results in a dummy relation." I don't think we need to
be explicit about rte->inh. But it's more of personal preference. We will leave
it to the committer, if you don't agree.
+                   /*
+                    * An empty partitioned table, i.e., one without any leaf
+                    * partitions, as signaled by rte->inh being set to false
+                    * by the prep phase (see expand_inherited_rtentry).
+                    */

We don't need this comment as well. Rather than repeating what's been said at
the top of the function, it should just refer to it like "nominal relation has
been set above for partitioned tables. For other parent relations, we'll use
the first child ...".
+        *
+        * If the parent is a partitioned table, we do not have a separate RTE
+        * representing it as a member of the inheritance set, because we do
+        * not create a scan plan for it.  As mentioned at the top of this
+        * function, we make the parent RTE itself the nominal relation.        */

Following condition is not very readable. It's not evident that it's of the
form (A && B) || C, at a glance it looks like it's A && (B || C).
+   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
+        list_length(appinfos) < 2) || list_length(appinfos) < 1)

Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

23 February 2017, 08:49:31

Thanks for the review.

On 2017/02/22 21:58, Ashutosh Bapat wrote:
>>> Also we should add tests to make sure the scans on partitioned tables
>>> without any partitions do not get into problems. PFA patch which adds
>>> those tests.
>>
>> I added the test case you suggest, but kept just the first one.
> 
> The second one was testing multi-level partitioning case, which
> deserves a testcase by its own.

Ah, okay.  Added it back.

> Some more comments
> 
> The comment below seems too verbose. All it can say is "A partitioned table
> without any partitions results in a dummy relation." I don't think we need to
> be explicit about rte->inh. But it's more of personal preference. We will leave
> it to the committer, if you don't agree.
> +                   /*
> +                    * An empty partitioned table, i.e., one without any leaf
> +                    * partitions, as signaled by rte->inh being set to false
> +                    * by the prep phase (see expand_inherited_rtentry).
> +                    */

It does seem poorly worded.  How about:

                /*
                 * A partitioned table without leaf partitions is marked
                 * as a dummy rel.
                 */

> We don't need this comment as well. Rather than repeating what's been said at
> the top of the function, it should just refer to it like "nominal relation has
> been set above for partitioned tables. For other parent relations, we'll use
> the first child ...".
> +        *
> +        * If the parent is a partitioned table, we do not have a separate RTE
> +        * representing it as a member of the inheritance set, because we do
> +        * not create a scan plan for it.  As mentioned at the top of this
> +        * function, we make the parent RTE itself the nominal relation.
>          */

Rewrote that comment block as:

     *
     * If the parent is a partitioned table, we already set the nominal
     * relation.
     */


> Following condition is not very readable. It's not evident that it's of the
> form (A && B) || C, at a glance it looks like it's A && (B || C).
> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
> 
> Instead you may rearrage it as
> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
> if (list_length(appinfos) < min_child_rels)

OK, done that way.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

23 February 2017, 09:44:51

On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/02/22 21:58, Ashutosh Bapat wrote:
>>>> Also we should add tests to make sure the scans on partitioned tables
>>>> without any partitions do not get into problems. PFA patch which adds
>>>> those tests.
>>>
>>> I added the test case you suggest, but kept just the first one.
>>
>> The second one was testing multi-level partitioning case, which
>> deserves a testcase by its own.
>
> Ah, okay.  Added it back.

Thanks.

>
>> Some more comments
>>
>> The comment below seems too verbose. All it can say is "A partitioned table
>> without any partitions results in a dummy relation." I don't think we need to
>> be explicit about rte->inh. But it's more of personal preference. We will leave
>> it to the committer, if you don't agree.
>> +                   /*
>> +                    * An empty partitioned table, i.e., one without any leaf
>> +                    * partitions, as signaled by rte->inh being set to false
>> +                    * by the prep phase (see expand_inherited_rtentry).
>> +                    */
>
> It does seem poorly worded.  How about:
>
>                 /*
>                  * A partitioned table without leaf partitions is marked
>                  * as a dummy rel.
>                  */
>
>> We don't need this comment as well. Rather than repeating what's been said at
>> the top of the function, it should just refer to it like "nominal relation has
>> been set above for partitioned tables. For other parent relations, we'll use
>> the first child ...".
>> +        *
>> +        * If the parent is a partitioned table, we do not have a separate RTE
>> +        * representing it as a member of the inheritance set, because we do
>> +        * not create a scan plan for it.  As mentioned at the top of this
>> +        * function, we make the parent RTE itself the nominal relation.
>>          */
>
> Rewrote that comment block as:
>
>      *
>      * If the parent is a partitioned table, we already set the nominal
>      * relation.
>      */
>

I reworded those comments a bit and corrected grammar. Please check in
the attached patch.

>
>> Following condition is not very readable. It's not evident that it's of the
>> form (A && B) || C, at a glance it looks like it's A && (B || C).
>> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
>> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
>>
>> Instead you may rearrage it as
>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
>> if (list_length(appinfos) < min_child_rels)
>
> OK, done that way.

On a second thought, I added a boolean to check if there were any
children created and then reset rte->inh based on that value. That's
better than relying on appinfos length now.

Let me know what you think.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0003-Avoid-creating-scan-nodes-for-partitioned-tables_v2.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

23 February 2017, 10:38:29

Thanks for the review.

On 2017/02/23 15:44, Ashutosh Bapat wrote:
> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote:
>> Rewrote that comment block as:
>>
>>      *
>>      * If the parent is a partitioned table, we already set the nominal
>>      * relation.
>>      */
>>
> 
> I reworded those comments a bit and corrected grammar. Please check in
> the attached patch.

What was there sounds grammatically correct to me, but fine.

>>> Following condition is not very readable. It's not evident that it's of the
>>> form (A && B) || C, at a glance it looks like it's A && (B || C).
>>> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
>>> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
>>>
>>> Instead you may rearrage it as
>>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
>>> if (list_length(appinfos) < min_child_rels)
>>
>> OK, done that way.
> 
> On a second thought, I added a boolean to check if there were any
> children created and then reset rte->inh based on that value. That's
> better than relying on appinfos length now.

@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)    /*
+     * Partitioned tables do not have storage for themselves and should not be
+     * scanned.

@@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
RangeTblEntry *rte, Index rti)        /*
+         * Partitioned tables themselves do not have any storage and should not
+         * be scanned. So, do not create child relations for those.
+         */

I guess we should not have to repeat "partitioned tables do not have
storage" in all these places.

+ * a partitioned relation as dummy. The duplicate RTE we added for the
+ * parent table is harmless, so we don't bother to get rid of it; ditto for
+ * the useless PlanRowMark node.

There is no duplicate RTE in the partitioned table case, which even my
original comment failed to consider.  Can you, maybe?

Thanks,
Amit

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

23 February 2017, 10:48:40

On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/02/23 15:44, Ashutosh Bapat wrote:
>> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote:
>>> Rewrote that comment block as:
>>>
>>>      *
>>>      * If the parent is a partitioned table, we already set the nominal
>>>      * relation.
>>>      */
>>>
>>
>> I reworded those comments a bit and corrected grammar. Please check in
>> the attached patch.
>
> What was there sounds grammatically correct to me, but fine.
>
>>>> Following condition is not very readable. It's not evident that it's of the
>>>> form (A && B) || C, at a glance it looks like it's A && (B || C).
>>>> +   if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
>>>> +        list_length(appinfos) < 2) || list_length(appinfos) < 1)
>>>>
>>>> Instead you may rearrage it as
>>>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
>>>> if (list_length(appinfos) < min_child_rels)
>>>
>>> OK, done that way.
>>
>> On a second thought, I added a boolean to check if there were any
>> children created and then reset rte->inh based on that value. That's
>> better than relying on appinfos length now.
>
> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>         /*
> +        * Partitioned tables do not have storage for themselves and should not be
> +        * scanned.
>
> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
> RangeTblEntry *rte, Index rti)
>                 /*
> +                * Partitioned tables themselves do not have any storage and should not
> +                * be scanned. So, do not create child relations for those.
> +                */
>
> I guess we should not have to repeat "partitioned tables do not have
> storage" in all these places.

Hmm, you are right. But they are two different functions and the code
blocks are located away from each other. So, I thought it would be
better to have complete comment there, instead of pointing to other
places. I am fine, if we can reword it without compromising
readability.

>
> + * a partitioned relation as dummy. The duplicate RTE we added for the
> + * parent table is harmless, so we don't bother to get rid of it; ditto for
> + * the useless PlanRowMark node.
>
> There is no duplicate RTE in the partitioned table case, which even my
> original comment failed to consider.  Can you, maybe?

May be we could says "If we have added duplicate RTE for the parent
table, it is harmless ...". Does that sound good?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

23 February 2017, 11:08:56

On 2017/02/23 16:48, Ashutosh Bapat wrote:
> On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote wrote:
>> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>>         /*
>> +        * Partitioned tables do not have storage for themselves and should not be
>> +        * scanned.
>>
>> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
>> RangeTblEntry *rte, Index rti)
>>                 /*
>> +                * Partitioned tables themselves do not have any storage and should not
>> +                * be scanned. So, do not create child relations for those.
>> +                */
>>
>> I guess we should not have to repeat "partitioned tables do not have
>> storage" in all these places.
> 
> Hmm, you are right. But they are two different functions and the code
> blocks are located away from each other. So, I thought it would be
> better to have complete comment there, instead of pointing to other
> places. I am fine, if we can reword it without compromising
> readability.

I was saying in general.  I agree that different sites in the optimizer
may have different considerations for why partitioned tables are to be
handled specially, common theme being that we do not have to scan the
parent relation.  If the implementation changes someday such that we don't
manipulate child tables (especially, partitions) through most planning
phases anymore, then maybe we will start using some different terminology
where we don't have to stress this fact too much.  We're not there yet though.

>> + * a partitioned relation as dummy. The duplicate RTE we added for the
>> + * parent table is harmless, so we don't bother to get rid of it; ditto for
>> + * the useless PlanRowMark node.
>>
>> There is no duplicate RTE in the partitioned table case, which even my
>> original comment failed to consider.  Can you, maybe?
> 
> May be we could says "If we have added duplicate RTE for the parent
> table, it is harmless ...". Does that sound good?

Duplicate RTE added in the non-partitioned table case is harmless, so we
don't bother to get rid of it; ditto for the useless PlanRowMark node.

Thanks,
Amit

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

23 February 2017, 11:49:17

On Thu, Feb 23, 2017 at 1:38 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/02/23 16:48, Ashutosh Bapat wrote:
>> On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote wrote:
>>> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
>>>         /*
>>> +        * Partitioned tables do not have storage for themselves and should not be
>>> +        * scanned.
>>>
>>> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
>>> RangeTblEntry *rte, Index rti)
>>>                 /*
>>> +                * Partitioned tables themselves do not have any storage and should not
>>> +                * be scanned. So, do not create child relations for those.
>>> +                */
>>>
>>> I guess we should not have to repeat "partitioned tables do not have
>>> storage" in all these places.
>>
>> Hmm, you are right. But they are two different functions and the code
>> blocks are located away from each other. So, I thought it would be
>> better to have complete comment there, instead of pointing to other
>> places. I am fine, if we can reword it without compromising
>> readability.
>
> I was saying in general.  I agree that different sites in the optimizer
> may have different considerations for why partitioned tables are to be
> handled specially, common theme being that we do not have to scan the
> parent relation.  If the implementation changes someday such that we don't
> manipulate child tables (especially, partitions) through most planning
> phases anymore, then maybe we will start using some different terminology
> where we don't have to stress this fact too much.  We're not there yet though.

I agree.

>
>>> + * a partitioned relation as dummy. The duplicate RTE we added for the
>>> + * parent table is harmless, so we don't bother to get rid of it; ditto for
>>> + * the useless PlanRowMark node.
>>>
>>> There is no duplicate RTE in the partitioned table case, which even my
>>> original comment failed to consider.  Can you, maybe?
>>
>> May be we could says "If we have added duplicate RTE for the parent
>> table, it is harmless ...". Does that sound good?
>
> Duplicate RTE added in the non-partitioned table case is harmless, so we
> don't bother to get rid of it; ditto for the useless PlanRowMark node.

Fine with me.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Bruce Momjian

Date:

27 February 2017, 21:54:24

On Fri, Feb 10, 2017 at 03:19:47PM +0900, Amit Langote wrote:
> The new partitioned tables do not contain any data by themselves.  Any
> data inserted into a partitioned table is routed to and stored in one of
> its partitions.  In fact, it is impossible to insert *any* data before a
> partition (to be precise, a leaf partition) is created.  It seems wasteful
> then to allocate physical storage (files) for partitioned tables.  If we
> do not allocate the storage, then we must make sure that the right thing
> happens when a command that is intended to manipulate a table's storage
> encounters a partitioned table, the "right thing" here being that the
> command's code either throws an error or warning (in some cases) if the
> specified table is a partitioned table or ignores any partitioned tables
> when it reads the list of relations to process from pg_class.  Commands
> that need to be taught about this are vacuum, analyze, truncate, and alter
> table.  Specifically:
> 
> - In case of vacuum, specifying a partitioned table causes a warning
> 
> - In case of analyze, we do not throw an error or warning but simply
>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>   acquire_inherited_sample_rows(), any partitioned tables in the list
>   returned by find_all_inheritors() are skipped.
> 
> - In case of truncate, only the part which manipulates table's physical
>   storage is skipped for partitioned tables.
> 
> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>   tables, because there is nothing to be done.
> 
> - Since we cannot create indexes on partitioned tables anyway, there is
>   no need to handle cluster and reindex (they throw a meaningful error
>   already due to the lack of indexes.)

I don't think we are doing this, but if the parent table doesn't have a
physical file pg_upgrade will need to be taught that.  We have that case
now for unlogged tables on standby servers that we need to address.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

28 February 2017, 05:53:16

On 2017/02/28 3:54, Bruce Momjian wrote:
> On Fri, Feb 10, 2017 at 03:19:47PM +0900, Amit Langote wrote:
>> The new partitioned tables do not contain any data by themselves.  Any
>> data inserted into a partitioned table is routed to and stored in one of
>> its partitions.  In fact, it is impossible to insert *any* data before a
>> partition (to be precise, a leaf partition) is created.  It seems wasteful
>> then to allocate physical storage (files) for partitioned tables.  If we
>> do not allocate the storage, then we must make sure that the right thing
>> happens when a command that is intended to manipulate a table's storage
>> encounters a partitioned table, the "right thing" here being that the
>> command's code either throws an error or warning (in some cases) if the
>> specified table is a partitioned table or ignores any partitioned tables
>> when it reads the list of relations to process from pg_class.  Commands
>> that need to be taught about this are vacuum, analyze, truncate, and alter
>> table.  Specifically:
>>
>> - In case of vacuum, specifying a partitioned table causes a warning
>>
>> - In case of analyze, we do not throw an error or warning but simply
>>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>>   acquire_inherited_sample_rows(), any partitioned tables in the list
>>   returned by find_all_inheritors() are skipped.
>>
>> - In case of truncate, only the part which manipulates table's physical
>>   storage is skipped for partitioned tables.
>>
>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>>   tables, because there is nothing to be done.
>>
>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
> 
> I don't think we are doing this, but if the parent table doesn't have a
> physical file pg_upgrade will need to be taught that.  We have that case
> now for unlogged tables on standby servers that we need to address.

Partitioned tables do have physical files as of now.  This thread is to
discuss the proposal to get rid of the physical file for partitioned tables.

By the way, I just tried pg_upgrade on top of this patch, and it seems
partitioned tables without the physical file migrate just fine to the new
cluster.  To test I did: created a partitioned table and few partitions,
inserted some data into it, pg_upgraded the cluster to find the
partitioned table intact with its data in the new cluster (to be clear,
the data is contained in partitions).  Is there something that wouldn't
work that I should instead be testing?

Also, it seems that the partitioned tables (without physical files) won't
have the same issue on the standby servers as unlogged tables.  It's just
that we route inserts into a partitioned table to its partitions and hence
the partitioned table itself does not contain because all the incoming
data is routed.  Am I missing something?

Thanks,
Amit

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

28 February 2017, 06:20:17

On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for the review.

In 0001, the documentation which you are patching has a section for
limitations that apply only to both partitioning and constraint
exclusion, and another for limitations that apply only to constraint
exclusion.  Surely the patch should be moving a limitation that will
no longer apply to partitioning from the first section to the second
section, rather than leaving it in the section for limitations that
apply to both systems and just adding a note that say "this doesn't
apply to partitioning any more".

In acquire_inherited_sample_rows(), instead of inserting a whole
stanza of logic just above the existing dispatch on relkind, I think
we can get by with a very slightly update to what's already there.

You can't use the result of a & b as a bool.  You need to write (a &
b) != 0, because the bool should always use 1 for true and 0 for
false; it should not set some higher-numbered bit.

The changes to autovacuum.c look useless, because
relation_needs_vacanalyze() will presumably never fire for a table
with no tuples of its own.

Updated patch with those changes and a few cosmetic tweaks attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

avoid-useless-partition-ops.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Bruce Momjian

Date:

28 February 2017, 06:23:19

On Tue, Feb 28, 2017 at 11:53:16AM +0900, Amit Langote wrote:
> > I don't think we are doing this, but if the parent table doesn't have a
> > physical file pg_upgrade will need to be taught that.  We have that case
> > now for unlogged tables on standby servers that we need to address.
> 
> Partitioned tables do have physical files as of now.  This thread is to
> discuss the proposal to get rid of the physical file for partitioned tables.
> 
> By the way, I just tried pg_upgrade on top of this patch, and it seems
> partitioned tables without the physical file migrate just fine to the new
> cluster.  To test I did: created a partitioned table and few partitions,
> inserted some data into it, pg_upgraded the cluster to find the
> partitioned table intact with its data in the new cluster (to be clear,
> the data is contained in partitions).  Is there something that wouldn't
> work that I should instead be testing?
> 
> Also, it seems that the partitioned tables (without physical files) won't
> have the same issue on the standby servers as unlogged tables.  It's just
> that we route inserts into a partitioned table to its partitions and hence
> the partitioned table itself does not contain because all the incoming
> data is routed.  Am I missing something?

I see why it works now.  pg_upgrade only upgrades relations and
materialized views:
       "  WHERE relkind IN ('r', 'm') AND "

but partitions are defined as 'P':
#define       RELKIND_PARTITIONED_TABLE 'P'     /* partitioned table */

I am a little confused by the above.  Is a partitioned table the parent
or the children?  Reading the code it seems it is the parent, which
explains why it works.  Can I clarify that?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] Partitioned tables and relfilenode

From

Michael Paquier

Date:

28 February 2017, 06:29:56

On Tue, Feb 28, 2017 at 12:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I am a little confused by the above.  Is a partitioned table the parent
> or the children?  Reading the code it seems it is the parent, which
> explains why it works.  Can I clarify that?

As I understand things, a partitioned table is a parent relation that
exists only to link to its child portions, holding as well the
definitions linking to each partition.
-- 
Michael

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

28 February 2017, 07:08:34

On 2017/02/28 12:29, Michael Paquier wrote:
> On Tue, Feb 28, 2017 at 12:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> I am a little confused by the above.  Is a partitioned table the parent
>> or the children?  Reading the code it seems it is the parent, which
>> explains why it works.  Can I clarify that?
> 
> As I understand things, a partitioned table is a parent relation that
> exists only to link to its child portions, holding as well the
> definitions linking to each partition.

Yes.  As I mentioned in my previous email, data inserted into the
partitioned table is routed to its partitions.  A partition may be itself
a partitioned table, so the routing continues until we find a leaf
partition.  All the partitioned tables in this chain leading to the leaf
partition are RELKIND_PARTITIONED_TABLE ('P') relations and only the leaf
tables are RELKIND_RELATION tables.

Thanks,
Amit

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

28 February 2017, 07:52:17

Amit's original patch had an assertion failure in
extract_autovac_opts(). His patch adds partitioned tables in the list
of tables to be auto-analyzed. But there's an assertion in
extract_autovac_opts(), which did not consider partitioned tables.
When a partitioned table is created, the assertion trips in autovacuum
worker and resets the cluster, restarting all the backends. I have
updated Robert's patch with the assertion fixed.

On Tue, Feb 28, 2017 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks for the review.
>
> In 0001, the documentation which you are patching has a section for
> limitations that apply only to both partitioning and constraint
> exclusion, and another for limitations that apply only to constraint
> exclusion.  Surely the patch should be moving a limitation that will
> no longer apply to partitioning from the first section to the second
> section, rather than leaving it in the section for limitations that
> apply to both systems and just adding a note that say "this doesn't
> apply to partitioning any more".
>
> In acquire_inherited_sample_rows(), instead of inserting a whole
> stanza of logic just above the existing dispatch on relkind, I think
> we can get by with a very slightly update to what's already there.
>
> You can't use the result of a & b as a bool.  You need to write (a &
> b) != 0, because the bool should always use 1 for true and 0 for
> false; it should not set some higher-numbered bit.
>
> The changes to autovacuum.c look useless, because
> relation_needs_vacanalyze() will presumably never fire for a table
> with no tuples of its own.
>
> Updated patch with those changes and a few cosmetic tweaks attached.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

avoid-useless-partition-ops_v2.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

28 February 2017, 08:03:12

On 2017/02/28 13:52, Ashutosh Bapat wrote:
> Amit's original patch had an assertion failure in
> extract_autovac_opts(). His patch adds partitioned tables in the list
> of tables to be auto-analyzed. But there's an assertion in
> extract_autovac_opts(), which did not consider partitioned tables.
> When a partitioned table is created, the assertion trips in autovacuum
> worker and resets the cluster, restarting all the backends. I have
> updated Robert's patch with the assertion fixed.

Since Robert's patch backed out all autovacuum.c changes, partitioned
tables case would no longer be able to reach that Assert.

Thanks,
Amit

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

28 February 2017, 09:05:46

Thanks for the review.  I was just about to send a new version of the patches.

On 2017/02/28 12:20, Robert Haas wrote:
> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks for the review.
> 
> In 0001, the documentation which you are patching has a section for
> limitations that apply only to both partitioning and constraint
> exclusion, and another for limitations that apply only to constraint
> exclusion.  Surely the patch should be moving a limitation that will
> no longer apply to partitioning from the first section to the second
> section, rather than leaving it in the section for limitations that
> apply to both systems and just adding a note that say "this doesn't
> apply to partitioning any more".

So we have the following headings under 5.11.6 Caveats:

"The following caveats apply to partitioned tables implemented using
either method (unless noted otherwise)"

and,

"The following caveats apply to constraint exclusion"

The latter lists only the caveats about the planner capability (constraint
exclusion) for partitioning - that it cannot be performed against volatile
WHERE conditions, only works with comparisons using btree operators, does
not scale beyond hundred partitions, etc.  These limitations are common to
both systems (inheritance and declarative partitioned tables), because
both rely on constraint exclusion.

How about separating the limitations listed under the first heading into:

"Limitations of using inheritance for partitioning"

and,

"Limitations of using partitioned tables"

This particular patch gets rid of the restriction that vacuum/analyze do
not recurse to child tables (partitions) only for the second of the above.

How about the documentation changes in the attached updated 0001?  I know
that this page needs a much larger rewrite as we are discussing in the
other thread.

> In acquire_inherited_sample_rows(), instead of inserting a whole
> stanza of logic just above the existing dispatch on relkind, I think
> we can get by with a very slightly update to what's already there.
> 
> You can't use the result of a & b as a bool.  You need to write (a &
> b) != 0, because the bool should always use 1 for true and 0 for
> false; it should not set some higher-numbered bit.

Oops, thanks for fixing that.  I suppose you are referring to this hunk in
the original patch:

-    relations = get_rel_oids(relid, relation);
+    relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM);

And we need to do it this way in *this* case, because we're passing it as
a bool argument.  I see that it's OK to do this:

    stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";

Or this:

    if (options & VACOPT_VACUUM)
    {
        PreventTransactionChain(isTopLevel, stmttype);

> The changes to autovacuum.c look useless, because
> relation_needs_vacanalyze() will presumably never fire for a table
> with no tuples of its own.

Hmm, I guess that makes sense.  Inheritance statistics are never collected
by autovacuum (at least in the partitioning setups and especially with the
partitioned tables now).  I recall that you mentioned [1] this as one of
the limitations of the system currently.

> Updated patch with those changes and a few cosmetic tweaks attached.

I have incorporated the changes in your patch in the attached updated
0001, with documentation tweaks as mentioned above.

Other than that, in acquire_inherited_sample_rows() I've used a method
suggested by Ashutosh Bapat to track if we encountered a child table,
which is much less ugly than below:

-    if (nrels < 2)
+    if ((onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE && nrels <
2) ||
+        nrels < 1)

I keep updated 0002 and 0003 here, but I've changed their order.  Don't
scan partitioned tables thing is now 0002 and don't allocate file for
partitioned tables is 0003.  Sorry about the confusion.

Thanks,
Amit

[1]
https://postgr.es/m/CA%2BTgmobTxn2%2B0x96h5Le%2BGOK5kw3J37SRveNfzEdx9s5-Yd8vA%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

02 March 2017, 12:36:52

On Tue, Feb 28, 2017 at 11:35 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> How about the documentation changes in the attached updated 0001?  I know
> that this page needs a much larger rewrite as we are discussing in the
> other thread.

Looks good.

>> In acquire_inherited_sample_rows(), instead of inserting a whole
>> stanza of logic just above the existing dispatch on relkind, I think
>> we can get by with a very slightly update to what's already there.
>>
>> You can't use the result of a & b as a bool.  You need to write (a &
>> b) != 0, because the bool should always use 1 for true and 0 for
>> false; it should not set some higher-numbered bit.
>
> Oops, thanks for fixing that.  I suppose you are referring to this hunk in
> the original patch:
>
> -    relations = get_rel_oids(relid, relation);
> +    relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM);
>
> And we need to do it this way in *this* case, because we're passing it as
> a bool argument.  I see that it's OK to do this:
>
>     stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";
>
> Or this:
>
>     if (options & VACOPT_VACUUM)
>     {
>         PreventTransactionChain(isTopLevel, stmttype);

In those cases it's still clearer, IMHO, to use != 0, but it's not
necessary.  However, when you're explicitly creating a value of type
"bool", then it's necessary.

Actually, looking at this again, I now think this part is wrong:

+            /*
+             * If only ANALYZE is to be performed, there is no need to include
+             * partitions in the list.  In a database-wide ANALYZE, we only
+             * update the inheritance statistics of partitioned tables, not
+             * the statistics of individual partitions.
+             */
+            if (!is_vacuum && classForm->relispartition)                continue;

I was thinking earlier that an ANALYZE on the parent would also update
the statistics for each child, but now I see that's not so.  So now I
think we should omit this logic (and change the documentation to
match).  That is, a database-wide ANALYZE should update the statistics
for each child as well as for the parent.  Otherwise direct queries
against the children (and partitionwise joins, once we have that) are
going to go haywire.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

02 March 2017, 13:22:35

On 2017/03/02 18:36, Robert Haas wrote:
> On Tue, Feb 28, 2017 at 11:35 AM, Amit Langote wrote:
>>> In acquire_inherited_sample_rows(), instead of inserting a whole
>>> stanza of logic just above the existing dispatch on relkind, I think
>>> we can get by with a very slightly update to what's already there.
>>>
>>> You can't use the result of a & b as a bool.  You need to write (a &
>>> b) != 0, because the bool should always use 1 for true and 0 for
>>> false; it should not set some higher-numbered bit.
>>
>> Oops, thanks for fixing that.  I suppose you are referring to this hunk in
>> the original patch:
>>
>> -    relations = get_rel_oids(relid, relation);
>> +    relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM);
>>
>> And we need to do it this way in *this* case, because we're passing it as
>> a bool argument.  I see that it's OK to do this:
>>
>>     stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";
>>
>> Or this:
>>
>>     if (options & VACOPT_VACUUM)
>>     {
>>         PreventTransactionChain(isTopLevel, stmttype);
> 
> In those cases it's still clearer, IMHO, to use != 0, but it's not
> necessary.  However, when you're explicitly creating a value of type
> "bool", then it's necessary.

Agreed.

> Actually, looking at this again, I now think this part is wrong:
> 
> +            /*
> +             * If only ANALYZE is to be performed, there is no need to include
> +             * partitions in the list.  In a database-wide ANALYZE, we only
> +             * update the inheritance statistics of partitioned tables, not
> +             * the statistics of individual partitions.
> +             */
> +            if (!is_vacuum && classForm->relispartition)
>                  continue;
> 
> I was thinking earlier that an ANALYZE on the parent would also update
> the statistics for each child, but now I see that's not so.  So now I

Yep, the patch enables ANALYZE to be propagated to partitions when the
parent table is specified in the command.  The above logic in the patch
made the database-wide ANALYZE to ignore partitions, in which case, only
the inheritance statistics would be updated.  I can also see why that'd be
undesirable.

> think we should omit this logic (and change the documentation to
> match).  That is, a database-wide ANALYZE should update the statistics
> for each child as well as for the parent.  Otherwise direct queries
> against the children (and partitionwise joins, once we have that) are
> going to go haywire.

OK, done.  I updated both analyze.sgml and vacuum.sgml to be more up to
date.  Both pages previously omitted materialized views.

Attached updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

02 March 2017, 15:48:20

On Thu, Mar 2, 2017 at 3:52 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> think we should omit this logic (and change the documentation to
>> match).  That is, a database-wide ANALYZE should update the statistics
>> for each child as well as for the parent.  Otherwise direct queries
>> against the children (and partitionwise joins, once we have that) are
>> going to go haywire.
>
> OK, done.  I updated both analyze.sgml and vacuum.sgml to be more up to
> date.  Both pages previously omitted materialized views.
>
> Attached updated patches.

Thanks, committed 0001 with a slight change to the wording.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

03 March 2017, 04:02:11

On 2017/03/02 21:48, Robert Haas wrote:
> On Thu, Mar 2, 2017 at 3:52 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> think we should omit this logic (and change the documentation to
>>> match).  That is, a database-wide ANALYZE should update the statistics
>>> for each child as well as for the parent.  Otherwise direct queries
>>> against the children (and partitionwise joins, once we have that) are
>>> going to go haywire.
>>
>> OK, done.  I updated both analyze.sgml and vacuum.sgml to be more up to
>> date.  Both pages previously omitted materialized views.
>>
>> Attached updated patches.
> 
> Thanks, committed 0001 with a slight change to the wording.

Thanks.  I noticed that 'and' is duplicated in a line added by the commit
to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
last version.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Michael Paquier

Date:

06 March 2017, 09:41:22

On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
> last version.
   /*
-    * If all the children were temp tables, pretend it's a non-inheritance
-    * situation.  The duplicate RTE we added for the parent table is
-    * harmless, so we don't bother to get rid of it; ditto for the useless
-    * PlanRowMark node.
+    * If all the children were temp tables or if the parent is a partitioned
+    * table without any leaf partitions, pretend it's a non-inheritance
+    * situation.  The duplicate RTE for the parent table we added in the
+    * non-partitioned table case is harmless, so we don't bother to get rid
+    * of it; ditto for the useless PlanRowMark node.    */
-   if (list_length(appinfos) < 2)
+   if (!has_child)
This comment is not completely correct. Children can be temp tables,
they just cannot be temp tables of other backends. It seems to me that
you could still keep this code simple and remove has_child..

@@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,       case RELKIND_RELATION:       case
RELKIND_TOASTVALUE:      case RELKIND_MATVIEW:
 
-       case RELKIND_PARTITIONED_TABLE:           options = heap_reloptions(classForm->relkind, datum, false);
break;
 
Partitioned tables cannot have reloptions? What about all the
autovacuum_* parameters then? Those are mainly not storage-related.
-- 
Michael

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

06 March 2017, 10:18:44

Thanks for the review.

On 2017/03/06 15:41, Michael Paquier wrote:
> On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks.  I noticed that 'and' is duplicated in a line added by the commit
>> to analyze.sgml.  Attached 0001 fixes that.  0002 and 0003 same as the
>> last version.
> 
>     /*
> -    * If all the children were temp tables, pretend it's a non-inheritance
> -    * situation.  The duplicate RTE we added for the parent table is
> -    * harmless, so we don't bother to get rid of it; ditto for the useless
> -    * PlanRowMark node.
> +    * If all the children were temp tables or if the parent is a partitioned
> +    * table without any leaf partitions, pretend it's a non-inheritance
> +    * situation.  The duplicate RTE for the parent table we added in the
> +    * non-partitioned table case is harmless, so we don't bother to get rid
> +    * of it; ditto for the useless PlanRowMark node.
>      */
> -   if (list_length(appinfos) < 2)
> +   if (!has_child)
> This comment is not completely correct. Children can be temp tables,
> they just cannot be temp tables of other backends. It seems to me that
> you could still keep this code simple and remove has_child..

I updated the comment.  I recall having posted a patch for that once, but
perhaps went unnoticed.

About has_child, the other option is to make the minimum length of
appinfos list relkind-based, but  the condition quickly becomes ugly. Do
you have a suggestion?

> @@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
>         case RELKIND_RELATION:
>         case RELKIND_TOASTVALUE:
>         case RELKIND_MATVIEW:
> -       case RELKIND_PARTITIONED_TABLE:
>             options = heap_reloptions(classForm->relkind, datum, false);
>             break;
> Partitioned tables cannot have reloptions? What about all the
> autovacuum_* parameters then? Those are mainly not storage-related.

AFAIK, none of the heap reloptions will be applicable to partitioned table
relations once we eliminate storage.

About autovacuum_* parameters - we currently don't handle partitioned
tables in autovacuum.c, because no statistics are reported for them. That
is, relation_needs_vacanalyze() will never return true for dovacuum,
doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE
relation.  That's something to be fixed separately though.  When we add
autovacuum support for partitioned tables, we may want to add a new set of
reloptions (new because partitioned tables still won't support all options
returned by heap_reloptions()).  Am I missing something?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2017/03/08 2:27, Robert Haas wrote:
> On Tue, Mar 7, 2017 at 12:11 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I see that all the changes by Amit and myself to what was earlier 0003
>> patch are now part of 0002 patch. The patch looks ready for committer.
> 
> Reviewing 0002:

Thanks for the review.

> 
> This patch seems to have falsified the header comments for
> expand_inherited_rtentry.

Oops, updated.

> I don't quite understand the need for the change to set_rel_size().
> The rte->inh case is handled before reaching the new code, and IIUC
> the !rte->inh case should never happen given the changes to
> expand_inherited_rtentry.  Or am I confused?

In expand_inherited_rtentry(), patch only changes the rule about the
minimum number of child RTEs required to keep rte->inh true.  In the
traditional case, it is 2 (the table itself and at least one child).  For
a partitioned table, since we don't want to scan the table itself, that
becomes 1 (at least one leaf partition).

So, it's still possible for rte->inh to become false if the required
minimum is not met, even for partitioned tables.

> If not, I think we
> should just add an Assert() to the "plain relation" case that this is
> not RELKIND_PARTITIONED_TABLE (with a comment explaining why it can't
> happen).

How about adding Assert(rte->relkind != RELKIND_PARTITIONED_TABLE) at the
beginning of the following functions as the updated patch does:

set_plain_rel_size()
set_tablesample_rel_size()
set_plain_rel_pathlist()
set_tablesample_rel_pathlist()

> In inheritance_planner, this comment addition does not seem adequate:
> 
> +         * If the parent is a partitioned table, we already set the nominal
> +         * relation.
> 
> That basically contradicts the previous paragraph.  I think you need
> to adjust the previous paragraph instead of adding a new one, probably
> in multiple places.  For example, the parenthesized bit now only
> applies in the non-partitioned case.

Hmm, that's true.  I rewrote the entire comment.

> 
> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
> 
> No lock?

Hmm, I think I missed that a partitioned parent table would not be locked
by the *executor* at all after applying this patch.  As of now, InitPlan()
takes care of locking all the result relations in the
PlannedStmt.resultRelations list.  This patch removes partitioned
parent(s) from appearing in this list.  But that makes me wonder - aren't
the locks taken by expand_inherited_rtentry() kept long enough?  Why does
InitPlan need to acquire them again?  I see this comment in
expand_inherited_rtentry:

   * If the parent relation is the query's result relation, then we need
   * RowExclusiveLock.  Otherwise, if it's accessed FOR UPDATE/SHARE, we
   * need RowShareLock; otherwise AccessShareLock.  We can't just grab
   * AccessShareLock because then the executor would be trying to upgrade
   * the lock, leading to possible deadlocks.  (This code should match the
   * parser and rewriter.)

So, I assumed all the necessary locking is being taken care of here.

Anyway, I changed NoLock to RowExclusiveLock here for now, but maybe it's
either not necessary or a way should be found to do that in InitPlan along
with other result relations.

> Another thing that bothers me about this is that, apparently, the use
> of nominalRelation is no longer strictly for EXPLAIN, as the comments
> in plannodes.h/relation.h still claim that it is.  I'm not sure how to
> adjust that exactly; there's not a lot of room in those comments to
> give all the details.  Maybe they should simply say something like /*
> Parent RT index */ instead of /* Parent RT index for use of EXPLAIN
> */.  But we can't just go around changing the meaning of things
> without updating the comments accordingly.

It seems that this comment and one more need to be updated.  The other
comment change is in explain.c as follows:

  * Adds the relid of each referenced RTE to *rels_used.  The result controls
  * which RTEs are assigned aliases by select_rtable_names_for_explain.
  * This ensures that we don't confusingly assign un-suffixed aliases to RTEs
- * that never appear in the EXPLAIN output (such as inheritance parents).
+ * that never appear in the EXPLAIN output (such as non-partitioned
+ * inheritance parents).
  */
 static bool
 ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)

I studied the EXPLAIN code a bit to see whether the problem cited for
using inheritance parent RTE as nominalRelation (for example, in comments
in inheritance_planner) apply to the partitioned parent RTE case, which it
doesn't.  The partitioned parent RTE won't appear anywhere else in the
plan.  So, different aliases for what's the same table/RTE won't happen in
the partitioned parent case.

> A related question is
> whether we should really be using nominalRelation for this or whether
> there's some other way we should be getting at the parent -- I don't
> have another idea, though.

This is just for the ModifyTable node.  Instead of adding any new member,
I thought nominalRelation would work fine.  But as I said above regarding
locking, we *may* need to find a different place for the partitioned
parent RTE (other than ModifyTable.nominalRelation).

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2017/03/13 19:24, Amit Langote wrote:
> On 2017/03/10 17:57, Amit Langote wrote:
>> On 2017/03/08 22:36, Robert Haas wrote:
>>> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote:
>>>>> -    rel = mtstate->resultRelInfo->ri_RelationDesc;
>>>>> +    nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table);
>>>>> +    nominalRel = heap_open(nominalRTE->relid, NoLock);
>>>>>
>>>>> No lock?
>>>>
>>>> Hmm, I think I missed that a partitioned parent table would not be locked
>>>> by the *executor* at all after applying this patch.  As of now, InitPlan()
>>>> takes care of locking all the result relations in the
>>>> PlannedStmt.resultRelations list.  This patch removes partitioned
>>>> parent(s) from appearing in this list.  But that makes me wonder - aren't
>>>> the locks taken by expand_inherited_rtentry() kept long enough?  Why does
>>>> InitPlan need to acquire them again?  I see this comment in
>>>> expand_inherited_rtentry:
>>>
>>> Parse-time locks, plan-time locks, and execution-time locks are all
>>> separate.  See the header comments for AcquireRewriteLocks in
>>> rewriteHandler.c; think about a view, where the parsed query has been
>>> stored and is later rewritten and planned.  Similarly, a plan can be
>>> reused even after the transaction that created that plan has
>>> committed; see AcquireExecutorLocks in plancache.c.
>>
>> Thanks for those references.
>>
>> I took a step back here and thought a bit more about the implications this
>> patch.  It occurred to me that the complete absence of partitioned table
>> RT entries in the plan-tree has more consequences than I originally
>> imagined.  I will post an updated patch by Monday latest.
> 
> Here is the updated patch.
> 
> Since this patch proposes to avoid creating scan nodes for non-leaf tables
> in a partition tree, they won't be referenced anywhere in the resulting
> plan tree.  So the executor will not lock those tables in the
> select/update/delete cases. Insert is different since we lock all tables
> in the partition tree when setting up tuple-routing in
> ExecInitModifyTable.  Not taking executor locks on the tables means that
> the cached plans that should be invalidated upon adding/removing a
> partition somewhere in the partition tree won't be.
> 
> First I thought that we could remember just the root table RT index using
> a new Index field partitionRoot in Append, MergeAppend, and ModifyTable
> nodes and use it to locate and lock the root table during executor node
> initialization.  But soon realized that that's not enough, because it
> ignores the fact that adding/removing partitions at lower levels does not
> require taking a lock on the root table; only the immediate parent.  So
> any cached select/update/delete plans won't be invalidated when a new
> lower-level partition is added/removed, because the immediate parent would
> not have been added to the query's range table and hence the
> PlannedStmt.relationOids list.  Remember that the latter is used by
> plancache.c to remember the relations that a given cached plan depends on
> remaining unchanged.  So the patch now adds a list member called
> partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores
> the RT indexes of all the non-leaf tables in a partition tree with root
> table RT index at the head (note that these RT indexes are of the RTEs
> added by expand_inherited_rtenrty; also see below).  Since the
> partitioned_rels list is constructed when building paths and must be
> propagated to the plan nodes, the same field is also present in the
> corresponding Path nodes.  ExecInit* routines for the aforementioned nodes
> now locate and lock the non-leaf tables using the RT indexes in
> partitioned_rels.  Leaf tables are locked, as before, either by InitPlan
> (update/delete result relations case) or by ExecInitAppend or
> ExecInitMergeAppend when initializing the appendplans/mergeplans (select
> case).
> 
> The previous proposal was for expand_inherited_rtentry to not create RT
> entries and AppendRelInfo's for the non-leaf tables, but I think that
> doesn't work, as I tried to explain above.  We need RTEs because that
> seems to be the only way currently for informing the executor of the
> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
> RTEs for the latter planning steps to collect them in partitioned_rels
> mentioned above. So with the latest patch, we do create the RT entry and
> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
> a minimal one; only parent_relid and child_relid are valid.  To make the
> latter planning steps ignore these minimal AppendRelInfo's, every
> AppendRelInfo is now marked with child_relkind.  Only
> set_append_rel_pathlist() and inheritance_planner() process them to
> collect the child_relid into the partitioned_rels list to be stored in
> AppendPath/MergeAppendPath and ModifyTablePath, respectively.

Sorry, forgot to attach the patches themselves.  Attached this time.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

14 March 2017, 11:18:01

On 2017/03/13 19:30, Amit Langote wrote:
>> Here is the updated patch.
>>
>> Since this patch proposes to avoid creating scan nodes for non-leaf tables
>> in a partition tree, they won't be referenced anywhere in the resulting
>> plan tree.  So the executor will not lock those tables in the
>> select/update/delete cases. Insert is different since we lock all tables
>> in the partition tree when setting up tuple-routing in
>> ExecInitModifyTable.  Not taking executor locks on the tables means that
>> the cached plans that should be invalidated upon adding/removing a
>> partition somewhere in the partition tree won't be.
>>
>> First I thought that we could remember just the root table RT index using
>> a new Index field partitionRoot in Append, MergeAppend, and ModifyTable
>> nodes and use it to locate and lock the root table during executor node
>> initialization.  But soon realized that that's not enough, because it
>> ignores the fact that adding/removing partitions at lower levels does not
>> require taking a lock on the root table; only the immediate parent.  So
>> any cached select/update/delete plans won't be invalidated when a new
>> lower-level partition is added/removed, because the immediate parent would
>> not have been added to the query's range table and hence the
>> PlannedStmt.relationOids list.  Remember that the latter is used by
>> plancache.c to remember the relations that a given cached plan depends on
>> remaining unchanged.  So the patch now adds a list member called
>> partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores
>> the RT indexes of all the non-leaf tables in a partition tree with root
>> table RT index at the head (note that these RT indexes are of the RTEs
>> added by expand_inherited_rtenrty; also see below).  Since the
>> partitioned_rels list is constructed when building paths and must be
>> propagated to the plan nodes, the same field is also present in the
>> corresponding Path nodes.  ExecInit* routines for the aforementioned nodes
>> now locate and lock the non-leaf tables using the RT indexes in
>> partitioned_rels.  Leaf tables are locked, as before, either by InitPlan
>> (update/delete result relations case) or by ExecInitAppend or
>> ExecInitMergeAppend when initializing the appendplans/mergeplans (select
>> case).
>>
>> The previous proposal was for expand_inherited_rtentry to not create RT
>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>> doesn't work, as I tried to explain above.  We need RTEs because that
>> seems to be the only way currently for informing the executor of the
>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>> RTEs for the latter planning steps to collect them in partitioned_rels
>> mentioned above. So with the latest patch, we do create the RT entry and
>> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
>> a minimal one; only parent_relid and child_relid are valid.  To make the
>> latter planning steps ignore these minimal AppendRelInfo's, every
>> AppendRelInfo is now marked with child_relkind.  Only
>> set_append_rel_pathlist() and inheritance_planner() process them to
>> collect the child_relid into the partitioned_rels list to be stored in
>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
> 
> Sorry, forgot to attach the patches themselves.  Attached this time.

I made some changes to 0001:

* No need to create RelOptInfo's for partitioned RTEs added by
  expand_inherited_rtentry

* Instead of storing relkind of child tables in AppendRelInfo's, simply
  store if they are partitioned; IOW, change child_relkind to
  child_is_partitioned

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2017/03/15 7:09, Robert Haas wrote:
> On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The previous proposal was for expand_inherited_rtentry to not create RT
>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>> doesn't work, as I tried to explain above.  We need RTEs because that
>> seems to be the only way currently for informing the executor of the
>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>> RTEs for the latter planning steps to collect them in partitioned_rels
>> mentioned above. So with the latest patch, we do create the RT entry and
>> AppendRelInfo for non-leaf tables.  AppendRelInfo created in this case is
>> a minimal one; only parent_relid and child_relid are valid.  To make the
>> latter planning steps ignore these minimal AppendRelInfo's, every
>> AppendRelInfo is now marked with child_relkind.  Only
>> set_append_rel_pathlist() and inheritance_planner() process them to
>> collect the child_relid into the partitioned_rels list to be stored in
>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
> 
> I see your point, but I still think this kind of stinks.  You've got
> all kinds of logic that is now conditional on child_is_partitioned,
> and that seems like a recipe for bugs and future maintenance
> difficulties.  It would be much nicer if we could come up with a
> design that doesn't create the AppendRelInfo in the first place,
> because then all of that stuff could just work.  Can we get away with
> creating an RTE for each partitioned table (other than the parent,
> perhaps; for that one it would be nice to use the inh-flagged RTE we
> already have) but NOT creating an AppendRelInfo to go with it?  If
> we've got the list of RTIs for the new RTEs associated with the append
> path in some other form, can't we get by without also having an
> AppendRelInfo to hold onto that translation?

I think we'll need to store *somewhere* the mapping of which inh=false
partitioned table RTE is the child of which inh=true (IOW, parent)
partitioned table RTE.  I've come to think that AppendRelInfos, although
contain extraneous information that won't be used, are better than
inventing something new altogether for time being.  AppendRelInfos are
referred to a few times by query_planner() steps before we eventually get
to either set_append_rel_pathlist() or inheritance_planner(), so not
changing that approach seems less worrisome for now.  So now if we both
create child RTEs and AppendRelInfos for the partitioned tables, we don't
need to change expand_inherited_rtentry() at all with this patch.
Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
child partitioned table RTEs, because no path/plan need to be created.  We
can do away with having to create RelOptInfos for child partitioned table
RTEs, which I found to be not that invasive.

We seem to agree that we need to carry the partitioned table RT indexes
over to the executor somehow so that they are locked properly during
execution.  To that end, Append, MergeAppend, and ModifyTable (and the
corresponding path nodes) nodes will contain a list of those RT indexes in
a field called partitioned_rels.  In the result relation case, the list
from ModifyTable is copied to a field in PlannerGlobal called
nonleafResultRelations (next to the old resultRelations) and further to a
field of the same name in PlannedStmt.  I considered if it can be done
without a new field, but realized that the old resultRelations is tied
closely with how the EState manages result relations.

InitPlan() and AcquireExecutorLocks() lock the tables denoted by the RT
indexes in PlannedStmt.nonleafResultRelations using RowExclusiveLock, just
like they would those in PlannedStmt.resultRelations.

Since expand_inherited_rtentries() adds PlanRowMarks for partitioned
tables just like previously, InitPlan() and AcquireExecutorLocks() will
lock them with RowShareLock if needed.

Finally, ExecInitAppend() and ExecInitMergeAppend() must acquire
AccessShareLock on the partition tables, if not already locked by InitPlan().

> The comments in InitPlan() explain why locks on result relations are
> taken in that function directly rather than during the ExecInitNode
> pass over the tree; it's because we need to make sure we take the
> strongest lock on any given relation first.  But the changes in
> ExecInitAppend and ExecInitMergeAppend are problematic in that regard;
> some AccessShareLocks may already have been taken, and now we're
> taking locks that in some case may be RowShareLock, which could cause
> a lock upgrade.  Remember that there's no reason that you couldn't
> join a table to one of its own partitions, or something like that.  I
> think you need to try to jigger things so that InitPlan() takes all
> locks stronger than AccessShareLock that are required anywhere in the
> query, and then other nodes can take anything at AccessShareLock that
> they need.

Thanks for the explanation.  If the scheme in the new patch that I
described above sounds OK, I think it takes care of locking the
partitioned tables without the upgrade hazards.

> I think that eliding the Append node when there's only one child may
> be unsafe in the case where the child's attribute numbers are
> different from the parent's attribute numbers.  I remember Tom making
> some comment about this when I was working on MergeAppend, although I
> no longer remember the specific details.

Append node elision does not occur in the one-child case.  With the patch:

create table q (a int) partition by list (a);
explain select * from q;
                QUERY PLAN
------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=4)
   One-Time Filter: false
(2 rows)

create table q1 partition of q for values in (1);
explain select * from q;
                         QUERY PLAN
------------------------------------------------------------
 Append  (cost=0.00..35.50 rows=2550 width=4)
   ->  Seq Scan on q1  (cost=0.00..35.50 rows=2550 width=4)
(2 rows)

Maybe that should be done, but this patch doesn't implement that.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

16 March 2017, 16:16:25

On Thu, Mar 16, 2017 at 6:03 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think we'll need to store *somewhere* the mapping of which inh=false
> partitioned table RTE is the child of which inh=true (IOW, parent)
> partitioned table RTE.

I mean, for the children you're going to scan, that seems to be
necessary so that you can do things like translate targetlists to use
the correct varno.  But for the children you're not going to scan,
well, you need to know which ones they are so you can lock them, but
do you really need the parent-child mappings?  Or just a list of which
ones there are?

> I've come to think that AppendRelInfos, although
> contain extraneous information that won't be used, are better than
> inventing something new altogether for time being.  AppendRelInfos are
> referred to a few times by query_planner() steps before we eventually get
> to either set_append_rel_pathlist() or inheritance_planner(), so not
> changing that approach seems less worrisome for now.  So now if we both
> create child RTEs and AppendRelInfos for the partitioned tables, we don't
> need to change expand_inherited_rtentry() at all with this patch.
> Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
> child partitioned table RTEs, because no path/plan need to be created.  We
> can do away with having to create RelOptInfos for child partitioned table
> RTEs, which I found to be not that invasive.

Yes, but on the flip side, you're having to add code in a lot of
places -- I think I counted 7 -- where you turn around and ignore
those AppendRelInfos.  That's a lot; how do we know we've got them
all?  I'm not sure what the patch would look like the other way, but
I'm hoping that you could just keep the list of partitioned table RTIs
someplace that mostly gets ignored, and then all of that special
handling could be ripped out.

> Append node elision does not occur in the one-child case.  With the patch:

Oh, OK.  Somehow the commit message you included led me to the
contrary conclusion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

17 March 2017, 11:57:23

On 2017/03/16 22:16, Robert Haas wrote:
> On Thu, Mar 16, 2017 at 6:03 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think we'll need to store *somewhere* the mapping of which inh=false
>> partitioned table RTE is the child of which inh=true (IOW, parent)
>> partitioned table RTE.
> 
> I mean, for the children you're going to scan, that seems to be
> necessary so that you can do things like translate targetlists to use
> the correct varno.  But for the children you're not going to scan,
> well, you need to know which ones they are so you can lock them, but
> do you really need the parent-child mappings?  Or just a list of which
> ones there are?

I thought we'd want to keep the RT indexes segregated per controlling
node, which is why we'd want to store the mapping somehow.  There may be
multiple partitioned tables in a query and one of them might be the result
relation.  If we propagate all of them in one list to the executor,
InitPlan wouldn't know which ones are target relations, for example.  But
I guess you don't mean passing all of them in one big list to the executor.

>> I've come to think that AppendRelInfos, although
>> contain extraneous information that won't be used, are better than
>> inventing something new altogether for time being.  AppendRelInfos are
>> referred to a few times by query_planner() steps before we eventually get
>> to either set_append_rel_pathlist() or inheritance_planner(), so not
>> changing that approach seems less worrisome for now.  So now if we both
>> create child RTEs and AppendRelInfos for the partitioned tables, we don't
>> need to change expand_inherited_rtentry() at all with this patch.
>> Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
>> child partitioned table RTEs, because no path/plan need to be created.  We
>> can do away with having to create RelOptInfos for child partitioned table
>> RTEs, which I found to be not that invasive.
> 
> Yes, but on the flip side, you're having to add code in a lot of
> places -- I think I counted 7 -- where you turn around and ignore
> those AppendRelInfos.

Perhaps you were looking at the previous version with "minimal" appinfos
containing the child_is_partitioned field?  Following are the instances in
the latest patch posted yesterday:

- In set_append_re_size, do not count the partitioned child tables

- In set_append_rel_pathlist, do not create paths, but add to the list of
  the partitioned tables to be included in the final Append path

- In create_lateral_join_info(), no need to propagate lateral join info to
  partitioned child tables

- In inheritance_planner, no need to create plans for partitioned child
  tables, but add to the list of the partitioned tables to be included in
  the final ModifyTable path

In the previous version, because AppendRelInfo structure was itself
modified, there were more instances because more places would need to know
about the new kinds of appinfos (the minimal ones), which isn't the case
with the latest patch.  Although admittedly, there will be a lot of
useless manipulation of those appinfos whose only purpose is to carry the
parent-child relid mapping up to where we finally consume that
information.  Anyway, I've tried to implement an approach that doesn't
create AppendRelInfos for partitioned child tables as described below.

> That's a lot; how do we know we've got them
> all?  I'm not sure what the patch would look like the other way, but
> I'm hoping that you could just keep the list of partitioned table RTIs
> someplace that mostly gets ignored, and then all of that special
> handling could be ripped out.

Since we'll want to save the RT indexes when we create child RTEs, that is
in expand_inherited_rtentry(), we'd do use a data structure that parallels
root->append_rel_list.  The structure might consist of a the parent_relid
and a list of partitioned child RT indexes.  Since we build this
information early during subquery_planner() it might equally be subject to
errors of omission, although that somehow sounds less likely.

I've tried to implement something like that in the attached updated 0001.
The mapping structure PartitionedChildRelInfo (a Node in fact) consists of
parent_relid and the list of partitioned child RT indexes.  A global list
of these nodes is maintained in the root PlannerInfo called pcinfo_list.
In set_append_rel_pathlist() and inheritance_planner(), we search the
aforementioned list and copy the correct list to
Append/MergeAppend/ModifyTable path created for the parent.  Thoughts?

I also noticed that PlanRowMarks that are created by
expand_inherited_rtentry() for partitioned child tables are problematic -
they must somehow be ignored in the executor.  One way to do that is to
set isParent (making it a dummy entry) when initializing them, which is
what the patch does.  Comment is updated to say that just like inheritance
parent relations, PlanRowMarks for partitioned child tables are also
marked as dummy.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

20 March 2017, 19:16:19

On Fri, Mar 17, 2017 at 4:57 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Yes, but on the flip side, you're having to add code in a lot of
>> places -- I think I counted 7 -- where you turn around and ignore
>> those AppendRelInfos.
>
> Perhaps you were looking at the previous version with "minimal" appinfos
> containing the child_is_partitioned field?

Yes, I think I was.  I think this version looks a lot better.
    /*
+     * Close the root partitioned rel if we opened it above, but keep the
+     * lock.
+     */
+    if (rel != mtstate->resultRelInfo->ri_RelationDesc)
+        heap_close(rel, NoLock);

We didn't take a lock above, though, so drop everything in the comment
from "but" onward.

-    add_paths_to_append_rel(root, rel, live_childrels);
+    add_paths_to_append_rel(root, rel, live_childrels, partitioned_rels);

I think it would make more sense to put the new logic into
add_paths_to_append_rel, instead of passing this down as an additional
parameter.

+     * do not appear anywhere else in the plan.  Situation is exactly the

The situation is...

+    if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
+    {
+        foreach(lc, root->pcinfo_list)
+        {
+            PartitionedChildRelInfo *pc = lfirst(lc);
+
+            if (pc->parent_relid == parentRTindex)
+            {
+                partitioned_rels = pc->child_rels;
+                break;
+            }
+        }
+    }

You seem to have a few copies of this logic.  I think it would be
worth factoring it out into a separate function.

+                root->glob->nonleafResultRelations =
+                    list_concat(root->glob->nonleafResultRelations,
+                                list_copy(splan->partitioned_rels));

Please add a brief comment.  One line is fine.

+            newrc->isParent = childrte->relkind == RELKIND_PARTITIONED_TABLE;

I'm not sure what project style is, but I personally find these kinds
of assignments easier to read with an extra set of parantheses:
           newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);

+    if (partitioned_rels == NIL)
+        return;
+
+    foreach(lc, partitioned_rels)

I think the if-test is pointless; the foreach loop is going to start
by comparing the initial value with NIL.

Why doesn't ExecSerializePlan() need to transfer a proper value for
nonleafResultRelations to workers?  Seems like it should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

21 March 2017, 12:05:42

On 2017/03/21 1:16, Robert Haas wrote:
> On Fri, Mar 17, 2017 at 4:57 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Yes, but on the flip side, you're having to add code in a lot of
>>> places -- I think I counted 7 -- where you turn around and ignore
>>> those AppendRelInfos.
>>
>> Perhaps you were looking at the previous version with "minimal" appinfos
>> containing the child_is_partitioned field?
> 
> Yes, I think I was.  I think this version looks a lot better.

Just to clarify, I assume you reviewed the latest version which does not
create AppendRelInfos, but instead creates PartitionedChildRelInfos (as
also evident from your comments below).  Sorry about the confusion.

>      /*
> +     * Close the root partitioned rel if we opened it above, but keep the
> +     * lock.
> +     */
> +    if (rel != mtstate->resultRelInfo->ri_RelationDesc)
> +        heap_close(rel, NoLock);
> 
> We didn't take a lock above, though, so drop everything in the comment
> from "but" onward.

Oh, right.

> -    add_paths_to_append_rel(root, rel, live_childrels);
> +    add_paths_to_append_rel(root, rel, live_childrels, partitioned_rels);
> 
> I think it would make more sense to put the new logic into
> add_paths_to_append_rel, instead of passing this down as an additional
> parameter.

OK, done.

> +     * do not appear anywhere else in the plan.  Situation is exactly the
> 
> The situation is...

Fixed.

> 
> +    if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
> +    {
> +        foreach(lc, root->pcinfo_list)
> +        {
> +            PartitionedChildRelInfo *pc = lfirst(lc);
> +
> +            if (pc->parent_relid == parentRTindex)
> +            {
> +                partitioned_rels = pc->child_rels;
> +                break;
> +            }
> +        }
> +    }
> 
> You seem to have a few copies of this logic.  I think it would be
> worth factoring it out into a separate function.

OK, done.  Put the definition in in planner.c

> +                root->glob->nonleafResultRelations =
> +                    list_concat(root->glob->nonleafResultRelations,
> +                                list_copy(splan->partitioned_rels));
> 
> Please add a brief comment.  One line is fine.

Done.

> 
> +            newrc->isParent = childrte->relkind == RELKIND_PARTITIONED_TABLE;
> 
> I'm not sure what project style is, but I personally find these kinds
> of assignments easier to read with an extra set of parantheses:
> 
>             newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);

Ah, done.

> 
> +    if (partitioned_rels == NIL)
> +        return;
> +
> +    foreach(lc, partitioned_rels)
> 
> I think the if-test is pointless; the foreach loop is going to start
> by comparing the initial value with NIL.

Right, fixed.

> Why doesn't ExecSerializePlan() need to transfer a proper value for
> nonleafResultRelations to workers?  Seems like it should.

It doesn't transfer resultRelations either, presumably because workers do
not handle result relations yet.  Also, both resultRelations and
nonleafResultRelations are set *only* if there is a ModifyTable node.

Attached updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

> On Tue, Mar 21, 2017 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Mar 21, 2017 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> Attached updated patches.
>>>
>>> Committed 0001 after removing a comma.
>>
>> Regarding 0002,

Thanks for the review.

>> I notice this surprising behavior:
>>
>> rhaas=# create table foo (a int, b text) partition by list (a);
>> CREATE TABLE
>> rhaas=# select relfilenode from pg_class where oid = 'foo'::regclass;
>>  relfilenode
>> -------------
>>        16385
>> (1 row)
>>
>> Why do we end up with a relfilenode if there's no storage?  I would
>> have expected to see 0 there.

Hmm, I see that happening even with views (and other relkinds that do not
have storage):

create table foo (a int);
create view foov as select * from foo;
select relfilenode from pg_class where relname = 'foov';
-[ RECORD 1 ]------
relfilenode | 24606

create type typ as (a int);
select relfilenode from pg_class where relname = 'typ';
-[ RECORD 1 ]------
relfilenode | 24610

After staring at the relevant code, fixing things so that relfilenode is 0
for relations without storage seems to be slightly involved. In the header
comment of relmapper.c, there is a line:

Mapped catalogs have zero in their pg_class.relfilenode entries.

which is kind of significant for the relcache initialization code.
RelationInitPhysicalAddr() called when building a relcache entry
initializes the rd_rel.relfilenode from pg_class.relfilenode if it's valid
and from relation mapper if invalid (i.e. 0).  So when relcache entries
for the mapped catalogs are being built, pg_class.relfilenode being
InvalidOid is a signal to get the same from the relmapper.  Since the same
code path is exercised in other cases, all relations supposedly without
storage would wrongly be looked up in the relmapper.  It may be possible
to fix that, but will require non-trivial rearrangement of relevant code.

>> Other than that, there's not much to see here.  I'm a little worried
>> that you might have missed some case that can result in an access to
>> the file, but I can't find one.  Stuff I tried:
>>
>> VACUUM
>> VACUUM FULL
>> CLUSTER
>> ANALYZE
>> CREATE INDEX
>> ALTER TABLE .. ALTER COLUMN .. TYPE
>> TRUNCATE
>>
>> It would be good to go through and make sure all of those - and any
>> others you can think of - are represented in the regression tests.

We now have tests that cover commands that previously would try to access
file for partitioned tables (most of those your listed above).  The commit
yesterday to eliminate scan nodes is also covered by tests.

Speaking of - do you think the following error message is reasonable when
a a partitioned table is passed to pg_prewarm():

select pg_prewarm('p');
ERROR:  fork "main" does not exist for this relation

It's the same if a view name is passed.

>> Also, a documentation update is probably in order to explain that
>> we're not going to accept heap_reloptions on the parent because the
>> parent has no storage; such options must be set on the child in order
>> to have effect.

Hmm, actually I am no longer sure if we should completely get rid of the
reloptions.  Some options like fillfactor don't make sense, but we might
want to use the values for autovacuum_* options when we will improve
autovacuum's handling of partitioned tables.  Maybe even parallel_workers
could be useful to specify for parent tables.  So, I took out reloptions.c
changes from the patch for now and documented that specifying the storage
options for partitioned parents is ineffective currently.  Does that make
sense?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Do-not-allocate-storage-for-partitioned-tables.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Maksim Milyutin

Date:

23 March 2017, 17:27:14

Hi!

I have noticed that there is scheduled unlinking of nonexistent physical 
storage under partitioned table when we execute DROP TABLE statement on 
this partitioned table. Though this action doesn't generate any error 
under typical behavior of postgres because the error of storage's lack 
is caught through if-statement [1] I think it is not safe.

My patch fixes this issue.

1. src/backend/storage/smgr/md.c:1385

-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

exclude_unlinking_storage_for_partitioned_table.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

23 March 2017, 17:47:02

Hi,

On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
<m.milyutin@postgrespro.ru> wrote:
> Hi!
>
> I have noticed that there is scheduled unlinking of nonexistent physical
> storage under partitioned table when we execute DROP TABLE statement on this
> partitioned table. Though this action doesn't generate any error under
> typical behavior of postgres because the error of storage's lack is caught
> through if-statement [1] I think it is not safe.
>
> My patch fixes this issue.
>
> 1. src/backend/storage/smgr/md.c:1385

Good catch, will incorporate that in the main patch.

Thanks,
Amit

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

24 March 2017, 03:54:22

On 2017/03/23 23:47, Amit Langote wrote:
> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
> <m.milyutin@postgrespro.ru> wrote:
>> Hi!
>>
>> I have noticed that there is scheduled unlinking of nonexistent physical
>> storage under partitioned table when we execute DROP TABLE statement on this
>> partitioned table. Though this action doesn't generate any error under
>> typical behavior of postgres because the error of storage's lack is caught
>> through if-statement [1] I think it is not safe.
>>
>> My patch fixes this issue.
>>
>> 1. src/backend/storage/smgr/md.c:1385
> 
> Good catch, will incorporate that in the main patch.

And here is the updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Do-not-allocate-storage-for-partitioned-tables.patch

Re: Partitioned tables and relfilenode

From

Robert Haas

Date:

27 March 2017, 17:27:09

On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/03/23 23:47, Amit Langote wrote:
>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
>> <m.milyutin@postgrespro.ru> wrote:
>>> Hi!
>>>
>>> I have noticed that there is scheduled unlinking of nonexistent physical
>>> storage under partitioned table when we execute DROP TABLE statement on this
>>> partitioned table. Though this action doesn't generate any error under
>>> typical behavior of postgres because the error of storage's lack is caught
>>> through if-statement [1] I think it is not safe.
>>>
>>> My patch fixes this issue.
>>>
>>> 1. src/backend/storage/smgr/md.c:1385
>>
>> Good catch, will incorporate that in the main patch.
>
> And here is the updated patch.

I think you should go back to the earlier strategy of disallowing heap
reloptions to be set on the partitioned table.  The thing is, we don't
really know which way we're going to want to go in the future.  Maybe
we'll come up with a system where you can set options on the
partitioned table, and those options will cascade to the children.  Or
maybe we'll come up with a system where partitioned tables have a
completely different set of options to control behaviors specific to
partitioned tables.  If we do the latter, then we don't want to also
have to support useless heap reloptions for forward compatibility, nor
do we want to break backward-compatibility to remove support.  If we
do the former, then it's better if we allow it in the same release
where it starts working.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Partitioned tables and relfilenode

From

Maksim Milyutin

Date:

27 March 2017, 18:13:23

On 24.03.2017 03:54, Amit Langote wrote:
>
> And here is the updated patch.
>

Perhaps you forgot my fix in the updated patch

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 3999e6e..36917c8 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1823,7 +1823,8 @@ heap_drop_with_catalog(Oid relid)      */     if (rel->rd_rel->relkind != RELKIND_VIEW &&
rel->rd_rel->relkind!= RELKIND_COMPOSITE_TYPE &&
 
-        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE)
+        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE &&
+        rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)     {         RelationDropStorage(rel);     }

-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Partitioned tables and relfilenode

From

Amit Langote

Date:

28 March 2017, 03:59:30

Hi,

On 2017/03/28 0:13, Maksim Milyutin wrote:
> On 24.03.2017 03:54, Amit Langote wrote:
>>
>> And here is the updated patch.
>>
> 
> Perhaps you forgot my fix in the updated patch
> 
> diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> index 3999e6e..36917c8 100644
> --- a/src/backend/catalog/heap.c
> +++ b/src/backend/catalog/heap.c
> @@ -1823,7 +1823,8 @@ heap_drop_with_catalog(Oid relid)
>       */
>      if (rel->rd_rel->relkind != RELKIND_VIEW &&
>          rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE &&
> -        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE)
> +        rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE &&
> +        rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
>      {
>          RelationDropStorage(rel);
>      }

Oops, my bad.  I will include it in the patch I'll send after addressing
Robert's comments.  Thanks again!

Regards,
Amit

Re: Partitioned tables and relfilenode

From

Robert Haas

Date:

29 March 2017, 06:18:21

On Mon, Mar 27, 2017 at 8:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Oops, my bad.  I will include it in the patch I'll send after addressing
> Robert's comments.  Thanks again!

That patch coming soon?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Partitioned tables and relfilenode

From

Amit Langote

Date:

29 March 2017, 09:40:20

On 2017/03/27 23:27, Robert Haas wrote:
> On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/03/23 23:47, Amit Langote wrote:
>>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
>>> <m.milyutin@postgrespro.ru> wrote:
>>>> Hi!
>>>>
>>>> I have noticed that there is scheduled unlinking of nonexistent physical
>>>> storage under partitioned table when we execute DROP TABLE statement on this
>>>> partitioned table. Though this action doesn't generate any error under
>>>> typical behavior of postgres because the error of storage's lack is caught
>>>> through if-statement [1] I think it is not safe.
>>>>
>>>> My patch fixes this issue.
>>>>
>>>> 1. src/backend/storage/smgr/md.c:1385
>>>
>>> Good catch, will incorporate that in the main patch.
>>
>> And here is the updated patch.
> 
> I think you should go back to the earlier strategy of disallowing heap
> reloptions to be set on the partitioned table.  The thing is, we don't
> really know which way we're going to want to go in the future.  Maybe
> we'll come up with a system where you can set options on the
> partitioned table, and those options will cascade to the children.  Or
> maybe we'll come up with a system where partitioned tables have a
> completely different set of options to control behaviors specific to
> partitioned tables.  If we do the latter, then we don't want to also
> have to support useless heap reloptions for forward compatibility, nor
> do we want to break backward-compatibility to remove support.  If we
> do the former, then it's better if we allow it in the same release
> where it starts working.

You're right, modified the patch accordingly.

By the way, the previous version of the patch didn't really "disallow"
specifying heap reloptions though.  What I'd imagine that should entail is
CREATE TABLE or ALTER TABLE raising error if one of those reloptions is
specified, which didn't really occur with the patch.  The options were
silently accepted and stored into pg_class, but their values were never
used.  I modified the patch so that an error occurs instead of silently
accepting the user input.

create table p (a int) partition by list (a) with (fillfactor = 10);
ERROR:  unrecognized parameter "fillfactor" for a partitioned table

Thanks,
Amit

Re: Partitioned tables and relfilenode

From

Kyotaro HORIGUCHI

Date:

29 March 2017, 10:49:38

Hello,

At Wed, 29 Mar 2017 15:40:20 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<c4d71df2-9e0e-3912-dc81-9a72e080c238@lab.ntt.co.jp>
> On 2017/03/27 23:27, Robert Haas wrote:
> > On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On 2017/03/23 23:47, Amit Langote wrote:
> >>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin
> >>> <m.milyutin@postgrespro.ru> wrote:
> >>>> Hi!
> >>>>
> >>>> I have noticed that there is scheduled unlinking of nonexistent physical
> >>>> storage under partitioned table when we execute DROP TABLE statement on this
> >>>> partitioned table. Though this action doesn't generate any error under
> >>>> typical behavior of postgres because the error of storage's lack is caught
> >>>> through if-statement [1] I think it is not safe.
> >>>>
> >>>> My patch fixes this issue.
> >>>>
> >>>> 1. src/backend/storage/smgr/md.c:1385
> >>>
> >>> Good catch, will incorporate that in the main patch.
> >>
> >> And here is the updated patch.
> > 
> > I think you should go back to the earlier strategy of disallowing heap
> > reloptions to be set on the partitioned table.  The thing is, we don't
> > really know which way we're going to want to go in the future.  Maybe
> > we'll come up with a system where you can set options on the
> > partitioned table, and those options will cascade to the children.  Or
> > maybe we'll come up with a system where partitioned tables have a
> > completely different set of options to control behaviors specific to
> > partitioned tables.  If we do the latter, then we don't want to also
> > have to support useless heap reloptions for forward compatibility, nor
> > do we want to break backward-compatibility to remove support.  If we
> > do the former, then it's better if we allow it in the same release
> > where it starts working.
> 
> You're right, modified the patch accordingly.
> 
> By the way, the previous version of the patch didn't really "disallow"
> specifying heap reloptions though.  What I'd imagine that should entail is
> CREATE TABLE or ALTER TABLE raising error if one of those reloptions is
> specified, which didn't really occur with the patch.  The options were
> silently accepted and stored into pg_class, but their values were never
> used.  I modified the patch so that an error occurs instead of silently
> accepting the user input.
> 
> create table p (a int) partition by list (a) with (fillfactor = 10);
> ERROR:  unrecognized parameter "fillfactor" for a partitioned table

The following attracted my eyes.

+      if (def->defnamespace == NULL &&
+          pg_strcasecmp(def->defname, "oids") != 0)
+        ereport(ERROR,
+          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+           errmsg("unrecognized parameter \"%s\" for a partitioned table",

This works since defnamespace is always NULL here, but if I
understand correctly what we should do here is "reject any option
other than "(default).OID"". So I think that the condition should
be like the following.

+      if (def->defnamespace != NULL ||
+          pg_strcasecmp(def->defname, "oids") != 0)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Partitioned tables and relfilenode

From

Amit Langote

Date:

29 March 2017, 11:21:26

Horiguchi-san,

Thanks for taking a look.

On 2017/03/29 16:49, Kyotaro HORIGUCHI wrote:
> At Wed, 29 Mar 2017 15:40:20 +0900, Amit Langote wrote:
>> On 2017/03/27 23:27, Robert Haas wrote:
>>>> And here is the updated patch.
>>>
>>> I think you should go back to the earlier strategy of disallowing heap
>>> reloptions to be set on the partitioned table.  The thing is, we don't
>>> really know which way we're going to want to go in the future.  Maybe
>>> we'll come up with a system where you can set options on the
>>> partitioned table, and those options will cascade to the children.  Or
>>> maybe we'll come up with a system where partitioned tables have a
>>> completely different set of options to control behaviors specific to
>>> partitioned tables.  If we do the latter, then we don't want to also
>>> have to support useless heap reloptions for forward compatibility, nor
>>> do we want to break backward-compatibility to remove support.  If we
>>> do the former, then it's better if we allow it in the same release
>>> where it starts working.
>>
>> You're right, modified the patch accordingly.
>>
>> By the way, the previous version of the patch didn't really "disallow"
>> specifying heap reloptions though.  What I'd imagine that should entail is
>> CREATE TABLE or ALTER TABLE raising error if one of those reloptions is
>> specified, which didn't really occur with the patch.  The options were
>> silently accepted and stored into pg_class, but their values were never
>> used.  I modified the patch so that an error occurs instead of silently
>> accepting the user input.
>>
>> create table p (a int) partition by list (a) with (fillfactor = 10);
>> ERROR:  unrecognized parameter "fillfactor" for a partitioned table
> 
> The following attracted my eyes.
> 
> +      if (def->defnamespace == NULL &&
> +          pg_strcasecmp(def->defname, "oids") != 0)
> +        ereport(ERROR,
> +          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +           errmsg("unrecognized parameter \"%s\" for a partitioned table",
> 
> This works since defnamespace is always NULL here, but if I
> understand correctly what we should do here is "reject any option
> other than "(default).OID"". So I think that the condition should
> be like the following.

You're right.  The following *wrongly* succeeds:

create table p (a int) partition by list (a) with
(toast.autovacuum_enabled = true);
CREATE TABLE

> +      if (def->defnamespace != NULL ||
> +          pg_strcasecmp(def->defname, "oids") != 0)

Looks correct, so incorporated in the attached updated patch.  Thanks.

Regards,
Amit

Re: Partitioned tables and relfilenode

From

Kyotaro HORIGUCHI

Date:

29 March 2017, 11:40:28

At Wed, 29 Mar 2017 17:21:26 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<b1234f04-53e0-011d-9f02-2437b909cce4@lab.ntt.co.jp>
> > Thanks for taking a look.

This patch is small enough to look at in a short time:p

> > The following attracted my eyes.
> > 
> > +      if (def->defnamespace == NULL &&
> > +          pg_strcasecmp(def->defname, "oids") != 0)
> > +        ereport(ERROR,
> > +          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +           errmsg("unrecognized parameter \"%s\" for a partitioned table",
> > 
> > This works since defnamespace is always NULL here, but if I
> > understand correctly what we should do here is "reject any option
> > other than "(default).OID"". So I think that the condition should
> > be like the following.
> 
> You're right.  The following *wrongly* succeeds:

Ouch! I briefly checked that by "hoge.oids" without confirming
around.

> create table p (a int) partition by list (a) with
> (toast.autovacuum_enabled = true);
> CREATE TABLE
> 
> > +      if (def->defnamespace != NULL ||
> > +          pg_strcasecmp(def->defname, "oids") != 0)
> 
> Looks correct, so incorporated in the attached updated patch.  Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Partitioned tables and relfilenode

From

Robert Haas

Date:

29 March 2017, 17:58:05

On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Looks correct, so incorporated in the attached updated patch.  Thanks.

This seems like a hacky way to limit the reloptions to just OIDs.
Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
like that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Partitioned tables and relfilenode

From

Amit Langote

Date:

30 March 2017, 12:24:16

On 2017/03/29 23:58, Robert Haas wrote:
> On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Looks correct, so incorporated in the attached updated patch.  Thanks.
> 
> This seems like a hacky way to limit the reloptions to just OIDs.
> Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
> like that?

OK, I tried that in the updated patch.

DefineRelation() and ATExecSetRelOptions() still calls heap_reloptions(),
but passes the new relopt_kind.  None of the options listed in
reloptions.c are of this kind as of now, so parseRelOptions() simply
outputs the "unrecognized parameter %s" in the case of partitioned tables
(except in some cases described below, but not because of the limitations
of this patch).  If and when partitioned tables start supporting the
existing parameters, we'll update the relopt_gen.kinds bitmask of those
parameters to allow them to be specified for partitioned tables.

With the patch:

create table parted (a int) partition by list (a) with (fillfactor = 10);
ERROR:  unrecognized parameter "fillfactor"

create table parted (a int) partition by list (a) with (toast.fillfactor =
10);
ERROR:  unrecognized parameter "fillfactor"

and:

create table parted (a int) partition by list (a) with (oids = true);
alter table parted set (fillfactor = 90);
ERROR:  unrecognized parameter "fillfactor"

but:

-- appears to succeed, whereas an error should have been reported (I think)
alter table parted set (toast.fillfactor = 10);
ALTER TABLE

-- even with views
create table foo (a int);
create view foov with (toast.fillfactor = 10) as select * from foo;
CREATE VIEW
alter view foov set (toast.fillfactor = 20);
ALTER VIEW

Nothing is stored in pg_class.reloptions really, but the validation that
should have occurred in parseRelOptions() doesn't.  This happens because
of the way transformRelOptions() works.  It will ignore the DefElem's that
don't apply to the specified relation based on the value of the namspace
parameter ("toast" or NULL).  That means it won't be included in the array
of options later received by parseRelOptions(), which is where the
validation occurs.

Also, alter table reset (xxx) never validates anything:

alter table foo reset (foo);
ALTER TABLE
alter table foo reset (foo.bar);
ALTER TABLE

Again, no pg_class.reloptions update occurs in this case. The reason this
happens is because transformRelOptions() never includes the options to be
reset in the array of options received by parseRelOptions(), so no
validation occurs.

But since both are existing behaviors, perhaps we can worry about it some
other time.

Thanks,
Amit

Re: Partitioned tables and relfilenode

From

Kyotaro HORIGUCHI

Date:

30 March 2017, 13:37:29

Hello,

At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<ee5f1cd4-4783-42e8-0263-648ae9c11264@lab.ntt.co.jp>
> On 2017/03/29 23:58, Robert Haas wrote:
> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> Looks correct, so incorporated in the attached updated patch.  Thanks.
> > 
> > This seems like a hacky way to limit the reloptions to just OIDs.
> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
> > like that?
> 
> OK, I tried that in the updated patch.

The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for
partitioned tables is RELKIND_PARTITIONED_TABLE, so is this
better to be _TABLE, even if a bit longer?

parseRelOptions seems to return garbage pointer if no option of
the kind is available.

> DefineRelation() and ATExecSetRelOptions() still calls heap_reloptions(),
> but passes the new relopt_kind.  None of the options listed in
> reloptions.c are of this kind as of now, so parseRelOptions() simply
> outputs the "unrecognized parameter %s" in the case of partitioned tables
> (except in some cases described below, but not because of the limitations
> of this patch).  If and when partitioned tables start supporting the
> existing parameters, we'll update the relopt_gen.kinds bitmask of those
> parameters to allow them to be specified for partitioned tables.
> 
> With the patch:
> 
> create table parted (a int) partition by list (a) with (fillfactor = 10);
> ERROR:  unrecognized parameter "fillfactor"
> 
> create table parted (a int) partition by list (a) with (toast.fillfactor =
> 10);
> ERROR:  unrecognized parameter "fillfactor"
> 
> and:
> 
> create table parted (a int) partition by list (a) with (oids = true);
> alter table parted set (fillfactor = 90);
> ERROR:  unrecognized parameter "fillfactor"
> 
> but:
> 
> -- appears to succeed, whereas an error should have been reported (I think)
> alter table parted set (toast.fillfactor = 10);
> ALTER TABLE
> 
> -- even with views
> create table foo (a int);
> create view foov with (toast.fillfactor = 10) as select * from foo;
> CREATE VIEW
> alter view foov set (toast.fillfactor = 20);
> ALTER VIEW
> 
> Nothing is stored in pg_class.reloptions really, but the validation that
> should have occurred in parseRelOptions() doesn't.  This happens because
> of the way transformRelOptions() works.  It will ignore the DefElem's that
> don't apply to the specified relation based on the value of the namspace
> parameter ("toast" or NULL).  That means it won't be included in the array
> of options later received by parseRelOptions(), which is where the
> validation occurs.
> 
> Also, alter table reset (xxx) never validates anything:
> 
> alter table foo reset (foo);
> ALTER TABLE
> alter table foo reset (foo.bar);
> ALTER TABLE
> 
> Again, no pg_class.reloptions update occurs in this case. The reason this
> happens is because transformRelOptions() never includes the options to be
> reset in the array of options received by parseRelOptions(), so no
> validation occurs.
> 
> But since both are existing behaviors, perhaps we can worry about it some
> other time.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Partitioned tables and relfilenode

From

Amit Langote

Date:

30 March 2017, 14:58:35

Thanks for the review.

On Thu, Mar 30, 2017 at 7:37 PM, Kyotaro HORIGUCHI wrote:
> At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote wrote:
>> On 2017/03/29 23:58, Robert Haas wrote:
>> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
>> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> >> Looks correct, so incorporated in the attached updated patch.  Thanks.
>> >
>> > This seems like a hacky way to limit the reloptions to just OIDs.
>> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
>> > like that?
>>
>> OK, I tried that in the updated patch.
>
> The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for
> partitioned tables is RELKIND_PARTITIONED_TABLE, so is this
> better to be _TABLE, even if a bit longer?

Hm, OK.  Done.

> parseRelOptions seems to return garbage pointer if no option of
> the kind is available.

Oops, fixed that too.

Updated patch attached.

Thanks,
Amit

Attachment

0001-Do-not-allocate-storage-for-partitioned-tables.patch

Re: Partitioned tables and relfilenode

From

Kyotaro HORIGUCHI

Date:

31 March 2017, 09:50:28

At Thu, 30 Mar 2017 20:58:35 +0900, Amit Langote <amitlangote09@gmail.com> wrote in
<CA+HiwqFmrJGe4eid2EaqnCjjQxrAYR5hJfzc5Cz929WA9p=A+w@mail.gmail.com>
> Thanks for the review.
> 
> On Thu, Mar 30, 2017 at 7:37 PM, Kyotaro HORIGUCHI wrote:
> > At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote wrote:
> >> On 2017/03/29 23:58, Robert Haas wrote:
> >> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote
> >> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> >> Looks correct, so incorporated in the attached updated patch.  Thanks.
> >> >
> >> > This seems like a hacky way to limit the reloptions to just OIDs.
> >> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something
> >> > like that?
> >>
> >> OK, I tried that in the updated patch.
> >
> > The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for
> > partitioned tables is RELKIND_PARTITIONED_TABLE, so is this
> > better to be _TABLE, even if a bit longer?
> 
> Hm, OK.  Done.
> 
> > parseRelOptions seems to return garbage pointer if no option of
> > the kind is available.
> 
> Oops, fixed that too.
> 
> Updated patch attached.

Thank you.

- Applies cleanly on master (f90d23d)
- Compiled without error
- Code seems fine.
- Documentaion seems fine.. for me.
- Passes regression test.
- Leaving the ALTER-on-toast.* problem is fine for me.

The regression contains the tests to fail with several reloptions
only for partitioned tables. Are they still required even though
it is now in the same framework with other kind of reloptions?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

11 April 2017, 12:16:51

On 2017/04/03 11:39, Amit Langote wrote:
> On 2017/04/01 5:29, Robert Haas wrote:
>> Thanks for reviewing, and thanks to Maksim as well, and thanks to Amit
>> for writing the patch.
> 
> Thanks for committing. :)

I noticed that I had missed a couple of places that would try to scan
partitioned tables, resulting in file access.

1. In validateCheckConstraint(), along with foreign tables, must ignore
partitioned tables.

2. DefineQueryRewrite() may try to scan a partitioned table in the case of
converting a table to view, where we must make sure that the table being
converted is empty.  It's checked by scanning the heap, which we should
not do for a partitioned table.  Nor should we try to drop the storage
once ready to make the table into a REKIND_VIEW relation (because all
other checks passed okaying the conversion).

Tests are added for both the cases.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Fix-few-places-still-scanning-partitioned-tables.patch

Re: [HACKERS] Partitioned tables and relfilenode

From

Ashutosh Bapat

Date:

11 April 2017, 15:59:44

Looking at the number of issues where we have to fix tests based on
the relkind checks, I think, we have to consider creating macros as
described in my reply to thread with subject " Allowing extended stats
on foreign and partitioned tables".

On Tue, Apr 11, 2017 at 2:46 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/04/03 11:39, Amit Langote wrote:
>> On 2017/04/01 5:29, Robert Haas wrote:
>>> Thanks for reviewing, and thanks to Maksim as well, and thanks to Amit
>>> for writing the patch.
>>
>> Thanks for committing. :)
>
> I noticed that I had missed a couple of places that would try to scan
> partitioned tables, resulting in file access.
>
> 1. In validateCheckConstraint(), along with foreign tables, must ignore
> partitioned tables.
>
> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
> converting a table to view, where we must make sure that the table being
> converted is empty.  It's checked by scanning the heap, which we should
> not do for a partitioned table.  Nor should we try to drop the storage
> once ready to make the table into a REKIND_VIEW relation (because all
> other checks passed okaying the conversion).
>
> Tests are added for both the cases.
>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Tom Lane

Date:

11 April 2017, 16:45:00

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
> converting a table to view, where we must make sure that the table being
> converted is empty.  It's checked by scanning the heap, which we should
> not do for a partitioned table.  Nor should we try to drop the storage
> once ready to make the table into a REKIND_VIEW relation (because all
> other checks passed okaying the conversion).

It looks like this patch intends to allow converting a partitioned table
to a view.  I would lobby for refusing the command, instead.  There is
no good reason to allow it, and it might well be a user error.
        regards, tom lane

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

11 April 2017, 20:41:52

On Tue, Apr 11, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
>> converting a table to view, where we must make sure that the table being
>> converted is empty.  It's checked by scanning the heap, which we should
>> not do for a partitioned table.  Nor should we try to drop the storage
>> once ready to make the table into a REKIND_VIEW relation (because all
>> other checks passed okaying the conversion).
>
> It looks like this patch intends to allow converting a partitioned table
> to a view.  I would lobby for refusing the command, instead.  There is
> no good reason to allow it, and it might well be a user error.

Yeah, I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

12 April 2017, 05:15:10

On 2017/04/12 2:41, Robert Haas wrote:
> On Tue, Apr 11, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>>> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of
>>> converting a table to view, where we must make sure that the table being
>>> converted is empty.  It's checked by scanning the heap, which we should
>>> not do for a partitioned table.  Nor should we try to drop the storage
>>> once ready to make the table into a REKIND_VIEW relation (because all
>>> other checks passed okaying the conversion).
>>
>> It looks like this patch intends to allow converting a partitioned table
>> to a view.  I would lobby for refusing the command, instead.  There is
>> no good reason to allow it, and it might well be a user error.
> 
> Yeah, I agree.

Alright.  So I made it into two patches instead: 0001 fixes the bug that
validateCheckConstraint() tries to scan partitioned tables and 0002 makes
trying to convert a partitioned table to a view a user error.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partitioned tables and relfilenode

From

Robert Haas

Date:

12 April 2017, 18:36:58

On Tue, Apr 11, 2017 at 10:15 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Alright.  So I made it into two patches instead: 0001 fixes the bug that
> validateCheckConstraint() tries to scan partitioned tables and 0002 makes
> trying to convert a partitioned table to a view a user error.

Committed together, after updating the regression test outputs to make
the tests pass.  You forgot to update the expected output file in
0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partitioned tables and relfilenode

From

Amit Langote

Date:

13 April 2017, 03:35:07

On 2017/04/13 0:36, Robert Haas wrote:
> On Tue, Apr 11, 2017 at 10:15 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Alright.  So I made it into two patches instead: 0001 fixes the bug that
>> validateCheckConstraint() tries to scan partitioned tables and 0002 makes
>> trying to convert a partitioned table to a view a user error.
> 
> Committed together, after updating the regression test outputs to make
> the tests pass.  You forgot to update the expected output file in
> 0002.

Oops, sorry about that.  Thanks for fixing and committing.

Regards,
Amit