Thread: [HACKERS] expanding inheritance in partition bound order

[HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context.  They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code.  That's good.

For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations.  For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm.  For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].

So attached are two WIP patches:

0001 implements two interface functions:

  List *get_all_partition_oids(Oid, LOCKMODE)
  List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
 Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right).  Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents.  That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order.  But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.

Applying just 0001 will satisfy the requirements stated in [1], but it
won't look pretty as is for too long.

So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and
relevant data structures.  For example, PartitionDispatchData has now been
simplified to contain only the partition key, partition descriptor and
indexes array, whereas previously it also stored the relation descriptor,
partition key execution state, tuple table slot, tuple conversion map
which are required for tuple-routing.  RelationGetPartitionDispatchInfo()
no longer generates those things, but returns just enough information so
that a caller can generate and manage those things by itself.  This
simplification makes it less cumbersome to call
RelationGetPartitionDispatchInfo() in other places.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoajC0J50%3D2FqnZLvB10roY%2B68HgFWhso%3DV_StkC6PWujQ%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The current way to expand inherited tables, including partitioned tables,
> is to use either find_all_inheritors() or find_inheritance_children()
> depending on the context.  They return child table OIDs in the (ascending)
> order of those OIDs, which means the callers that need to lock the child
> tables can do so without worrying about the possibility of deadlock in
> some concurrent execution of that piece of code.  That's good.
>
> For partitioned tables, there is a possibility of returning child table
> (partition) OIDs in the partition bound order, which in addition to
> preventing the possibility of deadlocks during concurrent locking, seems
> potentially useful for other caller-specific optimizations.  For example,
> tuple-routing code can utilize that fact to implement binary-search based
> partition-searching algorithm.  For one more example, refer to the "UPDATE
> partition key" thread where it's becoming clear that it would be nice if
> the planner had put the partitions in bound order in the ModifyTable that
> it creates for UPDATE of partitioned tables [1].

Thanks a lot for working on this. Partition-wise join can benefit from
this as well. See comment about build_simple_rel's matching algorithm
in [1]. It will become O(n) instead of O(n^2).

>
> So attached are two WIP patches:
>
> 0001 implements two interface functions:
>
>   List *get_all_partition_oids(Oid, LOCKMODE)
>   List *get_partition_oids(Oid, LOCKMODE)
>
> that resemble find_all_inheritors() and find_inheritance_children(),
> respectively, but expect that users call them only for partitioned tables.
>  Needless to mention, OIDs are returned with canonical order determined by
> that of the partition bounds and they way partition tree structure is
> traversed (top-down, breadth-first-left-to-right).  Patch replaces all the
> calls of the old interface functions with the respective new ones for
> partitioned table parents.  That means expand_inherited_rtentry (among
> others) now calls get_all_partition_oids() if the RTE is for a partitioned
> table and find_all_inheritors() otherwise.
>
> In its implementation, get_all_partition_oids() calls
> RelationGetPartitionDispatchInfo(), which is useful to generate the result
> list in the desired partition bound order.  But the current interface and
> implementation of RelationGetPartitionDispatchInfo() needs some rework,
> because it's too closely coupled with the executor's tuple routing code.

May be we want to implement get_all_partition_oids() calling
get_partition_oids() on each new entry that gets added, similar to
find_all_inheritors(). That might avoid changes to DispatchInfo() and
also dependency on that structure.

Also almost every place which called find_all_inheritors() or
find_inheritance_children() is changed to if () else case calling
those functions or the new function as required. May be we should
create macros/functions to do that routing to keep the code readable.
May be find_all_inheritors() and find_inheritance_children()
themselves become the routing function and their original code moves
into new functions get_all_inheritors() and
get_inheritance_children(). We may choose other names for functions.
The idea is to create routing functions/macros instead of sprinkling
code with if () else conditions.

[1] https://www.postgresql.org/message-id/CA+TgmobeRUTu4osXA_UA4AORho83WxAvFG8n1NQcoFuujbeh7A@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The current way to expand inherited tables, including partitioned tables,
> is to use either find_all_inheritors() or find_inheritance_children()
> depending on the context.  They return child table OIDs in the (ascending)
> order of those OIDs, which means the callers that need to lock the child
> tables can do so without worrying about the possibility of deadlock in
> some concurrent execution of that piece of code.  That's good.
>
> For partitioned tables, there is a possibility of returning child table
> (partition) OIDs in the partition bound order, which in addition to
> preventing the possibility of deadlocks during concurrent locking, seems
> potentially useful for other caller-specific optimizations.  For example,
> tuple-routing code can utilize that fact to implement binary-search based
> partition-searching algorithm.  For one more example, refer to the "UPDATE
> partition key" thread where it's becoming clear that it would be nice if
> the planner had put the partitions in bound order in the ModifyTable that
> it creates for UPDATE of partitioned tables [1].

I guess I don't really understand why we want to change the locking
order.  That is bound to make expanding the inheritance hierarchy more
expensive.  If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein.  But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.

While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered.  That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly.  I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all.  The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors.  Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.

Concretely, my proposal is:

1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).

2. RelationGetPartitionDispatchInfo assumes the relations are already locked.

3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it.  If not, there's no need.

4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...).  Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.

One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row.  However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list.  And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly.  I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID.  It
can also return the number of initial elements of its return value
which have that flag set.

Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.

Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.

With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy.  You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.

Thoughts?

P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
>
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.

I agree. I think overall, we should keep
RelationGetPartitionDispatchInfo() only for preparing the dispatch
info in the planner, and generate the locked oids (using
find_all_inheritors() or get_partitioned_oids() or whatever) *without*
using RelationGetPartitionDispatchInfo(), since
RelationGetPartitionDispatchInfo() is generating the pd structure
which we don't want in every expansion.

>
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.

Yes, this way we need to open only the partitioned tables.

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/04 20:28, Ashutosh Bapat wrote:
> On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The current way to expand inherited tables, including partitioned tables,
>> is to use either find_all_inheritors() or find_inheritance_children()
>> depending on the context.  They return child table OIDs in the (ascending)
>> order of those OIDs, which means the callers that need to lock the child
>> tables can do so without worrying about the possibility of deadlock in
>> some concurrent execution of that piece of code.  That's good.
>>
>> For partitioned tables, there is a possibility of returning child table
>> (partition) OIDs in the partition bound order, which in addition to
>> preventing the possibility of deadlocks during concurrent locking, seems
>> potentially useful for other caller-specific optimizations.  For example,
>> tuple-routing code can utilize that fact to implement binary-search based
>> partition-searching algorithm.  For one more example, refer to the "UPDATE
>> partition key" thread where it's becoming clear that it would be nice if
>> the planner had put the partitions in bound order in the ModifyTable that
>> it creates for UPDATE of partitioned tables [1].
> 
> Thanks a lot for working on this. Partition-wise join can benefit from
> this as well. See comment about build_simple_rel's matching algorithm
> in [1]. It will become O(n) instead of O(n^2).

Nice.  It seems that we have a good demand for $subject. :)

>> So attached are two WIP patches:
>>
>> 0001 implements two interface functions:
>>
>>   List *get_all_partition_oids(Oid, LOCKMODE)
>>   List *get_partition_oids(Oid, LOCKMODE)
>>
>> that resemble find_all_inheritors() and find_inheritance_children(),
>> respectively, but expect that users call them only for partitioned tables.
>>  Needless to mention, OIDs are returned with canonical order determined by
>> that of the partition bounds and they way partition tree structure is
>> traversed (top-down, breadth-first-left-to-right).  Patch replaces all the
>> calls of the old interface functions with the respective new ones for
>> partitioned table parents.  That means expand_inherited_rtentry (among
>> others) now calls get_all_partition_oids() if the RTE is for a partitioned
>> table and find_all_inheritors() otherwise.
>>
>> In its implementation, get_all_partition_oids() calls
>> RelationGetPartitionDispatchInfo(), which is useful to generate the result
>> list in the desired partition bound order.  But the current interface and
>> implementation of RelationGetPartitionDispatchInfo() needs some rework,
>> because it's too closely coupled with the executor's tuple routing code.
> 
> May be we want to implement get_all_partition_oids() calling
> get_partition_oids() on each new entry that gets added, similar to
> find_all_inheritors(). That might avoid changes to DispatchInfo() and
> also dependency on that structure.
> 
> Also almost every place which called find_all_inheritors() or
> find_inheritance_children() is changed to if () else case calling
> those functions or the new function as required. May be we should
> create macros/functions to do that routing to keep the code readable.
> May be find_all_inheritors() and find_inheritance_children()
> themselves become the routing function and their original code moves
> into new functions get_all_inheritors() and
> get_inheritance_children(). We may choose other names for functions.
> The idea is to create routing functions/macros instead of sprinkling
> code with if () else conditions.

Given the Robert's comments [1], it seems that we might have to abandon
the idea to separate the interface for partitioned and non-partitioned
inheritance cases.  I'm thinking about the issues and alternatives he
mentioned in his email.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com




Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Fri, Aug 4, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The current way to expand inherited tables, including partitioned tables,
>> is to use either find_all_inheritors() or find_inheritance_children()
>> depending on the context.  They return child table OIDs in the (ascending)
>> order of those OIDs, which means the callers that need to lock the child
>> tables can do so without worrying about the possibility of deadlock in
>> some concurrent execution of that piece of code.  That's good.
>>
>> For partitioned tables, there is a possibility of returning child table
>> (partition) OIDs in the partition bound order, which in addition to
>> preventing the possibility of deadlocks during concurrent locking, seems
>> potentially useful for other caller-specific optimizations.  For example,
>> tuple-routing code can utilize that fact to implement binary-search based
>> partition-searching algorithm.  For one more example, refer to the "UPDATE
>> partition key" thread where it's becoming clear that it would be nice if
>> the planner had put the partitions in bound order in the ModifyTable that
>> it creates for UPDATE of partitioned tables [1].
>
> I guess I don't really understand why we want to change the locking
> order.  That is bound to make expanding the inheritance hierarchy more
> expensive. If we use this approach in all cases, it seems to me we're
> bound to reintroduce the problem we fixed in commit
> c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
> the same vein.

I initially didn't understand this, but I think now I understand it.
Establishing the order of children by partition bounds requires
building the relcache entry right now. That's what is expensive and
would introduce the same problems as the commit you have quoted.

> But I don't see that there's any necessary relation
> between the order of locking and the order of expansion inside the
> relcache entry/plan/whatever else -- so I propose that we keep the
> existing locking order and only change the other stuff.
>
> While reading related code this morning, I noticed that
> RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
> *already* changed the locking order for certain operations, because
> the PartitionDesc's OID array is bound-ordered not OID-ordered.  That
> means that when RelationGetPartitionDispatchInfo uses the
> PartitionDesc's OID arra to figure out what to lock, it's potentially
> going to encounter partitions in a different order than would have
> been the case if it had used find_all_inheritors directly.  I'm
> tempted to think that RelationGetPartitionDispatchInfo shouldn't
> really be doing locking at all.  The best way to have the locking
> always happen in the same order is to have only one piece of code that
> determines that order - and I vote for find_all_inheritors.  Aside
> from the fact that it's the way we've always done it (and still do it
> in most other places), that code includes infinite-loop defenses which
> the new code you've introduced lacks.
>

+1.

> Concretely, my proposal is:
>
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
>
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
>
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.
>
> 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
> to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
> elog(ERROR, ...).  Might be interesting to stuff that check into the
> relation_open(..., NoLock) path, too.
>
> One objection to this line of attack is that there might be a good
> case for locking only the partitioned inheritors first and then going
> back and locking the leaf nodes in a second pass, or even only when
> required for a particular row.  However, that doesn't require putting
> everything in bound order - it only requires moving the partitioned
> children to the beginning of the list.  And I think rather than having
> new logic for that, we should teach find_inheritance_children() to do
> that directly.  I have a feeling Ashutosh is going to cringe at this
> suggestion, but my idea is to do this by denormalizing: add a column
> to pg_inherits indicating whether the child is of
> RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
> pg_inherits, it can pull that flag out for free along with the
> relation OID, and qsort() first by the flag and then by the OID.  It
> can also return the number of initial elements of its return value
> which have that flag set.

I am always uncomfortable, when we save the same information in two
places without having a way to make sure that they are in sync. That
means we have to add explicit code to make sure that that information
is kept in sync. Somebody forgetting to add that code wherever
necessary means we have contradictory information persisted in the
databases without an idea of which of them is correct. Not necessarily
in this case, but usually it is an indication of something going wrong
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.

>
> Then, in find_all_inheritors, we split rels_list into
> partitioned_rels_list and other_rels_list, and process
> partitioned_rels_list in its entirety before touching other_rels_list;
> they get concatenated at the end.
>
> Now, find_all_inheritors and find_inheritance_children can also grow a
> flag bool only_partitioned_children; if set, then we skip the
> unpartitioned children entirely.
>
> With all that in place, you can call find_all_inheritors(blah blah,
> false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
> true) to lock just the partitioned tables in the hierarchy.  You get a
> consistent lock order either way, and if you start with only the
> partitioned tables and later want the leaf partitions too, you just go
> through the partitioned children in the order they were returned and
> find_inheritance_children(blah blah, false) on each one of them and
> the lock order is exactly consistent with what you would have gotten
> if you'd done find_all_inheritors(blah blah, false) originally.
>
> Thoughts?

I noticed that find_all_inheritors() uses a hash table to eliminate
duplicates arising out of multiple inheritance. Partition hierarchy is
never going to have multiple inheritance, and doesn't need to
eliminate duplicates and so doesn't need the hash table. It will be
good, if we can eliminate that overhead. But that's separate task than
what this thread is about.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Mon, Aug 7, 2017 at 11:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>>
>> One objection to this line of attack is that there might be a good
>> case for locking only the partitioned inheritors first and then going
>> back and locking the leaf nodes in a second pass, or even only when
>> required for a particular row.  However, that doesn't require putting
>> everything in bound order - it only requires moving the partitioned
>> children to the beginning of the list.  And I think rather than having
>> new logic for that, we should teach find_inheritance_children() to do
>> that directly.  I have a feeling Ashutosh is going to cringe at this
>> suggestion, but my idea is to do this by denormalizing: add a column
>> to pg_inherits indicating whether the child is of
>> RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
>> pg_inherits, it can pull that flag out for free along with the
>> relation OID, and qsort() first by the flag and then by the OID.  It
>> can also return the number of initial elements of its return value
>> which have that flag set.
>
> I am always uncomfortable, when we save the same information in two
> places without having a way to make sure that they are in sync. That
> means we have to add explicit code to make sure that that information
> is kept in sync. Somebody forgetting to add that code wherever
> necessary means we have contradictory information persisted in the
> databases without an idea of which of them is correct. Not necessarily
> in this case, but usually it is an indication of something going wrong
> with the way schema is designed. May be it's better to use your idea
> of using get_rel_relkind() or find a way to check that the flag is in
> sync with the relkind, like when building the relcache.

Said all that, I think we will use this code quite often and so the
performance benefits by replicating the information are worth the
trouble of maintaining code to sync and check the duplicate
information.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/05 2:25, Robert Haas wrote:
> On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The current way to expand inherited tables, including partitioned tables,
>> is to use either find_all_inheritors() or find_inheritance_children()
>> depending on the context.  They return child table OIDs in the (ascending)
>> order of those OIDs, which means the callers that need to lock the child
>> tables can do so without worrying about the possibility of deadlock in
>> some concurrent execution of that piece of code.  That's good.
>>
>> For partitioned tables, there is a possibility of returning child table
>> (partition) OIDs in the partition bound order, which in addition to
>> preventing the possibility of deadlocks during concurrent locking, seems
>> potentially useful for other caller-specific optimizations.  For example,
>> tuple-routing code can utilize that fact to implement binary-search based
>> partition-searching algorithm.  For one more example, refer to the "UPDATE
>> partition key" thread where it's becoming clear that it would be nice if
>> the planner had put the partitions in bound order in the ModifyTable that
>> it creates for UPDATE of partitioned tables [1].
> 
> I guess I don't really understand why we want to change the locking
> order.  That is bound to make expanding the inheritance hierarchy more
> expensive.  If we use this approach in all cases, it seems to me we're
> bound to reintroduce the problem we fixed in commit
> c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
> the same vein.  But I don't see that there's any necessary relation
> between the order of locking and the order of expansion inside the
> relcache entry/plan/whatever else -- so I propose that we keep the
> existing locking order and only change the other stuff.

Hmm, okay.

I guess I was trying to fit one solution to what might be two (or worse,
more) problems of the current implementation, which is not good.

> While reading related code this morning, I noticed that
> RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
> *already* changed the locking order for certain operations, because
> the PartitionDesc's OID array is bound-ordered not OID-ordered.  That
> means that when RelationGetPartitionDispatchInfo uses the
> PartitionDesc's OID arra to figure out what to lock, it's potentially
> going to encounter partitions in a different order than would have
> been the case if it had used find_all_inheritors directly.

I think Amit Khandekar mentioned this on the UPDATE partition key thread [1].

> I'm
> tempted to think that RelationGetPartitionDispatchInfo shouldn't
> really be doing locking at all.  The best way to have the locking
> always happen in the same order is to have only one piece of code that
> determines that order - and I vote for find_all_inheritors.  Aside
> from the fact that it's the way we've always done it (and still do it
> in most other places), that code includes infinite-loop defenses which
> the new code you've introduced lacks.

As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine.  My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.

(Also in the current implementation, we open all the partitions to
construct Var translation lists, which are actually unused through most of
the planner stages, but admittedly it's a separate issue.)

The locking-partitions-too-soon issue, I think, is an important one and
may need discussing separately, but thought I'd mention it anyway.  It
also seems somewhat related to this discussion, but I may be wrong.

> Concretely, my proposal is:
> 
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).

ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo.  That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces.  One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo.  It seems what the
approach you've outlined below will let us do that.

BTW, IIUC, there will be two lists of OIDs we'll  have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.

To lock, we iterate L1 and if a given member is in L2, we lock it.  It
might be possible to make it as cheap as O(nlogn).

L2 is the order we put leaf partitions into a given plan.

> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
> 
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.

That's what the proposed refactoring patch 0002 actually does.

> 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
> to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
> elog(ERROR, ...).  Might be interesting to stuff that check into the
> relation_open(..., NoLock) path, too.
> 
> One objection to this line of attack is that there might be a good
> case for locking only the partitioned inheritors first and then going
> back and locking the leaf nodes in a second pass, or even only when
> required for a particular row.  However, that doesn't require putting
> everything in bound order - it only requires moving the partitioned
> children to the beginning of the list.  And I think rather than having
> new logic for that, we should teach find_inheritance_children() to do
> that directly.  I have a feeling Ashutosh is going to cringe at this
> suggestion, but my idea is to do this by denormalizing: add a column
> to pg_inherits indicating whether the child is of
> RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
> pg_inherits, it can pull that flag out for free along with the
> relation OID, and qsort() first by the flag and then by the OID.  It
> can also return the number of initial elements of its return value
> which have that flag set.

Maybe, we can make the initial patch use syscache to get the relkind for a
given child.  If the syscache bloat is unbearable, we go with the
denormalization approach.

> Then, in find_all_inheritors, we split rels_list into
> partitioned_rels_list and other_rels_list, and process
> partitioned_rels_list in its entirety before touching other_rels_list;
> they get concatenated at the end.
> 
> Now, find_all_inheritors and find_inheritance_children can also grow a
> flag bool only_partitioned_children; if set, then we skip the
> unpartitioned children entirely.
> 
> With all that in place, you can call find_all_inheritors(blah blah,
> false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
> true) to lock just the partitioned tables in the hierarchy.  You get a
> consistent lock order either way, and if you start with only the
> partitioned tables and later want the leaf partitions too, you just go
> through the partitioned children in the order they were returned and
> find_inheritance_children(blah blah, false) on each one of them and
> the lock order is exactly consistent with what you would have gotten
> if you'd done find_all_inheritors(blah blah, false) originally.
> 
> Thoughts?

So, with this in place:

1. Call find_all_inheritors to lock partitioned tables in the tree in an  order prescribed by OIDs

2. Call RelationGetPartitionDispatchInfo at an opportune time, which will  generate minimal information about the
partitiontree that it can do  without having to worry about locking anything
 

3. Determine the list of which leaf partitions will need to be scanned  using the information obtained in 2, if
possibleto do that at all [2].
 

4. Lock leaf partitions in the find_inheritance_children prescribed order,  but only those that are in the list built
in3.
 

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

Thanks.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB%3DwY9ZLfe8cQOfG4bTqVGyQ%40mail.gmail.com

[2] We can do that in set_append_rel_size(), but not in   inheritance_planner()




Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Mon, Aug 7, 2017 at 1:48 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> with the way schema is designed. May be it's better to use your idea
> of using get_rel_relkind() or find a way to check that the flag is in
> sync with the relkind, like when building the relcache.

That's got the same problem as building a full relcache entry: cache
bloat.  It will have *less* cache bloat, but still some.  Maybe it's
little enough to be tolerable; not sure.  But we want this system to
scale to LOTS of partitions eventually, so building on a design that
we know has scaling problems seems a bit short-sighted.

> I noticed that find_all_inheritors() uses a hash table to eliminate
> duplicates arising out of multiple inheritance. Partition hierarchy is
> never going to have multiple inheritance, and doesn't need to
> eliminate duplicates and so doesn't need the hash table. It will be
> good, if we can eliminate that overhead. But that's separate task than
> what this thread is about.

I don't want to eliminate that overhead.  If the catalog is manually
modified or corrupted, the problem could still occur, and result in
backend crashes or, at best, incomprehensible errors.  The comments
allude to this problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think Amit Khandekar mentioned this on the UPDATE partition key thread [1].

Yes, similar discussion.

> As long as find_all_inheritors() is a place only to determine the order in
> which partitions will be locked, it's fine.  My concern is about the time
> of actual locking, which in the current planner implementation is too soon
> that we end up needlessly locking all the partitions.

I don't think avoiding that problem is going to be easy.  We need a
bunch of per-relation information, like the size of each relation, and
what indexes it has, and how big they are, and the statistics for each
one.  It was at one point proposed by someone that every partition
should be required to have the same indexes, but (1) we didn't
implement it like that and (2) if we had done that it wouldn't solve
this problem anyway because the sizes are still going to vary.

Note that I'm not saying this isn't a good problem to solve, just that
it's likely to be a very hard problem to solve.

> The locking-partitions-too-soon issue, I think, is an important one and
> ISTM, we'd want to lock the partitions after we've determined the specific
> ones a query needs to scan using the information returned by
> RelationGetPartitionDispatchInfo.  That means the latter had better locked
> the relations whose cached partition descriptors will be used to determine
> the result that it produces.  One way to do that might be to lock all the
> tables in the list returned by find_all_inheritors that are partitioned
> tables before calling RelationGetPartitionDispatchInfo.  It seems what the
> approach you've outlined below will let us do that.

Yeah, I think so.  I think we could possibly open and lock partitioned
children only, then prune away leaf partitions that we can determine
aren't needed, then open and lock the leaf partitions that are needed.

> BTW, IIUC, there will be two lists of OIDs we'll  have: one in the
> find_all_inheritors order, say, L1 and the other determined by using
> partitioning-specific information for the given query, say L2.
>
> To lock, we iterate L1 and if a given member is in L2, we lock it.  It
> might be possible to make it as cheap as O(nlogn).

Commonly, we'll prune no partitions or all but one; and we should be
able to make those cases very fast.  Other cases can cost a little
more, but I'll certainly complain about anything more than O(n lg n).

>> 3. While we're optimizing, in the first loop inside of
>> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
>> use get_rel_relkind() to see whether we've got a partitioned table; if
>> so, open it.  If not, there's no need.
>
> That's what the proposed refactoring patch 0002 actually does.

Great.

> Maybe, we can make the initial patch use syscache to get the relkind for a
> given child.  If the syscache bloat is unbearable, we go with the
> denormalization approach.

Yeah.  Maybe if you write that patch, you can also test it to see how
bad the bloat is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/07 14:37, Amit Khandekar wrote:
> On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
>> P.S. While I haven't reviewed 0002 in detail, I think the concept of
>> minimizing what needs to be built in RelationGetPartitionDispatchInfo
>> is a very good idea.
> 
> True. I also think, RelationGetPartitionDispatchInfo () should be
> called while preparing the ModifyTable plan; the PartitionDispatch
> data structure returned by RelationGetPartitionDispatchInfo() should
> be stored in that plan, and then the execution-time fields in
> PartitionDispatch would be populated in ExecInitModifyTable().

I'm not sure if we could ever store the PartitionDispatch structure itself
in the plan.

Planner would build and use it to put the leaf partition sub-plans in the
canonical order in the resulting plan (Append, ModifyTable, etc.).
Executor will have to rebuild the PartitionDispatch info again if and when
it needs the same (for example, in ExecSetupPartitionTupleRouting for
insert or update tuple routing).

The refactoring patch that I've proposed (0002) makes PartitionDispatch
structure itself contain a lot less information/state than it currently
does.  So RelationGetPartitionDispatchInfo's job per the revised patch is
to reveal the partition tree structure and the information of each
partitioned table that the tree contains.  The original design whereby it
builds and puts into PartitionDispatchData thing like partition key
execution state (ExprState), TupleTableSlot, TupleConversionMap seems
wrong to me in retrospect and we should somehow revise it.  Those things I
mentioned are only needed for tuple-routing, so they should be built and
managed by the executor, not partition.c.  Any feedback on the proposed
patch is welcome. :)

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/08 4:34, Robert Haas wrote:
> On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> As long as find_all_inheritors() is a place only to determine the order in
>> which partitions will be locked, it's fine.  My concern is about the time
>> of actual locking, which in the current planner implementation is too soon
>> that we end up needlessly locking all the partitions.
> 
> I don't think avoiding that problem is going to be easy.  We need a
> bunch of per-relation information, like the size of each relation, and
> what indexes it has, and how big they are, and the statistics for each
> one.  It was at one point proposed by someone that every partition
> should be required to have the same indexes, but (1) we didn't
> implement it like that and (2) if we had done that it wouldn't solve
> this problem anyway because the sizes are still going to vary.

Sorry, I didn't mean to say we shouldn't lock and open partitions at all.
We do need their relation descriptors for planning and there is no doubt
about that.  I was just saying that we should do that only for the
partitions that are not pruned.  But, as you say, I can see that the
planner changes required to be able to do that might be hard.

>> The locking-partitions-too-soon issue, I think, is an important one and
>> ISTM, we'd want to lock the partitions after we've determined the specific
>> ones a query needs to scan using the information returned by
>> RelationGetPartitionDispatchInfo.  That means the latter had better locked
>> the relations whose cached partition descriptors will be used to determine
>> the result that it produces.  One way to do that might be to lock all the
>> tables in the list returned by find_all_inheritors that are partitioned
>> tables before calling RelationGetPartitionDispatchInfo.  It seems what the
>> approach you've outlined below will let us do that.
> 
> Yeah, I think so.  I think we could possibly open and lock partitioned
> children only, then prune away leaf partitions that we can determine
> aren't needed, then open and lock the leaf partitions that are needed.

Yes.

>> BTW, IIUC, there will be two lists of OIDs we'll  have: one in the
>> find_all_inheritors order, say, L1 and the other determined by using
>> partitioning-specific information for the given query, say L2.
>>
>> To lock, we iterate L1 and if a given member is in L2, we lock it.  It
>> might be possible to make it as cheap as O(nlogn).
> 
> Commonly, we'll prune no partitions or all but one; and we should be
> able to make those cases very fast.

Agreed.

>> Maybe, we can make the initial patch use syscache to get the relkind for a
>> given child.  If the syscache bloat is unbearable, we go with the
>> denormalization approach.
> 
> Yeah.  Maybe if you write that patch, you can also test it to see how
> bad the bloat is.

I will try and see, but maybe the syscache solution doesn't get us past
the proof-of-concept stage.

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/05 2:25, Robert Haas wrote:
> Concretely, my proposal is:
> 
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
> 
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
> 
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.
> 
> 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
> to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
> elog(ERROR, ...).  Might be interesting to stuff that check into the
> relation_open(..., NoLock) path, too.
> 
> One objection to this line of attack is that there might be a good
> case for locking only the partitioned inheritors first and then going
> back and locking the leaf nodes in a second pass, or even only when
> required for a particular row.  However, that doesn't require putting
> everything in bound order - it only requires moving the partitioned
> children to the beginning of the list.  And I think rather than having
> new logic for that, we should teach find_inheritance_children() to do
> that directly.  I have a feeling Ashutosh is going to cringe at this
> suggestion, but my idea is to do this by denormalizing: add a column
> to pg_inherits indicating whether the child is of
> RELKIND_PARTITIONED_TABLE.  Then, when find_inheritance_children scans
> pg_inherits, it can pull that flag out for free along with the
> relation OID, and qsort() first by the flag and then by the OID.  It
> can also return the number of initial elements of its return value
> which have that flag set.
> 
> Then, in find_all_inheritors, we split rels_list into
> partitioned_rels_list and other_rels_list, and process
> partitioned_rels_list in its entirety before touching other_rels_list;
> they get concatenated at the end.
> 
> Now, find_all_inheritors and find_inheritance_children can also grow a
> flag bool only_partitioned_children; if set, then we skip the
> unpartitioned children entirely.
> 
> With all that in place, you can call find_all_inheritors(blah blah,
> false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
> true) to lock just the partitioned tables in the hierarchy.  You get a
> consistent lock order either way, and if you start with only the
> partitioned tables and later want the leaf partitions too, you just go
> through the partitioned children in the order they were returned and
> find_inheritance_children(blah blah, false) on each one of them and
> the lock order is exactly consistent with what you would have gotten
> if you'd done find_all_inheritors(blah blah, false) originally.

I tried implementing this in the attached set of patches.

[PATCH 2/5] Teach pg_inherits.c a bit about partitioning

Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument

[PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing locking

Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.

TODO: Add RelationLockHeldByMe() and put if (!RelationLockHeldByMe())
elog(ERROR, ...) check in RelationGetPartitionDispatchInfo()

[PATCH 4/5] Teach expand_inherited_rtentry to use partition bound order

After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.

[PATCH 5/5] Store in pg_inherits if a child is a partitioned table

Catalog changes so that is_partitioned property of child tables is now
stored in pg_inherits.  This avoids consulting syscache to get that
property as is currently implemented in patch 2/5.


I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes.  But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).

With the above patches, we get the desired order of child sub-plans in
Append and ModifyTable plans for partitioned tables:

create table p (a int) partition by range (a);
create table p4 partition of p for values from (30) to (40);
create table p3 partition of p for values from (20) to (30);
create table p2 partition of p for values from (10) to (20);
create table p1 partition of p for values from (1) to (10);
create table p0 partition of p for values from (minvalue) to (1) partition
by list (a);
create table p00 partition of p0 for values in (0);
create table p01 partition of p0 for values in (-1);
create table p02 partition of p0 for values in (-2);


explain select count(*) from p;
                            QUERY PLAN
-------------------------------------------------------------------
 Aggregate  (cost=293.12..293.13 rows=1 width=8)
   ->  Append  (cost=0.00..248.50 rows=17850 width=0)
         ->  Seq Scan on p1  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p2  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p3  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p4  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p02  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p01  (cost=0.00..35.50 rows=2550 width=0)
         ->  Seq Scan on p00  (cost=0.00..35.50 rows=2550 width=0)

explain update p set a = a;
                          QUERY PLAN
--------------------------------------------------------------
 Update on p  (cost=0.00..248.50 rows=17850 width=10)
   Update on p1
   Update on p2
   Update on p3
   Update on p4
   Update on p02
   Update on p01
   Update on p00
   ->  Seq Scan on p1  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p2  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p3  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p4  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p02  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p01  (cost=0.00..35.50 rows=2550 width=10)
   ->  Seq Scan on p00  (cost=0.00..35.50 rows=2550 width=10)
(15 rows)

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

I put this patch ahead in the list and so it's now 0001.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Beena Emerson
Date:
Hi Amit,

On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/08/05 2:25, Robert Haas wrote:
>> Concretely, my proposal is:
>>
>> P.S. While I haven't reviewed 0002 in detail, I think the concept of
>> minimizing what needs to be built in RelationGetPartitionDispatchInfo
>> is a very good idea.
>
> I put this patch ahead in the list and so it's now 0001.
>

FYI, 0001 patch throws the warning:

execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
in this function [-Wmaybe-uninitialized]    next_ptinfo->parentid != ptinfo->parentid)

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> P.S. While I haven't reviewed 0002 in detail, I think the concept of
>> minimizing what needs to be built in RelationGetPartitionDispatchInfo
>> is a very good idea.
>
> I put this patch ahead in the list and so it's now 0001.

I think what you've currently got as
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a
bug fix that probably needs to be back-patched into v10, so it should
come first.

I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should
be merged into one patch and that should come next, followed by
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and
finally what you now have as
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch.

This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/10 18:52, Beena Emerson wrote:
> Hi Amit,
> 
> On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/08/05 2:25, Robert Haas wrote:
>>> Concretely, my proposal is:
>>>
>>> P.S. While I haven't reviewed 0002 in detail, I think the concept of
>>> minimizing what needs to be built in RelationGetPartitionDispatchInfo
>>> is a very good idea.
>>
>> I put this patch ahead in the list and so it's now 0001.
>>
> 
> FYI, 0001 patch throws the warning:
> 
> execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
> execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
> in this function [-Wmaybe-uninitialized]
>      next_ptinfo->parentid != ptinfo->parentid)

Thanks for the review.  Will fix in the updated version of the patch I
will post sometime later today.

Regards,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Thanks for the review.

On 2017/08/16 2:27, Robert Haas wrote:
> On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> P.S. While I haven't reviewed 0002 in detail, I think the concept of
>>> minimizing what needs to be built in RelationGetPartitionDispatchInfo
>>> is a very good idea.
>>
>> I put this patch ahead in the list and so it's now 0001.
> 
> I think what you've currently got as
> 0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a
> bug fix that probably needs to be back-patched into v10, so it should
> come first.

That makes sense.  That patch is now 0001.  Checked that it can be
back-patched to REL_10_STABLE.

> I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and
> 0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should
> be merged into one patch and that should come next,

Merged the two into one: attached 0002.

> followed by
> 0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and

This one is now 0003.

> finally what you now have as
> 0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch.

And 0004.

> This patch series is blocking a bunch of other things, so it would be
> nice if you could press forward with this quickly.

Attached updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

> Attached updated patches.

Thanks Amit for the patches.

I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.

> I haven't yet done anything about changing the timing of opening and
> locking leaf partitions, because it will require some more thinking about
> the required planner changes.  But the above set of patches will get us
> far enough to get leaf partition sub-plans appear in the partition bound
> order (same order as what partition tuple-routing uses in the executor).

So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ? I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.

For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).

If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter  for find_all_inheritors() to be used to lock only the
partitioned tables.

I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.

-------

I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.

Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.

Still looking at RelationGetPartitionDispatchInfo() changes ...


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

>
>> This patch series is blocking a bunch of other things, so it would be
>> nice if you could press forward with this quickly.
>
> Attached updated patches.
>

Review for 0001. The attached patch has some minor changes to the
comments and code.

+ * All the relations in the partition tree (including 'rel') must have been
+ * locked (using at least the AccessShareLock) by the caller.
It would be good if we can Assert this in the function. But I couldn't find a
way to check whether the backend holds a lock of required strength. Is there
any?

        /*
-        * We locked all the partitions above including the leaf partitions.
-        * Note that each of the relations in *partitions are eventually
-        * closed by the caller.
+        * All the partitions were locked above.  Note that the relcache
+        * entries will be closed by ExecEndModifyTable().
         */
I don't see much value in this hunk, so removed it in the attached patch.

+   list_free(all_parts);
ExecSetupPartitionTupleRouting() will be called only once per DML statement.
Leaking the memory for the duration of DML may be worth the time spent
in the traversing
the list and freeing each cell independently. So removed the hunk in the
attached set.

0002 review
+
+     <row>
+      <entry><structfield>inhchildparted</structfield></entry>
+      <entry><type>bool</type></entry>
+      <entry></entry>
+      <entry>
+       This is <literal>true</> if the child table is a partitioned table,
+       <literal>false</> otherwise
+      </entry>
+     </row>
In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".

+#define OID_CMP(o1, o2) \
+       ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
Instead of duplicating the logic in this macro and oid_cmp(), we may want to
call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp()
passing pointers to the OIDs; a pointer indirection would be costly, but still
maintainable.

+   if (num_partitioned_children)
+       *num_partitioned_children = my_num_partitioned_children;
+
Instead of this conditional, why not to make every caller pass a pointer to
integer. The callers will just ignore the value if they don't want it. If we do
this change, we can get rid of my_num_partitioned_children variable and
directly update the passed in pointer variable.

        inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
-       if (numoids >= maxoids)
+       is_partitioned = ((Form_pg_inherits)
+                               GETSTRUCT(inheritsTuple))->inhchildparted;
Now that we are fetching two members from Form_pg_inherits structure, may be we
should use a local variable
Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple);
and use that to fetch its members.

I am still reviewing changes in this patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Hi Amit,

Thanks for the comments.

On 2017/08/16 20:30, Amit Khandekar wrote:
> On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
>> Attached updated patches.
> 
> Thanks Amit for the patches.
> 
> I too agree with the overall approach taken for keeping the locking
> order consistent: it's best to do the locking with the existing
> find_all_inheritors() since it is much cheaper than to lock them in
> partition-bound order, the later being expensive since it requires
> opening partitioned tables.

Yeah.  Per the Robert's design idea, we will always do the *locking* in
the order determined by find_all_inheritors/find_inheritance_children.

>> I haven't yet done anything about changing the timing of opening and
>> locking leaf partitions, because it will require some more thinking about
>> the required planner changes.  But the above set of patches will get us
>> far enough to get leaf partition sub-plans appear in the partition bound
>> order (same order as what partition tuple-routing uses in the executor).
> 
> So, I believe none of the changes done in pg_inherits.c are essential
> for expanding the inheritence tree in bound order, right ?

Right.

The changes to pg_inherits.c are only about recognizing partitioned tables
in an inheritance hierarchy and putting them ahead in the returned list.
Now that I think of it, the title of the patch (teach pg_inherits.c about
"partitioning") sounds a bit confusing.  In particular, the patch does not
teach it things like partition bound order, just that some tables in the
hierarchy could be partitioned tables.

> I am not
> sure whether we are planning to commit these two things together or
> incrementally :
> 1. Expand in bound order
> 2. Allow for locking only the partitioned tables first.
> 
> For #1, the changes in pg_inherits.c are not required (viz, keeping
> partitioned tables at the head of the list, adding inhchildparted
> column, etc).

Yes.  Changes to pg_inherits.c can be committed completely independently
of either 1 or 2, although then there would be nobody using that capability.

About 2: I think the capability to lock only the partitioned tables in
expand_inherited_rtentry() will only be used once we have the patch to do
the necessary planner restructuring that will allow us to defer child
table locking to some place that is not expand_inherited_rtentry().  I am
working on that patch now and should be able to show something soon.

> If we are going to do #2 together with #1, then in the patch set there
> is no one using the capability introduced by #2. That is, there are no
> callers of find_all_inheritors() that make use of the new
> num_partitioned_children parameter. Also, there is no boolean
> parameter  for find_all_inheritors() to be used to lock only the
> partitioned tables.
> 
> I feel we should think about
> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
> first get the review done for the other patches.

I think that makes sense.

> I see that RelationGetPartitionDispatchInfo() now returns quite a
> small subset of what it used to return, which is good. But I feel for
> expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
> still a bit heavy. We only require the oids, so the
> PartitionedTableInfo data structure (including the pd->indexes array)
> gets discarded.

Maybe we could make the output argument optional, but I don't see much
point in being too conservative here.  Building the indexes array does not
cost us that much and if a not-too-distant-in-future patch could use that
information somehow, it could do so for free.

> Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
> for each child. In expand_inherited_rtentry(), we anyway have to open
> all the child tables, so we get the partition descriptors for each of
> the children for free. So how about, in expand_inherited_rtentry(), we
> traverse the partition tree using these descriptors similar to how it
> is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
> code duplication for traversing, we can have a common API.

As mentioned, one goal I'm seeking is to avoid having to open the child
tables in expand_inherited_rtentry().

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Hi Ashutosh,

Thanks for the review and the updated patch.

On 2017/08/16 21:48, Ashutosh Bapat wrote:
> On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
>>
>>> This patch series is blocking a bunch of other things, so it would be
>>> nice if you could press forward with this quickly.
>>
>> Attached updated patches.
>>
> 
> Review for 0001. The attached patch has some minor changes to the
> comments and code.
> 
> + * All the relations in the partition tree (including 'rel') must have been
> + * locked (using at least the AccessShareLock) by the caller.
>
> It would be good if we can Assert this in the function. But I couldn't find a
> way to check whether the backend holds a lock of required strength. Is there
> any?

Currently there isn't.  Robert suggested a RelationLockHeldByMe(Oid) [1],
which is still a TODO on my plate.

>         /*
> -        * We locked all the partitions above including the leaf partitions.
> -        * Note that each of the relations in *partitions are eventually
> -        * closed by the caller.
> +        * All the partitions were locked above.  Note that the relcache
> +        * entries will be closed by ExecEndModifyTable().
>          */
> I don't see much value in this hunk,

I thought the new text was a bit clearer, but maybe that's just me.  Will
remove.

> +   list_free(all_parts);
> ExecSetupPartitionTupleRouting() will be called only once per DML statement.
> Leaking the memory for the duration of DML may be worth the time spent
> in the traversing
> the list and freeing each cell independently.

Fair enough, will remove the list_free().

> 0002 review
> +
> +     <row>
> +      <entry><structfield>inhchildparted</structfield></entry>
> +      <entry><type>bool</type></entry>
> +      <entry></entry>
> +      <entry>
> +       This is <literal>true</> if the child table is a partitioned table,
> +       <literal>false</> otherwise
> +      </entry>
> +     </row>
> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
> be we should name the column as "inhchildpartitioned".

Sure.

> +#define OID_CMP(o1, o2) \
> +       ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
> Instead of duplicating the logic in this macro and oid_cmp(), we may want to
> call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp()
> passing pointers to the OIDs; a pointer indirection would be costly, but still
> maintainable.

Actually, I avoided using oid_cmp exactly for that reason.

> +   if (num_partitioned_children)
> +       *num_partitioned_children = my_num_partitioned_children;
> +
> Instead of this conditional, why not to make every caller pass a pointer to
> integer. The callers will just ignore the value if they don't want it. If we do
> this change, we can get rid of my_num_partitioned_children variable and
> directly update the passed in pointer variable.

There are a bunch of callers of find_all_inheritors() and
find_inheritance_children.  Changes to make them all declare a pointless
variable seemed off to me.  The conditional in question doesn't seem to be
that expensive.  (To be fair, the one introduced in find_all_inheritors()
kind of is as implemented by the patch, because it's executed for every
iteration of the foreach(l, rels_list) loop, which I will fix.)

> 
>         inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
> -       if (numoids >= maxoids)
> +       is_partitioned = ((Form_pg_inherits)
> +                               GETSTRUCT(inheritsTuple))->inhchildparted;
> Now that we are fetching two members from Form_pg_inherits structure, may be we
> should use a local variable
> Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple);
> and use that to fetch its members.

Sure, will do.

> I am still reviewing changes in this patch.

Okay, will wait for more comments before sending the updated patches.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com




Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
>> be we should name the column as "inhchildpartitioned".
>
> Sure.

I suggest inhpartitioned or inhispartition.  inhchildpartitioned seems too long.

> There are a bunch of callers of find_all_inheritors() and
> find_inheritance_children.  Changes to make them all declare a pointless
> variable seemed off to me.  The conditional in question doesn't seem to be
> that expensive.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/17 11:22, Robert Haas wrote:
> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
>>> be we should name the column as "inhchildpartitioned".
>>
>> Sure.
> 
> I suggest inhpartitioned or inhispartition.  inhchildpartitioned seems too long.

inhchildpartitioned indeed seems long.

Since we storing if the child table (one with the OID inhrelid) is
partitioned, inhpartitioned seems best to me.  Will implement that.

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/08/17 11:22, Robert Haas wrote:
>> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
>>>> be we should name the column as "inhchildpartitioned".
>>>
>>> Sure.
>>
>> I suggest inhpartitioned or inhispartition.  inhchildpartitioned seems too long.
>
> inhchildpartitioned indeed seems long.
>
> Since we storing if the child table (one with the OID inhrelid) is
> partitioned, inhpartitioned seems best to me.  Will implement that.

inhchildpartitioned is long but clearly tells that the child table is
partitioned, not the parent. pg_inherit can have parents which are not
partitioned, so it's better to have self-explanatory catalog name. I
am fine with some other name as long as it's clear.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/17 13:56, Ashutosh Bapat wrote:
> On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/08/17 11:22, Robert Haas wrote:
>>> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
>>>>> be we should name the column as "inhchildpartitioned".
>>>>
>>>> Sure.
>>>
>>> I suggest inhpartitioned or inhispartition.  inhchildpartitioned seems too long.
>>
>> inhchildpartitioned indeed seems long.
>>
>> Since we storing if the child table (one with the OID inhrelid) is
>> partitioned, inhpartitioned seems best to me.  Will implement that.
> 
> inhchildpartitioned is long but clearly tells that the child table is
> partitioned, not the parent. pg_inherit can have parents which are not
> partitioned, so it's better to have self-explanatory catalog name. I
> am fine with some other name as long as it's clear.

OTOH, the pg_inherits field that stores the OID of the child table does
not mention "child" in its name (inhrelid), although you are right that
inhpartitioned can be taken to mean that the inheritance parent
(inhparent) is partitioned.  In any case, system catalog documentation
which clearly states what's what might be the best guide for the confused.

Of course, we can add a comment in pg_inherits.h next to the field
explaining what it is for those reading the source code and confused about
what inhpartitioned means.

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Thu, Aug 17, 2017 at 10:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/08/17 13:56, Ashutosh Bapat wrote:
>> On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2017/08/17 11:22, Robert Haas wrote:
>>>> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
>>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
>>>>>> be we should name the column as "inhchildpartitioned".
>>>>>
>>>>> Sure.
>>>>
>>>> I suggest inhpartitioned or inhispartition.  inhchildpartitioned seems too long.
>>>
>>> inhchildpartitioned indeed seems long.
>>>
>>> Since we storing if the child table (one with the OID inhrelid) is
>>> partitioned, inhpartitioned seems best to me.  Will implement that.
>>
>> inhchildpartitioned is long but clearly tells that the child table is
>> partitioned, not the parent. pg_inherit can have parents which are not
>> partitioned, so it's better to have self-explanatory catalog name. I
>> am fine with some other name as long as it's clear.
>
> OTOH, the pg_inherits field that stores the OID of the child table does
> not mention "child" in its name (inhrelid), although you are right that
> inhpartitioned can be taken to mean that the inheritance parent
> (inhparent) is partitioned.  In any case, system catalog documentation
> which clearly states what's what might be the best guide for the confused.
>
Sorry, I overlooked this detail. To me it means that the table is
driven by the child and inhpartitioned looks good then.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/17 10:09, Amit Langote wrote:
> On 2017/08/16 20:30, Amit Khandekar wrote:
>> On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I am not
>> sure whether we are planning to commit these two things together or
>> incrementally :
>> 1. Expand in bound order
>> 2. Allow for locking only the partitioned tables first.
>>
>> For #1, the changes in pg_inherits.c are not required (viz, keeping
>> partitioned tables at the head of the list, adding inhchildparted
>> column, etc).
> 
> Yes.  Changes to pg_inherits.c can be committed completely independently
> of either 1 or 2, although then there would be nobody using that capability.
> 
> About 2: I think the capability to lock only the partitioned tables in
> expand_inherited_rtentry() will only be used once we have the patch to do
> the necessary planner restructuring that will allow us to defer child
> table locking to some place that is not expand_inherited_rtentry().  I am
> working on that patch now and should be able to show something soon.
> 
>> If we are going to do #2 together with #1, then in the patch set there
>> is no one using the capability introduced by #2. That is, there are no
>> callers of find_all_inheritors() that make use of the new
>> num_partitioned_children parameter. Also, there is no boolean
>> parameter  for find_all_inheritors() to be used to lock only the
>> partitioned tables.
>>
>> I feel we should think about
>> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
>> first get the review done for the other patches.
> 
> I think that makes sense.

After thinking some more on this, I think Amit's suggestion to put this
patch at the end of the priority list is good (that is, the patch that
teaches pg_inherits infrastructure to list partitioned tables ahead in the
list.)  Its purpose is mainly to fulfill the requirement that partitioned
tables be able to be locked ahead of any leaf partitions in the list (per
the design idea Robert suggested [1]).  Patch which requires that
capability is not in the picture yet.  Perhaps, we could review and commit
this patch only when that future patch shows up.  So, I will hold that
patch for now.

Thoughts?

Attached rest of the patches.  0001 has changes per Ashutosh's review
comments [2]:

0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking

0002: Teach expand_inherited_rtentry to use partition bound order

0003: Decouple RelationGetPartitionDispatchInfo() from executor

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CAFjFpRdXn7w7nVKv4qN9fa%2BtzRi_mJFNCsBWy%3Dbd0SLbYczUfA%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 17 August 2017 at 06:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> Thanks for the comments.
>
> On 2017/08/16 20:30, Amit Khandekar wrote:
>> On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>>> Attached updated patches.
>>
>> Thanks Amit for the patches.
>>
>> I too agree with the overall approach taken for keeping the locking
>> order consistent: it's best to do the locking with the existing
>> find_all_inheritors() since it is much cheaper than to lock them in
>> partition-bound order, the later being expensive since it requires
>> opening partitioned tables.
>
> Yeah.  Per the Robert's design idea, we will always do the *locking* in
> the order determined by find_all_inheritors/find_inheritance_children.
>
>>> I haven't yet done anything about changing the timing of opening and
>>> locking leaf partitions, because it will require some more thinking about
>>> the required planner changes.  But the above set of patches will get us
>>> far enough to get leaf partition sub-plans appear in the partition bound
>>> order (same order as what partition tuple-routing uses in the executor).
>>
>> So, I believe none of the changes done in pg_inherits.c are essential
>> for expanding the inheritence tree in bound order, right ?
>
> Right.
>
> The changes to pg_inherits.c are only about recognizing partitioned tables
> in an inheritance hierarchy and putting them ahead in the returned list.
> Now that I think of it, the title of the patch (teach pg_inherits.c about
> "partitioning") sounds a bit confusing.  In particular, the patch does not
> teach it things like partition bound order, just that some tables in the
> hierarchy could be partitioned tables.
>
>> I am not
>> sure whether we are planning to commit these two things together or
>> incrementally :
>> 1. Expand in bound order
>> 2. Allow for locking only the partitioned tables first.
>>
>> For #1, the changes in pg_inherits.c are not required (viz, keeping
>> partitioned tables at the head of the list, adding inhchildparted
>> column, etc).
>
> Yes.  Changes to pg_inherits.c can be committed completely independently
> of either 1 or 2, although then there would be nobody using that capability.
>
> About 2: I think the capability to lock only the partitioned tables in
> expand_inherited_rtentry() will only be used once we have the patch to do
> the necessary planner restructuring that will allow us to defer child
> table locking to some place that is not expand_inherited_rtentry().  I am
> working on that patch now and should be able to show something soon.
>
>> If we are going to do #2 together with #1, then in the patch set there
>> is no one using the capability introduced by #2. That is, there are no
>> callers of find_all_inheritors() that make use of the new
>> num_partitioned_children parameter. Also, there is no boolean
>> parameter  for find_all_inheritors() to be used to lock only the
>> partitioned tables.
>>
>> I feel we should think about
>> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
>> first get the review done for the other patches.
>
> I think that makes sense.
>
>> I see that RelationGetPartitionDispatchInfo() now returns quite a
>> small subset of what it used to return, which is good. But I feel for
>> expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
>> still a bit heavy. We only require the oids, so the
>> PartitionedTableInfo data structure (including the pd->indexes array)
>> gets discarded.
>
> Maybe we could make the output argument optional, but I don't see much
> point in being too conservative here.  Building the indexes array does not
> cost us that much and if a not-too-distant-in-future patch could use that
> information somehow, it could do so for free.

Ok, so these changes are mostly kept keeping in mind some future
use-cases. Otherwise, I was thinking we could just keep a light-weight
function to generate the oids, and keep the current
RelationGetPartitionDispatchInfo() intact.

Anyways, some more comments :

In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
array of pointers. Why can't it be an array of
PartitionTupleRoutingInfo structure  rather than pointer to that
structure ?

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
+ * Close all the leaf partitions and their indices.
*
Above comment needs to be shifted a bit down to the subsequent "for"
loop where it's actually applicable.

* node->mt_partition_dispatch_info[0] corresponds to the root partitioned
* table, for which we didn't create tupslot.
Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0]

/** XXX- do we need a pinning mechanism for partition descriptors* so that there references can be managed
independentlyof* the parent relcache entry? Like PinPartitionDesc(partdesc)?*/
 
pd->partdesc = partdesc;

Any idea if the above can be handled ? I am not too sure.


>
> Thanks,
> Amit
>



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Thu, Aug 17, 2017 at 12:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Attached rest of the patches.  0001 has changes per Ashutosh's review
> comments [2]:
>
> 0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking

[2] had a patch with some changes to the original patch you posted. I
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.

>
> 0002: Teach expand_inherited_rtentry to use partition bound order

0004 in [1] expands a multi-level partition hierarchy into similar
inheritance hierarchy. That patch doesn't need all OIDs in one go. It
will have to handle the partition hierarchy level by level, so most of
the code added by this patch will need to be changed by that patch. Is
there a way we can somehow change this patch so that the delta in 0004
is reduced? That may need rethinking about using
RelationGetPartitionDispatchInfo().

[1] https://www.postgresql.org/message-id/CAFjFpRfkr7igCGBBWH1PQ__W-XPy1O79Y-qxCmJc6FizLqFz7Q@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFjFpRdXn7w7nVKv4qN9fa%2BtzRi_mJFNCsBWy%3Dbd0SLbYczUfA%40mail.gmail.com


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> [2] had a patch with some changes to the original patch you posted. I
> didn't describe those changes in my mail, since they rearranged the
> comments. Those changes are not part of this patch and you haven't
> comments about those changes as well. If you have intentionally
> excluded those changes, it's fine. In case, you haven't reviewed them,
> please see if they are good to be incorporated.

I took a quick look at your version but I think I like Amit's fine the
way it is, so committed that and back-patched it to v10.

I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
that we don't really need, only to turn around and free them.  We get
a bunch of tuple slots that we don't need, only to turn around and
drop them.  We don't really need the PartitionDispatch objects either,
except for the OIDs they contain.  There's a lot of extra stuff being
computed here that is really irrelevant for this purpose.  I think we
should try to clean that up somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Hi Ashutosh,

On 2017/08/17 21:39, Ashutosh Bapat wrote:
> On Thu, Aug 17, 2017 at 12:59 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> Attached rest of the patches.  0001 has changes per Ashutosh's review
>> comments [2]:
>>
>> 0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking
> 
> [2] had a patch with some changes to the original patch you posted. I
> didn't describe those changes in my mail, since they rearranged the
> comments. Those changes are not part of this patch and you haven't
> comments about those changes as well. If you have intentionally
> excluded those changes, it's fine. In case, you haven't reviewed them,
> please see if they are good to be incorporated.

Sorry, I thought the ones you mentioned in the email were the only changes
you made to the original patch.  I noted only those and included them when
editing the relevant commit in my local repository in an interactive
rebase session.  I didn't actually take your patch and try to merge it
with the commit in my local repository.  IMHO, simply commenting in the
email which parts of the patch you would like to see changed would be
helpful. Then we can discuss those changes and proceed with them (or not)
per the result of that discussion.

>> 0002: Teach expand_inherited_rtentry to use partition bound order
> 
> 0004 in [1] expands a multi-level partition hierarchy into similar
> inheritance hierarchy. That patch doesn't need all OIDs in one go. It
> will have to handle the partition hierarchy level by level, so most of
> the code added by this patch will need to be changed by that patch. Is
> there a way we can somehow change this patch so that the delta in 0004
> is reduced? That may need rethinking about using
> RelationGetPartitionDispatchInfo().

Regarding that, I have a question:

Does the multi-level partition-wise join planning logic depend on the
inheritance itself to be expanded in a multi-level aware manner.  That is,
expanding the partitioned table inheritance in multi-level aware manner in
expan_inherited_rtentry()?

Wouldn't it suffice to just have the resulting Append paths be nested per
multi-level partitioning hierarchy?  Creating such nested Append paths
doesn't necessarily require that the inheritance be expanded that way in
the first place (as I am finding out when working on another patch).

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Fri, Aug 18, 2017 at 10:12 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote
>
>>> 0002: Teach expand_inherited_rtentry to use partition bound order
>>
>> 0004 in [1] expands a multi-level partition hierarchy into similar
>> inheritance hierarchy. That patch doesn't need all OIDs in one go. It
>> will have to handle the partition hierarchy level by level, so most of
>> the code added by this patch will need to be changed by that patch. Is
>> there a way we can somehow change this patch so that the delta in 0004
>> is reduced? That may need rethinking about using
>> RelationGetPartitionDispatchInfo().
>
> Regarding that, I have a question:
>
> Does the multi-level partition-wise join planning logic depend on the
> inheritance itself to be expanded in a multi-level aware manner.  That is,
> expanding the partitioned table inheritance in multi-level aware manner in
> expan_inherited_rtentry()?

Yes, it needs AppendRelInfos to retain the parent-child relationship.
Please refer [1], [2], [3] for details.

>
> Wouldn't it suffice to just have the resulting Append paths be nested per
> multi-level partitioning hierarchy?

We are joining RelOptInfos, so those need to be nested. For those to
be nested, we need AppendRelInfos to preserve parent-child
relationship. Nesting paths doesn't help. Append paths actually should
be flattened out to avoid any extra time consumed in nested Append
node.


[1] https://www.postgresql.org/message-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV%3DkB8Akkqg@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com
[3] https://www.postgresql.org/message-id/CAFjFpRd6Kzx6Xn%3D7vdwwnh6rEw2VEgo--iPdhV%2BFb7bHfPzsbw%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 18 August 2017 at 01:24, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> [2] had a patch with some changes to the original patch you posted. I
>> didn't describe those changes in my mail, since they rearranged the
>> comments. Those changes are not part of this patch and you haven't
>> comments about those changes as well. If you have intentionally
>> excluded those changes, it's fine. In case, you haven't reviewed them,
>> please see if they are good to be incorporated.
>
> I took a quick look at your version but I think I like Amit's fine the
> way it is, so committed that and back-patched it to v10.
>
> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
> that we don't really need, only to turn around and free them.  We get
> a bunch of tuple slots that we don't need, only to turn around and
> drop them.

I think in the final changes after applying all 3 patches, the
redundant tuple slot is no longer present. But ...
> We don't really need the PartitionDispatch objects either,
> except for the OIDs they contain.  There's a lot of extra stuff being
> computed here that is really irrelevant for this purpose.  I think we
> should try to clean that up somehow.
... I am of the same opinion. That's why - as I mentioned upthread - I
was thinking why not have a separate light-weight function to just
generate oids, and keep RelationGetPartitionDispatchInfo() intact, to
be used only for tuple routing.

But, I haven't yet checked Ashuthosh's requirements, which suggest
that it does not help to even get the oid list.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Fri, Aug 18, 2017 at 10:32 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> I think in the final changes after applying all 3 patches, the
> redundant tuple slot is no longer present. But ...
>> We don't really need the PartitionDispatch objects either,
>> except for the OIDs they contain.  There's a lot of extra stuff being
>> computed here that is really irrelevant for this purpose.  I think we
>> should try to clean that up somehow.
> ... I am of the same opinion. That's why - as I mentioned upthread - I
> was thinking why not have a separate light-weight function to just
> generate oids, and keep RelationGetPartitionDispatchInfo() intact, to
> be used only for tuple routing.
>
> But, I haven't yet checked Ashuthosh's requirements, which suggest
> that it does not help to even get the oid list.
>

0004 patch in partition-wise join patchset has code to expand
partition hierarchy. That patch is expanding inheritance hierarchy in
depth first manner. Robert commented that instead of depth first
manner, it will be better if we expand it in partitioned tables first
manner. With the latest changes in your patch-set I don't see the
reason for expanding in partitioned tables first order. Can you please
elaborate if we still need to expand in partitioned table first
manner? May be we should just address the expansion issue in 0004
instead of dividing it in two patches.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/18 4:54, Robert Haas wrote:
> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> [2] had a patch with some changes to the original patch you posted. I
>> didn't describe those changes in my mail, since they rearranged the
>> comments. Those changes are not part of this patch and you haven't
>> comments about those changes as well. If you have intentionally
>> excluded those changes, it's fine. In case, you haven't reviewed them,
>> please see if they are good to be incorporated.
> 
> I took a quick look at your version but I think I like Amit's fine the
> way it is, so committed that and back-patched it to v10.

Thanks for committing.

> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
> that we don't really need, only to turn around and free them.  We get
> a bunch of tuple slots that we don't need, only to turn around and
> drop them.  We don't really need the PartitionDispatch objects either,
> except for the OIDs they contain.  There's a lot of extra stuff being
> computed here that is really irrelevant for this purpose.  I think we
> should try to clean that up somehow.

One way to do that might be to reverse the order of the remaining patches
and put the patch to refactor RelationGetPartitionDispatchInfo() first.
With that refactoring, PartitionDispatch itself has become much simpler in
that it does not contain a relcache reference to be closed eventually by
the caller, the tuple map, and the tuple table slot.  Since those things
are required for tuple-routing, the refactoring makes
ExecSetupPartitionTupleRouting itself create them from the (minimal)
information returned by RelationGetPartitionDispatchInfo and ultimately
destroy when done using it.  I kept the indexes field in
PartitionDispatchData though, because it's essentially free to create
while we are walking the partition tree in
RelationGetPartitionDispatchInfo() and it seems undesirable to make the
caller compute that information (indexes) by traversing the partition tree
all over again, if it doesn't otherwise have to.  I am still considering
some counter-arguments raised by Amit Khandekar about this last assertion.

Thoughts?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 18 August 2017 at 10:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/08/18 4:54, Robert Haas wrote:
>> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> [2] had a patch with some changes to the original patch you posted. I
>>> didn't describe those changes in my mail, since they rearranged the
>>> comments. Those changes are not part of this patch and you haven't
>>> comments about those changes as well. If you have intentionally
>>> excluded those changes, it's fine. In case, you haven't reviewed them,
>>> please see if they are good to be incorporated.
>>
>> I took a quick look at your version but I think I like Amit's fine the
>> way it is, so committed that and back-patched it to v10.
>
> Thanks for committing.
>
>> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
>> that we don't really need, only to turn around and free them.  We get
>> a bunch of tuple slots that we don't need, only to turn around and
>> drop them.  We don't really need the PartitionDispatch objects either,
>> except for the OIDs they contain.  There's a lot of extra stuff being
>> computed here that is really irrelevant for this purpose.  I think we
>> should try to clean that up somehow.
>
> One way to do that might be to reverse the order of the remaining patches
> and put the patch to refactor RelationGetPartitionDispatchInfo() first.
> With that refactoring, PartitionDispatch itself has become much simpler in
> that it does not contain a relcache reference to be closed eventually by
> the caller, the tuple map, and the tuple table slot.  Since those things
> are required for tuple-routing, the refactoring makes
> ExecSetupPartitionTupleRouting itself create them from the (minimal)
> information returned by RelationGetPartitionDispatchInfo and ultimately
> destroy when done using it.  I kept the indexes field in
> PartitionDispatchData though, because it's essentially free to create
> while we are walking the partition tree in
> RelationGetPartitionDispatchInfo() and it seems undesirable to make the
> caller compute that information (indexes) by traversing the partition tree
> all over again, if it doesn't otherwise have to.  I am still considering
> some counter-arguments raised by Amit Khandekar about this last assertion.
>
> Thoughts?

One another approach (that I have used in update-partition-key patch)
is to *not* generate the oids beforehand, and instead, call a
partition_walker_next() function to traverse through the tree. Each
next() function would return a ChildInfo that includes child oid,
parent oid, etc. All users of this would guarantee a fixed order of
oids. In the update-partition-key patch, I am opening and closing each
of the children, which of course, we need to avoid.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> 0004 patch in partition-wise join patchset has code to expand
> partition hierarchy. That patch is expanding inheritance hierarchy in
> depth first manner. Robert commented that instead of depth first
> manner, it will be better if we expand it in partitioned tables first
> manner. With the latest changes in your patch-set I don't see the
> reason for expanding in partitioned tables first order. Can you please
> elaborate if we still need to expand in partitioned table first
> manner? May be we should just address the expansion issue in 0004
> instead of dividing it in two patches.

Let me see if I can clarify.  I think there are three requirements here:

A. Amit wants to be able to prune leaf partitions before opening and
locking those relations, so that pruning can be done earlier and,
therefore, more cheaply.

B. Partition-wise join wants to expand the inheritance hierarchy a
level at a time instead of all at once, ending up with rte->inh = true
entries for intermediate partitioned tables.

C. Partition-wise join (and lots of other things; see numerous
mentions of EIBO in
http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html)
want to expand in bound order.

Obviously, bound-order and partitioned-tables-first are incompatible
orderings, but there's no actual conflict because the first one has to
do with the order of *expansion* and the second one has to do with the
order of *locking*.  So in the end game I think
expand_inherited_rtentry looks approximately like this:

1. Calling find_all_inheritors with a new only-lock-the-partitions
flag.  This should result in locking all partitioned tables in the
inheritance hierarchy in breadth-first, low-OID-first order.  (When
the only-lock-the-partitions isn't specified, all partitioned tables
should still be locked before any unpartitioned tables, so that the
locking order in that case is consistent with what we do here.)

2. Iterate over the partitioned tables identified in step 1 in the
order in which they were returned.  For each one:
- Decide which children can be pruned.
- Lock the unpruned, non-partitioned children in low-OID-first order.

3. Make another pass over the inheritance hierarchy, starting at the
root.  Traverse the whole hierarchy in breadth-first in *bound* order.
Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
true for partitioned tables and rte->inh = false for leaf partitions.

Whether we should try to go straight to the end state here or do this
via a series of incremental changes, I'm not entirely sure right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Sat, Aug 19, 2017 at 1:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> 0004 patch in partition-wise join patchset has code to expand
>> partition hierarchy. That patch is expanding inheritance hierarchy in
>> depth first manner. Robert commented that instead of depth first
>> manner, it will be better if we expand it in partitioned tables first
>> manner. With the latest changes in your patch-set I don't see the
>> reason for expanding in partitioned tables first order. Can you please
>> elaborate if we still need to expand in partitioned table first
>> manner? May be we should just address the expansion issue in 0004
>> instead of dividing it in two patches.
>
> Let me see if I can clarify.  I think there are three requirements here:
>
> A. Amit wants to be able to prune leaf partitions before opening and
> locking those relations, so that pruning can be done earlier and,
> therefore, more cheaply.

We could actually prune partitioned tables thus pruning whole
partitioned tree. Do we want to then lock those partitioned tables but
not the leaves in that tree?

If there's already some discussion answering this question, please
point me to the same. Sorry for not paying attention to it.

>
> B. Partition-wise join wants to expand the inheritance hierarchy a
> level at a time instead of all at once, ending up with rte->inh = true
> entries for intermediate partitioned tables.

And create AppendRelInfos which pair children with their partitioned
parent rather than the root.

>
> C. Partition-wise join (and lots of other things; see numerous
> mentions of EIBO in
> http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html)
> want to expand in bound order.
>
> Obviously, bound-order and partitioned-tables-first are incompatible
> orderings, but there's no actual conflict because the first one has to
> do with the order of *expansion* and the second one has to do with the
> order of *locking*.

right. Thanks for making it clear.

> So in the end game I think
> expand_inherited_rtentry looks approximately like this:
>
> 1. Calling find_all_inheritors with a new only-lock-the-partitions
> flag.  This should result in locking all partitioned tables in the
> inheritance hierarchy in breadth-first, low-OID-first order.  (When
> the only-lock-the-partitions isn't specified, all partitioned tables
> should still be locked before any unpartitioned tables, so that the
> locking order in that case is consistent with what we do here.)
>

I am confused. When "only-lock-the-partitions" is true, do we expect
intermediate partitioned tables to be locked? Why then "only" in the
flag?

> 2. Iterate over the partitioned tables identified in step 1 in the
> order in which they were returned.  For each one:
> - Decide which children can be pruned.
> - Lock the unpruned, non-partitioned children in low-OID-first order.
>
> 3. Make another pass over the inheritance hierarchy, starting at the
> root.  Traverse the whole hierarchy in breadth-first in *bound* order.
> Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
> true for partitioned tables and rte->inh = false for leaf partitions.

These two seem to be based on the assumption that we have to lock all
the partitioned tables even if they can be pruned.

For regular inheritance there is only a single parent, so traversing
the list returned by find_all_inheritors suffices. For partitioned
hierarchy, we need to know the parent of every child, which is not
part of the find_all_inheritors() output list. Even if it returns only
the partitioned children, they themselves may have a parent different
from the root partition. So, we have to discard the output of
find_all_inheritors() for partitioned hierarchy and traverse the
children as per their orders in oids array in PartitionDesc. May be
it's better to separate the guts of expand_inherited_rtentry(), which
create AppendRelInfos, RTEs and rowmarks for the children into a
separate routine. Use that routine in two different functions
expand_inherited_rtentry() and expand_partitioned_rtentry() for
regular inheritance and partitioned inheritance resp. The functions
will use two different traversal methods appropriate for traversing
the children in either case.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Hi Amit,

On 2017/08/17 21:18, Amit Khandekar wrote:
> Anyways, some more comments :
> 
> In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
> array of pointers. Why can't it be an array of
> PartitionTupleRoutingInfo structure  rather than pointer to that
> structure ?

AFAIK, assigning pointers is less expensive than assigning struct and we
end up doing a lot of assigning of the members of that array to a local
variable in get_partition_for_tuple(), for example.  Perhaps, we could
avoid those assignments and implement it the way you suggest.

> diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
> + * Close all the leaf partitions and their indices.
> *
> Above comment needs to be shifted a bit down to the subsequent "for"
> loop where it's actually applicable.

That's right, done.

> * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
> * table, for which we didn't create tupslot.
> Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0]

Oops, fixed.

> /*
>  * XXX- do we need a pinning mechanism for partition descriptors
>  * so that there references can be managed independently of
>  * the parent relcache entry? Like PinPartitionDesc(partdesc)?
>  */
> pd->partdesc = partdesc;
> 
> Any idea if the above can be handled ? I am not too sure.

A similar mechanism exists for TupleDesc ref-counting (see the usage of
PinTupleDesc and ReleaseTupleDesc across the backend code.)  I too am
currently unsure if such an elaborate mechanism is actually *necessary*
for rd_partdesc.


Attached updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/21 13:11, Ashutosh Bapat wrote:
> On Sat, Aug 19, 2017 at 1:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> 0004 patch in partition-wise join patchset has code to expand
>>> partition hierarchy. That patch is expanding inheritance hierarchy in
>>> depth first manner. Robert commented that instead of depth first
>>> manner, it will be better if we expand it in partitioned tables first
>>> manner. With the latest changes in your patch-set I don't see the
>>> reason for expanding in partitioned tables first order. Can you please
>>> elaborate if we still need to expand in partitioned table first
>>> manner? May be we should just address the expansion issue in 0004
>>> instead of dividing it in two patches.
>>
>> Let me see if I can clarify.  I think there are three requirements here:
>>
>> A. Amit wants to be able to prune leaf partitions before opening and
>> locking those relations, so that pruning can be done earlier and,
>> therefore, more cheaply.
> 
> We could actually prune partitioned tables thus pruning whole
> partitioned tree. Do we want to then lock those partitioned tables but
> not the leaves in that tree?

I think it would be nice if we keep the current approach of expanding the
whole partition tree in expand_inherited_rtentry(), at least to know how
many more entries a given partitioned table will add to the query's range
table.  It would be nice, because that way, we don't have to worry *right
away* about modifying the planner to cope with some new behavior whereby
range table entries will get added at some later point.

Then, as you might already know, if we want to use the partition bound
order when expanding the whole partition tree, we will depend on the
relcache entries of the partitioned tables in that tree, which will
require us to take locks on them.

It does sound odd that we may end up locking a child *partitioned* table
that is potentially prune-able, but maybe there is some way to relinquish
that lock once we find out that it is pruned after all.

>> B. Partition-wise join wants to expand the inheritance hierarchy a
>> level at a time instead of all at once, ending up with rte->inh = true
>> entries for intermediate partitioned tables.
> 
> And create AppendRelInfos which pair children with their partitioned
> parent rather than the root.

There should be *some* way to preserve the parent-child RT index mapping
and to preserve the multi-level hierarchy, a way that doesn't map all the
child tables in a partition tree to the root table's RT index.

AppendRelInfo is one way of doing that mapping currently, but if we
continue to treat it as the only way (for the purpose of mapping), we will
be stuck with the way they are created and manipulated.  Especially, if we
are going to always depend on the fact that root->append_rel_list contains
all the required AppendRelInfos, then we will always have to fully expand
the inheritance in expand_inherited_rtentry() (by fully I mean, locking
and opening all the child tables, instead of just the partitioned tables).In a world where we don't want to open the
partitionchild tables in
 
expand_inherited_rtentry(), we cannot build the corresponding
AppendRelInfos there.  Note that this is not about completely dispelling
AppendRelInfos-for-partition-child-tables, but about doing without them
being present in root->append_rel_list.  We would still need them to be
able to use adjust_appendrel_attrs(), etc., but we can create them at a
different time and store them in a place that's not root->append_rel_list;For example, inside the RelOptInfo of the
childtable.  Or perhaps, we
 
can still add them to root->append_rel_list, but will need to be careful
about the places that depend on the timing of AppendRelInfos being present
there.

>> So in the end game I think
>> expand_inherited_rtentry looks approximately like this:
>>
>> 1. Calling find_all_inheritors with a new only-lock-the-partitions
>> flag.  This should result in locking all partitioned tables in the
>> inheritance hierarchy in breadth-first, low-OID-first order.  (When
>> the only-lock-the-partitions isn't specified, all partitioned tables
>> should still be locked before any unpartitioned tables, so that the
>> locking order in that case is consistent with what we do here.)
> 
> I am confused. When "only-lock-the-partitions" is true, do we expect
> intermediate partitioned tables to be locked? Why then "only" in the
> flag?

I guess Robert meant to say lock-only-"partitioned"-tables?

>> 2. Iterate over the partitioned tables identified in step 1 in the
>> order in which they were returned.  For each one:
>> - Decide which children can be pruned.
>> - Lock the unpruned, non-partitioned children in low-OID-first order.
>>
>> 3. Make another pass over the inheritance hierarchy, starting at the
>> root.  Traverse the whole hierarchy in breadth-first in *bound* order.
>> Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
>> true for partitioned tables and rte->inh = false for leaf partitions.
> 
> These two seem to be based on the assumption that we have to lock all
> the partitioned tables even if they can be pruned.
> 
> For regular inheritance there is only a single parent, so traversing
> the list returned by find_all_inheritors suffices. For partitioned
> hierarchy, we need to know the parent of every child, which is not
> part of the find_all_inheritors() output list. Even if it returns only
> the partitioned children, they themselves may have a parent different
> from the root partition. So, we have to discard the output of
> find_all_inheritors() for partitioned hierarchy and traverse the
> children as per their orders in oids array in PartitionDesc. May be
> it's better to separate the guts of expand_inherited_rtentry(), which
> create AppendRelInfos, RTEs and rowmarks for the children into a
> separate routine. Use that routine in two different functions
> expand_inherited_rtentry() and expand_partitioned_rtentry() for
> regular inheritance and partitioned inheritance resp. The functions
> will use two different traversal methods appropriate for traversing
> the children in either case.

I just posted a patch [1] that implements something like this, but
implementation details might seem different.  It doesn't however implement
a solution to the problem that you pose that partitioned child tables that
are prune-able are locked.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/098b9c71-1915-1a2a-8d52-1a7a50ce79e8%40lab.ntt.co.jp




Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [ new patches ]

I am failing to understand the point of separating PartitionDispatch
into PartitionDispatch and PartitionTableInfo.  That seems like an
unnecessary multiplication of entities, as does the introduction of
PartitionKeyInfo.  I also think that replacing reldesc with reloid is
not really an improvement; any places that gets the relid has to go
open the relation to get the reldesc, whereas without that it has a
direct pointer to the information it needs.

I suggest that this patch just focus on removing the following things
from PartitionDispatchData: keystate, tupslot, tupmap.  Those things
are clearly executor-specific stuff that makes sense to move to a
different structure, what you're calling PartitionTupleRoutingInfo
(not sure that's the best name).  The other stuff all seems fine.
You're going to have to open the relation anyway, so keeping the
reldesc around seems like an optimization, if anything.  The
PartitionKey and PartitionDesc pointers may not really be needed --
they're just pointers into reldesc -- but they're trivial to compute,
so it doesn't hurt anything to have them either as a
micro-optimization for performance or even just for readability.

That just leaves indexes.  In a world where keystate, tupslot, and
tupmap are removed from the PartitionDispatchData, you must need
indexes or there would be no point in constructing a
PartitionDispatchData object in the first place; any application that
needs neither indexes nor the executor-specific stuff could just use
the Relation directly.

Regarding your XXX comments, note that if you've got a lock on a
relation, the pointers to the PartitionKey and PartitionDesc are
stable.  The PartitionKey can't change once it's established, and the
PartitionDesc can't change while we've got a lock on the relation
unless we change it ourselves (and any places that do should have
CheckTableNotInUse checks).  The keep_partkey and keep_partdesc
handling in relcache.c exists exactly so that we can guarantee that
the pointer won't go stale under us.  Now, if we *don't* have a lock
on the relation, then those pointers can easily be invalidated -- so
you can't hang onto a PartitionDispatch for longer than you hang onto
the lock on the Relation.  But that shouldn't be a problem.  I think
you only need to hang onto PartitionDispatch pointers for the lifetime
of a single query.  One can imagine optimizations where we try to
avoid rebuilding that for subsequent queries but I'm not sure there's
any demonstrated need for such a system at present.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/26 3:28, Robert Haas wrote:
> On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> [ new patches ]
> 
> I am failing to understand the point of separating PartitionDispatch
> into PartitionDispatch and PartitionTableInfo.  That seems like an
> unnecessary multiplication of entities, as does the introduction of
> PartitionKeyInfo.  I also think that replacing reldesc with reloid is
> not really an improvement; any places that gets the relid has to go
> open the relation to get the reldesc, whereas without that it has a
> direct pointer to the information it needs.

I am worried about the open relcache reference in PartitionDispatch when
we start using it in the planner.  Whereas there is a ExecEndModifyTable()
as a suitable place to close that reference, there doesn't seem to exist
one within the planner, but I guess we will have to figure something out.
For time being, the second patch closes the same in
expand_inherited_rtentry() right after picking up the OID using
RelationGetRelid(pd->reldesc).

> I suggest that this patch just focus on removing the following things
> from PartitionDispatchData: keystate, tupslot, tupmap.  Those things
> are clearly executor-specific stuff that makes sense to move to a
> different structure, what you're calling PartitionTupleRoutingInfo
> (not sure that's the best name).  The other stuff all seems fine.
> You're going to have to open the relation anyway, so keeping the
> reldesc around seems like an optimization, if anything.  The
> PartitionKey and PartitionDesc pointers may not really be needed --
> they're just pointers into reldesc -- but they're trivial to compute,
> so it doesn't hurt anything to have them either as a
> micro-optimization for performance or even just for readability.

OK, done this way in the attached updated patch.  Any suggestions about a
better name for what the patch calls PartitionTupleRoutingInfo?

> That just leaves indexes.  In a world where keystate, tupslot, and
> tupmap are removed from the PartitionDispatchData, you must need
> indexes or there would be no point in constructing a
> PartitionDispatchData object in the first place; any application that
> needs neither indexes nor the executor-specific stuff could just use
> the Relation directly.

Agreed.

> Regarding your XXX comments, note that if you've got a lock on a
> relation, the pointers to the PartitionKey and PartitionDesc are
> stable.  The PartitionKey can't change once it's established, and the
> PartitionDesc can't change while we've got a lock on the relation
> unless we change it ourselves (and any places that do should have
> CheckTableNotInUse checks).  The keep_partkey and keep_partdesc
> handling in relcache.c exists exactly so that we can guarantee that
> the pointer won't go stale under us.  Now, if we *don't* have a lock
> on the relation, then those pointers can easily be invalidated -- so
> you can't hang onto a PartitionDispatch for longer than you hang onto
> the lock on the Relation.  But that shouldn't be a problem.  I think
> you only need to hang onto PartitionDispatch pointers for the lifetime
> of a single query.  One can imagine optimizations where we try to
> avoid rebuilding that for subsequent queries but I'm not sure there's
> any demonstrated need for such a system at present.

Here too.

Attached are the updated patches.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Mon, Aug 28, 2017 at 6:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I am worried about the open relcache reference in PartitionDispatch when
> we start using it in the planner.  Whereas there is a ExecEndModifyTable()
> as a suitable place to close that reference, there doesn't seem to exist
> one within the planner, but I guess we will have to figure something out.

Yes, I think there's no real way to avoid having to figure that out.

> OK, done this way in the attached updated patch.  Any suggestions about a
> better name for what the patch calls PartitionTupleRoutingInfo?

I think this patch could be further simplified by continuing to use
the existing function signature for RelationGetPartitionDispatchInfo
instead of changing it to return a List * rather than an array.  I
don't see any benefit to such a change.  The current system is more
efficient.

I keep having the feeling that this is a big patch with a small patch
struggling to get out.  Is it really necessary to change
RelationGetPartitionDispatchInfo so much or could you just do a really
minimal surgery to remove the code that sets the stuff we don't need?
Like this:

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..4fabcf9f32 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1089,29 +1089,7 @@ RelationGetPartitionDispatchInfo(Relation rel,        pd[i] = (PartitionDispatch)
palloc(sizeof(PartitionDispatchData));       pd[i]->reldesc = partrel;        pd[i]->key = partkey;
 
-        pd[i]->keystate = NIL;        pd[i]->partdesc = partdesc;
-        if (parent != NULL)
-        {
-            /*
-             * For every partitioned table other than root, we must store a
-             * tuple table slot initialized with its tuple descriptor and a
-             * tuple conversion map to convert a tuple from its parent's
-             * rowtype to its own. That is to make sure that we are looking at
-             * the correct row using the correct tuple descriptor when
-             * computing its partition key for tuple routing.
-             */
-            pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
-            pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-                                                   tupdesc,
-
gettext_noop("could not convert row type"));
-        }
-        else
-        {
-            /* Not required for the root partitioned table */
-            pd[i]->tupslot = NULL;
-            pd[i]->tupmap = NULL;
-        }        pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
        /*


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/29 4:26, Robert Haas wrote:
> I think this patch could be further simplified by continuing to use
> the existing function signature for RelationGetPartitionDispatchInfo
> instead of changing it to return a List * rather than an array.  I
> don't see any benefit to such a change.  The current system is more
> efficient.

OK, restored the array way.

> I keep having the feeling that this is a big patch with a small patch
> struggling to get out.  Is it really necessary to change
> RelationGetPartitionDispatchInfo so much or could you just do a really
> minimal surgery to remove the code that sets the stuff we don't need?
> Like this:

Sure, done in the attached updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I keep having the feeling that this is a big patch with a small patch
>> struggling to get out.  Is it really necessary to change
>> RelationGetPartitionDispatchInfo so much or could you just do a really
>> minimal surgery to remove the code that sets the stuff we don't need?
>> Like this:
>
> Sure, done in the attached updated patch.

On first glance, that looks pretty good.  I'll have a deeper look
tomorrow.  It strikes me that if PartitionTupleRoutingInfo is an
executor structure, we should probably (as a separate patch) relocate
FormPartitionKeyDatum and get_partition_for_tuple to someplace in
src/backend/executor and rename the accordingly; maybe it's time for
an execPartition.c?  catalog/partition.c is getting really bug, so
it's not a bad thing if some of that stuff gets moved elsewhere.  A
lingering question in my mind, though, is whether it's really correct
to think of PartitionTupleRoutingInfo as executor-specific.  Maybe
we're going to need that for other things too?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/30 12:03, Robert Haas wrote:
> On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I keep having the feeling that this is a big patch with a small patch
>>> struggling to get out.  Is it really necessary to change
>>> RelationGetPartitionDispatchInfo so much or could you just do a really
>>> minimal surgery to remove the code that sets the stuff we don't need?
>>> Like this:
>>
>> Sure, done in the attached updated patch.
> 
> On first glance, that looks pretty good.  I'll have a deeper look
> tomorrow.

Thanks.

> It strikes me that if PartitionTupleRoutingInfo is an
> executor structure, we should probably (as a separate patch) relocate
> FormPartitionKeyDatum and get_partition_for_tuple to someplace in
> src/backend/executor and rename the accordingly; maybe it's time for
> an execPartition.c? catalog/partition.c is getting really bug, so

I agree.

It would be a good idea to introduce an execPartition.c so that future
patches in this area (such as executor partition-pruning patch on the
horizon) have a convenient place to park their code.

Will see if I can make a patch for that.

> it's not a bad thing if some of that stuff gets moved elsewhere.  A
> lingering question in my mind, though, is whether it's really correct
> to think of PartitionTupleRoutingInfo as executor-specific.  Maybe
> we're going to need that for other things too?

Hmm.  Maybe, a subset of PartitionTupleRoutingInfo's fields are usable
outside the executor (only PartitionDispatch, which is exported by
partition.h anyway?), but not all of it.  For example, fields keystate,
tupslot seem pretty executor-specific.  In fact, they seem to be required
only for tuple routing.

One idea is to not introduce PartitionTupleRoutingInfo at all and add its
fields directly as ModifyTableState/CopyState fields.  We currently have,
for example, a mt_partition_tupconv_maps array with one slot for every
leaf partition.  Maybe, there could be following fields in
ModifyTableState (and similarly in CopyState):

int                   mt_num_parted    /* this one exists today */
struct PartitionDispatchData **mt_partition_dispatch_info; /* and this */
List                **mt_parted_keystate;
TupleConversionMap  **mt_parted_tupconv_maps;
TupleTableSlot      **mt_parted_tupslots;

Each of those arrays will have mt_num_parted slots.

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote:
>
> That just leaves indexes.  In a world where keystate, tupslot, and
> tupmap are removed from the PartitionDispatchData, you must need
> indexes or there would be no point in constructing a
> PartitionDispatchData object in the first place; any application that
> needs neither indexes nor the executor-specific stuff could just use
> the Relation directly.

But there is expand_inherited_rtentry() which neither requires indexes
nor any executor stuff, but still requires to call
RelationGetPartitionDispatchInfo(), and so these indexes get built
unnecessarily.

Looking at the latest patch, I can see that those indexes can be
separately built in ExecSetupPartitionTupleRouting() where it is
required, instead of in RelationGetPartitionDispatchInfo(). In the
loop which iterates through the pd[] returned from
RelationGetPartitionDispatchInfo(), we can build them using the exact
code currently written to build them in
RelationGetPartitionDispatchInfo().

In the future, if we require such applications where indexes are also
required, we may have a separate function only to build indexes, and
then ExecSetupPartitionTupleRouting() would also call that function.

--------


On 21 August 2017 at 11:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
>> array of pointers. Why can't it be an array of
>> PartitionTupleRoutingInfo structure  rather than pointer to that
>> structure ?
>
> AFAIK, assigning pointers is less expensive than assigning struct and we
> end up doing a lot of assigning of the members of that array to a local
> variable

I didn't get why exactly we would have to copy the structures. We
could just pass the address of ptrinfos[index], no ?

My only point for this was : we would not have to call palloc0() for
each of the element of ptrinfos. Instead, just allocate memory for all
of the elements in a single palloc0(). We anyways have to allocate
memory for *each* of the element.

> in get_partition_for_tuple(), for example.  Perhaps, we could
> avoid those assignments and implement it the way you suggest.

You mean at these 2 places in get_partition_for_tuple() , right ? :

1. /* start with the root partitioned table */
parent = ptrinfos[0];

2. else
parent = ptrinfos[-parent->pd->indexes[cur_index]];

Both of the above places, we could just use &ptrinfos[...] instead of
ptrinfos[...]. But I guess you meant something else.

------------

RelationGetPartitionDispatchInfo() opens all the partitioned tables.
But in ExecSetupPartitionTupleRouting(), it again opens all the
parents, that is all the partitioned tables, and closes them back.

Instead, would it be possible to do this : Instead of the
PartitionDispatch->parentoid field, PartitionDispatch can have
parentrel Relation field, which will point to reldesc field of one of
the pds[] elements.

------------

For me, the CopyStateData->ptrinfos and ModifyTableState.mt_ptrinfos
field names sound confusing. How about part_tuple_routing_info or just
tuple_routing_info ?



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Wed, Aug 30, 2017 at 8:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I keep having the feeling that this is a big patch with a small patch
>>> struggling to get out.  Is it really necessary to change
>>> RelationGetPartitionDispatchInfo so much or could you just do a really
>>> minimal surgery to remove the code that sets the stuff we don't need?
>>> Like this:
>>
>> Sure, done in the attached updated patch.
>
> On first glance, that looks pretty good.  I'll have a deeper look
> tomorrow.
>

In one of your earlier mails on this thread, you had described how
expand_inherited_rtentry() would look like as

-- quote --
1. Calling find_all_inheritors with a new only-lock-the-partitions
flag.  This should result in locking all partitioned tables in the
inheritance hierarchy in breadth-first, low-OID-first order.  (When
the only-lock-the-partitions isn't specified, all partitioned tables
should still be locked before any unpartitioned tables, so that the
locking order in that case is consistent with what we do here.)

2. Iterate over the partitioned tables identified in step 1 in the
order in which they were returned.  For each one:
- Decide which children can be pruned.
- Lock the unpruned, non-partitioned children in low-OID-first order.

3. Make another pass over the inheritance hierarchy, starting at the
root.  Traverse the whole hierarchy in breadth-first in *bound* order.
Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
true for partitioned tables and rte->inh = false for leaf partitions.

-- quote ends --

Amit's patches seem to be addressing the third point here. But the
expansion is not happening in breadth-first manner. We are expanding
all the partitioned partitions first and then leaf partitions. So
that's not exactly "breadth-first".

I tried to rebase first patch from partition-wise join patchset [1] on
top of these two patches. I am having hard time applying those
changes. The goal of the my patch is to expand the partitioned table
into an inheritance hierarchy which retains the partition hierarchy.
For that to happen, we need to know which partition belongs to which
partitioned table in the partition hierarchy. PartitionDispatch array
provided by RelationGetPartitionDispatchInfo() provides me the parent
OIDs of partitioned parents but it doesn't do so for the leaf
partitions. So, I changed the signature of that function to return the
list of parent OIDs of leaf partitions. But for building
AppendRelInfos, child RTEs and child row marks, I need parent's RTI,
RTE and row marks, which are not available directly. Given parent's
OID, I need to search root->parse->rtable to find its RTE, RTI and
then using RTI I can find rowmarks. But that seems to defeat the
purpose why partition-wise join needs EIBO i.e. to avoid O(n ^2) loop
in build_simple_rel(). For eliminating that loop we are introducing
another O(n^2) loop in expand_inherited_rtentry(). Even without
considering O(n^2) complexity this looks ugly.

A better way for translating partition hierarchy into inheritance
hierarchy may be to expand all partitions (partitioned or leaf) of a
given parent at a time in breadth-first manner. This allows us to
create child RTE, AppendRelInfo, rowmarks while we have corresponding
parent structures at hand, rather than searching for those. This would
still satisfy Amit Khandekar's requirement to expand leaf partitions
in the same order as their OIDs would be returned by
RelationGetPartitionDispatchInfo(). I have a feeling that, if we go
that route, we will replace almost all the changes that Amit Langote's
patches do to expand_inherited_rtentry().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Aug 30, 2017 at 9:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Amit's patches seem to be addressing the third point here. But the
> expansion is not happening in breadth-first manner. We are expanding
> all the partitioned partitions first and then leaf partitions. So
> that's not exactly "breadth-first".

Correct, but I think Amit's ordering is what we actually want:
breadth-first, low-OID-first over interior partitioned tables, and
then breadth-first, low-OID-first again over leaves.  If we don't keep
partitioned partitions first, then we're going to have problems
keeping the locking order consistent when we start doing pruning
during expansion.

> A better way for translating partition hierarchy into inheritance
> hierarchy may be to expand all partitions (partitioned or leaf) of a
> given parent at a time in breadth-first manner. This allows us to
> create child RTE, AppendRelInfo, rowmarks while we have corresponding
> parent structures at hand, rather than searching for those. This would
> still satisfy Amit Khandekar's requirement to expand leaf partitions
> in the same order as their OIDs would be returned by
> RelationGetPartitionDispatchInfo(). I have a feeling that, if we go
> that route, we will replace almost all the changes that Amit Langote's
> patches do to expand_inherited_rtentry().

I think we will, too, but I think that's basically the problem of the
partition-wise join patch.  Either find_all_inheritors is going to
have to return enough additional information to let
expand_inherited_rtentry work efficiently, or else
expand_inherited_rtentry is going to have to duplicate some of the
logic from find_all_inheritors.  But that doesn't make what Amit is
doing here a bad idea -- getting stuff that shouldn't be part of
PartitionDispatch removed and getting the expansion order in
expand_inherited_rtentry() changed seem to be the right things to do
even if the way it's implemented has to be revised to meet other
goals.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Aug 30, 2017 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 9:22 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Amit's patches seem to be addressing the third point here. But the
>> expansion is not happening in breadth-first manner. We are expanding
>> all the partitioned partitions first and then leaf partitions. So
>> that's not exactly "breadth-first".
>
> Correct, but I think Amit's ordering is what we actually want:
> breadth-first, low-OID-first over interior partitioned tables, and
> then breadth-first, low-OID-first again over leaves.  If we don't keep
> partitioned partitions first, then we're going to have problems
> keeping the locking order consistent when we start doing pruning
> during expansion.

No, I'm wrong and you're correct.  We want the partitions to be locked
first, but we don't want them to be pulled to the front of the
expansion order, because then it's not in bound order anymore and any
optimization that tries to rely on that ordering will break.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Aug 30, 2017 at 6:08 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote:
>> That just leaves indexes.  In a world where keystate, tupslot, and
>> tupmap are removed from the PartitionDispatchData, you must need
>> indexes or there would be no point in constructing a
>> PartitionDispatchData object in the first place; any application that
>> needs neither indexes nor the executor-specific stuff could just use
>> the Relation directly.
>
> But there is expand_inherited_rtentry() which neither requires indexes
> nor any executor stuff, but still requires to call
> RelationGetPartitionDispatchInfo(), and so these indexes get built
> unnecessarily.

True, but the reason why expand_inherited_rtentry() needs to call
RelationGetPartitionDispatchInfo() is to get back the leaf partition
OIDs in bound order.  If we're using
RelationGetPartitionDispatchInfo() to get the leaf partition OIDs into
bound order, we've got to run the loop that builds leaf_part_oids, and
the same loop constructs indexes.  So I don't think we're doing much
redundant work there.

Now, if we made it the job of expand_inherited_rtentry() to loop over
the PartitionDesc, then it really wouldn't need to call
RelationGetPartitionDispatchInfo at all.  Maybe that's actually a
better plan anyway, because as Ashutosh points out, we don't want the
partitioned children to show up before the unpartitioned children in
the resulting ordering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Wed, Aug 30, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 6:08 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote:
>>> That just leaves indexes.  In a world where keystate, tupslot, and
>>> tupmap are removed from the PartitionDispatchData, you must need
>>> indexes or there would be no point in constructing a
>>> PartitionDispatchData object in the first place; any application that
>>> needs neither indexes nor the executor-specific stuff could just use
>>> the Relation directly.
>>
>> But there is expand_inherited_rtentry() which neither requires indexes
>> nor any executor stuff, but still requires to call
>> RelationGetPartitionDispatchInfo(), and so these indexes get built
>> unnecessarily.
>
> True, but the reason why expand_inherited_rtentry() needs to call
> RelationGetPartitionDispatchInfo() is to get back the leaf partition
> OIDs in bound order.  If we're using
> RelationGetPartitionDispatchInfo() to get the leaf partition OIDs into
> bound order, we've got to run the loop that builds leaf_part_oids, and
> the same loop constructs indexes.  So I don't think we're doing much
> redundant work there.
>
> Now, if we made it the job of expand_inherited_rtentry() to loop over
> the PartitionDesc, then it really wouldn't need to call
> RelationGetPartitionDispatchInfo at all.  Maybe that's actually a
> better plan anyway, because as Ashutosh points out, we don't want the
> partitioned children to show up before the unpartitioned children in
> the resulting ordering.
>

+1. I think we should just pull out the OIDs from partition descriptor.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> +1. I think we should just pull out the OIDs from partition descriptor.

Like this?  The first patch refactors the expansion of a single child
out into a separate function, and the second patch implements EIBO on
top of it.

I realized while doing this that we really want to expand the
partitioning hierarchy depth-first, not breadth-first.  For some
things, like partition-wise join in the case where all bounds match
exactly, we really only need a *predictable* ordering that will be the
same for two equi-partitioned tables.  A breadth-first expansion will
give us that.  But it's not actually in bound order.  For example:

create table foo (a int, b text) partition by list (a);
create table foo1 partition of foo for values in (2);
create table foo2 partition of foo for values in (1) partition by range (b);
create table foo2a partition of foo2 for values from ('b') to ('c');
create table foo2b partition of foo2 for values from ('a') to ('b');
create table foo3 partition of foo for values in (3);

The correct bound-order expansion of this is foo2b - foo2a - foo1 -
foo3, which is indeed what you get with the attached patch.  But if we
did the expansion in breadth-first fashion, we'd get foo1 - foo3 -
foo2a, foo2b, which is, well, not in bound order.  If the idea is that
you see a > 2 and rule out all partitions that appear before the first
one with an a-value >= 2, it's not going to work.

Mind you, that idea has some problems anyway in the face of default
partitions, null partitions, and list partitions which accept
non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
4, 6).  We might need to mark the PartitionDesc to indicate whether
PartitionDesc-order is in fact bound-order in a particular instance,
or something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Ashutosh Bapat
Date:
On Thu, Aug 31, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> +1. I think we should just pull out the OIDs from partition descriptor.
>
> Like this?  The first patch refactors the expansion of a single child
> out into a separate function, and the second patch implements EIBO on
> top of it.
>
> I realized while doing this that we really want to expand the
> partitioning hierarchy depth-first, not breadth-first.  For some
> things, like partition-wise join in the case where all bounds match
> exactly, we really only need a *predictable* ordering that will be the
> same for two equi-partitioned table.

+1. Spotted right!

> A breadth-first expansion will
> give us that.  But it's not actually in bound order.  For example:
>
> create table foo (a int, b text) partition by list (a);
> create table foo1 partition of foo for values in (2);
> create table foo2 partition of foo for values in (1) partition by range (b);
> create table foo2a partition of foo2 for values from ('b') to ('c');
> create table foo2b partition of foo2 for values from ('a') to ('b');
> create table foo3 partition of foo for values in (3);
>
> The correct bound-order expansion of this is foo2b - foo2a - foo1 -
> foo3, which is indeed what you get with the attached patch.  But if we
> did the expansion in breadth-first fashion, we'd get foo1 - foo3 -
> foo2a, foo2b, which is, well, not in bound order.  If the idea is that
> you see a > 2 and rule out all partitions that appear before the first
> one with an a-value >= 2, it's not going to work.

Here are the patches revised a bit. I have esp changed the variable
names and arguments to reflect their true role in the functions. Also
updated prologue of expand_single_inheritance_child() to mention
"has_child". Let me know if those changes look good.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/08/31 4:45, Robert Haas wrote:
> On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> +1. I think we should just pull out the OIDs from partition descriptor.
> 
> Like this?  The first patch refactors the expansion of a single child
> out into a separate function, and the second patch implements EIBO on
> top of it.
> 
> I realized while doing this that we really want to expand the
> partitioning hierarchy depth-first, not breadth-first.  For some
> things, like partition-wise join in the case where all bounds match
> exactly, we really only need a *predictable* ordering that will be the
> same for two equi-partitioned tables.  A breadth-first expansion will
> give us that.  But it's not actually in bound order.  For example:
> 
> create table foo (a int, b text) partition by list (a);
> create table foo1 partition of foo for values in (2);
> create table foo2 partition of foo for values in (1) partition by range (b);
> create table foo2a partition of foo2 for values from ('b') to ('c');
> create table foo2b partition of foo2 for values from ('a') to ('b');
> create table foo3 partition of foo for values in (3);
> 
> The correct bound-order expansion of this is foo2b - foo2a - foo1 -
> foo3, which is indeed what you get with the attached patch.  But if we
> did the expansion in breadth-first fashion, we'd get foo1 - foo3 -
> foo2a, foo2b, which is, well, not in bound order.  If the idea is that
> you see a > 2 and rule out all partitions that appear before the first
> one with an a-value >= 2, it's not going to work.

I think, overall, this might be a good idea.  Thanks for working on it.

The patches I posted in the "path toward faster partition pruning" achieve
the same end result as your patch that the leaf partitions appear in the
partition bound order in the Append path for a partitioned table.  It
achieves that result in a somewhat different way, but let's forget about
that for a moment.  One thing the patch on that thread didn't achieve
though is getting the leaf partitions in the same (partition bound) order
in the ModifyTable path for UPDATE/DELETE, because inheritance_planner()
path is not modified in a suitable way (in fact, I'm afraid that there
might be a deadlock bug lurking there, which I must address).

Your patch, OTOH, achieves the same order in both cases, which seems
desirable.

> Mind you, that idea has some problems anyway in the face of default
> partitions, null partitions, and list partitions which accept
> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
> 4, 6).  We might need to mark the PartitionDesc to indicate whether
> PartitionDesc-order is in fact bound-order in a particular instance,
> or something like that.

ISTM, the primary motivation for the EIBO patch at this point is to get
the partitions ordered in a predictable manner so that the partition-wise
join patch and update partition key patches could implement certain logic
using O (n) algorithm rather than an O (n^2) one.  Neither of them depend
on the actual order in the sense of, say, sticking a PathKey to the
resulting Append.  Perhaps, the patch to"Make the optimiser aware of
partitions ordering" [1] will have to consider this somehow; maybe by
limiting its scope to only the cases where the root partitioned table is
range partitioned.

Thanks,
Amit

[1] https://commitfest.postgresql.org/14/1093/




Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Thu, Aug 31, 2017 at 3:36 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> ISTM, the primary motivation for the EIBO patch at this point is to get
> the partitions ordered in a predictable manner so that the partition-wise
> join patch and update partition key patches could implement certain logic
> using O (n) algorithm rather than an O (n^2) one.

That's part of it, but not the whole thing.  For example, BASIC
partition-wise join only needs a predictable order, not a
bound-ordered one.  But the next step is to be able to match up uneven
bounds - e.g. given [1000, 2000), [3000, 4000), [5000, 6000) on one
side and [1100, 2100), [2900,3900), and [5500,5600) on the other side,
we can still make it work.  That greatly benefits from being able to
iterate through all the bounds in order.

> Neither of them depend
> on the actual order in the sense of, say, sticking a PathKey to the
> resulting Append.  Perhaps, the patch to"Make the optimiser aware of
> partitions ordering" [1] will have to consider this somehow; maybe by
> limiting its scope to only the cases where the root partitioned table is
> range partitioned.

I think that doing a depth-first traversal as I've done here avoids
the need to limit it to that case.  If we did a breadth-first
traversal anything that was subpartitioned would end up having the
subpartitions at the end instead of in the sequence, but the
depth-first traversal avoids that issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Thu, Aug 31, 2017 at 2:56 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Here are the patches revised a bit. I have esp changed the variable
> names and arguments to reflect their true role in the functions. Also
> updated prologue of expand_single_inheritance_child() to mention
> "has_child". Let me know if those changes look good.

Sure.  Committed as you have it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Mind you, that idea has some problems anyway in the face of default
>> partitions, null partitions, and list partitions which accept
>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
>> 4, 6).  We might need to mark the PartitionDesc to indicate whether
>> PartitionDesc-order is in fact bound-order in a particular instance,
>> or something like that.
>
> ISTM, the primary motivation for the EIBO patch at this point is to get
> the partitions ordered in a predictable manner so that the partition-wise
> join patch and update partition key patches could implement certain logic
> using O (n) algorithm rather than an O (n^2) one.  Neither of them depend
> on the actual order in the sense of, say, sticking a PathKey to the
> resulting Append.

Now that the inheritance hierarchy is expanded in depth-first order,
RelationGetPartitionDispatchInfo() needs to be changed to arrange the
PartitionDispatch array and the leaf partitions in depth-first order
(as we know this is a requirement for the update-partition-key patch
for efficiently determining which of the leaf partitions are already
present in the update result rels). Amit, I am not sure if you are
already doing this as part of the patches in this mail thread. Please
let me know. Also, let me know if you think there will be any loss of
efficiency in tuple routing code if we arrange the Partition Dispatch
indexes in depth-first order.



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Hi Amit,

On 2017/09/03 16:07, Amit Khandekar wrote:
> On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Mind you, that idea has some problems anyway in the face of default
>>> partitions, null partitions, and list partitions which accept
>>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
>>> 4, 6).  We might need to mark the PartitionDesc to indicate whether
>>> PartitionDesc-order is in fact bound-order in a particular instance,
>>> or something like that.
>>
>> ISTM, the primary motivation for the EIBO patch at this point is to get
>> the partitions ordered in a predictable manner so that the partition-wise
>> join patch and update partition key patches could implement certain logic
>> using O (n) algorithm rather than an O (n^2) one.  Neither of them depend
>> on the actual order in the sense of, say, sticking a PathKey to the
>> resulting Append.
> 
> Now that the inheritance hierarchy is expanded in depth-first order,
> RelationGetPartitionDispatchInfo() needs to be changed to arrange the
> PartitionDispatch array and the leaf partitions in depth-first order
> (as we know this is a requirement for the update-partition-key patch
> for efficiently determining which of the leaf partitions are already
> present in the update result rels).

I was thinking the same.

> Amit, I am not sure if you are
> already doing this as part of the patches in this mail thread. Please
> let me know.

Actually, I had thought of changing the expansion order in
RelationGetPartitionDispatchInfo to depth-first after Robert committed his
patch the other day, but haven't got around to doing that yet.  Will do
that in the updated patch (the refactoring patch) I will post sometime
later today or tomorrow on a differently titled thread, because the EIBO
work seems to be done.

> Also, let me know if you think there will be any loss of
> efficiency in tuple routing code if we arrange the Partition Dispatch
> indexes in depth-first order.

I don't think there will be any loss in the efficiency of the tuple
routing code itself.  It's just that the position of the ResultRelInfos
(of leaf partitions) and PartitionDispatch objects (of partitioned tables)
will be different in their respective arrays, that is, pd->indexes will
now have different values than formerly.

And now because the planner will put leaf partitions subplans / WCOs /
RETURNING projections in that order in the ModifyTable node, we must make
sure that we adapt the same order in the executor, as you already noted.

Thanks,
Amit




Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 4 September 2017 at 06:34, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> On 2017/09/03 16:07, Amit Khandekar wrote:
>> On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> Mind you, that idea has some problems anyway in the face of default
>>>> partitions, null partitions, and list partitions which accept
>>>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
>>>> 4, 6).  We might need to mark the PartitionDesc to indicate whether
>>>> PartitionDesc-order is in fact bound-order in a particular instance,
>>>> or something like that.
>>>
>>> ISTM, the primary motivation for the EIBO patch at this point is to get
>>> the partitions ordered in a predictable manner so that the partition-wise
>>> join patch and update partition key patches could implement certain logic
>>> using O (n) algorithm rather than an O (n^2) one.  Neither of them depend
>>> on the actual order in the sense of, say, sticking a PathKey to the
>>> resulting Append.
>>
>> Now that the inheritance hierarchy is expanded in depth-first order,
>> RelationGetPartitionDispatchInfo() needs to be changed to arrange the
>> PartitionDispatch array and the leaf partitions in depth-first order
>> (as we know this is a requirement for the update-partition-key patch
>> for efficiently determining which of the leaf partitions are already
>> present in the update result rels).
>
> I was thinking the same.
>
>> Amit, I am not sure if you are
>> already doing this as part of the patches in this mail thread. Please
>> let me know.
>
> Actually, I had thought of changing the expansion order in
> RelationGetPartitionDispatchInfo to depth-first after Robert committed his
> patch the other day, but haven't got around to doing that yet.  Will do
> that in the updated patch (the refactoring patch) I will post sometime
> later today or tomorrow on a differently titled thread, because the EIBO
> work seems to be done.

Great, thanks. Just wanted to make sure someone is working on that,
because, as you said, it is no longer an EIBO patch. Since you are
doing that, I won't work on that.

>
>> Also, let me know if you think there will be any loss of
>> efficiency in tuple routing code if we arrange the Partition Dispatch
>> indexes in depth-first order.
>
> I don't think there will be any loss in the efficiency of the tuple
> routing code itself.  It's just that the position of the ResultRelInfos
> (of leaf partitions) and PartitionDispatch objects (of partitioned tables)
> will be different in their respective arrays, that is, pd->indexes will
> now have different values than formerly.

Ok. Good to hear that.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/09/05 14:11, Amit Khandekar wrote:
> Great, thanks. Just wanted to make sure someone is working on that,
> because, as you said, it is no longer an EIBO patch. Since you are
> doing that, I won't work on that.

Here is that patch (actually two patches).  Sorry it took me a bit.

Description:

[PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc.  After this refactoring,
ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().

[PATCH 2/2] Make RelationGetPartitionDispatch expansion order
 depth-first

This is so as it matches what the planner is doing with partitioning
inheritance expansion.  Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.


Actually, I'm coming to a conclusion that we should keep any
whole-partition-tree stuff out of partition.c and its interface, as Robert
has also alluded to in an earlier message on this thread [1].  But since
that's a different topic, I'll shut up about it on this thread and start a
new thread to discuss what kind of code rearrangement is possible.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoafr%3DhUrM%3Dcbx-k%3DBDHOF2OfXaw95HQSNAK4mHBwmSjtw%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
Thanks Amit for the patch. I am still reviewing it, but meanwhile
below are a few comments so far ...

On 8 September 2017 at 15:53, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> [PATCH 2/2] Make RelationGetPartitionDispatch expansion order
>  depth-first
>
> This is so as it matches what the planner is doing with partitioning
> inheritance expansion.  Matching with planner order helps because
> it helps ease matching the executor's per-partition objects with
> planner-created per-partition nodes.
>
>

+   next_parted_idx += (list_length(*pds) - next_parted_idx - 1);

I think this can be replaced just by :
+   next_parted_idx = list_length(*pds) - 1;
Or, how about removing this variable next_parted_idx altogether ?
Instead, we can just do this :
pd->indexes[i] = -(1 + list_length(*pds));
If that is not possible, I may be missing something.

-----------

+ next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);

Didn't understand why next_leaf_idx needs to be updated in case when
the current partrelid is partitioned. I think it should be incremented
only for leaf partitions, no ? Or for that matter, again, how about
removing the variable 'next_leaf_idx' and doing this :
*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
pd->indexes[i] = list_length(*leaf_part_oids) - 1;

-----------

* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop.  This is
* a bit tricky but works because the foreach() macro doesn't fetch the
* next list element until the bottom of the loop.

I think the above comment needs to be modified with something
explaining the relevant changed code. For e.g. there is no
parted_rels, and the "tricky" part was there earlier because of the
list being iterated and at the same time being appended.

------------

I couldn't see the existing comments like "Indexes corresponding to
the internal partitions are multiplied by" anywhere in the patch. I
think those comments are still valid, and important.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
Hi Amit,

On 2017/09/11 16:16, Amit Khandekar wrote:
> Thanks Amit for the patch. I am still reviewing it, but meanwhile
> below are a few comments so far ...

Thanks for the review.

> +   next_parted_idx += (list_length(*pds) - next_parted_idx - 1);
> 
> I think this can be replaced just by :
> +   next_parted_idx = list_length(*pds) - 1;
> Or, how about removing this variable next_parted_idx altogether ?
> Instead, we can just do this :
> pd->indexes[i] = -(1 + list_length(*pds));

That seems like the simplest possible way to do it.

> + next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);
> 
> Didn't understand why next_leaf_idx needs to be updated in case when
> the current partrelid is partitioned. I think it should be incremented
> only for leaf partitions, no ? Or for that matter, again, how about
> removing the variable 'next_leaf_idx' and doing this :
> *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
> pd->indexes[i] = list_length(*leaf_part_oids) - 1;

Yep.

Attached updated patch does it that way for both partitioned table indexes
and leaf partition indexes.  Thanks for pointing it out.


> -----------
> 
> * For every partitioned table in the tree, starting with the root
> * partitioned table, add its relcache entry to parted_rels, while also
> * queuing its partitions (in the order in which they appear in the
> * partition descriptor) to be looked at later in the same loop.  This is
> * a bit tricky but works because the foreach() macro doesn't fetch the
> * next list element until the bottom of the loop.
> 
> I think the above comment needs to be modified with something
> explaining the relevant changed code. For e.g. there is no
> parted_rels, and the "tricky" part was there earlier because of the
> list being iterated and at the same time being appended.
> 
> ------------

I think I forgot to update this comment.

> I couldn't see the existing comments like "Indexes corresponding to
> the internal partitions are multiplied by" anywhere in the patch. I
> think those comments are still valid, and important.

Again, I failed to keep this comment.  Anyway, I reworded the comments a
bit to describe what the code is doing more clearly.  Hope you find it so too.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/09/11 18:56, Amit Langote wrote:
> Attached updated patch does it that way for both partitioned table indexes
> and leaf partition indexes.  Thanks for pointing it out.

It seems to me we don't really need the first patch all that much.  That
is, let's keep PartitionDispatchData the way it is for now, since we don't
really have any need for it beside tuple-routing (EIBO as committed didn't
need it for one).  So, let's forget about "decoupling
RelationGetPartitionDispatchInfo() from the executor" thing for now and
move on.

So, attached is just the patch to make RelationGetPartitionDispatchInfo()
traverse the partition tree in depth-first manner to be applied on HEAD.

Thoughts?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 13 September 2017 at 15:32, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/11 18:56, Amit Langote wrote:
>> Attached updated patch does it that way for both partitioned table indexes
>> and leaf partition indexes.  Thanks for pointing it out.
>
> It seems to me we don't really need the first patch all that much.  That
> is, let's keep PartitionDispatchData the way it is for now, since we don't
> really have any need for it beside tuple-routing (EIBO as committed didn't
> need it for one).  So, let's forget about "decoupling
> RelationGetPartitionDispatchInfo() from the executor" thing for now and
> move on.
>
> So, attached is just the patch to make RelationGetPartitionDispatchInfo()
> traverse the partition tree in depth-first manner to be applied on HEAD.
>
> Thoughts?

+1. If at all we need the decoupling later for some reason, we can do
that incrementally.

Will review your latest patch by tomorrow.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> It seems to me we don't really need the first patch all that much.  That
> is, let's keep PartitionDispatchData the way it is for now, since we don't
> really have any need for it beside tuple-routing (EIBO as committed didn't
> need it for one).  So, let's forget about "decoupling
> RelationGetPartitionDispatchInfo() from the executor" thing for now and
> move on.
>
> So, attached is just the patch to make RelationGetPartitionDispatchInfo()
> traverse the partition tree in depth-first manner to be applied on HEAD.

I like this patch.  Not only does it improve the behavior, but it's
actually less code than we have now, and in my opinion, the new code
is easier to understand, too.

A few suggestions:

- I think get_partition_dispatch_recurse() get a check_stack_depth()
call just like expand_partitioned_rtentry() did, and for the same
reasons: if the catalog contents are corrupted so that we have an
infinite loop in the partitioning hierarchy, we want to error out, not
crash.

- I think we should add a comment explaining that we're careful to do
this in the same order as expand_partitioned_rtentry() so that callers
can assume that the N'th entry in the leaf_part_oids array will also
be the N'th child of an Append node.

+         * For every partitioned table other than root, we must store a

other than the root

+     * partitioned table.  The value multiplied back by -1 is returned as the

multiplied by -1, not multiplied back by -1

+     * tables in the tree, using which, search is continued further down the
+     * partition tree.

Period after "in the tree".  Then continue: "This value is used to
continue the search in the next level of the partition tree."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/09/14 1:42, Robert Haas wrote:
> On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> It seems to me we don't really need the first patch all that much.  That
>> is, let's keep PartitionDispatchData the way it is for now, since we don't
>> really have any need for it beside tuple-routing (EIBO as committed didn't
>> need it for one).  So, let's forget about "decoupling
>> RelationGetPartitionDispatchInfo() from the executor" thing for now and
>> move on.
>>
>> So, attached is just the patch to make RelationGetPartitionDispatchInfo()
>> traverse the partition tree in depth-first manner to be applied on HEAD.
> 
> I like this patch.  Not only does it improve the behavior, but it's
> actually less code than we have now, and in my opinion, the new code
> is easier to understand, too.
> 
> A few suggestions:

Thanks for the review.

> - I think get_partition_dispatch_recurse() get a check_stack_depth()
> call just like expand_partitioned_rtentry() did, and for the same
> reasons: if the catalog contents are corrupted so that we have an
> infinite loop in the partitioning hierarchy, we want to error out, not
> crash.

Ah, missed that.  Done.

> - I think we should add a comment explaining that we're careful to do
> this in the same order as expand_partitioned_rtentry() so that callers
> can assume that the N'th entry in the leaf_part_oids array will also
> be the N'th child of an Append node.

Done.  Since the Append/ModifyTable may skip some leaf partitions due to
pruning, added a note about that too.

> +         * For every partitioned table other than root, we must store a
> 
> other than the root
> 
> +     * partitioned table.  The value multiplied back by -1 is returned as the
> 
> multiplied by -1, not multiplied back by -1
> 
> +     * tables in the tree, using which, search is continued further down the
> +     * partition tree.
> 
> Period after "in the tree".  Then continue: "This value is used to
> continue the search in the next level of the partition tree."

Fixed.

Attached updated patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Khandekar
Date:
On 14 September 2017 at 06:43, Amit Langote
> Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patch.

@@ -1222,151 +1209,130 @@ PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
                                  int
 
*num_parted, List **leaf_part_oids){
+       List   *pdlist;       PartitionDispatchData **pd;

+       get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);

Above, pdlist is passed uninitialized. And then inside
get_partition_dispatch_recurse(), it is used here :
*pds = lappend(*pds, pd);

--------

pg_indent says more alignments needed. For e.g. gettext_noop() call
below needs to be aligned:
pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
tupdesc,
gettext_noop("could not convert row type"));

--------

Other than that, the patch looks good to me. I verified that the leaf
oids are ordered exaclty in the order of the UPDATE subplans output.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

From
Robert Haas
Date:
On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 14 September 2017 at 06:43, Amit Langote
>> Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated patch.
>
> @@ -1222,151 +1209,130 @@ PartitionDispatch *
>  RelationGetPartitionDispatchInfo(Relation rel,
>                                                                  int
> *num_parted, List **leaf_part_oids)
>  {
> +       List   *pdlist;
>         PartitionDispatchData **pd;
>
> +       get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
>
> Above, pdlist is passed uninitialized. And then inside
> get_partition_dispatch_recurse(), it is used here :
> *pds = lappend(*pds, pd);
>
> --------
>
> pg_indent says more alignments needed. For e.g. gettext_noop() call
> below needs to be aligned:
> pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
> tupdesc,
> gettext_noop("could not convert row type"));
>
> --------
>
> Other than that, the patch looks good to me. I verified that the leaf
> oids are ordered exaclty in the order of the UPDATE subplans output.

Committed with fixes for those issues and a few other cosmetic changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

From
Amit Langote
Date:
On 2017/09/15 1:37, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 14 September 2017 at 06:43, Amit Langote
>>> Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Attached updated patch.
>>
>> @@ -1222,151 +1209,130 @@ PartitionDispatch *
>>  RelationGetPartitionDispatchInfo(Relation rel,
>>                                                                  int
>> *num_parted, List **leaf_part_oids)
>>  {
>> +       List   *pdlist;
>>         PartitionDispatchData **pd;
>>
>> +       get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
>>
>> Above, pdlist is passed uninitialized. And then inside
>> get_partition_dispatch_recurse(), it is used here :
>> *pds = lappend(*pds, pd);
>>
>> --------
>>
>> pg_indent says more alignments needed. For e.g. gettext_noop() call
>> below needs to be aligned:
>> pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
>> tupdesc,
>> gettext_noop("could not convert row type"));
>>
>> --------
>>
>> Other than that, the patch looks good to me. I verified that the leaf
>> oids are ordered exaclty in the order of the UPDATE subplans output.
> 
> Committed with fixes for those issues and a few other cosmetic changes.

Thanks Amit for the review and Robert for committing.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers