Thread: [HACKERS] expanding inheritance in partition bound order
The current way to expand inherited tables, including partitioned tables, is to use either find_all_inheritors() or find_inheritance_children() depending on the context. They return child table OIDs in the (ascending) order of those OIDs, which means the callers that need to lock the child tables can do so without worrying about the possibility of deadlock in some concurrent execution of that piece of code. That's good. For partitioned tables, there is a possibility of returning child table (partition) OIDs in the partition bound order, which in addition to preventing the possibility of deadlocks during concurrent locking, seems potentially useful for other caller-specific optimizations. For example, tuple-routing code can utilize that fact to implement binary-search based partition-searching algorithm. For one more example, refer to the "UPDATE partition key" thread where it's becoming clear that it would be nice if the planner had put the partitions in bound order in the ModifyTable that it creates for UPDATE of partitioned tables [1]. So attached are two WIP patches: 0001 implements two interface functions: List *get_all_partition_oids(Oid, LOCKMODE) List *get_partition_oids(Oid, LOCKMODE) that resemble find_all_inheritors() and find_inheritance_children(), respectively, but expect that users call them only for partitioned tables. Needless to mention, OIDs are returned with canonical order determined by that of the partition bounds and they way partition tree structure is traversed (top-down, breadth-first-left-to-right). Patch replaces all the calls of the old interface functions with the respective new ones for partitioned table parents. That means expand_inherited_rtentry (among others) now calls get_all_partition_oids() if the RTE is for a partitioned table and find_all_inheritors() otherwise. In its implementation, get_all_partition_oids() calls RelationGetPartitionDispatchInfo(), which is useful to generate the result list in the desired partition bound order. But the current interface and implementation of RelationGetPartitionDispatchInfo() needs some rework, because it's too closely coupled with the executor's tuple routing code. Applying just 0001 will satisfy the requirements stated in [1], but it won't look pretty as is for too long. So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and relevant data structures. For example, PartitionDispatchData has now been simplified to contain only the partition key, partition descriptor and indexes array, whereas previously it also stored the relation descriptor, partition key execution state, tuple table slot, tuple conversion map which are required for tuple-routing. RelationGetPartitionDispatchInfo() no longer generates those things, but returns just enough information so that a caller can generate and manage those things by itself. This simplification makes it less cumbersome to call RelationGetPartitionDispatchInfo() in other places. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoajC0J50%3D2FqnZLvB10roY%2B68HgFWhso%3DV_StkC6PWujQ%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The current way to expand inherited tables, including partitioned tables, > is to use either find_all_inheritors() or find_inheritance_children() > depending on the context. They return child table OIDs in the (ascending) > order of those OIDs, which means the callers that need to lock the child > tables can do so without worrying about the possibility of deadlock in > some concurrent execution of that piece of code. That's good. > > For partitioned tables, there is a possibility of returning child table > (partition) OIDs in the partition bound order, which in addition to > preventing the possibility of deadlocks during concurrent locking, seems > potentially useful for other caller-specific optimizations. For example, > tuple-routing code can utilize that fact to implement binary-search based > partition-searching algorithm. For one more example, refer to the "UPDATE > partition key" thread where it's becoming clear that it would be nice if > the planner had put the partitions in bound order in the ModifyTable that > it creates for UPDATE of partitioned tables [1]. Thanks a lot for working on this. Partition-wise join can benefit from this as well. See comment about build_simple_rel's matching algorithm in [1]. It will become O(n) instead of O(n^2). > > So attached are two WIP patches: > > 0001 implements two interface functions: > > List *get_all_partition_oids(Oid, LOCKMODE) > List *get_partition_oids(Oid, LOCKMODE) > > that resemble find_all_inheritors() and find_inheritance_children(), > respectively, but expect that users call them only for partitioned tables. > Needless to mention, OIDs are returned with canonical order determined by > that of the partition bounds and they way partition tree structure is > traversed (top-down, breadth-first-left-to-right). Patch replaces all the > calls of the old interface functions with the respective new ones for > partitioned table parents. That means expand_inherited_rtentry (among > others) now calls get_all_partition_oids() if the RTE is for a partitioned > table and find_all_inheritors() otherwise. > > In its implementation, get_all_partition_oids() calls > RelationGetPartitionDispatchInfo(), which is useful to generate the result > list in the desired partition bound order. But the current interface and > implementation of RelationGetPartitionDispatchInfo() needs some rework, > because it's too closely coupled with the executor's tuple routing code. May be we want to implement get_all_partition_oids() calling get_partition_oids() on each new entry that gets added, similar to find_all_inheritors(). That might avoid changes to DispatchInfo() and also dependency on that structure. Also almost every place which called find_all_inheritors() or find_inheritance_children() is changed to if () else case calling those functions or the new function as required. May be we should create macros/functions to do that routing to keep the code readable. May be find_all_inheritors() and find_inheritance_children() themselves become the routing function and their original code moves into new functions get_all_inheritors() and get_inheritance_children(). We may choose other names for functions. The idea is to create routing functions/macros instead of sprinkling code with if () else conditions. [1] https://www.postgresql.org/message-id/CA+TgmobeRUTu4osXA_UA4AORho83WxAvFG8n1NQcoFuujbeh7A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The current way to expand inherited tables, including partitioned tables, > is to use either find_all_inheritors() or find_inheritance_children() > depending on the context. They return child table OIDs in the (ascending) > order of those OIDs, which means the callers that need to lock the child > tables can do so without worrying about the possibility of deadlock in > some concurrent execution of that piece of code. That's good. > > For partitioned tables, there is a possibility of returning child table > (partition) OIDs in the partition bound order, which in addition to > preventing the possibility of deadlocks during concurrent locking, seems > potentially useful for other caller-specific optimizations. For example, > tuple-routing code can utilize that fact to implement binary-search based > partition-searching algorithm. For one more example, refer to the "UPDATE > partition key" thread where it's becoming clear that it would be nice if > the planner had put the partitions in bound order in the ModifyTable that > it creates for UPDATE of partitioned tables [1]. I guess I don't really understand why we want to change the locking order. That is bound to make expanding the inheritance hierarchy more expensive. If we use this approach in all cases, it seems to me we're bound to reintroduce the problem we fixed in commit c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in the same vein. But I don't see that there's any necessary relation between the order of locking and the order of expansion inside the relcache entry/plan/whatever else -- so I propose that we keep the existing locking order and only change the other stuff. While reading related code this morning, I noticed that RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have *already* changed the locking order for certain operations, because the PartitionDesc's OID array is bound-ordered not OID-ordered. That means that when RelationGetPartitionDispatchInfo uses the PartitionDesc's OID arra to figure out what to lock, it's potentially going to encounter partitions in a different order than would have been the case if it had used find_all_inheritors directly. I'm tempted to think that RelationGetPartitionDispatchInfo shouldn't really be doing locking at all. The best way to have the locking always happen in the same order is to have only one piece of code that determines that order - and I vote for find_all_inheritors. Aside from the fact that it's the way we've always done it (and still do it in most other places), that code includes infinite-loop defenses which the new code you've introduced lacks. Concretely, my proposal is: 1. Before calling RelationGetPartitionDispatchInfo, the calling code should use find_all_inheritors to lock all the relevant relations (or the planner could use find_all_inheritors to get a list of relation OIDs, store it in the plan in order, and then at execution time we visit them in that order and lock them). 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. 3. While we're optimizing, in the first loop inside of RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, use get_rel_relkind() to see whether we've got a partitioned table; if so, open it. If not, there's no need. 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1)) elog(ERROR, ...). Might be interesting to stuff that check into the relation_open(..., NoLock) path, too. One objection to this line of attack is that there might be a good case for locking only the partitioned inheritors first and then going back and locking the leaf nodes in a second pass, or even only when required for a particular row. However, that doesn't require putting everything in bound order - it only requires moving the partitioned children to the beginning of the list. And I think rather than having new logic for that, we should teach find_inheritance_children() to do that directly. I have a feeling Ashutosh is going to cringe at this suggestion, but my idea is to do this by denormalizing: add a column to pg_inherits indicating whether the child is of RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans pg_inherits, it can pull that flag out for free along with the relation OID, and qsort() first by the flag and then by the OID. It can also return the number of initial elements of its return value which have that flag set. Then, in find_all_inheritors, we split rels_list into partitioned_rels_list and other_rels_list, and process partitioned_rels_list in its entirety before touching other_rels_list; they get concatenated at the end. Now, find_all_inheritors and find_inheritance_children can also grow a flag bool only_partitioned_children; if set, then we skip the unpartitioned children entirely. With all that in place, you can call find_all_inheritors(blah blah, false) to lock the whole hierarchy, or find_all_inheritors(blah blah, true) to lock just the partitioned tables in the hierarchy. You get a consistent lock order either way, and if you start with only the partitioned tables and later want the leaf partitions too, you just go through the partitioned children in the order they were returned and find_inheritance_children(blah blah, false) on each one of them and the lock order is exactly consistent with what you would have gotten if you'd done find_all_inheritors(blah blah, false) originally. Thoughts? P.S. While I haven't reviewed 0002 in detail, I think the concept of minimizing what needs to be built in RelationGetPartitionDispatchInfo is a very good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote: > > 1. Before calling RelationGetPartitionDispatchInfo, the calling code > should use find_all_inheritors to lock all the relevant relations (or > the planner could use find_all_inheritors to get a list of relation > OIDs, store it in the plan in order, and then at execution time we > visit them in that order and lock them). > > 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. I agree. I think overall, we should keep RelationGetPartitionDispatchInfo() only for preparing the dispatch info in the planner, and generate the locked oids (using find_all_inheritors() or get_partitioned_oids() or whatever) *without* using RelationGetPartitionDispatchInfo(), since RelationGetPartitionDispatchInfo() is generating the pd structure which we don't want in every expansion. > > 3. While we're optimizing, in the first loop inside of > RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, > use get_rel_relkind() to see whether we've got a partitioned table; if > so, open it. If not, there's no need. Yes, this way we need to open only the partitioned tables. > P.S. While I haven't reviewed 0002 in detail, I think the concept of > minimizing what needs to be built in RelationGetPartitionDispatchInfo > is a very good idea. True. I also think, RelationGetPartitionDispatchInfo () should be called while preparing the ModifyTable plan; the PartitionDispatch data structure returned by RelationGetPartitionDispatchInfo() should be stored in that plan, and then the execution-time fields in PartitionDispatch would be populated in ExecInitModifyTable(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/08/04 20:28, Ashutosh Bapat wrote: > On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The current way to expand inherited tables, including partitioned tables, >> is to use either find_all_inheritors() or find_inheritance_children() >> depending on the context. They return child table OIDs in the (ascending) >> order of those OIDs, which means the callers that need to lock the child >> tables can do so without worrying about the possibility of deadlock in >> some concurrent execution of that piece of code. That's good. >> >> For partitioned tables, there is a possibility of returning child table >> (partition) OIDs in the partition bound order, which in addition to >> preventing the possibility of deadlocks during concurrent locking, seems >> potentially useful for other caller-specific optimizations. For example, >> tuple-routing code can utilize that fact to implement binary-search based >> partition-searching algorithm. For one more example, refer to the "UPDATE >> partition key" thread where it's becoming clear that it would be nice if >> the planner had put the partitions in bound order in the ModifyTable that >> it creates for UPDATE of partitioned tables [1]. > > Thanks a lot for working on this. Partition-wise join can benefit from > this as well. See comment about build_simple_rel's matching algorithm > in [1]. It will become O(n) instead of O(n^2). Nice. It seems that we have a good demand for $subject. :) >> So attached are two WIP patches: >> >> 0001 implements two interface functions: >> >> List *get_all_partition_oids(Oid, LOCKMODE) >> List *get_partition_oids(Oid, LOCKMODE) >> >> that resemble find_all_inheritors() and find_inheritance_children(), >> respectively, but expect that users call them only for partitioned tables. >> Needless to mention, OIDs are returned with canonical order determined by >> that of the partition bounds and they way partition tree structure is >> traversed (top-down, breadth-first-left-to-right). Patch replaces all the >> calls of the old interface functions with the respective new ones for >> partitioned table parents. That means expand_inherited_rtentry (among >> others) now calls get_all_partition_oids() if the RTE is for a partitioned >> table and find_all_inheritors() otherwise. >> >> In its implementation, get_all_partition_oids() calls >> RelationGetPartitionDispatchInfo(), which is useful to generate the result >> list in the desired partition bound order. But the current interface and >> implementation of RelationGetPartitionDispatchInfo() needs some rework, >> because it's too closely coupled with the executor's tuple routing code. > > May be we want to implement get_all_partition_oids() calling > get_partition_oids() on each new entry that gets added, similar to > find_all_inheritors(). That might avoid changes to DispatchInfo() and > also dependency on that structure. > > Also almost every place which called find_all_inheritors() or > find_inheritance_children() is changed to if () else case calling > those functions or the new function as required. May be we should > create macros/functions to do that routing to keep the code readable. > May be find_all_inheritors() and find_inheritance_children() > themselves become the routing function and their original code moves > into new functions get_all_inheritors() and > get_inheritance_children(). We may choose other names for functions. > The idea is to create routing functions/macros instead of sprinkling > code with if () else conditions. Given the Robert's comments [1], it seems that we might have to abandon the idea to separate the interface for partitioned and non-partitioned inheritance cases. I'm thinking about the issues and alternatives he mentioned in his email. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com
On Fri, Aug 4, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The current way to expand inherited tables, including partitioned tables, >> is to use either find_all_inheritors() or find_inheritance_children() >> depending on the context. They return child table OIDs in the (ascending) >> order of those OIDs, which means the callers that need to lock the child >> tables can do so without worrying about the possibility of deadlock in >> some concurrent execution of that piece of code. That's good. >> >> For partitioned tables, there is a possibility of returning child table >> (partition) OIDs in the partition bound order, which in addition to >> preventing the possibility of deadlocks during concurrent locking, seems >> potentially useful for other caller-specific optimizations. For example, >> tuple-routing code can utilize that fact to implement binary-search based >> partition-searching algorithm. For one more example, refer to the "UPDATE >> partition key" thread where it's becoming clear that it would be nice if >> the planner had put the partitions in bound order in the ModifyTable that >> it creates for UPDATE of partitioned tables [1]. > > I guess I don't really understand why we want to change the locking > order. That is bound to make expanding the inheritance hierarchy more > expensive. If we use this approach in all cases, it seems to me we're > bound to reintroduce the problem we fixed in commit > c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in > the same vein. I initially didn't understand this, but I think now I understand it. Establishing the order of children by partition bounds requires building the relcache entry right now. That's what is expensive and would introduce the same problems as the commit you have quoted. > But I don't see that there's any necessary relation > between the order of locking and the order of expansion inside the > relcache entry/plan/whatever else -- so I propose that we keep the > existing locking order and only change the other stuff. > > While reading related code this morning, I noticed that > RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have > *already* changed the locking order for certain operations, because > the PartitionDesc's OID array is bound-ordered not OID-ordered. That > means that when RelationGetPartitionDispatchInfo uses the > PartitionDesc's OID arra to figure out what to lock, it's potentially > going to encounter partitions in a different order than would have > been the case if it had used find_all_inheritors directly. I'm > tempted to think that RelationGetPartitionDispatchInfo shouldn't > really be doing locking at all. The best way to have the locking > always happen in the same order is to have only one piece of code that > determines that order - and I vote for find_all_inheritors. Aside > from the fact that it's the way we've always done it (and still do it > in most other places), that code includes infinite-loop defenses which > the new code you've introduced lacks. > +1. > Concretely, my proposal is: > > 1. Before calling RelationGetPartitionDispatchInfo, the calling code > should use find_all_inheritors to lock all the relevant relations (or > the planner could use find_all_inheritors to get a list of relation > OIDs, store it in the plan in order, and then at execution time we > visit them in that order and lock them). > > 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. > > 3. While we're optimizing, in the first loop inside of > RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, > use get_rel_relkind() to see whether we've got a partitioned table; if > so, open it. If not, there's no need. > > 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add > to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1)) > elog(ERROR, ...). Might be interesting to stuff that check into the > relation_open(..., NoLock) path, too. > > One objection to this line of attack is that there might be a good > case for locking only the partitioned inheritors first and then going > back and locking the leaf nodes in a second pass, or even only when > required for a particular row. However, that doesn't require putting > everything in bound order - it only requires moving the partitioned > children to the beginning of the list. And I think rather than having > new logic for that, we should teach find_inheritance_children() to do > that directly. I have a feeling Ashutosh is going to cringe at this > suggestion, but my idea is to do this by denormalizing: add a column > to pg_inherits indicating whether the child is of > RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans > pg_inherits, it can pull that flag out for free along with the > relation OID, and qsort() first by the flag and then by the OID. It > can also return the number of initial elements of its return value > which have that flag set. I am always uncomfortable, when we save the same information in two places without having a way to make sure that they are in sync. That means we have to add explicit code to make sure that that information is kept in sync. Somebody forgetting to add that code wherever necessary means we have contradictory information persisted in the databases without an idea of which of them is correct. Not necessarily in this case, but usually it is an indication of something going wrong with the way schema is designed. May be it's better to use your idea of using get_rel_relkind() or find a way to check that the flag is in sync with the relkind, like when building the relcache. > > Then, in find_all_inheritors, we split rels_list into > partitioned_rels_list and other_rels_list, and process > partitioned_rels_list in its entirety before touching other_rels_list; > they get concatenated at the end. > > Now, find_all_inheritors and find_inheritance_children can also grow a > flag bool only_partitioned_children; if set, then we skip the > unpartitioned children entirely. > > With all that in place, you can call find_all_inheritors(blah blah, > false) to lock the whole hierarchy, or find_all_inheritors(blah blah, > true) to lock just the partitioned tables in the hierarchy. You get a > consistent lock order either way, and if you start with only the > partitioned tables and later want the leaf partitions too, you just go > through the partitioned children in the order they were returned and > find_inheritance_children(blah blah, false) on each one of them and > the lock order is exactly consistent with what you would have gotten > if you'd done find_all_inheritors(blah blah, false) originally. > > Thoughts? I noticed that find_all_inheritors() uses a hash table to eliminate duplicates arising out of multiple inheritance. Partition hierarchy is never going to have multiple inheritance, and doesn't need to eliminate duplicates and so doesn't need the hash table. It will be good, if we can eliminate that overhead. But that's separate task than what this thread is about. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Aug 7, 2017 at 11:18 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> >> One objection to this line of attack is that there might be a good >> case for locking only the partitioned inheritors first and then going >> back and locking the leaf nodes in a second pass, or even only when >> required for a particular row. However, that doesn't require putting >> everything in bound order - it only requires moving the partitioned >> children to the beginning of the list. And I think rather than having >> new logic for that, we should teach find_inheritance_children() to do >> that directly. I have a feeling Ashutosh is going to cringe at this >> suggestion, but my idea is to do this by denormalizing: add a column >> to pg_inherits indicating whether the child is of >> RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans >> pg_inherits, it can pull that flag out for free along with the >> relation OID, and qsort() first by the flag and then by the OID. It >> can also return the number of initial elements of its return value >> which have that flag set. > > I am always uncomfortable, when we save the same information in two > places without having a way to make sure that they are in sync. That > means we have to add explicit code to make sure that that information > is kept in sync. Somebody forgetting to add that code wherever > necessary means we have contradictory information persisted in the > databases without an idea of which of them is correct. Not necessarily > in this case, but usually it is an indication of something going wrong > with the way schema is designed. May be it's better to use your idea > of using get_rel_relkind() or find a way to check that the flag is in > sync with the relkind, like when building the relcache. Said all that, I think we will use this code quite often and so the performance benefits by replicating the information are worth the trouble of maintaining code to sync and check the duplicate information. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/08/05 2:25, Robert Haas wrote: > On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The current way to expand inherited tables, including partitioned tables, >> is to use either find_all_inheritors() or find_inheritance_children() >> depending on the context. They return child table OIDs in the (ascending) >> order of those OIDs, which means the callers that need to lock the child >> tables can do so without worrying about the possibility of deadlock in >> some concurrent execution of that piece of code. That's good. >> >> For partitioned tables, there is a possibility of returning child table >> (partition) OIDs in the partition bound order, which in addition to >> preventing the possibility of deadlocks during concurrent locking, seems >> potentially useful for other caller-specific optimizations. For example, >> tuple-routing code can utilize that fact to implement binary-search based >> partition-searching algorithm. For one more example, refer to the "UPDATE >> partition key" thread where it's becoming clear that it would be nice if >> the planner had put the partitions in bound order in the ModifyTable that >> it creates for UPDATE of partitioned tables [1]. > > I guess I don't really understand why we want to change the locking > order. That is bound to make expanding the inheritance hierarchy more > expensive. If we use this approach in all cases, it seems to me we're > bound to reintroduce the problem we fixed in commit > c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in > the same vein. But I don't see that there's any necessary relation > between the order of locking and the order of expansion inside the > relcache entry/plan/whatever else -- so I propose that we keep the > existing locking order and only change the other stuff. Hmm, okay. I guess I was trying to fit one solution to what might be two (or worse, more) problems of the current implementation, which is not good. > While reading related code this morning, I noticed that > RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have > *already* changed the locking order for certain operations, because > the PartitionDesc's OID array is bound-ordered not OID-ordered. That > means that when RelationGetPartitionDispatchInfo uses the > PartitionDesc's OID arra to figure out what to lock, it's potentially > going to encounter partitions in a different order than would have > been the case if it had used find_all_inheritors directly. I think Amit Khandekar mentioned this on the UPDATE partition key thread [1]. > I'm > tempted to think that RelationGetPartitionDispatchInfo shouldn't > really be doing locking at all. The best way to have the locking > always happen in the same order is to have only one piece of code that > determines that order - and I vote for find_all_inheritors. Aside > from the fact that it's the way we've always done it (and still do it > in most other places), that code includes infinite-loop defenses which > the new code you've introduced lacks. As long as find_all_inheritors() is a place only to determine the order in which partitions will be locked, it's fine. My concern is about the time of actual locking, which in the current planner implementation is too soon that we end up needlessly locking all the partitions. (Also in the current implementation, we open all the partitions to construct Var translation lists, which are actually unused through most of the planner stages, but admittedly it's a separate issue.) The locking-partitions-too-soon issue, I think, is an important one and may need discussing separately, but thought I'd mention it anyway. It also seems somewhat related to this discussion, but I may be wrong. > Concretely, my proposal is: > > 1. Before calling RelationGetPartitionDispatchInfo, the calling code > should use find_all_inheritors to lock all the relevant relations (or > the planner could use find_all_inheritors to get a list of relation > OIDs, store it in the plan in order, and then at execution time we > visit them in that order and lock them). ISTM, we'd want to lock the partitions after we've determined the specific ones a query needs to scan using the information returned by RelationGetPartitionDispatchInfo. That means the latter had better locked the relations whose cached partition descriptors will be used to determine the result that it produces. One way to do that might be to lock all the tables in the list returned by find_all_inheritors that are partitioned tables before calling RelationGetPartitionDispatchInfo. It seems what the approach you've outlined below will let us do that. BTW, IIUC, there will be two lists of OIDs we'll have: one in the find_all_inheritors order, say, L1 and the other determined by using partitioning-specific information for the given query, say L2. To lock, we iterate L1 and if a given member is in L2, we lock it. It might be possible to make it as cheap as O(nlogn). L2 is the order we put leaf partitions into a given plan. > 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. > > 3. While we're optimizing, in the first loop inside of > RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, > use get_rel_relkind() to see whether we've got a partitioned table; if > so, open it. If not, there's no need. That's what the proposed refactoring patch 0002 actually does. > 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add > to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1)) > elog(ERROR, ...). Might be interesting to stuff that check into the > relation_open(..., NoLock) path, too. > > One objection to this line of attack is that there might be a good > case for locking only the partitioned inheritors first and then going > back and locking the leaf nodes in a second pass, or even only when > required for a particular row. However, that doesn't require putting > everything in bound order - it only requires moving the partitioned > children to the beginning of the list. And I think rather than having > new logic for that, we should teach find_inheritance_children() to do > that directly. I have a feeling Ashutosh is going to cringe at this > suggestion, but my idea is to do this by denormalizing: add a column > to pg_inherits indicating whether the child is of > RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans > pg_inherits, it can pull that flag out for free along with the > relation OID, and qsort() first by the flag and then by the OID. It > can also return the number of initial elements of its return value > which have that flag set. Maybe, we can make the initial patch use syscache to get the relkind for a given child. If the syscache bloat is unbearable, we go with the denormalization approach. > Then, in find_all_inheritors, we split rels_list into > partitioned_rels_list and other_rels_list, and process > partitioned_rels_list in its entirety before touching other_rels_list; > they get concatenated at the end. > > Now, find_all_inheritors and find_inheritance_children can also grow a > flag bool only_partitioned_children; if set, then we skip the > unpartitioned children entirely. > > With all that in place, you can call find_all_inheritors(blah blah, > false) to lock the whole hierarchy, or find_all_inheritors(blah blah, > true) to lock just the partitioned tables in the hierarchy. You get a > consistent lock order either way, and if you start with only the > partitioned tables and later want the leaf partitions too, you just go > through the partitioned children in the order they were returned and > find_inheritance_children(blah blah, false) on each one of them and > the lock order is exactly consistent with what you would have gotten > if you'd done find_all_inheritors(blah blah, false) originally. > > Thoughts? So, with this in place: 1. Call find_all_inheritors to lock partitioned tables in the tree in an order prescribed by OIDs 2. Call RelationGetPartitionDispatchInfo at an opportune time, which will generate minimal information about the partitiontree that it can do without having to worry about locking anything 3. Determine the list of which leaf partitions will need to be scanned using the information obtained in 2, if possibleto do that at all [2]. 4. Lock leaf partitions in the find_inheritance_children prescribed order, but only those that are in the list built in3. > P.S. While I haven't reviewed 0002 in detail, I think the concept of > minimizing what needs to be built in RelationGetPartitionDispatchInfo > is a very good idea. Thanks. Regards, Amit [1] https://www.postgresql.org/message-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB%3DwY9ZLfe8cQOfG4bTqVGyQ%40mail.gmail.com [2] We can do that in set_append_rel_size(), but not in inheritance_planner()
On Mon, Aug 7, 2017 at 1:48 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > with the way schema is designed. May be it's better to use your idea > of using get_rel_relkind() or find a way to check that the flag is in > sync with the relkind, like when building the relcache. That's got the same problem as building a full relcache entry: cache bloat. It will have *less* cache bloat, but still some. Maybe it's little enough to be tolerable; not sure. But we want this system to scale to LOTS of partitions eventually, so building on a design that we know has scaling problems seems a bit short-sighted. > I noticed that find_all_inheritors() uses a hash table to eliminate > duplicates arising out of multiple inheritance. Partition hierarchy is > never going to have multiple inheritance, and doesn't need to > eliminate duplicates and so doesn't need the hash table. It will be > good, if we can eliminate that overhead. But that's separate task than > what this thread is about. I don't want to eliminate that overhead. If the catalog is manually modified or corrupted, the problem could still occur, and result in backend crashes or, at best, incomprehensible errors. The comments allude to this problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I think Amit Khandekar mentioned this on the UPDATE partition key thread [1]. Yes, similar discussion. > As long as find_all_inheritors() is a place only to determine the order in > which partitions will be locked, it's fine. My concern is about the time > of actual locking, which in the current planner implementation is too soon > that we end up needlessly locking all the partitions. I don't think avoiding that problem is going to be easy. We need a bunch of per-relation information, like the size of each relation, and what indexes it has, and how big they are, and the statistics for each one. It was at one point proposed by someone that every partition should be required to have the same indexes, but (1) we didn't implement it like that and (2) if we had done that it wouldn't solve this problem anyway because the sizes are still going to vary. Note that I'm not saying this isn't a good problem to solve, just that it's likely to be a very hard problem to solve. > The locking-partitions-too-soon issue, I think, is an important one and > ISTM, we'd want to lock the partitions after we've determined the specific > ones a query needs to scan using the information returned by > RelationGetPartitionDispatchInfo. That means the latter had better locked > the relations whose cached partition descriptors will be used to determine > the result that it produces. One way to do that might be to lock all the > tables in the list returned by find_all_inheritors that are partitioned > tables before calling RelationGetPartitionDispatchInfo. It seems what the > approach you've outlined below will let us do that. Yeah, I think so. I think we could possibly open and lock partitioned children only, then prune away leaf partitions that we can determine aren't needed, then open and lock the leaf partitions that are needed. > BTW, IIUC, there will be two lists of OIDs we'll have: one in the > find_all_inheritors order, say, L1 and the other determined by using > partitioning-specific information for the given query, say L2. > > To lock, we iterate L1 and if a given member is in L2, we lock it. It > might be possible to make it as cheap as O(nlogn). Commonly, we'll prune no partitions or all but one; and we should be able to make those cases very fast. Other cases can cost a little more, but I'll certainly complain about anything more than O(n lg n). >> 3. While we're optimizing, in the first loop inside of >> RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, >> use get_rel_relkind() to see whether we've got a partitioned table; if >> so, open it. If not, there's no need. > > That's what the proposed refactoring patch 0002 actually does. Great. > Maybe, we can make the initial patch use syscache to get the relkind for a > given child. If the syscache bloat is unbearable, we go with the > denormalization approach. Yeah. Maybe if you write that patch, you can also test it to see how bad the bloat is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/08/07 14:37, Amit Khandekar wrote: > On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote: >> P.S. While I haven't reviewed 0002 in detail, I think the concept of >> minimizing what needs to be built in RelationGetPartitionDispatchInfo >> is a very good idea. > > True. I also think, RelationGetPartitionDispatchInfo () should be > called while preparing the ModifyTable plan; the PartitionDispatch > data structure returned by RelationGetPartitionDispatchInfo() should > be stored in that plan, and then the execution-time fields in > PartitionDispatch would be populated in ExecInitModifyTable(). I'm not sure if we could ever store the PartitionDispatch structure itself in the plan. Planner would build and use it to put the leaf partition sub-plans in the canonical order in the resulting plan (Append, ModifyTable, etc.). Executor will have to rebuild the PartitionDispatch info again if and when it needs the same (for example, in ExecSetupPartitionTupleRouting for insert or update tuple routing). The refactoring patch that I've proposed (0002) makes PartitionDispatch structure itself contain a lot less information/state than it currently does. So RelationGetPartitionDispatchInfo's job per the revised patch is to reveal the partition tree structure and the information of each partitioned table that the tree contains. The original design whereby it builds and puts into PartitionDispatchData thing like partition key execution state (ExprState), TupleTableSlot, TupleConversionMap seems wrong to me in retrospect and we should somehow revise it. Those things I mentioned are only needed for tuple-routing, so they should be built and managed by the executor, not partition.c. Any feedback on the proposed patch is welcome. :) Thanks, Amit
On 2017/08/08 4:34, Robert Haas wrote: > On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> As long as find_all_inheritors() is a place only to determine the order in >> which partitions will be locked, it's fine. My concern is about the time >> of actual locking, which in the current planner implementation is too soon >> that we end up needlessly locking all the partitions. > > I don't think avoiding that problem is going to be easy. We need a > bunch of per-relation information, like the size of each relation, and > what indexes it has, and how big they are, and the statistics for each > one. It was at one point proposed by someone that every partition > should be required to have the same indexes, but (1) we didn't > implement it like that and (2) if we had done that it wouldn't solve > this problem anyway because the sizes are still going to vary. Sorry, I didn't mean to say we shouldn't lock and open partitions at all. We do need their relation descriptors for planning and there is no doubt about that. I was just saying that we should do that only for the partitions that are not pruned. But, as you say, I can see that the planner changes required to be able to do that might be hard. >> The locking-partitions-too-soon issue, I think, is an important one and >> ISTM, we'd want to lock the partitions after we've determined the specific >> ones a query needs to scan using the information returned by >> RelationGetPartitionDispatchInfo. That means the latter had better locked >> the relations whose cached partition descriptors will be used to determine >> the result that it produces. One way to do that might be to lock all the >> tables in the list returned by find_all_inheritors that are partitioned >> tables before calling RelationGetPartitionDispatchInfo. It seems what the >> approach you've outlined below will let us do that. > > Yeah, I think so. I think we could possibly open and lock partitioned > children only, then prune away leaf partitions that we can determine > aren't needed, then open and lock the leaf partitions that are needed. Yes. >> BTW, IIUC, there will be two lists of OIDs we'll have: one in the >> find_all_inheritors order, say, L1 and the other determined by using >> partitioning-specific information for the given query, say L2. >> >> To lock, we iterate L1 and if a given member is in L2, we lock it. It >> might be possible to make it as cheap as O(nlogn). > > Commonly, we'll prune no partitions or all but one; and we should be > able to make those cases very fast. Agreed. >> Maybe, we can make the initial patch use syscache to get the relkind for a >> given child. If the syscache bloat is unbearable, we go with the >> denormalization approach. > > Yeah. Maybe if you write that patch, you can also test it to see how > bad the bloat is. I will try and see, but maybe the syscache solution doesn't get us past the proof-of-concept stage. Thanks, Amit
On 2017/08/05 2:25, Robert Haas wrote: > Concretely, my proposal is: > > 1. Before calling RelationGetPartitionDispatchInfo, the calling code > should use find_all_inheritors to lock all the relevant relations (or > the planner could use find_all_inheritors to get a list of relation > OIDs, store it in the plan in order, and then at execution time we > visit them in that order and lock them). > > 2. RelationGetPartitionDispatchInfo assumes the relations are already locked. > > 3. While we're optimizing, in the first loop inside of > RelationGetPartitionDispatchInfo, don't call heap_open(). Instead, > use get_rel_relkind() to see whether we've got a partitioned table; if > so, open it. If not, there's no need. > > 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add > to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1)) > elog(ERROR, ...). Might be interesting to stuff that check into the > relation_open(..., NoLock) path, too. > > One objection to this line of attack is that there might be a good > case for locking only the partitioned inheritors first and then going > back and locking the leaf nodes in a second pass, or even only when > required for a particular row. However, that doesn't require putting > everything in bound order - it only requires moving the partitioned > children to the beginning of the list. And I think rather than having > new logic for that, we should teach find_inheritance_children() to do > that directly. I have a feeling Ashutosh is going to cringe at this > suggestion, but my idea is to do this by denormalizing: add a column > to pg_inherits indicating whether the child is of > RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans > pg_inherits, it can pull that flag out for free along with the > relation OID, and qsort() first by the flag and then by the OID. It > can also return the number of initial elements of its return value > which have that flag set. > > Then, in find_all_inheritors, we split rels_list into > partitioned_rels_list and other_rels_list, and process > partitioned_rels_list in its entirety before touching other_rels_list; > they get concatenated at the end. > > Now, find_all_inheritors and find_inheritance_children can also grow a > flag bool only_partitioned_children; if set, then we skip the > unpartitioned children entirely. > > With all that in place, you can call find_all_inheritors(blah blah, > false) to lock the whole hierarchy, or find_all_inheritors(blah blah, > true) to lock just the partitioned tables in the hierarchy. You get a > consistent lock order either way, and if you start with only the > partitioned tables and later want the leaf partitions too, you just go > through the partitioned children in the order they were returned and > find_inheritance_children(blah blah, false) on each one of them and > the lock order is exactly consistent with what you would have gotten > if you'd done find_all_inheritors(blah blah, false) originally. I tried implementing this in the attached set of patches. [PATCH 2/5] Teach pg_inherits.c a bit about partitioning Both find_inheritance_children and find_all_inheritors now list partitioned child tables before non-partitioned ones and return the number of partitioned tables in an optional output argument [PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing locking Anyone who wants to call RelationGetPartitionDispatchInfo() must first acquire locks using find_all_inheritors. TODO: Add RelationLockHeldByMe() and put if (!RelationLockHeldByMe()) elog(ERROR, ...) check in RelationGetPartitionDispatchInfo() [PATCH 4/5] Teach expand_inherited_rtentry to use partition bound order After locking the child tables using find_all_inheritors, we discard the list of child table OIDs that it generates and rebuild the same using the information returned by RelationGetPartitionDispatchInfo. [PATCH 5/5] Store in pg_inherits if a child is a partitioned table Catalog changes so that is_partitioned property of child tables is now stored in pg_inherits. This avoids consulting syscache to get that property as is currently implemented in patch 2/5. I haven't yet done anything about changing the timing of opening and locking leaf partitions, because it will require some more thinking about the required planner changes. But the above set of patches will get us far enough to get leaf partition sub-plans appear in the partition bound order (same order as what partition tuple-routing uses in the executor). With the above patches, we get the desired order of child sub-plans in Append and ModifyTable plans for partitioned tables: create table p (a int) partition by range (a); create table p4 partition of p for values from (30) to (40); create table p3 partition of p for values from (20) to (30); create table p2 partition of p for values from (10) to (20); create table p1 partition of p for values from (1) to (10); create table p0 partition of p for values from (minvalue) to (1) partition by list (a); create table p00 partition of p0 for values in (0); create table p01 partition of p0 for values in (-1); create table p02 partition of p0 for values in (-2); explain select count(*) from p; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=293.12..293.13 rows=1 width=8) -> Append (cost=0.00..248.50 rows=17850 width=0) -> Seq Scan on p1 (cost=0.00..35.50 rows=2550 width=0) -> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=0) -> Seq Scan on p3 (cost=0.00..35.50 rows=2550 width=0) -> Seq Scan on p4 (cost=0.00..35.50 rows=2550 width=0) -> Seq Scan on p02 (cost=0.00..35.50 rows=2550 width=0) -> Seq Scan on p01 (cost=0.00..35.50 rows=2550 width=0) -> Seq Scan on p00 (cost=0.00..35.50 rows=2550 width=0) explain update p set a = a; QUERY PLAN -------------------------------------------------------------- Update on p (cost=0.00..248.50 rows=17850 width=10) Update on p1 Update on p2 Update on p3 Update on p4 Update on p02 Update on p01 Update on p00 -> Seq Scan on p1 (cost=0.00..35.50 rows=2550 width=10) -> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=10) -> Seq Scan on p3 (cost=0.00..35.50 rows=2550 width=10) -> Seq Scan on p4 (cost=0.00..35.50 rows=2550 width=10) -> Seq Scan on p02 (cost=0.00..35.50 rows=2550 width=10) -> Seq Scan on p01 (cost=0.00..35.50 rows=2550 width=10) -> Seq Scan on p00 (cost=0.00..35.50 rows=2550 width=10) (15 rows) > P.S. While I haven't reviewed 0002 in detail, I think the concept of > minimizing what needs to be built in RelationGetPartitionDispatchInfo > is a very good idea. I put this patch ahead in the list and so it's now 0001. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch
- 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch
- 0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch
- 0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch
- 0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch
Hi Amit, On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/08/05 2:25, Robert Haas wrote: >> Concretely, my proposal is: >> >> P.S. While I haven't reviewed 0002 in detail, I think the concept of >> minimizing what needs to be built in RelationGetPartitionDispatchInfo >> is a very good idea. > > I put this patch ahead in the list and so it's now 0001. > FYI, 0001 patch throws the warning: execMain.c: In function ‘ExecSetupPartitionTupleRouting’: execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized in this function [-Wmaybe-uninitialized] next_ptinfo->parentid != ptinfo->parentid) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> P.S. While I haven't reviewed 0002 in detail, I think the concept of >> minimizing what needs to be built in RelationGetPartitionDispatchInfo >> is a very good idea. > > I put this patch ahead in the list and so it's now 0001. I think what you've currently got as 0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a bug fix that probably needs to be back-patched into v10, so it should come first. I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and 0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should be merged into one patch and that should come next, followed by 0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and finally what you now have as 0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch. This patch series is blocking a bunch of other things, so it would be nice if you could press forward with this quickly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/08/10 18:52, Beena Emerson wrote: > Hi Amit, > > On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/08/05 2:25, Robert Haas wrote: >>> Concretely, my proposal is: >>> >>> P.S. While I haven't reviewed 0002 in detail, I think the concept of >>> minimizing what needs to be built in RelationGetPartitionDispatchInfo >>> is a very good idea. >> >> I put this patch ahead in the list and so it's now 0001. >> > > FYI, 0001 patch throws the warning: > > execMain.c: In function ‘ExecSetupPartitionTupleRouting’: > execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized > in this function [-Wmaybe-uninitialized] > next_ptinfo->parentid != ptinfo->parentid) Thanks for the review. Will fix in the updated version of the patch I will post sometime later today. Regards, Amit
Thanks for the review. On 2017/08/16 2:27, Robert Haas wrote: > On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> P.S. While I haven't reviewed 0002 in detail, I think the concept of >>> minimizing what needs to be built in RelationGetPartitionDispatchInfo >>> is a very good idea. >> >> I put this patch ahead in the list and so it's now 0001. > > I think what you've currently got as > 0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a > bug fix that probably needs to be back-patched into v10, so it should > come first. That makes sense. That patch is now 0001. Checked that it can be back-patched to REL_10_STABLE. > I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and > 0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should > be merged into one patch and that should come next, Merged the two into one: attached 0002. > followed by > 0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and This one is now 0003. > finally what you now have as > 0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch. And 0004. > This patch series is blocking a bunch of other things, so it would be > nice if you could press forward with this quickly. Attached updated patches. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. Thanks Amit for the patches. I too agree with the overall approach taken for keeping the locking order consistent: it's best to do the locking with the existing find_all_inheritors() since it is much cheaper than to lock them in partition-bound order, the later being expensive since it requires opening partitioned tables. > I haven't yet done anything about changing the timing of opening and > locking leaf partitions, because it will require some more thinking about > the required planner changes. But the above set of patches will get us > far enough to get leaf partition sub-plans appear in the partition bound > order (same order as what partition tuple-routing uses in the executor). So, I believe none of the changes done in pg_inherits.c are essential for expanding the inheritence tree in bound order, right ? I am not sure whether we are planning to commit these two things together or incrementally : 1. Expand in bound order 2. Allow for locking only the partitioned tables first. For #1, the changes in pg_inherits.c are not required (viz, keeping partitioned tables at the head of the list, adding inhchildparted column, etc). If we are going to do #2 together with #1, then in the patch set there is no one using the capability introduced by #2. That is, there are no callers of find_all_inheritors() that make use of the new num_partitioned_children parameter. Also, there is no boolean parameter for find_all_inheritors() to be used to lock only the partitioned tables. I feel we should think about 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and first get the review done for the other patches. ------- I see that RelationGetPartitionDispatchInfo() now returns quite a small subset of what it used to return, which is good. But I feel for expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is still a bit heavy. We only require the oids, so the PartitionedTableInfo data structure (including the pd->indexes array) gets discarded. Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind() for each child. In expand_inherited_rtentry(), we anyway have to open all the child tables, so we get the partition descriptors for each of the children for free. So how about, in expand_inherited_rtentry(), we traverse the partition tree using these descriptors similar to how it is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid code duplication for traversing, we can have a common API. Still looking at RelationGetPartitionDispatchInfo() changes ... -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> This patch series is blocking a bunch of other things, so it would be >> nice if you could press forward with this quickly. > > Attached updated patches. > Review for 0001. The attached patch has some minor changes to the comments and code. + * All the relations in the partition tree (including 'rel') must have been + * locked (using at least the AccessShareLock) by the caller. It would be good if we can Assert this in the function. But I couldn't find a way to check whether the backend holds a lock of required strength. Is there any? /* - * We locked all the partitions above including the leaf partitions. - * Note that each of the relations in *partitions are eventually - * closed by the caller. + * All the partitions were locked above. Note that the relcache + * entries will be closed by ExecEndModifyTable(). */ I don't see much value in this hunk, so removed it in the attached patch. + list_free(all_parts); ExecSetupPartitionTupleRouting() will be called only once per DML statement. Leaking the memory for the duration of DML may be worth the time spent in the traversing the list and freeing each cell independently. So removed the hunk in the attached set. 0002 review + + <row> + <entry><structfield>inhchildparted</structfield></entry> + <entry><type>bool</type></entry> + <entry></entry> + <entry> + This is <literal>true</> if the child table is a partitioned table, + <literal>false</> otherwise + </entry> + </row> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May be we should name the column as "inhchildpartitioned". +#define OID_CMP(o1, o2) \ + ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0)); Instead of duplicating the logic in this macro and oid_cmp(), we may want to call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp() passing pointers to the OIDs; a pointer indirection would be costly, but still maintainable. + if (num_partitioned_children) + *num_partitioned_children = my_num_partitioned_children; + Instead of this conditional, why not to make every caller pass a pointer to integer. The callers will just ignore the value if they don't want it. If we do this change, we can get rid of my_num_partitioned_children variable and directly update the passed in pointer variable. inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid; - if (numoids >= maxoids) + is_partitioned = ((Form_pg_inherits) + GETSTRUCT(inheritsTuple))->inhchildparted; Now that we are fetching two members from Form_pg_inherits structure, may be we should use a local variable Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple); and use that to fetch its members. I am still reviewing changes in this patch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Amit, Thanks for the comments. On 2017/08/16 20:30, Amit Khandekar wrote: > On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> Attached updated patches. > > Thanks Amit for the patches. > > I too agree with the overall approach taken for keeping the locking > order consistent: it's best to do the locking with the existing > find_all_inheritors() since it is much cheaper than to lock them in > partition-bound order, the later being expensive since it requires > opening partitioned tables. Yeah. Per the Robert's design idea, we will always do the *locking* in the order determined by find_all_inheritors/find_inheritance_children. >> I haven't yet done anything about changing the timing of opening and >> locking leaf partitions, because it will require some more thinking about >> the required planner changes. But the above set of patches will get us >> far enough to get leaf partition sub-plans appear in the partition bound >> order (same order as what partition tuple-routing uses in the executor). > > So, I believe none of the changes done in pg_inherits.c are essential > for expanding the inheritence tree in bound order, right ? Right. The changes to pg_inherits.c are only about recognizing partitioned tables in an inheritance hierarchy and putting them ahead in the returned list. Now that I think of it, the title of the patch (teach pg_inherits.c about "partitioning") sounds a bit confusing. In particular, the patch does not teach it things like partition bound order, just that some tables in the hierarchy could be partitioned tables. > I am not > sure whether we are planning to commit these two things together or > incrementally : > 1. Expand in bound order > 2. Allow for locking only the partitioned tables first. > > For #1, the changes in pg_inherits.c are not required (viz, keeping > partitioned tables at the head of the list, adding inhchildparted > column, etc). Yes. Changes to pg_inherits.c can be committed completely independently of either 1 or 2, although then there would be nobody using that capability. About 2: I think the capability to lock only the partitioned tables in expand_inherited_rtentry() will only be used once we have the patch to do the necessary planner restructuring that will allow us to defer child table locking to some place that is not expand_inherited_rtentry(). I am working on that patch now and should be able to show something soon. > If we are going to do #2 together with #1, then in the patch set there > is no one using the capability introduced by #2. That is, there are no > callers of find_all_inheritors() that make use of the new > num_partitioned_children parameter. Also, there is no boolean > parameter for find_all_inheritors() to be used to lock only the > partitioned tables. > > I feel we should think about > 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and > first get the review done for the other patches. I think that makes sense. > I see that RelationGetPartitionDispatchInfo() now returns quite a > small subset of what it used to return, which is good. But I feel for > expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is > still a bit heavy. We only require the oids, so the > PartitionedTableInfo data structure (including the pd->indexes array) > gets discarded. Maybe we could make the output argument optional, but I don't see much point in being too conservative here. Building the indexes array does not cost us that much and if a not-too-distant-in-future patch could use that information somehow, it could do so for free. > Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind() > for each child. In expand_inherited_rtentry(), we anyway have to open > all the child tables, so we get the partition descriptors for each of > the children for free. So how about, in expand_inherited_rtentry(), we > traverse the partition tree using these descriptors similar to how it > is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid > code duplication for traversing, we can have a common API. As mentioned, one goal I'm seeking is to avoid having to open the child tables in expand_inherited_rtentry(). Thanks, Amit
Hi Ashutosh, Thanks for the review and the updated patch. On 2017/08/16 21:48, Ashutosh Bapat wrote: > On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> >>> This patch series is blocking a bunch of other things, so it would be >>> nice if you could press forward with this quickly. >> >> Attached updated patches. >> > > Review for 0001. The attached patch has some minor changes to the > comments and code. > > + * All the relations in the partition tree (including 'rel') must have been > + * locked (using at least the AccessShareLock) by the caller. > > It would be good if we can Assert this in the function. But I couldn't find a > way to check whether the backend holds a lock of required strength. Is there > any? Currently there isn't. Robert suggested a RelationLockHeldByMe(Oid) [1], which is still a TODO on my plate. > /* > - * We locked all the partitions above including the leaf partitions. > - * Note that each of the relations in *partitions are eventually > - * closed by the caller. > + * All the partitions were locked above. Note that the relcache > + * entries will be closed by ExecEndModifyTable(). > */ > I don't see much value in this hunk, I thought the new text was a bit clearer, but maybe that's just me. Will remove. > + list_free(all_parts); > ExecSetupPartitionTupleRouting() will be called only once per DML statement. > Leaking the memory for the duration of DML may be worth the time spent > in the traversing > the list and freeing each cell independently. Fair enough, will remove the list_free(). > 0002 review > + > + <row> > + <entry><structfield>inhchildparted</structfield></entry> > + <entry><type>bool</type></entry> > + <entry></entry> > + <entry> > + This is <literal>true</> if the child table is a partitioned table, > + <literal>false</> otherwise > + </entry> > + </row> > In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May > be we should name the column as "inhchildpartitioned". Sure. > +#define OID_CMP(o1, o2) \ > + ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0)); > Instead of duplicating the logic in this macro and oid_cmp(), we may want to > call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp() > passing pointers to the OIDs; a pointer indirection would be costly, but still > maintainable. Actually, I avoided using oid_cmp exactly for that reason. > + if (num_partitioned_children) > + *num_partitioned_children = my_num_partitioned_children; > + > Instead of this conditional, why not to make every caller pass a pointer to > integer. The callers will just ignore the value if they don't want it. If we do > this change, we can get rid of my_num_partitioned_children variable and > directly update the passed in pointer variable. There are a bunch of callers of find_all_inheritors() and find_inheritance_children. Changes to make them all declare a pointless variable seemed off to me. The conditional in question doesn't seem to be that expensive. (To be fair, the one introduced in find_all_inheritors() kind of is as implemented by the patch, because it's executed for every iteration of the foreach(l, rels_list) loop, which I will fix.) > > inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid; > - if (numoids >= maxoids) > + is_partitioned = ((Form_pg_inherits) > + GETSTRUCT(inheritsTuple))->inhchildparted; > Now that we are fetching two members from Form_pg_inherits structure, may be we > should use a local variable > Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple); > and use that to fetch its members. Sure, will do. > I am still reviewing changes in this patch. Okay, will wait for more comments before sending the updated patches. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May >> be we should name the column as "inhchildpartitioned". > > Sure. I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long. > There are a bunch of callers of find_all_inheritors() and > find_inheritance_children. Changes to make them all declare a pointless > variable seemed off to me. The conditional in question doesn't seem to be > that expensive. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/08/17 11:22, Robert Haas wrote: > On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May >>> be we should name the column as "inhchildpartitioned". >> >> Sure. > > I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long. inhchildpartitioned indeed seems long. Since we storing if the child table (one with the OID inhrelid) is partitioned, inhpartitioned seems best to me. Will implement that. Thanks, Amit
On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/08/17 11:22, Robert Haas wrote: >> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May >>>> be we should name the column as "inhchildpartitioned". >>> >>> Sure. >> >> I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long. > > inhchildpartitioned indeed seems long. > > Since we storing if the child table (one with the OID inhrelid) is > partitioned, inhpartitioned seems best to me. Will implement that. inhchildpartitioned is long but clearly tells that the child table is partitioned, not the parent. pg_inherit can have parents which are not partitioned, so it's better to have self-explanatory catalog name. I am fine with some other name as long as it's clear. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/08/17 13:56, Ashutosh Bapat wrote: > On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/08/17 11:22, Robert Haas wrote: >>> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote >>> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May >>>>> be we should name the column as "inhchildpartitioned". >>>> >>>> Sure. >>> >>> I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long. >> >> inhchildpartitioned indeed seems long. >> >> Since we storing if the child table (one with the OID inhrelid) is >> partitioned, inhpartitioned seems best to me. Will implement that. > > inhchildpartitioned is long but clearly tells that the child table is > partitioned, not the parent. pg_inherit can have parents which are not > partitioned, so it's better to have self-explanatory catalog name. I > am fine with some other name as long as it's clear. OTOH, the pg_inherits field that stores the OID of the child table does not mention "child" in its name (inhrelid), although you are right that inhpartitioned can be taken to mean that the inheritance parent (inhparent) is partitioned. In any case, system catalog documentation which clearly states what's what might be the best guide for the confused. Of course, we can add a comment in pg_inherits.h next to the field explaining what it is for those reading the source code and confused about what inhpartitioned means. Thanks, Amit
On Thu, Aug 17, 2017 at 10:54 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/08/17 13:56, Ashutosh Bapat wrote: >> On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> On 2017/08/17 11:22, Robert Haas wrote: >>>> On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote >>>> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>>>> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May >>>>>> be we should name the column as "inhchildpartitioned". >>>>> >>>>> Sure. >>>> >>>> I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long. >>> >>> inhchildpartitioned indeed seems long. >>> >>> Since we storing if the child table (one with the OID inhrelid) is >>> partitioned, inhpartitioned seems best to me. Will implement that. >> >> inhchildpartitioned is long but clearly tells that the child table is >> partitioned, not the parent. pg_inherit can have parents which are not >> partitioned, so it's better to have self-explanatory catalog name. I >> am fine with some other name as long as it's clear. > > OTOH, the pg_inherits field that stores the OID of the child table does > not mention "child" in its name (inhrelid), although you are right that > inhpartitioned can be taken to mean that the inheritance parent > (inhparent) is partitioned. In any case, system catalog documentation > which clearly states what's what might be the best guide for the confused. > Sorry, I overlooked this detail. To me it means that the table is driven by the child and inhpartitioned looks good then. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/08/17 10:09, Amit Langote wrote: > On 2017/08/16 20:30, Amit Khandekar wrote: >> On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I am not >> sure whether we are planning to commit these two things together or >> incrementally : >> 1. Expand in bound order >> 2. Allow for locking only the partitioned tables first. >> >> For #1, the changes in pg_inherits.c are not required (viz, keeping >> partitioned tables at the head of the list, adding inhchildparted >> column, etc). > > Yes. Changes to pg_inherits.c can be committed completely independently > of either 1 or 2, although then there would be nobody using that capability. > > About 2: I think the capability to lock only the partitioned tables in > expand_inherited_rtentry() will only be used once we have the patch to do > the necessary planner restructuring that will allow us to defer child > table locking to some place that is not expand_inherited_rtentry(). I am > working on that patch now and should be able to show something soon. > >> If we are going to do #2 together with #1, then in the patch set there >> is no one using the capability introduced by #2. That is, there are no >> callers of find_all_inheritors() that make use of the new >> num_partitioned_children parameter. Also, there is no boolean >> parameter for find_all_inheritors() to be used to lock only the >> partitioned tables. >> >> I feel we should think about >> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and >> first get the review done for the other patches. > > I think that makes sense. After thinking some more on this, I think Amit's suggestion to put this patch at the end of the priority list is good (that is, the patch that teaches pg_inherits infrastructure to list partitioned tables ahead in the list.) Its purpose is mainly to fulfill the requirement that partitioned tables be able to be locked ahead of any leaf partitions in the list (per the design idea Robert suggested [1]). Patch which requires that capability is not in the picture yet. Perhaps, we could review and commit this patch only when that future patch shows up. So, I will hold that patch for now. Thoughts? Attached rest of the patches. 0001 has changes per Ashutosh's review comments [2]: 0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking 0002: Teach expand_inherited_rtentry to use partition bound order 0003: Decouple RelationGetPartitionDispatchInfo() from executor Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFjFpRdXn7w7nVKv4qN9fa%2BtzRi_mJFNCsBWy%3Dbd0SLbYczUfA%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 17 August 2017 at 06:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Amit, > > Thanks for the comments. > > On 2017/08/16 20:30, Amit Khandekar wrote: >> On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >>> Attached updated patches. >> >> Thanks Amit for the patches. >> >> I too agree with the overall approach taken for keeping the locking >> order consistent: it's best to do the locking with the existing >> find_all_inheritors() since it is much cheaper than to lock them in >> partition-bound order, the later being expensive since it requires >> opening partitioned tables. > > Yeah. Per the Robert's design idea, we will always do the *locking* in > the order determined by find_all_inheritors/find_inheritance_children. > >>> I haven't yet done anything about changing the timing of opening and >>> locking leaf partitions, because it will require some more thinking about >>> the required planner changes. But the above set of patches will get us >>> far enough to get leaf partition sub-plans appear in the partition bound >>> order (same order as what partition tuple-routing uses in the executor). >> >> So, I believe none of the changes done in pg_inherits.c are essential >> for expanding the inheritence tree in bound order, right ? > > Right. > > The changes to pg_inherits.c are only about recognizing partitioned tables > in an inheritance hierarchy and putting them ahead in the returned list. > Now that I think of it, the title of the patch (teach pg_inherits.c about > "partitioning") sounds a bit confusing. In particular, the patch does not > teach it things like partition bound order, just that some tables in the > hierarchy could be partitioned tables. > >> I am not >> sure whether we are planning to commit these two things together or >> incrementally : >> 1. Expand in bound order >> 2. Allow for locking only the partitioned tables first. >> >> For #1, the changes in pg_inherits.c are not required (viz, keeping >> partitioned tables at the head of the list, adding inhchildparted >> column, etc). > > Yes. Changes to pg_inherits.c can be committed completely independently > of either 1 or 2, although then there would be nobody using that capability. > > About 2: I think the capability to lock only the partitioned tables in > expand_inherited_rtentry() will only be used once we have the patch to do > the necessary planner restructuring that will allow us to defer child > table locking to some place that is not expand_inherited_rtentry(). I am > working on that patch now and should be able to show something soon. > >> If we are going to do #2 together with #1, then in the patch set there >> is no one using the capability introduced by #2. That is, there are no >> callers of find_all_inheritors() that make use of the new >> num_partitioned_children parameter. Also, there is no boolean >> parameter for find_all_inheritors() to be used to lock only the >> partitioned tables. >> >> I feel we should think about >> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and >> first get the review done for the other patches. > > I think that makes sense. > >> I see that RelationGetPartitionDispatchInfo() now returns quite a >> small subset of what it used to return, which is good. But I feel for >> expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is >> still a bit heavy. We only require the oids, so the >> PartitionedTableInfo data structure (including the pd->indexes array) >> gets discarded. > > Maybe we could make the output argument optional, but I don't see much > point in being too conservative here. Building the indexes array does not > cost us that much and if a not-too-distant-in-future patch could use that > information somehow, it could do so for free. Ok, so these changes are mostly kept keeping in mind some future use-cases. Otherwise, I was thinking we could just keep a light-weight function to generate the oids, and keep the current RelationGetPartitionDispatchInfo() intact. Anyways, some more comments : In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an array of pointers. Why can't it be an array of PartitionTupleRoutingInfo structure rather than pointer to that structure ? diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c + * Close all the leaf partitions and their indices. * Above comment needs to be shifted a bit down to the subsequent "for" loop where it's actually applicable. * node->mt_partition_dispatch_info[0] corresponds to the root partitioned * table, for which we didn't create tupslot. Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0] /** XXX- do we need a pinning mechanism for partition descriptors* so that there references can be managed independentlyof* the parent relcache entry? Like PinPartitionDesc(partdesc)?*/ pd->partdesc = partdesc; Any idea if the above can be handled ? I am not too sure. > > Thanks, > Amit > -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, Aug 17, 2017 at 12:59 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Attached rest of the patches. 0001 has changes per Ashutosh's review > comments [2]: > > 0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking [2] had a patch with some changes to the original patch you posted. I didn't describe those changes in my mail, since they rearranged the comments. Those changes are not part of this patch and you haven't comments about those changes as well. If you have intentionally excluded those changes, it's fine. In case, you haven't reviewed them, please see if they are good to be incorporated. > > 0002: Teach expand_inherited_rtentry to use partition bound order 0004 in [1] expands a multi-level partition hierarchy into similar inheritance hierarchy. That patch doesn't need all OIDs in one go. It will have to handle the partition hierarchy level by level, so most of the code added by this patch will need to be changed by that patch. Is there a way we can somehow change this patch so that the delta in 0004 is reduced? That may need rethinking about using RelationGetPartitionDispatchInfo(). [1] https://www.postgresql.org/message-id/CAFjFpRfkr7igCGBBWH1PQ__W-XPy1O79Y-qxCmJc6FizLqFz7Q@mail.gmail.com [2] https://www.postgresql.org/message-id/CAFjFpRdXn7w7nVKv4qN9fa%2BtzRi_mJFNCsBWy%3Dbd0SLbYczUfA%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > [2] had a patch with some changes to the original patch you posted. I > didn't describe those changes in my mail, since they rearranged the > comments. Those changes are not part of this patch and you haven't > comments about those changes as well. If you have intentionally > excluded those changes, it's fine. In case, you haven't reviewed them, > please see if they are good to be incorporated. I took a quick look at your version but I think I like Amit's fine the way it is, so committed that and back-patched it to v10. I find 0002 pretty ugly as things stand. We get a bunch of tuple maps that we don't really need, only to turn around and free them. We get a bunch of tuple slots that we don't need, only to turn around and drop them. We don't really need the PartitionDispatch objects either, except for the OIDs they contain. There's a lot of extra stuff being computed here that is really irrelevant for this purpose. I think we should try to clean that up somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Ashutosh, On 2017/08/17 21:39, Ashutosh Bapat wrote: > On Thu, Aug 17, 2017 at 12:59 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> Attached rest of the patches. 0001 has changes per Ashutosh's review >> comments [2]: >> >> 0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking > > [2] had a patch with some changes to the original patch you posted. I > didn't describe those changes in my mail, since they rearranged the > comments. Those changes are not part of this patch and you haven't > comments about those changes as well. If you have intentionally > excluded those changes, it's fine. In case, you haven't reviewed them, > please see if they are good to be incorporated. Sorry, I thought the ones you mentioned in the email were the only changes you made to the original patch. I noted only those and included them when editing the relevant commit in my local repository in an interactive rebase session. I didn't actually take your patch and try to merge it with the commit in my local repository. IMHO, simply commenting in the email which parts of the patch you would like to see changed would be helpful. Then we can discuss those changes and proceed with them (or not) per the result of that discussion. >> 0002: Teach expand_inherited_rtentry to use partition bound order > > 0004 in [1] expands a multi-level partition hierarchy into similar > inheritance hierarchy. That patch doesn't need all OIDs in one go. It > will have to handle the partition hierarchy level by level, so most of > the code added by this patch will need to be changed by that patch. Is > there a way we can somehow change this patch so that the delta in 0004 > is reduced? That may need rethinking about using > RelationGetPartitionDispatchInfo(). Regarding that, I have a question: Does the multi-level partition-wise join planning logic depend on the inheritance itself to be expanded in a multi-level aware manner. That is, expanding the partitioned table inheritance in multi-level aware manner in expan_inherited_rtentry()? Wouldn't it suffice to just have the resulting Append paths be nested per multi-level partitioning hierarchy? Creating such nested Append paths doesn't necessarily require that the inheritance be expanded that way in the first place (as I am finding out when working on another patch). Thanks, Amit
On Fri, Aug 18, 2017 at 10:12 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote > >>> 0002: Teach expand_inherited_rtentry to use partition bound order >> >> 0004 in [1] expands a multi-level partition hierarchy into similar >> inheritance hierarchy. That patch doesn't need all OIDs in one go. It >> will have to handle the partition hierarchy level by level, so most of >> the code added by this patch will need to be changed by that patch. Is >> there a way we can somehow change this patch so that the delta in 0004 >> is reduced? That may need rethinking about using >> RelationGetPartitionDispatchInfo(). > > Regarding that, I have a question: > > Does the multi-level partition-wise join planning logic depend on the > inheritance itself to be expanded in a multi-level aware manner. That is, > expanding the partitioned table inheritance in multi-level aware manner in > expan_inherited_rtentry()? Yes, it needs AppendRelInfos to retain the parent-child relationship. Please refer [1], [2], [3] for details. > > Wouldn't it suffice to just have the resulting Append paths be nested per > multi-level partitioning hierarchy? We are joining RelOptInfos, so those need to be nested. For those to be nested, we need AppendRelInfos to preserve parent-child relationship. Nesting paths doesn't help. Append paths actually should be flattened out to avoid any extra time consumed in nested Append node. [1] https://www.postgresql.org/message-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV%3DkB8Akkqg@mail.gmail.com [2] https://www.postgresql.org/message-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com [3] https://www.postgresql.org/message-id/CAFjFpRd6Kzx6Xn%3D7vdwwnh6rEw2VEgo--iPdhV%2BFb7bHfPzsbw%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 18 August 2017 at 01:24, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> [2] had a patch with some changes to the original patch you posted. I >> didn't describe those changes in my mail, since they rearranged the >> comments. Those changes are not part of this patch and you haven't >> comments about those changes as well. If you have intentionally >> excluded those changes, it's fine. In case, you haven't reviewed them, >> please see if they are good to be incorporated. > > I took a quick look at your version but I think I like Amit's fine the > way it is, so committed that and back-patched it to v10. > > I find 0002 pretty ugly as things stand. We get a bunch of tuple maps > that we don't really need, only to turn around and free them. We get > a bunch of tuple slots that we don't need, only to turn around and > drop them. I think in the final changes after applying all 3 patches, the redundant tuple slot is no longer present. But ... > We don't really need the PartitionDispatch objects either, > except for the OIDs they contain. There's a lot of extra stuff being > computed here that is really irrelevant for this purpose. I think we > should try to clean that up somehow. ... I am of the same opinion. That's why - as I mentioned upthread - I was thinking why not have a separate light-weight function to just generate oids, and keep RelationGetPartitionDispatchInfo() intact, to be used only for tuple routing. But, I haven't yet checked Ashuthosh's requirements, which suggest that it does not help to even get the oid list. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Aug 18, 2017 at 10:32 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > I think in the final changes after applying all 3 patches, the > redundant tuple slot is no longer present. But ... >> We don't really need the PartitionDispatch objects either, >> except for the OIDs they contain. There's a lot of extra stuff being >> computed here that is really irrelevant for this purpose. I think we >> should try to clean that up somehow. > ... I am of the same opinion. That's why - as I mentioned upthread - I > was thinking why not have a separate light-weight function to just > generate oids, and keep RelationGetPartitionDispatchInfo() intact, to > be used only for tuple routing. > > But, I haven't yet checked Ashuthosh's requirements, which suggest > that it does not help to even get the oid list. > 0004 patch in partition-wise join patchset has code to expand partition hierarchy. That patch is expanding inheritance hierarchy in depth first manner. Robert commented that instead of depth first manner, it will be better if we expand it in partitioned tables first manner. With the latest changes in your patch-set I don't see the reason for expanding in partitioned tables first order. Can you please elaborate if we still need to expand in partitioned table first manner? May be we should just address the expansion issue in 0004 instead of dividing it in two patches. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/08/18 4:54, Robert Haas wrote: > On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> [2] had a patch with some changes to the original patch you posted. I >> didn't describe those changes in my mail, since they rearranged the >> comments. Those changes are not part of this patch and you haven't >> comments about those changes as well. If you have intentionally >> excluded those changes, it's fine. In case, you haven't reviewed them, >> please see if they are good to be incorporated. > > I took a quick look at your version but I think I like Amit's fine the > way it is, so committed that and back-patched it to v10. Thanks for committing. > I find 0002 pretty ugly as things stand. We get a bunch of tuple maps > that we don't really need, only to turn around and free them. We get > a bunch of tuple slots that we don't need, only to turn around and > drop them. We don't really need the PartitionDispatch objects either, > except for the OIDs they contain. There's a lot of extra stuff being > computed here that is really irrelevant for this purpose. I think we > should try to clean that up somehow. One way to do that might be to reverse the order of the remaining patches and put the patch to refactor RelationGetPartitionDispatchInfo() first. With that refactoring, PartitionDispatch itself has become much simpler in that it does not contain a relcache reference to be closed eventually by the caller, the tuple map, and the tuple table slot. Since those things are required for tuple-routing, the refactoring makes ExecSetupPartitionTupleRouting itself create them from the (minimal) information returned by RelationGetPartitionDispatchInfo and ultimately destroy when done using it. I kept the indexes field in PartitionDispatchData though, because it's essentially free to create while we are walking the partition tree in RelationGetPartitionDispatchInfo() and it seems undesirable to make the caller compute that information (indexes) by traversing the partition tree all over again, if it doesn't otherwise have to. I am still considering some counter-arguments raised by Amit Khandekar about this last assertion. Thoughts? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 18 August 2017 at 10:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/08/18 4:54, Robert Haas wrote: >> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> [2] had a patch with some changes to the original patch you posted. I >>> didn't describe those changes in my mail, since they rearranged the >>> comments. Those changes are not part of this patch and you haven't >>> comments about those changes as well. If you have intentionally >>> excluded those changes, it's fine. In case, you haven't reviewed them, >>> please see if they are good to be incorporated. >> >> I took a quick look at your version but I think I like Amit's fine the >> way it is, so committed that and back-patched it to v10. > > Thanks for committing. > >> I find 0002 pretty ugly as things stand. We get a bunch of tuple maps >> that we don't really need, only to turn around and free them. We get >> a bunch of tuple slots that we don't need, only to turn around and >> drop them. We don't really need the PartitionDispatch objects either, >> except for the OIDs they contain. There's a lot of extra stuff being >> computed here that is really irrelevant for this purpose. I think we >> should try to clean that up somehow. > > One way to do that might be to reverse the order of the remaining patches > and put the patch to refactor RelationGetPartitionDispatchInfo() first. > With that refactoring, PartitionDispatch itself has become much simpler in > that it does not contain a relcache reference to be closed eventually by > the caller, the tuple map, and the tuple table slot. Since those things > are required for tuple-routing, the refactoring makes > ExecSetupPartitionTupleRouting itself create them from the (minimal) > information returned by RelationGetPartitionDispatchInfo and ultimately > destroy when done using it. I kept the indexes field in > PartitionDispatchData though, because it's essentially free to create > while we are walking the partition tree in > RelationGetPartitionDispatchInfo() and it seems undesirable to make the > caller compute that information (indexes) by traversing the partition tree > all over again, if it doesn't otherwise have to. I am still considering > some counter-arguments raised by Amit Khandekar about this last assertion. > > Thoughts? One another approach (that I have used in update-partition-key patch) is to *not* generate the oids beforehand, and instead, call a partition_walker_next() function to traverse through the tree. Each next() function would return a ChildInfo that includes child oid, parent oid, etc. All users of this would guarantee a fixed order of oids. In the update-partition-key patch, I am opening and closing each of the children, which of course, we need to avoid. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > 0004 patch in partition-wise join patchset has code to expand > partition hierarchy. That patch is expanding inheritance hierarchy in > depth first manner. Robert commented that instead of depth first > manner, it will be better if we expand it in partitioned tables first > manner. With the latest changes in your patch-set I don't see the > reason for expanding in partitioned tables first order. Can you please > elaborate if we still need to expand in partitioned table first > manner? May be we should just address the expansion issue in 0004 > instead of dividing it in two patches. Let me see if I can clarify. I think there are three requirements here: A. Amit wants to be able to prune leaf partitions before opening and locking those relations, so that pruning can be done earlier and, therefore, more cheaply. B. Partition-wise join wants to expand the inheritance hierarchy a level at a time instead of all at once, ending up with rte->inh = true entries for intermediate partitioned tables. C. Partition-wise join (and lots of other things; see numerous mentions of EIBO in http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html) want to expand in bound order. Obviously, bound-order and partitioned-tables-first are incompatible orderings, but there's no actual conflict because the first one has to do with the order of *expansion* and the second one has to do with the order of *locking*. So in the end game I think expand_inherited_rtentry looks approximately like this: 1. Calling find_all_inheritors with a new only-lock-the-partitions flag. This should result in locking all partitioned tables in the inheritance hierarchy in breadth-first, low-OID-first order. (When the only-lock-the-partitions isn't specified, all partitioned tables should still be locked before any unpartitioned tables, so that the locking order in that case is consistent with what we do here.) 2. Iterate over the partitioned tables identified in step 1 in the order in which they were returned. For each one: - Decide which children can be pruned. - Lock the unpruned, non-partitioned children in low-OID-first order. 3. Make another pass over the inheritance hierarchy, starting at the root. Traverse the whole hierarchy in breadth-first in *bound* order. Add RTEs and AppendRelInfos as we go -- these will have rte->inh = true for partitioned tables and rte->inh = false for leaf partitions. Whether we should try to go straight to the end state here or do this via a series of incremental changes, I'm not entirely sure right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 19, 2017 at 1:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> 0004 patch in partition-wise join patchset has code to expand >> partition hierarchy. That patch is expanding inheritance hierarchy in >> depth first manner. Robert commented that instead of depth first >> manner, it will be better if we expand it in partitioned tables first >> manner. With the latest changes in your patch-set I don't see the >> reason for expanding in partitioned tables first order. Can you please >> elaborate if we still need to expand in partitioned table first >> manner? May be we should just address the expansion issue in 0004 >> instead of dividing it in two patches. > > Let me see if I can clarify. I think there are three requirements here: > > A. Amit wants to be able to prune leaf partitions before opening and > locking those relations, so that pruning can be done earlier and, > therefore, more cheaply. We could actually prune partitioned tables thus pruning whole partitioned tree. Do we want to then lock those partitioned tables but not the leaves in that tree? If there's already some discussion answering this question, please point me to the same. Sorry for not paying attention to it. > > B. Partition-wise join wants to expand the inheritance hierarchy a > level at a time instead of all at once, ending up with rte->inh = true > entries for intermediate partitioned tables. And create AppendRelInfos which pair children with their partitioned parent rather than the root. > > C. Partition-wise join (and lots of other things; see numerous > mentions of EIBO in > http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html) > want to expand in bound order. > > Obviously, bound-order and partitioned-tables-first are incompatible > orderings, but there's no actual conflict because the first one has to > do with the order of *expansion* and the second one has to do with the > order of *locking*. right. Thanks for making it clear. > So in the end game I think > expand_inherited_rtentry looks approximately like this: > > 1. Calling find_all_inheritors with a new only-lock-the-partitions > flag. This should result in locking all partitioned tables in the > inheritance hierarchy in breadth-first, low-OID-first order. (When > the only-lock-the-partitions isn't specified, all partitioned tables > should still be locked before any unpartitioned tables, so that the > locking order in that case is consistent with what we do here.) > I am confused. When "only-lock-the-partitions" is true, do we expect intermediate partitioned tables to be locked? Why then "only" in the flag? > 2. Iterate over the partitioned tables identified in step 1 in the > order in which they were returned. For each one: > - Decide which children can be pruned. > - Lock the unpruned, non-partitioned children in low-OID-first order. > > 3. Make another pass over the inheritance hierarchy, starting at the > root. Traverse the whole hierarchy in breadth-first in *bound* order. > Add RTEs and AppendRelInfos as we go -- these will have rte->inh = > true for partitioned tables and rte->inh = false for leaf partitions. These two seem to be based on the assumption that we have to lock all the partitioned tables even if they can be pruned. For regular inheritance there is only a single parent, so traversing the list returned by find_all_inheritors suffices. For partitioned hierarchy, we need to know the parent of every child, which is not part of the find_all_inheritors() output list. Even if it returns only the partitioned children, they themselves may have a parent different from the root partition. So, we have to discard the output of find_all_inheritors() for partitioned hierarchy and traverse the children as per their orders in oids array in PartitionDesc. May be it's better to separate the guts of expand_inherited_rtentry(), which create AppendRelInfos, RTEs and rowmarks for the children into a separate routine. Use that routine in two different functions expand_inherited_rtentry() and expand_partitioned_rtentry() for regular inheritance and partitioned inheritance resp. The functions will use two different traversal methods appropriate for traversing the children in either case. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Hi Amit, On 2017/08/17 21:18, Amit Khandekar wrote: > Anyways, some more comments : > > In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an > array of pointers. Why can't it be an array of > PartitionTupleRoutingInfo structure rather than pointer to that > structure ? AFAIK, assigning pointers is less expensive than assigning struct and we end up doing a lot of assigning of the members of that array to a local variable in get_partition_for_tuple(), for example. Perhaps, we could avoid those assignments and implement it the way you suggest. > diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c > + * Close all the leaf partitions and their indices. > * > Above comment needs to be shifted a bit down to the subsequent "for" > loop where it's actually applicable. That's right, done. > * node->mt_partition_dispatch_info[0] corresponds to the root partitioned > * table, for which we didn't create tupslot. > Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0] Oops, fixed. > /* > * XXX- do we need a pinning mechanism for partition descriptors > * so that there references can be managed independently of > * the parent relcache entry? Like PinPartitionDesc(partdesc)? > */ > pd->partdesc = partdesc; > > Any idea if the above can be handled ? I am not too sure. A similar mechanism exists for TupleDesc ref-counting (see the usage of PinTupleDesc and ReleaseTupleDesc across the backend code.) I too am currently unsure if such an elaborate mechanism is actually *necessary* for rd_partdesc. Attached updated patches. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/08/21 13:11, Ashutosh Bapat wrote: > On Sat, Aug 19, 2017 at 1:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> 0004 patch in partition-wise join patchset has code to expand >>> partition hierarchy. That patch is expanding inheritance hierarchy in >>> depth first manner. Robert commented that instead of depth first >>> manner, it will be better if we expand it in partitioned tables first >>> manner. With the latest changes in your patch-set I don't see the >>> reason for expanding in partitioned tables first order. Can you please >>> elaborate if we still need to expand in partitioned table first >>> manner? May be we should just address the expansion issue in 0004 >>> instead of dividing it in two patches. >> >> Let me see if I can clarify. I think there are three requirements here: >> >> A. Amit wants to be able to prune leaf partitions before opening and >> locking those relations, so that pruning can be done earlier and, >> therefore, more cheaply. > > We could actually prune partitioned tables thus pruning whole > partitioned tree. Do we want to then lock those partitioned tables but > not the leaves in that tree? I think it would be nice if we keep the current approach of expanding the whole partition tree in expand_inherited_rtentry(), at least to know how many more entries a given partitioned table will add to the query's range table. It would be nice, because that way, we don't have to worry *right away* about modifying the planner to cope with some new behavior whereby range table entries will get added at some later point. Then, as you might already know, if we want to use the partition bound order when expanding the whole partition tree, we will depend on the relcache entries of the partitioned tables in that tree, which will require us to take locks on them. It does sound odd that we may end up locking a child *partitioned* table that is potentially prune-able, but maybe there is some way to relinquish that lock once we find out that it is pruned after all. >> B. Partition-wise join wants to expand the inheritance hierarchy a >> level at a time instead of all at once, ending up with rte->inh = true >> entries for intermediate partitioned tables. > > And create AppendRelInfos which pair children with their partitioned > parent rather than the root. There should be *some* way to preserve the parent-child RT index mapping and to preserve the multi-level hierarchy, a way that doesn't map all the child tables in a partition tree to the root table's RT index. AppendRelInfo is one way of doing that mapping currently, but if we continue to treat it as the only way (for the purpose of mapping), we will be stuck with the way they are created and manipulated. Especially, if we are going to always depend on the fact that root->append_rel_list contains all the required AppendRelInfos, then we will always have to fully expand the inheritance in expand_inherited_rtentry() (by fully I mean, locking and opening all the child tables, instead of just the partitioned tables).In a world where we don't want to open the partitionchild tables in expand_inherited_rtentry(), we cannot build the corresponding AppendRelInfos there. Note that this is not about completely dispelling AppendRelInfos-for-partition-child-tables, but about doing without them being present in root->append_rel_list. We would still need them to be able to use adjust_appendrel_attrs(), etc., but we can create them at a different time and store them in a place that's not root->append_rel_list;For example, inside the RelOptInfo of the childtable. Or perhaps, we can still add them to root->append_rel_list, but will need to be careful about the places that depend on the timing of AppendRelInfos being present there. >> So in the end game I think >> expand_inherited_rtentry looks approximately like this: >> >> 1. Calling find_all_inheritors with a new only-lock-the-partitions >> flag. This should result in locking all partitioned tables in the >> inheritance hierarchy in breadth-first, low-OID-first order. (When >> the only-lock-the-partitions isn't specified, all partitioned tables >> should still be locked before any unpartitioned tables, so that the >> locking order in that case is consistent with what we do here.) > > I am confused. When "only-lock-the-partitions" is true, do we expect > intermediate partitioned tables to be locked? Why then "only" in the > flag? I guess Robert meant to say lock-only-"partitioned"-tables? >> 2. Iterate over the partitioned tables identified in step 1 in the >> order in which they were returned. For each one: >> - Decide which children can be pruned. >> - Lock the unpruned, non-partitioned children in low-OID-first order. >> >> 3. Make another pass over the inheritance hierarchy, starting at the >> root. Traverse the whole hierarchy in breadth-first in *bound* order. >> Add RTEs and AppendRelInfos as we go -- these will have rte->inh = >> true for partitioned tables and rte->inh = false for leaf partitions. > > These two seem to be based on the assumption that we have to lock all > the partitioned tables even if they can be pruned. > > For regular inheritance there is only a single parent, so traversing > the list returned by find_all_inheritors suffices. For partitioned > hierarchy, we need to know the parent of every child, which is not > part of the find_all_inheritors() output list. Even if it returns only > the partitioned children, they themselves may have a parent different > from the root partition. So, we have to discard the output of > find_all_inheritors() for partitioned hierarchy and traverse the > children as per their orders in oids array in PartitionDesc. May be > it's better to separate the guts of expand_inherited_rtentry(), which > create AppendRelInfos, RTEs and rowmarks for the children into a > separate routine. Use that routine in two different functions > expand_inherited_rtentry() and expand_partitioned_rtentry() for > regular inheritance and partitioned inheritance resp. The functions > will use two different traversal methods appropriate for traversing > the children in either case. I just posted a patch [1] that implements something like this, but implementation details might seem different. It doesn't however implement a solution to the problem that you pose that partitioned child tables that are prune-able are locked. Thanks, Amit [1] https://www.postgresql.org/message-id/098b9c71-1915-1a2a-8d52-1a7a50ce79e8%40lab.ntt.co.jp
On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > [ new patches ] I am failing to understand the point of separating PartitionDispatch into PartitionDispatch and PartitionTableInfo. That seems like an unnecessary multiplication of entities, as does the introduction of PartitionKeyInfo. I also think that replacing reldesc with reloid is not really an improvement; any places that gets the relid has to go open the relation to get the reldesc, whereas without that it has a direct pointer to the information it needs. I suggest that this patch just focus on removing the following things from PartitionDispatchData: keystate, tupslot, tupmap. Those things are clearly executor-specific stuff that makes sense to move to a different structure, what you're calling PartitionTupleRoutingInfo (not sure that's the best name). The other stuff all seems fine. You're going to have to open the relation anyway, so keeping the reldesc around seems like an optimization, if anything. The PartitionKey and PartitionDesc pointers may not really be needed -- they're just pointers into reldesc -- but they're trivial to compute, so it doesn't hurt anything to have them either as a micro-optimization for performance or even just for readability. That just leaves indexes. In a world where keystate, tupslot, and tupmap are removed from the PartitionDispatchData, you must need indexes or there would be no point in constructing a PartitionDispatchData object in the first place; any application that needs neither indexes nor the executor-specific stuff could just use the Relation directly. Regarding your XXX comments, note that if you've got a lock on a relation, the pointers to the PartitionKey and PartitionDesc are stable. The PartitionKey can't change once it's established, and the PartitionDesc can't change while we've got a lock on the relation unless we change it ourselves (and any places that do should have CheckTableNotInUse checks). The keep_partkey and keep_partdesc handling in relcache.c exists exactly so that we can guarantee that the pointer won't go stale under us. Now, if we *don't* have a lock on the relation, then those pointers can easily be invalidated -- so you can't hang onto a PartitionDispatch for longer than you hang onto the lock on the Relation. But that shouldn't be a problem. I think you only need to hang onto PartitionDispatch pointers for the lifetime of a single query. One can imagine optimizations where we try to avoid rebuilding that for subsequent queries but I'm not sure there's any demonstrated need for such a system at present. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/08/26 3:28, Robert Haas wrote: > On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> [ new patches ] > > I am failing to understand the point of separating PartitionDispatch > into PartitionDispatch and PartitionTableInfo. That seems like an > unnecessary multiplication of entities, as does the introduction of > PartitionKeyInfo. I also think that replacing reldesc with reloid is > not really an improvement; any places that gets the relid has to go > open the relation to get the reldesc, whereas without that it has a > direct pointer to the information it needs. I am worried about the open relcache reference in PartitionDispatch when we start using it in the planner. Whereas there is a ExecEndModifyTable() as a suitable place to close that reference, there doesn't seem to exist one within the planner, but I guess we will have to figure something out. For time being, the second patch closes the same in expand_inherited_rtentry() right after picking up the OID using RelationGetRelid(pd->reldesc). > I suggest that this patch just focus on removing the following things > from PartitionDispatchData: keystate, tupslot, tupmap. Those things > are clearly executor-specific stuff that makes sense to move to a > different structure, what you're calling PartitionTupleRoutingInfo > (not sure that's the best name). The other stuff all seems fine. > You're going to have to open the relation anyway, so keeping the > reldesc around seems like an optimization, if anything. The > PartitionKey and PartitionDesc pointers may not really be needed -- > they're just pointers into reldesc -- but they're trivial to compute, > so it doesn't hurt anything to have them either as a > micro-optimization for performance or even just for readability. OK, done this way in the attached updated patch. Any suggestions about a better name for what the patch calls PartitionTupleRoutingInfo? > That just leaves indexes. In a world where keystate, tupslot, and > tupmap are removed from the PartitionDispatchData, you must need > indexes or there would be no point in constructing a > PartitionDispatchData object in the first place; any application that > needs neither indexes nor the executor-specific stuff could just use > the Relation directly. Agreed. > Regarding your XXX comments, note that if you've got a lock on a > relation, the pointers to the PartitionKey and PartitionDesc are > stable. The PartitionKey can't change once it's established, and the > PartitionDesc can't change while we've got a lock on the relation > unless we change it ourselves (and any places that do should have > CheckTableNotInUse checks). The keep_partkey and keep_partdesc > handling in relcache.c exists exactly so that we can guarantee that > the pointer won't go stale under us. Now, if we *don't* have a lock > on the relation, then those pointers can easily be invalidated -- so > you can't hang onto a PartitionDispatch for longer than you hang onto > the lock on the Relation. But that shouldn't be a problem. I think > you only need to hang onto PartitionDispatch pointers for the lifetime > of a single query. One can imagine optimizations where we try to > avoid rebuilding that for subsequent queries but I'm not sure there's > any demonstrated need for such a system at present. Here too. Attached are the updated patches. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Aug 28, 2017 at 6:38 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I am worried about the open relcache reference in PartitionDispatch when > we start using it in the planner. Whereas there is a ExecEndModifyTable() > as a suitable place to close that reference, there doesn't seem to exist > one within the planner, but I guess we will have to figure something out. Yes, I think there's no real way to avoid having to figure that out. > OK, done this way in the attached updated patch. Any suggestions about a > better name for what the patch calls PartitionTupleRoutingInfo? I think this patch could be further simplified by continuing to use the existing function signature for RelationGetPartitionDispatchInfo instead of changing it to return a List * rather than an array. I don't see any benefit to such a change. The current system is more efficient. I keep having the feeling that this is a big patch with a small patch struggling to get out. Is it really necessary to change RelationGetPartitionDispatchInfo so much or could you just do a really minimal surgery to remove the code that sets the stuff we don't need? Like this: diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 96a64ce6b2..4fabcf9f32 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1089,29 +1089,7 @@ RelationGetPartitionDispatchInfo(Relation rel, pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData)); pd[i]->reldesc = partrel; pd[i]->key = partkey; - pd[i]->keystate = NIL; pd[i]->partdesc = partdesc; - if (parent != NULL) - { - /* - * For every partitioned table other than root, we must store a - * tuple table slot initialized with its tuple descriptor and a - * tuple conversion map to convert a tuple from its parent's - * rowtype to its own. That is to make sure that we are looking at - * the correct row using the correct tuple descriptor when - * computing its partition key for tuple routing. - */ - pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc); - pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent), - tupdesc, - gettext_noop("could not convert row type")); - } - else - { - /* Not required for the root partitioned table */ - pd[i]->tupslot = NULL; - pd[i]->tupmap = NULL; - } pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int)); /* -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/08/29 4:26, Robert Haas wrote: > I think this patch could be further simplified by continuing to use > the existing function signature for RelationGetPartitionDispatchInfo > instead of changing it to return a List * rather than an array. I > don't see any benefit to such a change. The current system is more > efficient. OK, restored the array way. > I keep having the feeling that this is a big patch with a small patch > struggling to get out. Is it really necessary to change > RelationGetPartitionDispatchInfo so much or could you just do a really > minimal surgery to remove the code that sets the stuff we don't need? > Like this: Sure, done in the attached updated patch. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I keep having the feeling that this is a big patch with a small patch >> struggling to get out. Is it really necessary to change >> RelationGetPartitionDispatchInfo so much or could you just do a really >> minimal surgery to remove the code that sets the stuff we don't need? >> Like this: > > Sure, done in the attached updated patch. On first glance, that looks pretty good. I'll have a deeper look tomorrow. It strikes me that if PartitionTupleRoutingInfo is an executor structure, we should probably (as a separate patch) relocate FormPartitionKeyDatum and get_partition_for_tuple to someplace in src/backend/executor and rename the accordingly; maybe it's time for an execPartition.c? catalog/partition.c is getting really bug, so it's not a bad thing if some of that stuff gets moved elsewhere. A lingering question in my mind, though, is whether it's really correct to think of PartitionTupleRoutingInfo as executor-specific. Maybe we're going to need that for other things too? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/08/30 12:03, Robert Haas wrote: > On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> I keep having the feeling that this is a big patch with a small patch >>> struggling to get out. Is it really necessary to change >>> RelationGetPartitionDispatchInfo so much or could you just do a really >>> minimal surgery to remove the code that sets the stuff we don't need? >>> Like this: >> >> Sure, done in the attached updated patch. > > On first glance, that looks pretty good. I'll have a deeper look > tomorrow. Thanks. > It strikes me that if PartitionTupleRoutingInfo is an > executor structure, we should probably (as a separate patch) relocate > FormPartitionKeyDatum and get_partition_for_tuple to someplace in > src/backend/executor and rename the accordingly; maybe it's time for > an execPartition.c? catalog/partition.c is getting really bug, so I agree. It would be a good idea to introduce an execPartition.c so that future patches in this area (such as executor partition-pruning patch on the horizon) have a convenient place to park their code. Will see if I can make a patch for that. > it's not a bad thing if some of that stuff gets moved elsewhere. A > lingering question in my mind, though, is whether it's really correct > to think of PartitionTupleRoutingInfo as executor-specific. Maybe > we're going to need that for other things too? Hmm. Maybe, a subset of PartitionTupleRoutingInfo's fields are usable outside the executor (only PartitionDispatch, which is exported by partition.h anyway?), but not all of it. For example, fields keystate, tupslot seem pretty executor-specific. In fact, they seem to be required only for tuple routing. One idea is to not introduce PartitionTupleRoutingInfo at all and add its fields directly as ModifyTableState/CopyState fields. We currently have, for example, a mt_partition_tupconv_maps array with one slot for every leaf partition. Maybe, there could be following fields in ModifyTableState (and similarly in CopyState): int mt_num_parted /* this one exists today */ struct PartitionDispatchData **mt_partition_dispatch_info; /* and this */ List **mt_parted_keystate; TupleConversionMap **mt_parted_tupconv_maps; TupleTableSlot **mt_parted_tupslots; Each of those arrays will have mt_num_parted slots. Thanks, Amit
On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote: > > That just leaves indexes. In a world where keystate, tupslot, and > tupmap are removed from the PartitionDispatchData, you must need > indexes or there would be no point in constructing a > PartitionDispatchData object in the first place; any application that > needs neither indexes nor the executor-specific stuff could just use > the Relation directly. But there is expand_inherited_rtentry() which neither requires indexes nor any executor stuff, but still requires to call RelationGetPartitionDispatchInfo(), and so these indexes get built unnecessarily. Looking at the latest patch, I can see that those indexes can be separately built in ExecSetupPartitionTupleRouting() where it is required, instead of in RelationGetPartitionDispatchInfo(). In the loop which iterates through the pd[] returned from RelationGetPartitionDispatchInfo(), we can build them using the exact code currently written to build them in RelationGetPartitionDispatchInfo(). In the future, if we require such applications where indexes are also required, we may have a separate function only to build indexes, and then ExecSetupPartitionTupleRouting() would also call that function. -------- On 21 August 2017 at 11:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an >> array of pointers. Why can't it be an array of >> PartitionTupleRoutingInfo structure rather than pointer to that >> structure ? > > AFAIK, assigning pointers is less expensive than assigning struct and we > end up doing a lot of assigning of the members of that array to a local > variable I didn't get why exactly we would have to copy the structures. We could just pass the address of ptrinfos[index], no ? My only point for this was : we would not have to call palloc0() for each of the element of ptrinfos. Instead, just allocate memory for all of the elements in a single palloc0(). We anyways have to allocate memory for *each* of the element. > in get_partition_for_tuple(), for example. Perhaps, we could > avoid those assignments and implement it the way you suggest. You mean at these 2 places in get_partition_for_tuple() , right ? : 1. /* start with the root partitioned table */ parent = ptrinfos[0]; 2. else parent = ptrinfos[-parent->pd->indexes[cur_index]]; Both of the above places, we could just use &ptrinfos[...] instead of ptrinfos[...]. But I guess you meant something else. ------------ RelationGetPartitionDispatchInfo() opens all the partitioned tables. But in ExecSetupPartitionTupleRouting(), it again opens all the parents, that is all the partitioned tables, and closes them back. Instead, would it be possible to do this : Instead of the PartitionDispatch->parentoid field, PartitionDispatch can have parentrel Relation field, which will point to reldesc field of one of the pds[] elements. ------------ For me, the CopyStateData->ptrinfos and ModifyTableState.mt_ptrinfos field names sound confusing. How about part_tuple_routing_info or just tuple_routing_info ?
On Wed, Aug 30, 2017 at 8:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> I keep having the feeling that this is a big patch with a small patch >>> struggling to get out. Is it really necessary to change >>> RelationGetPartitionDispatchInfo so much or could you just do a really >>> minimal surgery to remove the code that sets the stuff we don't need? >>> Like this: >> >> Sure, done in the attached updated patch. > > On first glance, that looks pretty good. I'll have a deeper look > tomorrow. > In one of your earlier mails on this thread, you had described how expand_inherited_rtentry() would look like as -- quote -- 1. Calling find_all_inheritors with a new only-lock-the-partitions flag. This should result in locking all partitioned tables in the inheritance hierarchy in breadth-first, low-OID-first order. (When the only-lock-the-partitions isn't specified, all partitioned tables should still be locked before any unpartitioned tables, so that the locking order in that case is consistent with what we do here.) 2. Iterate over the partitioned tables identified in step 1 in the order in which they were returned. For each one: - Decide which children can be pruned. - Lock the unpruned, non-partitioned children in low-OID-first order. 3. Make another pass over the inheritance hierarchy, starting at the root. Traverse the whole hierarchy in breadth-first in *bound* order. Add RTEs and AppendRelInfos as we go -- these will have rte->inh = true for partitioned tables and rte->inh = false for leaf partitions. -- quote ends -- Amit's patches seem to be addressing the third point here. But the expansion is not happening in breadth-first manner. We are expanding all the partitioned partitions first and then leaf partitions. So that's not exactly "breadth-first". I tried to rebase first patch from partition-wise join patchset [1] on top of these two patches. I am having hard time applying those changes. The goal of the my patch is to expand the partitioned table into an inheritance hierarchy which retains the partition hierarchy. For that to happen, we need to know which partition belongs to which partitioned table in the partition hierarchy. PartitionDispatch array provided by RelationGetPartitionDispatchInfo() provides me the parent OIDs of partitioned parents but it doesn't do so for the leaf partitions. So, I changed the signature of that function to return the list of parent OIDs of leaf partitions. But for building AppendRelInfos, child RTEs and child row marks, I need parent's RTI, RTE and row marks, which are not available directly. Given parent's OID, I need to search root->parse->rtable to find its RTE, RTI and then using RTI I can find rowmarks. But that seems to defeat the purpose why partition-wise join needs EIBO i.e. to avoid O(n ^2) loop in build_simple_rel(). For eliminating that loop we are introducing another O(n^2) loop in expand_inherited_rtentry(). Even without considering O(n^2) complexity this looks ugly. A better way for translating partition hierarchy into inheritance hierarchy may be to expand all partitions (partitioned or leaf) of a given parent at a time in breadth-first manner. This allows us to create child RTE, AppendRelInfo, rowmarks while we have corresponding parent structures at hand, rather than searching for those. This would still satisfy Amit Khandekar's requirement to expand leaf partitions in the same order as their OIDs would be returned by RelationGetPartitionDispatchInfo(). I have a feeling that, if we go that route, we will replace almost all the changes that Amit Langote's patches do to expand_inherited_rtentry(). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Aug 30, 2017 at 9:22 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Amit's patches seem to be addressing the third point here. But the > expansion is not happening in breadth-first manner. We are expanding > all the partitioned partitions first and then leaf partitions. So > that's not exactly "breadth-first". Correct, but I think Amit's ordering is what we actually want: breadth-first, low-OID-first over interior partitioned tables, and then breadth-first, low-OID-first again over leaves. If we don't keep partitioned partitions first, then we're going to have problems keeping the locking order consistent when we start doing pruning during expansion. > A better way for translating partition hierarchy into inheritance > hierarchy may be to expand all partitions (partitioned or leaf) of a > given parent at a time in breadth-first manner. This allows us to > create child RTE, AppendRelInfo, rowmarks while we have corresponding > parent structures at hand, rather than searching for those. This would > still satisfy Amit Khandekar's requirement to expand leaf partitions > in the same order as their OIDs would be returned by > RelationGetPartitionDispatchInfo(). I have a feeling that, if we go > that route, we will replace almost all the changes that Amit Langote's > patches do to expand_inherited_rtentry(). I think we will, too, but I think that's basically the problem of the partition-wise join patch. Either find_all_inheritors is going to have to return enough additional information to let expand_inherited_rtentry work efficiently, or else expand_inherited_rtentry is going to have to duplicate some of the logic from find_all_inheritors. But that doesn't make what Amit is doing here a bad idea -- getting stuff that shouldn't be part of PartitionDispatch removed and getting the expansion order in expand_inherited_rtentry() changed seem to be the right things to do even if the way it's implemented has to be revised to meet other goals. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 30, 2017 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 30, 2017 at 9:22 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Amit's patches seem to be addressing the third point here. But the >> expansion is not happening in breadth-first manner. We are expanding >> all the partitioned partitions first and then leaf partitions. So >> that's not exactly "breadth-first". > > Correct, but I think Amit's ordering is what we actually want: > breadth-first, low-OID-first over interior partitioned tables, and > then breadth-first, low-OID-first again over leaves. If we don't keep > partitioned partitions first, then we're going to have problems > keeping the locking order consistent when we start doing pruning > during expansion. No, I'm wrong and you're correct. We want the partitions to be locked first, but we don't want them to be pulled to the front of the expansion order, because then it's not in bound order anymore and any optimization that tries to rely on that ordering will break. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 30, 2017 at 6:08 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote: >> That just leaves indexes. In a world where keystate, tupslot, and >> tupmap are removed from the PartitionDispatchData, you must need >> indexes or there would be no point in constructing a >> PartitionDispatchData object in the first place; any application that >> needs neither indexes nor the executor-specific stuff could just use >> the Relation directly. > > But there is expand_inherited_rtentry() which neither requires indexes > nor any executor stuff, but still requires to call > RelationGetPartitionDispatchInfo(), and so these indexes get built > unnecessarily. True, but the reason why expand_inherited_rtentry() needs to call RelationGetPartitionDispatchInfo() is to get back the leaf partition OIDs in bound order. If we're using RelationGetPartitionDispatchInfo() to get the leaf partition OIDs into bound order, we've got to run the loop that builds leaf_part_oids, and the same loop constructs indexes. So I don't think we're doing much redundant work there. Now, if we made it the job of expand_inherited_rtentry() to loop over the PartitionDesc, then it really wouldn't need to call RelationGetPartitionDispatchInfo at all. Maybe that's actually a better plan anyway, because as Ashutosh points out, we don't want the partitioned children to show up before the unpartitioned children in the resulting ordering. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 30, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 30, 2017 at 6:08 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote: >>> That just leaves indexes. In a world where keystate, tupslot, and >>> tupmap are removed from the PartitionDispatchData, you must need >>> indexes or there would be no point in constructing a >>> PartitionDispatchData object in the first place; any application that >>> needs neither indexes nor the executor-specific stuff could just use >>> the Relation directly. >> >> But there is expand_inherited_rtentry() which neither requires indexes >> nor any executor stuff, but still requires to call >> RelationGetPartitionDispatchInfo(), and so these indexes get built >> unnecessarily. > > True, but the reason why expand_inherited_rtentry() needs to call > RelationGetPartitionDispatchInfo() is to get back the leaf partition > OIDs in bound order. If we're using > RelationGetPartitionDispatchInfo() to get the leaf partition OIDs into > bound order, we've got to run the loop that builds leaf_part_oids, and > the same loop constructs indexes. So I don't think we're doing much > redundant work there. > > Now, if we made it the job of expand_inherited_rtentry() to loop over > the PartitionDesc, then it really wouldn't need to call > RelationGetPartitionDispatchInfo at all. Maybe that's actually a > better plan anyway, because as Ashutosh points out, we don't want the > partitioned children to show up before the unpartitioned children in > the resulting ordering. > +1. I think we should just pull out the OIDs from partition descriptor. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > +1. I think we should just pull out the OIDs from partition descriptor. Like this? The first patch refactors the expansion of a single child out into a separate function, and the second patch implements EIBO on top of it. I realized while doing this that we really want to expand the partitioning hierarchy depth-first, not breadth-first. For some things, like partition-wise join in the case where all bounds match exactly, we really only need a *predictable* ordering that will be the same for two equi-partitioned tables. A breadth-first expansion will give us that. But it's not actually in bound order. For example: create table foo (a int, b text) partition by list (a); create table foo1 partition of foo for values in (2); create table foo2 partition of foo for values in (1) partition by range (b); create table foo2a partition of foo2 for values from ('b') to ('c'); create table foo2b partition of foo2 for values from ('a') to ('b'); create table foo3 partition of foo for values in (3); The correct bound-order expansion of this is foo2b - foo2a - foo1 - foo3, which is indeed what you get with the attached patch. But if we did the expansion in breadth-first fashion, we'd get foo1 - foo3 - foo2a, foo2b, which is, well, not in bound order. If the idea is that you see a > 2 and rule out all partitions that appear before the first one with an a-value >= 2, it's not going to work. Mind you, that idea has some problems anyway in the face of default partitions, null partitions, and list partitions which accept non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, 4, 6). We might need to mark the PartitionDesc to indicate whether PartitionDesc-order is in fact bound-order in a particular instance, or something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Aug 31, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> +1. I think we should just pull out the OIDs from partition descriptor. > > Like this? The first patch refactors the expansion of a single child > out into a separate function, and the second patch implements EIBO on > top of it. > > I realized while doing this that we really want to expand the > partitioning hierarchy depth-first, not breadth-first. For some > things, like partition-wise join in the case where all bounds match > exactly, we really only need a *predictable* ordering that will be the > same for two equi-partitioned table. +1. Spotted right! > A breadth-first expansion will > give us that. But it's not actually in bound order. For example: > > create table foo (a int, b text) partition by list (a); > create table foo1 partition of foo for values in (2); > create table foo2 partition of foo for values in (1) partition by range (b); > create table foo2a partition of foo2 for values from ('b') to ('c'); > create table foo2b partition of foo2 for values from ('a') to ('b'); > create table foo3 partition of foo for values in (3); > > The correct bound-order expansion of this is foo2b - foo2a - foo1 - > foo3, which is indeed what you get with the attached patch. But if we > did the expansion in breadth-first fashion, we'd get foo1 - foo3 - > foo2a, foo2b, which is, well, not in bound order. If the idea is that > you see a > 2 and rule out all partitions that appear before the first > one with an a-value >= 2, it's not going to work. Here are the patches revised a bit. I have esp changed the variable names and arguments to reflect their true role in the functions. Also updated prologue of expand_single_inheritance_child() to mention "has_child". Let me know if those changes look good. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/08/31 4:45, Robert Haas wrote: > On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> +1. I think we should just pull out the OIDs from partition descriptor. > > Like this? The first patch refactors the expansion of a single child > out into a separate function, and the second patch implements EIBO on > top of it. > > I realized while doing this that we really want to expand the > partitioning hierarchy depth-first, not breadth-first. For some > things, like partition-wise join in the case where all bounds match > exactly, we really only need a *predictable* ordering that will be the > same for two equi-partitioned tables. A breadth-first expansion will > give us that. But it's not actually in bound order. For example: > > create table foo (a int, b text) partition by list (a); > create table foo1 partition of foo for values in (2); > create table foo2 partition of foo for values in (1) partition by range (b); > create table foo2a partition of foo2 for values from ('b') to ('c'); > create table foo2b partition of foo2 for values from ('a') to ('b'); > create table foo3 partition of foo for values in (3); > > The correct bound-order expansion of this is foo2b - foo2a - foo1 - > foo3, which is indeed what you get with the attached patch. But if we > did the expansion in breadth-first fashion, we'd get foo1 - foo3 - > foo2a, foo2b, which is, well, not in bound order. If the idea is that > you see a > 2 and rule out all partitions that appear before the first > one with an a-value >= 2, it's not going to work. I think, overall, this might be a good idea. Thanks for working on it. The patches I posted in the "path toward faster partition pruning" achieve the same end result as your patch that the leaf partitions appear in the partition bound order in the Append path for a partitioned table. It achieves that result in a somewhat different way, but let's forget about that for a moment. One thing the patch on that thread didn't achieve though is getting the leaf partitions in the same (partition bound) order in the ModifyTable path for UPDATE/DELETE, because inheritance_planner() path is not modified in a suitable way (in fact, I'm afraid that there might be a deadlock bug lurking there, which I must address). Your patch, OTOH, achieves the same order in both cases, which seems desirable. > Mind you, that idea has some problems anyway in the face of default > partitions, null partitions, and list partitions which accept > non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, > 4, 6). We might need to mark the PartitionDesc to indicate whether > PartitionDesc-order is in fact bound-order in a particular instance, > or something like that. ISTM, the primary motivation for the EIBO patch at this point is to get the partitions ordered in a predictable manner so that the partition-wise join patch and update partition key patches could implement certain logic using O (n) algorithm rather than an O (n^2) one. Neither of them depend on the actual order in the sense of, say, sticking a PathKey to the resulting Append. Perhaps, the patch to"Make the optimiser aware of partitions ordering" [1] will have to consider this somehow; maybe by limiting its scope to only the cases where the root partitioned table is range partitioned. Thanks, Amit [1] https://commitfest.postgresql.org/14/1093/
On Thu, Aug 31, 2017 at 3:36 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > ISTM, the primary motivation for the EIBO patch at this point is to get > the partitions ordered in a predictable manner so that the partition-wise > join patch and update partition key patches could implement certain logic > using O (n) algorithm rather than an O (n^2) one. That's part of it, but not the whole thing. For example, BASIC partition-wise join only needs a predictable order, not a bound-ordered one. But the next step is to be able to match up uneven bounds - e.g. given [1000, 2000), [3000, 4000), [5000, 6000) on one side and [1100, 2100), [2900,3900), and [5500,5600) on the other side, we can still make it work. That greatly benefits from being able to iterate through all the bounds in order. > Neither of them depend > on the actual order in the sense of, say, sticking a PathKey to the > resulting Append. Perhaps, the patch to"Make the optimiser aware of > partitions ordering" [1] will have to consider this somehow; maybe by > limiting its scope to only the cases where the root partitioned table is > range partitioned. I think that doing a depth-first traversal as I've done here avoids the need to limit it to that case. If we did a breadth-first traversal anything that was subpartitioned would end up having the subpartitions at the end instead of in the sequence, but the depth-first traversal avoids that issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 31, 2017 at 2:56 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Here are the patches revised a bit. I have esp changed the variable > names and arguments to reflect their true role in the functions. Also > updated prologue of expand_single_inheritance_child() to mention > "has_child". Let me know if those changes look good. Sure. Committed as you have it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Mind you, that idea has some problems anyway in the face of default >> partitions, null partitions, and list partitions which accept >> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, >> 4, 6). We might need to mark the PartitionDesc to indicate whether >> PartitionDesc-order is in fact bound-order in a particular instance, >> or something like that. > > ISTM, the primary motivation for the EIBO patch at this point is to get > the partitions ordered in a predictable manner so that the partition-wise > join patch and update partition key patches could implement certain logic > using O (n) algorithm rather than an O (n^2) one. Neither of them depend > on the actual order in the sense of, say, sticking a PathKey to the > resulting Append. Now that the inheritance hierarchy is expanded in depth-first order, RelationGetPartitionDispatchInfo() needs to be changed to arrange the PartitionDispatch array and the leaf partitions in depth-first order (as we know this is a requirement for the update-partition-key patch for efficiently determining which of the leaf partitions are already present in the update result rels). Amit, I am not sure if you are already doing this as part of the patches in this mail thread. Please let me know. Also, let me know if you think there will be any loss of efficiency in tuple routing code if we arrange the Partition Dispatch indexes in depth-first order.
Hi Amit, On 2017/09/03 16:07, Amit Khandekar wrote: > On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Mind you, that idea has some problems anyway in the face of default >>> partitions, null partitions, and list partitions which accept >>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, >>> 4, 6). We might need to mark the PartitionDesc to indicate whether >>> PartitionDesc-order is in fact bound-order in a particular instance, >>> or something like that. >> >> ISTM, the primary motivation for the EIBO patch at this point is to get >> the partitions ordered in a predictable manner so that the partition-wise >> join patch and update partition key patches could implement certain logic >> using O (n) algorithm rather than an O (n^2) one. Neither of them depend >> on the actual order in the sense of, say, sticking a PathKey to the >> resulting Append. > > Now that the inheritance hierarchy is expanded in depth-first order, > RelationGetPartitionDispatchInfo() needs to be changed to arrange the > PartitionDispatch array and the leaf partitions in depth-first order > (as we know this is a requirement for the update-partition-key patch > for efficiently determining which of the leaf partitions are already > present in the update result rels). I was thinking the same. > Amit, I am not sure if you are > already doing this as part of the patches in this mail thread. Please > let me know. Actually, I had thought of changing the expansion order in RelationGetPartitionDispatchInfo to depth-first after Robert committed his patch the other day, but haven't got around to doing that yet. Will do that in the updated patch (the refactoring patch) I will post sometime later today or tomorrow on a differently titled thread, because the EIBO work seems to be done. > Also, let me know if you think there will be any loss of > efficiency in tuple routing code if we arrange the Partition Dispatch > indexes in depth-first order. I don't think there will be any loss in the efficiency of the tuple routing code itself. It's just that the position of the ResultRelInfos (of leaf partitions) and PartitionDispatch objects (of partitioned tables) will be different in their respective arrays, that is, pd->indexes will now have different values than formerly. And now because the planner will put leaf partitions subplans / WCOs / RETURNING projections in that order in the ModifyTable node, we must make sure that we adapt the same order in the executor, as you already noted. Thanks, Amit
On 4 September 2017 at 06:34, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Amit, > > On 2017/09/03 16:07, Amit Khandekar wrote: >> On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> Mind you, that idea has some problems anyway in the face of default >>>> partitions, null partitions, and list partitions which accept >>>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2, >>>> 4, 6). We might need to mark the PartitionDesc to indicate whether >>>> PartitionDesc-order is in fact bound-order in a particular instance, >>>> or something like that. >>> >>> ISTM, the primary motivation for the EIBO patch at this point is to get >>> the partitions ordered in a predictable manner so that the partition-wise >>> join patch and update partition key patches could implement certain logic >>> using O (n) algorithm rather than an O (n^2) one. Neither of them depend >>> on the actual order in the sense of, say, sticking a PathKey to the >>> resulting Append. >> >> Now that the inheritance hierarchy is expanded in depth-first order, >> RelationGetPartitionDispatchInfo() needs to be changed to arrange the >> PartitionDispatch array and the leaf partitions in depth-first order >> (as we know this is a requirement for the update-partition-key patch >> for efficiently determining which of the leaf partitions are already >> present in the update result rels). > > I was thinking the same. > >> Amit, I am not sure if you are >> already doing this as part of the patches in this mail thread. Please >> let me know. > > Actually, I had thought of changing the expansion order in > RelationGetPartitionDispatchInfo to depth-first after Robert committed his > patch the other day, but haven't got around to doing that yet. Will do > that in the updated patch (the refactoring patch) I will post sometime > later today or tomorrow on a differently titled thread, because the EIBO > work seems to be done. Great, thanks. Just wanted to make sure someone is working on that, because, as you said, it is no longer an EIBO patch. Since you are doing that, I won't work on that. > >> Also, let me know if you think there will be any loss of >> efficiency in tuple routing code if we arrange the Partition Dispatch >> indexes in depth-first order. > > I don't think there will be any loss in the efficiency of the tuple > routing code itself. It's just that the position of the ResultRelInfos > (of leaf partitions) and PartitionDispatch objects (of partitioned tables) > will be different in their respective arrays, that is, pd->indexes will > now have different values than formerly. Ok. Good to hear that. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On 2017/09/05 14:11, Amit Khandekar wrote: > Great, thanks. Just wanted to make sure someone is working on that, > because, as you said, it is no longer an EIBO patch. Since you are > doing that, I won't work on that. Here is that patch (actually two patches). Sorry it took me a bit. Description: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor Currently it and the structure it generates viz. PartitionDispatch objects are too coupled with the executor's tuple-routing code. In particular, it's pretty undesirable that it makes it the responsibility of the caller to release some resources, such as executor tuple table slots, tuple-conversion maps, etc. After this refactoring, ExecSetupPartitionTupleRouting() now needs to do some of the work that was previously done in RelationGetPartitionDispatchInfo(). [PATCH 2/2] Make RelationGetPartitionDispatch expansion order depth-first This is so as it matches what the planner is doing with partitioning inheritance expansion. Matching with planner order helps because it helps ease matching the executor's per-partition objects with planner-created per-partition nodes. Actually, I'm coming to a conclusion that we should keep any whole-partition-tree stuff out of partition.c and its interface, as Robert has also alluded to in an earlier message on this thread [1]. But since that's a different topic, I'll shut up about it on this thread and start a new thread to discuss what kind of code rearrangement is possible. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoafr%3DhUrM%3Dcbx-k%3DBDHOF2OfXaw95HQSNAK4mHBwmSjtw%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Thanks Amit for the patch. I am still reviewing it, but meanwhile below are a few comments so far ... On 8 September 2017 at 15:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > [PATCH 2/2] Make RelationGetPartitionDispatch expansion order > depth-first > > This is so as it matches what the planner is doing with partitioning > inheritance expansion. Matching with planner order helps because > it helps ease matching the executor's per-partition objects with > planner-created per-partition nodes. > > + next_parted_idx += (list_length(*pds) - next_parted_idx - 1); I think this can be replaced just by : + next_parted_idx = list_length(*pds) - 1; Or, how about removing this variable next_parted_idx altogether ? Instead, we can just do this : pd->indexes[i] = -(1 + list_length(*pds)); If that is not possible, I may be missing something. ----------- + next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx); Didn't understand why next_leaf_idx needs to be updated in case when the current partrelid is partitioned. I think it should be incremented only for leaf partitions, no ? Or for that matter, again, how about removing the variable 'next_leaf_idx' and doing this : *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid); pd->indexes[i] = list_length(*leaf_part_oids) - 1; ----------- * For every partitioned table in the tree, starting with the root * partitioned table, add its relcache entry to parted_rels, while also * queuing its partitions (in the order in which they appear in the * partition descriptor) to be looked at later in the same loop. This is * a bit tricky but works because the foreach() macro doesn't fetch the * next list element until the bottom of the loop. I think the above comment needs to be modified with something explaining the relevant changed code. For e.g. there is no parted_rels, and the "tricky" part was there earlier because of the list being iterated and at the same time being appended. ------------ I couldn't see the existing comments like "Indexes corresponding to the internal partitions are multiplied by" anywhere in the patch. I think those comments are still valid, and important. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit, On 2017/09/11 16:16, Amit Khandekar wrote: > Thanks Amit for the patch. I am still reviewing it, but meanwhile > below are a few comments so far ... Thanks for the review. > + next_parted_idx += (list_length(*pds) - next_parted_idx - 1); > > I think this can be replaced just by : > + next_parted_idx = list_length(*pds) - 1; > Or, how about removing this variable next_parted_idx altogether ? > Instead, we can just do this : > pd->indexes[i] = -(1 + list_length(*pds)); That seems like the simplest possible way to do it. > + next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx); > > Didn't understand why next_leaf_idx needs to be updated in case when > the current partrelid is partitioned. I think it should be incremented > only for leaf partitions, no ? Or for that matter, again, how about > removing the variable 'next_leaf_idx' and doing this : > *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid); > pd->indexes[i] = list_length(*leaf_part_oids) - 1; Yep. Attached updated patch does it that way for both partitioned table indexes and leaf partition indexes. Thanks for pointing it out. > ----------- > > * For every partitioned table in the tree, starting with the root > * partitioned table, add its relcache entry to parted_rels, while also > * queuing its partitions (in the order in which they appear in the > * partition descriptor) to be looked at later in the same loop. This is > * a bit tricky but works because the foreach() macro doesn't fetch the > * next list element until the bottom of the loop. > > I think the above comment needs to be modified with something > explaining the relevant changed code. For e.g. there is no > parted_rels, and the "tricky" part was there earlier because of the > list being iterated and at the same time being appended. > > ------------ I think I forgot to update this comment. > I couldn't see the existing comments like "Indexes corresponding to > the internal partitions are multiplied by" anywhere in the patch. I > think those comments are still valid, and important. Again, I failed to keep this comment. Anyway, I reworded the comments a bit to describe what the code is doing more clearly. Hope you find it so too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/09/11 18:56, Amit Langote wrote: > Attached updated patch does it that way for both partitioned table indexes > and leaf partition indexes. Thanks for pointing it out. It seems to me we don't really need the first patch all that much. That is, let's keep PartitionDispatchData the way it is for now, since we don't really have any need for it beside tuple-routing (EIBO as committed didn't need it for one). So, let's forget about "decoupling RelationGetPartitionDispatchInfo() from the executor" thing for now and move on. So, attached is just the patch to make RelationGetPartitionDispatchInfo() traverse the partition tree in depth-first manner to be applied on HEAD. Thoughts? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 13 September 2017 at 15:32, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/11 18:56, Amit Langote wrote: >> Attached updated patch does it that way for both partitioned table indexes >> and leaf partition indexes. Thanks for pointing it out. > > It seems to me we don't really need the first patch all that much. That > is, let's keep PartitionDispatchData the way it is for now, since we don't > really have any need for it beside tuple-routing (EIBO as committed didn't > need it for one). So, let's forget about "decoupling > RelationGetPartitionDispatchInfo() from the executor" thing for now and > move on. > > So, attached is just the patch to make RelationGetPartitionDispatchInfo() > traverse the partition tree in depth-first manner to be applied on HEAD. > > Thoughts? +1. If at all we need the decoupling later for some reason, we can do that incrementally. Will review your latest patch by tomorrow. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > It seems to me we don't really need the first patch all that much. That > is, let's keep PartitionDispatchData the way it is for now, since we don't > really have any need for it beside tuple-routing (EIBO as committed didn't > need it for one). So, let's forget about "decoupling > RelationGetPartitionDispatchInfo() from the executor" thing for now and > move on. > > So, attached is just the patch to make RelationGetPartitionDispatchInfo() > traverse the partition tree in depth-first manner to be applied on HEAD. I like this patch. Not only does it improve the behavior, but it's actually less code than we have now, and in my opinion, the new code is easier to understand, too. A few suggestions: - I think get_partition_dispatch_recurse() get a check_stack_depth() call just like expand_partitioned_rtentry() did, and for the same reasons: if the catalog contents are corrupted so that we have an infinite loop in the partitioning hierarchy, we want to error out, not crash. - I think we should add a comment explaining that we're careful to do this in the same order as expand_partitioned_rtentry() so that callers can assume that the N'th entry in the leaf_part_oids array will also be the N'th child of an Append node. + * For every partitioned table other than root, we must store a other than the root + * partitioned table. The value multiplied back by -1 is returned as the multiplied by -1, not multiplied back by -1 + * tables in the tree, using which, search is continued further down the + * partition tree. Period after "in the tree". Then continue: "This value is used to continue the search in the next level of the partition tree." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/14 1:42, Robert Haas wrote: > On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> It seems to me we don't really need the first patch all that much. That >> is, let's keep PartitionDispatchData the way it is for now, since we don't >> really have any need for it beside tuple-routing (EIBO as committed didn't >> need it for one). So, let's forget about "decoupling >> RelationGetPartitionDispatchInfo() from the executor" thing for now and >> move on. >> >> So, attached is just the patch to make RelationGetPartitionDispatchInfo() >> traverse the partition tree in depth-first manner to be applied on HEAD. > > I like this patch. Not only does it improve the behavior, but it's > actually less code than we have now, and in my opinion, the new code > is easier to understand, too. > > A few suggestions: Thanks for the review. > - I think get_partition_dispatch_recurse() get a check_stack_depth() > call just like expand_partitioned_rtentry() did, and for the same > reasons: if the catalog contents are corrupted so that we have an > infinite loop in the partitioning hierarchy, we want to error out, not > crash. Ah, missed that. Done. > - I think we should add a comment explaining that we're careful to do > this in the same order as expand_partitioned_rtentry() so that callers > can assume that the N'th entry in the leaf_part_oids array will also > be the N'th child of an Append node. Done. Since the Append/ModifyTable may skip some leaf partitions due to pruning, added a note about that too. > + * For every partitioned table other than root, we must store a > > other than the root > > + * partitioned table. The value multiplied back by -1 is returned as the > > multiplied by -1, not multiplied back by -1 > > + * tables in the tree, using which, search is continued further down the > + * partition tree. > > Period after "in the tree". Then continue: "This value is used to > continue the search in the next level of the partition tree." Fixed. Attached updated patch. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 14 September 2017 at 06:43, Amit Langote > Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patch. @@ -1222,151 +1209,130 @@ PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel, int *num_parted, List **leaf_part_oids){ + List *pdlist; PartitionDispatchData **pd; + get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); Above, pdlist is passed uninitialized. And then inside get_partition_dispatch_recurse(), it is used here : *pds = lappend(*pds, pd); -------- pg_indent says more alignments needed. For e.g. gettext_noop() call below needs to be aligned: pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent), tupdesc, gettext_noop("could not convert row type")); -------- Other than that, the patch looks good to me. I verified that the leaf oids are ordered exaclty in the order of the UPDATE subplans output. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 14 September 2017 at 06:43, Amit Langote >> Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated patch. > > @@ -1222,151 +1209,130 @@ PartitionDispatch * > RelationGetPartitionDispatchInfo(Relation rel, > int > *num_parted, List **leaf_part_oids) > { > + List *pdlist; > PartitionDispatchData **pd; > > + get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); > > Above, pdlist is passed uninitialized. And then inside > get_partition_dispatch_recurse(), it is used here : > *pds = lappend(*pds, pd); > > -------- > > pg_indent says more alignments needed. For e.g. gettext_noop() call > below needs to be aligned: > pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent), > tupdesc, > gettext_noop("could not convert row type")); > > -------- > > Other than that, the patch looks good to me. I verified that the leaf > oids are ordered exaclty in the order of the UPDATE subplans output. Committed with fixes for those issues and a few other cosmetic changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/15 1:37, Robert Haas wrote: > On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 14 September 2017 at 06:43, Amit Langote >>> Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Attached updated patch. >> >> @@ -1222,151 +1209,130 @@ PartitionDispatch * >> RelationGetPartitionDispatchInfo(Relation rel, >> int >> *num_parted, List **leaf_part_oids) >> { >> + List *pdlist; >> PartitionDispatchData **pd; >> >> + get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); >> >> Above, pdlist is passed uninitialized. And then inside >> get_partition_dispatch_recurse(), it is used here : >> *pds = lappend(*pds, pd); >> >> -------- >> >> pg_indent says more alignments needed. For e.g. gettext_noop() call >> below needs to be aligned: >> pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent), >> tupdesc, >> gettext_noop("could not convert row type")); >> >> -------- >> >> Other than that, the patch looks good to me. I verified that the leaf >> oids are ordered exaclty in the order of the UPDATE subplans output. > > Committed with fixes for those issues and a few other cosmetic changes. Thanks Amit for the review and Robert for committing. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers