Thread: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Aug 10, 2017 at 9:28 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On my computer it took ~1.5 seconds to plan a 1000 partition join, >> ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a >> 4000 partition join. I poked around in a profiler a bit and saw that >> for the 2000 partition case I spent almost half the time in >> create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle, >> and about half of that was in bms_is_subset. The other half the time >> was in query_planner->make_one_rel which spent 2/3 of its time in >> set_rel_size->add_child_rel_equivalences->bms_overlap and the other >> 1/3 in standard_join_search. > > Ashutosh asked me how I did that. Please see attached. I was > explaining simple joins like SELECT * FROM foofoo JOIN barbar USING > (a, b). Here also is the experimental hack I tried when I saw > bitmapset.c eating my CPU. > On my machine I observed following planning times 1000 partitions, without partition-wise join, 100ms; with partition-wise join 500ms 2000 partitions, without partition-wise join, 320ms; with partition-wise join 2.2s 4000 partitions, without partition-wise join, 1.3ms; with partition-wise join 17s So, even without partition-wise join the planning time increases at a superlinear rate with the number of partitions. Your patch didn't improve planning time without partition-wise join, so it's something good to have along-with partition-wise join. Given that Bitmapsets are used in other parts of code as well, the optimization may affect those parts as well, esp. the overhead of maintaining first_non_empty_wordnum. The comment at the beginning of the file bitmapset.c says 3 * bitmapset.c 4 * PostgreSQL generic bitmap set package 5 * 6 * A bitmap set can represent any set of nonnegative integers, although 7 * it is mainly intended for setswhere the maximum value is not large, 8 * say at most a few hundred. When we created thousands of children, we have certainly crossed the few hundred threashold. So, there may be other optimizations possible there. Probably we should leave that out of partition-wise join patches. Do you think we solving this problem is a prerequisite for partition-wise join? Or should we propose that patch as a separate enhancement? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Thomas Munro
Date:
On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Your patch didn't improve planning time without partition-wise join, > so it's something good to have along-with partition-wise join. Given > that Bitmapsets are used in other parts of code as well, the > optimization may affect those parts as well, esp. the overhead of > maintaining first_non_empty_wordnum. Maybe, but if you consider that this container already deals with the upper bound moving up by reallocating and copying the whole thing, adjusting an int when the lower bound moves down doesn't seem like anything to worry about... > The comment at the beginning of the file bitmapset.c says > 3 * bitmapset.c > 4 * PostgreSQL generic bitmap set package > 5 * > 6 * A bitmap set can represent any set of nonnegative integers, although > 7 * it is mainly intended for sets where the maximum value is not large, > 8 * say at most a few hundred. > > When we created thousands of children, we have certainly crossed the > few hundred threashold. So, there may be other optimizations possible > there. Probably we should leave that out of partition-wise join > patches. +1 > Do you think we solving this problem is a prerequisite for > partition-wise join? Or should we propose that patch as a separate > enhancement? No, I'm not proposing anything yet. For now I just wanted to share this observation about where hot CPU time goes in simple tests, and since it turned out to be a loop in a loop that I could see an easy to way to fix for singleton sets and sets with a small range, I couldn't help trying it... But I'm still trying to understand the bigger picture. I'll be interested to compare profiles with the ordered append_rel_list version you have mentioned, to see how that moves the hot spots. I guess one very practical question to ask is: can we plan queries with realistic numbers of partitioned tables and partitions in reasonable times? Well, it certainly looks very good for hundreds of partitions so far... My own experience of partitioning with other RDBMSs has been on that order, 'monthly partitions covering the past 10 years' and similar, but on the other hand it wouldn't be surprising to learn that people want to go to many thousands, especially for schemas which just keep adding partitions over time and don't want to drop them. As for hash partitioning, that seems to be typically done with numbers like 16, 32 or 64 in other products from what I can glean. Speculation: perhaps hash partitioning is more motivated by parallelism than data maintenance and thus somehow anchored to the ground by core counts; if so no planning performance worries there I guess (until core counts double quite a few more times). One nice thing about the planning time is that restrictions on the partition key cut down planning time; so where I measure ~7 seconds to plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions, if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so if someone has a keep-adding-more-partitions-over-time model then at least their prunable current day/week/whatever queries will not suffer quite so badly. (Yeah my computer seems to be a lot slower than yours for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @ 2.2Ghz). Curious: would you consider joins between partitioned tables and non-partitioned tables where the join is pushed down to be a kind of "partition-wise join", or something else? If so, would that be a special case, or just the logical extreme case for 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where one single "partition" on the non-partitioned side maps to all the partitions on the partitioned size? -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Aug 9, 2017 at 7:09 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > I started reviewing this. It's nicely commented, but it's also very > complicated, and it's going to take me several rounds to understand > what all the parts do, but here's some assorted feedback after reading > some parts of the patches, some tinkering and quite a bit of time > spent trying to break it (so far unsuccessfully). > Thanks for testing the patch. Good to know it has withstood your testing. > On my computer it took ~1.5 seconds to plan a 1000 partition join, > ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a > 4000 partition join. I poked around in a profiler a bit and saw that > for the 2000 partition case I spent almost half the time in > create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle, > and about half of that was in bms_is_subset. The other half the time > was in query_planner->make_one_rel which spent 2/3 of its time in > set_rel_size->add_child_rel_equivalences->bms_overlap and the other > 1/3 in standard_join_search. Thanks for profiling. I have separately mailed about bitmapset improvements. Equivalence classes contain all the expressions which are known to be equal in EquivalenceClass::ec_members. For a partitioned table, there will be as many expressions as the number of children. The child expressions are marked as em_is_child and are looked at only when child relids are available to the function scanning the members. The number of equivalence members increases linearly with the number of partitions, and the number of words in the bitmaps increases linearly with the number of partitions, effectively the the number of words scanned increases quadratically. Hence the superlinear increase in time with the number of partitions. When I took separate profiles with 1000, 2000 and 4000 partitions resp. I see that 15%, 29% and 40% time spent in bms_is_subset() resp. I am not sure how much we can do in this patchset to reduce this problem. Apart from your bitmapset optimization, we could perhaps use some more efficient data structure other than list to search members based on the relids OR re-use parent's expressions for child somehow. I have been thinking about the second option, but never got a chance to work on it. > > When using list-based partitions, it must be possible to omit the part > of a join key that is implied by the partition because the partition > has only one list value. For example, if I create a two level > hierarchy with one partition per US state and then time-based range > partitions under that, the state part of this merge condition is > redundant: > > Merge Cond: ((sales_wy_2017_10.state = > purchases_wy_2017_10.state) AND (sales_wy_2017_10.created = > purchases_wy_2017_10.created)) That's a good idea. In fact, we could use a similar trick when the condition is sales_wy_2017_10.state = 'state'. We can not use the trick in case of DML or when there are locking clauses, since we need to evaluate the qual in case the row underneath changes while locking it. We also can not do this when one of the keys being compared is a nullable partition key (a concept explained in partition-wise join implementation patch), since a partition can have also have rows with NULL values for such partition keys in that partition. I think the idea has merit, although, I think we should handle it targetting more generic cases like the one stated above. > > 0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch > > -partition_bounds_equal(PartitionKey key, > +partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval, > PartitionBoundInfo b1, > PartitionBoundInfo b2) > > I wonder is there any value in creating a struct to represent the > common part of PartitionKey and PartitionScheme that functions like > this and others need? Maybe not. Perhaps you didn't want to make > PartitionKey contain a PartitionScheme because then you'd lose the > property that every pointer to PartitionScheme in the system must be a > pointer to an interned (canonical) PartitionScheme, so it's always > safe to compare pointers to test for equality? Right. Other reason to keep those two separate, is we might change the contents of PartitionScheme as we move forward with the reviews. May be we should revisit it after we have finalised the design. > > 0005-Canonical-partition-scheme.patch: > > +/* > + * get_relation_partition_info > + * > + * Retrieves partitioning information for a given relation. > + * > + * For a partitioned table it sets partitioning scheme, partition key > + * expressions, number of partitions and OIDs of partitions in the given > + * RelOptInfo. > + */ > +static void > +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, > + Relation relation) > > Would this be better called "set_relation_partition_info"? It doesn't > really "retrieve" the information, it "installs" it. Yes. Done. > > +{ > + PartitionDesc part_desc; > + > + /* No partitioning information for an unpartitioned relation. */ > + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE || > + !(rel->part_scheme = find_partition_scheme(root, relation))) > + return; > > Here and elsewhere you use the idiom !(foo = bar), which is perfectly > good C in my book but I understand the project convention is to avoid > implicit pointer->boolean conversion and to prefer expressions like > (foo = bar) != NULL and there is certainly a lot more code like that. PG code uses both the styles, search "if (!" in execExpr.c, createplan.c for example. I find this style useful, when I want to code, say "if this relation does not have a partitioning scheme" rather than "if this relation have NULL partitioning scheme". > > 0007-Partition-wise-join-implementation.patch > > + {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD, > > This GUC should appear in postgresql.conf.sample. Done. Attached patches with the comments addressed. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Aug 10, 2017 at 3:13 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Your patch didn't improve planning time without partition-wise join, >> so it's something good to have along-with partition-wise join. Given >> that Bitmapsets are used in other parts of code as well, the >> optimization may affect those parts as well, esp. the overhead of >> maintaining first_non_empty_wordnum. > > Maybe, but if you consider that this container already deals with the > upper bound moving up by reallocating and copying the whole thing, > adjusting an int when the lower bound moves down doesn't seem like > anything to worry about... Yeah. May be we should check whether that makes any difference to planning times of TPC-H queries. If it shows any difference. > >> Do you think we solving this problem is a prerequisite for >> partition-wise join? Or should we propose that patch as a separate >> enhancement? > > No, I'm not proposing anything yet. For now I just wanted to share > this observation about where hot CPU time goes in simple tests, and > since it turned out to be a loop in a loop that I could see an easy to > way to fix for singleton sets and sets with a small range, I couldn't > help trying it... But I'm still trying to understand the bigger > picture. I'll be interested to compare profiles with the ordered > append_rel_list version you have mentioned, to see how that moves the > hot spots. build_simple_rel() which contains that loop takes only .23% of planning time. So, I doubt if that's going to change much. + 0.23% postgres postgres [.] build_simple_rel ▒ > > I guess one very practical question to ask is: can we plan queries > with realistic numbers of partitioned tables and partitions in > reasonable times? Well, it certainly looks very good for hundreds of > partitions so far... My own experience of partitioning with other > RDBMSs has been on that order, 'monthly partitions covering the past > 10 years' and similar, but on the other hand it wouldn't be surprising > to learn that people want to go to many thousands, especially for > schemas which just keep adding partitions over time and don't want to > drop them. As for hash partitioning, that seems to be typically done > with numbers like 16, 32 or 64 in other products from what I can > glean. Speculation: perhaps hash partitioning is more motivated by > parallelism than data maintenance and thus somehow anchored to the > ground by core counts; if so no planning performance worries there I > guess (until core counts double quite a few more times). Agreed. > > One nice thing about the planning time is that restrictions on the > partition key cut down planning time; so where I measure ~7 seconds to > plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions, > if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so > if someone has a keep-adding-more-partitions-over-time model then at > least their prunable current day/week/whatever queries will not suffer > quite so badly. (Yeah my computer seems to be a lot slower than yours > for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @ > 2.2Ghz). That's interesting observation. Thanks for sharing it. > > Curious: would you consider joins between partitioned tables and > non-partitioned tables where the join is pushed down to be a kind of > "partition-wise join", or something else? If so, would that be a > special case, or just the logical extreme case for > 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where > one single "partition" on the non-partitioned side maps to all the > partitions on the partitioned size? > Parameterized nest loop joins with partition key as parameters simulate something like that. Apart from that case, I don't see any case where such a join would be more efficient compared to the current method of ganging all partitions and joining them to the unpartitioned table. But oh wait, that could be useful in sharding, when the unpartitioned table is replicated and partitioned table is distributed across shards. So, yes, that's a useful case. I am not sure whether it's some kind of partition-wise join; it doesn't matter, it looks useful. Said that, I am not planning to handle it in the near future. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Aug 10, 2017 at 5:43 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Do you think we solving this problem is a prerequisite for >> partition-wise join? Or should we propose that patch as a separate >> enhancement? > > No, I'm not proposing anything yet. For now I just wanted to share > this observation about where hot CPU time goes in simple tests, and > since it turned out to be a loop in a loop that I could see an easy to > way to fix for singleton sets and sets with a small range, I couldn't > help trying it... But I'm still trying to understand the bigger > picture. I'll be interested to compare profiles with the ordered > append_rel_list version you have mentioned, to see how that moves the > hot spots. Perhaps this is stating the obvious, but it's often better to optimize things like this at a higher level, rather than by tinkering with stuff like Bitmapset. On the other hand, sometimes micro-optimizations are the way to go, because optimizing find_ec_member_for_tle(), for example, might involve a much broader rethink of the planner code than we want to undertake right now. > I guess one very practical question to ask is: can we plan queries > with realistic numbers of partitioned tables and partitions in > reasonable times? Well, it certainly looks very good for hundreds of > partitions so far... My own experience of partitioning with other > RDBMSs has been on that order, 'monthly partitions covering the past > 10 years' and similar, but on the other hand it wouldn't be surprising > to learn that people want to go to many thousands, especially for > schemas which just keep adding partitions over time and don't want to > drop them. I've been thinking that it would be good if this feature - and other new partitioning features - could scale to about 1000 partitions without too much trouble. Eventually, it might be nice to scale higher, but there's not much point in making partition-wise join scale to 100,000 partitions if we've got some other part of the system that runs into trouble beyond 250. > Curious: would you consider joins between partitioned tables and > non-partitioned tables where the join is pushed down to be a kind of > "partition-wise join", or something else? If so, would that be a > special case, or just the logical extreme case for > 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where > one single "partition" on the non-partitioned side maps to all the > partitions on the partitioned size? I think this is actually a really important case which we've just excluded from the initial scope because the problem is hard enough already. But it's quite possible that if you are joining partitioned tables A and B with unpartitioned table X, the right join order could be A-X-B; the A-X join might knock out a lot of rows. It's not great to have to pick between doing the A-B join partitionwise and doing the A-X join first; you want to get both things. But we can't do everything at once. Further down the road, there's more than one way of doing the A-X join. You could join each partition of A to all of X, which is likely optimal if for example you are doing a nested loop with an inner index scan on X. But you could also partition X on the fly using A's partitioning scheme and then join partitions of A against the on-the-fly-partitioned version of X. That's likely to be a lot better for a merge join with an underlying sort on X. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Attached patches with the comments addressed. I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4 and e139f1953f29db245f60a7acb72fccb1e05d2442. 0004 doesn't apply any more, probably due to commit d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along these lines could be separately committed prior to the main patch, and I think that would be a good idea just to flush out any bugs in this part independently of the rest. However, I also think that we probably ought to try to get Amit Langote's changes to this function to repair the locking order and expand in bound order committed before proceeding with these changes. In fact, I think there's a certain amount of conflict between what's being discussed over there and what you're trying to do here. In that thread, we propose to move partitioned tables at any level to the front of the inheritance expansion. Here, however, you want to expand level by level. I think partitioned-tables-first is the right approach for the reasons discussed on the other thread; namely, we want to be able to prune leaf partitions before expanding them, but that requires us to expand all the non-leaf tables first to maintain a consistent locking order in all scenarios. So the approach you've taken in this patch may need to be re-thought somewhat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Aug 15, 2017 at 10:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Attached patches with the comments addressed. > > I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4 > and e139f1953f29db245f60a7acb72fccb1e05d2442. Thanks a lot Robert. Some less patches to maintain :). > > 0004 doesn't apply any more, probably due to commit > d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along > these lines could be separately committed prior to the main patch, and > I think that would be a good idea just to flush out any bugs in this > part independently of the rest. However, I also think that we > probably ought to try to get Amit Langote's changes to this function > to repair the locking order and expand in bound order committed before > proceeding with these changes. I am reviewing those changes and contribute to that thread if necessary. > > In fact, I think there's a certain amount of conflict between what's > being discussed over there and what you're trying to do here. In that > thread, we propose to move partitioned tables at any level to the > front of the inheritance expansion. Here, however, you want to expand > level by level. I think partitioned-tables-first is the right > approach for the reasons discussed on the other thread; namely, we > want to be able to prune leaf partitions before expanding them, but > that requires us to expand all the non-leaf tables first to maintain a > consistent locking order in all scenarios. So the approach you've > taken in this patch may need to be re-thought somewhat. > There are two ways we can do this 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex) or just childRTIndex (using this we can fetch childRTE calling rtfetch()) of intermediate partitioned tables. Once we are done expanding immediate children, call expand_inherited_rtentry() recursively on this list. 2. expand_inherited_tables() scans root->parse->rtable only upto the end of original range table list. Make it go beyond that end, expanding any new entries added for intermediate partitions. FWIW, the first option allows us to keep all AppendRelInfos corresponding to one partitioned relation together and also expands the whole partition hierarchy in one go. Second will require minimal changes to expand_inherited_rtentry(). Both approaches will spend time scanning same number of RTE; the first will have them in different lists, and second will have them in root->parse->rtable. I don't see one being more attractive than the other. Do you have any opinion? I will submit the rebased patches after reviewing/adjusting Amit's changes and also the changes in expand_inherited_rtentry() after we have concluded the approach to be taken. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > There are two ways we can do this > 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex) > or just childRTIndex (using this we can fetch childRTE calling > rtfetch()) of intermediate partitioned tables. Once we are done > expanding immediate children, call expand_inherited_rtentry() > recursively on this list. > > 2. expand_inherited_tables() scans root->parse->rtable only upto the > end of original range table list. Make it go beyond that end, > expanding any new entries added for intermediate partitions. > > FWIW, the first option allows us to keep all AppendRelInfos > corresponding to one partitioned relation together and also expands > the whole partition hierarchy in one go. Second will require minimal > changes to expand_inherited_rtentry(). Both approaches will spend time > scanning same number of RTE; the first will have them in different > lists, and second will have them in root->parse->rtable. I don't see > one being more attractive than the other. Do you have any opinion? I don't like option (2). I'm not sure about option (1). I think maybe we should have two nested loops in expanded_inherited_rtentry(), the outer one iterating over partitioned tables (or just the original parent RTE if partitioning is not involved) and then inner one looping over individual leaf partitions for each partitioned table. Probably we'd end up wanting to move at least some of the logic inside the existing loop into a subroutine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Aug 16, 2017 at 5:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> There are two ways we can do this >> 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex) >> or just childRTIndex (using this we can fetch childRTE calling >> rtfetch()) of intermediate partitioned tables. Once we are done >> expanding immediate children, call expand_inherited_rtentry() >> recursively on this list. >> >> 2. expand_inherited_tables() scans root->parse->rtable only upto the >> end of original range table list. Make it go beyond that end, >> expanding any new entries added for intermediate partitions. >> >> FWIW, the first option allows us to keep all AppendRelInfos >> corresponding to one partitioned relation together and also expands >> the whole partition hierarchy in one go. Second will require minimal >> changes to expand_inherited_rtentry(). Both approaches will spend time >> scanning same number of RTE; the first will have them in different >> lists, and second will have them in root->parse->rtable. I don't see >> one being more attractive than the other. Do you have any opinion? > > I don't like option (2). I'm not sure about option (1). I think > maybe we should have two nested loops in expanded_inherited_rtentry(), > the outer one iterating over partitioned tables (or just the original > parent RTE if partitioning is not involved) and then inner one looping > over individual leaf partitions for each partitioned table. Probably > we'd end up wanting to move at least some of the logic inside the > existing loop into a subroutine. I originally thought to provide it along-with the changes to expand_inherited_rtentry(), but that thread is taking longer. Jeevan Chalke needs rebased patches for his work on aggregate pushdown and Thomas might need them for further review. So, here they are. The last two patches in this set implement the advanced partition matching algorithm. Those patches are here for ready reference. One can observe that patch doesn't change much of the basic partition-wise join implementation. I am starting a new thread for discussing the advanced partition matching algorithm. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
From
Antonin Houska
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I originally thought to provide it along-with the changes to > expand_inherited_rtentry(), but that thread is taking longer. Jeevan > Chalke needs rebased patches for his work on aggregate pushdown and > Thomas might need them for further review. So, here they are. Since I have related patch in the current commitfest (https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your patch: * generate_partition_wise_join_paths() Right parenthesis is missing in the prologue. * get_partitioned_child_rels_for_join() I think the Assert() statement is easier to understand inside the loop, see the assert.diff attachment. * have_partkey_equi_join() As the function handles generic join, this comment doesn't seem to me relevant: /* * The equi-join between partition keys is strict if equi-join between * at least one partition key is using a strict operator. See * explanation about outer join reordering identity 3 in * optimizer/README */ strict_op = op_strict(opexpr->opno); And I think the function can return true even if strict_op is false for all the operators evaluated in the loop. * match_expr_to_partition_keys() I'm not sure this comment is clear enough: /* * If it's a strict equi-join a NULL partition key on one side will * not join a NULL partition key on the other side. So, rows with NULL * partition key from a partition on one side can not join with those * from a non-matching partition on the other side. So, search the * nullable partition keys as well. */ if (!strict_op) continue; My understanding of the problem of NULL values generated by outer join is: these NULL values --- if evaluated by non-strict expression --- can make row of N-th partition on one side of the join match row(s) of *other than* N-th partition(s) on the other side. Thus the nullable input expressions may only be evaluated by strict operators. I think it'd be clearer if you stressed that (undesired) *match* of partition keys can be a problem, rather than mismatch. If you insist on your wording, then I think you should at least move the comment below to the part that only deals with strict operators. * There are several places where lfirst_node() macro should be used. For example rel = lfirst_node(RelOptInfo, lc); instead of rel = (RelOptInfo *) lfirst(lc); * map_and_merge_partitions() Besides a few changes proposed in map_and_merge_partitions.diff (a few of them to suppress compiler warnings) I think that this part needs more thought: { Assert(mergemap1[index1] != mergemap2[index2] && mergemap1[index1] >= 0 && mergemap2[index2] >= 0); /* * Both the partitions map to different merged partitions. This * means that multiple partitions from one relation matches to one * partition from the other relation. Partition-wise join does not * handle this case right now, since it requires ganging multiple * partitions together (into one RelOptInfo). */ merged_index = -1; } I could hit this path with the following test: CREATE TABLE a(i int) PARTITION BY LIST(i); CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2); CREATE TABLE b(j int) PARTITION BY LIST(j); CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2); SET enable_partition_wise_join TO on; SELECT * FROM a FULL JOIN b ON i = j; I don't think there's a reason not to join a_0 partition to b_0, is there? -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
Here's set of patches rebased on the latest head. This rebase mainly changes patch 0001, which translates partition hierarchy into inheritance hierarchy creating AppendRelInfos and RelOptInfos for partitioned partitions. Because of that, it's not necessary to record the partitioned partitions in a PartitionedChildRelInfos::child_rels. The only RTI that goes in there is the RTI of child RTE which is same as the parent RTE except inh flag. I tried removing that with a series of changes but it seems that following code in ExecInitModifyTable() requires it. 1897 /* The root table RT index is at the head of the partitioned_rels list */ 1898 if (node->partitioned_rels) 1899 { 1900 Index root_rti; 1901 Oid root_oid; 1902 1903 root_rti = linitial_int(node->partitioned_rels); 1904 root_oid = getrelid(root_rti, estate->es_range_table); 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ 1906 } 1907 else 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; I don't know whether we could change this code not to use PartitionedChildRelInfos::child_rels. Removing PartitionedChildRelInfos machinary seems like a separate patch. The last two patches implement the advanced partition matching algorithm and are here in this set for ready reference. Please use [1] for discussing/reviewing those. [1] https://www.postgresql.org/message-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Etsuro Fujita
Date:
On 2017/09/02 4:10, Ashutosh Bapat wrote: > This rebase mainly changes patch 0001, which translates partition > hierarchy into inheritance hierarchy creating AppendRelInfos and > RelOptInfos for partitioned partitions. Because of that, it's not > necessary to record the partitioned partitions in a > PartitionedChildRelInfos::child_rels. The only RTI that goes in there > is the RTI of child RTE which is same as the parent RTE except inh > flag. I tried removing that with a series of changes but it seems that > following code in ExecInitModifyTable() requires it. > 1897 /* The root table RT index is at the head of the > partitioned_rels list */ > 1898 if (node->partitioned_rels) > 1899 { > 1900 Index root_rti; > 1901 Oid root_oid; > 1902 > 1903 root_rti = linitial_int(node->partitioned_rels); > 1904 root_oid = getrelid(root_rti, estate->es_range_table); > 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ > 1906 } > 1907 else > 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; > > I don't know whether we could change this code not to use > PartitionedChildRelInfos::child_rels. Though I haven't read the patch yet, I think the above code is useless. And I proposed a patch to clean it up before [1]. I'll add that patch to the next commitfest. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8%40lab.ntt.co.jp
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/04 12:38, Etsuro Fujita wrote: > On 2017/09/02 4:10, Ashutosh Bapat wrote: >> This rebase mainly changes patch 0001, which translates partition >> hierarchy into inheritance hierarchy creating AppendRelInfos and >> RelOptInfos for partitioned partitions. Because of that, it's not >> necessary to record the partitioned partitions in a >> PartitionedChildRelInfos::child_rels. The only RTI that goes in there >> is the RTI of child RTE which is same as the parent RTE except inh >> flag. I tried removing that with a series of changes but it seems that >> following code in ExecInitModifyTable() requires it. >> 1897 /* The root table RT index is at the head of the >> partitioned_rels list */ >> 1898 if (node->partitioned_rels) >> 1899 { >> 1900 Index root_rti; >> 1901 Oid root_oid; >> 1902 >> 1903 root_rti = linitial_int(node->partitioned_rels); >> 1904 root_oid = getrelid(root_rti, estate->es_range_table); >> 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ >> 1906 } >> 1907 else >> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; >> >> I don't know whether we could change this code not to use >> PartitionedChildRelInfos::child_rels. For a root partitioned tables, ModifyTable.partitioned_rels comes from PartitionedChildRelInfo.child_rels recorded for the table by expand_inherited_rtnentry(). In fact, the latter is copied verbatim to ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same. The only point of keeping those RT indexes around in the ModifyTable node is for the executor to be able to locate partitioned table RT entries and lock them. Without them, the executor wouldn't know about those tables at all, because there won't be subplans corresponding to partitioned tables in the tree and hence their RT indexes won't appear in the ModifyTable.resultRelations list. If your patch adds partitioned child rel AppendRelInfos back for whatever reason, you should also make sure in inheritance_planner() that their RT indexes don't end up the resultRelations list. See this piece of code in inheritance_planner(): 1351 /* Build list of sub-paths */ 1352 subpaths = lappend(subpaths, subpath); 1353 1354 /* Build list of modified subroots, too */ 1355 subroots = lappend(subroots, subroot); 1356 1357 /* Build list of target-relation RT indexes */ 1358 resultRelations = lappend_int(resultRelations, appinfo->child_relid); Maybe it won't happen, because if this appinfo corresponds to a partitioned child table, recursion would occur and we'll get to this piece of code for only the leaf children. By the way, if you want to get rid of PartitionedChildRelInfo, you can do that as long as you find some other way of putting together the partitioned_rels list to add into the ModifyTable (Append/MergeAppend) node created for the root partitioned table. Currently, PartitionedChildRelInfo (and the root->pcinfo_list) is the way for expand_inherited_rtentry() to pass that information to the planner's path-generating code. We may be able to generate that list when actually creating the path using set_append_rel_pathlist() or inheritance_planner(), without having created a PartitionedChildRelInfo node beforehand. > Though I haven't read the patch yet, I think the above code is useless. > And I proposed a patch to clean it up before [1]. I'll add that patch to > the next commitfest. +1. Thanks, Amit
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah@cybertec.at> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > >> I originally thought to provide it along-with the changes to >> expand_inherited_rtentry(), but that thread is taking longer. Jeevan >> Chalke needs rebased patches for his work on aggregate pushdown and >> Thomas might need them for further review. So, here they are. > > Since I have related patch in the current commitfest > (https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your > patch: > > * generate_partition_wise_join_paths() > > Right parenthesis is missing in the prologue. Thanks for pointing that out. Fixed. > > > * get_partitioned_child_rels_for_join() > > I think the Assert() statement is easier to understand inside the loop, see > the assert.diff attachment. The assert at the end of function also checks that we have got child_rels lists for all the parents passed in. That is not checked by your version. Furthermore, we would checked that each child_rels has at least one element while buildings paths for base relations. Checking the same again for joins doesn't add any value. > > > * have_partkey_equi_join() > > As the function handles generic join, this comment doesn't seem to me > relevant: > > /* > * The equi-join between partition keys is strict if equi-join between > * at least one partition key is using a strict operator. See > * explanation about outer join reordering identity 3 in > * optimizer/README > */ > strict_op = op_strict(opexpr->opno); What in that comment is not exactly relevant? > > And I think the function can return true even if strict_op is false for all > the operators evaluated in the loop. I think it does that. Do you have a case where it doesn't? > > > * match_expr_to_partition_keys() > > I'm not sure this comment is clear enough: > > /* > * If it's a strict equi-join a NULL partition key on one side will > * not join a NULL partition key on the other side. So, rows with NULL > * partition key from a partition on one side can not join with those > * from a non-matching partition on the other side. So, search the > * nullable partition keys as well. > */ > if (!strict_op) > continue; > > My understanding of the problem of NULL values generated by outer join is: > these NULL values --- if evaluated by non-strict expression --- can make row > of N-th partition on one side of the join match row(s) of *other than* N-th > partition(s) on the other side. Thus the nullable input expressions may only > be evaluated by strict operators. I think it'd be clearer if you stressed that > (undesired) *match* of partition keys can be a problem, rather than mismatch Sorry, I am not able to understand this. To me it looks like my wording conveys what you are saying. > > If you insist on your wording, then I think you should at least move the > comment below to the part that only deals with strict operators. Done. > > > * There are several places where lfirst_node() macro should be used. For > example > > rel = lfirst_node(RelOptInfo, lc); > > instead of > > rel = (RelOptInfo *) lfirst(lc); Thanks for that. > > > * map_and_merge_partitions() > > Besides a few changes proposed in map_and_merge_partitions.diff (a few of them > to suppress compiler warnings) I think that this part needs more thought: > > { > Assert(mergemap1[index1] != mergemap2[index2] && > mergemap1[index1] >= 0 && mergemap2[index2] >= 0); > > /* > * Both the partitions map to different merged partitions. This > * means that multiple partitions from one relation matches to one > * partition from the other relation. Partition-wise join does not > * handle this case right now, since it requires ganging multiple > * partitions together (into one RelOptInfo). > */ > merged_index = -1; > } > > I could hit this path with the following test: > > CREATE TABLE a(i int) PARTITION BY LIST(i); > CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2); > CREATE TABLE b(j int) PARTITION BY LIST(j); > CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2); > > SET enable_partition_wise_join TO on; > > SELECT * > FROM a > FULL JOIN > b ON i = j; > > I don't think there's a reason not to join a_0 partition to b_0, is there? With the latest patchset I am seeing that partition-wise join is used in this case. I have started a new thread [1] for advanced partition matching patches. Please post review comments about the last two patches on that thread. [1] https://www.postgresql.org/message-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com Attached patchset with above comments addressed. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/04 12:38, Etsuro Fujita wrote: >> On 2017/09/02 4:10, Ashutosh Bapat wrote: >>> This rebase mainly changes patch 0001, which translates partition >>> hierarchy into inheritance hierarchy creating AppendRelInfos and >>> RelOptInfos for partitioned partitions. Because of that, it's not >>> necessary to record the partitioned partitions in a >>> PartitionedChildRelInfos::child_rels. The only RTI that goes in there >>> is the RTI of child RTE which is same as the parent RTE except inh >>> flag. I tried removing that with a series of changes but it seems that >>> following code in ExecInitModifyTable() requires it. >>> 1897 /* The root table RT index is at the head of the >>> partitioned_rels list */ >>> 1898 if (node->partitioned_rels) >>> 1899 { >>> 1900 Index root_rti; >>> 1901 Oid root_oid; >>> 1902 >>> 1903 root_rti = linitial_int(node->partitioned_rels); >>> 1904 root_oid = getrelid(root_rti, estate->es_range_table); >>> 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ >>> 1906 } >>> 1907 else >>> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc; >>> >>> I don't know whether we could change this code not to use >>> PartitionedChildRelInfos::child_rels. > > For a root partitioned tables, ModifyTable.partitioned_rels comes from > PartitionedChildRelInfo.child_rels recorded for the table by > expand_inherited_rtnentry(). In fact, the latter is copied verbatim to > ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same. > The only point of keeping those RT indexes around in the ModifyTable node > is for the executor to be able to locate partitioned table RT entries and > lock them. Without them, the executor wouldn't know about those tables at > all, because there won't be subplans corresponding to partitioned tables > in the tree and hence their RT indexes won't appear in the > ModifyTable.resultRelations list. If your patch adds partitioned child > rel AppendRelInfos back for whatever reason, you should also make sure in > inheritance_planner() that their RT indexes don't end up the > resultRelations list. See this piece of code in inheritance_planner(): > > 1351 /* Build list of sub-paths */ > 1352 subpaths = lappend(subpaths, subpath); > 1353 > 1354 /* Build list of modified subroots, too */ > 1355 subroots = lappend(subroots, subroot); > 1356 > 1357 /* Build list of target-relation RT indexes */ > 1358 resultRelations = lappend_int(resultRelations, > appinfo->child_relid); > > Maybe it won't happen, because if this appinfo corresponds to a > partitioned child table, recursion would occur and we'll get to this piece > of code for only the leaf children. You are right. We don't execute above lines for partitioned partitions. > > By the way, if you want to get rid of PartitionedChildRelInfo, you can do > that as long as you find some other way of putting together the > partitioned_rels list to add into the ModifyTable (Append/MergeAppend) > node created for the root partitioned table. Currently, > PartitionedChildRelInfo (and the root->pcinfo_list) is the way for > expand_inherited_rtentry() to pass that information to the planner's > path-generating code. We may be able to generate that list when actually > creating the path using set_append_rel_pathlist() or > inheritance_planner(), without having created a PartitionedChildRelInfo > node beforehand. AFAIU, the list contained RTIs of the relations, which didnt' have corresponding AppendRelInfos to lock those relations. Now that we create AppendRelInfos even for partitioned partitions, I don't think we need the list to take care of the locks. Is there any other reason why we maintain that list (apart from the trigger case I have raised and Fujita-san says that the list is not required in that case as well.) > >> Though I haven't read the patch yet, I think the above code is useless. >> And I proposed a patch to clean it up before [1]. I'll add that patch to >> the next commitfest. > > +1. +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Etsuro Fujita
Date:
On 2017/09/04 21:32, Ashutosh Bapat wrote: > On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote >> By the way, if you want to get rid of PartitionedChildRelInfo, you can do >> that as long as you find some other way of putting together the >> partitioned_rels list to add into the ModifyTable (Append/MergeAppend) >> node created for the root partitioned table. Currently, >> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for >> expand_inherited_rtentry() to pass that information to the planner's >> path-generating code. We may be able to generate that list when actually >> creating the path using set_append_rel_pathlist() or >> inheritance_planner(), without having created a PartitionedChildRelInfo >> node beforehand. > > AFAIU, the list contained RTIs of the relations, which didnt' have > corresponding AppendRelInfos to lock those relations. Now that we > create AppendRelInfos even for partitioned partitions, I don't think > we need the list to take care of the locks. I don't think so either. (Since I haven't followed discussions on this thread in detail yet, I don't understand the idea/need of creating AppendRelInfos for partitioned partitions, though.) >>> Though I haven't read the patch yet, I think the above code is useless. >>> And I proposed a patch to clean it up before [1]. I'll add that patch to >>> the next commitfest. >> >> +1. > +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? No. The patch just removes the partitioned_rels list from nodeModifyTable.c. Best regards, Etsuro Fujita
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/04 21:32, Ashutosh Bapat wrote: > On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote wrote: >> By the way, if you want to get rid of PartitionedChildRelInfo, you can do >> that as long as you find some other way of putting together the >> partitioned_rels list to add into the ModifyTable (Append/MergeAppend) >> node created for the root partitioned table. Currently, >> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for >> expand_inherited_rtentry() to pass that information to the planner's >> path-generating code. We may be able to generate that list when actually >> creating the path using set_append_rel_pathlist() or >> inheritance_planner(), without having created a PartitionedChildRelInfo >> node beforehand. > > AFAIU, the list contained RTIs of the relations, which didnt' have > corresponding AppendRelInfos to lock those relations. Now that we > create AppendRelInfos even for partitioned partitions, I don't think > we need the list to take care of the locks. Is there any other reason > why we maintain that list (apart from the trigger case I have raised > and Fujita-san says that the list is not required in that case as > well.) We do *need* the list in ModifyTable (Append/MergeAppend) node itself. We can, however, get rid of the PartitionedChildRelInfo node that carries the partitioned child RT indexes from an earlier planning phase (expand_inherited_rtentry) to a later phase (create_{modifytable|append|merge_append}_path). The later phase can build that list from the AppendRelInfos that you mention we now [1] build. >>> Though I haven't read the patch yet, I think the above code is useless. >>> And I proposed a patch to clean it up before [1]. I'll add that patch to >>> the next commitfest. >> >> +1. > +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't say that partitioned_rels itself is useless, just that its particular usage in ExecInitModifyTable is. We still need that list for planner to tell the executor that there are some RT entries the latter would need to lock before executing a given plan. Without that dedicated list, the executor cannot know at all that certain tables in the partition tree (viz. the partitioned ones) need to be locked. I mentioned the reason - (Merge)Append.subplans, ModifyTable.resultRelations does not contain respective entries corresponding to the partitioned tables, and traditionally, the executor looks at those lists to figure out the tables to lock. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/05 13:20, Amit Langote wrote: > The later phase can > build that list from the AppendRelInfos that you mention we now [1] build. > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154 Looking at that commit again, AppendRelInfos are still not created for partitioned child tables. Looking at the code in expand_single_inheritance_child(), which exists as of 30833ba154: /* * Build an AppendRelInfo for this parent and child, unless the child is a * partitioned table. */ if (childrte->relkind!= RELKIND_PARTITIONED_TABLE && !childrte->inh) { ...code that builds AppendRelInfo... } else *partitioned_child_rels = lappend_int(*partitioned_child_rels, childRTindex); you can see that an AppendRelInfo won't get built for partitioned child tables. Also, even if the commit changed things so that the child RT entries (and AppendRelInfos) now get built in an order determined by depth-first traversal of the partition tree, the same original parent RT index is used to mark all AppendRelInfos, so the expansion essentially flattens the hierarchy. In the updated patch I will post on the "path toward faster partition pruning" thread [1], I am planning to rejigger things so that two things start to happen: 1. For partitioned child tables, build the child RT entry with inh = true and also build an AppendRelInfos 2. When recursively expanding a partitioned child table in expand_partitioned_rtentry(), pass its new RT index as the parentRTindexto the recursive call of expand_partitioned_rtentry(), so that the resulting AppendRelInfos reflect immediateparent-child relationship With 1 in place, build_simple_rel() will build RelOptInfos even for partitioned child tables, so that for each one, we can recursively build an Append path. So, instead of just one Append path for the root partitioned table, there is one for each partitioned table in the tree. I will be including the above described change in the partition-pruning patch, because the other code in that patch relies on the same and I know Ashuotsh has wanted that for a long time. :) Thanks, Amit [1] https://www.postgresql.org/message-id/044e2e09-9690-7aff-1749-2d318da38a11%40lab.ntt.co.jp
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 5, 2017 at 11:54 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/05 13:20, Amit Langote wrote: >> The later phase can >> build that list from the AppendRelInfos that you mention we now [1] build. >> >> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154 > > Looking at that commit again, AppendRelInfos are still not created for > partitioned child tables. Looking at the code in > expand_single_inheritance_child(), which exists as of 30833ba154: > > > /* > * Build an AppendRelInfo for this parent and child, unless the child is a > * partitioned table. > */ > if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh) > { > ...code that builds AppendRelInfo... > } > else > *partitioned_child_rels = lappend_int(*partitioned_child_rels, > childRTindex); > > you can see that an AppendRelInfo won't get built for partitioned child > tables. > > Also, even if the commit changed things so that the child RT entries (and > AppendRelInfos) now get built in an order determined by depth-first > traversal of the partition tree, the same original parent RT index is used > to mark all AppendRelInfos, so the expansion essentially flattens the > hierarchy. In the updated patch I will post on the "path toward faster > partition pruning" thread [1], I am planning to rejigger things so that > two things start to happen: > > 1. For partitioned child tables, build the child RT entry with inh = true > and also build an AppendRelInfos > > 2. When recursively expanding a partitioned child table in > expand_partitioned_rtentry(), pass its new RT index as the > parentRTindex to the recursive call of expand_partitioned_rtentry(), so > that the resulting AppendRelInfos reflect immediate parent-child > relationship > > With 1 in place, build_simple_rel() will build RelOptInfos even for > partitioned child tables, so that for each one, we can recursively build > an Append path. So, instead of just one Append path for the root > partitioned table, there is one for each partitioned table in the tree. > > I will be including the above described change in the partition-pruning > patch, because the other code in that patch relies on the same and I know > Ashuotsh has wanted that for a long time. :) Those changes are already part of my updated 0001 patch. Aren't they? May be you should just review those and see if those are suitable for you? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Etsuro Fujita
Date:
On 2017/09/05 13:20, Amit Langote wrote: > On 2017/09/04 21:32, Ashutosh Bapat wrote: >> +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list? > > As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't > say that partitioned_rels itself is useless, just that its particular > usage in ExecInitModifyTable is. That's right. (I thought there would probably be no need to create that list if we created AppendRelInfos even for partitioned partitions.) > We still need that list for planner to > tell the executor that there are some RT entries the latter would need to > lock before executing a given plan. Without that dedicated list, the > executor cannot know at all that certain tables in the partition tree > (viz. the partitioned ones) need to be locked. I mentioned the reason - > (Merge)Append.subplans, ModifyTable.resultRelations does not contain > respective entries corresponding to the partitioned tables, and > traditionally, the executor looks at those lists to figure out the tables > to lock. I think so too. Best regards, Etsuro Fujita
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/05 15:30, Ashutosh Bapat wrote: > Those changes are already part of my updated 0001 patch. Aren't they? > May be you should just review those and see if those are suitable for > you? Yeah, I think it's going to be the same patch, functionality-wise. And sorry, I didn't realize you were talking about the case after applying your patch on HEAD. Thanks, Amit
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 5, 2017 at 12:06 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/05 15:30, Ashutosh Bapat wrote: >> Those changes are already part of my updated 0001 patch. Aren't they? >> May be you should just review those and see if those are suitable for >> you? > > Yeah, I think it's going to be the same patch, functionality-wise. > > And sorry, I didn't realize you were talking about the case after applying > your patch on HEAD. > Ok. Can you please answer my previous questions? AFAIU, the list contained RTIs of the relations, which didnt' have corresponding AppendRelInfos to lock those relations. Now that we create AppendRelInfos even for partitioned partitions with my 0001 patch, I don't think we need the list to take care of the locks. Is there any other reason why we maintain that list (apart from the trigger case I have raised and Fujita-san says that the list is not required in that case as well.)? Having asked that, I think my patch shouldn't deal with removing partitioned_rels lists and related structures and code. It should be done as a separate patch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/05 15:43, Ashutosh Bapat wrote: > Ok. Can you please answer my previous questions? > > AFAIU, the list contained RTIs of the relations, which didnt' have > corresponding AppendRelInfos to lock those relations. Now that we > create AppendRelInfos even for partitioned partitions with my 0001 > patch, I don't think > we need the list to take care of the locks. Is there any other reason > why we maintain that list (apart from the trigger case I have raised > and Fujita-san says that the list is not required in that case as > well.)? AppendRelInfos exist within the planner (they come to be and go away within the planner). Once we leave the planner, that information is gone. Executor will receive a plan node that won't contain that information: 1. Append has an appendplans field, which contains one plan tree for every leaf partition. None of its fields, other thanpartitined_rels, contains the RT indexes of the partitioned tables. Similarly in the case of MergeAppend. 2. ModifyTable has a resultRelations fields which contains a list of leaf partition RT indexes and a plans field which containsone plan tree for every RT index in the resultRelations list (that is a plan tree that will scan the particularleaf partition). None of its fields, other than partitined_rels, contains the RT indexes of the partitioned tables. I learned over the course of developing the patch that added this partitioned_rels field [1] that the executor needs to identify all the affected tables by a given plan tree so that it could lock them. Executor needs to lock them separately even if the plan itself was built after locking all the relevant tables [2]. For example, see ExecLockNonLeafAppendTables(), which will lock the tables in the (Merge)Append.partitioned_rels list. While I've been thinking all along that the same thing must be happening for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of times on this thread), it's actually not. Instead, ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are merged into PlannedStmt.nonleafResultRelations (which happens in set_plan_refs) and that's where the executor finds them to lock them (which happens in InitPlan). So, it appears that ModifyTable.partitioned_rels is indeed unused in the executor. But we still can't get rid of it from the ModifyTable node itself without figuring out a way (a channel) to transfer that information into PlannedStmt.nonleafResultRelations. > Having asked that, I think my patch shouldn't deal with removing > partitioned_rels lists and related structures and code. It should be> done as a separate patch. Going back to your original email which started this discussion, it seems that we agree on that the PartitionedChildRelInfo node can be removed, and I agree that it shouldn't be done in the partitionwise-join patch series but as a separate patch. As described above, we shouldn't try yet to get rid of the partitioned_rels list that appears in some plan nodes. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8 [2] https://www.postgresql.org/message-id/CA%2BTgmoYiwviCDRi3Zk%2BQuXj1r7uMu9T_kDNq%2B17PCWgzrbzw8A%40mail.gmail.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/05 15:43, Ashutosh Bapat wrote: >> Ok. Can you please answer my previous questions? >> >> AFAIU, the list contained RTIs of the relations, which didnt' have >> corresponding AppendRelInfos to lock those relations. Now that we >> create AppendRelInfos even for partitioned partitions with my 0001 >> patch, I don't think >> we need the list to take care of the locks. Is there any other reason >> why we maintain that list (apart from the trigger case I have raised >> and Fujita-san says that the list is not required in that case as >> well.)? > > AppendRelInfos exist within the planner (they come to be and go away > within the planner). Once we leave the planner, that information is gone. > > Executor will receive a plan node that won't contain that information: > > 1. Append has an appendplans field, which contains one plan tree for every > leaf partition. None of its fields, other than partitined_rels, > contains the RT indexes of the partitioned tables. Similarly in the > case of MergeAppend. > > 2. ModifyTable has a resultRelations fields which contains a list of leaf > partition RT indexes and a plans field which contains one plan tree for > every RT index in the resultRelations list (that is a plan tree that > will scan the particular leaf partition). None of its fields, other > than partitined_rels, contains the RT indexes of the partitioned > tables. > > I learned over the course of developing the patch that added this > partitioned_rels field [1] that the executor needs to identify all the > affected tables by a given plan tree so that it could lock them. Executor > needs to lock them separately even if the plan itself was built after > locking all the relevant tables [2]. For example, see > ExecLockNonLeafAppendTables(), which will lock the tables in the > (Merge)Append.partitioned_rels list. > > While I've been thinking all along that the same thing must be happening > for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of > times on this thread), it's actually not. Instead, > ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are > merged into PlannedStmt.nonleafResultRelations (which happens in > set_plan_refs) and that's where the executor finds them to lock them > (which happens in InitPlan). > > So, it appears that ModifyTable.partitioned_rels is indeed unused in the > executor. But we still can't get rid of it from the ModifyTable node > itself without figuring out a way (a channel) to transfer that information > into PlannedStmt.nonleafResultRelations. Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am not adding any partitioned partition other than the parent itself. But since every partitioned partition in turn acts as parent, it appears its own list. The list obtained by concatenating all such lists together contains all the partitioned partition RTIs. In my patch, I need to teach accumulate_append_subpath() to accumulate partitioned_rels as well. > >> Having asked that, I think my patch shouldn't deal with removing >> partitioned_rels lists and related structures and code. It should be> done as a separate patch. > > Going back to your original email which started this discussion, it seems > that we agree on that the PartitionedChildRelInfo node can be removed, and > I agree that it shouldn't be done in the partitionwise-join patch series > but as a separate patch. As described above, we shouldn't try yet to get > rid of the partitioned_rels list that appears in some plan nodes. Thanks. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 5, 2017 at 3:00 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/09/05 15:43, Ashutosh Bapat wrote: >>> Ok. Can you please answer my previous questions? >>> >>> AFAIU, the list contained RTIs of the relations, which didnt' have >>> corresponding AppendRelInfos to lock those relations. Now that we >>> create AppendRelInfos even for partitioned partitions with my 0001 >>> patch, I don't think >>> we need the list to take care of the locks. Is there any other reason >>> why we maintain that list (apart from the trigger case I have raised >>> and Fujita-san says that the list is not required in that case as >>> well.)? >> >> AppendRelInfos exist within the planner (they come to be and go away >> within the planner). Once we leave the planner, that information is gone. >> >> Executor will receive a plan node that won't contain that information: >> >> 1. Append has an appendplans field, which contains one plan tree for every >> leaf partition. None of its fields, other than partitined_rels, >> contains the RT indexes of the partitioned tables. Similarly in the >> case of MergeAppend. >> >> 2. ModifyTable has a resultRelations fields which contains a list of leaf >> partition RT indexes and a plans field which contains one plan tree for >> every RT index in the resultRelations list (that is a plan tree that >> will scan the particular leaf partition). None of its fields, other >> than partitined_rels, contains the RT indexes of the partitioned >> tables. >> >> I learned over the course of developing the patch that added this >> partitioned_rels field [1] that the executor needs to identify all the >> affected tables by a given plan tree so that it could lock them. Executor >> needs to lock them separately even if the plan itself was built after >> locking all the relevant tables [2]. For example, see >> ExecLockNonLeafAppendTables(), which will lock the tables in the >> (Merge)Append.partitioned_rels list. >> >> While I've been thinking all along that the same thing must be happening >> for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of >> times on this thread), it's actually not. Instead, >> ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are >> merged into PlannedStmt.nonleafResultRelations (which happens in >> set_plan_refs) and that's where the executor finds them to lock them >> (which happens in InitPlan). >> >> So, it appears that ModifyTable.partitioned_rels is indeed unused in the >> executor. But we still can't get rid of it from the ModifyTable node >> itself without figuring out a way (a channel) to transfer that information >> into PlannedStmt.nonleafResultRelations. > > Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am > not adding any partitioned partition other than the parent itself. But > since every partitioned partition in turn acts as parent, it appears > its own list. The list obtained by concatenating all such lists > together contains all the partitioned partition RTIs. In my patch, I > need to teach accumulate_append_subpath() to accumulate > partitioned_rels as well. > accumulate_append_subpath() is executed for every path instead of every relation, so changing it would collect the same list multiple times. Instead, I found the old way of associating all intermediate partitions with the root partitioned relation work better. Here's the updated patch set. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
From
Antonin Houska
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah@cybertec.at> wrote: > > > > > > > > * get_partitioned_child_rels_for_join() > > > > I think the Assert() statement is easier to understand inside the loop, see > > the assert.diff attachment. > The assert at the end of function also checks that we have got > child_rels lists for all the parents passed in. Really? I can imagine that some instances of PartitionedChildRelInfo have the child_rels list empty, while other ones have these lists long enough to compensate for the empty lists. > > > > > > * have_partkey_equi_join() > > > > As the function handles generic join, this comment doesn't seem to me > > relevant: > > > > /* > > * The equi-join between partition keys is strict if equi-join between > > * at least one partition key is using a strict operator. See > > * explanation about outer join reordering identity 3 in > > * optimizer/README > > */ > > strict_op = op_strict(opexpr->opno); > > What in that comment is not exactly relevant? Basically I don't understand why you mention join reordering here. The join ordering questions must all have been resolved by the time have_partkey_equi_join() is called. > > > > And I think the function can return true even if strict_op is false for all > > the operators evaluated in the loop. > > I think it does that. Do you have a case where it doesn't? Here I refer to this part of the comment above: "... if equi-join between at least one partition key is using a strict operator." My understanding of the code (especially match_expr_to_partition_keys) is that no operator actually needs to be strict as long as each operator involved in the join matches at least one non-nullable expression on both sides of the join. > > * match_expr_to_partition_keys() > > > > I'm not sure this comment is clear enough: > > > > /* > > * If it's a strict equi-join a NULL partition key on one side will > > * not join a NULL partition key on the other side. So, rows with NULL > > * partition key from a partition on one side can not join with those > > * from a non-matching partition on the other side. So, search the > > * nullable partition keys as well. > > */ > > if (!strict_op) > > continue; > > > > My understanding of the problem of NULL values generated by outer join is: > > these NULL values --- if evaluated by non-strict expression --- can make row > > of N-th partition on one side of the join match row(s) of *other than* N-th > > partition(s) on the other side. Thus the nullable input expressions may only > > be evaluated by strict operators. I think it'd be clearer if you stressed that > > (undesired) *match* of partition keys can be a problem, rather than mismatch > > Sorry, I am not able to understand this. To me it looks like my > wording conveys what you are saying. I just tried to expreess the idea in a way that is clearer to me. I think we both mean the same. Not sure I should spend more effort on another version of the comment. > > If you insist on your wording, then I think you should at least move the > > comment below to the part that only deals with strict operators. > > Done. o.k. > > > > * map_and_merge_partitions() > > > > Besides a few changes proposed in map_and_merge_partitions.diff (a few of them > > to suppress compiler warnings) I think that this part needs more thought: > > > > { > > Assert(mergemap1[index1] != mergemap2[index2] && > > mergemap1[index1] >= 0 && mergemap2[index2] >= 0); > > > > /* > > * Both the partitions map to different merged partitions. This > > * means that multiple partitions from one relation matches to one > > * partition from the other relation. Partition-wise join does not > > * handle this case right now, since it requires ganging multiple > > * partitions together (into one RelOptInfo). > > */ > > merged_index = -1; > > } > > > > I could hit this path with the following test: > > > > CREATE TABLE a(i int) PARTITION BY LIST(i); > > CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2); > > CREATE TABLE b(j int) PARTITION BY LIST(j); > > CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2); > > > > SET enable_partition_wise_join TO on; > > > > SELECT * > > FROM a > > FULL JOIN > > b ON i = j; > > > > I don't think there's a reason not to join a_0 partition to b_0, is there? > > With the latest patchset I am seeing that partition-wise join is used > in this case. I have started a new thread [1] for advanced partition > matching patches. What plan do you get, with the patches from https://www.postgresql.org/message-id/CAFjFpRfdXpuSu0pxON3dKcr8WndJkaXLzHUVax_Laod0Tgc6UQ@mail.gmail.com I still see the join above Append, not below: QUERY PLAN -------------------------------------------------------------------------Merge Full Join (cost=359.57..860.00 rows=32512width=8) Merge Cond: (a_0.i = b_0.j) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: a_0.i -> Append (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on a_0 (cost=0.00..35.50 rows=2550width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: b_0.j -> Append (cost=0.00..35.50rows=2550 width=4) -> Seq Scan on b_0 (cost=0.00..35.50 rows=2550 width=4) > Please post review comments about the last two patches on that thread. ok, I'll do if I find any problem. > [1] https://www.postgresql.org/message-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Sep 7, 2017 at 4:32 PM, Antonin Houska <ah@cybertec.at> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > >> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah@cybertec.at> wrote: >> > >> > >> > >> > * get_partitioned_child_rels_for_join() >> > >> > I think the Assert() statement is easier to understand inside the loop, see >> > the assert.diff attachment. > >> The assert at the end of function also checks that we have got >> child_rels lists for all the parents passed in. > > Really? I can imagine that some instances of PartitionedChildRelInfo have the > child_rels list empty, while other ones have these lists long enough to > compensate for the empty lists. > That isn't true. Each child_rels list will at least have one entry. Please see get_partitioned_child_rels(). >> > >> > >> > * have_partkey_equi_join() >> > >> > As the function handles generic join, this comment doesn't seem to me >> > relevant: >> > >> > /* >> > * The equi-join between partition keys is strict if equi-join between >> > * at least one partition key is using a strict operator. See >> > * explanation about outer join reordering identity 3 in >> > * optimizer/README >> > */ >> > strict_op = op_strict(opexpr->opno); >> >> What in that comment is not exactly relevant? > > Basically I don't understand why you mention join reordering here. The join > ordering questions must all have been resolved by the time > have_partkey_equi_join() is called. I am referring to a particular section in README which talks about the relation between strict operator and legal join order. > >> > >> > And I think the function can return true even if strict_op is false for all >> > the operators evaluated in the loop. >> >> I think it does that. Do you have a case where it doesn't? > > Here I refer to this part of the comment above: > > "... if equi-join between at least one partition key is using a strict > operator." > > My understanding of the code (especially match_expr_to_partition_keys) is that > no operator actually needs to be strict as long as each operator involved in > the join matches at least one non-nullable expression on both sides of the > join. I don't think so. A strict operator returns NULL when either of the inputs is NULL. We can not say so for non-strict operators, which may deem NULL and non-NULL arguments as equal, even though that looks insane. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > accumulate_append_subpath() is executed for every path instead of > every relation, so changing it would collect the same list multiple > times. Instead, I found the old way of associating all intermediate > partitions with the root partitioned relation work better. Here's the > updated patch set. When I tried out patch 0001, it crashed repeatedly during 'make check' because of an assertion failure in get_partitioned_child_rels. It seemed to me that the way the patch was refactoring expand_inherited_rtentry involved more code rearrangement than necessary, so I reverted all the code rearrangement and just kept the functional changes, and all the crashes went away. (That refactoring also failed to initialize has_child properly.) In so doing, I reintroduced the problem that the PartitionedChildRelInfo lists weren't getting set up correctly, but after some thought I realized that was just because expand_single_inheritance_child() was choosing between adding an RTE and adding the OID to partitioned_child_rels, whereas for an intermediate partitioned table it needs to do both. So I inserted a trivial fix for that problem (replacing "else" with a new "if"-test), basically: - else + + if (childrte->relkind == RELKIND_PARTITIONED_TABLE) Please check out the attached version of the patch. In addition to the above simplifications, I did some adjustments to the comments in various places - some just grammar and others a bit more substantive. And I think I broke a long line in one place, too. One thing I notice is that if I rip out the changes to initsplan.c, the new regression test still passes. If it's possible to write a test that fails without those changes, I think it would be a good idea to include one in the patch. That's certainly one of the subtler parts of this patch, IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/08 4:04, Robert Haas wrote: > On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> accumulate_append_subpath() is executed for every path instead of >> every relation, so changing it would collect the same list multiple >> times. Instead, I found the old way of associating all intermediate >> partitions with the root partitioned relation work better. Here's the >> updated patch set. > > When I tried out patch 0001, it crashed repeatedly during 'make check' > because of an assertion failure in get_partitioned_child_rels. It > seemed to me that the way the patch was refactoring > expand_inherited_rtentry involved more code rearrangement than > necessary, so I reverted all the code rearrangement and just kept the > functional changes, and all the crashes went away. (That refactoring > also failed to initialize has_child properly.) When I tried the attached patch, it doesn't seem to expand partitioning inheritance in step-wise manner as the patch's title says. I think the rewritten patch forgot to include Ashutosh's changes to expand_single_inheritance_child() whereby the AppendRelInfo of the child will be marked with the direct parent instead of always the root parent. I updated the patch to include just those changes. I'm not sure about one of the Ashutosh's changes whereby the child PlanRowMark is also passed to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think the child RTE, child RT index and child Relation are fine, because they are necessary for creating AppendRelInfos in a desired way for later planning steps. But PlanRowMarks are not processed within the planner afterwards and do not need to be marked with the immediate parent-child association in the same way that AppendRelInfos need to be. I also included the changes to add_paths_to_append_rel() from my patch on the "path toward faster partition pruning" thread. We'd need that change, because while add_paths_to_append_rel() is called for all partitioned table RTEs in a given partition tree, expand_inherited_rtentry() would have set up a PartitionedChildRelInfo only for the root parent, so get_partitioned_child_rels() would not find the same for non-root partitioned table rels and crash failing the Assert. The changes I made are such that we call get_partitioned_child_rels() only for the parent rels that are known to correspond root partitioned tables (or as you pointed out on the thread, "the table named in the query" as opposed those added to the query as result of inheritance expansion). In addition to the relkind check on the input RTE, it would seem that checking that the reloptkind is RELOPT_BASEREL would be enough. But actually not, because if a partitioned table is accessed in a UNION ALL query, reloptkind even for the root partitioned table (the table named in the query) would be RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is actually the root partitioned table is to check whether its parent rel is not RTE_RELATION, because the parent in case of UNION ALL Append is a RTE_SUBQUERY RT entry. > One thing I notice is that if I rip out the changes to initsplan.c, > the new regression test still passes. If it's possible to write a > test that fails without those changes, I think it would be a good idea > to include one in the patch. That's certainly one of the subtler > parts of this patch, IMHO. Back when this (step-wise expansion of partition inheritance) used to be a patch in the original declarative partitioning patch series, Ashutosh had reported a test query [1] that would fail getting a plan, for which we came up with the initsplan.c changes in this patch as the solution: ERROR: could not devise a query plan for the given query I tried that query again without the initsplan.c changes and somehow the same error does not occur anymore. It's strange because without the initsplan.c changes, there is no way for partitions lower in the tree than the first level to get the direct_lateral_relids and lateral_relids from the root parent rel. Maybe, Ashutosh has a way to devise the failing query again. I also confirmed that the partition-pruning patch set works fine with this patch instead of the patch on that thread with the same functionality, which I will now drop from that patch set. Sorry about the wasted time. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj%3Diqg%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/08 14:47, Amit Langote wrote: > When I tried the attached patch, it doesn't seem to expand partitioning > inheritance in step-wise manner as the patch's title says. Oops. By "attached patch", I had meant to say the Robert's patch, that is, expand-stepwise-rmh.patch. Not expand-stepwise-rmh-2.patch, which is the updated version of the patch attached with the quoted message. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 8, 2017 at 12:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> accumulate_append_subpath() is executed for every path instead of >> every relation, so changing it would collect the same list multiple >> times. Instead, I found the old way of associating all intermediate >> partitions with the root partitioned relation work better. Here's the >> updated patch set. > > When I tried out patch 0001, it crashed repeatedly during 'make check' > because of an assertion failure in get_partitioned_child_rels. Running "make check" on the whole patchset doesn't give that failure. So I didn't notice the crash since I was running regression on the whole patchset. Sorry for that. Fortunately git rebase -i allows us to execute shell commands while applying patches, so I have set it up to compile each patch and run regression. Hopefully that will catch such errors in future. That process showed me that patch 0003-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch fixes that crash by calling get_partitioned_child_rels() only on the root partitioned table for which we have set up child_rels list. Amit Langote has a similar fix reported in his reply. So, we will discuss it there. > It > seemed to me that the way the patch was refactoring > expand_inherited_rtentry involved more code rearrangement than > necessary, so I reverted all the code rearrangement and just kept the > functional changes, and all the crashes went away. (That refactoring > also failed to initialize has_child properly.) In so doing, I > reintroduced the problem that the PartitionedChildRelInfo lists > weren't getting set up correctly, but after some thought I realized > that was just because expand_single_inheritance_child() was choosing > between adding an RTE and adding the OID to partitioned_child_rels, > whereas for an intermediate partitioned table it needs to do both. So > I inserted a trivial fix for that problem (replacing "else" with a new > "if"-test), basically: > > - else > + > + if (childrte->relkind == RELKIND_PARTITIONED_TABLE) > > Please check out the attached version of the patch. In addition to > the above simplifications, I did some adjustments to the comments in > various places - some just grammar and others a bit more substantive. > And I think I broke a long line in one place, too. > > One thing I notice is that if I rip out the changes to initsplan.c, > the new regression test still passes. If it's possible to write a > test that fails without those changes, I think it would be a good idea > to include one in the patch. That's certainly one of the subtler > parts of this patch, IMHO. Amit Langote has replied on these points as well. So, I will comment in a reply to his reply. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Fri, Sep 8, 2017 at 1:47 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > When I tried the attached patch, it doesn't seem to expand partitioning > inheritance in step-wise manner as the patch's title says. I think the > rewritten patch forgot to include Ashutosh's changes to > expand_single_inheritance_child() whereby the AppendRelInfo of the child > will be marked with the direct parent instead of always the root parent. Woops. > I updated the patch to include just those changes. I'm not sure about > one of the Ashutosh's changes whereby the child PlanRowMark is also passed > to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think > the child RTE, child RT index and child Relation are fine, because they > are necessary for creating AppendRelInfos in a desired way for later > planning steps. But PlanRowMarks are not processed within the planner > afterwards and do not need to be marked with the immediate parent-child > association in the same way that AppendRelInfos need to be. We probably need some better comments to explain which things need to be marked using the immediate parent and which need to be marked using the baserel, and why. > I also included the changes to add_paths_to_append_rel() from my patch on > the "path toward faster partition pruning" thread. We'd need that change, > because while add_paths_to_append_rel() is called for all partitioned > table RTEs in a given partition tree, expand_inherited_rtentry() would > have set up a PartitionedChildRelInfo only for the root parent, so > get_partitioned_child_rels() would not find the same for non-root > partitioned table rels and crash failing the Assert. The changes I made > are such that we call get_partitioned_child_rels() only for the parent > rels that are known to correspond root partitioned tables (or as you > pointed out on the thread, "the table named in the query" as opposed those > added to the query as result of inheritance expansion). In addition to > the relkind check on the input RTE, it would seem that checking that the > reloptkind is RELOPT_BASEREL would be enough. But actually not, because > if a partitioned table is accessed in a UNION ALL query, reloptkind even > for the root partitioned table (the table named in the query) would be > RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is > actually the root partitioned table is to check whether its parent rel is > not RTE_RELATION, because the parent in case of UNION ALL Append is a > RTE_SUBQUERY RT entry. OK, so this needs some good comments, too... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/08 4:04, Robert Haas wrote: >> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> accumulate_append_subpath() is executed for every path instead of >>> every relation, so changing it would collect the same list multiple >>> times. Instead, I found the old way of associating all intermediate >>> partitions with the root partitioned relation work better. Here's the >>> updated patch set. >> >> When I tried out patch 0001, it crashed repeatedly during 'make check' >> because of an assertion failure in get_partitioned_child_rels. It >> seemed to me that the way the patch was refactoring >> expand_inherited_rtentry involved more code rearrangement than >> necessary, so I reverted all the code rearrangement and just kept the >> functional changes, and all the crashes went away. (That refactoring >> also failed to initialize has_child properly.) > > When I tried the attached patch, it doesn't seem to expand partitioning > inheritance in step-wise manner as the patch's title says. I think the > rewritten patch forgot to include Ashutosh's changes to > expand_single_inheritance_child() whereby the AppendRelInfo of the child > will be marked with the direct parent instead of always the root parent. Right. If we apply 0002 from partition-wise join patchset, which has changed build_simple_rel() to collect direct children of a given partitioned table, it introduces another crash because of Assertion failure; for a partitioned table build_simple_rel() finds more children than expected because indirect children are also counted as direct children. > > I updated the patch to include just those changes. I'm not sure about > one of the Ashutosh's changes whereby the child PlanRowMark is also passed > to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think > the child RTE, child RT index and child Relation are fine, because they > are necessary for creating AppendRelInfos in a desired way for later > planning steps. But PlanRowMarks are not processed within the planner > afterwards and do not need to be marked with the immediate parent-child > association in the same way that AppendRelInfos need to be. Passing top parent's row mark works today, since there is no parent-child specific translation happening there. But if in future, we introduce such a translation, row marks for indirect children in a multi-level partitioned hierarchy won't be accurate. So, I think it's better to pass row marks of the direct parent. > > I also included the changes to add_paths_to_append_rel() from my patch on > the "path toward faster partition pruning" thread. We'd need that change, > because while add_paths_to_append_rel() is called for all partitioned > table RTEs in a given partition tree, expand_inherited_rtentry() would > have set up a PartitionedChildRelInfo only for the root parent, so > get_partitioned_child_rels() would not find the same for non-root > partitioned table rels and crash failing the Assert. The changes I made > are such that we call get_partitioned_child_rels() only for the parent > rels that are known to correspond root partitioned tables (or as you > pointed out on the thread, "the table named in the query" as opposed those > added to the query as result of inheritance expansion). In addition to > the relkind check on the input RTE, it would seem that checking that the > reloptkind is RELOPT_BASEREL would be enough. But actually not, because > if a partitioned table is accessed in a UNION ALL query, reloptkind even > for the root partitioned table (the table named in the query) would be > RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is > actually the root partitioned table is to check whether its parent rel is > not RTE_RELATION, because the parent in case of UNION ALL Append is a > RTE_SUBQUERY RT entry. > There was a change in my 0003 patch, which fixed the crash. It checked for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in my 0001 patch. It no more crashes. I tried various queries involving set operations and bare multi-level partitioned table scan with my patch, but none of them showed any anomaly. Do you have a testcase which shows problem with my patch? May be your suggestion is correct, but corresponding code implementation is slightly longer than I would expect. So, we should go with it, if there is corresponding testcase which shows why it's needed. In your patch + parent_rel = root->simple_rel_array[parent_relid]; + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY); Do you mean RTE_RELATION as you explained above? >> One thing I notice is that if I rip out the changes to initsplan.c, >> the new regression test still passes. If it's possible to write a >> test that fails without those changes, I think it would be a good idea >> to include one in the patch. That's certainly one of the subtler >> parts of this patch, IMHO. > > Back when this (step-wise expansion of partition inheritance) used to be a > patch in the original declarative partitioning patch series, Ashutosh had > reported a test query [1] that would fail getting a plan, for which we > came up with the initsplan.c changes in this patch as the solution: > > ERROR: could not devise a query plan for the given query > > I tried that query again without the initsplan.c changes and somehow the > same error does not occur anymore. It's strange because without the > initsplan.c changes, there is no way for partitions lower in the tree than > the first level to get the direct_lateral_relids and lateral_relids from > the root parent rel. Maybe, Ashutosh has a way to devise the failing > query again. > Thanks a lot for the reference. I devised a testcase slightly modifying my original test. I have included the test in the latest patch set. I have included Robert's changes to parts other than expand_inherited_rtentry() in the patch. > > I also confirmed that the partition-pruning patch set works fine with this > patch instead of the patch on that thread with the same functionality, > which I will now drop from that patch set. Sorry about the wasted time. > Thanks a lot. Please review the patch in the updated patchset. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> I also confirmed that the partition-pruning patch set works fine with this >> patch instead of the patch on that thread with the same functionality, >> which I will now drop from that patch set. Sorry about the wasted time. > > Thanks a lot. Please review the patch in the updated patchset. In set_append_rel_size(), I don't find the comment too clear (and this part was taken from Amit's patch, right?). I suggest: + /* + * Associate the partitioned tables which are descendents of the table + * named in the query with the topmost append path (i.e. the one where + * rel->reloptkind is RELOPT_BASEREL). This ensures that they get properly + * locked at execution time. + */ I'm a bit suspicious about the fact that there are now executor changes related to the PlanRowMarks. If the rowmark's prti is now the intermediate parent's RT index rather than the top-parent's RT index, it'd seem like that'd matter somehow. Maybe it doesn't, because the code that cares about prti seems to only care about whether it's different from rti. But if that's true everywhere, then why even change this? I think we might be well off not to tinker with things that don't need to be changed. Apart from that concern, now that I understand (from my own failed attempt and some off-list discussion) why this patch works the way it does, I think this is in fairly good shape. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/09 2:38, Ashutosh Bapat wrote: > On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote: >> I updated the patch to include just those changes. I'm not sure about >> one of the Ashutosh's changes whereby the child PlanRowMark is also passed >> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think >> the child RTE, child RT index and child Relation are fine, because they >> are necessary for creating AppendRelInfos in a desired way for later >> planning steps. But PlanRowMarks are not processed within the planner >> afterwards and do not need to be marked with the immediate parent-child >> association in the same way that AppendRelInfos need to be. > > Passing top parent's row mark works today, since there is no > parent-child specific translation happening there. But if in future, > we introduce such a translation, row marks for indirect children in a > multi-level partitioned hierarchy won't be accurate. So, I think it's > better to pass row marks of the direct parent. IMHO, we should make it the responsibility of the future patch to set a child PlanRowMark's prti to the direct parent's RT index, when we actually know that it's needed for something. We clearly know today why we need to pass the other objects like child RT entry, RT index, and Relation, so we should limit this patch to pass only those objects to the recursive call. That makes this patch a relatively easy to understand change. >> I also included the changes to add_paths_to_append_rel() from my patch on >> the "path toward faster partition pruning" thread. We'd need that change, >> because while add_paths_to_append_rel() is called for all partitioned >> table RTEs in a given partition tree, expand_inherited_rtentry() would >> have set up a PartitionedChildRelInfo only for the root parent, so >> get_partitioned_child_rels() would not find the same for non-root >> partitioned table rels and crash failing the Assert. The changes I made >> are such that we call get_partitioned_child_rels() only for the parent >> rels that are known to correspond root partitioned tables (or as you >> pointed out on the thread, "the table named in the query" as opposed those >> added to the query as result of inheritance expansion). In addition to >> the relkind check on the input RTE, it would seem that checking that the >> reloptkind is RELOPT_BASEREL would be enough. But actually not, because >> if a partitioned table is accessed in a UNION ALL query, reloptkind even >> for the root partitioned table (the table named in the query) would be >> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is >> actually the root partitioned table is to check whether its parent rel is >> not RTE_RELATION, because the parent in case of UNION ALL Append is a >> RTE_SUBQUERY RT entry. >> > > There was a change in my 0003 patch, which fixed the crash. It checked > for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in > my 0001 patch. It no more crashes. I tried various queries involving > set operations and bare multi-level partitioned table scan with my > patch, but none of them showed any anomaly. Do you have a testcase > which shows problem with my patch? May be your suggestion is correct, > but corresponding code implementation is slightly longer than I would > expect. So, we should go with it, if there is corresponding testcase > which shows why it's needed. If we go with your patch, partitioned tables won't get locked, for example, in case of the following query (p is a partitioned table): select 1 from p union all select 2 from p; That's because the RelOptInfos for the two instances of p in the above query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children of the Append corresponding to the UNION ALL subquery RTE. So, partitioned_rels does not get set per your proposed code. > > In your patch > > + parent_rel = root->simple_rel_array[parent_relid]; > + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY); > > Do you mean RTE_RELATION as you explained above? No, I mean RTE_SUBQUERY. If the partitioned table RTE in question corresponds to one named in the query, we should be able to find its pcinfo in root->pcinfo_list. If the partitioned table RTE is one added as result of inheritance expansion, it won't have an associated pcinfo. So, we should find a way to distinguish them from one another. The first idea that had occurred to me was the same as yours -- RelOptInfo of the partitioned table RTE named in the query would be of reloptkind RELOPT_BASEREL and those of the partitioned table RTE added as result of inheritance expansion will be of reloptkind RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former is not. If the partitioned table named in the query appears under UNION ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That means we have to use some other means to distinguish partitioned table RTEs that have an associated pcinfo from those that don't. So, I devised this method of looking at the parent RTE (if any) for distinguishing the two. Partitioned table named in the query either doesn't have the parent or if it does, the parent could only ever be a UNION ALL subquery (RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion will always have the parent and the parent will only ever be a table (RTE_RELATION). >>> One thing I notice is that if I rip out the changes to initsplan.c, >>> the new regression test still passes. If it's possible to write a >>> test that fails without those changes, I think it would be a good idea >>> to include one in the patch. That's certainly one of the subtler >>> parts of this patch, IMHO. >> >> Back when this (step-wise expansion of partition inheritance) used to be a >> patch in the original declarative partitioning patch series, Ashutosh had >> reported a test query [1] that would fail getting a plan, for which we >> came up with the initsplan.c changes in this patch as the solution: >> >> ERROR: could not devise a query plan for the given query >> >> I tried that query again without the initsplan.c changes and somehow the >> same error does not occur anymore. It's strange because without the >> initsplan.c changes, there is no way for partitions lower in the tree than >> the first level to get the direct_lateral_relids and lateral_relids from >> the root parent rel. Maybe, Ashutosh has a way to devise the failing >> query again. > > Thanks a lot for the reference. I devised a testcase slightly > modifying my original test. I have included the test in the latest > patch set. > > I have included Robert's changes to parts other than > expand_inherited_rtentry() in the patch. > >> I also confirmed that the partition-pruning patch set works fine with this >> patch instead of the patch on that thread with the same functionality, >> which I will now drop from that patch set. Sorry about the wasted time. >> > > Thanks a lot. Please review the patch in the updated patchset. Some comments: In create_lateral_join_info(): + Assert(IS_SIMPLE_REL(brel)); + Assert(brte); The second Assert is either unnecessary or should be placed first. The following comment could be made a bit clearer. + * In the case of table inheritance, the parent RTE is directly linked + * to every child table via an AppendRelInfo. In the case of table + * partitioning, the inheritance hierarchy is expanded one level at a + * time rather than flattened. Therefore, an other member rel that is + * a partitioned table may have children of its own, and must + * therefore be marked with the appropriate lateral info so that those + * children eventually get marked also. How about: In the case of partitioned table inheritance, the original parent RTE is linked, via AppendRelInfo, only to its immediate partitions.Partitions below the first level are accessibleonly via their immediate parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so consider those as well. In expand_inherited_rtentry(), the following comment fragment is obsolete, because we *do* now create AppendRelInfo's for partitioned children: + /* + * We keep a list of objects in root, each of which maps a partitioned + * parent RT index to the list of RT indexes of its partitioned child + * tables which do not have AppendRelInfos associated with those. By the way, when we call expand_single_inheritance_child() in the non-partitioned inheritance case, we should pass NULL for childrte_p, childRTindex_p, childrc_p, instead of declaring variables that won't be used. Hence, expand_single_inheritance_child() should make those arguments optional. + + /* + * If the partitioned table has no partitions or all the partitions are + * temporary tables from other backends, treat this as non-inheritance + * case. + */ + if (!has_child) + parentrte->inh = false; I guess the above applies to all partitioned tables in the tree, so, I think we should update the comment in set_rel_size(): else if (rte->relkind == RELKIND_PARTITIONED_TABLE) { /* * A partitioned table without leaf partitions is marked * as a dummy rel. */ set_dummy_rel_pathlist(rel); } to say: a partitioned table without partitions is marked as a dummy rel. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/09 9:58, Robert Haas wrote: > I'm a bit suspicious about the fact that there are now executor > changes related to the PlanRowMarks. If the rowmark's prti is now the > intermediate parent's RT index rather than the top-parent's RT index, > it'd seem like that'd matter somehow. Maybe it doesn't, because the > code that cares about prti seems to only care about whether it's > different from rti. Yes, it doesn't matter. The important point though is that nothing we want to do in the short term requires us to set a child PlanRowMark's prti to its immediate parent's RT index, as I also mentioned in reply to Ashutosh. > But if that's true everywhere, then why even > change this? I think we might be well off not to tinker with things > that don't need to be changed. +1. > Apart from that concern, now that I understand (from my own failed > attempt and some off-list discussion) why this patch works the way it > does, I think this is in fairly good shape. I too think so, except we still need to incorporate changes to add_paths_to_append_rel() necessary to correctly set partitioned_rels, as I explained in reply Ashutosh. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> I also confirmed that the partition-pruning patch set works fine with this >>> patch instead of the patch on that thread with the same functionality, >>> which I will now drop from that patch set. Sorry about the wasted time. >> >> Thanks a lot. Please review the patch in the updated patchset. > > In set_append_rel_size(), I don't find the comment too clear (and this > part was taken from Amit's patch, right?). I suggest: No, I didn't take it from Amit's patch. Both of us have different wordings. But yours is better than both of us. Included it in the attached patches. > > + /* > + * Associate the partitioned tables which are descendents of the table > + * named in the query with the topmost append path (i.e. the one where > + * rel->reloptkind is RELOPT_BASEREL). This ensures that they get properly > + * locked at execution time. > + */ > > I'm a bit suspicious about the fact that there are now executor > changes related to the PlanRowMarks. If the rowmark's prti is now the > intermediate parent's RT index rather than the top-parent's RT index, > it'd seem like that'd matter somehow. Maybe it doesn't, because the > code that cares about prti seems to only care about whether it's > different from rti. But if that's true everywhere, then why even > change this? I think we might be well off not to tinker with things > that don't need to be changed. In the definition of ExecRowMark, I see Index prti; /* parent range table index, if child */ It just says parent, by which I take as direct parent. For inheritance, which earlier flattened inheritance hierarchy, direct parent was top parent. So, probably nobody thought whether a parent is direct parent or top parent. But now that we have introduced that concept we need to interpret this comment anew. And I think interpreting it as direct parent is non-lossy. If we set top parent's index, parent RTI in AppendRelInfo and PlanRowMark would not agree. So, it looks quite natural that we set the direct parent's index in PlanRowMark. From that POV, we aren't changing anything, we are setting the same parent RTI in AppendRelInfo and PlanRowMark. Chaning different parent RTIs in those two structure would be a real change. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/11 16:23, Ashutosh Bapat wrote: > On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm a bit suspicious about the fact that there are now executor >> changes related to the PlanRowMarks. If the rowmark's prti is now the >> intermediate parent's RT index rather than the top-parent's RT index, >> it'd seem like that'd matter somehow. Maybe it doesn't, because the >> code that cares about prti seems to only care about whether it's >> different from rti. But if that's true everywhere, then why even >> change this? I think we might be well off not to tinker with things >> that don't need to be changed. > > In the definition of ExecRowMark, I see > Index prti; /* parent range table index, if child */ > It just says parent, by which I take as direct parent. For > inheritance, which earlier flattened inheritance hierarchy, direct > parent was top parent. So, probably nobody thought whether a parent is > direct parent or top parent. But now that we have introduced that > concept we need to interpret this comment anew. And I think > interpreting it as direct parent is non-lossy. But then we also don't have anything to say about why we're making that change. If you could describe what non-lossy is in this context, then fine. But that we'd like to match with what we're going to do for AppendRelInfos does not seem to be a sufficient explanation for this change. > If we set top parent's > index, parent RTI in AppendRelInfo and PlanRowMark would not agree. > So, it looks quite natural that we set the direct parent's index in > PlanRowMark. They would not agree, yes, but aren't they unrelated? If we have a reason for them to agree, (for example, row-locking breaks in the inherited table case if we didn't), then we should definitely make them agree. Updating the comment for prti definition might be something that this patch could (should?) do, but I'm not quite sure about that too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/09 2:38, Ashutosh Bapat wrote: >> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote: >>> I updated the patch to include just those changes. I'm not sure about >>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed >>> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think >>> the child RTE, child RT index and child Relation are fine, because they >>> are necessary for creating AppendRelInfos in a desired way for later >>> planning steps. But PlanRowMarks are not processed within the planner >>> afterwards and do not need to be marked with the immediate parent-child >>> association in the same way that AppendRelInfos need to be. >> >> Passing top parent's row mark works today, since there is no >> parent-child specific translation happening there. But if in future, >> we introduce such a translation, row marks for indirect children in a >> multi-level partitioned hierarchy won't be accurate. So, I think it's >> better to pass row marks of the direct parent. > > IMHO, we should make it the responsibility of the future patch to set a > child PlanRowMark's prti to the direct parent's RT index, when we actually > know that it's needed for something. We clearly know today why we need to > pass the other objects like child RT entry, RT index, and Relation, so we > should limit this patch to pass only those objects to the recursive call. > That makes this patch a relatively easy to understand change. I think you are mixing two issues here 1. setting parent RTI in child PlanRowMark and 2. passing immediate parent's PlanRowMark to expand_single_inheritance_child(). I have discussed 1 in my reply to Robert. About 2 you haven't given any particular comments to my reply. To me it looks like it's this patch that introduces the notion of multi-level expansion, so it's natural for this patch to pass PlanRowMark in cascaded fashion similar to other structures. > >>> I also included the changes to add_paths_to_append_rel() from my patch on >>> the "path toward faster partition pruning" thread. We'd need that change, >>> because while add_paths_to_append_rel() is called for all partitioned >>> table RTEs in a given partition tree, expand_inherited_rtentry() would >>> have set up a PartitionedChildRelInfo only for the root parent, so >>> get_partitioned_child_rels() would not find the same for non-root >>> partitioned table rels and crash failing the Assert. The changes I made >>> are such that we call get_partitioned_child_rels() only for the parent >>> rels that are known to correspond root partitioned tables (or as you >>> pointed out on the thread, "the table named in the query" as opposed those >>> added to the query as result of inheritance expansion). In addition to >>> the relkind check on the input RTE, it would seem that checking that the >>> reloptkind is RELOPT_BASEREL would be enough. But actually not, because >>> if a partitioned table is accessed in a UNION ALL query, reloptkind even >>> for the root partitioned table (the table named in the query) would be >>> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is >>> actually the root partitioned table is to check whether its parent rel is >>> not RTE_RELATION, because the parent in case of UNION ALL Append is a >>> RTE_SUBQUERY RT entry. >>> >> >> There was a change in my 0003 patch, which fixed the crash. It checked >> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in >> my 0001 patch. It no more crashes. I tried various queries involving >> set operations and bare multi-level partitioned table scan with my >> patch, but none of them showed any anomaly. Do you have a testcase >> which shows problem with my patch? May be your suggestion is correct, >> but corresponding code implementation is slightly longer than I would >> expect. So, we should go with it, if there is corresponding testcase >> which shows why it's needed. > > If we go with your patch, partitioned tables won't get locked, for > example, in case of the following query (p is a partitioned table): > > select 1 from p union all select 2 from p; > > That's because the RelOptInfos for the two instances of p in the above > query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children > of the Append corresponding to the UNION ALL subquery RTE. So, > partitioned_rels does not get set per your proposed code. Session 1: postgres=# begin; BEGIN postgres=# select 1 from t1 union all select 2 from t1;?column? ---------- (0 rows) postgres=# select pg_backend_pid();pg_backend_pid ---------------- 28843 (1 row) Session 2 postgres=# select locktype, relation::regclass, virtualxid, virtualtransaction, pid, mode, granted, fastpath from pg_locks; locktype | relation | virtualxid | virtualtransaction | pid |mode | granted | fastpath ------------+----------+------------+--------------------+-------+-----------------+---------+----------relation | pg_locks| | 4/14 | 28854 | AccessShareLock | t | tvirtualxid | | 4/14 | 4/14 | 28854 | ExclusiveLock | t | trelation | t1p1p1 | | 3/9 | 28843 | AccessShareLock | t | trelation | t1p1 | | 3/9 | 28843 | AccessShareLock | t | trelation | t1 | | 3/9 | 28843 | AccessShareLock | t | tvirtualxid | | 3/9 | 3/9 | 28843 | ExclusiveLock | t | t (6 rows) So, all partitioned partitions are getting locked correctly. Am I missing something? > >> >> In your patch >> >> + parent_rel = root->simple_rel_array[parent_relid]; >> + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY); >> >> Do you mean RTE_RELATION as you explained above? > > No, I mean RTE_SUBQUERY. > > If the partitioned table RTE in question corresponds to one named in the > query, we should be able to find its pcinfo in root->pcinfo_list. If the > partitioned table RTE is one added as result of inheritance expansion, it > won't have an associated pcinfo. So, we should find a way to distinguish > them from one another. The first idea that had occurred to me was the > same as yours -- RelOptInfo of the partitioned table RTE named in the > query would be of reloptkind RELOPT_BASEREL and those of the partitioned > table RTE added as result of inheritance expansion will be of reloptkind > RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former > is not. If the partitioned table named in the query appears under UNION > ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That > means we have to use some other means to distinguish partitioned table > RTEs that have an associated pcinfo from those that don't. So, I devised > this method of looking at the parent RTE (if any) for distinguishing the > two. Partitioned table named in the query either doesn't have the parent > or if it does, the parent could only ever be a UNION ALL subquery > (RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion > will always have the parent and the parent will only ever be a table > (RTE_RELATION). > Actually, the original problem that caused this discussion started with an assertion failure in get_partitioned_child_rels() as Assert(list_length(result) >= 1); This assertion fails if result is NIL when an intermediate partitioned table is passed. May be we should assert (result == NIL || list_length(result) == 1) and allow that function to be called even for intermediate partitioned partitions for which the function will return NIL. That will leave the code in add_paths_to_append_rel() simple. Thoughts? > > In create_lateral_join_info(): > > + Assert(IS_SIMPLE_REL(brel)); > + Assert(brte); > > The second Assert is either unnecessary or should be placed first. simple_rte_array[] may have some NULL entries. Second assert makes sure that we aren't dealing with a NULL entry. Any particular reason to reorder the asserts? > > The following comment could be made a bit clearer. > > + * In the case of table inheritance, the parent RTE is directly > linked > + * to every child table via an AppendRelInfo. In the case of table > + * partitioning, the inheritance hierarchy is expanded one level at a > + * time rather than flattened. Therefore, an other member rel > that is > + * a partitioned table may have children of its own, and must > + * therefore be marked with the appropriate lateral info so that > those > + * children eventually get marked also. > > How about: In the case of partitioned table inheritance, the original > parent RTE is linked, via AppendRelInfo, only to its immediate partitions. > Partitions below the first level are accessible only via their immediate > parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so > consider those as well. I don't see much difference between those two. We usually do not use macros in comments, so usually comments mention "other member" rel. Let's leave this for the committer to judge. > > In expand_inherited_rtentry(), the following comment fragment is obsolete, > because we *do* now create AppendRelInfo's for partitioned children: > > + /* > + * We keep a list of objects in root, each of which maps a > partitioned > + * parent RT index to the list of RT indexes of its partitioned child > + * tables which do not have AppendRelInfos associated with those. Good catch. I have reworded it as /* * We keep a list of objects in root, each of which maps a root *partitioned parent RT index to the list of RT indexes of descendant * partitioned child tables. Does that look good? > > > By the way, when we call expand_single_inheritance_child() in the > non-partitioned inheritance case, we should pass NULL for childrte_p, > childRTindex_p, childrc_p, instead of declaring variables that won't be > used. Hence, expand_single_inheritance_child() should make those > arguments optional. That introduces an extra "if" condition, which is costlier than an assignment. We have used both the styles in the code. Previously, I have got comments otherwise. So, I am not sure. > > + > + /* > + * If the partitioned table has no partitions or all the partitions are > + * temporary tables from other backends, treat this as non-inheritance > + * case. > + */ > + if (!has_child) > + parentrte->inh = false; > > I guess the above applies to all partitioned tables in the tree, so, I > think we should update the comment in set_rel_size(): > > else if (rte->relkind == RELKIND_PARTITIONED_TABLE) > { > /* > * A partitioned table without leaf partitions is marked > * as a dummy rel. > */ > set_dummy_rel_pathlist(rel); > } > > to say: a partitioned table without partitions is marked as a dummy rel. Done. Thanks again for the catch. I will update the patches once we have some resolution about 1. prti in PlanRowMarks and 2. detection of root partitioned table in add_paths_to_append_rel(). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Mon, Sep 11, 2017 at 2:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/11 16:23, Ashutosh Bapat wrote: >> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I'm a bit suspicious about the fact that there are now executor >>> changes related to the PlanRowMarks. If the rowmark's prti is now the >>> intermediate parent's RT index rather than the top-parent's RT index, >>> it'd seem like that'd matter somehow. Maybe it doesn't, because the >>> code that cares about prti seems to only care about whether it's >>> different from rti. But if that's true everywhere, then why even >>> change this? I think we might be well off not to tinker with things >>> that don't need to be changed. >> >> In the definition of ExecRowMark, I see >> Index prti; /* parent range table index, if child */ >> It just says parent, by which I take as direct parent. For >> inheritance, which earlier flattened inheritance hierarchy, direct >> parent was top parent. So, probably nobody thought whether a parent is >> direct parent or top parent. But now that we have introduced that >> concept we need to interpret this comment anew. And I think >> interpreting it as direct parent is non-lossy. > > But then we also don't have anything to say about why we're making that > change. If you could describe what non-lossy is in this context, then > fine. By setting prti to the topmost parent RTI we are loosing information that this child may be an intermediate child similar to what we did earlier to AppendRelInfo. That's the lossy-ness in this context. > But that we'd like to match with what we're going to do for > AppendRelInfos does not seem to be a sufficient explanation for this change. The purpose of this patch is to change the parent-child linkages for partitioned table and prti is one of them. So, in fact, I am wondering why not to change that along with AppendRelInfo. > >> If we set top parent's >> index, parent RTI in AppendRelInfo and PlanRowMark would not agree. >> So, it looks quite natural that we set the direct parent's index in >> PlanRowMark. > > They would not agree, yes, but aren't they unrelated? If we have a reason > for them to agree, (for example, row-locking breaks in the inherited table > case if we didn't), then we should definitely make them agree. > > Updating the comment for prti definition might be something that this > patch could (should?) do, but I'm not quite sure about that too. > To me that looks backwards again for the reasons described above. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > So, all partitioned partitions are getting locked correctly. Am I > missing something? That's not a valid test. In that scenario, you're going to hold all the locks acquired by the planner, all the locks acquired by the rewriter, and all the locks acquired by the executor, but when using prepared queries, it's possible to execute the plan after the planner and rewriter locks are no longer held. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> So, all partitioned partitions are getting locked correctly. Am I >> missing something? > > That's not a valid test. In that scenario, you're going to hold all > the locks acquired by the planner, all the locks acquired by the > rewriter, and all the locks acquired by the executor, but when using > prepared queries, it's possible to execute the plan after the planner > and rewriter locks are no longer held. > I see the same thing when I use prepare and execute Session 1 postgres=# prepare stmt as select 1 from t1 union all select 2 from t1; PREPARE postgres=# select pg_backend_pid();pg_backend_pid ---------------- 50912 (1 row) postgres=# begin; BEGIN postgres=# execute stmt;?column? ---------- (0 rows) Session 2 postgres=# select locktype, relation::regclass, virtualxid, virtualtransaction, pid, mode, granted, fastpath from pg_locks; locktype | relation | virtualxid | virtualtransaction | pid |mode | granted | fastpath ------------+----------+------------+--------------------+-------+-----------------+---------+----------relation | pg_locks| | 4/4 | 50914 | AccessShareLock | t | tvirtualxid | | 4/4 | 4/4 | 50914 | ExclusiveLock | t | trelation | t1p1p1 | | 3/12 | 50912 | AccessShareLock | t | trelation | t1p1 | | 3/12 | 50912 | AccessShareLock | t | trelation | t1 | | 3/12 | 50912 | AccessShareLock | t | tvirtualxid | | 3/12 | 3/12 | 50912 | ExclusiveLock | t | t (6 rows) -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Mon, Sep 11, 2017 at 8:07 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I see the same thing when I use prepare and execute Hmm. Well, that's good, but it doesn't prove there's no bug. We have to understand where and why it's getting locked to know whether the behavior will be correct in all cases. I haven't had time to look at Amit's comments in detail yet so I don't know whether I agree with his analysis or not, but we have to look at what's going on under the hood to know whether the engine is working -- not just listen to the noise it makes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/11 19:45, Ashutosh Bapat wrote: > On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >> IMHO, we should make it the responsibility of the future patch to set a >> child PlanRowMark's prti to the direct parent's RT index, when we actually >> know that it's needed for something. We clearly know today why we need to >> pass the other objects like child RT entry, RT index, and Relation, so we >> should limit this patch to pass only those objects to the recursive call. >> That makes this patch a relatively easy to understand change. > > I think you are mixing two issues here 1. setting parent RTI in child > PlanRowMark and 2. passing immediate parent's PlanRowMark to > expand_single_inheritance_child(). > > I have discussed 1 in my reply to Robert. > > About 2 you haven't given any particular comments to my reply. To me > it looks like it's this patch that introduces the notion of > multi-level expansion, so it's natural for this patch to pass > PlanRowMark in cascaded fashion similar to other structures. You patch does 2 to be able to do 1, doesn't it? That is, to be able to set the child PlanRowMark's prti to the direct parent's RT index, you pass the immediate parent's PlanRowMark to the recursive call of expand_single_inheritance_child(). All I am trying to say is that this patch's mission is to expand inheritance step-wise to be able to do certain things in the *planner* that weren't possible before. The patch accomplishes that by creating child AppendRelInfos such that its parent_relid field is set to the immediate parent's RT index. It's quite clear why we're doing so. It's not clear why we should do so for PlanRowMarks too. Maybe it's fine as long as nothing breaks. >> If we go with your patch, partitioned tables won't get locked, for >> example, in case of the following query (p is a partitioned table): >> >> select 1 from p union all select 2 from p; >> >> That's because the RelOptInfos for the two instances of p in the above >> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children >> of the Append corresponding to the UNION ALL subquery RTE. So, >> partitioned_rels does not get set per your proposed code. > [...] > So, all partitioned partitions are getting locked correctly. Am I > missing something? Will reply to this separately to your other email. > Actually, the original problem that caused this discussion started > with an assertion failure in get_partitioned_child_rels() as > Assert(list_length(result) >= 1); > > This assertion fails if result is NIL when an intermediate partitioned > table is passed. May be we should assert (result == NIL || > list_length(result) == 1) and allow that function to be called even > for intermediate partitioned partitions for which the function will > return NIL. That will leave the code in add_paths_to_append_rel() > simple. Thoughts? Yeah, I guess that could work. We'll just have to write comments to describe why the Assert is written that way. >> In create_lateral_join_info(): >> >> + Assert(IS_SIMPLE_REL(brel)); >> + Assert(brte); >> >> The second Assert is either unnecessary or should be placed first. > > simple_rte_array[] may have some NULL entries. Second assert makes > sure that we aren't dealing with a NULL entry. Any particular reason > to reorder the asserts? Sorry, I missed that the 2nd Assert has b"rte". I thought it's b"rel". >> The following comment could be made a bit clearer. >> >> + * In the case of table inheritance, the parent RTE is directly >> linked >> + * to every child table via an AppendRelInfo. In the case of table >> + * partitioning, the inheritance hierarchy is expanded one level at a >> + * time rather than flattened. Therefore, an other member rel >> that is >> + * a partitioned table may have children of its own, and must >> + * therefore be marked with the appropriate lateral info so that >> those >> + * children eventually get marked also. >> >> How about: In the case of partitioned table inheritance, the original >> parent RTE is linked, via AppendRelInfo, only to its immediate partitions. >> Partitions below the first level are accessible only via their immediate >> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so >> consider those as well. > > I don't see much difference between those two. We usually do not use > macros in comments, so usually comments mention "other member" rel. > Let's leave this for the committer to judge. Sure. >> In expand_inherited_rtentry(), the following comment fragment is obsolete, >> because we *do* now create AppendRelInfo's for partitioned children: >> >> + /* >> + * We keep a list of objects in root, each of which maps a >> partitioned >> + * parent RT index to the list of RT indexes of its partitioned child >> + * tables which do not have AppendRelInfos associated with those. > > Good catch. I have reworded it as > /* > * We keep a list of objects in root, each of which maps a root > * partitioned parent RT index to the list of RT indexes of descendant > * partitioned child tables. > > Does that look good? Looks fine. >> By the way, when we call expand_single_inheritance_child() in the >> non-partitioned inheritance case, we should pass NULL for childrte_p, >> childRTindex_p, childrc_p, instead of declaring variables that won't be >> used. Hence, expand_single_inheritance_child() should make those >> arguments optional. > > That introduces an extra "if" condition, which is costlier than an > assignment. We have used both the styles in the code. Previously, I > have got comments otherwise. So, I am not sure. OK. expand_single_inheritance_child's header comment does not mention the new result fields. Maybe add a comment describing what their role is and that they're not optional arguments. > I will update the patches once we have some resolution about 1. prti > in PlanRowMarks and 2. detection of root partitioned table in > add_paths_to_append_rel(). OK. About 2, I somewhat agree with your proposed solution above, which might be simpler to explain in comments than the code I proposed. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/11 19:45, Ashutosh Bapat wrote: >> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >>> IMHO, we should make it the responsibility of the future patch to set a >>> child PlanRowMark's prti to the direct parent's RT index, when we actually >>> know that it's needed for something. We clearly know today why we need to >>> pass the other objects like child RT entry, RT index, and Relation, so we >>> should limit this patch to pass only those objects to the recursive call. >>> That makes this patch a relatively easy to understand change. >> >> I think you are mixing two issues here 1. setting parent RTI in child >> PlanRowMark and 2. passing immediate parent's PlanRowMark to >> expand_single_inheritance_child(). >> >> I have discussed 1 in my reply to Robert. >> >> About 2 you haven't given any particular comments to my reply. To me >> it looks like it's this patch that introduces the notion of >> multi-level expansion, so it's natural for this patch to pass >> PlanRowMark in cascaded fashion similar to other structures. > > You patch does 2 to be able to do 1, doesn't it? That is, to be able to > set the child PlanRowMark's prti to the direct parent's RT index, you pass > the immediate parent's PlanRowMark to the recursive call of > expand_single_inheritance_child(). No. child PlanRowMark's prti is set to parentRTIndex, which is a separate argument and is used to also set parent_relid in AppendRelInfo. > >> Actually, the original problem that caused this discussion started >> with an assertion failure in get_partitioned_child_rels() as >> Assert(list_length(result) >= 1); >> >> This assertion fails if result is NIL when an intermediate partitioned >> table is passed. May be we should assert (result == NIL || >> list_length(result) == 1) and allow that function to be called even >> for intermediate partitioned partitions for which the function will >> return NIL. That will leave the code in add_paths_to_append_rel() >> simple. Thoughts? > > Yeah, I guess that could work. We'll just have to write comments to > describe why the Assert is written that way. > >>> By the way, when we call expand_single_inheritance_child() in the >>> non-partitioned inheritance case, we should pass NULL for childrte_p, >>> childRTindex_p, childrc_p, instead of declaring variables that won't be >>> used. Hence, expand_single_inheritance_child() should make those >>> arguments optional. >> >> That introduces an extra "if" condition, which is costlier than an >> assignment. We have used both the styles in the code. Previously, I >> have got comments otherwise. So, I am not sure. > > OK. expand_single_inheritance_child's header comment does not mention the > new result fields. Maybe add a comment describing what their role is and > that they're not optional arguments. > >> I will update the patches once we have some resolution about 1. prti >> in PlanRowMarks and 2. detection of root partitioned table in >> add_paths_to_append_rel(). > > OK. > > About 2, I somewhat agree with your proposed solution above, which might > be simpler to explain in comments than the code I proposed. After testing a few queries I am getting a feeling that ExecLockNonLeafAppendTables isn't really locking anything. I will write more about that in my reply to Robert's mail. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/11 21:07, Ashutosh Bapat wrote: > On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> So, all partitioned partitions are getting locked correctly. Am I >>> missing something? >> >> That's not a valid test. In that scenario, you're going to hold all >> the locks acquired by the planner, all the locks acquired by the >> rewriter, and all the locks acquired by the executor, but when using >> prepared queries, it's possible to execute the plan after the planner >> and rewriter locks are no longer held. > > I see the same thing when I use prepare and execute So I looked at this a bit closely and came to the conclusion that we may not need to keep partitioned table RT indexes in the (Merge)Append.partitioned_rels after all, as far as execution-time locking is concerned. Consider two cases: 1. Plan is created and executed in the same transaction In this case, locks taken on the partitioned tables by the planner will suffice. 2. Plan is executed in a different transaction from the one in which it was created (a cached plan) In this case, AcquireExecutorLocks will lock all the relations in PlannedStmt.rtable, which must include all partitioned tables of all partition trees involved in the query. Of those, it will lock the tables whose RT indexes appear in PlannedStmt.nonleafResultRelations with RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global list of all partitioned table RT indexes obtained by concatenating partitioned_rels lists of all ModifyTable nodes involved in the query (set_plan_refs does that). We need to distinguish nonleafResultRelations, because we need to take the stronger lock on a given table before any weaker one if it happens to appear in the query as a non-result relation too, to avoid lock strength upgrade deadlock hazard. Moreover, because all the tables from plannedstmt->rtable, including the partitioned tables, will be added to PlannedStmt.relationsOids, any invalidation events affecting the partitioned tables (for example, add/remove a partition) will cause the plan involving partitioned tables to be recreated. In none of this do we rely on the partitioned table RT indexes appearing in the (Merge)Append node itself. Maybe, we should just remove partitioned_rels from (Merge)AppendPath and (Merge)Append node in a separate patch and move on. Thoughts? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/11 21:07, Ashutosh Bapat wrote: >> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> So, all partitioned partitions are getting locked correctly. Am I >>>> missing something? >>> >>> That's not a valid test. In that scenario, you're going to hold all >>> the locks acquired by the planner, all the locks acquired by the >>> rewriter, and all the locks acquired by the executor, but when using >>> prepared queries, it's possible to execute the plan after the planner >>> and rewriter locks are no longer held. >> >> I see the same thing when I use prepare and execute > > So I looked at this a bit closely and came to the conclusion that we may > not need to keep partitioned table RT indexes in the > (Merge)Append.partitioned_rels after all, as far as execution-time locking > is concerned. > > Consider two cases: > > 1. Plan is created and executed in the same transaction > > In this case, locks taken on the partitioned tables by the planner will > suffice. > > 2. Plan is executed in a different transaction from the one in which it > was created (a cached plan) > > In this case, AcquireExecutorLocks will lock all the relations in > PlannedStmt.rtable, which must include all partitioned tables of all > partition trees involved in the query. Of those, it will lock the tables > whose RT indexes appear in PlannedStmt.nonleafResultRelations with > RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global > list of all partitioned table RT indexes obtained by concatenating > partitioned_rels lists of all ModifyTable nodes involved in the query > (set_plan_refs does that). We need to distinguish nonleafResultRelations, > because we need to take the stronger lock on a given table before any > weaker one if it happens to appear in the query as a non-result relation > too, to avoid lock strength upgrade deadlock hazard. > > Moreover, because all the tables from plannedstmt->rtable, including the > partitioned tables, will be added to PlannedStmt.relationsOids, any > invalidation events affecting the partitioned tables (for example, > add/remove a partition) will cause the plan involving partitioned tables > to be recreated. > > In none of this do we rely on the partitioned table RT indexes appearing > in the (Merge)Append node itself. Maybe, we should just remove > partitioned_rels from (Merge)AppendPath and (Merge)Append node in a > separate patch and move on. > > Thoughts? Yes, I did the same analysis (to which I refer in my earlier reply to you). I too think we should just remove partitioned_rels from Append paths. But then the question is those are then transferred to ModifyTable node in create_modifytable_plan() and use it for something else. What should we do about that code? I don't think we are really using that list from ModifyTable node as well, so may be we could remove it from there as well. What do you think? Does that mean partitioned_rels isn't used at all in the code? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/12 16:55, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote: >> So I looked at this a bit closely and came to the conclusion that we may >> not need to keep partitioned table RT indexes in the >> (Merge)Append.partitioned_rels after all, as far as execution-time locking >> is concerned. >> >> Consider two cases: >> >> 1. Plan is created and executed in the same transaction >> >> In this case, locks taken on the partitioned tables by the planner will >> suffice. >> >> 2. Plan is executed in a different transaction from the one in which it >> was created (a cached plan) >> >> In this case, AcquireExecutorLocks will lock all the relations in >> PlannedStmt.rtable, which must include all partitioned tables of all >> partition trees involved in the query. Of those, it will lock the tables >> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >> list of all partitioned table RT indexes obtained by concatenating >> partitioned_rels lists of all ModifyTable nodes involved in the query >> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >> because we need to take the stronger lock on a given table before any >> weaker one if it happens to appear in the query as a non-result relation >> too, to avoid lock strength upgrade deadlock hazard. >> >> Moreover, because all the tables from plannedstmt->rtable, including the >> partitioned tables, will be added to PlannedStmt.relationsOids, any >> invalidation events affecting the partitioned tables (for example, >> add/remove a partition) will cause the plan involving partitioned tables >> to be recreated. >> >> In none of this do we rely on the partitioned table RT indexes appearing >> in the (Merge)Append node itself. Maybe, we should just remove >> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a >> separate patch and move on. >> >> Thoughts? > > Yes, I did the same analysis (to which I refer in my earlier reply to > you). I too think we should just remove partitioned_rels from Append > paths. But then the question is those are then transferred to > ModifyTable node in create_modifytable_plan() and use it for something > else. What should we do about that code? I don't think we are really > using that list from ModifyTable node as well, so may be we could > remove it from there as well. What do you think? Does that mean > partitioned_rels isn't used at all in the code? No, we cannot simply get rid of partitioned_rels altogether. We'll need to keep it in the ModifyTable node, because we *do* need the nonleafResultRelations list in PlannedStmt to distinguish partitioned table result relations, which set_plan_refs builds by concatenating partitioned_rels lists of various ModifyTable nodes of the query. The PlannedStmt.nonleafResultRelations list actually has some use (which parallels PlannedStmt.resultRelations), but partitioned_rels list in the individual (Merge)Append, as it turns out, doesn't. So, we can remove partitioned_rels from (Merge)AppendPath and (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/12 16:39, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/09/11 19:45, Ashutosh Bapat wrote: >>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >>>> IMHO, we should make it the responsibility of the future patch to set a >>>> child PlanRowMark's prti to the direct parent's RT index, when we actually >>>> know that it's needed for something. We clearly know today why we need to >>>> pass the other objects like child RT entry, RT index, and Relation, so we >>>> should limit this patch to pass only those objects to the recursive call. >>>> That makes this patch a relatively easy to understand change. >>> >>> I think you are mixing two issues here 1. setting parent RTI in child >>> PlanRowMark and 2. passing immediate parent's PlanRowMark to >>> expand_single_inheritance_child(). >>> >>> I have discussed 1 in my reply to Robert. >>> >>> About 2 you haven't given any particular comments to my reply. To me >>> it looks like it's this patch that introduces the notion of >>> multi-level expansion, so it's natural for this patch to pass >>> PlanRowMark in cascaded fashion similar to other structures. >> >> You patch does 2 to be able to do 1, doesn't it? That is, to be able to >> set the child PlanRowMark's prti to the direct parent's RT index, you pass >> the immediate parent's PlanRowMark to the recursive call of >> expand_single_inheritance_child(). > > No. child PlanRowMark's prti is set to parentRTIndex, which is a > separate argument and is used to also set parent_relid in > AppendRelInfo. OK. So, to keep the old behavior (if at all), we'd actually need a new argument rootParentRTindex. Old behavior being that all child PlanRowMarks has the rootParentRTindex as their prti. It seems though that the new behavior where prti will now be set to the direct parent's RT index is more or less harmless, because whatever we set prti to, as long as it's different from rti, we can consider it a child PlanRowMark. So it might be fine to set prti to direct parent's RT index. That said, I noticed that we might need to be careful about what the value of the root parent's PlanRowMark's allMarkType field gets set to. We need to make sure that it reflects markType of all partitions in the tree, including those that are not root parent's direct children. Is that true with the proposed implementation? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/12 16:55, Ashutosh Bapat wrote: >> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote: >>> So I looked at this a bit closely and came to the conclusion that we may >>> not need to keep partitioned table RT indexes in the >>> (Merge)Append.partitioned_rels after all, as far as execution-time locking >>> is concerned. >>> >>> Consider two cases: >>> >>> 1. Plan is created and executed in the same transaction >>> >>> In this case, locks taken on the partitioned tables by the planner will >>> suffice. >>> >>> 2. Plan is executed in a different transaction from the one in which it >>> was created (a cached plan) >>> >>> In this case, AcquireExecutorLocks will lock all the relations in >>> PlannedStmt.rtable, which must include all partitioned tables of all >>> partition trees involved in the query. Of those, it will lock the tables >>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >>> list of all partitioned table RT indexes obtained by concatenating >>> partitioned_rels lists of all ModifyTable nodes involved in the query >>> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >>> because we need to take the stronger lock on a given table before any >>> weaker one if it happens to appear in the query as a non-result relation >>> too, to avoid lock strength upgrade deadlock hazard. >>> >>> Moreover, because all the tables from plannedstmt->rtable, including the >>> partitioned tables, will be added to PlannedStmt.relationsOids, any >>> invalidation events affecting the partitioned tables (for example, >>> add/remove a partition) will cause the plan involving partitioned tables >>> to be recreated. >>> >>> In none of this do we rely on the partitioned table RT indexes appearing >>> in the (Merge)Append node itself. Maybe, we should just remove >>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a >>> separate patch and move on. >>> >>> Thoughts? >> >> Yes, I did the same analysis (to which I refer in my earlier reply to >> you). I too think we should just remove partitioned_rels from Append >> paths. But then the question is those are then transferred to >> ModifyTable node in create_modifytable_plan() and use it for something >> else. What should we do about that code? I don't think we are really >> using that list from ModifyTable node as well, so may be we could >> remove it from there as well. What do you think? Does that mean >> partitioned_rels isn't used at all in the code? > > No, we cannot simply get rid of partitioned_rels altogether. We'll need > to keep it in the ModifyTable node, because we *do* need the > nonleafResultRelations list in PlannedStmt to distinguish partitioned > table result relations, which set_plan_refs builds by concatenating > partitioned_rels lists of various ModifyTable nodes of the query. The > PlannedStmt.nonleafResultRelations list actually has some use (which > parallels PlannedStmt.resultRelations), but partitioned_rels list in the > individual (Merge)Append, as it turns out, doesn't. > > So, we can remove partitioned_rels from (Merge)AppendPath and > (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). Don't we need partitioned_rels from Append paths to be transferred to ModifyTable node or we have a different way of calculating nonleafResultRelations? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/12 17:53, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote: >> So, we can remove partitioned_rels from (Merge)AppendPath and >> (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). > > Don't we need partitioned_rels from Append paths to be transferred to > ModifyTable node or we have a different way of calculating > nonleafResultRelations? No, we don't transfer partitioned_rels from Append path to ModifyTable node. inheritance_planner(), that builds the ModifyTable path for UPDATE/DELETE on a partitioned table, fetches partitioned_rels from root->pcinfo_list itself and passes it to create_modifytable_path. No Append path is involved in that case. PlannedStmt.nonleafResultRelations is built by concatenating the partitioned_rels lists of all ModifyTable nodes appearing in the query. It does not depend on Append's or AppendPath's partitioned_rels. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > That said, I noticed that we might need to be careful about what the value > of the root parent's PlanRowMark's allMarkType field gets set to. We need > to make sure that it reflects markType of all partitions in the tree, > including those that are not root parent's direct children. Is that true > with the proposed implementation? Yes. We include child's allMarkTypes into parent's allMarkTypes. So, top parent's PlanRowMarks should have all descendant's allMarkTypes, which is not happening in the patch right now. There are two ways to fix that. 1. Pass top parent's PlanRowMark all the way down to the leaf partitions, so that current expand_single_inheritance_child() collects allMarkTypes of all children correctly. But this way, PlanRowMarks of intermediate parent does not reflect allMarkTypes of its children, only top root records that. 2. Pass immediate parent's PlanRowMark to expand_single_inheritance_child(), so that it records allMarkTypes of its children. In expand_partitioned_rtentry() have following sequence expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel, parentrc,childrel, appinfos, &childrte, &childRTindex, &childrc); /* If this child is itself partitioned, recurse */ if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { expand_partitioned_rtentry(root, childrte, childRTindex, childrel, childrc, lockmode, appinfos, partitioned_child_rels); /* Include child's rowmark type in parent's allMarkTypes */ parentrc->allMarkTypes |= childrc->allMarkTypes; } so that we push allMarkTypes up the hierarchy. I like the second way, since every intermediate parent records allMarkTypes of its descendants. Thoughts? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/12 18:49, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> That said, I noticed that we might need to be careful about what the value >> of the root parent's PlanRowMark's allMarkType field gets set to. We need >> to make sure that it reflects markType of all partitions in the tree, >> including those that are not root parent's direct children. Is that true >> with the proposed implementation? > > Yes. We include child's allMarkTypes into parent's allMarkTypes. So, > top parent's PlanRowMarks should have all descendant's allMarkTypes, > which is not happening in the patch right now. There are two ways to > fix that. > > 1. Pass top parent's PlanRowMark all the way down to the leaf > partitions, so that current expand_single_inheritance_child() collects > allMarkTypes of all children correctly. But this way, PlanRowMarks of > intermediate parent does not reflect allMarkTypes of its children, > only top root records that. > 2. Pass immediate parent's PlanRowMark to > expand_single_inheritance_child(), so that it records allMarkTypes of > its children. In expand_partitioned_rtentry() have following sequence > > expand_single_inheritance_child(root, parentrte, parentRTindex, > parentrel, parentrc, childrel, > appinfos, &childrte, &childRTindex, > &childrc); > > /* If this child is itself partitioned, recurse */ > if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > { > expand_partitioned_rtentry(root, childrte, childRTindex, > childrel, childrc, lockmode, appinfos, > partitioned_child_rels); > > /* Include child's rowmark type in parent's allMarkTypes */ > parentrc->allMarkTypes |= childrc->allMarkTypes; > } > so that we push allMarkTypes up the hierarchy. > > I like the second way, since every intermediate parent records > allMarkTypes of its descendants. I like the second way, too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 12, 2017 at 2:35 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/12 17:53, Ashutosh Bapat wrote: >> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote: >>> So, we can remove partitioned_rels from (Merge)AppendPath and >>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). >> >> Don't we need partitioned_rels from Append paths to be transferred to >> ModifyTable node or we have a different way of calculating >> nonleafResultRelations? > > No, we don't transfer partitioned_rels from Append path to ModifyTable > node. inheritance_planner(), that builds the ModifyTable path for > UPDATE/DELETE on a partitioned table, fetches partitioned_rels from > root->pcinfo_list itself and passes it to create_modifytable_path. No > Append path is involved in that case. PlannedStmt.nonleafResultRelations > is built by concatenating the partitioned_rels lists of all ModifyTable > nodes appearing in the query. It does not depend on Append's or > AppendPath's partitioned_rels. Ok. Thanks for the explanation. This make me examine inheritance_planner() closely and I think I have spotted a thinko there. In inheritance_planner() parent_rte is set to the RTE of parent to start with and then in the loop 1132 /* 1133 * And now we can get on with generating a plan for each child table. 1134 */ 1135 foreach(lc, root->append_rel_list) 1136 { ... code clipped 1165 /* 1166 * If there are securityQuals attached to the parent, move them to the 1167 * child rel (they've already been transformed properly for that). 1168 */ 1169 parent_rte = rt_fetch(parentRTindex, subroot->parse->rtable); 1170 child_rte = rt_fetch(appinfo->child_relid, subroot->parse->rtable); 1171 child_rte->securityQuals = parent_rte->securityQuals; 1172 parent_rte->securityQuals = NIL; we set parent_rte to the one obtained from subroot->parse, which happens to be the same (at least in contents) as original parent_rte. Later we use this parent_rte to pull partitioned_rels outside that loop 1371 if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) 1372 { 1373 partitioned_rels = get_partitioned_child_rels(root, parentRTindex); 1374 /* The root partitioned table is included as a child rel */ 1375 Assert(list_length(partitioned_rels) >= 1); 1376 } I think the code here expects the original parent_rte and not the one we set around line 1169. This isn't a bug right now, since both the parent_rte s have same content. But I am not sure if that will remain to be so. Here's patch to fix the thinko. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > In this case, AcquireExecutorLocks will lock all the relations in > PlannedStmt.rtable, which must include all partitioned tables of all > partition trees involved in the query. Of those, it will lock the tables > whose RT indexes appear in PlannedStmt.nonleafResultRelations with > RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global > list of all partitioned table RT indexes obtained by concatenating > partitioned_rels lists of all ModifyTable nodes involved in the query > (set_plan_refs does that). We need to distinguish nonleafResultRelations, > because we need to take the stronger lock on a given table before any > weaker one if it happens to appear in the query as a non-result relation > too, to avoid lock strength upgrade deadlock hazard. Hmm. The problem with this theory in my view is that it doesn't explain why InitPlan() and ExecOpenScanRelation() lock the relations instead of just assuming that they are already locked either by AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables() doesn't really need to take locks, then ExecOpenScanRelation() must not need to do it either. We invented ExecLockNonLeafAppendTables() on the occasion of removing the scans of those tables which would previously have caused ExecOpenScanRelation() to be invoked, so as to keep the locking behavior unchanged. AcquireExecutorLocks() looks like an odd bit of code to me. The executor itself locks result tables in InitPlan() and then everything else during InitPlan() and all of the others later on while walking the plan tree -- comments in InitPlan() say that this is to avoid a lock upgrade hazard if a result rel is also a source rel. But AcquireExecutorLocks() has no such provision; it just locks everything in RTE order. In theory, that's a deadlock hazard of another kind, as we just talked about in the context of EIBO. In fact, expanding in bound order has made the situation worse: before, expansion order and locking order were the same, so maybe having AcquireExecutorLocks() work in RTE order coincidentally happened to give the same result as the executor code itself as long as there are no result relations. But this is certainly not true any more. I'm not sure it's worth expending a lot of time on this -- it's evidently not a problem in practice, or somebody probably would've complained before now. But that having been said, I don't think we should assume that all the locks taken from the executor are worthless because plancache.c will always do the job for us. I don't know of a case where we execute a saved plan without going through the plan cache, but that doesn't mean that there isn't one or that there couldn't be one in the future. It's not the job of these partitioning patches to whack around the way we do locking in general -- they should preserve the existing behavior as much as possible. If we want to get rid of the locking in the executor altogether, that's a separate discussion where, I have a feeling, there will prove to be better reasons for the way things are than we are right now supposing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/12 19:56, Ashutosh Bapat wrote: > I think the code here expects the original parent_rte and not the one > we set around line 1169. > > This isn't a bug right now, since both the parent_rte s have same > content. But I am not sure if that will remain to be so. Here's patch > to fix the thinko. Instead of the new bool is_parent_partitioned, why not move the code to set partitioned_rels to the block where you're now setting is_parent_partitioned. Also, since we know this isn't a bug at the moment but will turn into one once we have step-wise expansion, why not include this fix in that patch itself? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Khandekar
Date:
Hi, Rafia had done some testing on TPCH queries using Partition-wise join patch along with Parallel Append patch. There, we had observed that for query 4, even though the partition wise joins are under a Parallel Append, the join are all non-partial. Specifically, the partition-wise join has non-partial nested loop joins when actually it was expected to have partial nested loop joins. (The difference can be seen by the observation that the outer relation of that join is scanned by non-parallel Bitmap Heap scan when it should have used Parallel Bitmap Heap Scan). Here is the detailed analysis , including where I think is the issue : https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com All the TPCH results are posted in the same above mail thread. Thanks -Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> In this case, AcquireExecutorLocks will lock all the relations in >> PlannedStmt.rtable, which must include all partitioned tables of all >> partition trees involved in the query. Of those, it will lock the tables >> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >> list of all partitioned table RT indexes obtained by concatenating >> partitioned_rels lists of all ModifyTable nodes involved in the query >> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >> because we need to take the stronger lock on a given table before any >> weaker one if it happens to appear in the query as a non-result relation >> too, to avoid lock strength upgrade deadlock hazard. > > Hmm. The problem with this theory in my view is that it doesn't > explain why InitPlan() and ExecOpenScanRelation() lock the relations > instead of just assuming that they are already locked either by > AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables() > doesn't really need to take locks, then ExecOpenScanRelation() must > not need to do it either. We invented ExecLockNonLeafAppendTables() > on the occasion of removing the scans of those tables which would > previously have caused ExecOpenScanRelation() to be invoked, so as to > keep the locking behavior unchanged. > > AcquireExecutorLocks() looks like an odd bit of code to me. The > executor itself locks result tables in InitPlan() and then everything > else during InitPlan() and all of the others later on while walking > the plan tree -- comments in InitPlan() say that this is to avoid a > lock upgrade hazard if a result rel is also a source rel. But > AcquireExecutorLocks() has no such provision; it just locks everything > in RTE order. In theory, that's a deadlock hazard of another kind, as > we just talked about in the context of EIBO. In fact, expanding in > bound order has made the situation worse: before, expansion order and > locking order were the same, so maybe having AcquireExecutorLocks() > work in RTE order coincidentally happened to give the same result as > the executor code itself as long as there are no result relations. > But this is certainly not true any more. I'm not sure it's worth > expending a lot of time on this -- it's evidently not a problem in > practice, or somebody probably would've complained before now. > > But that having been said, I don't think we should assume that all the > locks taken from the executor are worthless because plancache.c will > always do the job for us. I don't know of a case where we execute a > saved plan without going through the plan cache, but that doesn't mean > that there isn't one or that there couldn't be one in the future. > It's not the job of these partitioning patches to whack around the way > we do locking in general -- they should preserve the existing behavior > as much as possible. If we want to get rid of the locking in the > executor altogether, that's a separate discussion where, I have a > feeling, there will prove to be better reasons for the way things are > than we are right now supposing. > I agree that it's not the job of these patches to change the locking or even get rid of partitioned_rels. In order to continue returning partitioned_rels in Append paths esp. in the case of queries involving set operations and partitioned table e.g "select 1 from t1 union all select 2 from t1;" in which t1 is multi-level partitioned table, we need a fix in add_paths_to_append_rels(). The fix provided in [1] is correct but we will need a longer explanation of why we have to involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation is complicated. If we get rid of partitioned_rels, we don't need to fix that code in add_paths_to_append_rel(). I suggested that [2] -- (excerpt from [2]) Actually, the original problem that caused this discussion started with an assertion failure in get_partitioned_child_rels() as Assert(list_length(result) >= 1); This assertion fails if result is NIL when an intermediate partitioned table is passed. May be we should assert (result == NIL || list_length(result) == 1) and allow that function to be called even for intermediate partitioned partitions for which the function will return NIL. That will leave the code in add_paths_to_append_rel() simple. Thoughts? -- Amit Langote agrees with this. It kind of makes the assertion lame but keeps the code sane. What do you think? [1] https://www.postgresql.org/message-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d7a7@lab.ntt.co.jp [2] https://www.postgresql.org/message-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Sep 13, 2017 at 11:29 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/12 19:56, Ashutosh Bapat wrote: >> I think the code here expects the original parent_rte and not the one >> we set around line 1169. >> >> This isn't a bug right now, since both the parent_rte s have same >> content. But I am not sure if that will remain to be so. Here's patch >> to fix the thinko. > > Instead of the new bool is_parent_partitioned, why not move the code to > set partitioned_rels to the block where you're now setting > is_parent_partitioned. > > Also, since we know this isn't a bug at the moment but will turn into one > once we have step-wise expansion, why not include this fix in that patch > itself? It won't turn into a bug with step-wise expansion since every parent_rte will have RELKIND_PARTITIONED_TABLE for a partitioned top parent, which is used to extract the partitioned_rels. But I guess, it's better to fix the thinko in step-wise expansion since parent_rte itself changes. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Hi, > > Rafia had done some testing on TPCH queries using Partition-wise join > patch along with Parallel Append patch. > > There, we had observed that for query 4, even though the partition > wise joins are under a Parallel Append, the join are all non-partial. > > Specifically, the partition-wise join has non-partial nested loop > joins when actually it was expected to have partial nested loop joins. > (The difference can be seen by the observation that the outer relation > of that join is scanned by non-parallel Bitmap Heap scan when it > should have used Parallel Bitmap Heap Scan). > > Here is the detailed analysis , including where I think is the issue : > > https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com > > All the TPCH results are posted in the same above mail thread. Can you please check if the attached patch fixes the issue. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/13 16:21, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> locks taken from the executor are worthless because plancache.c will >> always do the job for us. I don't know of a case where we execute a >> saved plan without going through the plan cache, but that doesn't mean >> that there isn't one or that there couldn't be one in the future. >> It's not the job of these partitioning patches to whack around the way >> we do locking in general -- they should preserve the existing behavior >> as much as possible. If we want to get rid of the locking in the >> executor altogether, that's a separate discussion where, I have a >> feeling, there will prove to be better reasons for the way things are >> than we are right now supposing. >> > > I agree that it's not the job of these patches to change the locking > or even get rid of partitioned_rels. In order to continue returning > partitioned_rels in Append paths esp. in the case of queries involving > set operations and partitioned table e.g "select 1 from t1 union all > select 2 from t1;" in which t1 is multi-level partitioned table, we > need a fix in add_paths_to_append_rels(). The fix provided in [1] is > correct but we will need a longer explanation of why we have to > involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation > is complicated. If we get rid of partitioned_rels, we don't need to > fix that code in add_paths_to_append_rel(). Yeah, let's get on with setting partitioned_rels in AppendPath correctly in this patch. Ashutosh's suggested approach seems fine, although it needlessly requires to scan root->pcinfo_list. But it shouldn't be longer than the number of partitioned tables in the query, so maybe that's fine too. At least, it doesn't require us to add code to add_paths_to_append_rel() that can be pretty hard to wrap one's head around. That said, we might someday need to look carefully at some things that Robert mentioned carefully, especially around the order of locks taken by AcquireExecutorLocks() in light of the EIBO patch getting committed. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Khandekar
Date:
On 13 September 2017 at 13:05, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Hi, >> >> Rafia had done some testing on TPCH queries using Partition-wise join >> patch along with Parallel Append patch. >> >> There, we had observed that for query 4, even though the partition >> wise joins are under a Parallel Append, the join are all non-partial. >> >> Specifically, the partition-wise join has non-partial nested loop >> joins when actually it was expected to have partial nested loop joins. >> (The difference can be seen by the observation that the outer relation >> of that join is scanned by non-parallel Bitmap Heap scan when it >> should have used Parallel Bitmap Heap Scan). >> >> Here is the detailed analysis , including where I think is the issue : >> >> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com >> >> All the TPCH results are posted in the same above mail thread. > > Can you please check if the attached patch fixes the issue. Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins are generated now. If I see any unexpected differences in the estimated or actual costs, I will report that in the Parallel Append thread. As far as Partition-wise join is concerned, this issue is solved, because Partial nested loop join does get created. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> In this case, AcquireExecutorLocks will lock all the relations in >>> PlannedStmt.rtable, which must include all partitioned tables of all >>> partition trees involved in the query. Of those, it will lock the tables >>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >>> list of all partitioned table RT indexes obtained by concatenating >>> partitioned_rels lists of all ModifyTable nodes involved in the query >>> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >>> because we need to take the stronger lock on a given table before any >>> weaker one if it happens to appear in the query as a non-result relation >>> too, to avoid lock strength upgrade deadlock hazard. >> >> Hmm. The problem with this theory in my view is that it doesn't >> explain why InitPlan() and ExecOpenScanRelation() lock the relations >> instead of just assuming that they are already locked either by >> AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables() >> doesn't really need to take locks, then ExecOpenScanRelation() must >> not need to do it either. We invented ExecLockNonLeafAppendTables() >> on the occasion of removing the scans of those tables which would >> previously have caused ExecOpenScanRelation() to be invoked, so as to >> keep the locking behavior unchanged. >> >> AcquireExecutorLocks() looks like an odd bit of code to me. The >> executor itself locks result tables in InitPlan() and then everything >> else during InitPlan() and all of the others later on while walking >> the plan tree -- comments in InitPlan() say that this is to avoid a >> lock upgrade hazard if a result rel is also a source rel. But >> AcquireExecutorLocks() has no such provision; it just locks everything >> in RTE order. In theory, that's a deadlock hazard of another kind, as >> we just talked about in the context of EIBO. In fact, expanding in >> bound order has made the situation worse: before, expansion order and >> locking order were the same, so maybe having AcquireExecutorLocks() >> work in RTE order coincidentally happened to give the same result as >> the executor code itself as long as there are no result relations. >> But this is certainly not true any more. I'm not sure it's worth >> expending a lot of time on this -- it's evidently not a problem in >> practice, or somebody probably would've complained before now. >> >> But that having been said, I don't think we should assume that all the >> locks taken from the executor are worthless because plancache.c will >> always do the job for us. I don't know of a case where we execute a >> saved plan without going through the plan cache, but that doesn't mean >> that there isn't one or that there couldn't be one in the future. >> It's not the job of these partitioning patches to whack around the way >> we do locking in general -- they should preserve the existing behavior >> as much as possible. If we want to get rid of the locking in the >> executor altogether, that's a separate discussion where, I have a >> feeling, there will prove to be better reasons for the way things are >> than we are right now supposing. >> > > I agree that it's not the job of these patches to change the locking > or even get rid of partitioned_rels. In order to continue returning > partitioned_rels in Append paths esp. in the case of queries involving > set operations and partitioned table e.g "select 1 from t1 union all > select 2 from t1;" in which t1 is multi-level partitioned table, we > need a fix in add_paths_to_append_rels(). The fix provided in [1] is > correct but we will need a longer explanation of why we have to > involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation > is complicated. If we get rid of partitioned_rels, we don't need to > fix that code in add_paths_to_append_rel(). > > I suggested that [2] > -- (excerpt from [2]) > > Actually, the original problem that caused this discussion started > with an assertion failure in get_partitioned_child_rels() as > Assert(list_length(result) >= 1); > > This assertion fails if result is NIL when an intermediate partitioned > table is passed. May be we should assert (result == NIL || > list_length(result) == 1) and allow that function to be called even > for intermediate partitioned partitions for which the function will > return NIL. That will leave the code in add_paths_to_append_rel() > simple. Thoughts? > -- > > Amit Langote agrees with this. It kind of makes the assertion lame but > keeps the code sane. What do you think? I debugged what happens in case of query "select 1 from t1 union all select 2 from t1;" with the current HEAD (without multi-level expansion patch attached). It doesn't set partitioned_rels in Append path that gets converted into Append plan. Remember t1 is a multi-level partitioned table here with t1p1 as its immediate partition and t1p1p1 as partition of t1p1. So, the set_append_rel_pathlist() recurses once as shown in the following stack trace. #0 add_paths_to_append_rel (root=0x23e4308, rel=0x23fb768, live_childrels=0x23ff5f0) at allpaths.c:1281 #1 0x000000000076e170 in set_append_rel_pathlist (root=0x23e4308, rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:1262 #2 0x000000000076cf23 in set_rel_pathlist (root=0x23e4308, rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:431 #3 0x000000000076e0f6 in set_append_rel_pathlist (root=0x23e4308, rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:1247 #4 0x000000000076cf23 in set_rel_pathlist (root=0x23e4308, rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:431 #5 0x000000000076cc22 in set_base_rel_pathlists (root=0x23e4308) at allpaths.c:309 When add_paths_to_append_rel() (frame 0) is called for t1, it gets partitioned_rels and stuffs it in append path/s it creates. But those paths are flattened into the append paths created for the set operations when add_paths_to_append_rels() is called from frame 3. While flattening the append paths in accumulate_append_subpath() we do not pull any partitioned_rels that are stuffed in those paths and thus the final append path/s created does not have partitioned_rels in there. The same behaviour is retained by my v30 patchset [1]. I think we should go ahead by fixing add_paths_to_append_rel() as done in that patchset. partitioned_rels needs to be removed from append paths anyway, so that code will be removed when we do that. [1] https://www.postgresql.org/message-id/CAFjFpRfHkJW3G=_PnSUc6PbXJE48AWYwyRzaGqtfKzzoU4wXXw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I debugged what happens in case of query "select 1 from t1 union all > select 2 from t1;" with the current HEAD (without multi-level > expansion patch attached). It doesn't set partitioned_rels in Append > path that gets converted into Append plan. Remember t1 is a > multi-level partitioned table here with t1p1 as its immediate > partition and t1p1p1 as partition of t1p1. So, the > set_append_rel_pathlist() recurses once as shown in the following > stack trace. Nice debugging. I spent some time today looking at this and I think it's a bug in v10, and specifically in add_paths_to_append_rel(), which only sets partitioned_rels correctly when the appendrel is a partitioned rel, and not when it's a subquery RTE with one or more partitioned queries beneath it. Attached are two patches either one of which will fix it. First, I wrote mechanical-partrels-fix.patch, which just mechanically propagates partitioned_rels lists from accumulated subpaths into the list used to construct the parent (Merge)AppendPath. I wasn't entire happy with that, because it ends up building multiple partitioned_rels lists for the same RelOptInfo. That seems silly, but there's no principled way to avoid it; avoiding it amounts to hoping that all the paths for the same relation carry the same partitioned_rels list, which is uncomfortable. So then I wrote pcinfo-for-subquery.patch. That patch notices when an RTE_SUBQUERY appendrel is processed and accumulates the partitioned_rels of its immediate children; in case there can be multiple nested levels of subqueries before we get down to the actual partitioned rel, it also adds a PartitionedChildRelInfo for the subquery RTE, so that there's no need to walk the whole tree to build the partitioned_rels list at higher levels, just the immediate children. I find this fix a lot more satisfying. It adds less code and does no extra work in the common case. Notice that the choice of fix we adopt has consequences for your 0001-Multi-level-partitioned-table-expansion.patch -- with mechanical-partrels-fix.patch, that patch could either associated all partitioned_rels with the top-parent or it could work level by level and everything would get properly assembled later. But with pcinfo-for-subquery.patch, we need everything associated with the top-parent. That doesn't seem like a problem to me, but it's something to note. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/14 7:43, Robert Haas wrote: > On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I debugged what happens in case of query "select 1 from t1 union all >> select 2 from t1;" with the current HEAD (without multi-level >> expansion patch attached). It doesn't set partitioned_rels in Append >> path that gets converted into Append plan. Remember t1 is a >> multi-level partitioned table here with t1p1 as its immediate >> partition and t1p1p1 as partition of t1p1. So, the >> set_append_rel_pathlist() recurses once as shown in the following >> stack trace. > > Nice debugging. +1. > I spent some time today looking at this and I think > it's a bug in v10, and specifically in add_paths_to_append_rel(), > which only sets partitioned_rels correctly when the appendrel is a > partitioned rel, and not when it's a subquery RTE with one or more > partitioned queries beneath it. > > Attached are two patches either one of which will fix it. First, I > wrote mechanical-partrels-fix.patch, which just mechanically > propagates partitioned_rels lists from accumulated subpaths into the > list used to construct the parent (Merge)AppendPath. I wasn't entire > happy with that, because it ends up building multiple partitioned_rels > lists for the same RelOptInfo. That seems silly, but there's no > principled way to avoid it; avoiding it amounts to hoping that all the > paths for the same relation carry the same partitioned_rels list, > which is uncomfortable. > > So then I wrote pcinfo-for-subquery.patch. That patch notices when an > RTE_SUBQUERY appendrel is processed and accumulates the > partitioned_rels of its immediate children; in case there can be > multiple nested levels of subqueries before we get down to the actual > partitioned rel, it also adds a PartitionedChildRelInfo for the > subquery RTE, so that there's no need to walk the whole tree to build > the partitioned_rels list at higher levels, just the immediate > children. I find this fix a lot more satisfying. It adds less code > and does no extra work in the common case. I very much like pcinfo-for-subquery.patch, although I'm not sure if we need to create PartitionedChildRelInfo for the sub-query parent RTE as the patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL subqueries are flattened way before we get to add_paths_to_append_rel(); if it could not be flattened, there wouldn't be a call to add_paths_to_append_rel() in the first place, because no AppendRelInfos would be generated. See what happens when is_simple_union_all_recurse() returns false to flatten_simple_union_all() -- no AppendRelInfos will be generated and added to root->append_rel_list in that case. IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries like we're setting out to build for multi-level partitioned tables. So, as things stand today, there can at most be one recursive call of add_path_to_append_rel() for a sub-query parent RTE, that is, if its child sub-queries contain partitioned tables, but not more. The other patch (multi-level expansion of partitioned tables) will change that, but even then we won't need sub-query's own PartitioendChildRelInfo. > Notice that the choice of fix we adopt has consequences for your > 0001-Multi-level-partitioned-table-expansion.patch -- with > mechanical-partrels-fix.patch, that patch could either associated all > partitioned_rels with the top-parent or it could work level by level > and everything would get properly assembled later. But with > pcinfo-for-subquery.patch, we need everything associated with the > top-parent. That doesn't seem like a problem to me, but it's > something to note. I think it's fine. With 0001-Multi-level-partitioned-table-expansion.patch, get_partitioned_child_rels() will get called even for non-root partitioned tables, for which it won't find a valid pcinfo. I think that patch must also change its callers to stop Asserting that a valid pcinfo is returned. Spotted a typo in pcinfo-for-subquery.patch: + * A plain relation will alread have Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Sep 14, 2017 at 4:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I debugged what happens in case of query "select 1 from t1 union all >> select 2 from t1;" with the current HEAD (without multi-level >> expansion patch attached). It doesn't set partitioned_rels in Append >> path that gets converted into Append plan. Remember t1 is a >> multi-level partitioned table here with t1p1 as its immediate >> partition and t1p1p1 as partition of t1p1. So, the >> set_append_rel_pathlist() recurses once as shown in the following >> stack trace. > > Nice debugging. I spent some time today looking at this and I think > it's a bug in v10, and specifically in add_paths_to_append_rel(), > which only sets partitioned_rels correctly when the appendrel is a > partitioned rel, and not when it's a subquery RTE with one or more > partitioned queries beneath it. > > Attached are two patches either one of which will fix it. First, I > wrote mechanical-partrels-fix.patch, which just mechanically > propagates partitioned_rels lists from accumulated subpaths into the > list used to construct the parent (Merge)AppendPath. I wasn't entire > happy with that, because it ends up building multiple partitioned_rels > lists for the same RelOptInfo. That seems silly, but there's no > principled way to avoid it; avoiding it amounts to hoping that all the > paths for the same relation carry the same partitioned_rels list, > which is uncomfortable. > > So then I wrote pcinfo-for-subquery.patch. That patch notices when an > RTE_SUBQUERY appendrel is processed and accumulates the > partitioned_rels of its immediate children; in case there can be > multiple nested levels of subqueries before we get down to the actual > partitioned rel, it also adds a PartitionedChildRelInfo for the > subquery RTE, so that there's no need to walk the whole tree to build > the partitioned_rels list at higher levels, just the immediate > children. I find this fix a lot more satisfying. It adds less code > and does no extra work in the common case. Thanks a lot for the patch. I have included pcinfo-for-subquery.patch in my patchset as the first patch with typo corrections suggested by Amit Langote. > > Notice that the choice of fix we adopt has consequences for your > 0001-Multi-level-partitioned-table-expansion.patch -- with > mechanical-partrels-fix.patch, that patch could either associated all > partitioned_rels with the top-parent or it could work level by level > and everything would get properly assembled later. But with > pcinfo-for-subquery.patch, we need everything associated with the > top-parent. That doesn't seem like a problem to me, but it's > something to note. > I have few changes to multi-level expansion patch as per discussion in earlier mails 1. expand_single_inheritance_child() gets the top parent's PlanRowMark from which it builds the child's PlanRowMark and also update allMarkTypes of the top parent's PlanRowMark. The chlid's PlanRowMark contains the RTI of the top parent, which is pulled from the top parent's PlanRowMark. This is to keep the old behaviour intact. 2. Updated expand_single_inheritance_child's prologue to explain various output arguments, per suggestion from Amit Langote. Also included comments about the way we construct child PlanRowMark. Please see if the comments look good. 3. As suggested by Amit Langote, with multi-level partitioned table expansion, intermediate partitioned tables won't have pcinfo associated them. So, that patch removes the assertion Assert(list_length(partitioned_rels) >= 1) in add_paths_to_append_rels(). I didn't remove that assertion from your patch so that you could cherry-pick that commit to v10 where that assertion holds true. 4. Fixed inheritance_planner() to use top parent's RTE to pull partitioned_rels per discussion with Amit few mails back [1]. Please let me know if I have missed anything; it's been some long discussion. Apart from this I have included fix to reparameterize parallel nested loop paths as per discussion in [2]. Please note that I have removed the advanced partitioning patches from the attached patchset since those need a rebase because of default partition support. [1] https://www.postgresql.org/message-id/CAFjFpRe62H0rTb4Rb7wOVSR25xfNW+mt1Ncp-OtzGaEtZBTLwA@mail.gmail.com [2] https://www.postgresql.org/message-id/CAJ3gD9ctVgv6r0-7B6js7Z5uPHXx+KA5jK-3=uFsGwKOXfTddg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Wed, Sep 13, 2017 at 10:57 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I very much like pcinfo-for-subquery.patch, although I'm not sure if we > need to create PartitionedChildRelInfo for the sub-query parent RTE as the > patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL > subqueries are flattened way before we get to add_paths_to_append_rel(); > if it could not be flattened, there wouldn't be a call to > add_paths_to_append_rel() in the first place, because no AppendRelInfos > would be generated. See what happens when is_simple_union_all_recurse() > returns false to flatten_simple_union_all() -- no AppendRelInfos will be > generated and added to root->append_rel_list in that case. > > IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries > like we're setting out to build for multi-level partitioned tables. > > So, as things stand today, there can at most be one recursive call of > add_path_to_append_rel() for a sub-query parent RTE, that is, if its child > sub-queries contain partitioned tables, but not more. The other patch > (multi-level expansion of partitioned tables) will change that, but even > then we won't need sub-query's own PartitioendChildRelInfo. OK, let's assume you're correct unless some contrary evidence emerges. Committed without that part; thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I have few changes to multi-level expansion patch as per discussion in > earlier mails OK, I have committed 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic changes. Phew, getting that sorted out has been an astonishing amount of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/09/15 4:43, Robert Haas wrote: > On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I have few changes to multi-level expansion patch as per discussion in >> earlier mails > > OK, I have committed > 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic > changes. > > Phew, getting that sorted out has been an astonishing amount of work. Yeah, thanks to both of you. Now on to other complicated stuff. :) Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rafia Sabih
Date:
On Thu, Sep 14, 2017 at 8:27 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/14 7:43, Robert Haas wrote: >> On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> I debugged what happens in case of query "select 1 from t1 union all >>> select 2 from t1;" with the current HEAD (without multi-level >>> expansion patch attached). It doesn't set partitioned_rels in Append >>> path that gets converted into Append plan. Remember t1 is a >>> multi-level partitioned table here with t1p1 as its immediate >>> partition and t1p1p1 as partition of t1p1. So, the >>> set_append_rel_pathlist() recurses once as shown in the following >>> stack trace. >> >> Nice debugging. > > +1. > >> I spent some time today looking at this and I think >> it's a bug in v10, and specifically in add_paths_to_append_rel(), >> which only sets partitioned_rels correctly when the appendrel is a >> partitioned rel, and not when it's a subquery RTE with one or more >> partitioned queries beneath it. >> >> Attached are two patches either one of which will fix it. First, I >> wrote mechanical-partrels-fix.patch, which just mechanically >> propagates partitioned_rels lists from accumulated subpaths into the >> list used to construct the parent (Merge)AppendPath. I wasn't entire >> happy with that, because it ends up building multiple partitioned_rels >> lists for the same RelOptInfo. That seems silly, but there's no >> principled way to avoid it; avoiding it amounts to hoping that all the >> paths for the same relation carry the same partitioned_rels list, >> which is uncomfortable. >> >> So then I wrote pcinfo-for-subquery.patch. That patch notices when an >> RTE_SUBQUERY appendrel is processed and accumulates the >> partitioned_rels of its immediate children; in case there can be >> multiple nested levels of subqueries before we get down to the actual >> partitioned rel, it also adds a PartitionedChildRelInfo for the >> subquery RTE, so that there's no need to walk the whole tree to build >> the partitioned_rels list at higher levels, just the immediate >> children. I find this fix a lot more satisfying. It adds less code >> and does no extra work in the common case. > > I very much like pcinfo-for-subquery.patch, although I'm not sure if we > need to create PartitionedChildRelInfo for the sub-query parent RTE as the > patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL > subqueries are flattened way before we get to add_paths_to_append_rel(); > if it could not be flattened, there wouldn't be a call to > add_paths_to_append_rel() in the first place, because no AppendRelInfos > would be generated. See what happens when is_simple_union_all_recurse() > returns false to flatten_simple_union_all() -- no AppendRelInfos will be > generated and added to root->append_rel_list in that case. > > IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries > like we're setting out to build for multi-level partitioned tables. > > So, as things stand today, there can at most be one recursive call of > add_path_to_append_rel() for a sub-query parent RTE, that is, if its child > sub-queries contain partitioned tables, but not more. The other patch > (multi-level expansion of partitioned tables) will change that, but even > then we won't need sub-query's own PartitioendChildRelInfo. > >> Notice that the choice of fix we adopt has consequences for your >> 0001-Multi-level-partitioned-table-expansion.patch -- with >> mechanical-partrels-fix.patch, that patch could either associated all >> partitioned_rels with the top-parent or it could work level by level >> and everything would get properly assembled later. But with >> pcinfo-for-subquery.patch, we need everything associated with the >> top-parent. That doesn't seem like a problem to me, but it's >> something to note. > > I think it's fine. > > With 0001-Multi-level-partitioned-table-expansion.patch, > get_partitioned_child_rels() will get called even for non-root partitioned > tables, for which it won't find a valid pcinfo. I think that patch must > also change its callers to stop Asserting that a valid pcinfo is returned. > > Spotted a typo in pcinfo-for-subquery.patch: > > + * A plain relation will alread have > > Thanks, > Amit > On TPC-H benchmarking of this patch, I found a regression in Q7. It was taking some 1500s with the patch and some 900s without the patch. Please find the attached pwd_reg.zip for the output of explain analyse on head and with patch. The experimental settings used were, commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629 patch-version = v26 Server settings: work_mem = 1GB shared_buffers = 10GB effective_cache_size = 10GB max_parallel_workers_per_gather = 4 Partitioning information: Partitioning scheme = by range Number of partitions in lineitem and orders table = 106 partition key for lineitem = l_orderkey partition key for orders = o_orderkey Apart from these there is a regression case on a custom table, on head query completes in 20s and with this patch it takes 27s. Please find the attached .out and .sql file for the output and schema for the test case respectively. I have reported this case before (sometime around March this year) as well, but I am not sure if it was overlooked or is an unimportant and expected behaviour for some reason. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 15, 2017 at 1:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I have few changes to multi-level expansion patch as per discussion in >> earlier mails > > OK, I have committed > 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic > changes. > > Phew, getting that sorted out has been an astonishing amount of work. Thanks a lot Robert. Here are rebased patches. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > On TPC-H benchmarking of this patch, I found a regression in Q7. It > was taking some 1500s with the patch and some 900s without the patch. > Please find the attached pwd_reg.zip for the output of explain analyse > on head and with patch. > > The experimental settings used were, > commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629 > patch-version = v26 > > Server settings: > work_mem = 1GB > shared_buffers = 10GB > effective_cache_size = 10GB > max_parallel_workers_per_gather = 4 > > Partitioning information: > Partitioning scheme = by range > Number of partitions in lineitem and orders table = 106 > partition key for lineitem = l_orderkey > partition key for orders = o_orderkey I observe that with partition-wise join patch the planner is using GatherMerge along-with partition-wise join and on head its not using GatherMerge. Just to make sure that its partition-wise join which is causing regression and not GatherMerge, can you please run the query with enable_gathermerge = false? I see following lines explain analyze output 7_1.out without the patch -> Sort (cost=84634030.40..84638520.55 rows=1796063 width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1) Sort Key: n1.n_name, n2.n_name, (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without time zone)) Sort Method: quicksort Memory: 308912kB -> Hash Join (cost=16080591.94..84447451.72 rows=1796063 width=72) (actual time=252745.701..1057447.219 rows=1749956 loops=1) Since Sort doesn't filter any rows, we would expect it to output the same number of rows as hash join underneath it. But the number of rows differ in this case. I am wondering whether there's some problem with the explain analyze output itself. > > Apart from these there is a regression case on a custom table, on head > query completes in 20s and with this patch it takes 27s. Please find > the attached .out and .sql file for the output and schema for the test > case respectively. I have reported this case before (sometime around > March this year) as well, but I am not sure if it was overlooked or is > an unimportant and expected behaviour for some reason. > Are you talking about [1]? I have explained about the regression in [2] and [3]. This looks like an issue with the existing costing model. [1] https://www.postgresql.org/message-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com [2] https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com [3] https://www.postgresql.org/message-id/CAFjFpReJKSdCfaeuZjGD79hOETzpz5BKDxLJgxr7qznrXX+TRw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Thanks a lot Robert. > > Here are rebased patches. I didn't get quite as much time to look at these today as I would have liked, but here's what I've got so far. Comments on 0001: - In the RelOptInfo, part_oids is defined in a completely different part of the structure than nparts, but you can't use it without nparts because you don't know how long it is. I suggest moving the definition to just after nparts. - On the other hand, maybe we should just remove it completely. I don't see it in any of the subsequent patches. If it's used by the advanced matching code, let's leave it out of 0001 for now and add it back after the basic feature is committed. - Similarly, partsupfunc isn't used by the later patches either. It seems it could also be removed, at least for now. - The comment for partexprs isn't very clear about how the lists inside the array work. My understanding is that the lists have as many members as the partition key has columns/expressions. - I'm not entirely sure whether maintaining partexprs and nullable_partexprs is the right design. If I understand correctly, whether or not a partexpr is nullable is really a per-RTI property, not a per-expression property. You could consider something like "Relids nullable_rels". Comments on 0002: - The relationship between deciding to set partition scheme and related details and the configured value of enable_partition_wise_join needs some consideration. If the only use of the partition scheme is partition-wise join, there's no point in setting it even for a baserel unless enable_partition_wise_join is set -- but if there are other important uses for that data, such as Amit's partition pruning work, then we might want to always set it. And similarly for a join: if the details are only needed in the partition-wise join case, then we only need to set them in that case, but if there are other uses, then it's different. If it turns out that setting these details for a baserel is useful in other cases but that it's only for a joinrel in the partition-wise join case, then the patch gets it exactly right. But is that correct? I'm not sure. - The naming of enable_partition_wise_join might also need some thought. What happens when we also have partition-wise aggregate? What about the proposal to strength-reduce MergeAppend to Append -- would that use this infrastructure? I wonder if we out to call this enable_partition_wise or enable_partition_wise_planning to make it a bit more general. Then, too, I've never really liked having partition_wise in the GUC name because it might make someone think that it makes you partitions have a lot of wisdom. Removing the underscore might help: partitionwise. Or maybe there is some whole different name that would be better. If anyone wants to bikeshed, now's the time. - It seems to me that build_joinrel_partition_info() could be simplified a bit. One thing is that list_copy() is perfectly capable of handling a NIL input, so there's no need to test for that before calling it. Comments on 0003: - Instead of reorganizing add_paths_to_append_rel as you did, could you just add an RTE_JOIN case to the switch? Not sure if there's some problem with that idea, but it seems like it might come out nicer. On the overall patch set: - I am curious to know how this has been tested. How much of the new code is covered by the tests in 0007-Partition-wise-join-tests.patch? How much does coverage improve with 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What code, if any, is not covered by either of those test suites? Could we do meaningful testing of this with something like Andreas Seltenreich's sqlsmith? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Thomas Munro
Date:
On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On the overall patch set: > > - I am curious to know how this has been tested. How much of the new > code is covered by the tests in 0007-Partition-wise-join-tests.patch? > How much does coverage improve with > 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What > code, if any, is not covered by either of those test suites? Could we > do meaningful testing of this with something like Andreas > Seltenreich's sqlsmith? FWIW I'm working on an answer to both of those question, but keep getting distracted by other things catching on fire... -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Thomas Munro
Date:
On Sat, Sep 16, 2017 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On the overall patch set: >> >> - I am curious to know how this has been tested. How much of the new >> code is covered by the tests in 0007-Partition-wise-join-tests.patch? >> How much does coverage improve with >> 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What >> code, if any, is not covered by either of those test suites? Could we >> do meaningful testing of this with something like Andreas >> Seltenreich's sqlsmith? > > FWIW I'm working on an answer to both of those question, but keep > getting distracted by other things catching on fire... I cobbled together some scripts to figure out the test coverage of lines actually modified by this patch set. Please see attached. I'm not sure if there is an established or better way to do this, but I used git-blame to figure out which lines of gcov output can be blamed on Ashutosh and prepended that to the lines of gcov's output. That allowed me to find new/changed code not covered by "make check". I found 94 untested new lines with 0007 applied and 88 untested new lines with 0008 applied. The 6 lines that 0008 reaches and 0007 doesn't are: ======== src/backend/optimizer/path/allpaths.c ======== -[TOUCHED BY PATCH SET] #####: 3303: mark_dummy_rel(rel); -[TOUCHED BY PATCH SET] #####: 3304: return; -[TOUCHED BY PATCH SET] #####: 1515: continue; -[TOUCHED BY PATCH SET] #####: 1526: continue; ======== src/backend/optimizer/util/pathnode.c ======== -[TOUCHED BY PATCH SET] #####: 3433: break; -[TOUCHED BY PATCH SET] #####: 3435: return NULL; -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rafia Sabih
Date:
On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >> On TPC-H benchmarking of this patch, I found a regression in Q7. It >> was taking some 1500s with the patch and some 900s without the patch. >> Please find the attached pwd_reg.zip for the output of explain analyse >> on head and with patch. >> >> The experimental settings used were, >> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629 >> patch-version = v26 >> >> Server settings: >> work_mem = 1GB >> shared_buffers = 10GB >> effective_cache_size = 10GB >> max_parallel_workers_per_gather = 4 >> >> Partitioning information: >> Partitioning scheme = by range >> Number of partitions in lineitem and orders table = 106 >> partition key for lineitem = l_orderkey >> partition key for orders = o_orderkey > > I observe that with partition-wise join patch the planner is using > GatherMerge along-with partition-wise join and on head its not using > GatherMerge. Just to make sure that its partition-wise join which is > causing regression and not GatherMerge, can you please run the query > with enable_gathermerge = false? > That does not sound plausible since around 130s are already spent till append node. Anyhow, I executed the query with enable_gathermerge = false, and still it is taking some 1500 secs. Please find the attached file for the explain analyse output. > I see following lines explain analyze output 7_1.out without the patch > -> Sort (cost=84634030.40..84638520.55 rows=1796063 > width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1) > Sort Key: n1.n_name, n2.n_name, > (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without > time zone)) > Sort Method: quicksort Memory: 308912kB > -> Hash Join (cost=16080591.94..84447451.72 > rows=1796063 width=72) (actual time=252745.701..1057447.219 > rows=1749956 loops=1) > Since Sort doesn't filter any rows, we would expect it to output the > same number of rows as hash join underneath it. But the number of rows > differ in this case. I am wondering whether there's some problem with > the explain analyze output itself. > Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual time=1556989.996..1556989.997 rows=1 loops=1) -> Finalize GroupAggregate (cost=83341943.28..83342723.24 rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1 loops=1) Group Key: n1.n_name, n2.n_name, (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without time zone)) -> Sort (cost=83341943.28..83342043.92 rows=40256 width=92) (actual time=1556989.910..1556989.911 rows=6 loops=1) Sort Key: n1.n_name, n2.n_name, (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without time zone)) Sort Method: quicksort Memory: 27kB -> Gather (cost=83326804.81..83338864.31 rows=40256 width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1) Workers Planned: 4 Workers Launched: 4 AFAICU the node above sort is group-aggregate and then there is limit, and the number of rows for sort node in explain analyse is returned number of rows. So, what is happening here is once one group is completed it is aggregated and fetched by limit, now there is no need for sort to return any more rows and hence the result. >> >> Apart from these there is a regression case on a custom table, on head >> query completes in 20s and with this patch it takes 27s. Please find >> the attached .out and .sql file for the output and schema for the test >> case respectively. I have reported this case before (sometime around >> March this year) as well, but I am not sure if it was overlooked or is >> an unimportant and expected behaviour for some reason. >> > > Are you talking about [1]? I have explained about the regression in > [2] and [3]. This looks like an issue with the existing costing model. > > [1] https://www.postgresql.org/message-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com > [2] https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com > [3] https://www.postgresql.org/message-id/CAFjFpReJKSdCfaeuZjGD79hOETzpz5BKDxLJgxr7qznrXX+TRw@mail.gmail.com > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Thanks a lot Robert. >> >> Here are rebased patches. > > I didn't get quite as much time to look at these today as I would have > liked, but here's what I've got so far. > > Comments on 0001: > > - In the RelOptInfo, part_oids is defined in a completely different > part of the structure than nparts, but you can't use it without nparts > because you don't know how long it is. I suggest moving the > definition to just after nparts. > > - On the other hand, maybe we should just remove it completely. I > don't see it in any of the subsequent patches. If it's used by the > advanced matching code, let's leave it out of 0001 for now and add it > back after the basic feature is committed. No, it's not used by advanced partition matching code. It was used by to match OIDs with the child rels to order those in the array. But now that we are expanding in EIBO fashion, it is not useful. Should have removed it earlier. Removed now. > > - Similarly, partsupfunc isn't used by the later patches either. It > seems it could also be removed, at least for now. It's used by advanced partition matching code to compare bounds. It will be required by partition pruning patch. But removed for now. > > - The comment for partexprs isn't very clear about how the lists > inside the array work. My understanding is that the lists have as > many members as the partition key has columns/expressions. Actually we are doing some preparation for partition-wise join here. partexprs and nullable_partexprs are used in partition-wise join implementation patch. I have updated prologue of RelOptInfo structure with the comments like below * Note: A base relation will always have only one set of partition keys. But a* join relation has as many sets of partitionkeys as the number of joining* relations. The number of partition keys is given by* "part_scheme->partnatts". "partexprs"and "nullable_partexprs" are arrays* containing part_scheme->partnatts elements. Each element of the array* containsa list of partition key expressions. For a base relation each list* contains only one expression. For a join relationeach list contains at* most as many expressions as the joining relations. The expressions in a list* at a given positionin the array correspond to the partition key at that* position. "partexprs" contains partition keys of non-nullablejoining* relations and "nullable_partexprs" contains partition keys of nullable* joining relations. For a baserelation only "partexprs" is populated. Let me know this looks fine. The logic to match the partition keys of joining relations in have_partkey_equi_join() and match_expr_to_partition_keys() becomes simpler if we arrange the partition key expressions as array indexed by position of partition key and each array element as list of partition key expressions at that position. partition pruning might need partexprs look up relevant quals, but nullable_partexprs doesn't have any use there. So may be we should add nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join implementation) instead of 0001. What do you think? > > - I'm not entirely sure whether maintaining partexprs and > nullable_partexprs is the right design. If I understand correctly, > whether or not a partexpr is nullable is really a per-RTI property, > not a per-expression property. You could consider something like > "Relids nullable_rels". That's true. However in order to decide whether an expression falls on nullable side of a join, we will need to call pull_varnos() on it and check the output against nullable_rels. Separating the expressions themselves avoids that step. > > Comments on 0002: > > - The relationship between deciding to set partition scheme and > related details and the configured value of enable_partition_wise_join > needs some consideration. If the only use of the partition scheme is > partition-wise join, there's no point in setting it even for a baserel > unless enable_partition_wise_join is set -- but if there are other > important uses for that data, such as Amit's partition pruning work, > then we might want to always set it. And similarly for a join: if the > details are only needed in the partition-wise join case, then we only > need to set them in that case, but if there are other uses, then it's > different. If it turns out that setting these details for a baserel > is useful in other cases but that it's only for a joinrel in the > partition-wise join case, then the patch gets it exactly right. But > is that correct? I'm not sure. Partition scheme contains the information about data types of partition keys, which is required to compare partition bounds. Partition pruning will need to compare constants with partition bounds and hence will need information contained in partition scheme. So, we will need to set it for base relations whether or not partition-wise join is enabled. > > - The naming of enable_partition_wise_join might also need some > thought. What happens when we also have partition-wise aggregate? > What about the proposal to strength-reduce MergeAppend to Append -- > would that use this infrastructure? I wonder if we out to call this > enable_partition_wise or enable_partition_wise_planning to make it a > bit more general. Then, too, I've never really liked having > partition_wise in the GUC name because it might make someone think > that it makes you partitions have a lot of wisdom. Removing the > underscore might help: partitionwise. Or maybe there is some whole > different name that would be better. If anyone wants to bikeshed, > now's the time. partitions having a lot of wisdom would be wise_partitions rather than partition_wise ;). If partition-wise join is disabled, partition-wise aggregates, strength reduction of MergeAppend won't be possible on a join tree, but those will be possible on a base relation. Even if partition-wise join enabled, one may want to disable other partition-wise optimizations individually. So, they are somewhat independent switches. I don't think we should bundle all of those into one. Whatever names we choose for those GUCs, I think they should have same naming convention e.g. "partition_wise_xyz". I am open to suggestions about the names. > > - It seems to me that build_joinrel_partition_info() could be > simplified a bit. One thing is that list_copy() is perfectly capable > of handling a NIL input, so there's no need to test for that before > calling it. partexprs may be NULL for FULL JOIN and nullable_partexprs may be NULL when there is no nullable relation. So, we have to check existence of those arrays before accessing lists containing partition key expressions. list_copy() is being called on individual array elements and "if" conditions check for the existence of array. The functions might have become complicated because I am using outer/inner_partexprs to hold one of the lists and partexprs contains the array of lists. We may use better named, but I don't have any better ideas right now. Will think about them. We could simplify that function according to your suggestion of nullable_relids. Basically partexprs then contains partition key expressions all relations nullable and non-nullable. nullable_relids + pull_varnos() tells us which of those fall on nullable side and which ones don't. Is this how you are thinking of simplifying it? If we go with this scheme, again nullable_relids will not be useful for partition pruning, so may be we should add it as part of 0002 (partition-wise join implementation) instead of 0001. > > Comments on 0003: > > - Instead of reorganizing add_paths_to_append_rel as you did, could > you just add an RTE_JOIN case to the switch? Not sure if there's some > problem with that idea, but it seems like it might come out nicer. RTE_JOIN is created only for joins specified using JOIN clause i.e syntactic joins. The joins created during query planner like rel1, rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > partition pruning might need partexprs look up relevant quals, but > nullable_partexprs doesn't have any use there. So may be we should add > nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join > implementation) instead of 0001. What do you think? +1. >> - I'm not entirely sure whether maintaining partexprs and >> nullable_partexprs is the right design. If I understand correctly, >> whether or not a partexpr is nullable is really a per-RTI property, >> not a per-expression property. You could consider something like >> "Relids nullable_rels". > > That's true. However in order to decide whether an expression falls on > nullable side of a join, we will need to call pull_varnos() on it and > check the output against nullable_rels. Separating the expressions > themselves avoids that step. Good point. Also, I'm not sure about cases like this: SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; Suppose the relations are all partitioned by (x, y) but that the = operator is not strict. A partition-wise join is valid between a and b, but we can't regard w as partitioned any more, because w.x might contain nulls in partitions where the partitioning scheme wouldn't allow them. On the other hand, if the subquery were to select a.x, a.y then clearly it would be fine: there would be no possibility of a NULL having been substituted for a proper value. What if the subquery selected a.x, b.y? Initially, I thought that would be OK too, because of the fact that the a.y = b.y clause is in the WHERE clause rather than the join condition. But on further thought I think that probably doesn't work, because with = being a non-strict operator there's no guarantee that it would remove any nulls introduced by the left join. Of course, if the subselect had a WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT list mention those columns would be fine. >> - The naming of enable_partition_wise_join might also need some >> thought. What happens when we also have partition-wise aggregate? >> What about the proposal to strength-reduce MergeAppend to Append -- >> would that use this infrastructure? I wonder if we out to call this >> enable_partition_wise or enable_partition_wise_planning to make it a >> bit more general. Then, too, I've never really liked having >> partition_wise in the GUC name because it might make someone think >> that it makes you partitions have a lot of wisdom. Removing the >> underscore might help: partitionwise. Or maybe there is some whole >> different name that would be better. If anyone wants to bikeshed, >> now's the time. > > partitions having a lot of wisdom would be wise_partitions rather than > partition_wise ;). Well, maybe it's the joins that have a lot of wisdom, then. enable_partition_wise_join could be read to mean that we should allow partitioning of joins, but only if those joins know the secret of true happiness. > If partition-wise join is disabled, partition-wise aggregates, > strength reduction of MergeAppend won't be possible on a join tree, > but those will be possible on a base relation. Even if partition-wise > join enabled, one may want to disable other partition-wise > optimizations individually. So, they are somewhat independent > switches. I don't think we should bundle all of those into one. > Whatever names we choose for those GUCs, I think they should have same > naming convention e.g. "partition_wise_xyz". I am open to suggestions > about the names. I think the chances of you getting multiple GUCs for different partition-wise optimizations past Tom are pretty low. >> - Instead of reorganizing add_paths_to_append_rel as you did, could >> you just add an RTE_JOIN case to the switch? Not sure if there's some >> problem with that idea, but it seems like it might come out nicer. > > RTE_JOIN is created only for joins specified using JOIN clause i.e > syntactic joins. The joins created during query planner like rel1, > rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there. OK, never mind that then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rafia Sabih
Date:
On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: >> > On TPC-H benchmarking of this patch, I found a regression in Q7. It > was taking some 1500s with the patch and some 900s without the patch. > Please find the attached pwd_reg.zip for the output of explain analyse > on head and with patch. > > The experimental settings used were, > commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629 > patch-version = v26 > > Server settings: > work_mem = 1GB > shared_buffers = 10GB > effective_cache_size = 10GB > max_parallel_workers_per_gather = 4 > > Partitioning information: > Partitioning scheme = by range > Number of partitions in lineitem and orders table = 106 > partition key for lineitem = l_orderkey > partition key for orders = o_orderkey > > Apart from these there is a regression case on a custom table, on head > query completes in 20s and with this patch it takes 27s. Please find > the attached .out and .sql file for the output and schema for the test > case respectively. I have reported this case before (sometime around > March this year) as well, but I am not sure if it was overlooked or is > an unimportant and expected behaviour for some reason. > On completing the benchmark for all queries for the above mentioned setup, following performance improvement can be seen, Query | Patch | Head 3 | 1455 | 1631 4 | 499 | 4344 5 | 1464 | 1606 10 | 1475 | 1599 12 | 1465 | 1790 Note that all values of execution time are in seconds. To summarise, apart from Q4, all other queries are showing somewhat 10-20% improvement. Though it is good but honestly, I expected more from this feature atleast on this scale factor. I am yet to compare these performances with the unpartitioned version of the database. Please find attached file for the output of explain analyse for all the queries on head and with patch. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> partition pruning might need partexprs look up relevant quals, but >> nullable_partexprs doesn't have any use there. So may be we should add >> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join >> implementation) instead of 0001. What do you think? > > +1. Done. > >>> - I'm not entirely sure whether maintaining partexprs and >>> nullable_partexprs is the right design. If I understand correctly, >>> whether or not a partexpr is nullable is really a per-RTI property, >>> not a per-expression property. You could consider something like >>> "Relids nullable_rels". >> >> That's true. However in order to decide whether an expression falls on >> nullable side of a join, we will need to call pull_varnos() on it and >> check the output against nullable_rels. Separating the expressions >> themselves avoids that step. > > Good point. Also, I'm not sure about cases like this: > > SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE > a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; > > Suppose the relations are all partitioned by (x, y) but that the = > operator is not strict. A partition-wise join is valid between a and > b, but we can't regard w as partitioned any more, because w.x might > contain nulls in partitions where the partitioning scheme wouldn't > allow them. On the other hand, if the subquery were to select a.x, > a.y then clearly it would be fine: there would be no possibility of a > NULL having been substituted for a proper value. > > What if the subquery selected a.x, b.y? Initially, I thought that > would be OK too, because of the fact that the a.y = b.y clause is in > the WHERE clause rather than the join condition. But on further > thought I think that probably doesn't work, because with = being a > non-strict operator there's no guarantee that it would remove any > nulls introduced by the left join. Of course, if the subselect had a > WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT > list mention those columns would be fine. > I am actually not sure whether we can use partition-wise join for a LEFT JOIN b when the partition key equalities are spread across ON and WHERE clauses. I am not able to find any example against it, but I am not able to prove it as well. The reference I used for partition-wise join [1], mentions JOIN conditions i.e. ON clause conditions. But all the examples used in that paper are that of INNER join. So, I am not sure what exactly the authors meant by JOIN conditions. Right now I am restricting the patch to work with only conditions in the ON clause. Practically most of the operators are strict. OUTER join's WHERE clause has any partition key equality with strict operator, optimizer will turn that OUTER join into an INNER one, turning all clauses into join clauses. That will enable partition-wise join. So, the current restriction doesn't restrict any practical cases. OTOH, I have seen that treating ON and WHERE clauses as same for an OUTER join leads to surprising results. So, I am leaning to treat them separate for partition-wise join as well and only use ON clause conditions for partition-wise join. If we get complaints about partition-wise join not being picked we will fix them after proving that it's not harmful. Lifting that restriction is not so difficult. have_partition_key_equijoin() ignores "pushed down" quals. We have to just change that condition. Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL is interesting. If those conditions are in ON clause, we may still have a result where b.x and b.y as NULL when no row in "a" matches a row in "b". If those conditions are in WHERE clause, I think optimizer will turn the join into an INNER join irrespective of whether the equality operator is strict. > >> If partition-wise join is disabled, partition-wise aggregates, >> strength reduction of MergeAppend won't be possible on a join tree, >> but those will be possible on a base relation. Even if partition-wise >> join enabled, one may want to disable other partition-wise >> optimizations individually. So, they are somewhat independent >> switches. I don't think we should bundle all of those into one. >> Whatever names we choose for those GUCs, I think they should have same >> naming convention e.g. "partition_wise_xyz". I am open to suggestions >> about the names. > > I think the chances of you getting multiple GUCs for different > partition-wise optimizations past Tom are pretty low. We do have enable_hashjoin and enable_hashagg to control use of hashing for aggregate and join. On similar lines we can have three GUCs to enable use of partition-wise strategy, one for each of join, aggregation and sorting. Having granular switches would be useful for debugging and may be to turn partition-wise strategies off when they are not optimal. Do we want a switch to turn ON/OFF partition pruning? Said, that I am fine with single GUC controlling all. We won't set any partitioning information in RelOptInfo if that GUC is turned OFF. [1] https://pdfs.semanticscholar.org/27c2/ba75f8b6a39d4bce85d5579dace609c9abaa.pdf -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Alvaro Herrera
Date:
Rafia Sabih wrote: > On completing the benchmark for all queries for the above mentioned > setup, following performance improvement can be seen, > Query | Patch | Head > 3 | 1455 | 1631 > 4 | 499 | 4344 > 5 | 1464 | 1606 > 10 | 1475 | 1599 > 12 | 1465 | 1790 > > Note that all values of execution time are in seconds. > To summarise, apart from Q4, all other queries are showing somewhat > 10-20% improvement. Saving 90% of time on the slowest query looks like a worthy improvement on its own right. However, you're reporting execution time only, right? What happens to planning time? In a quick look, $ grep 'Planning time' pg_part_*/4* pg_part_head/4_1.out: Planning time: 3390.699 ms pg_part_head/4_2.out: Planning time: 194.211 ms pg_part_head/4_3.out: Planning time: 210.964 ms pg_part_head/4_4.out: Planning time: 4150.647 ms pg_part_patch/4_1.out: Planning time: 7577.247 ms pg_part_patch/4_2.out: Planning time: 312.421 ms pg_part_patch/4_3.out: Planning time: 304.697 ms pg_part_patch/4_4.out: Planning time: 269.778 ms I think the noise in these few results is too large to draw any conclusions. Maybe a few dozen runs of EXPLAIN (w/o ANALYZE) would tell something significant? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Thomas Munro
Date:
On Sat, Sep 16, 2017 at 2:41 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Sep 16, 2017 at 9:38 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On the overall patch set: >>> >>> - I am curious to know how this has been tested. How much of the new >>> code is covered by the tests in 0007-Partition-wise-join-tests.patch? >>> How much does coverage improve with >>> 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What >>> code, if any, is not covered by either of those test suites? Could we >>> do meaningful testing of this with something like Andreas >>> Seltenreich's sqlsmith? >> >> FWIW I'm working on an answer to both of those question, but keep >> getting distracted by other things catching on fire... > > I cobbled together some scripts to figure out the test coverage of > lines actually modified by this patch set. Please see attached. > > I'm not sure if there is an established or better way to do this, but > I used git-blame to figure out which lines of gcov output can be > blamed on Ashutosh and prepended that to the lines of gcov's output. > That allowed me to find new/changed code not covered by "make check". > I found 94 untested new lines with 0007 applied and 88 untested new > lines with 0008 applied. The 6 lines that 0008 reaches and 0007 > doesn't are: > > ======== src/backend/optimizer/path/allpaths.c ======== > -[TOUCHED BY PATCH SET] #####: 3303: mark_dummy_rel(rel); > -[TOUCHED BY PATCH SET] #####: 3304: return; > -[TOUCHED BY PATCH SET] #####: 1515: continue; > -[TOUCHED BY PATCH SET] #####: 1526: continue; > ======== src/backend/optimizer/util/pathnode.c ======== > -[TOUCHED BY PATCH SET] #####: 3433: break; > -[TOUCHED BY PATCH SET] #####: 3435: return NULL; Two obvious questions: 1. What are we missing in the ~90 lines of non-covered code, and are there bugs lurking there? First, here's an easier to read report than the one I posted earlier. It's based on the whole patch stack (including the extra tests) from your v33 tarball: https://codecov.io/gh/postgresql-cfbot/postgresql/commit/19dace6fca0d9c2bca5022158cf28d99aa237550 The main areas of uncovered lines are: code in get_wholerow_ref_from_convert_row_type() and code that calls it, and per node type cases in reparameterize_path_by_child(). It seems like the former could use a test case, and I wonder if there is some way we could write "flat-copy and then apply recursively to all subpaths" code like this without having to handle these cases explicitly. There are a couple of other tiny return cases other than just sanity check errors which it might be interesting to hit too. 2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit? I thought about how to answer questions like this and came up with a shell script that (1) makes computers run really hot for quite a long time and (2) tells you which blocks of SQL hit which lines of C. Please find attached the shell script and its output. The .sql files have been annotated with "block" numbers (blocks being chunks of SQL stuff separated by blank lines), and the C files annotated with references to those block numbers where A<n> = block <n> partition_join.sql and B<n> = block <n> in partition_join_extras.sql. Then to find lines that B queries hit but A queries don't and know which particular queries hit them, you might use something like: grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \ grep 'SQL blocks: .*B[0-9]' (Off topic but by way of explanation: the attachment name ending .tarball.gz avoids .tgz or .tar.gz so my stupid cfbot doesn't think it's a new patch set. I need to figure something better out for that...) -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rafia Sabih
Date:
On Tue, Sep 19, 2017 at 3:50 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Rafia Sabih wrote: > >> On completing the benchmark for all queries for the above mentioned >> setup, following performance improvement can be seen, >> Query | Patch | Head >> 3 | 1455 | 1631 >> 4 | 499 | 4344 >> 5 | 1464 | 1606 >> 10 | 1475 | 1599 >> 12 | 1465 | 1790 >> >> Note that all values of execution time are in seconds. >> To summarise, apart from Q4, all other queries are showing somewhat >> 10-20% improvement. > > Saving 90% of time on the slowest query looks like a worthy improvement > on its own right. However, you're reporting execution time only, right? > What happens to planning time? In a quick look, Definitely. The planning time issue has been discussed upthread, On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > Another minor thing to note that is planning time is almost twice with > this patch, though I understand that this is for scenarios with really > big 'big data' so this may not be a serious issue in such cases, but > it'd be good if we can keep an eye on this that it doesn't exceed the > computational bounds for a really large number of tables.. To which Robert replied as, Yes, this is definitely going to use significant additional planning time and memory. There are several possible strategies for improving that situation, but I think we need to get the basics in place first. That's why the proposal is now to have this turned off by default. People joining really big tables that happen to be equipartitioned are likely to want to turn it on, though, even before those optimizations are done. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > The main areas of uncovered lines are: code in > get_wholerow_ref_from_convert_row_type() and code that calls it, and > per node type cases in reparameterize_path_by_child(). It seems like > the former could use a test case, and I wonder if there is some way we > could write "flat-copy and then apply recursively to all subpaths" > code like this without having to handle these cases explicitly. There > are a couple of other tiny return cases other than just sanity check > errors which it might be interesting to hit too. Under the debugger I checked that the test in partition_join.sql -- left outer join, with whole-row reference EXPLAIN (COSTS OFF) SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; covers get_wholerow_ref_from_convert_row_type(). But it doesn't cover a couple of lines in the case of nested ConvertRowTypeExpr in that function. We can add/modify a testcase in multi-level partitioned table section to cover that. reparameterize_path_by_child() coverage is hard. It would require that many different kinds of paths survive in lower joins in the join tree. It's hard to come up with queries that would do that with limited amount of data and a reasonable number of tests. Me and Thomas discussed about his suggestion about "flat-copy and then apply recursively to all subpaths" which he sees as a path tree mutator. It won't improve the test coverage. Like expression_tree_mutator() path mutation is not that widely used phenomenon, so we do not yet know what should be the characteristics of a path mutator could be. In case we see more of path mutation code in future, it's an idea worth considering. > > 2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit? > > I thought about how to answer questions like this and came up with a > shell script that (1) makes computers run really hot for quite a long > time and (2) tells you which blocks of SQL hit which lines of C. > Please find attached the shell script and its output. The .sql files > have been annotated with "block" numbers (blocks being chunks of SQL > stuff separated by blank lines), and the C files annotated with > references to those block numbers where A<n> = block <n> > partition_join.sql and B<n> = block <n> in partition_join_extras.sql. > > Then to find lines that B queries hit but A queries don't and know > which particular queries hit them, you might use something like: > > grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \ > grep 'SQL blocks: .*B[0-9]' > Thanks for this. It generates a lot of output (970 lines over all the coverage files). It will take some time for getting anything meaningful out of this. May be there's some faster way by looking at the lines that are covered by B but not A. BTW, I checked those lines to see if there could be any bug there. But I don't see what could go wrong with those lines. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> >>>> - I'm not entirely sure whether maintaining partexprs and >>>> nullable_partexprs is the right design. If I understand correctly, >>>> whether or not a partexpr is nullable is really a per-RTI property, >>>> not a per-expression property. You could consider something like >>>> "Relids nullable_rels". >>> >>> That's true. However in order to decide whether an expression falls on >>> nullable side of a join, we will need to call pull_varnos() on it and >>> check the output against nullable_rels. Separating the expressions >>> themselves avoids that step. >> >> Good point. Also, I'm not sure about cases like this: >> >> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE >> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; >> >> Suppose the relations are all partitioned by (x, y) but that the = >> operator is not strict. A partition-wise join is valid between a and >> b, but we can't regard w as partitioned any more, because w.x might >> contain nulls in partitions where the partitioning scheme wouldn't >> allow them. On the other hand, if the subquery were to select a.x, >> a.y then clearly it would be fine: there would be no possibility of a >> NULL having been substituted for a proper value. >> >> What if the subquery selected a.x, b.y? Initially, I thought that >> would be OK too, because of the fact that the a.y = b.y clause is in >> the WHERE clause rather than the join condition. But on further >> thought I think that probably doesn't work, because with = being a >> non-strict operator there's no guarantee that it would remove any >> nulls introduced by the left join. Of course, if the subselect had a >> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT >> list mention those columns would be fine. >> > In my previous reply to this, I probably didn't answer your question while I explained the restriction on where equality conditions on partition keys can appear. Here's answer to your questions assuming those restrictions don't exist. Actually in the example you have given, optimizer flattens w as a LJ b which kind of makes the explanations below a bit complicated. 1. SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; partition-wise join will be possible between a and b but not between w and c for the reasons you have explained above. 2. SELECT * FROM (SELECT a.x, a.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; partition-wise join will be possible between a and b and also between w and c for the reasons you have explained above. 3. SELECT * FROM (SELECT a.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y; partition-wise join will be possible between a and b but not w and c as you have explained. In this case b.x and b.y will appear as nullable_partexprs in w (represented as a LJ b in optimizer) and a.x and a.y will appear in partexprs. Depending upon what gets projected out of w, the join between w and c will use corresponding keys for equality conditions. Since the operator is non-strict, any expression which is part of nullable_partexprs will be discarded in match_expr_to_partition_keys(). Hope that helps. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Jeevan Chalke
Date:
On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
I think having a granular control over each of these optimization will be
handy for the DBAs too.
On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> partition pruning might need partexprs look up relevant quals, but
>> nullable_partexprs doesn't have any use there. So may be we should add
>> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
>> implementation) instead of 0001. What do you think?
>
> +1.
Done.I am actually not sure whether we can use partition-wise join for a
>
>>> - I'm not entirely sure whether maintaining partexprs and
>>> nullable_partexprs is the right design. If I understand correctly,
>>> whether or not a partexpr is nullable is really a per-RTI property,
>>> not a per-expression property. You could consider something like
>>> "Relids nullable_rels".
>>
>> That's true. However in order to decide whether an expression falls on
>> nullable side of a join, we will need to call pull_varnos() on it and
>> check the output against nullable_rels. Separating the expressions
>> themselves avoids that step.
>
> Good point. Also, I'm not sure about cases like this:
>
> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
>
> Suppose the relations are all partitioned by (x, y) but that the =
> operator is not strict. A partition-wise join is valid between a and
> b, but we can't regard w as partitioned any more, because w.x might
> contain nulls in partitions where the partitioning scheme wouldn't
> allow them. On the other hand, if the subquery were to select a.x,
> a.y then clearly it would be fine: there would be no possibility of a
> NULL having been substituted for a proper value.
>
> What if the subquery selected a.x, b.y? Initially, I thought that
> would be OK too, because of the fact that the a.y = b.y clause is in
> the WHERE clause rather than the join condition. But on further
> thought I think that probably doesn't work, because with = being a
> non-strict operator there's no guarantee that it would remove any
> nulls introduced by the left join. Of course, if the subselect had a
> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
> list mention those columns would be fine.
>
LEFT JOIN b when the partition key equalities are spread across ON and
WHERE clauses. I am not able to find any example against it, but I am
not able to prove it as well. The reference I used for partition-wise
join [1], mentions JOIN conditions i.e. ON clause conditions. But all
the examples used in that paper are that of INNER join. So, I am not
sure what exactly the authors meant by JOIN conditions. Right now I am
restricting the patch to work with only conditions in the ON clause.
Practically most of the operators are strict. OUTER join's WHERE
clause has any partition key equality with strict operator, optimizer
will turn
that OUTER join into an INNER one, turning all clauses into join
clauses. That will enable partition-wise join. So, the current
restriction doesn't restrict any practical cases.
OTOH, I have seen that treating ON and WHERE clauses as same for an
OUTER join leads to surprising results. So, I am leaning to treat them
separate for partition-wise join as well and only use ON clause
conditions for partition-wise join. If we get complaints about
partition-wise join not being picked we will fix them after proving
that it's not harmful. Lifting that restriction is not so difficult.
have_partition_key_equijoin() ignores "pushed down" quals. We have to
just change that condition.
Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL
is interesting. If those conditions are in ON clause, we may still
have a result where b.x and b.y as NULL when no row in "a" matches a
row in "b". If those conditions are in WHERE clause, I think optimizer
will turn the join into an INNER join irrespective of whether the
equality operator is strict.
>
>> If partition-wise join is disabled, partition-wise aggregates,
>> strength reduction of MergeAppend won't be possible on a join tree,
>> but those will be possible on a base relation. Even if partition-wise
>> join enabled, one may want to disable other partition-wise
>> optimizations individually. So, they are somewhat independent
>> switches. I don't think we should bundle all of those into one.
>> Whatever names we choose for those GUCs, I think they should have same
>> naming convention e.g. "partition_wise_xyz". I am open to suggestions
>> about the names.
>
> I think the chances of you getting multiple GUCs for different
> partition-wise optimizations past Tom are pretty low.
We do have enable_hashjoin and enable_hashagg to control use of
hashing for aggregate and join. On similar lines we can have three
GUCs to enable use of partition-wise strategy, one for each of join,
aggregation and sorting. Having granular switches would be useful for
debugging and may be to turn partition-wise strategies off when they
are not optimal.
I think having a granular control over each of these optimization will be
handy for the DBAs too.
Do we want a switch to turn ON/OFF partition pruning?
Said, that I am fine with single GUC controlling all. We won't set any
partitioning information in RelOptInfo if that GUC is turned OFF.
[1] https://pdfs.semanticscholar.org/27c2/ ba75f8b6a39d4bce85d5579dace609 c9abaa.pdf --
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rajkumar Raghuwanshi
Date:
On Wed, Sep 20, 2017 at 3:13 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> 2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit?
>
> I thought about how to answer questions like this and came up with a
> shell script that (1) makes computers run really hot for quite a long
> time and (2) tells you which blocks of SQL hit which lines of C.
> Please find attached the shell script and its output. The .sql files
> have been annotated with "block" numbers (blocks being chunks of SQL
> stuff separated by blank lines), and the C files annotated with
> references to those block numbers where A<n> = block <n>
> partition_join.sql and B<n> = block <n> in partition_join_extras.sql.
>
> Then to find lines that B queries hit but A queries don't and know
> which particular queries hit them, you might use something like:
>
> grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
> grep 'SQL blocks: .*B[0-9]'
>
Thanks for this. It generates a lot of output (970 lines over all the
coverage files). It will take some time for getting anything
meaningful out of this. May be there's some faster way by looking at
the lines that are covered by B but not A. BTW, I checked those lines
to see if there could be any bug there. But I don't see what could go
wrong with those lines.
I have also tried to find test cases in B which hits some extra line which is not
hitting by A, with the help of results attached by Thomas in coverage.tarball_FILES.
It took lot of time but I am able to find some test cases. which if adding in partition_join.sql
hitting by A, with the help of results attached by Thomas in coverage.tarball_FILES.
It took lot of time but I am able to find some test cases. which if adding in partition_join.sql
increasing no of lines hit by 14. but for hitting these 14 extra line attached patch is doing
900+ line inserts in partition_join.sql and partition_join.out file.
I have used gcov-lcov to find coverage for files changed by partition-wise-join patches
with and without attached patch which is below.
with existing partition_join.sql | partition_join.sql + some test cases of partition_join_extra.sql | |||||||
Modifed Files | Line Coverage | Functions | Line Coverage | Functions | ||||
src/backend/optimizer/geqo | 79.4 % | 269/339 | 96.6 % | 28/29 | 79.4 % | 269/339 | 96.6 % | 28/29 |
src/backend/optimizer/path/allpaths.c | 92.3 % | 787 / 853 | 95.5 % | 42 / 44 | 92.6 % | 790 / 853 | 95.5 % | 42 / 44 |
src/backend/optimizer/path/costsize.c | 96.8 % | 1415 / 1462 | 98.4 % | 61 / 62 | 96.9 % | 1416 / 1462 | 98.4 % | 61 / 62 |
src/backend/optimizer/path/joinpath.c | 95.5 % | 404 / 423 | 100.0 % | 16 / 16 | 95.5 % | 404 / 423 | 100.0 % | 16 / 16 |
src/backend/optimizer/path/joinrels.c | 92.5 % | 422 / 456 | 100.0 % | 16 / 16 | 93.0 % | 424 / 456 | 100.0 % | 16 / 16 |
src/backend/optimizer/plan/createplan.c | 90.9 % | 1928 / 2122 | 96.3 % | 103 / 107 | 91.0 % | 1930 / 2122 | 96.3 % | 103 / 107 |
src/backend/optimizer/plan/planner.c | 94.9 % | 1609 / 1696 | 97.6 % | 41 / 42 | 94.9 % | 1609 / 1696 | 97.6 % | 41 / 42 |
src/backend/optimizer/plan/setrefs.c | 91.3 % | 806 / 883 | 94.3 % | 33 / 35 | 91.3 % | 806 / 883 | 94.3 % | 33 / 35 |
src/backend/optimizer/prep/prepunion.c | 95.5 % | 661 / 692 | 100.0 % | 25 / 25 | 95.5 % | 661 / 692 | 100.0 % | 25 / 25 |
src/backend/optimizer/util/pathnode.c | 88.7 % | 1144 / 1290 | 98.1 % | 52 / 53 | 88.8 % | 1146 / 1290 | 98.1 % | 52 / 53 |
src/backend/optimizer/util/placeholder.c | 96.5 % | 139 / 144 | 100.0 % | 10 / 10 | 96.5 % | 139 / 144 | 100.0 % | 10 / 10 |
src/backend/optimizer/util/plancat.c | 89.0 % | 540 / 607 | 94.7 % | 18 / 19 | 89.6 % | 544 / 607 | 94.7 % | 18 / 19 |
src/backend/optimizer/util/relnode.c | 95.3 % | 548 / 575 | 100.0 % | 24 / 24 | 95.3 % | 548 / 575 | 100.0 % | 24 / 24 |
src/backend/utils/misc/guc.c | 67.4 % | 1536 / 2278 | 89.7 % | 113 / 126 | 67.4 % | 1536 / 2278 | 89.7 % | 113 / 126 |
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB CorporationAttachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Done. Committed 0001 with extensive editorialization. I did not think it was a good idea to include a partition.h a file in src/include/nodes, so I worked around that. The include of pg_inherits_fn.h was unneeded. I rewrote a lot of the comments and made some other style tweaks. Don't look now, but I think it might be about time for the main act. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Sep 21, 2017 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Done. > > Committed 0001 with extensive editorialization. I did not think it > was a good idea to include a partition.h a file in src/include/nodes, > so I worked around that. The include of pg_inherits_fn.h was > unneeded. I rewrote a lot of the comments and made some other style > tweaks. Thanks a lot Robert. Thanks for changing comments to be more precise and crisp. Here's set of rebased patches. The patch with extra tests is not for committing. All other patches, except the last one, will need to be committed together. The last patch may be committed along with other patches or as a separate patch. About your earlier comment of making build_joinrel_partition_info() simpler. Right now, the code assumes that partexprs or nullable_partexpr can be NULL when either of them is not populated. That may be saves a sizeof(pointer) * (number of keys) byes of memory. Saving that much memory may not be worth the complexity of code. So, we may always allocate memory for those arrays and fill it with NIL values when there are no key expressions to populate those. That will simplify the code. I haven't done that change in this patchset. I was busy debugging the Q7 regression. Let me know your comments about that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > About your earlier comment of making build_joinrel_partition_info() > simpler. Right now, the code assumes that partexprs or > nullable_partexpr can be NULL when either of them is not populated. > That may be saves a sizeof(pointer) * (number of keys) byes of memory. > Saving that much memory may not be worth the complexity of code. So, > we may always allocate memory for those arrays and fill it with NIL > values when there are no key expressions to populate those. That will > simplify the code. I haven't done that change in this patchset. I was > busy debugging the Q7 regression. Let me know your comments about > that. Hmm, I'm not sure that's the best approach, but let me look at it more carefully before I express a firm opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Mon, Sep 18, 2017 at 10:18 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: >> > > Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual > time=1556989.996..1556989.997 rows=1 loops=1) > -> Finalize GroupAggregate (cost=83341943.28..83342723.24 > rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1 > loops=1) > Group Key: n1.n_name, n2.n_name, (date_part('year'::text, > (lineitem_001.l_shipdate)::timestamp without time zone)) > -> Sort (cost=83341943.28..83342043.92 rows=40256 width=92) > (actual time=1556989.910..1556989.911 rows=6 loops=1) > Sort Key: n1.n_name, n2.n_name, > (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without > time zone)) > Sort Method: quicksort Memory: 27kB > -> Gather (cost=83326804.81..83338864.31 rows=40256 > width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1) > Workers Planned: 4 > Workers Launched: 4 > > AFAICU the node above sort is group-aggregate and then there is limit, > and the number of rows for sort node in explain analyse is returned > number of rows. So, what is happening here is once one group is > completed it is aggregated and fetched by limit, now there is no need > for sort to return any more rows and hence the result. Thanks for your explanation. That makes sense. I forgot about LIMIT node on top. I debugged the plans today and performed some experiments. Here are my observations The join order with and without partition-wise join changes. Without partition-wise join it is (lineitem, (suppliers, nation1)), (orders, (customer, nation2)). The join (lineitem, (suppliers, nation1)) is executed by one gather node and (orders, (customer, nation2)) is executed by other. Thus the plan has two gather nodes, which feed to the topmost join. With partition-wise join the join order is ((lineitem, orders), (supplier, nation1)), (customer, nation2). The join (lineitem, orders) uses partition-wise join. This plan executes the whole join tree along with partial group aggregation under a gather merge. The rows estimated for various nodes under Gather/GatherMerge are different from the actual rows e.g. -> Hash Join (cost=113164.47..61031454.40 rows=10789501 width=46) (actual time=3379.931..731987.943 rows=8744357 loops=5) (in non-partition-wise join plan) OR -> Append (cost=179532.36..80681785.95 rows=134868761 width=24) (actual time=9437.573..1360219.567 rows=109372134 loops=5) (in partition-wise join plan). I first thought that this is a real estimation error and spent some time investigating the estimation error. But eventually realised that this is how a parallel query plan reports, when I saw that Gather node estimated correct number of rows even though the nodes under it showed this difference. Here's the explanation of this report. There are 4 parallel workers, so, the leaders contribution would be estimated to be 0 by get_parallel_divisor(). So these estimates are per worker and so the total estimated rows produced by any of the nodes is 4 times the reported. But when the query actually runs, the leader also participates, so number of loops = 5 and the actual rows reported are (total actual rows) / (number of loops i.e. number of backends that executed the query). The total estimates rows and total actual rows are roughly equal. So there's no real estimation error, as I thought earlier. May be we want to make EXPLAIN (ANALYZE) output easier to understand. When I tried the same query on laptop with scale 20, I found that the leader is really contributing as much as other workers. So, the partial paths were really created based on an estimate which was 20% off. The cost difference between partition-wise join plan and non-partition-wise join plan is hardly 1.5%. So, it's possible that if we correct this estimation error, partition-wise join plan won't be chosen because of it will have a higher cost. Remember there are two gather nodes in non-partition-wise join plan and partition-wise join plan has one gather. So, non-partition-wise join path gets the 20% decreased estimates twice and partition-wise join gets it only once. The explain (analyze, verbose) of a parallel node looks like -> Parallel Seq Scan on public.lineitem_002 (cost=0.00..168752.99 rows=573464 width=24) (actual time=1.395..3075.485 rows=454464 loops=5) Filter: ((lineitem_002.l_shipdate >= '1995-01-01'::date) AND (lineitem_002.l_shipdate <= '1996-12-31'::date)) Rows Removed by Filter: 1045065 Worker 0: actual time=3.358..3131.426 rows=458267 loops=1 Worker 1: actual time=0.860..3146.282 rows=447231 loops=1 Worker 2: actual time=1.317..3123.646 rows=489960 loops=1 Worker 3: actual time=0.927..3130.497 rows=475545 loops=1 If we sum the rows returned by each worker they don't add up to (actual rows) * (actual loops). So I assumed that the unreported number of rows were processed by the leader. Is that right? I might be misunderstanding how parallel query works, but here's my analysis so far. I will continue investigating further. Any clues would be helpful. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rafia Sabih
Date:
On Tue, Sep 19, 2017 at 2:58 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >>> >> On TPC-H benchmarking of this patch, I found a regression in Q7. It >> was taking some 1500s with the patch and some 900s without the patch. >> Please find the attached pwd_reg.zip for the output of explain analyse >> on head and with patch. >> >> The experimental settings used were, >> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629 >> patch-version = v26 >> >> Server settings: >> work_mem = 1GB >> shared_buffers = 10GB >> effective_cache_size = 10GB >> max_parallel_workers_per_gather = 4 >> >> Partitioning information: >> Partitioning scheme = by range >> Number of partitions in lineitem and orders table = 106 >> partition key for lineitem = l_orderkey >> partition key for orders = o_orderkey >> >> Apart from these there is a regression case on a custom table, on head >> query completes in 20s and with this patch it takes 27s. Please find >> the attached .out and .sql file for the output and schema for the test >> case respectively. I have reported this case before (sometime around >> March this year) as well, but I am not sure if it was overlooked or is >> an unimportant and expected behaviour for some reason. >> > > On completing the benchmark for all queries for the above mentioned > setup, following performance improvement can be seen, > Query | Patch | Head > 3 | 1455 | 1631 > 4 | 499 | 4344 > 5 | 1464 | 1606 > 10 | 1475 | 1599 > 12 | 1465 | 1790 > > Note that all values of execution time are in seconds. I compared this experiment with non-partitioned database and following is the result, Query | Non-partitioned head 3 | 1752 4 | 315 5 | 2319 10 | 1535 12 | 1739 In summary, the query that appears slowest in partitioned database is not so otherwise. It is good to see that in Q4 partition-wise join helps in achieving performance closer to it's non-partitioned case, otherwise partitioning alone causes it to suffer greatly. Apart from Q4 it does not looks like partitioning hurts anywhere else, though the maximum improvement is ~35% for Q5. Another point to note here is that the performance on partitioned and unpartitioned heads are quite close (except Q4) which is something atleast I wasn't expecting. It looks like we need not to partition the tables anyway, or atleast this set of queries doesn't benefit from partitioning. Please let me know if somebody has better ideas on how partitioning schemes should be applied to make it more beneficial for these queries. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 22, 2017 at 10:45 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: >> >> On completing the benchmark for all queries for the above mentioned >> setup, following performance improvement can be seen, >> Query | Patch | Head >> 3 | 1455 | 1631 >> 4 | 499 | 4344 >> 5 | 1464 | 1606 >> 10 | 1475 | 1599 >> 12 | 1465 | 1790 >> >> Note that all values of execution time are in seconds. > > I compared this experiment with non-partitioned database and following > is the result, > Query | Non-partitioned head > 3 | 1752 > 4 | 315 > 5 | 2319 > 10 | 1535 > 12 | 1739 > > In summary, the query that appears slowest in partitioned database is > not so otherwise. It is good to see that in Q4 partition-wise join > helps in achieving performance closer to it's non-partitioned case, > otherwise partitioning alone causes it to suffer greatly. Apart from > Q4 it does not looks like partitioning hurts anywhere else, though the > maximum improvement is ~35% for Q5. > Another point to note here is that the performance on partitioned and > unpartitioned heads are quite close (except Q4) which is something > atleast I wasn't expecting. It looks like we need not to partition the > tables anyway, or atleast this set of queries doesn't benefit from > partitioning. Please let me know if somebody has better ideas on how > partitioning schemes should be applied to make it more beneficial for > these queries. Just partitioning is not expected to improve query performance (but we still see some performance improvement). Partitioning + partition-wise operations, pruning is expected to show performance gains. IIUC the results you reported, Q3 takes 1752 seconds with non-partitioned head, with partitioning it completes in 1631 seconds and with partition-wise join it completes in 1455, so net improvement because of partitioning is 300 seconds is almost 16% improvement, which is a lot for very large data. So, except Q4, every query improves when the tables are partitioned. Am I interpreting the results correctly? There may be some other way of partitioning, which may give better results, but I think what we have now shows the importance of partitioning in case of very large data e.g. scale 300 TPCH. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > >> >> Apart from these there is a regression case on a custom table, on head >> query completes in 20s and with this patch it takes 27s. Please find >> the attached .out and .sql file for the output and schema for the test >> case respectively. I have reported this case before (sometime around >> March this year) as well, but I am not sure if it was overlooked or is >> an unimportant and expected behaviour for some reason. >> > > Are you talking about [1]? I have explained about the regression in > [2] and [3]. This looks like an issue with the existing costing model. > I debugged this case further. There are two partitioned tables being joined prt (with partitions prt_p1, prt_p2 and so on) and prt2 (with partitions prt2_p1, prt2_p2, and so on). When join is executed without partition-wise join, prt2 is used to build hash table and prt is used to probe that hash table. prt2 has lesser number of rows than prt. But when partition-wise join is used, individual partitions are joined in reverse join order i.e. partitions of prt are used to build the hash table and partitions of prt2 are used to probe. This happens because the path for the other join order (partition of prt2 used to build the hash table and partition of prt used to probe) has huge cost compared to the first one (74459 and 313109) and a portion worth 259094 comes from lines 3226/7 of final_cost_hashjoin() 3215 /* 3216 * The number of tuple comparisons needed is the number of outer 3217 * tuples times the typical number of tuples in a hash bucket, which 3218 * is the inner relation size times its bucketsize fraction. At each 3219 * one, we need to evaluate the hashjoin quals. But actually, 3220 * charging the full qual eval cost at each tuple is pessimistic, 3221 * since we don't evaluate the quals unless the hash values match 3222 * exactly. For lack of a better idea, halve the cost estimate to 3223 * allow for that. 3224 */ 3225 startup_cost += hash_qual_cost.startup; 3226 run_cost += hash_qual_cost.per_tuple * outer_path_rows * 3227 clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; That's because for some reason innerbucketsize for partition of prt is 22 times more than that for partition of prt2. Looks like we have some estimation error in estimating bucket sizes. If I force partitions to be joined with the same order as partitioned tables (without partition-wise join), child-joins execute faster and in turn partition-wise join performs better than the non-partition-wise join. So, this is clearly some estimation and costing problem with regular joins. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Here's set of rebased patches. The patch with extra tests is not for > committing. All other patches, except the last one, will need to be > committed together. The last patch may be committed along with other > patches or as a separate patch. In set_append_rel_size, is it necessary to set attr_needed = bms_copy(rel->attr_needed[index]) rather than just pointing to the existing value? If so, perhaps the comments should explain the reasons. I would have thought that the values wouldn't change after this point, in which case it might not be necessary to copy them. Regarding nomenclature and my previous griping about wisdom, I was wondering about just calling this a "partition join" like you have in the regression test. So the GUC would be enable_partition_join, you'd have generate_partition_join_paths(), etc. Basically just delete "wise" throughout. The elog(DEBUG3) in try_partition_wise_join() doesn't follow message style guidelines and I think should just be removed. It was useful for development, I'm sure, but it's time for it to go. + elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path)); I think we should use the same formulation as elsewhere, namely "unrecognized node type: %d". And likewise probably "unexpected join type: %d". partition_join_extras.sql has a bunch of whitespace damage, although it doesn't really matter since, as you say, that's not for commit. (This is not a full review, just a few thoughts.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Oct 3, 2017 at 7:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Here's set of rebased patches. The patch with extra tests is not for >> committing. All other patches, except the last one, will need to be >> committed together. The last patch may be committed along with other >> patches or as a separate patch. > > In set_append_rel_size, is it necessary to set attr_needed = > bms_copy(rel->attr_needed[index]) rather than just pointing to the > existing value? If so, perhaps the comments should explain the > reasons. I would have thought that the values wouldn't change after > this point, in which case it might not be necessary to copy them. Right. The only places where attr_needed is changed is in remove_rel_from_query() (useless join removal) and add_vars_to_targetlist(). Both of those happen before set_append_rel_size(). Since parent and child join should project same attributes, having them share the Relids set makes more sense. So, changed accordingly and explained the same in comments. Also, changed list_nth() in the following code block to use list_nth_node(). > > Regarding nomenclature and my previous griping about wisdom, I was > wondering about just calling this a "partition join" like you have in > the regression test. So the GUC would be enable_partition_join, you'd > have generate_partition_join_paths(), etc. Basically just delete > "wise" throughout. Partition-wise join is standard term used in literature and in documentation of other popular DBMSes, so partition_wise makes more sense. But I am fine with partition_join as well. Do you want it partition_join or partitionjoin like enable_mergejoin/enable_hashjoin etc.? > > The elog(DEBUG3) in try_partition_wise_join() doesn't follow message > style guidelines and I think should just be removed. It was useful > for development, I'm sure, but it's time for it to go. Done. > > + elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path)); > > I think we should use the same formulation as elsewhere, namely > "unrecognized node type: %d". And likewise probably "unexpected join > type: %d". Changed "unrecognized path node type" to "unrecognized node type". "unrecognized join type: %d" seems to be used everywhere except postgres_fdw. So, used that. Also added a cast to int similar to other places. > > partition_join_extras.sql has a bunch of whitespace damage, although > it doesn't really matter since, as you say, that's not for commit. > Right. I will remove that patch from the patch-set when those tests are no more needed i.e. once we are done with code changes to the patches. Attached the updated patch-set. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> Regarding nomenclature and my previous griping about wisdom, I was >> wondering about just calling this a "partition join" like you have in >> the regression test. So the GUC would be enable_partition_join, you'd >> have generate_partition_join_paths(), etc. Basically just delete >> "wise" throughout. > > Partition-wise join is standard term used in literature and in > documentation of other popular DBMSes, so partition_wise makes more > sense. But I am fine with partition_join as well. Do you want it > partition_join or partitionjoin like enable_mergejoin/enable_hashjoin > etc.? Well, you're making me have second thoughts. It's really just that partition_wise looks a little awkward to me, and maybe that's not enough reason to change anything. I suppose if I commit it this way and somebody really hates it, it can always be changed later. We're not getting a lot of input from anyone else at the moment. > Attached the updated patch-set. I decided to skip over 0001 for today and spend some time looking at 0002-0006. Comments below. 0002: Looks fine. 0003: The commit message mentions estimate_num_groups but the patch doesn't touch it. I am concerned that this patch might introduce some problem fixed by commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that patch say, at one place, that "This protects against possible incorrect matches to child expressions that contain no Vars." However, if a child expression has no Vars, then I think em->em_relids will be empty, so the bms_is_equal() test that is there now will fail but your proposed bms_is_subset() test will pass. 0004: I suggest renaming get_wholerow_ref_from_convert_row_type to is_converted_whole_row_reference and making it return a bool. The coding of that function is a little strange; why not move Var to an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var = castNode(Var, convexpr->arg; if (var->varattno == 0) return var; } Will the statement that "In case of multi-level partitioning, we will have as many nested ConvertRowtypeExpr as there are levels in partition hierarchy" be falsified by Amit Khandekar's pending patch to avoid sticking a ConvertRowTypeExpr on top of another ConvertRowTypeExpr? Even if the answer is "no", I think it might be better to drop this part of the comment; it would be easy for it to become false in the future, because we might want to optimize that case in the future and we'll probably forget to update this comment when we do. In fix_upper_expr_mutator(), you have an if statement whose entire contents are another if statement. I think you should use && instead, and maybe reverse the order of the tests, since context->subplan_itlist->has_conv_whole_rows is probably cheaper to test than a function call. It's also a little strange that this code isn't adjacent too, or merged with, the existing has_non_vars case. Maybe: converted_whole_row = is_converted_whole_row_reference(node); if (context->outer_itlist && (context->outer_itlist->has_non_vars || (context->outer_itlist->has_conv_whole_rows && converted_whole_row)) ... if (context->inner_itlist && (context->inner_itlist->has_non_vars || (context->inner_itlist->has_conv_whole_rows && converted_whole_row)) ... 0005: The comment explaining why the ParamPathInfo is allocated in the same context as the RelOptInfo is a modified copy of an existing comment that still reads like the original, a manner of commenting I find a bit undesirable as it leads to filling up the source base with duplicate comments. I don't think I believe that comment, either. In the case from which that comment was copied (mark_dummy_rel), it was talking about a RelOptInfo, and geqo_eval() takes care to remove any leftover pointers to joinrels creating during a GEQO cycle. But there's no similar logic for ppilist, so I think what will happen here is that you'll end up with a freed node in the middle of the list. I think reparameterize_path_by_chid() could use a helper function reparameterize_pathlist_by_child() that iterates over a list of paths and returns a list of paths. That would remove some of the loops. I think the comments for reparameterize_path_by_child() need to be expanded. They don't explain how you decided which nodes need to be handled here or which fields within those nodes need some kind of handling other than a flat-copy. I think these kinds of explanations will be important for future maintenance of this code. You know why you did it this way, I can mostly guess what you did it this way, but what about the next person who comes along who hasn't made a detailed study of partition-wise join? I don't see much point in the T_SubqueryScanPath and T_ResultPath cases in reparameterize_path_by_child(). It's just falling through to the default case. I wonder if reparameterize_path_by_child() ought to default to returning NULL rather than throwing an error; the caller would then have to be prepared for that and skip building the path. But that would be more like what reparameterize_path() does, and it would make failure to include some relevant path type here a corner-case performance bug rather than a correctness issue. It seems like someone adding a new path type could quite easily fail to realize that it might need to be added here, or might be unsure whether it's necessary to add it here. 0006: I have some doubts about how stable all of the EXPLAIN outputs are going to be on the buildfarm. I'm not sure what we can really do about that in advance of trying them, but it's a lot of EXPLAIN output. If you have an ideas about how to tighten it up without losing test coverage, that would be good. For example, maybe the "full outer join" case isn't needed given the following test case which is also a full outer join but which covers additional behavior. I think it would be good to have a test case that shows multi-level partition-wise join working across multiple levels. I wrote the attached test, which you're welcome to use if you like it, adapt if you sorta like it, or replace if you dislike it. The table names at least should be changed to something less likely to duplicate other tests. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Amit Langote
Date:
On 2017/10/04 4:27, Robert Haas wrote: > On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> Regarding nomenclature and my previous griping about wisdom, I was >>> wondering about just calling this a "partition join" like you have in >>> the regression test. So the GUC would be enable_partition_join, you'd >>> have generate_partition_join_paths(), etc. Basically just delete >>> "wise" throughout. >> >> Partition-wise join is standard term used in literature and in >> documentation of other popular DBMSes, so partition_wise makes more >> sense. But I am fine with partition_join as well. Do you want it >> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin >> etc.? > > Well, you're making me have second thoughts. It's really just that > partition_wise looks a little awkward to me, and maybe that's not > enough reason to change anything. I suppose if I commit it this way > and somebody really hates it, it can always be changed later. We're > not getting a lot of input from anyone else at the moment. FWIW, the name enable_partition_join seems enough to convey the core feature, that is, I see "_wise" as redundant, even though I'm now quite used to seeing "_wise" in the emails here and saying it out loud every now and then. Ashutosh may have a point though that users coming from other databases might miss the "_wise". :) Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Oct 4, 2017 at 12:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > 0003: > > The commit message mentions estimate_num_groups but the patch doesn't touch it. This was fixed when we converted many rel->reloptkind == RELOPT_BASEREL to IS_SIMPLE_REL(). I have removed this section from the commit message. > > I am concerned that this patch might introduce some problem fixed by > commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that > patch say, at one place, that "This protects against possible > incorrect matches to child expressions that contain no Vars." > However, if a child expression has no Vars, then I think em->em_relids > will be empty, so the bms_is_equal() test that is there now will fail > but your proposed bms_is_subset() test will pass. bms_is_equal() was enough when there was only a single member in relids but it doesn't work now that there can be multiple of them. bms_is_equal() was replaced with bms_is_subset() to accomodate for ec_members with only a subset of relids when we are searching for a join relation. I am not sure whether your assumption that expression with no Vars would have em_relids empty is correct. I wonder whether we will add any em_is_child members with empty em_relids; looking at process_equivalence() those come from RestrictInfo::left/right_relids which just indicates the relids at which that particular expression can be evaluated. Place holder vars is an example when that can happen, but there may be others. To verify this, I tried attached patch on master and ran make check. The assertion didn't trip. If em_relids is not NULL, bms_is_subset() is fine. If em_relids could indeed go NULL when em_is_child is true, passing NULL relids (for parent rels) to that function can cause unwanted behaviour. bms_is_equal(em->em_relids, relids) will return true turning the if (em->em_is_child && !bms_is_equal(em->em_relids, relids)) to false. This means that we will consider a child member with em_relids NULL even while matching a parent relation. What surprises me is, that commit added a bunch of testcases and none of them failed with this change. Nonetheless, I have changed "matches" with "belongs to" in the prologue of those functions since an exact match won't be possible with child-joins. > > 0004: > > I suggest renaming get_wholerow_ref_from_convert_row_type to > is_converted_whole_row_reference and making it return a bool. Done. > > The coding of that function is a little strange; why not move Var to > an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var = > castNode(Var, convexpr->arg; if (var->varattno == 0) return var; } I probably went too far to avoid indented code :). Fixed now. > > Will the statement that "In case of multi-level partitioning, we will > have as many nested ConvertRowtypeExpr as there are levels in > partition hierarchy" be falsified by Amit Khandekar's pending patch to > avoid sticking a ConvertRowTypeExpr on top of another > ConvertRowTypeExpr? Even if the answer is "no", I think it might be > better to drop this part of the comment; it would be easy for it to > become false in the future, because we might want to optimize that > case in the future and we'll probably forget to update this comment > when we do. That might keep someone wondering where the nested ConvertRowtypeExpr's came from. But may be in future those can arise from something other than multi-level partition hierarchy and in that case too the comment would be rendered inaccurate. So done. > > In fix_upper_expr_mutator(), you have an if statement whose entire > contents are another if statement. I think you should use && instead, > and maybe reverse the order of the tests, since > context->subplan_itlist->has_conv_whole_rows is probably cheaper to > test than a function call. It's also a little strange that this code > isn't adjacent too, or merged with, the existing has_non_vars case. > Maybe: > > converted_whole_row = is_converted_whole_row_reference(node); > if (context->outer_itlist && (context->outer_itlist->has_non_vars || > (context->outer_itlist->has_conv_whole_rows && converted_whole_row)) > ... > if (context->inner_itlist && (context->inner_itlist->has_non_vars || > (context->inner_itlist->has_conv_whole_rows && converted_whole_row)) I placed it with the other node types since it's for a specific node type, but I guess your suggestion avoids duplicates and looks better. Done. > ... > > 0005: > > The comment explaining why the ParamPathInfo is allocated in the same > context as the RelOptInfo is a modified copy of an existing comment > that still reads like the original, a manner of commenting I find a > bit undesirable as it leads to filling up the source base with > duplicate comments. I have pointed to mark_dummy_rel() in that comment instead of duplicating the whole paragraph. > > I don't think I believe that comment, either. In the case from which > that comment was copied (mark_dummy_rel), it was talking about a > RelOptInfo, and geqo_eval() takes care to remove any leftover pointers > to joinrels creating during a GEQO cycle. But there's no similar > logic for ppilist, so I think what will happen here is that you'll end > up with a freed node in the middle of the list. In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist with dummy path being created in the same context as the RelOptInfo. Same applies here. While reparameterizing a path tree, we may reach a path for a base relation and create a PPI for a base relation. This may happen when GEQO is planning a join, and thus we are in a short-lived context created by that GEQO cycle. We don't want a base rel PPI to be created in that context, so instead we use the context of base rel itself. Other way round, we don't want to use a longer context for creating PPI for a join relation when it's created by a GEQO cycle. So, we use join relation's context.The code doesn't free up a node in the middle of the list but it avoids such an anomaly. See [1] > > I think reparameterize_path_by_chid() could use a helper function > reparameterize_pathlist_by_child() that iterates over a list of paths > and returns a list of paths. That would remove some of the loops. That's a good idea. Done. > > I think the comments for reparameterize_path_by_child() need to be > expanded. They don't explain how you decided which nodes need to be > handled here or which fields within those nodes need some kind of > handling other than a flat-copy. I think these kinds of explanations > will be important for future maintenance of this code. You know why > you did it this way, I can mostly guess what you did it this way, but > what about the next person who comes along who hasn't made a detailed > study of partition-wise join? We need to reparameterize any path which contains further paths and/or contains expressions that point to the parent relation. For a given path we need to reparameterize any paths that it contains and translate any expressions that are specific to that path. Expressions common across the paths are translated after the switch case. I have added this rule to the comment just above the switch case /* * Copy of the given path. Reparameterize any paths referenced by the given * path. Replace parent Vars in path specific expressions by corresponding * child Vars. */ Does that look fine or we want to add explanation for every node handled here. > > I don't see much point in the T_SubqueryScanPath and T_ResultPath > cases in reparameterize_path_by_child(). It's just falling through to > the default case. I added those cases separately to explain why we should not see those cases in that switch case. I think that explanation is important (esp. considering your comment above) and associating those comment with "case" statement looks better. Are you suggesting that we should add that explanation in default case? > > I wonder if reparameterize_path_by_child() ought to default to > returning NULL rather than throwing an error; the caller would then > have to be prepared for that and skip building the path. But that > would be more like what reparameterize_path() does, and it would make > failure to include some relevant path type here a corner-case > performance bug rather than a correctness issue. It seems like > someone adding a new path type could quite easily fail to realize that > it might need to be added here, or might be unsure whether it's > necessary to add it here. I am OK with that. However reparameterize_path_by_child() and reparameterize_paths_by_child() are callers of reparameterize_path_by_child() so they will need to deal with NULL return. I am fine with that too, but making sure that we are on the same page. If we do that, we could simply assert that the switch case doesn't see T_SubqueryScanPath and T_ResultPath. > > 0006: > > I have some doubts about how stable all of the EXPLAIN outputs are > going to be on the buildfarm. I'm not sure what we can really do > about that in advance of trying them, but it's a lot of EXPLAIN > output. If you have an ideas about how to tighten it up without > losing test coverage, that would be good. For example, maybe the > "full outer join" case isn't needed given the following test case > which is also a full outer join but which covers additional behavior. Yes, I too am thinking about the same. The only reason I have EXPLAIN output there is to check whether partition-wise join is being used or not. The testcase is not interested in the actual shape. It doesn't make sense to just test the output if partition-wise join is not used. May be a function examining the plan tree would help. The function will have to handle Result/Sort nodes on top and make sure that Append has join children. Do you have any other idea to check the shape of the plan tree without the details? Any EXPLAIN switch, existing functions etc.? Removed the extra full outer join testcase. > > I think it would be good to have a test case that shows multi-level > partition-wise join working across multiple levels. I wrote the > attached test, which you're welcome to use if you like it, adapt if > you sorta like it, or replace if you dislike it. The table names at > least should be changed to something less likely to duplicate other > tests. > There are tests for multi-level partitioned table in the file. They test whole partition hierarchy join, part of it being joined based on the quals. Search for -- -- multi-leveled partitions -- Have you looked at those? They test two-level partitioned tables and your test tests three-level partitioned table. I can modify the tests to have three levels of partitions and different partition schemes on different levels. Is that what you expect? [1] https://www.postgresql.org/message-id/CAFjFpRcPutbr4nVAsrY-5q%3DwCFrNK25_3MNhHgyYYM0yeOoj%3DQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> About your earlier comment of making build_joinrel_partition_info() >> simpler. Right now, the code assumes that partexprs or >> nullable_partexpr can be NULL when either of them is not populated. >> That may be saves a sizeof(pointer) * (number of keys) byes of memory. >> Saving that much memory may not be worth the complexity of code. So, >> we may always allocate memory for those arrays and fill it with NIL >> values when there are no key expressions to populate those. That will >> simplify the code. I haven't done that change in this patchset. I was >> busy debugging the Q7 regression. Let me know your comments about >> that. > > Hmm, I'm not sure that's the best approach, but let me look at it more > carefully before I express a firm opinion. Having studied this a bit more, I now think your proposed approach is a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I decided to skip over 0001 for today and spend some time looking at > 0002-0006. Back to 0001. + Enables or disables the query planner's use of partition-wise join + plans. When enabled, it spends time in creating paths for joins between + partitions and consumes memory to construct expression nodes to be used + for those joins, even if partition-wise join does not result in the + cheapest path. The time and memory increase exponentially with the + number of partitioned tables being joined and they increase linearly + with the number of partitions. The default is <literal>off</>. I think this is too scary and too much technical detail. I think you could just say something like: Enables or disables use of partition-wise join, which allows a join between partitioned tables to be performed by joining the matching partitions. Partition-wise join currently applies only when the join conditions include all the columns of the partition keys, which must be of the same data type and have exactly matching sets of child partitions. Because partition-wise join planning can use significantly increase CPU time and memory usage during planning, the default is <literal>off</>. +partitioned table. The join partners can not be found in other partitions. This +condition allows the join between partitioned tables to be broken into joins +between the matching partitions. The resultant join is partitioned in the same "The join partners can not be found in other partitions." is redundant with the previous sentence. I suggest deleting it. I also suggest "This condition allows the join between partitioned tables to be broken" -> "Because of this, the join between partitioned tables can be broken". +relation" for both partitioned table as well as join between partitioned tables +which can use partition-wise join technique. for either a partitioned table or a join between compatibly partitioned tables +Partitioning properties of a partitioned relation are stored in +PartitionSchemeData structure. Planner maintains a list of canonical partition +schemes (distinct PartitionSchemeData objects) so that any two partitioned +relations with same partitioning scheme share the same PartitionSchemeData +object. This reduces memory consumed by PartitionSchemeData objects and makes +it easy to compare the partition schemes of joining relations. Not all of the partitioning properties are stored in the PartitionSchemeData structure any more. I think this needs some rethinking and maybe some expansion. As written, each of the first two sentences needs a "the" at the beginning. + /* + * Create "append" paths for partitioned joins. Do this before + * creating GatherPaths so that partial "append" paths in + * partitioned joins will be considered. + */ I think you could shorten this to a single-line comment and just keep the first sentence. Similarly in the other location where you have the same sort of thing. + * child-joins. Otherwise, add_path might delete a path that some "append" + * path has reference to. to which some path generated here has a reference. Here and elsewhere, you use "append" rather than Append to refer to the paths added. I suppose that's weasel-wording to work around the fact that they might be either Append or MergeAppend paths, but I'm not sure it's really going to convey that to anyone. I suggest rephrasing those comments more generically, e.g.: + /* Add "append" paths containing paths from child-joins. */ You could say: Build additional paths for this rel from child-join paths. Or something. + if (!REL_HAS_ALL_PART_PROPS(rel)) + return; Isn't this an unnecessarily expensive test? I mean, it shouldn't be possible for it to have some arbitrary subset. + /* + * Every pair of joining relations we see here should have an equi-join + * between partition keys if this join has been deemed as a partitioned + * join. See build_joinrel_partition_info() for reasons. + */ + Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype, + parent_restrictlist)); I suggest removing this assertion. Seems like overkill to me. + child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo, + child_rel1->relids, + child_rel2->relids); It seems like we might end up doing this multiple times for the same child join, if there are more than 2 tables involved. Not sure if there's a good way to avoid that. Similarly for child_restrictlist. + pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks); Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack allocation is a lot faster, and then you don't need to pfree it. + /* Remove the relabel decoration. */ the -> any, decoration -> decorations + /* + * Replace the Var nodes of parent with those of children in expressions. + * This function may be called within a temporary context, but the + * expressions will be shallow-copied into the plan. Hence copy those in + * the planner's context. + */ I can't make heads or tails of this comment. --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -23,7 +23,9 @@#include "optimizer/pathnode.h"#include "optimizer/paths.h"#include "optimizer/planmain.h" +#include "optimizer/prep.h"#include "optimizer/restrictinfo.h" +#include "optimizer/tlist.h"#include "optimizer/var.h"#include "parser/parsetree.h"#include "utils/lsyscache.h" Maybe not needed? This is the only hunk in this file? Or should this be part of one of the later patches? + Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel)); + + /* Ensure child relation is really what it claims to be. */ + Assert(IS_OTHER_REL(childrel)); I suggest tightening this up a bit by removing the comment and the blank line that precedes it. + foreach(lc, parentrel->reltarget->exprs) + { + PlaceHolderVar *phv = lfirst(lc); + + if (IsA(phv, PlaceHolderVar)) + { + /* + * In case the placeholder Var refers to any of the parent + * relations, translate it to refer to the corresponding child. + */ + if (bms_overlap(phv->phrels, parentrel->relids) && + childrel->reloptkind == RELOPT_OTHER_JOINREL) + { + phv = (PlaceHolderVar *) adjust_appendrel_attrs(root, + (Node *) phv, + nappinfos, + appinfos); + } + + childrel->reltarget->exprs = lappend(childrel->reltarget->exprs, + phv); + phv_added = true; + } + } What if the PHV is buried down inside the expression someplace rather than being at the top level? More generally, why are we not just applying adjust_appendrel_attrs() to the whole expression? + /* Adjust the cost and width of child targetlist. */ + if (phv_added) + { + childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup; + childrel->reltarget->cost.per_tuple = parentrel->reltarget->cost.per_tuple; + childrel->reltarget->width = parentrel->reltarget->width; + } Making this conditional on phv_added is probably not saving anything. Branches are expensive. /* * Otherwise, anything in a baserel or joinrel targetlist ought to be - * a Var. (More general cases can only appear in appendrel child - * rels, which will never be seen here.) + * a Var or ConvertRowtypeExpr. For either of those, find the original + * baserel where they originate. */ Hmm, but now we could potentially see an appendrel child rel here, so don't we need to worry about more general cases? If not, let's explain why not. + * if, it's a ConvertRowtypeExpr, it will be computed only for the American usage does not put a comma after if like this (unless you are writing writing if, for example, blah blah blah -- but there the commas are to surround for example, not due to the if itself). +/* + * build_joinrel_partition_info + * If the join between given partitioned relations is possibly partitioned + * set the partitioning scheme and partition keys expressions for the + * join. + * + * If the two relations have same partitioning scheme, their join may be + * partitioned and will follow the same partitioning scheme as the joining + * relations. + */ I think you could drop the primary comment block and use the secondary block as the primary one. That is, get rid of "If the join between..." and promote "If the two relations...". + * The join is not partitioned, if any of the relations being joined are Another comma that's not typical of American usage. + * For an N-way inner join, where every syntactic inner join has equi-join has -> has an + * For an N-way join with outer joins, where every syntactic join has an + * equi-join between partition keys and a matching partitioning scheme, + * outer join reordering identities in optimizer/README imply that only + * those pairs of join are legal which have an equi-join between partition + * keys. Thus every pair of joining relations we see for this join should + * have an equi-join between partition keys if this join has been deemed + * as a partitioned join. In line 2, partition keys -> the partition keys In line 3, outer join -> the outer join "pairs of join" sounds wrong too, although I'm not sure how to reword it. More broadly: I don't think I understand this comment. The statement about "those pairs of join are legal which have an equi-join between partition keys" doesn't match my understanding e.g. A IJ B ON A.x = B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such clause syntatically. Maybe you could replace this whole comment block with something like this: We can only consider this join as an input to further partition-wise joins if (a) the input relations are partitioned, (b) the partition schemes match, and (c) we can identify an equi-join between the partition keys. Note that if it were possible for have_partkey_equi_join to return different answers for the same joinrel depending on which join ordering we try first, this logic would break. That shouldn't happen, though, because of the way the query planner deduces implied equalities. + * Join relation is partitioned using same partitioning scheme as the + * joining relations and has same bounds. the same partitioning scheme + * An INNER join between two partitioned relations is partitioned by key + * expressions from both the relations. For tables A and B partitioned by + * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by + * both A.a and B.b. + * + * A SEMI/ANTI join only retains data from the outer side and is + * partitioned by the partition keys of the outer side. I would write: An INNER join between two partitioned relations can be regarded as partitioned by either key expression. For example, A INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on B.b; they are equivalent. For a SEMI or ANTI join, the result can only be regarded as being partitioned in the same manner as the outer side, since the inner columns are not retained. + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with + * B.b NULL. These rows may not fit the partitioning conditions imposed on + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. Good. + * Strictly speaking, partition keys of an OUTER join should include + * partition key expressions from the OUTER side only. Consider a join I would join this with the previous sentence instead of repeating strictly speaking: ...and thus the partition keys should include partition key expressions from the OUTER side only. After that sentence, I'd skip a lot of the intermediate words here and continue this way: However, because all commonly-used comparison operators are strict, the presence of nulls on the outer side doesn't cause any problem; they can't match anything at future join levels anyway. Therefore, we track two sets of expressions: those that authentically partition the relation (partexprs) and those that partition the relation with the exception that extra nulls may be present (nullable_partexprs). When the comparison operator is strict, the latter is just as good as the former. Then, I think you can omit the rest of what you have; it should be clear enough what's going on for the full and right cases given that explanation. + * being joined. partexprs and nullable_partexprs are arrays containing part_scheme->partnatts Long line, needs reflowing. I don't think this is too far from being committable. You've done some nice work here! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Wed, Oct 4, 2017 at 11:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > + Enables or disables the query planner's use of partition-wise join > + plans. When enabled, it spends time in creating paths for joins between > + partitions and consumes memory to construct expression nodes to be used > + for those joins, even if partition-wise join does not result in the > + cheapest path. The time and memory increase exponentially with the > + number of partitioned tables being joined and they increase linearly > + with the number of partitions. The default is <literal>off</>. > > I think this is too scary and too much technical detail. I think you > could just say something like: Enables or disables use of > partition-wise join, which allows a join between partitioned tables to > be performed by joining the matching partitions. Partition-wise join > currently applies only when the join conditions include all the > columns of the partition keys, which must be of the same data type and > have exactly matching sets of child partitions. Because > partition-wise join planning can use significantly increase CPU time > and memory usage during planning, the default is <literal>off</>. Not enough caffeine, obviously: should have been something like -- Because partition-wise join can significantly increase the CPU and memory costs of planning... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Wed, Oct 4, 2017 at 8:23 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I am not sure whether your assumption that expression with no Vars > would have em_relids empty is correct. I wonder whether we will add > any em_is_child members with empty em_relids; looking at > process_equivalence() those come from RestrictInfo::left/right_relids > which just indicates the relids at which that particular expression > can be evaluated. Place holder vars is an example when that can > happen, but there may be others. To verify this, I tried attached > patch on master and ran make check. The assertion didn't trip. If > em_relids is not NULL, bms_is_subset() is fine. I spent some more time experimenting with this. I found that cases where an em_is_child equivalence class contains multiple relids are quite easy to generate, e.g. select * from foo, bar where foo.a + bar.a = 0, where foo and bar are partitioned. However, I wasn't able to generate a case where an em_is_child equivalence class has no relids at all, and I'm out of ideas about how such a thing could occur. I suspect it can't. I wondered whether there was some problem with the multiple-relids case, but I can't find an example where that misbehaves either. So maybe it's fine (or maybe I'm just not smart enough to find the case where it breaks). >> I don't think I believe that comment, either. In the case from which >> that comment was copied (mark_dummy_rel), it was talking about a >> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers >> to joinrels creating during a GEQO cycle. But there's no similar >> logic for ppilist, so I think what will happen here is that you'll end >> up with a freed node in the middle of the list. > > In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist > with dummy path being created in the same context as the RelOptInfo. > Same applies here. Oops. I was thinking that the ppilist was attached to some planner-global structure, but it's not; it's hanging off the RelOptInfo. So you're entirely right, and I'm just being dumb. > We need to reparameterize any path which contains further paths and/or > contains expressions that point to the parent relation. For a given > path we need to reparameterize any paths that it contains and > translate any expressions that are specific to that path. Expressions > common across the paths are translated after the switch case. I have > added this rule to the comment just above the switch case > /* > * Copy of the given path. Reparameterize any paths referenced by the given > * path. Replace parent Vars in path specific expressions by corresponding > * child Vars. > */ > Does that look fine or we want to add explanation for every node handled here. No, I don't think we want something for every node, just a general explanation at the top of the function. Maybe something like this: Most fields from the original path can simply be flat-copied, but any expressions must be adjusted to refer to the correct varnos, and any paths must be recursively reparameterized. Other fields that refer to specific relids also need adjustment. >> I don't see much point in the T_SubqueryScanPath and T_ResultPath >> cases in reparameterize_path_by_child(). It's just falling through to >> the default case. > > I added those cases separately to explain why we should not see those > cases in that switch case. I think that explanation is important > (esp. considering your comment above) and associating those comment > with "case" statement looks better. Are you suggesting that we should > add that explanation in default case? Or leave the explanation out altogether. >> I wonder if reparameterize_path_by_child() ought to default to >> returning NULL rather than throwing an error; the caller would then >> have to be prepared for that and skip building the path. But that >> would be more like what reparameterize_path() does, and it would make >> failure to include some relevant path type here a corner-case >> performance bug rather than a correctness issue. It seems like >> someone adding a new path type could quite easily fail to realize that >> it might need to be added here, or might be unsure whether it's >> necessary to add it here. > > I am OK with that. However reparameterize_path_by_child() and > reparameterize_paths_by_child() are callers of > reparameterize_path_by_child() so they will need to deal with NULL > return. I am fine with that too, but making sure that we are on the > same page. If we do that, we could simply assert that the switch case > doesn't see T_SubqueryScanPath and T_ResultPath. Or do nothing at all about those cases. I noticed today that the version of the patchset I have here says in the header comments for reparameterize_path_by_child() that it returns NULL if it can't reparameterize, but that's not what it actually does. If you make this change, the existing comment will become correct. The problem with the NULL return convention is that it's not very convenient when this function is recursing. Maybe we should change this function's signature to be bool reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel, Path **path); then you could do, e.g. if (!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual)) return; But I don't really like that approach; it's still quite long-winded. Instead, I suggest Stupid Macro Tricks: #define ADJUST_CHILD_ATTRS(val) \ val = (List *) adjust_appendrel_attrs_multilevel((Node *) val, child_rel->relids, child_rel->top_parent_relids); #define REPARAMETERIZE_CHILD_PATH(val) \ val = reparameterize_path_by_child(root, val, child_rel); \ if (val == NULL)\ return NULL; #define REPARAMETERIZE_CHILD_PATH_LIST(val) \ if (val != NIL) \ { \ val = reparameterize_pathlist_by_child(root,val, child_rel); \ if (val == NIL) \ return NULL; \ } With that, a complicated case like T_NestPath becomes just: JoinPath *jpath; FLAT_COPY_PATH(jpath, path, NestPath); REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath); REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath); ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo); new_path = (Path *) jpath; Now, I admit that hiding stuff inside the macro definitions like that is ugly. But I think it's still better than repeating boilerplate code with finnicky internal bits lots of times. > Yes, I too am thinking about the same. The only reason I have EXPLAIN > output there is to check whether partition-wise join is being used or > not. The testcase is not interested in the actual shape. It doesn't > make sense to just test the output if partition-wise join is not used. > May be a function examining the plan tree would help. The function > will have to handle Result/Sort nodes on top and make sure that Append > has join children. Do you have any other idea to check the shape of > the plan tree without the details? Any EXPLAIN switch, existing > functions etc.? No, not really. We may just need to be prepared to fix whatever breaks. >> I think it would be good to have a test case that shows multi-level >> partition-wise join working across multiple levels. I wrote the >> attached test, which you're welcome to use if you like it, adapt if >> you sorta like it, or replace if you dislike it. The table names at >> least should be changed to something less likely to duplicate other >> tests. >> > > There are tests for multi-level partitioned table in the file. They > test whole partition hierarchy join, part of it being joined based on > the quals. Search for > -- > -- multi-leveled partitions > -- > > Have you looked at those? They test two-level partitioned tables and > your test tests three-level partitioned table. I can modify the tests > to have three levels of partitions and different partition schemes on > different levels. Is that what you expect? Oops, no, I just missed the test case. I saw the one that said "inner join, qual covering only top-level partitions" and missed that there were others later where the quals covered lower levels also. Instead of "multi-leveled partitions" it might read better to say "multiple levels of partitioning". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Oct 4, 2017 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> About your earlier comment of making build_joinrel_partition_info() >>> simpler. Right now, the code assumes that partexprs or >>> nullable_partexpr can be NULL when either of them is not populated. >>> That may be saves a sizeof(pointer) * (number of keys) byes of memory. >>> Saving that much memory may not be worth the complexity of code. So, >>> we may always allocate memory for those arrays and fill it with NIL >>> values when there are no key expressions to populate those. That will >>> simplify the code. I haven't done that change in this patchset. I was >>> busy debugging the Q7 regression. Let me know your comments about >>> that. >> >> Hmm, I'm not sure that's the best approach, but let me look at it more >> carefully before I express a firm opinion. > > Having studied this a bit more, I now think your proposed approach is > a good idea. Thanks. Done. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Alvaro Herrera
Date:
Robert Haas wrote: > Regarding nomenclature and my previous griping about wisdom, I was > wondering about just calling this a "partition join" like you have in > the regression test. So the GUC would be enable_partition_join, you'd > have generate_partition_join_paths(), etc. Basically just delete > "wise" throughout. If I understand correctly, what's being used here is the "-wise" suffix, unrelated to wisdom, which Merriam Webster lists as "adverb combining form" here https://www.merriam-webster.com/dictionary/wise (though you have to scroll down a lot), which is defined as 1 a :in the manner of * crabwise * fanwise b :in the position or direction of * slantwise * clockwise 2 :with regard to :in respect of * dollarwise According to that, the right way to write this is "partitionwise join" (no dash), which means "join in respect of partitions", "join with regard to partitions". -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Thu, Oct 5, 2017 at 9:48 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> Regarding nomenclature and my previous griping about wisdom, I was >> wondering about just calling this a "partition join" like you have in >> the regression test. So the GUC would be enable_partition_join, you'd >> have generate_partition_join_paths(), etc. Basically just delete >> "wise" throughout. > > If I understand correctly, what's being used here is the "-wise" suffix, > unrelated to wisdom, which Merriam Webster lists as "adverb combining > form" here https://www.merriam-webster.com/dictionary/wise (though you > have to scroll down a lot), which is defined as > > 1 a :in the manner of * crabwise * fanwise > b :in the position or direction of * slantwise * clockwise > 2 :with regard to :in respect of * dollarwise > > According to that, the right way to write this is "partitionwise join" > (no dash), which means "join in respect of partitions", "join with > regard to partitions". I'm fine with that, if others like it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Oct 5, 2017 at 7:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: > >> Regarding nomenclature and my previous griping about wisdom, I was >> wondering about just calling this a "partition join" like you have in >> the regression test. So the GUC would be enable_partition_join, you'd >> have generate_partition_join_paths(), etc. Basically just delete >> "wise" throughout. > > If I understand correctly, what's being used here is the "-wise" suffix, > unrelated to wisdom, which Merriam Webster lists as "adverb combining > form" here https://www.merriam-webster.com/dictionary/wise (though you > have to scroll down a lot), which is defined as > > 1 a :in the manner of * crabwise * fanwise > b :in the position or direction of * slantwise * clockwise > 2 :with regard to :in respect of * dollarwise > That's right. > According to that, the right way to write this is "partitionwise join" > (no dash), which means "join in respect of partitions", "join with > regard to partitions". Google lists mostly "partition wise" or "partition-wise" and very rarely "partitionwise". The first being used in other DBMS literature. The paper (there aren't many on this subject) I referred [1] uses "partition-wise". It made more sense to replace " " or "-" with "_" when syntax doesn't allow the first two. I am not against "partitionwise" but I don't see any real reason why we should move away from popular usage of this term. [1] https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Oct 4, 2017 at 9:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I decided to skip over 0001 for today and spend some time looking at >> 0002-0006. > > Back to 0001. > > + Enables or disables the query planner's use of partition-wise join > + plans. When enabled, it spends time in creating paths for joins between > + partitions and consumes memory to construct expression nodes to be used > + for those joins, even if partition-wise join does not result in the > + cheapest path. The time and memory increase exponentially with the > + number of partitioned tables being joined and they increase linearly > + with the number of partitions. The default is <literal>off</>. > > I think this is too scary and too much technical detail. I think you > could just say something like: Enables or disables use of > partition-wise join, which allows a join between partitioned tables to > be performed by joining the matching partitions. Partition-wise join > currently applies only when the join conditions include all the > columns of the partition keys, which must be of the same data type and > have exactly matching sets of child partitions. Because > partition-wise join planning can use significantly increase CPU time > and memory usage during planning, the default is <literal>off</>. Done. With slight change. "include all the columns of the partition keys" has a different meaning when partition key is an expression, so I have used "include all the partition keys". Also changed the last sentence as "... can use significantly more CPU time and memory during planning ...". Please feel free to revert those changes, if you don't like them. > > +partitioned table. The join partners can not be found in other partitions. This > +condition allows the join between partitioned tables to be broken into joins > +between the matching partitions. The resultant join is partitioned in the same > > "The join partners can not be found in other partitions." is redundant > with the previous sentence. I suggest deleting it. I also suggest > "This condition allows the join between partitioned tables to be > broken" -> "Because of this, the join between partitioned tables can > be broken". Done. > > +relation" for both partitioned table as well as join between partitioned tables > +which can use partition-wise join technique. > > for either a partitioned table or a join between compatibly partitioned tables Done. > > +Partitioning properties of a partitioned relation are stored in > +PartitionSchemeData structure. Planner maintains a list of canonical partition > +schemes (distinct PartitionSchemeData objects) so that any two partitioned > +relations with same partitioning scheme share the same PartitionSchemeData > +object. This reduces memory consumed by PartitionSchemeData objects and makes > +it easy to compare the partition schemes of joining relations. > > Not all of the partitioning properties are stored in the > PartitionSchemeData structure any more. I think this needs some > rethinking and maybe some expansion. As written, each of the first > two sentences needs a "the" at the beginning. Changed to The partitioning properties of a partitioned relation are stored in its RelOptInfo. The information about data types of partition keys are stored in PartitionSchemeData structure. The planner maintains a list of canonical partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of any two partitioned relations with same partitioning scheme point to the same PartitionSchemeData object. This reduces memory consumed by PartitionSchemeData objects and makes it easy to compare the partition schemes of joining relations. Let me know if this looks good. > > + /* > + * Create "append" paths for > partitioned joins. Do this before > + * creating GatherPaths so that > partial "append" paths in > + * partitioned joins will be considered. > + */ > > I think you could shorten this to a single-line comment and just keep > the first sentence. Similarly in the other location where you have > the same sort of thing. Done. > > + * child-joins. Otherwise, add_path might delete a path that some "append" > + * path has reference to. > > to which some path generated here has a reference. Done. > > Here and elsewhere, you use "append" rather than Append to refer to > the paths added. I suppose that's weasel-wording to work around the > fact that they might be either Append or MergeAppend paths, but I'm > not sure it's really going to convey that to anyone. I suggest > rephrasing those comments more generically, e.g.: > > + /* Add "append" paths containing paths from child-joins. */ > > You could say: Build additional paths for this rel from child-join paths. > > Or something. Done. Removed word "append" from the comments in merge_clump(), standard_join_search() and prologue of generate_partition_wise_join_paths(). Changed the last comment as per your suggestion. > > + if (!REL_HAS_ALL_PART_PROPS(rel)) > + return; > Isn't this an unnecessarily expensive test? I mean, it shouldn't be > possible for it to have some arbitrary subset. All this function cares about is whether the given relation has any partitions which can be simply checked by rel->nparts > 0 and rel->part_rels != NULL. We need to explicitly check part_rels because an outer join which has empty inner side in every pair will have part_scheme, partbounds, nparts all set, but not part_rels. See relevant comments in try_partition_wise_join() for more details. I have now replaced macro with checks on rel->nparts and rel->part_rels. This would change with the last patch dealing with partition-wise join involving dummy relations. Once we have that an outer join like above will also have part_rels set. But even then I think checking for part_rels and nparts makes more sense than part_scheme and partbounds. > > + /* > + * Every pair of joining relations we see here should have an equi-join > + * between partition keys if this join has been deemed as a partitioned > + * join. See build_joinrel_partition_info() for reasons. > + */ > + Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype, > + > parent_restrictlist)); > > I suggest removing this assertion. Seems like overkill to me. I thought it was good to have there to catch any bug breaking that rule. But I have removed it as per your suggestion. Do you think we should remove following assertions as well? /* * Since we allow partition-wise join only when the partitionbounds of * the joining relations exactly match, the partition bounds of the join * should match those ofthe joining relations. */ Assert(partition_bounds_equal(joinrel->part_scheme->partnatts, joinrel->part_scheme->parttyplen, joinrel->part_scheme->parttypbyval, joinrel->boundinfo, rel1->boundinfo)); Assert(partition_bounds_equal(joinrel->part_scheme->partnatts, joinrel->part_scheme->parttyplen, joinrel->part_scheme->parttypbyval, joinrel->boundinfo, rel2->boundinfo)); > > + child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo, > + > child_rel1->relids, > + > child_rel2->relids); > > It seems like we might end up doing this multiple times for the same > child join, if there are more than 2 tables involved. Not sure if > there's a good way to avoid that. IIUC every pair of joining relations will use a different sjinfo, A LEFT JOIN B LEFT JOIN C will have two sjinfos one for AB and other for BC. For ABC we will use the one for AB to join A with BC and we will use one for BC to join AB with C. I agree that we are building sjinfo for AB twice once for joining AB and then for A(BC). In order to avoid that we will have to somehow link the parent sjinfo with child sjinfo and avoid translating parent sjinfo again and again. May be the parent sjinfo can contain a cache of child sjinfos.Do we want to do that in this patch set? We could avoid translation entirely, if we could use parent sjinfo for joining children. But that's a pretty deep surgery. > Similarly for child_restrictlist. Similary for restrictlist. Every joining pair has a different restrictlist. Otherwise, we would have saved restrictlist in the joinrel itself. > > + pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks); > > Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack > allocation is a lot faster, and then you don't need to pfree it. That's a good idea. Done. > > + /* Remove the relabel decoration. */ > > the -> any, decoration -> decorations Done. > > + /* > + * Replace the Var nodes of parent with those of children in > expressions. > + * This function may be called within a temporary context, but the > + * expressions will be shallow-copied into the plan. Hence copy those in > + * the planner's context. > + */ > > I can't make heads or tails of this comment. haha! My bad. the second sentence is something left of the code where the child-joins used to be planned in a temporary memory context. That's not true any more. Removed the entire comment. > > --- a/src/backend/optimizer/util/pathnode.c > +++ b/src/backend/optimizer/util/pathnode.c > @@ -23,7 +23,9 @@ > #include "optimizer/pathnode.h" > #include "optimizer/paths.h" > #include "optimizer/planmain.h" > +#include "optimizer/prep.h" > #include "optimizer/restrictinfo.h" > +#include "optimizer/tlist.h" > #include "optimizer/var.h" > #include "parser/parsetree.h" > #include "utils/lsyscache.h" > > Maybe not needed? This is the only hunk in this file? Or should this > be part of one of the later patches? I think 0005. Sorry. I will move it there. > > + Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel)); > + > + /* Ensure child relation is really what it claims to be. */ > + Assert(IS_OTHER_REL(childrel)); > > I suggest tightening this up a bit by removing the comment and the > blank line that precedes it. Done. > > + foreach(lc, parentrel->reltarget->exprs) > + { > + PlaceHolderVar *phv = lfirst(lc); > + > + if (IsA(phv, PlaceHolderVar)) > + { > + /* > + * In case the placeholder Var refers to any > of the parent > + * relations, translate it to refer to the > corresponding child. > + */ > + if (bms_overlap(phv->phrels, parentrel->relids) && > + childrel->reloptkind == RELOPT_OTHER_JOINREL) > + { > + phv = (PlaceHolderVar *) > adjust_appendrel_attrs(root, > + > (Node *) phv, > + > nappinfos, > + > appinfos); > + } > + > + childrel->reltarget->exprs = > lappend(childrel->reltarget->exprs, > + > phv); > + phv_added = true; > + } > + } > > What if the PHV is buried down inside the expression someplace rather > than being at the top level? That can't happen. See add_placeholders_to_joinrel(), which adds these placeholders to joinrel's target. That function adds PHVs as bare nodes, not embedded into something else. > More generally, why are we not just > applying adjust_appendrel_attrs() to the whole expression? Usually targetlists of join have Var nodes which bubble up from the base relations. Even PHVs bubble up from the lowest join where they can be evaluated. If we translate reltarget, we will allocate new Var nodes for every join relation consuming more memory and then setrefs will need to compare the contents of those nodes instead of just pointer comparison. We use this code and attr_needed to avoid memory consumption and setref's CPU consumption. > > + /* Adjust the cost and width of child targetlist. */ > + if (phv_added) > + { > + childrel->reltarget->cost.startup = > parentrel->reltarget->cost.startup; > + childrel->reltarget->cost.per_tuple = > parentrel->reltarget->cost.per_tuple; > + childrel->reltarget->width = parentrel->reltarget->width; > + } > > Making this conditional on phv_added is probably not saving anything. > Branches are expensive. Ok. If there are not PHVs in the query i.e. when root->placeholders_list is NIL, we don't need to scan reltarget->exprs. I have added that optimization. > > /* > * Otherwise, anything in a baserel or joinrel > targetlist ought to be > - * a Var. (More general cases can only appear in > appendrel child > - * rels, which will never be seen here.) > + * a Var or ConvertRowtypeExpr. For either of those, > find the original > + * baserel where they originate. > */ > > Hmm, but now we could potentially see an appendrel child rel here, so > don't we need to worry about more general cases? If not, let's > explain why not. By more general cases, that comment means ConvertRowtypeExpr or RowExpr, nothing else. A base relation's tlist can have only Var nodes when it reaches this comment. When a parent Var node is subjected to adjust_appendrel_attrs() it is translated to a Var node for all varattnos except 0, which indicates a whole-row var. For a child table, a whole-row var is always a named row type and hence gets translated to a ConvertRowExpr. Other kinds of children (subqueries in union etc.) can not appear here since they do not participate in a join directly. So it's really a Var and ConvertRowtypeExpr. I have modified the comment to explain this. > > + * if, it's a ConvertRowtypeExpr, it will be > computed only for the > > American usage does not put a comma after if like this (unless you are > writing writing if, for example, blah blah blah -- but there the > commas are to surround for example, not due to the if itself). That comma was unintentional. Removed. > > +/* > + * build_joinrel_partition_info > + * If the join between given partitioned relations is > possibly partitioned > + * set the partitioning scheme and partition keys > expressions for the > + * join. > + * > + * If the two relations have same partitioning scheme, their join may be > + * partitioned and will follow the same partitioning scheme as the joining > + * relations. > + */ > > I think you could drop the primary comment block and use the secondary > block as the primary one. That is, get rid of "If the join > between..." and promote "If the two relations...". Done. > > + * The join is not partitioned, if any of the relations being joined are > > Another comma that's not typical of American usage. Done. > > + * For an N-way inner join, where every syntactic inner join > has equi-join > > has -> has an > > + * For an N-way join with outer joins, where every syntactic join has an > + * equi-join between partition keys and a matching partitioning scheme, > + * outer join reordering identities in optimizer/README imply that only > + * those pairs of join are legal which have an equi-join > between partition > + * keys. Thus every pair of joining relations we see for this > join should > + * have an equi-join between partition keys if this join has been deemed > + * as a partitioned join. > > In line 2, partition keys -> the partition keys > In line 3, outer join -> the outer join > > "pairs of join" sounds wrong too, although I'm not sure how to reword it. > > More broadly: I don't think I understand this comment. The statement > about "those pairs of join are legal which have an equi-join between > partition keys" doesn't match my understanding e.g. A IJ B ON A.x = > B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such > clause syntatically. > > Maybe you could replace this whole comment block with something like > this: We can only consider this join as an input to further > partition-wise joins if (a) the input relations are partitioned, (b) > the partition schemes match, and (c) we can identify an equi-join > between the partition keys. Note that if it were possible for > have_partkey_equi_join to return different answers for the same > joinrel depending on which join ordering we try first, this logic > would break. That shouldn't happen, though, because of the way the > query planner deduces implied equalities. Hmm. I meant the second para to be read in the context of the first. Since AB is inner join A.x and B.x are replaceable (I forgot the correct term, identity?) and thus A.x = C.x implies B.x = C.x thus allowing join BC. But I think your version of the comment is easy to understand. But I think it should also refer to the way planner reorders joins; that's what causes us to worry about every join order being partitioned. I think we should redirect a reader, who wants to understand more about implied equalities and join orders, to optimizer/README. So, I have changed the last sentence to read "That shouldn't happen, though, because of the way the query planner deduces implied equalities and reorders joins. See optimizer/README for details." If you don't like my changes, please feel free to drop those. In the code block following this comment, I have used shorter variable names instead of accurate but long ones. E.g. outer_expr should have been outer_partexpr and outer_null_expr should have been outer_nullable_partexpr. Please feel free to change those if you don't like them or let me know if you have any better ideas and I will update the patch with those ideas. > > + * Join relation is partitioned using same partitioning scheme as the > + * joining relations and has same bounds. > > the same partitioning scheme Done. > > + * An INNER join between two partitioned relations is partitioned by key > + * expressions from both the relations. For tables A and B > partitioned by > + * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by > + * both A.a and B.b. > + * > + * A SEMI/ANTI join only retains data from the outer side and is > + * partitioned by the partition keys of the outer side. > > I would write: An INNER join between two partitioned relations can be > regarded as partitioned by either key expression. For example, A > INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on > B.b; they are equivalent. For a SEMI or ANTI join, the result can > only be regarded as being partitioned in the same manner as the outer > side, since the inner columns are not retained. Done. > > + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with > + * B.b NULL. These rows may not fit the partitioning > conditions imposed on > + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. > > Good. > > + * Strictly speaking, partition keys of an OUTER join should include > + * partition key expressions from the OUTER side only. Consider a join > > I would join this with the previous sentence instead of repeating > strictly speaking: ...and thus the partition keys should include > partition key expressions from the OUTER side only. After that > sentence, I'd skip a lot of the intermediate words here and continue > this way: However, because all commonly-used comparison operators are > strict, the presence of nulls on the outer side doesn't cause any > problem; they can't match anything at future join levels anyway. > Therefore, we track two sets of expressions: those that authentically > partition the relation (partexprs) and those that partition the > relation with the exception that extra nulls may be present > (nullable_partexprs). When the comparison operator is strict, the > latter is just as good as the former. > > Then, I think you can omit the rest of what you have; it should be > clear enough what's going on for the full and right cases given that > explanation. I liked this version. Changed the comments as per your suggestions. > > + * being joined. partexprs and nullable_partexprs are arrays > containing part_scheme->partnatts > > Long line, needs reflowing. Done. Also fixed a grammatical mistake: contains -> contain in the last line of that paragraph. > > I don't think this is too far from being committable. You've done > some nice work here! > Thanks a lot for your detailed reviews and guidance. I will post the updated patchset with my next reply. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Oct 5, 2017 at 12:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> We need to reparameterize any path which contains further paths and/or >> contains expressions that point to the parent relation. For a given >> path we need to reparameterize any paths that it contains and >> translate any expressions that are specific to that path. Expressions >> common across the paths are translated after the switch case. I have >> added this rule to the comment just above the switch case >> /* >> * Copy of the given path. Reparameterize any paths referenced by the given >> * path. Replace parent Vars in path specific expressions by corresponding >> * child Vars. >> */ >> Does that look fine or we want to add explanation for every node handled here. > > No, I don't think we want something for every node, just a general > explanation at the top of the function. Maybe something like this: > > Most fields from the original path can simply be flat-copied, but any > expressions must be adjusted to refer to the correct varnos, and any > paths must be recursively reparameterized. Other fields that refer to > specific relids also need adjustment. Done. > >>> I don't see much point in the T_SubqueryScanPath and T_ResultPath >>> cases in reparameterize_path_by_child(). It's just falling through to >>> the default case. >> >> I added those cases separately to explain why we should not see those >> cases in that switch case. I think that explanation is important >> (esp. considering your comment above) and associating those comment >> with "case" statement looks better. Are you suggesting that we should >> add that explanation in default case? > > Or leave the explanation out altogether. Ok. Removed the explanation and the cases. > >>> I wonder if reparameterize_path_by_child() ought to default to >>> returning NULL rather than throwing an error; the caller would then >>> have to be prepared for that and skip building the path. But that >>> would be more like what reparameterize_path() does, and it would make >>> failure to include some relevant path type here a corner-case >>> performance bug rather than a correctness issue. It seems like >>> someone adding a new path type could quite easily fail to realize that >>> it might need to be added here, or might be unsure whether it's >>> necessary to add it here. >> >> I am OK with that. However reparameterize_path_by_child() and >> reparameterize_paths_by_child() are callers of >> reparameterize_path_by_child() so they will need to deal with NULL >> return. I am fine with that too, but making sure that we are on the >> same page. If we do that, we could simply assert that the switch case >> doesn't see T_SubqueryScanPath and T_ResultPath. > > Or do nothing at all about those cases. > > I noticed today that the version of the patchset I have here says in > the header comments for reparameterize_path_by_child() that it returns > NULL if it can't reparameterize, but that's not what it actually does. > If you make this change, the existing comment will become correct. > > The problem with the NULL return convention is that it's not very > convenient when this function is recursing. Maybe we should change > this function's signature to be bool > reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel, > Path **path); then you could do, e.g. if > (!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual)) > return; > > But I don't really like that approach; it's still quite long-winded. > Instead, I suggest Stupid Macro Tricks: > > #define ADJUST_CHILD_ATTRS(val) \ > val = (List *) adjust_appendrel_attrs_multilevel((Node *) val, > child_rel->relids, child_rel->top_parent_relids); It so happens that every node we subject to adjust_appendrel_attrs_multilevel is List, so this is ok. In case we need to adjust some other type of node in future, we will pass node type too. For now, I have used the macro with (List *) hardcoded there. Do we write the whole macro on the same line even if it overflows? I see that being done for CONSIDER_PATH_STARTUP_COST defined in the same file and you also seem to suggest the same. But macros at other places are indented. For now, I have indented the macro on multiple lines. > > #define REPARAMETERIZE_CHILD_PATH(val) \ > val = reparameterize_path_by_child(root, val, child_rel); \ > if (val == NULL) \ > return NULL; > > #define REPARAMETERIZE_CHILD_PATH_LIST(val) \ > if (val != NIL) \ > { \ > val = reparameterize_pathlist_by_child(root, val, child_rel); \ > if (val == NIL) \ > return NULL; \ > } I added do { } while (0) around these code blocks like other places. Please feel free to remove it if you don't think that's not needed. > > With that, a complicated case like T_NestPath becomes just: > > JoinPath *jpath; > > FLAT_COPY_PATH(jpath, path, NestPath); > REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath); > REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath); > ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo); > new_path = (Path *) jpath; > > Now, I admit that hiding stuff inside the macro definitions like that > is ugly. But I think it's still better than repeating boilerplate > code with finnicky internal bits lots of times. I too do not like hiding stuff under macros since that make debugging hard, but with these macros code looks really elegant. Thanks for the suggestion. Also fixed some lines overflowing character limit. > >> Yes, I too am thinking about the same. The only reason I have EXPLAIN >> output there is to check whether partition-wise join is being used or >> not. The testcase is not interested in the actual shape. It doesn't >> make sense to just test the output if partition-wise join is not used. >> May be a function examining the plan tree would help. The function >> will have to handle Result/Sort nodes on top and make sure that Append >> has join children. Do you have any other idea to check the shape of >> the plan tree without the details? Any EXPLAIN switch, existing >> functions etc.? > > No, not really. We may just need to be prepared to fix whatever breaks. Sure. > > Instead of "multi-leveled partitions" it might read better to say > "multiple levels of partitioning". Done. Here's updated set of patches, rebased on top of the latest head. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Oct 6, 2017 at 5:09 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > Here's updated set of patches, rebased on top of the latest head. In this patchset reparameterize_pathlist_by_child() ignores NULL return from reparameterize_path_by_child(). Fixed that in the attached patchset. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Oct 6, 2017 at 5:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Oct 6, 2017 at 5:09 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> Here's updated set of patches, rebased on top of the latest head. > > In this patchset reparameterize_pathlist_by_child() ignores NULL > return from reparameterize_path_by_child(). Fixed that in the attached > patchset. > Sorry. I sent a wrong file. Here's the real v37. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Sorry. I sent a wrong file. Here's the real v37. Committed 0001-0006. I made some assorted comment and formatting changes and two small substantive changes: - In try_nestloop_path, bms_free(outerrelids) before returning if we can't reparameterize. - Moved the call to try_partition_wise_join inside populate_joinrel_with_paths, instead of always calling it just after that function is called. I think this is very good work and I'm excited about the feature. Now I'll wait to see whether the buildfarm, or Tom, yell at me for whatever problems this may still have... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Sorry. I sent a wrong file. Here's the real v37. > > Committed 0001-0006. I made some assorted comment and formatting > changes and two small substantive changes: > > - In try_nestloop_path, bms_free(outerrelids) before returning if we > can't reparameterize. Hmm. I missed that. > > - Moved the call to try_partition_wise_join inside > populate_joinrel_with_paths, instead of always calling it just after > that function is called. This looks good too. > > I think this is very good work and I'm excited about the feature. Thanks a lot Robert for detailed review and guidance. Thanks a lot Rafia for benchmarking the feature with TPCH and esp. very large scale database and also for testing and reported some real issues. Thanks Rajkumar for testing it with an exhaustive testset. Thanks Amit Langote, Thomas Munro, Dilip Kumar, Antonin Houska, Alvaro Herrera and Amit Khandekar for their review comments and suggestions. Thanks Jeevan Chalke, who used the patchset to implement partition-wise aggregates and provided some insights offlist. Sorry if I have missed anybody. As Robert says in the commit message, there's more to do but now that we have basic feature, improving it incrementally becomes a lot easier. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think this is very good work and I'm excited about the feature. Now > I'll wait to see whether the buildfarm, or Tom, yell at me for > whatever problems this may still have... > Buildfarm animal prion turned red. Before going into that failure, good news is that the other animals are green. So the plans are stable. prion runs the regression with -DRELCACHE_FORCE_RELEASE, which destroys a relcache entry as soon as its reference count drops down to 0. This destroys everything that's there in corresponding relcache entry including partition key information and partition descriptor information. find_partition_scheme() and set_relation_partition_info() both assume that the relcache information will survive as long as the relation lock is held. They do not copy the relevant partitioning information but just copy the pointers. That assumption is wrong. Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to zero, the data in partition scheme and partition bounds goes invalid and various checks to see if partition wise join is possible fail. That causes partition_join test to fail on prion. But I think, the bug could cause crash as well. The fix is to copy the relevant partitioning information from relcache into PartitionSchemeData and RelOptInfo. Here's a quick patch with that fix. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Fri, Oct 6, 2017 at 3:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think this is very good work and I'm excited about the feature. Now >> I'll wait to see whether the buildfarm, or Tom, yell at me for >> whatever problems this may still have... > > Buildfarm animal prion turned red. Before going into that failure, > good news is that the other animals are green. So the plans are > stable. > > prion runs the regression with -DRELCACHE_FORCE_RELEASE, which > destroys a relcache entry as soon as its reference count drops down to > 0. This destroys everything that's there in corresponding relcache > entry including partition key information and partition descriptor > information. find_partition_scheme() and set_relation_partition_info() > both assume that the relcache information will survive as long as the > relation lock is held. They do not copy the relevant partitioning > information but just copy the pointers. That assumption is wrong. > Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to > zero, the data in partition scheme and partition bounds goes invalid > and various checks to see if partition wise join is possible fail. > That causes partition_join test to fail on prion. But I think, the bug > could cause crash as well. > > The fix is to copy the relevant partitioning information from relcache > into PartitionSchemeData and RelOptInfo. Here's a quick patch with > that fix. Committed. I hope that makes things less red rather than more, because I'm going to be AFK for a few hours anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Committed. I hope that makes things less red rather than more, > because I'm going to be AFK for a few hours anyway. > Here's the last patch, dealing with the dummy relations, rebased. With this fix every join order of a partitioned join can be considered partitioned. (This wasn't the case earlier when dummy relation was involved.). So, we can allocate the child-join RelOptInfo array in build_joinrel_partition_info(), instead of waiting for an appropriate pair to arrive in try_partition_wise_join(). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Committed. I hope that makes things less red rather than more, >> because I'm going to be AFK for a few hours anyway. > > Here's the last patch, dealing with the dummy relations, rebased. With > this fix every join order of a partitioned join can be considered > partitioned. (This wasn't the case earlier when dummy relation was > involved.). So, we can allocate the child-join RelOptInfo array in > build_joinrel_partition_info(), instead of waiting for an appropriate > pair to arrive in try_partition_wise_join(). Wouldn't a far more general approach be to allow a partition-wise join between a partitioned table and an unpartitioned table, considering the result as partitioned? That seems like it would very often yield much better query plans than what we have right now, and also make the need for this particular thing go away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Oct 11, 2017 at 7:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Committed. I hope that makes things less red rather than more, >>> because I'm going to be AFK for a few hours anyway. >> >> Here's the last patch, dealing with the dummy relations, rebased. With >> this fix every join order of a partitioned join can be considered >> partitioned. (This wasn't the case earlier when dummy relation was >> involved.). So, we can allocate the child-join RelOptInfo array in >> build_joinrel_partition_info(), instead of waiting for an appropriate >> pair to arrive in try_partition_wise_join(). > > Wouldn't a far more general approach be to allow a partition-wise join > between a partitioned table and an unpartitioned table, considering > the result as partitioned? That seems like it would very often yield > much better query plans than what we have right now, and also make the > need for this particular thing go away. > You are suggesting that a dummy partitioned table be treated as an un-partitioned table and apply above suggested optimization. A join between a partitioned and unpartitioned table is partitioned by the keys of only partitioned table. An unpartitioned table doesn't have any keys, so this is fine. But a dummy partitioned table does have keys. Recording them as keys of the join relation helps when it joins to other relations. Furthermore a join between partitioned and unpartitioned table doesn't require any equi-join condition on partition keys of partitioned table but a join between partitioned tables is considered to be partitioned by keys on both sides only when there is an equi-join. So, when implementing a partitioned join between a partitioned and an unpartitioned table, we will have to make a special case to record partition keys when the unpartitioned side is actually a dummy partitioned table. That might be awkward. Because we don't have dummy children relation in all cases, we already have some awkwardness like allocating part_rels array only when we encounter a join order which has all the children. This patch removes that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > You are suggesting that a dummy partitioned table be treated as an > un-partitioned table and apply above suggested optimization. A join > between a partitioned and unpartitioned table is partitioned by the > keys of only partitioned table. An unpartitioned table doesn't have > any keys, so this is fine. But a dummy partitioned table does have > keys. Recording them as keys of the join relation helps when it joins > to other relations. Furthermore a join between partitioned and > unpartitioned table doesn't require any equi-join condition on > partition keys of partitioned table but a join between partitioned > tables is considered to be partitioned by keys on both sides only when > there is an equi-join. So, when implementing a partitioned join > between a partitioned and an unpartitioned table, we will have to make > a special case to record partition keys when the unpartitioned side is > actually a dummy partitioned table. That might be awkward. It seems to me that what we really need here is to move all of this stuff into a separate struct: /* used for partitioned relations */ PartitionScheme part_scheme; /* Partitioning scheme. */ int nparts; /* number of partitions */ struct PartitionBoundInfoData *boundinfo; /* Partition bounds */ struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions, * stored in the same order of bounds */ List **partexprs; /* Non-nullable partition key expressions. */ List **nullable_partexprs; /* Nullable partition key expressions. */ ...and then have a RelOptInfo carry a pointer to a list of those structures. That lets us consider multiple possible partition schemes for the same relation. For instance, suppose that a user joins four relations, P1, P2, Q1, and Q2. P1 and P2 are compatibly partitioned. Q1 and Q2 are compatibly partitioned (but not compatible with P1 and P2). Furthermore, let's suppose that the optimal join order begins with a join between P1 and Q1. When we construct the paths for that joinrel, we can either join all of P1 to all of Q1 (giving up on partition-wise join), or we can join each partition of P1 to all of Q1 (producing a result partitioned compatibly with P1 and allowing for a future partition-wise join to P2), or we can join each partition of Q1 to all of P1 (producing a result partitioned compatibly with Q1 and allowing for a future partition-wise join to Q2). Any of those could win depending on the details. With the data structure as it is today, we'd have to choose whether to mark the joinrel as partitioned like P1 or like Q1, but that's not really what we need here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Thu, Oct 12, 2017 at 10:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> You are suggesting that a dummy partitioned table be treated as an >> un-partitioned table and apply above suggested optimization. A join >> between a partitioned and unpartitioned table is partitioned by the >> keys of only partitioned table. An unpartitioned table doesn't have >> any keys, so this is fine. But a dummy partitioned table does have >> keys. Recording them as keys of the join relation helps when it joins >> to other relations. Furthermore a join between partitioned and >> unpartitioned table doesn't require any equi-join condition on >> partition keys of partitioned table but a join between partitioned >> tables is considered to be partitioned by keys on both sides only when >> there is an equi-join. So, when implementing a partitioned join >> between a partitioned and an unpartitioned table, we will have to make >> a special case to record partition keys when the unpartitioned side is >> actually a dummy partitioned table. That might be awkward. > > It seems to me that what we really need here is to move all of this > stuff into a separate struct: > > /* used for partitioned relations */ > PartitionScheme part_scheme; /* Partitioning scheme. */ > int nparts; /* number of > partitions */ > struct PartitionBoundInfoData *boundinfo; /* Partition bounds */ > struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions, > > * stored in the same order of bounds */ > List **partexprs; /* Non-nullable partition key > expressions. */ > List **nullable_partexprs; /* Nullable partition key > expressions. */ > In a very early patch I had PartitionOptInfo to hold all of this. RelOptInfo then had a pointer of PartitionOptInfo, if it was partitioned. When a relation can be partitioned in multiple ways like what you describe or because join by re-partitioning is efficient, RelOptInfo would have a list of those. But the representation needs to be thought through. I am wondering whether this should be modelled like IndexOptInfo. I am not sure. This is a topic of much larger discussion. I think we are digressing. We were discussing my patch to handle dummy partitioned relation, whose children are not marked dummy and do not have pathlists set. Do you still think that we should leave that aside? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> The fix is to copy the relevant partitioning information from relcache >> into PartitionSchemeData and RelOptInfo. Here's a quick patch with >> that fix. > > Committed. I hope that makes things less red rather than more, > because I'm going to be AFK for a few hours anyway. > set_append_rel_size() crashes when it encounters a partitioned table with a dropped column. Dropped columns do not have any translations saved in AppendInfo::translated_vars; the corresponding entry is NULL per make_inh_translation_list(). 1802 att = TupleDescAttr(old_tupdesc, old_attno); 1803 if (att->attisdropped) 1804 { 1805 /* Just put NULL into this list entry */ 1806 vars = lappend(vars, NULL); 1807 continue; 1808 } In set_append_rel_size() we try to attr_needed for child tables. While doing so we try to translate a user attribute number of parent to that of a child and crash since the translated Var is NULL. Here's patch to fix the crash. The patch also contains a testcase to test dropped columns in partitioned table. Sorry for not noticing this problem earlier. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Mon, Oct 16, 2017 at 5:03 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > set_append_rel_size() crashes when it encounters a partitioned table > with a dropped column. Dropped columns do not have any translations > saved in AppendInfo::translated_vars; the corresponding entry is NULL > per make_inh_translation_list(). > 1802 att = TupleDescAttr(old_tupdesc, old_attno); > 1803 if (att->attisdropped) > 1804 { > 1805 /* Just put NULL into this list entry */ > 1806 vars = lappend(vars, NULL); > 1807 continue; > 1808 } > > In set_append_rel_size() we try to attr_needed for child tables. While > doing so we try to translate a user attribute number of parent to that > of a child and crash since the translated Var is NULL. Here's patch to > fix the crash. The patch also contains a testcase to test dropped > columns in partitioned table. > > Sorry for not noticing this problem earlier. OK, committed. This is a good example of how having good code coverage doesn't necessarily mean you've found all the bugs. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Robert Haas
Date:
On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> OK, committed. This is a good example of how having good code >> coverage doesn't necessarily mean you've found all the bugs. :-) >> > As of now partition_join.sql is not having test cases covering cases > where partition table have default partition, attaching a small test > case patch to cover those. That's not that small, and to me it looks like overkill. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Sat, Dec 2, 2017 at 2:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshi > <rajkumar.raghuwanshi@enterprisedb.com> wrote: >> On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> OK, committed. This is a good example of how having good code >>> coverage doesn't necessarily mean you've found all the bugs. :-) >>> >> As of now partition_join.sql is not having test cases covering cases >> where partition table have default partition, attaching a small test >> case patch to cover those. > > That's not that small, and to me it looks like overkill. > I agree, the patch looks longer than expected. I think, it's important to have some testcases to test partition-wise join with default partitions. I think we need at least one test for range default partitions, one test for list partitioning, one for multi-level partitioning and one negative testcase with default partition missing from one side of join. May be we could reduce the number of SQL commands and queries in the patch by adding default partition to every table that participates in partition-wise join (leave the tables participating in negative tests aside.). But that's going to increase the size of EXPLAIN outputs and query results. The negative test may simply drop the default partition from one of the tables. For every table being tested, the patch adds two ALTER TABLE commands, one for detaching an existing partition and then attach the same as default partition. Alternative to that is just add a new default partition without detaching and existing partition. But then the default partition needs to populated with some data, which requires 1 INSERT statement at least. That doesn't reduce the size of patch, but increases the output of query and EXPLAIN plan. May be in case of multi-level partitioning test, we don't need to add DEFAULT in every partitioned relation; adding to one of them would be enough. May be add it to the parent, but that too can be avoided. That would reduce the size of patch a bit. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rajkumar Raghuwanshi
Date:
On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I agree, the patch looks longer than expected. I think, it's important > to have some testcases to test partition-wise join with default > partitions. I think we need at least one test for range default > partitions, one test for list partitioning, one for multi-level > partitioning and one negative testcase with default partition missing > from one side of join. > > May be we could reduce the number of SQL commands and queries in the > patch by adding default partition to every table that participates in > partition-wise join (leave the tables participating in negative tests > aside.). But that's going to increase the size of EXPLAIN outputs and > query results. The negative test may simply drop the default partition > from one of the tables. > > For every table being tested, the patch adds two ALTER TABLE commands, > one for detaching an existing partition and then attach the same as > default partition. Alternative to that is just add a new default > partition without detaching and existing partition. But then the > default partition needs to populated with some data, which requires 1 > INSERT statement at least. That doesn't reduce the size of patch, but > increases the output of query and EXPLAIN plan. > > May be in case of multi-level partitioning test, we don't need to add > DEFAULT in every partitioned relation; adding to one of them would be > enough. May be add it to the parent, but that too can be avoided. That > would reduce the size of patch a bit. Thanks Ashutosh for suggestions. I have reduced test cases as suggested. Attaching updated patch. Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rajkumar Raghuwanshi
Date:
On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I agree, the patch looks longer than expected. I think, it's important >> to have some testcases to test partition-wise join with default >> partitions. I think we need at least one test for range default >> partitions, one test for list partitioning, one for multi-level >> partitioning and one negative testcase with default partition missing >> from one side of join. >> >> May be we could reduce the number of SQL commands and queries in the >> patch by adding default partition to every table that participates in >> partition-wise join (leave the tables participating in negative tests >> aside.). But that's going to increase the size of EXPLAIN outputs and >> query results. The negative test may simply drop the default partition >> from one of the tables. >> >> For every table being tested, the patch adds two ALTER TABLE commands, >> one for detaching an existing partition and then attach the same as >> default partition. Alternative to that is just add a new default >> partition without detaching and existing partition. But then the >> default partition needs to populated with some data, which requires 1 >> INSERT statement at least. That doesn't reduce the size of patch, but >> increases the output of query and EXPLAIN plan. >> >> May be in case of multi-level partitioning test, we don't need to add >> DEFAULT in every partitioned relation; adding to one of them would be >> enough. May be add it to the parent, but that too can be avoided. That >> would reduce the size of patch a bit. > > Thanks Ashutosh for suggestions. > > I have reduced test cases as suggested. Attaching updated patch. > Sorry Attached wrong patch. attaching correct patch now.
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Tue, Dec 5, 2017 at 1:24 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi > <rajkumar.raghuwanshi@enterprisedb.com> wrote: >> On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> I agree, the patch looks longer than expected. I think, it's important >>> to have some testcases to test partition-wise join with default >>> partitions. I think we need at least one test for range default >>> partitions, one test for list partitioning, one for multi-level >>> partitioning and one negative testcase with default partition missing >>> from one side of join. >>> >>> May be we could reduce the number of SQL commands and queries in the >>> patch by adding default partition to every table that participates in >>> partition-wise join (leave the tables participating in negative tests >>> aside.). But that's going to increase the size of EXPLAIN outputs and >>> query results. The negative test may simply drop the default partition >>> from one of the tables. >>> >>> For every table being tested, the patch adds two ALTER TABLE commands, >>> one for detaching an existing partition and then attach the same as >>> default partition. Alternative to that is just add a new default >>> partition without detaching and existing partition. But then the >>> default partition needs to populated with some data, which requires 1 >>> INSERT statement at least. That doesn't reduce the size of patch, but >>> increases the output of query and EXPLAIN plan. >>> >>> May be in case of multi-level partitioning test, we don't need to add >>> DEFAULT in every partitioned relation; adding to one of them would be >>> enough. May be add it to the parent, but that too can be avoided. That >>> would reduce the size of patch a bit. >> >> Thanks Ashutosh for suggestions. >> >> I have reduced test cases as suggested. Attaching updated patch. >> > Sorry Attached wrong patch. > > attaching correct patch now. Thanks. Here are some comments +-- test default partition behavior for range +ALTER TABLE prt1 DETACH PARTITION prt1_p3; +ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT; +ALTER TABLE prt2 DETACH PARTITION prt2_p3; +ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT; I think we need an ANALYZE here in case the statistics gets updated while DETACH and ATTACH is going on. Other testcases also need to be updated with ANALYZE, including the negative one. +-- partition-wise join can not be applied if the only one of joining table have Correction: ... if only one of the joining tables has ... Please add the patch to the next commitfest so that it's not forgotten. I think we can get rid of the multi-level partition-wise testcase as well. Also, since we are re-attaching existing partition tables as default partitions, we don't need to check the output as well; just plan should be enough. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rajkumar Raghuwanshi
Date:
On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Thanks. Here are some comments
Thanks Ashutosh for review and suggestions.
+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;
I think we need an ANALYZE here in case the statistics gets updated while
DETACH and ATTACH is going on. Other testcases also need to be updated with
ANALYZE, including the negative one.
Done.
+-- partition-wise join can not be applied if the only one of joining table have
Correction: ... if only one of the joining tables has ...
Done.
Please add the patch to the next commitfest so that it's not
forgotten.
I think we can get rid of the multi-level partition-wise
testcase as well. Also, since we are re-attaching existing partition
tables as default partitions, we don't need to check the output as
well; just plan should be enough.
Ok. Done.
updated test patch attached.
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> Thanks. Here are some comments >> > Thanks Ashutosh for review and suggestions. > >> >> +-- test default partition behavior for range >> +ALTER TABLE prt1 DETACH PARTITION prt1_p3; >> +ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT; >> +ALTER TABLE prt2 DETACH PARTITION prt2_p3; >> +ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT; >> >> I think we need an ANALYZE here in case the statistics gets updated while >> DETACH and ATTACH is going on. Other testcases also need to be updated >> with >> ANALYZE, including the negative one. > > Done. > >> >> >> +-- partition-wise join can not be applied if the only one of joining >> table have >> >> Correction: ... if only one of the joining tables has ... > > Done. > >> >> Please add the patch to the next commitfest so that it's not >> forgotten. > > Done. > Added to CF: https://commitfest.postgresql.org/16/1426/ > >> >> I think we can get rid of the multi-level partition-wise >> testcase as well. Also, since we are re-attaching existing partition >> tables as default partitions, we don't need to check the output as >> well; just plan should be enough. > > Ok. Done. > > updated test patch attached. > The patch looks good to me. I don't think we can reduce it further. But we need some tests to test PWJ with default partitions. Marking this as ready for committer. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rajkumar Raghuwanshi
Date:
On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> updated test patch attached.
Changed partition-wise statement to partitionwise.
Attached re-based patch.
Attached re-based patch.
The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.
Thanks Ashutosh.
Attachment
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Thomas Munro
Date:
On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat > Changed partition-wise statement to partitionwise. > Attached re-based patch. > >> The patch looks good to me. I don't think we can reduce it further. >> But we need some tests to test PWJ with default partitions. Marking >> this as ready for committer. Hi Rajkumar, partition_join ... FAILED The regression test currently fails with your v4 patch because a redundant Result node has been removed from a query plan. That may be due to commit 11cf92f6 or nearby commits. -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Ashutosh Bapat
Date:
On Wed, Jun 6, 2018 at 8:11 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi > <rajkumar.raghuwanshi@enterprisedb.com> wrote: >> On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat >> Changed partition-wise statement to partitionwise. >> Attached re-based patch. >> >>> The patch looks good to me. I don't think we can reduce it further. >>> But we need some tests to test PWJ with default partitions. Marking >>> this as ready for committer. > > Hi Rajkumar, > > partition_join ... FAILED > That made my heart stop for fraction of a second. I thought, something happened which caused partition_join test fail in master. But then I realised you are talking about Rajkumar's patch and test in that patch. I think it's better to start a separate thread discussing his patch, before I loose my heart ;) -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
From
Rajkumar Raghuwanshi
Date:
On Wed, Jun 6, 2018 at 9:21 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Jun 6, 2018 at 8:11 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
> <rajkumar.raghuwanshi@enterprisedb.com> wrote:
>> On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
>> Changed partition-wise statement to partitionwise.
>> Attached re-based patch.
>>
>>> The patch looks good to me. I don't think we can reduce it further.
>>> But we need some tests to test PWJ with default partitions. Marking
>>> this as ready for committer.
>
> Hi Rajkumar,
>
> partition_join ... FAILED
>
Thanks Thomas for patch review.
That made my heart stop for fraction of a second. I thought, something
happened which caused partition_join test fail in master. But then I
realised you are talking about Rajkumar's patch and test in that
patch. I think it's better to start a separate thread discussing his
patch, before I loose my heart ;)
Yeah, that would be better.
here is the new thread with updated patch.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation