Thread: Ordered Partitioned Table Scans
RANGE partitioning of time-series data is quite a common range to use partitioning, and such tables tend to grow fairly large. I thought since we always store RANGE partitioned tables in the PartitionDesc in ascending range order that it might be useful to make use of this and when the required pathkeys match the order of the range, then we could make use of an Append node instead of uselessly using a MergeAppend, since the MergeAppend will just exhaust each subplan one at a time, in order. It does not seem very hard to implement this and it does not add much in the way of additional processing to the planner. Performance wise it seems to give a good boost to getting sorted results from a partitioned table. I performed a quick test just on my laptop with: Setup: CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION BY RANGE (id); select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench FOR VALUES FROM (' || (x*100000)::text || ') TO (' || ((x+1)*100000)::text || ');' from generate_Series(0,299) x; \gexec \o INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x; create index on partbench (id); vacuum analyze; Test: select * from partbench order by id limit 1 offset 29999999; Results Patched: Time: 4234.807 ms (00:04.235) Time: 4237.928 ms (00:04.238) Time: 4241.289 ms (00:04.241) Time: 4234.030 ms (00:04.234) Time: 4244.197 ms (00:04.244) Time: 4266.000 ms (00:04.266) Unpatched: Time: 5917.288 ms (00:05.917) Time: 5937.775 ms (00:05.938) Time: 5911.146 ms (00:05.911) Time: 5906.881 ms (00:05.907) Time: 5918.309 ms (00:05.918) (about 39% faster) The implementation is fairly simple. One thing I don't like about is I'd rather build_partition_pathkeys() performed all the checks to know if the partition should support a natural pathkey, but as of now, I have the calling code ensuring that there are no sub-partitioned tables. These could cause tuples to be output in the wrong order. Does this idea seem like something we'd want? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/10/26 11:50, David Rowley wrote: > RANGE partitioning of time-series data is quite a common range to use > partitioning, and such tables tend to grow fairly large. I thought > since we always store RANGE partitioned tables in the PartitionDesc in > ascending range order that it might be useful to make use of this and > when the required pathkeys match the order of the range, then we could > make use of an Append node instead of uselessly using a MergeAppend, > since the MergeAppend will just exhaust each subplan one at a time, in > order. > > It does not seem very hard to implement this and it does not add much > in the way of additional processing to the planner. > > Performance wise it seems to give a good boost to getting sorted > results from a partitioned table. I performed a quick test just on my > laptop with: > > Setup: > CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT > NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION > BY RANGE (id); > select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench > FOR VALUES FROM (' || (x*100000)::text || ') TO (' || > ((x+1)*100000)::text || ');' from generate_Series(0,299) x; > \gexec > \o > INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x; > create index on partbench (id); > vacuum analyze; > > Test: > select * from partbench order by id limit 1 offset 29999999; > > Results Patched: > > Time: 4234.807 ms (00:04.235) > Time: 4237.928 ms (00:04.238) > Time: 4241.289 ms (00:04.241) > Time: 4234.030 ms (00:04.234) > Time: 4244.197 ms (00:04.244) > Time: 4266.000 ms (00:04.266) > > Unpatched: > > Time: 5917.288 ms (00:05.917) > Time: 5937.775 ms (00:05.938) > Time: 5911.146 ms (00:05.911) > Time: 5906.881 ms (00:05.907) > Time: 5918.309 ms (00:05.918) > > (about 39% faster) > > The implementation is fairly simple. One thing I don't like about is > I'd rather build_partition_pathkeys() performed all the checks to know > if the partition should support a natural pathkey, but as of now, I > have the calling code ensuring that there are no sub-partitioned > tables. These could cause tuples to be output in the wrong order. > > Does this idea seem like something we'd want? Definitely! Thanks for creating the patch. I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this last year, but the partitioning-related planning code hadn't advanced then as much as it has today, so they sort of postponed working on it. Eventually their patch was returned with feedback last November. Here's the link to their email in case you wanted to read some comments their proposal and patch got, although some of them might be obsolete. https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop Thanks, Amit
On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this > last year, but the partitioning-related planning code hadn't advanced then > as much as it has today, so they sort of postponed working on it. > Eventually their patch was returned with feedback last November. Here's > the link to their email in case you wanted to read some comments their > proposal and patch got, although some of them might be obsolete. > > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky times to be doing this. Hopefully, the dust has settled a little bit now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On Fri, Oct 26, 2018 at 6:40 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this > > last year, but the partitioning-related planning code hadn't advanced then > > as much as it has today, so they sort of postponed working on it. > > Eventually their patch was returned with feedback last November. Here's > > the link to their email in case you wanted to read some comments their > > proposal and patch got, although some of them might be obsolete. > > > > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop > > Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky > times to be doing this. Hopefully, the dust has settled a little bit > now. Yes, back then I unfortunately had a limited time to work on that, and I had to spend all of it rebasing the patch instead of working on the various issue :( Sadly, I have even less time now, but I'll try to look at your patch this weekend! As far as I remember, the biggest problems we had was to handle multi-level partitionning, when the query is ordered by all or a subset of the partition keys, and/or with a mix of ASC/DESC clauses. It also required some extra processing on the cost part for queries that can be naturally ordered and contain a LIMIT clause, since we can estimate how many partitions will have to be scanned.
Hi, On Fri, Oct 26, 2018 at 1:01 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > On Fri, Oct 26, 2018 at 6:40 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > > > On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this > > > last year, but the partitioning-related planning code hadn't advanced then > > > as much as it has today, so they sort of postponed working on it. > > > Eventually their patch was returned with feedback last November. Here's > > > the link to their email in case you wanted to read some comments their > > > proposal and patch got, although some of them might be obsolete. > > > > > > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop > > > > Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky > > times to be doing this. Hopefully, the dust has settled a little bit > > now. > > As far as I remember, the biggest problems we had was > to handle multi-level partitionning, when the query is ordered by all > or a subset of the partition keys, and/or with a mix of ASC/DESC > clauses. It also required some extra processing on the cost part for > queries that can be naturally ordered and contain a LIMIT clause, > since we can estimate how many partitions will have to be scanned. I just had a look at your patch. I see that you implemented only a subset of the possible optimizations (only the case for range partitionoing without subpartitions). This has been previously discussed, but we should be able to do similar optimization for list partitioning if there's no interleaved values, and also for some cases of multi-level partitioning. Concerning the implementation, there's at least one issue: it assumes that each subpath of a range-partitioned table will be ordered, with is not guaranteed. You need to to generate explicit Sort nodes nodes (in the same order as the query_pathkey) for partitions that don't have an ordered path and make sure that this path is used in the Append. Here's a simplistic case showing the issue (sorry, the partition names are poorly chosen): CREATE TABLE simple (id integer, val text) PARTITION BY RANGE (id); CREATE TABLE simple_1_2 PARTITION OF simple FOR VALUES FROM (1) TO (100000); CREATE TABLE simple_2_3 PARTITION OF simple FOR VALUES FROM (100000) TO (200000); CREATE TABLE simple_0_1 PARTITION OF simple FOR VALUES FROM (-100000) TO (1); INSERT INTO simple SELECT id, 'line ' || id FROM generate_series(-19999, 199999) id; CREATE INDEX ON simple_1_2 (id); CREATE INDEX ON simple_2_3 (id); EXPLAIN SELECT * FROM simple ORDER BY id ; QUERY PLAN --------------------------------------------------------------------------------------------------- Append (cost=0.00..7705.56 rows=219999 width=15) -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15) -> Index Scan using simple_1_2_id_idx on simple_1_2 (cost=0.29..3148.28 rows=99999 width=14) -> Index Scan using simple_2_3_id_idx on simple_2_3 (cost=0.29..3148.29 rows=100000 width=16) (4 rows) Also, if a LIMIT is specified, it should be able to give better estimates, at least if there's no qual. For instance: EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10; QUERY PLAN > -------------------------------------------------------------------------------------------------------> Limit (cost=0.00..0.35 rows=10 width=15) -> Append (cost=0.00..7705.56 rows=219999 width=15) -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15) -> Index Scan using simple_1_2_id_idx on simple_1_2 (cost=0.29..3148.28 rows=99999 width=14) -> Index Scan using simple_2_3_id_idx on simple_2_3 (cost=0.29..3148.29 rows=100000 width=16) (5 rows) In this case, we should estimate that the SeqScan (or in a corrected version the Sort) node should not return more than 10 rows, and each following partition should be scanned at all, and cost each path accordingly. I think that this is quite important, for instance to make sure that natively sorted Append is chosen over a MergeAppend when there are some subpath with explicit sorts, because with the Append we probably won't have to execute all the sorts if the previous partition scans returned enough rows. FWIW, both those cases were handled (probably with some bugs though) in the previous patches Ronan and I sent some time ago. Also, I did not forget about this feature, I planned to work on it in hope to have it included in pg12. However, I won't have a lot of time to work on it before December.
Thanks for looking at this. On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote: > I just had a look at your patch. I see that you implemented only a > subset of the possible optimizations (only the case for range > partitionoing without subpartitions). This has been previously > discussed, but we should be able to do similar optimization for list > partitioning if there's no interleaved values, and also for some cases > of multi-level partitioning. I had thought about these cases but originally had thought they would be more complex to implement than I could justify. On review, I've found some pretty cheap ways to handle both sub-partitions and for LIST partitioned tables. Currently, with LIST partitioned tables I've coded it to only allow the optimisation if there's no DEFAULT partition and all partitions are defined with exactly 1 Datum. This guarantees that there are no interleaved values, but it'll just fail to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The reason that I didn't go to the trouble of the additional checks was that I don't really want to add any per-partition overhead to this. If RelOptInfo had a Bitmapset of live partitions then we could just check the partitions that survived pruning. Amit Langote has a pending patch which does that and some other useful stuff, so maybe we can delay fixing that until the dust settles a bit in that area. Amit and I are both working hard to remove all these per-partition overheads. I imagine he'd also not be in favour of adding code that does something for all partitions when we've pruned down to just 1. I've personally no objection to doing the required additional processing for the non-pruned partitions only. We could also then fix the case where we disable the optimisation if there's a DEFAULT partition without any regards to if it's been pruned or not. > Concerning the implementation, there's at least one issue: it assumes > that each subpath of a range-partitioned table will be ordered, with > is not guaranteed. You need to to generate explicit Sort nodes nodes > (in the same order as the query_pathkey) for partitions that don't > have an ordered path and make sure that this path is used in the > Append. Here's a simplistic case showing the issue (sorry, the > partition names are poorly chosen): Thanks for noticing this. I had been thrown off due to the fact that Paths are never actually created for these sorts. On looking further I see that we do checks during createplan to see if the path is suitability sorted and just create a sort node if it's not. This seems to go against the whole point of paths, but I'm not going to fight for changing it, so I've just done the Append the same way as MergeAppend handles it. > Also, if a LIMIT is specified, it should be able to give better > estimates, at least if there's no qual. For instance: > > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10; > QUERY PLAN > > > -------------------------------------------------------------------------------------------------------> > Limit (cost=0.00..0.35 rows=10 width=15) > -> Append (cost=0.00..7705.56 rows=219999 width=15) > -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15) > -> Index Scan using simple_1_2_id_idx on simple_1_2 > (cost=0.29..3148.28 rows=99999 width=14) > -> Index Scan using simple_2_3_id_idx on simple_2_3 > (cost=0.29..3148.29 rows=100000 width=16) > (5 rows) > > In this case, we should estimate that the SeqScan (or in a corrected > version the Sort) node should not return more than 10 rows, and each > following partition should be scanned at all, and cost each path > accordingly. I think that this is quite important, for instance to > make sure that natively sorted Append is chosen over a MergeAppend > when there are some subpath with explicit sorts, because with the > Append we probably won't have to execute all the sorts if the previous > partition scans returned enough rows. In my patch, I'm not adding any additional paths. I'm just adding an Append instead of a MergeAppend. For what you're talking about the limit only needs to be passed into any underlying Sort so that it can become a top-N sort. This is handled already in create_limit_path(). Notice in the plan you pasted above that the limit has a lower total cost than its Append subnode. That's because create_limit_path() weighted the Limit total cost based on the row count of the limit and its subpath. ... 7705.56 / 219999 * 10 = ~0.35. > FWIW, both those cases were handled (probably with some bugs though) > in the previous patches Ronan and I sent some time ago. Also, I did > not forget about this feature, I planned to work on it in hope to have > it included in pg12. However, I won't have a lot of time to work on > it before December. I apologise for not noticing your patch. I only went as far as checking the November commitfest to see if anything existed already and I found nothing there. I have time to work on this now, so likely it's better if I continue, just in case your time in December does not materialise. v2 of the patch is attached. I've not had time yet to give it a throughout post write review, but on first look it seems okay. The known limitations are: * Disables the optimisation even if the DEFAULT partition is pruned. * Disables the optimisation if LIST partitioned tables have any partitions allowing > 1 value. * Fails to optimise UNION ALLs with partitioned tables. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 29 October 2018 at 13:44, David Rowley <david.rowley@2ndquadrant.com> wrote: > v2 of the patch is attached. I've not had time yet to give it a > throughout post write review, but on first look it seems okay. Added to the November 'fest. https://commitfest.postgresql.org/20/1850/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 29, 2018 at 1:44 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote: > > I just had a look at your patch. I see that you implemented only a > > subset of the possible optimizations (only the case for range > > partitionoing without subpartitions). This has been previously > > discussed, but we should be able to do similar optimization for list > > partitioning if there's no interleaved values, and also for some cases > > of multi-level partitioning. > > I had thought about these cases but originally had thought they would > be more complex to implement than I could justify. On review, I've > found some pretty cheap ways to handle both sub-partitions and for > LIST partitioned tables. Currently, with LIST partitioned tables I've > coded it to only allow the optimisation if there's no DEFAULT > partition and all partitions are defined with exactly 1 Datum. This > guarantees that there are no interleaved values, but it'll just fail > to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The > reason that I didn't go to the trouble of the additional checks was > that I don't really want to add any per-partition overhead to this. I see, but the overhead you mention is because you're doing that check during the planning in build_partition_pathkeys(). As advised by Robert quite some time ago (https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com), we can store that information when the PartitionDesc is built, so that would it wouldn't be problematic. Since checking for overlapping values is straightforward with the BoundInfoData infrastructure, it'd be a pity to miss this optimization in such cases, which I believe would not be rare. > If RelOptInfo had a Bitmapset of live partitions then we could just > check the partitions that survived pruning. Amit Langote has a > pending patch which does that and some other useful stuff, so maybe we > can delay fixing that until the dust settles a bit in that area. Amit > and I are both working hard to remove all these per-partition > overheads. I imagine he'd also not be in favour of adding code that > does something for all partitions when we've pruned down to just 1. > I've personally no objection to doing the required additional > processing for the non-pruned partitions only. We could also then fix > the case where we disable the optimisation if there's a DEFAULT > partition without any regards to if it's been pruned or not. Those are quite worthwhile enhancements, and being able to avoid a MergeAppend if the problematic partitions have been prune would be great! I didn't followed thoroughly all the discussions about the various optimization Amit and you are working on, but I don't think it would be incompatible with a new flag and the possibility to have the sorted append with multi valued list partitions? > > > Concerning the implementation, there's at least one issue: it assumes > > that each subpath of a range-partitioned table will be ordered, with > > is not guaranteed. You need to to generate explicit Sort nodes nodes > > (in the same order as the query_pathkey) for partitions that don't > > have an ordered path and make sure that this path is used in the > > Append. Here's a simplistic case showing the issue (sorry, the > > partition names are poorly chosen): > > Thanks for noticing this. I had been thrown off due to the fact that > Paths are never actually created for these sorts. On looking further I > see that we do checks during createplan to see if the path is > suitability sorted and just create a sort node if it's not. This seems > to go against the whole point of paths, but I'm not going to fight for > changing it, so I've just done the Append the same way as MergeAppend > handles it. Yes, I had quite the same reaction when I saw how MergeAppend handles it. > > > Also, if a LIMIT is specified, it should be able to give better > > estimates, at least if there's no qual. For instance: > > > > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10; > > QUERY PLAN > > > > > -------------------------------------------------------------------------------------------------------> > > Limit (cost=0.00..0.35 rows=10 width=15) > > -> Append (cost=0.00..7705.56 rows=219999 width=15) > > -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15) > > -> Index Scan using simple_1_2_id_idx on simple_1_2 > > (cost=0.29..3148.28 rows=99999 width=14) > > -> Index Scan using simple_2_3_id_idx on simple_2_3 > > (cost=0.29..3148.29 rows=100000 width=16) > > (5 rows) > > > > In this case, we should estimate that the SeqScan (or in a corrected > > version the Sort) node should not return more than 10 rows, and each > > following partition should be scanned at all, and cost each path > > accordingly. I think that this is quite important, for instance to > > make sure that natively sorted Append is chosen over a MergeAppend > > when there are some subpath with explicit sorts, because with the > > Append we probably won't have to execute all the sorts if the previous > > partition scans returned enough rows. > > In my patch, I'm not adding any additional paths. I'm just adding an > Append instead of a MergeAppend. For what you're talking about the > limit only needs to be passed into any underlying Sort so that it can > become a top-N sort. This is handled already in create_limit_path(). > Notice in the plan you pasted above that the limit has a lower total > cost than its Append subnode. That's because create_limit_path() > weighted the Limit total cost based on the row count of the limit and > its subpath. ... 7705.56 / 219999 * 10 = ~0.35. Yes. But the cost of the first partition in this example is wrong since there was no additional sort on top of the seq scan. However, I now realize that, as you said, what your patch does is to generate an Append *instead* of a MergeAppend if the optimization was possible. So there can't be the problem of a MergeAppend chosen over a cheaper Append in some cases, sorry for the noise. I totally missed that because when I worked on the same topic last year we had to generate both Append and MergeAppend. At that time Append were not parallel-aware yet, so there could be faster parallel MergeAppend in some cases. > > FWIW, both those cases were handled (probably with some bugs though) > > in the previous patches Ronan and I sent some time ago. Also, I did > > not forget about this feature, I planned to work on it in hope to have > > it included in pg12. However, I won't have a lot of time to work on > > it before December. > > I apologise for not noticing your patch. I only went as far as > checking the November commitfest to see if anything existed already > and I found nothing there. No worries, it's more than a year old now (I'm quite ashamed I didn't come back on this sooner). > I have time to work on this now, so likely > it's better if I continue, just in case your time in December does not > materialise. I entirely agree. > v2 of the patch is attached. I've not had time yet to give it a > throughout post write review, but on first look it seems okay. I've registered as a reviewer. I still didn't have a deep look at the patch yet, but thanks a lot for working on it!
On 31 October 2018 at 12:24, Julien Rouhaud <rjuju123@gmail.com> wrote: > On Mon, Oct 29, 2018 at 1:44 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: >> >> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote: >> > I just had a look at your patch. I see that you implemented only a >> > subset of the possible optimizations (only the case for range >> > partitionoing without subpartitions). This has been previously >> > discussed, but we should be able to do similar optimization for list >> > partitioning if there's no interleaved values, and also for some cases >> > of multi-level partitioning. >> >> I had thought about these cases but originally had thought they would >> be more complex to implement than I could justify. On review, I've >> found some pretty cheap ways to handle both sub-partitions and for >> LIST partitioned tables. Currently, with LIST partitioned tables I've >> coded it to only allow the optimisation if there's no DEFAULT >> partition and all partitions are defined with exactly 1 Datum. This >> guarantees that there are no interleaved values, but it'll just fail >> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The >> reason that I didn't go to the trouble of the additional checks was >> that I don't really want to add any per-partition overhead to this. > > I see, but the overhead you mention is because you're doing that check > during the planning in build_partition_pathkeys(). As advised by > Robert quite some time ago > (https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com), > we can store that information when the PartitionDesc is built, so > that would it wouldn't be problematic. Since checking for overlapping > values is straightforward with the BoundInfoData infrastructure, it'd > be a pity to miss this optimization in such cases, which I believe > would not be rare. Thanks for looking at this again. I retrospectively read that thread after Amit mentioned about your patch. I just disagree with Robert about caching this flag. The reason is, if the flag is false due to some problematic partitions, if we go and prune those, then we needlessly fail to optimise that case. I propose we come back and do the remaining optimisations with interleaved LIST partitions and partitioned tables with DEFAULT partitions later, once we have a new "live_parts" field in RelOptInfo. That way we can just check the live parts to ensure they're compatible with the optimization. If we get what's done already in then we're already a bit step forward. [...] >> v2 of the patch is attached. I've not had time yet to give it a >> throughout post write review, but on first look it seems okay. > > > I've registered as a reviewer. I still didn't have a deep look at > the patch yet, but thanks a lot for working on it! Thanks for signing up to review. I need to send another revision of the patch to add a missing call to truncate_useless_pathkeys(). Will try to do that today. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 31 October 2018 at 13:05, David Rowley <david.rowley@2ndquadrant.com> wrote: >>> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote: >> I've registered as a reviewer. I still didn't have a deep look at >> the patch yet, but thanks a lot for working on it! > > Thanks for signing up to review. I need to send another revision of > the patch to add a missing call to truncate_useless_pathkeys(). Will > try to do that today. I've attached a patch that removes the redundant pathkeys. This allows cases like the following to work: explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c; QUERY PLAN ------------------------------------------------------------- Append -> Index Scan using mcrparted1_a_abs_c_idx on mcrparted1 Index Cond: (a = 10) -> Index Scan using mcrparted2_a_abs_c_idx on mcrparted2 Index Cond: (a = 10) (5 rows) One thing that could work but currently does not are when LIST partitions just allow a single value, we could allow the Append to have pathkeys even if there are no indexes. One way to do this would be to add PathKeys to the seqscan path on the partition for supporting partitions. However, that's adding code in another area so likely should be another patch. This could allow cases like: create table bool_rp (b bool) partition by list(b); create table bool_rp_true partition of bool_rp for values in(true); create table bool_rp_false partition of bool_rp for values in(false); explain (costs off) select * from bool_rp order by b; QUERY PLAN ------------------------------------------------------------------ Append -> Seq Scan on bool_rp_false -> Seq Scan on bool_rp_true (3 rows) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> wrote: > On 31 October 2018 at 13:05, David Rowley <david.rowley@2ndquadrant.com> wrote: > >>> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote: > >> I've registered as a reviewer. I still didn't have a deep look at > >> the patch yet, but thanks a lot for working on it! > > > > Thanks for signing up to review. I need to send another revision of > > the patch to add a missing call to truncate_useless_pathkeys(). Will > > try to do that today. > > I've attached a patch that ... I've picked this one when looking around what I could review. * As for the logic, I found generate_mergeappend_paths() to be the most interesting part: Imagine table partitioned by "i", so "partition_pathkeys" is {i}. partition 1: i | j --+-- 0 | 0 1 | 1 0 | 1 1 | 0 partition 2: i | j --+-- 3 | 0 2 | 0 2 | 1 3 | 1 Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the ordering of the subpaths should not change the way tuples are split into partitions. Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with different items. To propose more generic rule, I used this example of range-partitioned table, where "i" and "j" are the partitioning keys: partition 1: i | j | k ---+---+--- 0 | 0 | 1 0 | 0 | 0 partition 2: i | j | k ---+---+--- 0 | 1 | 0 0 | 1 | 1 If the output "pathkey" is {i, k}, then the Append path makes rows of both partitions interleave: i | j | k ---+---+--- 0 | 0 | 0 0 | 1 | 0 0 | 0 | 1 0 | 1 | 1 So in general I think the restriction is that no valid position of "pathkeys" and "partition_pathkeys" may differ. Or in other words: the shorter of the 2 pathkey lists must be contained in the longer one. Does it make sense to you? Another problem I see is that build_partition_pathkeys() continues even if it fails to create a pathkey for some partitioning column. In the example above it would mean that the table can have "partition_pathkeys" equal to {j} instead of {i, j} on some EC-related conditions. However such a key does not correspond to reality - this is easier to imagine if another partition is considered. partition 3: i | j | k ---+---+--- 1 | 0 | 1 1 | 0 | 0 So I think no "partition_pathkeys" should be generated in that case. On the other hand, if the function returned the part of the list it could construct so far, it'd be wrong because such incomplete pathkeys could pass the checks I proposed above for reasons unrelated to the partitioning scheme. The following comments are mostly on coding: * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have this sentence in the header comment: Note: If changing this, see build_partition_pathkeys() However I could not find other use besides that in RelationBuildPartitionDesc(). * create_append_path(): /* * Apply query-wide LIMIT if known and path is for sole base relation. * (Handling this at this low level is a bit klugy.) */ if (root != NULL && pathkeys != NULL && bms_equal(rel->relids, root->all_baserels)) pathnode->limit_tuples = root->limit_tuples; else pathnode->limit_tuples = -1.0; I think this optimization is not specific to AppendPath / MergeAppendPath, so it could be moved elsewhere (as a separate patch of course). But specifically for AppendPath, why do we have to test pathkeys? The pathkeys of the AppendPath do not necessarily describe the order of the set to which LIMIT is applied, so their existence should not be important here. * If pathkeys is passed, shouldn't create_append_path() include the cost_sort() of subpaths just like create_merge_append_path() does? And if so, then create_append_path() and create_merge_append_path() might eventually have some common code (at least for the subpath processing) to be put into a separate function. * Likewise, create_append_plan() / create_merge_append_plan() are going to be more similar then before, so some refactoring could also make sense. Although it's not too much code, I admit the patch is not trivial, so I'm curious about your opinion. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On 1 November 2018 at 04:01, Antonin Houska <ah@cybertec.at> wrote: > * As for the logic, I found generate_mergeappend_paths() to be the most > interesting part: > > Imagine table partitioned by "i", so "partition_pathkeys" is {i}. > > partition 1: > > i | j > --+-- > 0 | 0 > 1 | 1 > 0 | 1 > 1 | 0 > > partition 2: > > i | j > --+-- > 3 | 0 > 2 | 0 > 2 | 1 > 3 | 1 > > Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the > ordering of the subpaths should not change the way tuples are split into > partitions. > > Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with > different items. To propose more generic rule, I used this example of > range-partitioned table, where "i" and "j" are the partitioning keys: > > partition 1: > > i | j | k > ---+---+--- > 0 | 0 | 1 > 0 | 0 | 0 > > partition 2: > > i | j | k > ---+---+--- > 0 | 1 | 0 > 0 | 1 | 1 > > If the output "pathkey" is {i, k}, then the Append path makes rows of both > partitions interleave: > > i | j | k > ---+---+--- > 0 | 0 | 0 > 0 | 1 | 0 > 0 | 0 | 1 > 0 | 1 | 1 > > So in general I think the restriction is that no valid position of "pathkeys" > and "partition_pathkeys" may differ. Or in other words: the shorter of the 2 > pathkey lists must be contained in the longer one. Does it make sense to you? I understand what you're saying. I just don't understand what you think is wrong with the patch in this area. > Another problem I see is that build_partition_pathkeys() continues even if it > fails to create a pathkey for some partitioning column. In the example above > it would mean that the table can have "partition_pathkeys" equal to {j} > instead of {i, j} on some EC-related conditions. However such a key does not > correspond to reality - this is easier to imagine if another partition is > considered. > > partition 3: > > i | j | k > ---+---+--- > 1 | 0 | 1 > 1 | 0 | 0 > > So I think no "partition_pathkeys" should be generated in that case. On the > other hand, if the function returned the part of the list it could construct > so far, it'd be wrong because such incomplete pathkeys could pass the checks I > proposed above for reasons unrelated to the partitioning scheme. Oops. That's a mistake. We should return what we have so far if we can't make one of the pathkeys. Will fix. > The following comments are mostly on coding: > > * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have > this sentence in the header comment: > > Note: If changing this, see build_partition_pathkeys() > > However I could not find other use besides that in > RelationBuildPartitionDesc(). While the new code does not call those directly, the new code does depend on the sort order of the partitions inside the PartitionDesc, which those functions are responsible for. Perhaps there's a better way to communicate that. Actually, I think the partitioning checking code I added to pathkeys.c does not belong there. Likely those checks should live with the other partitioning code in the form of a bool returning function. I'll change that now. It means we don't have to work that out twice as I'm currently running it once for forward and once for the backwards scan case. Currently the code is very simple but if we start analysing list partition bounds then it will become slower. > * create_append_path(): > > /* > * Apply query-wide LIMIT if known and path is for sole base relation. > * (Handling this at this low level is a bit klugy.) > */ > if (root != NULL && pathkeys != NULL && > bms_equal(rel->relids, root->all_baserels)) > pathnode->limit_tuples = root->limit_tuples; > else > pathnode->limit_tuples = -1.0; > > I think this optimization is not specific to AppendPath / MergeAppendPath, > so it could be moved elsewhere (as a separate patch of course). But > specifically for AppendPath, why do we have to test pathkeys? The pathkeys > of the AppendPath do not necessarily describe the order of the set to which > LIMIT is applied, so their existence should not be important here. The pathkeys != NULL could be removed. I was just trying to maintain the status quo for Appends without pathkeys. In reality it currently does not matter since that's only used as a parameter for cost_sort(). There'd be no reason previously to have a Sort path as a subpath in an Append node since the order would be destroyed after the Append. Perhaps we should just pass it through as one day it might be useful. I just can't currently imagine why. > * If pathkeys is passed, shouldn't create_append_path() include the > cost_sort() of subpaths just like create_merge_append_path() does? And if > so, then create_append_path() and create_merge_append_path() might > eventually have some common code (at least for the subpath processing) to be > put into a separate function. It does. It's just done via the call to cost_append(). > * Likewise, create_append_plan() / create_merge_append_plan() are going to be > more similar then before, so some refactoring could also make sense. > > Although it's not too much code, I admit the patch is not trivial, so I'm > curious about your opinion. I think the costing code is sufficiently different to warant not sharing more. For example, the startup costing is completely different. Append can start on the startup cost of the first subpath, but MergeAppend takes the sum of the startup cost of all subpaths. I've attached v4 of the patch. I think this addresses all that you mentioned apart from the first one, due to not understanding the problem. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> wrote: > On 1 November 2018 at 04:01, Antonin Houska <ah@cybertec.at> wrote: > > * As for the logic, I found generate_mergeappend_paths() to be the most > > interesting part: > > > > Imagine table partitioned by "i", so "partition_pathkeys" is {i}. > > > > partition 1: > > > > i | j > > --+-- > > 0 | 0 > > 1 | 1 > > 0 | 1 > > 1 | 0 > > > > partition 2: > > > > i | j > > --+-- > > 3 | 0 > > 2 | 0 > > 2 | 1 > > 3 | 1 > > > > Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the > > ordering of the subpaths should not change the way tuples are split into > > partitions. > > > > ... > > I understand what you're saying. I just don't understand what you > think is wrong with the patch in this area. I think these conditions are too restrictive: /* * Determine if these pathkeys match the partition order, or reverse * partition order. It can't match both, so only go to the trouble of * checking the reverse order when it's not in ascending partition * order. */ partition_order = pathkeys_contained_in(pathkeys, partition_pathkeys); partition_order_desc = !partition_order && pathkeys_contained_in(pathkeys, partition_pathkeys_desc); In the example above I wanted to show that your new feature should work even if "pathkeys" is not contained in "partition_pathkeys". > > Another problem I see is that build_partition_pathkeys() continues even if it > > fails to create a pathkey for some partitioning column. In the example above > > it would mean that the table can have "partition_pathkeys" equal to {j} > > instead of {i, j} on some EC-related conditions. However such a key does not > > correspond to reality - this is easier to imagine if another partition is > > considered. > > > > partition 3: > > > > i | j | k > > ---+---+--- > > 1 | 0 | 1 > > 1 | 0 | 0 > > > > So I think no "partition_pathkeys" should be generated in that case. On the > > other hand, if the function returned the part of the list it could construct > > so far, it'd be wrong because such incomplete pathkeys could pass the checks I > > proposed above for reasons unrelated to the partitioning scheme. > > Oops. That's a mistake. We should return what we have so far if we > can't make one of the pathkeys. Will fix. I think no "partition_pathkeys" should be created in this case, but before we can discuss this in detail there needs to be an agreement on the evaluation of the relationship between "pathkeys" and "partition_pathkeys", see above. > > The following comments are mostly on coding: > > > > * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have > > this sentence in the header comment: > > > > Note: If changing this, see build_partition_pathkeys() > > > > However I could not find other use besides that in > > RelationBuildPartitionDesc(). > > While the new code does not call those directly, the new code does > depend on the sort order of the partitions inside the PartitionDesc, > which those functions are responsible for. Perhaps there's a better > way to communicate that. I pointed this out because I suspect that changes of these functions would affect more features, not only the one you're trying to implement. > > * If pathkeys is passed, shouldn't create_append_path() include the > > cost_sort() of subpaths just like create_merge_append_path() does? And if > > so, then create_append_path() and create_merge_append_path() might > > eventually have some common code (at least for the subpath processing) to be > > put into a separate function. > > It does. It's just done via the call to cost_append(). ok, I missed that. > > * Likewise, create_append_plan() / create_merge_append_plan() are going to be > > more similar then before, so some refactoring could also make sense. > > > > Although it's not too much code, I admit the patch is not trivial, so I'm > > curious about your opinion. > > I think the costing code is sufficiently different to warant not > sharing more. For example, the startup costing is completely > different. Append can start on the startup cost of the first subpath, > but MergeAppend takes the sum of the startup cost of all subpaths. ok -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote: > I think these conditions are too restrictive: > > /* > * Determine if these pathkeys match the partition order, or reverse > * partition order. It can't match both, so only go to the trouble of > * checking the reverse order when it's not in ascending partition > * order. > */ > partition_order = pathkeys_contained_in(pathkeys, > partition_pathkeys); > partition_order_desc = !partition_order && > pathkeys_contained_in(pathkeys, > partition_pathkeys_desc); > > > In the example above I wanted to show that your new feature should work even > if "pathkeys" is not contained in "partition_pathkeys". Okay, after a bit more time looking at this I see what you're saying now and I agree, but; see below. >> > Another problem I see is that build_partition_pathkeys() continues even if it >> > fails to create a pathkey for some partitioning column. In the example above >> > it would mean that the table can have "partition_pathkeys" equal to {j} >> > instead of {i, j} on some EC-related conditions. However such a key does not >> > correspond to reality - this is easier to imagine if another partition is >> > considered. >> > >> > partition 3: >> > >> > i | j | k >> > ---+---+--- >> > 1 | 0 | 1 >> > 1 | 0 | 0 >> > >> > So I think no "partition_pathkeys" should be generated in that case. On the >> > other hand, if the function returned the part of the list it could construct >> > so far, it'd be wrong because such incomplete pathkeys could pass the checks I >> > proposed above for reasons unrelated to the partitioning scheme. >> >> Oops. That's a mistake. We should return what we have so far if we >> can't make one of the pathkeys. Will fix. > > I think no "partition_pathkeys" should be created in this case, but before we > can discuss this in detail there needs to be an agreement on the evaluation of > the relationship between "pathkeys" and "partition_pathkeys", see above. The problem with doing that is that if the partition keys are better than the pathkeys then we'll most likely fail to generate any partition keys at all due to lack of any existing eclass to use for the pathkeys. It's unsafe to use just the prefix in this case as the eclass may not have been found due to, for example one of the partition keys having a different collation than the required sort order of the query. In other words, we can't rely on a failure to create the pathkey meaning that a more strict sort order is not required. I'm a bit unsure on how safe it would be to pass "create_it" as true to make_pathkey_from_sortinfo(). We might be building partition path keys for some sub-partitioned table. In this case the eclass should likely have a its member added with em_is_child = true. The existing code always sets em_is_child to false. It's not that clear to me that setting up a new eclass with a single em_is_child = true member is correct. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote: > > I think these conditions are too restrictive: > > > > /* > > * Determine if these pathkeys match the partition order, or reverse > > * partition order. It can't match both, so only go to the trouble of > > * checking the reverse order when it's not in ascending partition > > * order. > > */ > > partition_order = pathkeys_contained_in(pathkeys, > > partition_pathkeys); > > partition_order_desc = !partition_order && > > pathkeys_contained_in(pathkeys, > > partition_pathkeys_desc); > > > The problem with doing that is that if the partition keys are better > than the pathkeys then we'll most likely fail to generate any > partition keys at all due to lack of any existing eclass to use for > the pathkeys. It's unsafe to use just the prefix in this case as the > eclass may not have been found due to, for example one of the > partition keys having a different collation than the required sort > order of the query. In other words, we can't rely on a failure to > create the pathkey meaning that a more strict sort order is not > required. I had another look at this patch and it seems okay just to add a new flag to build_partition_pathkeys() to indicate if the pathkey List was truncated or not. In generate_mergeappend_paths() we can then just check that flag before checking if the partiiton pathkeys are contained in pathkeys. It's fine if the partition keys were truncated for the reverse of that check. I've done this in the attached and added additional regression tests for this case. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On Thu, Nov 22, 2018 at 11:27 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote: > > On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote: > > > I think these conditions are too restrictive: > > > > > > /* > > > * Determine if these pathkeys match the partition order, or reverse > > > * partition order. It can't match both, so only go to the trouble of > > > * checking the reverse order when it's not in ascending partition > > > * order. > > > */ > > > partition_order = pathkeys_contained_in(pathkeys, > > > partition_pathkeys); > > > partition_order_desc = !partition_order && > > > pathkeys_contained_in(pathkeys, > > > partition_pathkeys_desc); > > > > > > The problem with doing that is that if the partition keys are better > > than the pathkeys then we'll most likely fail to generate any > > partition keys at all due to lack of any existing eclass to use for > > the pathkeys. It's unsafe to use just the prefix in this case as the > > eclass may not have been found due to, for example one of the > > partition keys having a different collation than the required sort > > order of the query. In other words, we can't rely on a failure to > > create the pathkey meaning that a more strict sort order is not > > required. > > I had another look at this patch and it seems okay just to add a new > flag to build_partition_pathkeys() to indicate if the pathkey List was > truncated or not. In generate_mergeappend_paths() we can then just > check that flag before checking if the partiiton pathkeys are > contained in pathkeys. It's fine if the partition keys were truncated > for the reverse of that check. > > I've done this in the attached and added additional regression tests > for this case. I started to look at v5. If I understand correctly, the new behavior is controlled by partitions_are_ordered(), but it only checks for declared partitions, not partitions that survived pruning. Did I miss something or is it the intended behavior? Also, generate_mergeappend_paths should probably be renamed to something like generate_sortedappend_paths since it can now generate either Append or MergeAppend paths. I'm also wondering if that's ok to only generate either a (sorted) Append or a MergeAppend. Is it possible that in some cases it's better to have a MergeAppend rather than a sorted Append, given that MergeAppend is parallel-aware and the sorted Append isn't?
On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <rjuju123@gmail.com> wrote: > I started to look at v5. Thanks for giving this a look over. > If I understand correctly, the new behavior is controlled by > partitions_are_ordered(), but it only checks for declared partitions, > not partitions that survived pruning. Did I miss something or is it > the intended behavior? Yeah, it was mentioned up-thread a bit. I wrote: > I retrospectively read that thread after Amit mentioned about your > patch. I just disagree with Robert about caching this flag. The > reason is, if the flag is false due to some problematic partitions, if > we go and prune those, then we needlessly fail to optimise that case. > I propose we come back and do the remaining optimisations with > interleaved LIST partitions and partitioned tables with DEFAULT > partitions later, once we have a new "live_parts" field in > RelOptInfo. That way we can just check the live parts to ensure > they're compatible with the optimization. If we get what's done > already in then we're already a bit step forward. The reason I'm keen to leave this alone today is that determining which partitions are pruned requires looking at each partition's RelOptInfo and checking if it's marked as a dummy rel. I'm trying to minimise the overhead of this patch by avoiding doing any per-partition processing. If we get the "live_parts" Bitmapset, then this becomes cheaper as Bitmapsets are fairly efficient at finding the next set member, even when they're large and sparsely populated. > Also, generate_mergeappend_paths should > probably be renamed to something like generate_sortedappend_paths > since it can now generate either Append or MergeAppend paths. You might be right about naming this something else, but "sortedappend" sounds like an Append node with a Sort node above it. "orderedappend" feels slightly better, although my personal vote would be not to rename it at all. Sometimes generating an Append seems like an easy enough corner case to mention in the function body. > I'm also wondering if that's ok to only generate either a (sorted) > Append or a MergeAppend. Is it possible that in some cases it's > better to have a MergeAppend rather than a sorted Append, given that > MergeAppend is parallel-aware and the sorted Append isn't? That might have been worth a thought if we had parallel MergeAppends, but we don't. You might be thinking of GatherMerge. I've attached a v6 patch. The only change is the renamed the generate_mergeappend_paths() function to generate_orderedappend_paths(), and also the required comment updates to go with it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Dec 19, 2018 at 10:51 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > If I understand correctly, the new behavior is controlled by > > partitions_are_ordered(), but it only checks for declared partitions, > > not partitions that survived pruning. Did I miss something or is it > > the intended behavior? > > Yeah, it was mentioned up-thread a bit. > > I wrote: > > I retrospectively read that thread after Amit mentioned about your > > patch. I just disagree with Robert about caching this flag. The > > reason is, if the flag is false due to some problematic partitions, if > > we go and prune those, then we needlessly fail to optimise that case. > > I propose we come back and do the remaining optimisations with > > interleaved LIST partitions and partitioned tables with DEFAULT > > partitions later, once we have a new "live_parts" field in > > RelOptInfo. That way we can just check the live parts to ensure > > they're compatible with the optimization. If we get what's done > > already in then we're already a bit step forward. Ah, sorry I did read this but I misunderstood it. I really need to catchup what changed for partitioning since pg11 more thoroughly. > The reason I'm keen to leave this alone today is that determining > which partitions are pruned requires looking at each partition's > RelOptInfo and checking if it's marked as a dummy rel. I'm trying to > minimise the overhead of this patch by avoiding doing any > per-partition processing. If we get the "live_parts" Bitmapset, then > this becomes cheaper as Bitmapsets are fairly efficient at finding the > next set member, even when they're large and sparsely populated. I see. But since for now the optimisation will only be done considering all partitions, I still think that it's better to store a bool flag in the PartitionDesc to describe if it's natively ordered or not, and therefore also handle the case for non-intervleaved-multi-datums list partitioning. It won't add much overhead and will benefit way more cases. We can still revisit that when a live_parts Bitmapset is available in RelOptInfo (and maybe other flag that say if partitions were pruned or not, and/or if the default partition was pruned). > > Also, generate_mergeappend_paths should > > probably be renamed to something like generate_sortedappend_paths > > since it can now generate either Append or MergeAppend paths. > > You might be right about naming this something else, but > "sortedappend" sounds like an Append node with a Sort node above it. > "orderedappend" feels slightly better, although my personal vote would > be not to rename it at all. Sometimes generating an Append seems like > an easy enough corner case to mention in the function body. Ok, I don't have a very strong opinion on it and orderedappend sounds less ambiguous. > > I'm also wondering if that's ok to only generate either a (sorted) > > Append or a MergeAppend. Is it possible that in some cases it's > > better to have a MergeAppend rather than a sorted Append, given that > > MergeAppend is parallel-aware and the sorted Append isn't? > > That might have been worth a thought if we had parallel MergeAppends, > but we don't. You might be thinking of GatherMerge. Ah, oups indeed :)
On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <rjuju123@gmail.com> wrote: > On Wed, Dec 19, 2018 at 10:51 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > The reason I'm keen to leave this alone today is that determining > > which partitions are pruned requires looking at each partition's > > RelOptInfo and checking if it's marked as a dummy rel. I'm trying to > > minimise the overhead of this patch by avoiding doing any > > per-partition processing. If we get the "live_parts" Bitmapset, then > > this becomes cheaper as Bitmapsets are fairly efficient at finding the > > next set member, even when they're large and sparsely populated. > > I see. But since for now the optimisation will only be done > considering all partitions, I still think that it's better to store a > bool flag in the PartitionDesc to describe if it's natively ordered or > not, and therefore also handle the case for > non-intervleaved-multi-datums list partitioning. It won't add much > overhead and will benefit way more cases. I'm not really in favour of adding a flag there only to remove it again once we can more easily determine the pruned partitions. Remember the flag, because it's stored in the relation cache, must be set accounting for all partitions. As soon as we want to add smarts for pruned partitions, then the flag becomes completely useless for everything. If covering all cases in the first hit is your aim then the way to go is to add the live_parts field to RelOptInfo in this patch rather than in Amit's patch in [1]. I'd much rather add the pruned partitions smarts as part of another effort. The most likely cases to benefit from this are already covered by the current patch; range partitioned tables. [1] https://commitfest.postgresql.org/21/1778/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 19, 2018 at 1:18 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > I see. But since for now the optimisation will only be done > > considering all partitions, I still think that it's better to store a > > bool flag in the PartitionDesc to describe if it's natively ordered or > > not, and therefore also handle the case for > > non-intervleaved-multi-datums list partitioning. It won't add much > > overhead and will benefit way more cases. > > I'm not really in favour of adding a flag there only to remove it > again once we can more easily determine the pruned partitions. > Remember the flag, because it's stored in the relation cache, must be > set accounting for all partitions. As soon as we want to add smarts > for pruned partitions, then the flag becomes completely useless for > everything. I don't see why we should drop this flag. If we know that the partitions are naturally ordered, they'll still be ordered after some partitions have been prune, so we can skip later checks if we already have the information. The only remaining cases this flag doesn't cover are: - partitions are naturally ordered but there's a default partition. We could store this information and later check if the default partition has been pruned or not - partitions are not naturally ordered, but become naturally ordered if enough partitions are pruned. I may be wrong but that doesn't seem like a very frequent use case to me I'd imagine that in a lot of cases either almost no partition are prune (or at least not enough so that the remaining one are ordered), or all but one partition is pruned),. So keeping a low overhead for the almost-no-pruned-partition with naturally ordered partitions case still seems like a good idea to me. > If covering all cases in the first hit is your aim then > the way to go is to add the live_parts field to RelOptInfo in this > patch rather than in Amit's patch in [1]. I'd much rather add the > pruned partitions smarts as part of another effort. The most likely > cases to benefit from this are already covered by the current patch; > range partitioned tables. Covering all cases is definitely not my goal here, just grabbing the low hanging fruits. The multi-level partitioning case is another thing that would need to be handled for instance (and that's the main reason I couldn't submit a new patch when I was working on it), and I'm definitely not arguing to cover it in this patch. That being said, I'll try to have a look at this patch too, but as I said I have a lot of catch-up to do in this part of the code, so I'm afraid that I'll not be super efficient.
On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote: > I don't see why we should drop this flag. If we know that the > partitions are naturally ordered, they'll still be ordered after some > partitions have been prune, so we can skip later checks if we already > have the information. The only remaining cases this flag doesn't > cover are: > > - partitions are naturally ordered but there's a default partition. > We could store this information and later check if the default > partition has been pruned or not > - partitions are not naturally ordered, but become naturally ordered > if enough partitions are pruned. I may be wrong but that doesn't seem > like a very frequent use case to me I'd imagine that in a lot of > cases either almost no partition are prune (or at least not enough so > that the remaining one are ordered), or all but one partition is > pruned),. So keeping a low overhead for the > almost-no-pruned-partition with naturally ordered partitions case > still seems like a good idea to me. I'm objecting to processing for all partitions, but processing for just non-pruned partitions seems fine to me. If there are 10k partitions and we pruned none of them, then planning will be slow anyway. I'm not too worried about slowing it down a further microsecond or two. It'll be a drop in the ocean. When we have the live_parts flag in RelOptInfo then we can allow all of the cases you've mentioned above, we'll just need to look at the non-pruned partitions, and in partition order, determine if the lowest LIST partitioned value sorts earlier than some earlier partition's highest LIST value and disable the optimisation for such cases. The flag you've mentioned will become redundant when support is added for the cases you've mentioned above. I don't see any reason not to support all these cases, once the live_parts flag makes in into RelOptInfo. I'm also a bit confused at why you think it's so important to make multi-valued LIST partitions work when no values are interleaved, but you suddenly don't care about the optimisation when the interleaved value partitions get pruned. Can you share your reasoning for that? If you're really so keen on this flag, can you share the design you have in mind? If it's just a single bool flag like "parts_ordered", and that's set to false, then how would you know there is some natural order when the DEFAULT partition gets pruned? Or are you proposing multiple flags, maybe two flags, one for when the default is pruned and one when it's not? If so, I'd question why the default partition is so special? Pruning of any of the other partitions could turn a naturally unordered LIST partitioned table into a naturally ordered partitioned table if the pruned partition happened to be the only one with interleaved values. Handling only the DEFAULT partition in a special way seems to violate the principle of least astonishment. But in short, I just really don't like the flags idea and I'm not really willing to work on it or put my name on it. I'd much rather wait then build a proper solution that works in all cases. I feel the current patch is worthwhile as it stands. > The multi-level partitioning case is another > thing that would need to be handled for instance (and that's the main > reason I couldn't submit a new patch when I was working on it), and > I'm definitely not arguing to cover it in this patch. As far as I'm aware, the multi-level partitioning should work just fine with the current patch. I added code for that a while ago. There are regression tests to exercise it. I'm not aware of any cases where it does not work. > That being > said, I'll try to have a look at this patch too, but as I said I have > a lot of catch-up to do in this part of the code, so I'm afraid that > I'll not be super efficient. Thanks for your time on this so far. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 19, 2018 at 3:01 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote: > > I'm objecting to processing for all partitions, but processing for > just non-pruned partitions seems fine to me. If there are 10k > partitions and we pruned none of them, then planning will be slow > anyway. I'm not too worried about slowing it down a further > microsecond or two. It'll be a drop in the ocean. When we have the > live_parts flag in RelOptInfo then we can allow all of the cases > you've mentioned above, we'll just need to look at the non-pruned > partitions, and in partition order, determine if the lowest LIST > partitioned value sorts earlier than some earlier partition's highest > LIST value and disable the optimisation for such cases. My concern is more for a more moderate number of partition (a few hundreds?). I don't know how expensive that'll be, but it just seem sad to recompute their ordering each time and waste cycles if we can do it only once in non corner cases. > The flag you've mentioned will become redundant when support is added > for the cases you've mentioned above. I don't see any reason not to > support all these cases, once the live_parts flag makes in into > RelOptInfo. I'm also a bit confused at why you think it's so > important to make multi-valued LIST partitions work when no values are > interleaved, but you suddenly don't care about the optimisation when > the interleaved value partitions get pruned. Can you share your > reasoning for that? I never said that I don't care about interleaved partition being pruned. I do think it might not be a super frequent thing, but I certainly wish we handle it. I just agree with your argument that the pruned partitions problem will be better handled with the live_parts that should be added in another patch. > If you're really so keen on this flag, can you share the design you > have in mind? If it's just a single bool flag like "parts_ordered", > and that's set to false, then how would you know there is some natural > order when the DEFAULT partition gets pruned? Or are you proposing > multiple flags, maybe two flags, one for when the default is pruned > and one when it's not? I don't think that the design is a big problem here. You can either have a flag that say if the partitions are ordered whether there's a default partition or not, so callers will have to check if the default partition is still there, or just store an enum to distinguish the different cases. > If so, I'd question why the default partition > is so special? Pruning of any of the other partitions could turn a > naturally unordered LIST partitioned table into a naturally ordered > partitioned table if the pruned partition happened to be the only one > with interleaved values. Handling only the DEFAULT partition in a > special way seems to violate the principle of least astonishment. I'm not sure I'm following you, the default partition is by nature a special partition, and its simple presence prevent this optimisation. We can't possibly store all the sets of subsets of partitions that would make the partitioned table naturally ordered if they were pruned, so it seems like a different problem. > But in short, I just really don't like the flags idea and I'm not > really willing to work on it or put my name on it. I'd much rather > wait then build a proper solution that works in all cases. I feel the > current patch is worthwhile as it stands. Ok, fine. > > The multi-level partitioning case is another > > thing that would need to be handled for instance (and that's the main > > reason I couldn't submit a new patch when I was working on it), and > > I'm definitely not arguing to cover it in this patch. > > As far as I'm aware, the multi-level partitioning should work just > fine with the current patch. I added code for that a while ago. There > are regression tests to exercise it. I'm not aware of any cases where > it does not work. Ok.
On Thu, 20 Dec 2018 at 09:48, Julien Rouhaud <rjuju123@gmail.com> wrote: > On Wed, Dec 19, 2018 at 3:01 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > If so, I'd question why the default partition > > is so special? Pruning of any of the other partitions could turn a > > naturally unordered LIST partitioned table into a naturally ordered > > partitioned table if the pruned partition happened to be the only one > > with interleaved values. Handling only the DEFAULT partition in a > > special way seems to violate the principle of least astonishment. > > I'm not sure I'm following you, the default partition is by nature a > special partition, and its simple presence prevent this optimisation. > We can't possibly store all the sets of subsets of partitions that > would make the partitioned table naturally ordered if they were > pruned, so it seems like a different problem. For example: create table listp (a int) partition by list (a); create table listp12 partition of listp for values in(1,2); create table listp03 partition of listp for vlaues in(0,3); create table listp45 partition of listp for values in(4,5); create table listpd partition of listp default; select * from listp where a in(1,2,4,5); Here we prune all but listp12 and listp45. Since the default is pruned and listp03 is pruned then there are no interleaved values. By your proposed design the natural ordering is not detected since we're storing a flag that says the partitions are unordered due to listp03. With my idea for using live_parts, we'll process the partitions looking for interleaved values on each query, after pruning takes place. In this case, we'll see the partitions are naturally ordered. I don't really foresee any issues with that additional processing since it will only be a big effort when there are a large number of partitions, and in those cases the planner already has lots of work to do. Such processing is just a drop in the ocean when compared to path generation for all those partitions. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 19, 2018 at 11:08 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 20 Dec 2018 at 09:48, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Wed, Dec 19, 2018 at 3:01 PM David Rowley > > <david.rowley@2ndquadrant.com> wrote: > > > If so, I'd question why the default partition > > > is so special? Pruning of any of the other partitions could turn a > > > naturally unordered LIST partitioned table into a naturally ordered > > > partitioned table if the pruned partition happened to be the only one > > > with interleaved values. Handling only the DEFAULT partition in a > > > special way seems to violate the principle of least astonishment. > > > > I'm not sure I'm following you, the default partition is by nature a > > special partition, and its simple presence prevent this optimisation. > > We can't possibly store all the sets of subsets of partitions that > > would make the partitioned table naturally ordered if they were > > pruned, so it seems like a different problem. > > For example: > > create table listp (a int) partition by list (a); > create table listp12 partition of listp for values in(1,2); > create table listp03 partition of listp for vlaues in(0,3); > create table listp45 partition of listp for values in(4,5); > create table listpd partition of listp default; > > select * from listp where a in(1,2,4,5); > > Here we prune all but listp12 and listp45. Since the default is pruned > and listp03 is pruned then there are no interleaved values. By your > proposed design the natural ordering is not detected since we're > storing a flag that says the partitions are unordered due to listp03. No, what I'm proposing is to store if the partitions are naturally ordered or not, *and* recheck after pruning if that property could have changed (so if some partitions have been pruned). So we avoid extra processing if we already knew that the partitions were ordered (possibly with the default partition pruning information), or if we know that the partitions are not ordered and no partition have been pruned.
On Thu, 20 Dec 2018 at 18:20, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Wed, Dec 19, 2018 at 11:08 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > create table listp (a int) partition by list (a); > > create table listp12 partition of listp for values in(1,2); > > create table listp03 partition of listp for vlaues in(0,3); > > create table listp45 partition of listp for values in(4,5); > > create table listpd partition of listp default; > > > > select * from listp where a in(1,2,4,5); > > > > Here we prune all but listp12 and listp45. Since the default is pruned > > and listp03 is pruned then there are no interleaved values. By your > > proposed design the natural ordering is not detected since we're > > storing a flag that says the partitions are unordered due to listp03. > > No, what I'm proposing is to store if the partitions are naturally > ordered or not, *and* recheck after pruning if that property could > have changed (so if some partitions have been pruned). So we avoid > extra processing if we already knew that the partitions were ordered > (possibly with the default partition pruning information), or if we > know that the partitions are not ordered and no partition have been > pruned. I see. So if the flag says "Yes", then we can skip the plan-time check, if it says "No" and partitions were pruned, then we need to re-check as the reason the flag says "No" might have been pruned away. I guess that works, but I had imagined that the test wouldn't have been particularly expensive. As more partitions are left unpruned then such a test costing a bit more I thought would have been unlikely to be measurable, but then I've not written the code yet. Are you saying that you think this patch should have this? Or are you happy to leave it until the next round? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 6, 2019 at 4:24 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 20 Dec 2018 at 18:20, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > > > No, what I'm proposing is to store if the partitions are naturally > > ordered or not, *and* recheck after pruning if that property could > > have changed (so if some partitions have been pruned). So we avoid > > extra processing if we already knew that the partitions were ordered > > (possibly with the default partition pruning information), or if we > > know that the partitions are not ordered and no partition have been > > pruned. > > I see. So if the flag says "Yes", then we can skip the plan-time > check, if it says "No" and partitions were pruned, then we need to > re-check as the reason the flag says "No" might have been pruned away. Exactly. > I guess that works, but I had imagined that the test wouldn't have > been particularly expensive. As more partitions are left unpruned then > such a test costing a bit more I thought would have been unlikely to > be measurable, but then I've not written the code yet. That's where my objection is I think. IIUC, the tests aren't not especially expensive, one reason is because the multi-value list partitioning case is out of scope. I was thinking that this flag would allow that keep this case in scope while not adding much overhead, and could still be useful with future enhancements (though optimizing some cycles with huge number of partitions is probably as you said a drop in the ocean). > Are you saying that you think this patch should have this? Or are you > happy to leave it until the next round? I'd be happy if we can handle in an efficient way ordered partitioned table scan, including multi-value list partitioning, eventually. So if that means that this optimization if not the best way to handle it, or if it's just not the best time to implement it I'm perfectly fine with it.
I've attached a rebased patch which fixes up the recent conflicts. No other changes were made. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Jan 31, 2019 at 04:29:56PM +1300, David Rowley wrote: > I've attached a rebased patch which fixes up the recent conflicts. No > other changes were made. Moved to next CF, waiting for review. -- Michael
Attachment
On Thu, 31 Jan 2019 at 16:29, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached a rebased patch which fixes up the recent conflicts. No > other changes were made. Rebased version due to a new call to create_append_path() added in ab5fcf2b0. No other changes made. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> wrote: > On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote: > > On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote: > > > I think these conditions are too restrictive: > > > > > > /* > > > * Determine if these pathkeys match the partition order, or reverse > > > * partition order. It can't match both, so only go to the trouble of > > > * checking the reverse order when it's not in ascending partition > > > * order. > > > */ > > > partition_order = pathkeys_contained_in(pathkeys, > > > partition_pathkeys); > > > partition_order_desc = !partition_order && > > > pathkeys_contained_in(pathkeys, > > > partition_pathkeys_desc); > > > > > > The problem with doing that is that if the partition keys are better > > than the pathkeys then we'll most likely fail to generate any > > partition keys at all due to lack of any existing eclass to use for > > the pathkeys. It's unsafe to use just the prefix in this case as the > > eclass may not have been found due to, for example one of the > > partition keys having a different collation than the required sort > > order of the query. In other words, we can't rely on a failure to > > create the pathkey meaning that a more strict sort order is not > > required. > > I had another look at this patch and it seems okay just to add a new > flag to build_partition_pathkeys() to indicate if the pathkey List was > truncated or not. In generate_mergeappend_paths() we can then just > check that flag before checking if the partiiton pathkeys are > contained in pathkeys. It's fine if the partition keys were truncated > for the reverse of that check. > > I've done this in the attached and added additional regression tests > for this case. I agree with this approach and I'm also fine with your other comments / code changes to address my review. As for the latest version (v8-0001-...) I've only caught a small typo: "When the first subpath needs sorted ...". It was probably meant "... needs sort ...". -- Antonin Houska https://www.cybertec-postgresql.com
Thanks a lot for taking the time to look at this. On Tue, 5 Mar 2019 at 03:03, Antonin Houska <ah@cybertec.at> wrote: > As for the latest version (v8-0001-...) I've only caught a small typo: "When > the first subpath needs sorted ...". It was probably meant "... needs sort > ...". That was a sort of short way of saying "needs [to be] sorted". I've added in the missing "to be" in the attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Dec 19, 2018 at 5:08 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > With my idea for using live_parts, we'll process the partitions > looking for interleaved values on each query, after pruning takes > place. In this case, we'll see the partitions are naturally ordered. I > don't really foresee any issues with that additional processing since > it will only be a big effort when there are a large number of > partitions, and in those cases the planner already has lots of work to > do. Such processing is just a drop in the ocean when compared to path > generation for all those partitions. I agree that partitions_are_ordered() is cheap enough in this patch that it probably doesn't matter whether we cache the result. On the other hand, that's mostly because you haven't handled the hard cases - e.g. interleaved list partitions. If you did, then it would be expensive, and it probably *would* be worth caching the result. Now maybe those hard cases aren't worth handling anyway. You also seem to be saying that since we run-time partitioning pruning might change the answer, caching the initial answer is pointless. But I think Julien has made a good argument for why that's wrong: if the initial answer is that the partitions are ordered, which will often be true, then we can skip all later checks. So I am OK with the fact that this patch doesn't choose to cache it, but I don't really buy any of your arguments for why it would be a bad idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 6 Mar 2019 at 07:17, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Dec 19, 2018 at 5:08 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > With my idea for using live_parts, we'll process the partitions > > looking for interleaved values on each query, after pruning takes > > place. In this case, we'll see the partitions are naturally ordered. I > > don't really foresee any issues with that additional processing since > > it will only be a big effort when there are a large number of > > partitions, and in those cases the planner already has lots of work to > > do. Such processing is just a drop in the ocean when compared to path > > generation for all those partitions. > > I agree that partitions_are_ordered() is cheap enough in this patch > that it probably doesn't matter whether we cache the result. On the > other hand, that's mostly because you haven't handled the hard cases - > e.g. interleaved list partitions. If you did, then it would be > expensive, and it probably *would* be worth caching the result. Now > maybe those hard cases aren't worth handling anyway. I admit that I didn't understand the idea of the flag at the time, having failed to see the point of it since if partitions are plan-time pruned then I had thought the flag would be useless. However, as Julien explained, it would be a flag of "Yes" means "Yes", okay to do ordered scans, and "No" means "Recheck if there are pruned partitions using only the non-pruned ones". That seems fine and very sane to me now that I understand it. FWIW, my moment of realisation came in [1]. However, my thoughts are that adding new flags and the live_parts field in RelOptInfo raise the bar a bit for this patch. There's already quite a number of partition-related fields in RelOptInfo. Understanding what each of those does is not trivial, so I figured that this patch would be much easier to consider if I skipped that part for the first cut version. I feared a lot of instability of what fields exist from Amit's planner improvement patches and I didn't want to deal with dependencies from WIP. I had to deal with that last year on run-time pruning and it turned out not to be fun. > You also seem to be saying that since we run-time partitioning pruning > might change the answer, caching the initial answer is pointless. But > I think Julien has made a good argument for why that's wrong: if the > initial answer is that the partitions are ordered, which will often be > true, then we can skip all later checks. > > So I am OK with the fact that this patch doesn't choose to cache it, > but I don't really buy any of your arguments for why it would be a bad > idea. OK, good. I agree. For the record; I want to steer clear of the flag in this first cut version, especially so now given what time it is. [1] https://www.postgresql.org/message-id/CAKJS1f_r51OAPsN1oC4i36D7vznnihNk+1wiDFG0qRVb_eOKWg@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > [ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ] I took a quick look through this and I'm not very happy with it. It seems to me that the premise ought to be "just use an Append if we can prove the output would be ordered anyway", but that's not what we actually have here: instead you're adding more infrastructure onto Append, which notably involves invasive changes to the API of create_append_path, which is the main reason why the patch keeps breaking. (It's broken again as of HEAD, though the cfbot doesn't seem to have noticed yet.) Likewise there's a bunch of added complication in cost_append, create_append_plan, etc. I think you should remove all that and restrict this optimization to the case where all the subpaths are natively ordered --- if we have to insert Sorts, it's hardly going to move the needle to worry about simplifying the parent MergeAppend to Append. There also seem to be bits that duplicate functionality of the drop-single-child-[Merge]Append patch (specifically I'm looking at get_singleton_append_subpath). Why do we need that? The logic in build_partition_pathkeys is noticeably stupider than build_index_pathkeys, in particular it's not bright about boolean columns. Maybe that's fine, but if so it deserves a comment explaining why we're not bothering. Also, the comment for build_index_pathkeys specifies that callers should do truncate_useless_pathkeys, which they do; why is that not relevant here? regards, tom lane
On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <david.rowley@2ndquadrant.com> writes: > > [ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ] > > I think you should remove all that > and restrict this optimization to the case where all the subpaths are > natively ordered --- if we have to insert Sorts, it's hardly going to move > the needle to worry about simplifying the parent MergeAppend to Append. This can be a huge win for queries of the form "ORDER BY partkey LIMIT x". Even if the first subpath(s) aren't natively ordered, not all of the sorts should actually be performed.
Julien Rouhaud <rjuju123@gmail.com> writes: > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think you should remove all that >> and restrict this optimization to the case where all the subpaths are >> natively ordered --- if we have to insert Sorts, it's hardly going to move >> the needle to worry about simplifying the parent MergeAppend to Append. > This can be a huge win for queries of the form "ORDER BY partkey LIMIT > x". Even if the first subpath(s) aren't natively ordered, not all of > the sorts should actually be performed. [ shrug... ] We've got no realistic chance of estimating such situations properly, so I'd have no confidence in a plan choice based on such a thing. Nor do I believe that this case is all that important. regards, tom lane
On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Julien Rouhaud <rjuju123@gmail.com> writes: > > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think you should remove all that > >> and restrict this optimization to the case where all the subpaths are > >> natively ordered --- if we have to insert Sorts, it's hardly going to move > >> the needle to worry about simplifying the parent MergeAppend to Append. > > > This can be a huge win for queries of the form "ORDER BY partkey LIMIT > > x". Even if the first subpath(s) aren't natively ordered, not all of > > the sorts should actually be performed. > > [ shrug... ] We've got no realistic chance of estimating such situations > properly, so I'd have no confidence in a plan choice based on such a > thing. With all due respect, I'd say that's not even close to being true. A MergeAppend's startup cost end up set to the sum of all of its subplan's startup costs, plus any Sort that will be required if the subpath is not sufficiently ordered already. An Append's startup cost will just be the startup cost of the first subpath. This can happen since, unlike MergeAppend, we don't need to pull the first tuple out of such subnode to find the lowest one. In Julien's case, such an Append plan has a potential of weighing in massively cheaper than a MergeAppend plan. Just imagine some large sorts in some later subpath. Can you explain why you think that's not properly being estimated in the patch? > Nor do I believe that this case is all that important. Can you explain why you believe that? I see you were the author of b1577a7c78d which was committed over 19 years ago, so I'm surprised to hear you say cheap startup plans are not important. Or is it, you just don't think they're important for partitioned tables? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 9 Mar 2019 at 09:14, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I took a quick look through this and I'm not very happy with it. > It seems to me that the premise ought to be "just use an Append > if we can prove the output would be ordered anyway", but that's not > what we actually have here: instead you're adding more infrastructure > onto Append, which notably involves invasive changes to the API of > create_append_path, which is the main reason why the patch keeps breaking. Can you suggest how else we could teach higher paths that an Append is ordered by some path keys without giving the append some pathkeys? That's what pathkeys are for, so I struggle to imagine how else this could work. If we don't do this, then how is a MergeJoin going to know it does not need to sort before joining? As for the "the patch keeps breaking"... those are just conflicts with other changes that have been made in master. That seems fairly normal to me. > (It's broken again as of HEAD, though the cfbot doesn't seem to have > noticed yet.) I think it's not been updating itself for a few days. > Likewise there's a bunch of added complication in > cost_append, create_append_plan, etc. I think you should remove all that > and restrict this optimization to the case where all the subpaths are > natively ordered --- if we have to insert Sorts, it's hardly going to move > the needle to worry about simplifying the parent MergeAppend to Append. I think the patch would be less than half as useful if we do that. Can you explain why you think that fast startup plans are less important for partitioned tables? I could perhaps understand an argument against this if the patch added masses of complex code to achieve the goal, but in my opinion, the code is fairly easy to understand and there's not very much extra code added. > There also seem to be bits that duplicate functionality of the > drop-single-child-[Merge]Append patch (specifically I'm looking > at get_singleton_append_subpath). Why do we need that? hmm, that patch is separate functionality. The patch you're talking about, as you know, just removes Append/MergeAppends that have a single subpath. Over here we add smarts to allow conversion of MergeAppends into Appends when the order of the partitions is defined the same as the required order of the, would be, MergeAppend path. get_singleton_append_subpath() allows sub-partitioned table's MergeAppend or Append subpaths to be pulled into the top-level Appends when they just contain a single subpath. An example from the tests: Append -> Index Scan using mcrparted0_a_abs_c_idx on mcrparted0 -> Index Scan using mcrparted1_a_abs_c_idx on mcrparted1 -> Index Scan using mcrparted2_a_abs_c_idx on mcrparted2 -> Index Scan using mcrparted3_a_abs_c_idx on mcrparted3 -> Index Scan using mcrparted4_a_abs_c_idx on mcrparted4 -> Merge Append Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c -> Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a -> Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def If the nested MergeAppend path just had a single node then get_singleton_append_subpath() would have pulled the subpath into the top-level Append. However, in this example, since there are multiple MergeAppend subpath, the pull-up would be invalid since the top-level Append can't guarantee the sort order of those MergeAppend subpaths. In fact, the test directly after that one drops the mcrparted5_def table which turns the plan into: Append -> Index Scan using mcrparted0_a_abs_c_idx on mcrparted0 -> Index Scan using mcrparted1_a_abs_c_idx on mcrparted1 -> Index Scan using mcrparted2_a_abs_c_idx on mcrparted2 -> Index Scan using mcrparted3_a_abs_c_idx on mcrparted3 -> Index Scan using mcrparted4_a_abs_c_idx on mcrparted4 -> Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a > The logic in build_partition_pathkeys is noticeably stupider than > build_index_pathkeys, in particular it's not bright about boolean columns. > Maybe that's fine, but if so it deserves a comment explaining why we're > not bothering. Good point. That's required to allow cases like: SELECT * FROM parttable WHERE boolcol = true ORDER BY boolcol, ordercol; I've fixed that in the attached. > Also, the comment for build_index_pathkeys specifies that > callers should do truncate_useless_pathkeys, which they do; why is that > not relevant here? I've neglected to explain that in the comments. The problem with that is that doing so would break cases where we use an Append when the partition keys are a prefix of the query's pathkeys. Say we had a range partition table on (a,b) and an index on (a, b, c): SELECT * FROM range_ab ORDER BY a, b, c; With the current patch, we can use an Append for that as no earlier value of (a,b) can come in a later partition. If we truncate_useless_pathkeys() then it'll strip out all pathkeys as the partition pathkeys are not contained within the query pathkeys. Maybe the patch should perform truncate_useless_pathkeys for the check where we see if the query pathkeys are contained in the partition's pathkeys. However, we can't do it for the reverse check. I've added a comment to explain about the lack of truncate_useless_pathkeys() in the attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
I've attached an updated patch which fixes the conflict with 0a9d7e1f6d8 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Fri, Mar 8, 2019 at 3:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I took a quick look through this and I'm not very happy with it. > It seems to me that the premise ought to be "just use an Append > if we can prove the output would be ordered anyway", but that's not > what we actually have here: instead you're adding more infrastructure > onto Append, which notably involves invasive changes to the API of > create_append_path, which is the main reason why the patch keeps breaking. > (It's broken again as of HEAD, though the cfbot doesn't seem to have > noticed yet.) Likewise there's a bunch of added complication in > cost_append, create_append_plan, etc. I think you should remove all that > and restrict this optimization to the case where all the subpaths are > natively ordered --- if we have to insert Sorts, it's hardly going to move > the needle to worry about simplifying the parent MergeAppend to Append. Other people have already said that they don't think this is true; I agree with those people. Even if you have to sort *every* path, sorting a bunch of reasonably large data sets individually is possibly better than sorting all the data together, because (1) you can start emitting rows sooner, (2) it might make you fit in memory instead of having to spill to disk, and (3) O(n lg n) is supralinear. Still, if that were the only case this handled, I wouldn't be too excited, because it seems at least plausible that lumping a bunch of small partitions together and sorting it all at once could save some start-up and tear-down costs vs. sorting them individually. But it isn't; the ability to consider that sort of plan is just a fringe benefit. If a substantial fraction of the partitions have indexes -- half, three-quarters, all-but-one -- sorting only the remaining ones should win big. Admittedly, I think this case is less common than it was a few years ago, because with table inheritance one often ended up with a parent partition that was empty and had no indexes so it produced a dummy-seqscan in every plan, and that's gone with partitioning. Moreover, because of Alvaro's work on cascaded CREATE INDEX, people are probably now more likely to have matching indexes on all the partitions. Still, it's not that hard to imagine a case where older data that doesn't change much is more heavily indexed than tables that are still suffering DML of whatever kind on a regular basis. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 16 Mar 2019 at 04:22, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached an updated patch which fixes the conflict with 0a9d7e1f6d8 ... and here's the one that I should have sent. (renamed to v12 to prevent confusion) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Julien Rouhaud <rjuju123@gmail.com> writes: > > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think you should remove all that > >> and restrict this optimization to the case where all the subpaths are > >> natively ordered --- if we have to insert Sorts, it's hardly going to move > >> the needle to worry about simplifying the parent MergeAppend to Append. > > > This can be a huge win for queries of the form "ORDER BY partkey LIMIT > > x". Even if the first subpath(s) aren't natively ordered, not all of > > the sorts should actually be performed. > > [ shrug... ] We've got no realistic chance of estimating such situations > properly, so I'd have no confidence in a plan choice based on such a > thing. Nor do I believe that this case is all that important. Hi Tom, Wondering if you can provide more details on why you think there's no realistic chance of the planner costing these cases correctly? It would be unfortunate to reject this patch based on a sentence that starts with "[ shrug... ]", especially so when three people have stood up and disagreed with you. I've explained why I think you're wrong. Would you be able to explain to me why you think I'm wrong? You also mentioned that you didn't like the fact I'd changed the API for create_append_plan(). Could you suggest why you think passing pathkeys in is the wrong thing to do? The Append path obviously needs pathkeys so that upper paths know what order the path guarantees. Passing pathkeys in allows us to verify that pathkeys are a valid thing to have for the AppendPath. They're not valid in the Parallel Append case, for example, so setting them afterwards does not seem like an improvement. They also allow us to cost the cheaper startup cost properly, however, you did seem to argue that you have no confidence in cheap startup plans, which I'm still confused by. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> This can be a huge win for queries of the form "ORDER BY partkey LIMIT >>> x". Even if the first subpath(s) aren't natively ordered, not all of >>> the sorts should actually be performed. >> [ shrug... ] We've got no realistic chance of estimating such situations >> properly, so I'd have no confidence in a plan choice based on such a >> thing. Nor do I believe that this case is all that important. > Wondering if you can provide more details on why you think there's no > realistic chance of the planner costing these cases correctly? The reason why I'm skeptical about LIMIT with a plan of the form append-some-sorts-together is that there are going to be large discontinuities in the cost-vs-number-of-rows-returned graph, ie, every time you finish one child plan and start the next one, there'll be a hiccup while the sort happens. This is something that our plan cost approximation (startup cost followed by linear output up to total cost) can't even represent. Sticking a LIMIT on top of that is certainly not going to lead to any useful estimate of the actual cost, meaning that if you get a correct plan choice it'll just be by luck, and if you don't there'll be nothing to be done about it. If we don't incorporate that, then the situation that the planner will have to model is a MergeAppend with possibly some sorts in front of it, and it will correctly cost that as if all the sorts happen before any output occurs, and so you can hope that reasonable plan choices will ensue. I believe that the cases where a LIMIT query actually ought to go for a fast-start plan are where *all* the child rels have fast-start (non-sort) paths --- which is exactly the cases we'd get if we only allow "sorted" Appends when none of the inputs require a sort. Imagining that we'll get good results in cases where some of them need to be sorted is just wishful thinking. > It would be unfortunate to reject this patch based on a sentence that > starts with "[ shrug... ]", especially so when three people have stood > up and disagreed with you. I don't want to reject the patch altogether, I just want it to not include a special hack to allow Append rather than MergeAppend in such cases. I am aware that I'm probably going to be shouted down on this point, but that will not change my opinion that the shouters are wrong. regards, tom lane
On Fri, 22 Mar 2019 at 11:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
>>> x". Even if the first subpath(s) aren't natively ordered, not all of
>>> the sorts should actually be performed.
>> [ shrug... ] We've got no realistic chance of estimating such situations
>> properly, so I'd have no confidence in a plan choice based on such a
>> thing. Nor do I believe that this case is all that important.
> Wondering if you can provide more details on why you think there's no
> realistic chance of the planner costing these cases correctly?
The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens. This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent. Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.
If we don't incorporate that, then the situation that the planner
will have to model is a MergeAppend with possibly some sorts in
front of it, and it will correctly cost that as if all the sorts
happen before any output occurs, and so you can hope that reasonable
plan choices will ensue.
I believe that the cases where a LIMIT query actually ought to go
for a fast-start plan are where *all* the child rels have fast-start
(non-sort) paths --- which is exactly the cases we'd get if we only
allow "sorted" Appends when none of the inputs require a sort.
Imagining that we'll get good results in cases where some of them
need to be sorted is just wishful thinking.
> It would be unfortunate to reject this patch based on a sentence that
> starts with "[ shrug... ]", especially so when three people have stood
> up and disagreed with you.
I don't want to reject the patch altogether, I just want it to not
include a special hack to allow Append rather than MergeAppend in such
cases. I am aware that I'm probably going to be shouted down on this
point, but that will not change my opinion that the shouters are wrong.
I agree that the issue of mixing sorts at various points will make nonsense of the startup cost/total cost results.
What you say about LIMIT is exactly right. But ISTM that LIMIT itself is the issue there and it need more smarts to correctly calculate costs.
I don't see LIMIT costing being broken as a reason to restrict this optimization. I would ask that we allow improvements to the important use case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, 23 Mar 2019 at 04:12, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The reason why I'm skeptical about LIMIT with a plan of the form > append-some-sorts-together is that there are going to be large > discontinuities in the cost-vs-number-of-rows-returned graph, > ie, every time you finish one child plan and start the next one, > there'll be a hiccup while the sort happens. This is something > that our plan cost approximation (startup cost followed by linear > output up to total cost) can't even represent. Sticking a > LIMIT on top of that is certainly not going to lead to any useful > estimate of the actual cost, meaning that if you get a correct > plan choice it'll just be by luck, and if you don't there'll be > nothing to be done about it. Thanks for explaining. I see where you're coming from now. I think this point would carry more weight if using the Append instead of the MergeAppend were worse in some cases as we could end up using an inferior plan accidentally. However, that's not the case. The Append plan should always perform better both for startup and pulling a single row all the way to pulling the final row. The underlying subplans are the same in each case, but Append has the additional saving of not having to determine to perform a sort on the top row from each subpath. I also think that cost-vs-number-of-rows-returned is not any worse than a SeqScan where the required rows are unevenly distributed throughout the table. In fact, the SeqScan case is much worse as we could end up choosing that over an index scan, which could be significantly better, but as mentioned above, and benchmarked in the initial post of this thread, Append always wins over MergeAppend, so I don't quite understand your reasoning here. I could understand it if Append needed the sorts but MergeAppend did not, but they both need the sorts if there's not a path that already provides the required ordering. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > I agree that the issue of mixing sorts at various points will make nonsense > of the startup cost/total cost results. Right. > I don't see LIMIT costing being broken as a reason to restrict this > optimization. I would ask that we allow improvements to the important use > case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly. There's not time to reinvent LIMIT costing for v12. I'd be happy to see some work done on that in the future, and when it does get done, I'd be happy to see Append planning extended to allow this case. I just don't think it's wise to ship one without the other. regards, tom lane
On Fri, 22 Mar 2019 at 11:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
> I agree that the issue of mixing sorts at various points will make nonsense
> of the startup cost/total cost results.
Right.
> I don't see LIMIT costing being broken as a reason to restrict this
> optimization. I would ask that we allow improvements to the important use
> case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.
There's not time to reinvent LIMIT costing for v12. I'd be happy to
see some work done on that in the future, and when it does get done,
I'd be happy to see Append planning extended to allow this case.
I just don't think it's wise to ship one without the other.
I was hoping to motivate you to look at this personally, and soon. LIMIT is so broken that any improvements count as bug fixes in my book.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > Thanks for explaining. I see where you're coming from now. I think > this point would carry more weight if using the Append instead of the > MergeAppend were worse in some cases as we could end up using an > inferior plan accidentally. However, that's not the case. The Append > plan should always perform better both for startup and pulling a > single row all the way to pulling the final row. The underlying > subplans are the same in each case, but Append has the additional > saving of not having to determine to perform a sort on the top row > from each subpath. Uh, what? sorted-Append and MergeAppend would need pre-sorts on exactly the same set of children. It's true that the Append path might not have to actually execute some of those sorts, if it's able to stop in an earlier child. The problem here is basically that it's hard to predict whether that will happen. > Append always wins over MergeAppend, so I > don't quite understand your reasoning here. The problem is that the planner is likely to favor a "fast-start" Append *too much*, and prefer it over some other plan altogether. In cases where, say, the first child requires no sort but also doesn't emit very many rows, while the second child requires an expensive sort, the planner will have a ridiculously optimistic opinion of the cost of fetching slightly more rows than are available from the first child. This might lead it to wrongly choose a merge join over a hash for example. Yes, there are cases where Append-with-some-sorts is preferable to MergeAppend-with-some-sorts, and maybe I'd even believe that it always is. But I don't believe that it's necessarily preferable to plans that don't require a sort at all, and I'm afraid that we are likely to find the planner making seriously bad choices when it's presented with such situations. I'd rather we leave that case out for now, until we have some better way of modelling it. The fact that the patch also requires a lot of extra hacking just to support this case badly doesn't make me any more favorably disposed. regards, tom lane
On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > In cases where, say, the first child requires no sort but also doesn't > emit very many rows, while the second child requires an expensive sort, > the planner will have a ridiculously optimistic opinion of the cost of > fetching slightly more rows than are available from the first child. > This might lead it to wrongly choose a merge join over a hash for example. I think this is very much a valid point, especially in view of the fact that we already choose supposedly fast-start plans too often. I don't know whether it's a death sentence for this patch, but it should at least make us stop and think hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In cases where, say, the first child requires no sort but also doesn't >> emit very many rows, while the second child requires an expensive sort, >> the planner will have a ridiculously optimistic opinion of the cost of >> fetching slightly more rows than are available from the first child. >> This might lead it to wrongly choose a merge join over a hash for example. > I think this is very much a valid point, especially in view of the > fact that we already choose supposedly fast-start plans too often. I > don't know whether it's a death sentence for this patch, but it should > at least make us stop and think hard. Once again: this objection is not a "death sentence for this patch". I simply wish to suppress the option to generate an ordered Append when some of the inputs would require an added sort step. As long as we have pre-ordered paths for all children, go for it. regards, tom lane
On Fri, Mar 22, 2019 at 12:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Once again: this objection is not a "death sentence for this patch". > I simply wish to suppress the option to generate an ordered Append > when some of the inputs would require an added sort step. As long > as we have pre-ordered paths for all children, go for it. I stand corrected. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In cases where, say, the first child requires no sort but also doesn't >> emit very many rows, while the second child requires an expensive sort, >> the planner will have a ridiculously optimistic opinion of the cost of >> fetching slightly more rows than are available from the first child. >> This might lead it to wrongly choose a merge join over a hash for example. > I think this is very much a valid point, especially in view of the > fact that we already choose supposedly fast-start plans too often. I > don't know whether it's a death sentence for this patch, but it should > at least make us stop and think hard. BTW, another thing we could possibly do to answer this objection is to give the ordered-Append node an artificially pessimistic startup cost, such as the sum or the max of its children's startup costs. That's pretty ugly and unprincipled, but maybe it's better than not having the ability to generate the plan shape at all? regards, tom lane
On Fri, Mar 22, 2019 at 12:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> In cases where, say, the first child requires no sort but also doesn't > >> emit very many rows, while the second child requires an expensive sort, > >> the planner will have a ridiculously optimistic opinion of the cost of > >> fetching slightly more rows than are available from the first child. > >> This might lead it to wrongly choose a merge join over a hash for example. > > > I think this is very much a valid point, especially in view of the > > fact that we already choose supposedly fast-start plans too often. I > > don't know whether it's a death sentence for this patch, but it should > > at least make us stop and think hard. > > BTW, another thing we could possibly do to answer this objection is to > give the ordered-Append node an artificially pessimistic startup cost, > such as the sum or the max of its children's startup costs. That's > pretty ugly and unprincipled, but maybe it's better than not having the > ability to generate the plan shape at all? Yeah, I'm not sure whether that's a good idea or not. I think one of the problems with a cost-based optimizer is that once you start putting things in with the wrong cost because you think it will give the right answer, you're sorta playing with fire, because you can't necessarily predict how things are going are going to turn out in more complex scenarios. On the other hand, it may sometimes be the right thing to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 22, 2019 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 22, 2019 at 12:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> In cases where, say, the first child requires no sort but also doesn't > > >> emit very many rows, while the second child requires an expensive sort, > > >> the planner will have a ridiculously optimistic opinion of the cost of > > >> fetching slightly more rows than are available from the first child. > > >> This might lead it to wrongly choose a merge join over a hash for example. > > > > > I think this is very much a valid point, especially in view of the > > > fact that we already choose supposedly fast-start plans too often. I > > > don't know whether it's a death sentence for this patch, but it should > > > at least make us stop and think hard. > > > > BTW, another thing we could possibly do to answer this objection is to > > give the ordered-Append node an artificially pessimistic startup cost, > > such as the sum or the max of its children's startup costs. That's > > pretty ugly and unprincipled, but maybe it's better than not having the > > ability to generate the plan shape at all? > > Yeah, I'm not sure whether that's a good idea or not. I think one of > the problems with a cost-based optimizer is that once you start > putting things in with the wrong cost because you think it will give > the right answer, you're sorta playing with fire, because you can't > necessarily predict how things are going are going to turn out in more > complex scenarios. On the other hand, it may sometimes be the right > thing to do. I've been bitten too many times with super inefficient plans of the form "let's use the wrong index instead of the good one because I'll probably find there the tuple I want very quickly", due to LIMIT assuming an even distribution. Since those queries can end up taking dozens of minutes instead of less a ms, without a lot of control to fix this kind of problem I definitely don't want to introduce another similar source of pain for users. However, what we're talking here is still a corner case. People having partitioned tables with a mix of partitions with and without indexes suitable for ORDER BY x LIMIT y queries should already have at best average performance. And trying to handle this case cannot hurt cases where all partitions have suitable indexes, so that may be an acceptable risk? I also have mixed feeling about this artificial startup cost penalty, but if we go this way we can for sure cumulate the startup cost of all sorts that we think we'll have to perform (according to each path's estimated rows and the given limit_tuples). That probably won't be enough though, especially with fractional paths.
On Sat, 23 Mar 2019 at 04:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <david.rowley@2ndquadrant.com> writes: > > Append has the additional > > saving of not having to determine to perform a sort on the top row > > from each subpath. > > Uh, what? sorted-Append and MergeAppend would need pre-sorts on > exactly the same set of children. I was talking about the binary heap code that MergeAppend uses to decide which subplan to pull from next. > In cases where, say, the first child requires no sort but also doesn't > emit very many rows, while the second child requires an expensive sort, > the planner will have a ridiculously optimistic opinion of the cost of > fetching slightly more rows than are available from the first child. > This might lead it to wrongly choose a merge join over a hash for example. umm.. Yeah, that's a good point. I seemed to have failed to consider that the fast startup plan could lower the cost of a merge join with a limit. I agree with that concern. I also find it slightly annoying since we already make other plan shapes that can suffer from similar problems, e.g Index scan + filter + limit, but I agree we don't need any more of those as they're pretty painful when they hit. I'll change the patch around to pull out the code you've mentioned. Thanks for spelling out your point to me. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, another thing we could possibly do to answer this objection is to > give the ordered-Append node an artificially pessimistic startup cost, > such as the sum or the max of its children's startup costs. That's > pretty ugly and unprincipled, but maybe it's better than not having the > ability to generate the plan shape at all? I admit to having thought of that while trying to get to sleep last night, but I was too scared to even suggest it. It's pretty much how MergeAppend would cost it anyway. I agree it's not pretty to lie about the startup cost, but it does kinda seem silly to fall back on a more expensive MergeAppend when we know fine well Append is cheaper. Probably the danger would be that someone pulls it out thinking its a bug. So we'd need to clearly comment why we're doing it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, another thing we could possibly do to answer this objection is to >> give the ordered-Append node an artificially pessimistic startup cost, >> such as the sum or the max of its children's startup costs. That's >> pretty ugly and unprincipled, but maybe it's better than not having the >> ability to generate the plan shape at all? > I admit to having thought of that while trying to get to sleep last > night, but I was too scared to even suggest it. It's pretty much how > MergeAppend would cost it anyway. I agree it's not pretty to lie > about the startup cost, but it does kinda seem silly to fall back on a > more expensive MergeAppend when we know fine well Append is cheaper. Yeah. I'm starting to think that this might actually be the way to go, and here's why: my argument here is basically that a child plan that has a large startup cost is going to screw up our ability to estimate whether the parent Append is really a fast-start plan or not. Now, if we have to insert a Sort to make a child plan be correctly ordered, that clearly is a case where the child could have a large startup cost ... but what if a correctly-ordered child has a large startup cost for some other reason? Simply refusing to insert Sort nodes won't keep us out of the weeds if that's true. However, if we stick in a hack like the one suggested above, that will keep us from being too optimistic about the fast-start properties of the Append node no matter whether the problem arises from an added Sort node or is intrinsic to the child plan. It may well be that as things stand today, this scenario is only hypothetical, because we can only prove that a plain-Append plan is correctly sorted if it's arising from a suitably partitioned table, and the child plans in such cases will all be IndexScans with minimal startup cost. But we should look ahead to scenarios where that's not true. (mumble maybe a foreign table as a partition is already a counterexample? mumble) regards, tom lane
On Wed, Dec 19, 2018 at 3:01 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > The multi-level partitioning case is another > > thing that would need to be handled for instance (and that's the main > > reason I couldn't submit a new patch when I was working on it), and > > I'm definitely not arguing to cover it in this patch. > > As far as I'm aware, the multi-level partitioning should work just > fine with the current patch. I added code for that a while ago. There > are regression tests to exercise it. I'm not aware of any cases where > it does not work. Sorry to come back this late. What I was mentioning about sub-partitioning is when a whole partition hierarchy is natively ordered, we could avoid the generate merge appends. But unless I'm missing something with your patch, that won't happen. Considering CREATE TABLE nested (id1 integer, id2 integer, val text) PARTITION BY LIST (id1); CREATE TABLE nested_1 PARTITION OF nested FOR VALUES IN (1) PARTITION BY RANGE (id2); CREATE TABLE nested_1_1 PARTITION OF nested_1 FOR VALUES FROM (1) TO (100000); CREATE TABLE nested_1_2 PARTITION OF nested_1 FOR VALUES FROM (100000) TO (200000); CREATE TABLE nested_1_3 PARTITION OF nested_1 FOR VALUES FROM (200000) TO (300000); CREATE TABLE nested_2 PARTITION OF nested FOR VALUES IN (2) PARTITION BY RANGE (id2); CREATE TABLE nested_2_1 PARTITION OF nested_2 FOR VALUES FROM (1) TO (100000); CREATE TABLE nested_2_2 PARTITION OF nested_2 FOR VALUES FROM (100000) TO (200000); CREATE TABLE nested_2_3 PARTITION OF nested_2 FOR VALUES FROM (200000) TO (300000); CREATE INDEX ON nested(id1, id2); ISTM that a query like SELECT * FROM nested ORDER BY 1, 2; could simply append all the partitions in the right order (or generate a tree of ordered appends), but: QUERY PLAN ------------------------------------------------------------------- Append -> Merge Append Sort Key: nested_1_1.id1, nested_1_1.id2 -> Index Scan using nested_1_1_id1_id2_idx on nested_1_1 -> Index Scan using nested_1_2_id1_id2_idx on nested_1_2 -> Index Scan using nested_1_3_id1_id2_idx on nested_1_3 -> Merge Append Sort Key: nested_2_1.id1, nested_2_1.id2 -> Index Scan using nested_2_1_id1_id2_idx on nested_2_1 -> Index Scan using nested_2_2_id1_id2_idx on nested_2_2 -> Index Scan using nested_2_3_id1_id2_idx on nested_2_3 (11 rows) Also, a query like SELECT * FROM nested_1 ORDER BY 1, 2; could generate an append path, since the first column is guaranteed to be identical in all partitions, but instead: QUERY PLAN ------------------------------------------------------------- Merge Append Sort Key: nested_1_1.id1, nested_1_1.id2 -> Index Scan using nested_1_1_id1_id2_idx on nested_1_1 -> Index Scan using nested_1_2_id1_id2_idx on nested_1_2 -> Index Scan using nested_1_3_id1_id2_idx on nested_1_3 (5 rows) and of course # EXPLAIN (costs off) SELECT * FROM nested_1 ORDER BY 2; QUERY PLAN ------------------------------------ Sort Sort Key: nested_1_1.id2 -> Append -> Seq Scan on nested_1_1 -> Seq Scan on nested_1_2 -> Seq Scan on nested_1_3 (6 rows) I admit that I didn't re-read the whole thread, so maybe I'm missing something (if that's the case my apologies, and feel free to point me any relevant discussion). I'm just trying to make sure that we don't miss some cases, as those seems possible and useful to handle. Or is that out of the perimeter?
On Sun, 24 Mar 2019 at 05:16, Julien Rouhaud <rjuju123@gmail.com> wrote: > ISTM that a query like > SELECT * FROM nested ORDER BY 1, 2; > could simply append all the partitions in the right order (or generate > a tree of ordered appends), but: > > QUERY PLAN > ------------------------------------------------------------------- > Append > -> Merge Append > Sort Key: nested_1_1.id1, nested_1_1.id2 > -> Index Scan using nested_1_1_id1_id2_idx on nested_1_1 > -> Index Scan using nested_1_2_id1_id2_idx on nested_1_2 > -> Index Scan using nested_1_3_id1_id2_idx on nested_1_3 > -> Merge Append > Sort Key: nested_2_1.id1, nested_2_1.id2 > -> Index Scan using nested_2_1_id1_id2_idx on nested_2_1 > -> Index Scan using nested_2_2_id1_id2_idx on nested_2_2 > -> Index Scan using nested_2_3_id1_id2_idx on nested_2_3 > (11 rows) > > > Also, a query like > SELECT * FROM nested_1 ORDER BY 1, 2; > could generate an append path, since the first column is guaranteed to > be identical in all partitions, but instead: > > QUERY PLAN > ------------------------------------------------------------- > Merge Append > Sort Key: nested_1_1.id1, nested_1_1.id2 > -> Index Scan using nested_1_1_id1_id2_idx on nested_1_1 > -> Index Scan using nested_1_2_id1_id2_idx on nested_1_2 > -> Index Scan using nested_1_3_id1_id2_idx on nested_1_3 > (5 rows) > > and of course > > # EXPLAIN (costs off) SELECT * FROM nested_1 ORDER BY 2; > QUERY PLAN > ------------------------------------ > Sort > Sort Key: nested_1_1.id2 > -> Append > -> Seq Scan on nested_1_1 > -> Seq Scan on nested_1_2 > -> Seq Scan on nested_1_3 > (6 rows) I think both these cases could be handled, but I think the way it would likely have to be done would be to run the partition constraints through equivalence class processing. Likely doing that would need some new field in EquivalenceClass that indicated that the eclass did not need to be applied to the partition. If it was done that way then pathkey_is_redundant() would be true for the id1 column's pathkey in the sub-partitioned tables. The last plan you show above could also use an index scan too since build_index_pathkeys() would also find the pathkey redundant. Doing this would also cause a query like: select * from nested_1_1 where id2=1; would not apply "id2 = 1" as a base qual to the partition. That's good for 2 reasons. 1) No wasted effort filtering rows that always match; and 2) A Seq scan can be used instead of the planner possibly thinking that an index scan might be useful to filter rows. Stats might tell the planner that anyway... but... I suggested some changes to equivalence classes a few years ago in [1] and I failed to get that idea floating. In ways, this is similar as it requires having equivalence classes that are not used in all cases. I think to get something working a week before code cutoff is a step too far for this, but certainly, it would be interesting to look into fixing it in a later release. [1] https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 23 Mar 2019 at 19:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <david.rowley@2ndquadrant.com> writes: > > On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> BTW, another thing we could possibly do to answer this objection is to > >> give the ordered-Append node an artificially pessimistic startup cost, > >> such as the sum or the max of its children's startup costs. That's > >> pretty ugly and unprincipled, but maybe it's better than not having the > >> ability to generate the plan shape at all? > > > I admit to having thought of that while trying to get to sleep last > > night, but I was too scared to even suggest it. It's pretty much how > > MergeAppend would cost it anyway. I agree it's not pretty to lie > > about the startup cost, but it does kinda seem silly to fall back on a > > more expensive MergeAppend when we know fine well Append is cheaper. > > Yeah. I'm starting to think that this might actually be the way to go, Here's a version with it done that way. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sun, Mar 24, 2019 at 11:06 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Sat, 23 Mar 2019 at 19:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > David Rowley <david.rowley@2ndquadrant.com> writes: > > > On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> BTW, another thing we could possibly do to answer this objection is to > > >> give the ordered-Append node an artificially pessimistic startup cost, > > >> such as the sum or the max of its children's startup costs. That's > > >> pretty ugly and unprincipled, but maybe it's better than not having the > > >> ability to generate the plan shape at all? > > > > > I admit to having thought of that while trying to get to sleep last > > > night, but I was too scared to even suggest it. It's pretty much how > > > MergeAppend would cost it anyway. I agree it's not pretty to lie > > > about the startup cost, but it does kinda seem silly to fall back on a > > > more expensive MergeAppend when we know fine well Append is cheaper. > > > > Yeah. I'm starting to think that this might actually be the way to go, > > Here's a version with it done that way. FTR this patch doesn't apply since single child [Merge]Append suppression (8edd0e7946) has been pushed.
On Tue, 26 Mar 2019 at 09:02, Julien Rouhaud <rjuju123@gmail.com> wrote: > FTR this patch doesn't apply since single child [Merge]Append > suppression (8edd0e7946) has been pushed. Thanks for letting me know. I've attached v14 based on current master. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Tue, Mar 26, 2019 at 3:13 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Tue, 26 Mar 2019 at 09:02, Julien Rouhaud <rjuju123@gmail.com> wrote: > > FTR this patch doesn't apply since single child [Merge]Append > > suppression (8edd0e7946) has been pushed. > > Thanks for letting me know. I've attached v14 based on current master. Thanks! So, AFAICT everything works as intended, I don't see any problem in the code and the special costing heuristic should avoid dramatic plans. A few, mostly nitpicking, comments: + if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) && + partitions_are_ordered(root, rel)) shouldn't the test be IS_PARTITIONED_REL(rel) instead of testing part_scheme? I'm thinking of 1d33858406 and related discussions. + * partitions_are_ordered + * For the partitioned table given in 'partrel', returns true if the + * partitioned table guarantees that tuples which sort earlier according + * to the partition bound are stored in an earlier partition. Returns + * false this is not possible, or if we have insufficient means to prove + * it. [...] + * partkey_is_bool_constant_for_query + * + * If a partition key column is constrained to have a constant value by the + * query's WHERE conditions, then we needn't take the key into consideration + * when checking if scanning partitions in order can't cause lower-order + * values to appear in later partitions. Maybe it's because I'm not a native english speaker, but I had to read those comments multiple time. I'd also add to partitions_are_ordered comments a note about default_partition (even though the function is super short). + if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts) + return false; there are a few over lengthy lines, maybe a pgindent run would be useful. + * build_partition_pathkeys + * Build a pathkeys list that describes the ordering induced by the + * partitions of 'partrel'. (Callers must ensure that this partitioned + * table guarantees that lower order tuples never will be found in a + * later partition.). Sets *partialkeys to false if pathkeys were only + * built for a prefix of the partition key, otherwise sets it to true. + */ +List * +build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel, + ScanDirection scandir, bool *partialkeys) Maybe add an assert partitions_are_ordered also? And finally, should this optimisation be mentioned in ddl.sgml (or somewhere else)?
On Wed, 27 Mar 2019 at 19:48, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Sorry if this was discussed before, but why does this patch add any new > code to partprune.c? AFAICT, there's no functionality changes to the > pruning code. You're right. It probably shouldn't be there. There's a bit of a lack of a good home for partition code relating to the planner it seems. > seem like their logic is specialized enough to be confined to pathkeys.c, > only because it's needed there. Yeah maybe. > Regarding > > +bool > +partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel) > > I think this could simply be: > > bool > partitions_are_ordered(PartitionBoundInfo *boundinfo) > > and be defined in partitioning/partbounds.c. If you think any future > modifications to this will require access to the partition key info in > PartitionScheme, maybe the following is fine: > > bool > partitions_are_ordered(RelOptInfo *partrel) It does need to know how many partitions the partitioned table has, which it gets from partrel->nparts, so yeah, RelOptInfo is probably needed. I don't think passing in int nparts is a good solution to that. The problem with moving it to partbounds.c is that nothing there knows about RelOptInfo currently. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, 27 Mar 2019 at 21:24, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Noticed a typo. > > + * multiple subpaths then we can't make guarantees about the > + * order tuples in those subpaths, so we must leave the > > order of tuples? I'll fix that. Thanks. > Again, sorry if this was discussed, but I got curious about why > partitions_are_ordered() thinks it can say true even for an otherwise > ordered list-partitioned table, but containing a null-only partition, > which is *always* scanned last. If a query says ORDER BY partkey NULLS > FIRST, then it's not alright to proceed with assuming partitions are > ordered even if partitions_are_ordered() said so. If it's *always* scanned last then it's fine for ORDER BY partkey NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't match on the pathkeys. Or if they do ORDER BY partkey DESC NULLS FIRST, then we're also fine, since we reverse the order of the subpaths list in that case. ORDER BY partkey DESC NULLS LAST is not okay, and we don't optimise that since it won't match the pathkeys we generate in build_partition_pathkeys(). > Related, why does build_partition_pathkeys() pass what it does for > nulls_first parameter of make_pathkey_from_sortinfo()? > > cpathkey = make_pathkey_from_sortinfo(root, > keyCol, > NULL, > partscheme->partopfamily[i], > partscheme->partopcintype[i], > partscheme->partcollation[i], > ScanDirectionIsBackward(scandir), > ==> ScanDirectionIsBackward(scandir), > 0, > partrel->relids, > false); > > I think null values are almost always placed in the last partition, unless > the null-accepting list partition also accepts some other non-null value. > I'm not sure exactly how we can determine the correct value to pass here, > but what's there in the patch now doesn't seem to be it. The code looks okay to me. It'll generate pathkeys like ORDER BY partkey NULLS LAST for forward scans and ORDER BY partkey DESC NULLS FIRST for backwards scans. Can you explain what cases you think the code gets wrong? Here's a preview of the actual and expected behaviour: # explain (costs off) select * from listp order by a asc nulls last; QUERY PLAN ------------------------------------------------------------ Append -> Index Only Scan using listp1_a_idx on listp1 -> Index Only Scan using listp2_a_idx on listp2 -> Index Only Scan using listp_null_a_idx on listp_null (4 rows) # explain (costs off) select * from listp order by a asc nulls first; -- not optimised QUERY PLAN ------------------------------------ Sort Sort Key: listp1.a NULLS FIRST -> Append -> Seq Scan on listp1 -> Seq Scan on listp2 -> Seq Scan on listp_null (6 rows) # explain (costs off) select * from listp order by a desc nulls first; -- subpath list is simply reversed in this case. QUERY PLAN --------------------------------------------------------------------- Append -> Index Only Scan Backward using listp_null_a_idx on listp_null -> Index Only Scan Backward using listp2_a_idx on listp2 -> Index Only Scan Backward using listp1_a_idx on listp1 (4 rows) # explain (costs off) select * from listp order by a desc nulls last; -- not optimised QUERY PLAN -------------------------------------- Sort Sort Key: listp1.a DESC NULLS LAST -> Append -> Seq Scan on listp1 -> Seq Scan on listp2 -> Seq Scan on listp_null (6 rows) We could likely improve the two cases that are not optimized by putting the NULL partition in the correct place in the append subpaths, but for now, we don't really have an efficient means to identify which subpath that is. I've not looked at your partition planning improvements patch for a while to see if you're storing a Bitmapset of the non-pruned partitions in RelOptInfo. Something like that would allow us to make this better. Julien and I have talked about other possible cases to optimise if we have that. e.g if the default partition is pruned then we can optimise a RANGE partitioned table with a default. So there's definitely more to be done on this. I think there's a general consensus that what we're doing in the patch already is enough to be useful. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019/03/28 7:29, David Rowley wrote: > On Wed, 27 Mar 2019 at 19:48, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Sorry if this was discussed before, but why does this patch add any new >> code to partprune.c? AFAICT, there's no functionality changes to the >> pruning code. > > You're right. It probably shouldn't be there. There's a bit of a lack > of a good home for partition code relating to the planner it seems. partprune.c is outside the optimizer sub-directory, but exports planning-related functions like prune_append_rel_partitions(), make_partition_pruneinfo(), etc. Similarly, partbound.c can grow bit of planning-related functionality. >> Regarding >> >> +bool >> +partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel) >> >> I think this could simply be: >> >> bool >> partitions_are_ordered(PartitionBoundInfo *boundinfo) >> >> and be defined in partitioning/partbounds.c. If you think any future >> modifications to this will require access to the partition key info in >> PartitionScheme, maybe the following is fine: >> >> bool >> partitions_are_ordered(RelOptInfo *partrel) > > It does need to know how many partitions the partitioned table has, > which it gets from partrel->nparts, so yeah, RelOptInfo is probably > needed. I don't think passing in int nparts is a good solution to > that. The problem with moving it to partbounds.c is that nothing > there knows about RelOptInfo currently. Maybe, this could be a start. Also, there is a patch in nearby thread which adds additional functionality to partbounds.c to be used by partitionwise join code in the optimizer [1]. Thanks, Amit [1] https://commitfest.postgresql.org/22/1553/
On Thu, 28 Mar 2019 at 14:34, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2019/03/28 7:29, David Rowley wrote: > > On Wed, 27 Mar 2019 at 19:48, Amit Langote > > It does need to know how many partitions the partitioned table has, > > which it gets from partrel->nparts, so yeah, RelOptInfo is probably > > needed. I don't think passing in int nparts is a good solution to > > that. The problem with moving it to partbounds.c is that nothing > > there knows about RelOptInfo currently. > > Maybe, this could be a start. Also, there is a patch in nearby thread > which adds additional functionality to partbounds.c to be used by > partitionwise join code in the optimizer [1]. Thanks for the review. I've attached a patch that mostly just moved the code around. I also changed the comment in build_partition_pathkeys() to explain about the nulls_first argument and why I just pass ScanDirectionIsBackward(scandir). Also, another comment in struct PlannerInfo to mentioning the guarantee about append_rel_list having earlier partitions as defined in PartitionDesc earlier in the list. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, 28 Mar 2019 at 15:40, David Rowley <david.rowley@2ndquadrant.com> wrote: > Thanks for the review. I've attached a patch that mostly just moved > the code around. Here's one that fixes up the compiler warning from the last one. Thanks CF bot... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David, On 2019/03/28 8:04, David Rowley wrote: > On Wed, 27 Mar 2019 at 21:24, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Noticed a typo. >> >> + * multiple subpaths then we can't make guarantees about the >> + * order tuples in those subpaths, so we must leave the >> >> order of tuples? > > I'll fix that. Thanks. > >> Again, sorry if this was discussed, but I got curious about why >> partitions_are_ordered() thinks it can say true even for an otherwise >> ordered list-partitioned table, but containing a null-only partition, >> which is *always* scanned last. If a query says ORDER BY partkey NULLS >> FIRST, then it's not alright to proceed with assuming partitions are >> ordered even if partitions_are_ordered() said so. > > If it's *always* scanned last then it's fine for ORDER BY partkey > NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't > match on the pathkeys. Sorry, I had meant to say that null values may or may not appear in the last partition depending on how the null-accepting partition is defined. I see that the code in partitions_are_ordered() correctly returns false if null partition is not the last one, for example, for: create table p (a int) partition by list (a); create table p1 partition of p for values in (1); create table p2_null partition of p for values in (2, null); create table p3 partition of p for values in (3); Maybe, a small comment regarding how that works correctly would be nice. > Or if they do ORDER BY partkey DESC NULLS FIRST, then we're also fine, > since we reverse the order of the subpaths list in that case. ORDER > BY partkey DESC NULLS LAST is not okay, and we don't optimise that > since it won't match the pathkeys we generate in > build_partition_pathkeys(). OK. >> Related, why does build_partition_pathkeys() pass what it does for >> nulls_first parameter of make_pathkey_from_sortinfo()? >> >> cpathkey = make_pathkey_from_sortinfo(root, >> keyCol, >> NULL, >> partscheme->partopfamily[i], >> partscheme->partopcintype[i], >> partscheme->partcollation[i], >> ScanDirectionIsBackward(scandir), >> ==> ScanDirectionIsBackward(scandir), >> 0, >> partrel->relids, >> false); >> >> I think null values are almost always placed in the last partition, unless >> the null-accepting list partition also accepts some other non-null value. >> I'm not sure exactly how we can determine the correct value to pass here, >> but what's there in the patch now doesn't seem to be it. > > The code looks okay to me. It'll generate pathkeys like ORDER BY > partkey NULLS LAST for forward scans and ORDER BY partkey DESC NULLS > FIRST for backwards scans. > > Can you explain what cases you think the code gets wrong? > > Here's a preview of the actual and expected behaviour: > > # explain (costs off) select * from listp order by a asc nulls last; > QUERY PLAN > ------------------------------------------------------------ > Append > -> Index Only Scan using listp1_a_idx on listp1 > -> Index Only Scan using listp2_a_idx on listp2 > -> Index Only Scan using listp_null_a_idx on listp_null > (4 rows) > > > # explain (costs off) select * from listp order by a asc nulls first; > -- not optimised > QUERY PLAN > ------------------------------------ > Sort > Sort Key: listp1.a NULLS FIRST > -> Append > -> Seq Scan on listp1 > -> Seq Scan on listp2 > -> Seq Scan on listp_null > (6 rows) > > # explain (costs off) select * from listp order by a desc nulls first; > -- subpath list is simply reversed in this case. > QUERY PLAN > --------------------------------------------------------------------- > Append > -> Index Only Scan Backward using listp_null_a_idx on listp_null > -> Index Only Scan Backward using listp2_a_idx on listp2 > -> Index Only Scan Backward using listp1_a_idx on listp1 > (4 rows) > > > # explain (costs off) select * from listp order by a desc nulls last; > -- not optimised > QUERY PLAN > -------------------------------------- > Sort > Sort Key: listp1.a DESC NULLS LAST > -> Append > -> Seq Scan on listp1 > -> Seq Scan on listp2 > -> Seq Scan on listp_null > (6 rows) Ah, everything seems to be working correctly. Thanks for the explanation and sorry about the noise. > We could likely improve the two cases that are not optimized by > putting the NULL partition in the correct place in the append > subpaths, but for now, we don't really have an efficient means to > identify which subpath that is. I've not looked at your partition > planning improvements patch for a while to see if you're storing a > Bitmapset of the non-pruned partitions in RelOptInfo. Something like > that would allow us to make this better. Julien and I have talked > about other possible cases to optimise if we have that. e.g if the > default partition is pruned then we can optimise a RANGE partitioned > table with a default. So there's definitely more to be done on this. I > think there's a general consensus that what we're doing in the patch > already is enough to be useful. Certainly. When trying out various combinations of ORDER BY ASC/DESC NULL FIRST/LAST yesterday, I wrongly thought the plan came out wrong in one or two cases, but now see that that's not the case. Also, the comment you added in the latest patch sheds some light on the matter, so that helps too. Thanks, Amit
On Fri, 29 Mar 2019 at 00:00, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2019/03/28 8:04, David Rowley wrote: > > If it's *always* scanned last then it's fine for ORDER BY partkey > > NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't > > match on the pathkeys. > > Sorry, I had meant to say that null values may or may not appear in the > last partition depending on how the null-accepting partition is defined. > I see that the code in partitions_are_ordered() correctly returns false if > null partition is not the last one, for example, for: > > create table p (a int) partition by list (a); > create table p1 partition of p for values in (1); > create table p2_null partition of p for values in (2, null); > create table p3 partition of p for values in (3); > > Maybe, a small comment regarding how that works correctly would be nice. hmm, but there is a comment. It says: /* * LIST partitions can also guarantee ordering, but we'd need to * ensure that partitions don't allow interleaved values. We * could likely check for this looking at each partition, in * order, and checking which Datums are accepted. If we find a * Datum in a partition that's greater than one previously already * seen, then values could become out of order and we'd have to * disable the optimization. For now, let's just keep it simple * and just accept LIST partitions without a DEFAULT partition * which only accept a single Datum per partition. This is cheap * as it does not require any per-partition processing. Maybe * we'd like to handle more complex cases in the future. */ Your example above breaks the "don't allow interleaved values" and "just accept LIST partitions without a DEFAULT partition which only accept a single Datum per partition." Do you think I need to add something like "and if there is a NULL partition, that it only accepts NULL values"? I think that's implied already, but if you think it's confusing then maybe it's worth adding something along those lines. David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2019/03/29 7:59, David Rowley wrote: > On Fri, 29 Mar 2019 at 00:00, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> On 2019/03/28 8:04, David Rowley wrote: >>> If it's *always* scanned last then it's fine for ORDER BY partkey >>> NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't >>> match on the pathkeys. >> >> Sorry, I had meant to say that null values may or may not appear in the >> last partition depending on how the null-accepting partition is defined. >> I see that the code in partitions_are_ordered() correctly returns false if >> null partition is not the last one, for example, for: >> >> create table p (a int) partition by list (a); >> create table p1 partition of p for values in (1); >> create table p2_null partition of p for values in (2, null); >> create table p3 partition of p for values in (3); >> >> Maybe, a small comment regarding how that works correctly would be nice. > > hmm, but there is a comment. It says: > > /* > * LIST partitions can also guarantee ordering, but we'd need to > * ensure that partitions don't allow interleaved values. We > * could likely check for this looking at each partition, in > * order, and checking which Datums are accepted. If we find a > * Datum in a partition that's greater than one previously already > * seen, then values could become out of order and we'd have to > * disable the optimization. For now, let's just keep it simple > * and just accept LIST partitions without a DEFAULT partition > * which only accept a single Datum per partition. This is cheap > * as it does not require any per-partition processing. Maybe > * we'd like to handle more complex cases in the future. > */ > > Your example above breaks the "don't allow interleaved values" and > "just accept LIST partitions without a DEFAULT partition which only > accept a single Datum per partition." > > Do you think I need to add something like "and if there is a NULL > partition, that it only accepts NULL values"? I think that's implied > already, but if you think it's confusing then maybe it's worth adding > something along those lines. How about extending the sentence about when the optimization is disabled as follows: "If we find a Datum in a partition that's greater than one previously already seen, then values could become out of order and we'd have to disable the optimization. We'd also need to disable the optimization if NULL values are interleaved with other Datum values, because the calling code expect them to be present in the last partition." Further, extend the "For now..." sentence as: "For now, let's just keep it simple and just accept LIST partitioned table without a DEFAULT partition where each partition only accepts a single Datum or NULL. It's OK to always accept NULL partition in that case, because PartitionBoundInfo lists it as the last partition." Thanks, Amit
On Tue, 2 Apr 2019 at 14:26, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > How about extending the sentence about when the optimization is disabled > as follows: > > "If we find a Datum in a partition that's greater than one previously > already seen, then values could become out of order and we'd have to > disable the optimization. We'd also need to disable the optimization if > NULL values are interleaved with other Datum values, because the calling > code expect them to be present in the last partition." > > Further, extend the "For now..." sentence as: > > "For now, let's just keep it simple and just accept LIST partitioned table > without a DEFAULT partition where each partition only accepts a single > Datum or NULL. It's OK to always accept NULL partition in that case, > because PartitionBoundInfo lists it as the last partition." I ended up rewording the entire thing and working on the header comment for the function too. I think previously it wasn't that well defined what "ordered" meant. I added a mention that we expect that NULLs, if possible must come in the last partition. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David, On Tue, Apr 2, 2019 at 8:49 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > I ended up rewording the entire thing and working on the header > comment for the function too. I think previously it wasn't that well > defined what "ordered" meant. I added a mention that we expect that > NULLs, if possible must come in the last partition. Thanks for the updated patch. New descriptions look good, although was amused by this: diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index bdd0d23854..9dd378d7a0 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -25,6 +25,7 @@ #include "miscadmin.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" +#include "nodes/pathnodes.h" ... +partitions_are_ordered(struct RelOptInfo *partrel) Maybe, "struct" is unnecessary? Thanks, Amit
On Wed, 3 Apr 2019 at 01:26, Amit Langote <amitlangote09@gmail.com> wrote: > +#include "nodes/pathnodes.h" > ... > +partitions_are_ordered(struct RelOptInfo *partrel) > > Maybe, "struct" is unnecessary? I just left it there so that the signature matched the header file. Looking around for examples I see make_partition_pruneinfo() has the structs only in the header file, so I guess that is how we do things, so changed to that in the attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2019/04/03 3:10, David Rowley wrote: > On Wed, 3 Apr 2019 at 01:26, Amit Langote <amitlangote09@gmail.com> wrote: >> +#include "nodes/pathnodes.h" >> ... >> +partitions_are_ordered(struct RelOptInfo *partrel) >> >> Maybe, "struct" is unnecessary? > > I just left it there so that the signature matched the header file. > Looking around for examples I see make_partition_pruneinfo() has the > structs only in the header file, so I guess that is how we do things, > so changed to that in the attached. Ah, I see. Thanks for updating the patch. I don't have any more comments. Thanks, Amit
On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I don't have any more comments. Great. Many thanks for having a look at this. Going by [1], Julien also seems pretty happy, so I'm going to change this over to ready for committer. [1] https://www.postgresql.org/message-id/CAOBaU_YpTQbFqcP5jYJZETPL6mgYuTwVTVVBZKZKC6XDBTDkfQ@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On Wed, Apr 3, 2019 at 7:47 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > I don't have any more comments. > > Great. Many thanks for having a look at this. Going by [1], Julien > also seems pretty happy, so I'm going to change this over to ready for > committer. Yes, I'm very sorry I didn't had time to come back on this thread yesterday. I was fine with the patch (I didn't noticed the partprune.c part though), and I'm happy with Amit's further comments and this last patch, so +1 for me!
David Rowley <david.rowley@2ndquadrant.com> writes: > On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I don't have any more comments. > Great. Many thanks for having a look at this. Going by [1], Julien > also seems pretty happy, so I'm going to change this over to ready for > committer. Pushed with some hacking, mostly trying to improve the comments. The only really substantive thing I changed was that you'd missed teaching ExecSetTupleBound about descending through Append. Now that we can have plans that are Limit-above-Append-above-Sort, that's important. regards, tom lane
On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pushed with some hacking, mostly trying to improve the comments. Great! Many thanks for improving those and pushing it. Many thanks to Julien, Antonin for their detailed reviews on this. Thanks Amit for your input on this as well. Much appreciated. > The only really substantive thing I changed was that you'd missed > teaching ExecSetTupleBound about descending through Append. > Now that we can have plans that are Limit-above-Append-above-Sort, > that's important. Oops. Wasn't aware about that until now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 6, 2019 at 2:45 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Pushed with some hacking, mostly trying to improve the comments. > > Great! Many thanks for improving those and pushing it. > > Many thanks to Julien, Antonin for their detailed reviews on this. > Thanks Amit for your input on this as well. Much appreciated. Thanks! I'm glad that we'll have this optimization for pg12.
On 2019/04/06 18:06, Julien Rouhaud wrote: > On Sat, Apr 6, 2019 at 2:45 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: >> >> On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Pushed with some hacking, mostly trying to improve the comments. >> >> Great! Many thanks for improving those and pushing it. >> >> Many thanks to Julien, Antonin for their detailed reviews on this. >> Thanks Amit for your input on this as well. Much appreciated. > > Thanks! I'm glad that we'll have this optimization for pg12. +1, thank you for working on this. It's nice to see partitioning being useful to optimize query execution in even more cases. Regards, Amit