Thread: Ordered Partitioned Table Scans

Ordered Partitioned Table Scans

From
David Rowley
Date:
RANGE partitioning of time-series data is quite a common range to use
partitioning, and such tables tend to grow fairly large.  I thought
since we always store RANGE partitioned tables in the PartitionDesc in
ascending range order that it might be useful to make use of this and
when the required pathkeys match the order of the range, then we could
make use of an Append node instead of uselessly using a MergeAppend,
since the MergeAppend will just exhaust each subplan one at a time, in
order.

It does not seem very hard to implement this and it does not add much
in the way of additional processing to the planner.

Performance wise it seems to give a good boost to getting sorted
results from a partitioned table. I performed a quick test just on my
laptop with:

Setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ');' from generate_Series(0,299) x;
\gexec
\o
INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x;
create index on partbench (id);
vacuum analyze;

Test:
select * from partbench order by id limit 1 offset 29999999;

Results Patched:

Time: 4234.807 ms (00:04.235)
Time: 4237.928 ms (00:04.238)
Time: 4241.289 ms (00:04.241)
Time: 4234.030 ms (00:04.234)
Time: 4244.197 ms (00:04.244)
Time: 4266.000 ms (00:04.266)

Unpatched:

Time: 5917.288 ms (00:05.917)
Time: 5937.775 ms (00:05.938)
Time: 5911.146 ms (00:05.911)
Time: 5906.881 ms (00:05.907)
Time: 5918.309 ms (00:05.918)

(about 39% faster)

The implementation is fairly simple. One thing I don't like about is
I'd rather build_partition_pathkeys() performed all the checks to know
if the partition should support a natural pathkey, but as of now, I
have the calling code ensuring that there are no sub-partitioned
tables. These could cause tuples to be output in the wrong order.

Does this idea seem like something we'd want?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
On 2018/10/26 11:50, David Rowley wrote:
> RANGE partitioning of time-series data is quite a common range to use
> partitioning, and such tables tend to grow fairly large.  I thought
> since we always store RANGE partitioned tables in the PartitionDesc in
> ascending range order that it might be useful to make use of this and
> when the required pathkeys match the order of the range, then we could
> make use of an Append node instead of uselessly using a MergeAppend,
> since the MergeAppend will just exhaust each subplan one at a time, in
> order.
> 
> It does not seem very hard to implement this and it does not add much
> in the way of additional processing to the planner.
> 
> Performance wise it seems to give a good boost to getting sorted
> results from a partitioned table. I performed a quick test just on my
> laptop with:
> 
> Setup:
> CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
> NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
> BY RANGE (id);
> select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
> FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
> ((x+1)*100000)::text || ');' from generate_Series(0,299) x;
> \gexec
> \o
> INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x;
> create index on partbench (id);
> vacuum analyze;
> 
> Test:
> select * from partbench order by id limit 1 offset 29999999;
> 
> Results Patched:
> 
> Time: 4234.807 ms (00:04.235)
> Time: 4237.928 ms (00:04.238)
> Time: 4241.289 ms (00:04.241)
> Time: 4234.030 ms (00:04.234)
> Time: 4244.197 ms (00:04.244)
> Time: 4266.000 ms (00:04.266)
> 
> Unpatched:
> 
> Time: 5917.288 ms (00:05.917)
> Time: 5937.775 ms (00:05.938)
> Time: 5911.146 ms (00:05.911)
> Time: 5906.881 ms (00:05.907)
> Time: 5918.309 ms (00:05.918)
> 
> (about 39% faster)
> 
> The implementation is fairly simple. One thing I don't like about is
> I'd rather build_partition_pathkeys() performed all the checks to know
> if the partition should support a natural pathkey, but as of now, I
> have the calling code ensuring that there are no sub-partitioned
> tables. These could cause tuples to be output in the wrong order.
> 
> Does this idea seem like something we'd want?

Definitely!  Thanks for creating the patch.

I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
last year, but the partitioning-related planning code hadn't advanced then
as much as it has today, so they sort of postponed working on it.
Eventually their patch was returned with feedback last November.  Here's
the link to their email in case you wanted to read some comments their
proposal and patch got, although some of them might be obsolete.

https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop

Thanks,
Amit



Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> last year, but the partitioning-related planning code hadn't advanced then
> as much as it has today, so they sort of postponed working on it.
> Eventually their patch was returned with feedback last November.  Here's
> the link to their email in case you wanted to read some comments their
> proposal and patch got, although some of them might be obsolete.
>
> https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop

Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
times to be doing this. Hopefully, the dust has settled a little bit
now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
Hi,

On Fri, Oct 26, 2018 at 6:40 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> > last year, but the partitioning-related planning code hadn't advanced then
> > as much as it has today, so they sort of postponed working on it.
> > Eventually their patch was returned with feedback last November.  Here's
> > the link to their email in case you wanted to read some comments their
> > proposal and patch got, although some of them might be obsolete.
> >
> > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop
>
> Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
> times to be doing this. Hopefully, the dust has settled a little bit
> now.

Yes, back then I unfortunately had a limited time to work on that, and
I had to spend all of it rebasing the patch instead of working on the
various issue :(

Sadly, I have even less time now, but I'll try to look at your patch
this weekend!  As far as I remember, the biggest problems we had was
to handle multi-level partitionning, when the query is ordered by all
or a subset of the partition keys, and/or with a mix of ASC/DESC
clauses.  It also required some extra processing on the cost part for
queries that can be naturally ordered and contain a LIMIT clause,
since we can estimate how many partitions will have to be scanned.


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
Hi,

On Fri, Oct 26, 2018 at 1:01 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Fri, Oct 26, 2018 at 6:40 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> >
> > On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > > I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
> > > last year, but the partitioning-related planning code hadn't advanced then
> > > as much as it has today, so they sort of postponed working on it.
> > > Eventually their patch was returned with feedback last November.  Here's
> > > the link to their email in case you wanted to read some comments their
> > > proposal and patch got, although some of them might be obsolete.
> > >
> > > https://www.postgresql.org/message-id/2401607.SfZhPQhbS4%40ronan_laptop
> >
> > Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
> > times to be doing this. Hopefully, the dust has settled a little bit
> > now.
>
> As far as I remember, the biggest problems we had was
> to handle multi-level partitionning, when the query is ordered by all
> or a subset of the partition keys, and/or with a mix of ASC/DESC
> clauses.  It also required some extra processing on the cost part for
> queries that can be naturally ordered and contain a LIMIT clause,
> since we can estimate how many partitions will have to be scanned.

I just had a look at your patch.  I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions).  This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

Concerning the implementation, there's at least one issue: it assumes
that each subpath of a range-partitioned table will be ordered, with
is not guaranteed.  You need to to generate explicit Sort nodes nodes
(in the same order as the query_pathkey) for partitions that don't
have an ordered path and make sure that this path is used in the
Append.  Here's a simplistic case showing the issue (sorry, the
partition names are poorly chosen):

CREATE TABLE simple (id integer, val text) PARTITION BY RANGE (id);
CREATE TABLE simple_1_2 PARTITION OF simple FOR VALUES FROM (1) TO (100000);
CREATE TABLE simple_2_3 PARTITION OF simple FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE simple_0_1 PARTITION OF simple FOR VALUES FROM (-100000) TO (1);

INSERT INTO simple SELECT id, 'line ' || id FROM
generate_series(-19999, 199999) id;

CREATE INDEX ON simple_1_2 (id);
CREATE INDEX ON simple_2_3 (id);

EXPLAIN SELECT * FROM simple ORDER BY id ;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Append  (cost=0.00..7705.56 rows=219999 width=15)
   ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
   ->  Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
   ->  Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(4 rows)

Also, if a LIMIT is specified, it should be able to give better
estimates, at least if there's no qual.  For instance:

EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
                                               QUERY PLAN
                                >
------------------------------------------------------------------------------------------------------->
 Limit  (cost=0.00..0.35 rows=10 width=15)
   ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
         ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
         ->  Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
         ->  Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(5 rows)

In this case, we should estimate that the SeqScan (or in a corrected
version the Sort) node should not return more than 10 rows, and each
following partition should be scanned at all, and cost each path
accordingly.  I think that this is quite important, for instance to
make sure that natively sorted Append is chosen over a MergeAppend
when there are some subpath with explicit sorts, because with the
Append we probably won't have to execute all the sorts if the previous
partition scans returned enough rows.

FWIW, both those cases were handled (probably with some bugs though)
in the previous patches Ronan and I sent some time ago.  Also, I did
not forget about this feature, I planned to work on it in hope to have
it included in pg12.  However, I won't have a lot of time to work on
it before December.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
Thanks for looking at this.

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:
> I just had a look at your patch.  I see that you implemented only a
> subset of the possible optimizations (only the case for range
> partitionoing without subpartitions).  This has been previously
> discussed, but we should be able to do similar optimization for list
> partitioning if there's no interleaved values, and also for some cases
> of multi-level partitioning.

I had thought about these cases but originally had thought they would
be more complex to implement than I could justify. On review, I've
found some pretty cheap ways to handle both sub-partitions and for
LIST partitioned tables.  Currently, with LIST partitioned tables I've
coded it to only allow the optimisation if there's no DEFAULT
partition and all partitions are defined with exactly 1 Datum. This
guarantees that there are no interleaved values, but it'll just fail
to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
reason that I didn't go to the trouble of the additional checks was
that I don't really want to add any per-partition overhead to this.
If RelOptInfo had a Bitmapset of live partitions then we could just
check the partitions that survived pruning.  Amit Langote has a
pending patch which does that and some other useful stuff, so maybe we
can delay fixing that until the dust settles a bit in that area. Amit
and I are both working hard to remove all these per-partition
overheads. I imagine he'd also not be in favour of adding code that
does something for all partitions when we've pruned down to just 1.
I've personally no objection to doing the required additional
processing for the non-pruned partitions only. We could also then fix
the case where we disable the optimisation if there's a DEFAULT
partition without any regards to if it's been pruned or not.

> Concerning the implementation, there's at least one issue: it assumes
> that each subpath of a range-partitioned table will be ordered, with
> is not guaranteed.  You need to to generate explicit Sort nodes nodes
> (in the same order as the query_pathkey) for partitions that don't
> have an ordered path and make sure that this path is used in the
> Append.  Here's a simplistic case showing the issue (sorry, the
> partition names are poorly chosen):

Thanks for noticing this. I had been thrown off due to the fact that
Paths are never actually created for these sorts. On looking further I
see that we do checks during createplan to see if the path is
suitability sorted and just create a sort node if it's not. This seems
to go against the whole point of paths, but I'm not going to fight for
changing it, so I've just done the Append the same way as MergeAppend
handles it.

> Also, if a LIMIT is specified, it should be able to give better
> estimates, at least if there's no qual.  For instance:
>
> EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
>                                                QUERY PLAN
>                                 >
> ------------------------------------------------------------------------------------------------------->
>  Limit  (cost=0.00..0.35 rows=10 width=15)
>    ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
>          ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
>          ->  Index Scan using simple_1_2_id_idx on simple_1_2
> (cost=0.29..3148.28 rows=99999 width=14)
>          ->  Index Scan using simple_2_3_id_idx on simple_2_3
> (cost=0.29..3148.29 rows=100000 width=16)
> (5 rows)
>
> In this case, we should estimate that the SeqScan (or in a corrected
> version the Sort) node should not return more than 10 rows, and each
> following partition should be scanned at all, and cost each path
> accordingly.  I think that this is quite important, for instance to
> make sure that natively sorted Append is chosen over a MergeAppend
> when there are some subpath with explicit sorts, because with the
> Append we probably won't have to execute all the sorts if the previous
> partition scans returned enough rows.

In my patch, I'm not adding any additional paths. I'm just adding an
Append instead of a MergeAppend.  For what you're talking about the
limit only needs to be passed into any underlying Sort so that it can
become a top-N sort.  This is handled already in create_limit_path().
Notice in the plan you pasted above that the limit has a lower total
cost than its Append subnode. That's because create_limit_path()
weighted the Limit total cost based on the row count of the limit and
its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

> FWIW, both those cases were handled (probably with some bugs though)
> in the previous patches Ronan and I sent some time ago.  Also, I did
> not forget about this feature, I planned to work on it in hope to have
> it included in pg12.  However, I won't have a lot of time to work on
> it before December.

I apologise for not noticing your patch. I only went as far as
checking the November commitfest to see if anything existed already
and I found nothing there.  I have time to work on this now, so likely
it's better if I continue, just in case your time in December does not
materialise.

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

The known limitations are:

* Disables the optimisation even if the DEFAULT partition is pruned.
* Disables the optimisation if LIST partitioned tables have any
partitions allowing > 1 value.
* Fails to optimise UNION ALLs with partitioned tables.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On 29 October 2018 at 13:44, David Rowley <david.rowley@2ndquadrant.com> wrote:
> v2 of the patch is attached. I've not had time yet to give it a
> throughout post write review, but on first look it seems okay.

Added to the November 'fest.

https://commitfest.postgresql.org/20/1850/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Mon, Oct 29, 2018 at 1:44 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > I just had a look at your patch.  I see that you implemented only a
> > subset of the possible optimizations (only the case for range
> > partitionoing without subpartitions).  This has been previously
> > discussed, but we should be able to do similar optimization for list
> > partitioning if there's no interleaved values, and also for some cases
> > of multi-level partitioning.
>
> I had thought about these cases but originally had thought they would
> be more complex to implement than I could justify. On review, I've
> found some pretty cheap ways to handle both sub-partitions and for
> LIST partitioned tables.  Currently, with LIST partitioned tables I've
> coded it to only allow the optimisation if there's no DEFAULT
> partition and all partitions are defined with exactly 1 Datum. This
> guarantees that there are no interleaved values, but it'll just fail
> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
> reason that I didn't go to the trouble of the additional checks was
> that I don't really want to add any per-partition overhead to this.

I see, but the overhead you mention is because you're doing that check
during the planning in build_partition_pathkeys().  As advised by
Robert quite some time ago
(https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com),
 we can store that information when the PartitionDesc is built, so
that would it wouldn't be problematic.  Since checking for overlapping
values is straightforward with the BoundInfoData infrastructure, it'd
be a pity to miss this optimization in such cases, which I believe
would not be rare.

> If RelOptInfo had a Bitmapset of live partitions then we could just
> check the partitions that survived pruning.  Amit Langote has a
> pending patch which does that and some other useful stuff, so maybe we
> can delay fixing that until the dust settles a bit in that area. Amit
> and I are both working hard to remove all these per-partition
> overheads. I imagine he'd also not be in favour of adding code that
> does something for all partitions when we've pruned down to just 1.
> I've personally no objection to doing the required additional
> processing for the non-pruned partitions only. We could also then fix
> the case where we disable the optimisation if there's a DEFAULT
> partition without any regards to if it's been pruned or not.

Those are quite worthwhile enhancements, and being able to avoid a
MergeAppend if the problematic partitions have been prune would be
great!  I didn't followed thoroughly all the discussions about the
various optimization Amit and you are working on, but I don't think it
would be incompatible with a new flag and the possibility to have the
sorted append with multi valued list partitions?

>
> > Concerning the implementation, there's at least one issue: it assumes
> > that each subpath of a range-partitioned table will be ordered, with
> > is not guaranteed.  You need to to generate explicit Sort nodes nodes
> > (in the same order as the query_pathkey) for partitions that don't
> > have an ordered path and make sure that this path is used in the
> > Append.  Here's a simplistic case showing the issue (sorry, the
> > partition names are poorly chosen):
>
> Thanks for noticing this. I had been thrown off due to the fact that
> Paths are never actually created for these sorts. On looking further I
> see that we do checks during createplan to see if the path is
> suitability sorted and just create a sort node if it's not. This seems
> to go against the whole point of paths, but I'm not going to fight for
> changing it, so I've just done the Append the same way as MergeAppend
> handles it.

Yes, I had quite the same reaction when I saw how MergeAppend handles it.

>
> > Also, if a LIMIT is specified, it should be able to give better
> > estimates, at least if there's no qual.  For instance:
> >
> > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
> >                                                QUERY PLAN
> >                                 >
> > ------------------------------------------------------------------------------------------------------->
> >  Limit  (cost=0.00..0.35 rows=10 width=15)
> >    ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
> >          ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
> >          ->  Index Scan using simple_1_2_id_idx on simple_1_2
> > (cost=0.29..3148.28 rows=99999 width=14)
> >          ->  Index Scan using simple_2_3_id_idx on simple_2_3
> > (cost=0.29..3148.29 rows=100000 width=16)
> > (5 rows)
> >
> > In this case, we should estimate that the SeqScan (or in a corrected
> > version the Sort) node should not return more than 10 rows, and each
> > following partition should be scanned at all, and cost each path
> > accordingly.  I think that this is quite important, for instance to
> > make sure that natively sorted Append is chosen over a MergeAppend
> > when there are some subpath with explicit sorts, because with the
> > Append we probably won't have to execute all the sorts if the previous
> > partition scans returned enough rows.
>
> In my patch, I'm not adding any additional paths. I'm just adding an
> Append instead of a MergeAppend.  For what you're talking about the
> limit only needs to be passed into any underlying Sort so that it can
> become a top-N sort.  This is handled already in create_limit_path().
> Notice in the plan you pasted above that the limit has a lower total
> cost than its Append subnode. That's because create_limit_path()
> weighted the Limit total cost based on the row count of the limit and
> its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

Yes.  But the cost of the first partition in this example is wrong
since there was no additional sort on top of the seq scan.

However, I now realize that, as you said, what your patch does is to
generate an Append *instead* of a MergeAppend if the optimization was
possible.  So there can't be the problem of a MergeAppend chosen over
a cheaper Append in some cases, sorry for the noise.  I totally missed
that because when I worked on the same topic last year we had to
generate both Append and MergeAppend.  At that time Append were not
parallel-aware yet, so there could be faster parallel MergeAppend in
some cases.

> > FWIW, both those cases were handled (probably with some bugs though)
> > in the previous patches Ronan and I sent some time ago.  Also, I did
> > not forget about this feature, I planned to work on it in hope to have
> > it included in pg12.  However, I won't have a lot of time to work on
> > it before December.
>
> I apologise for not noticing your patch. I only went as far as
> checking the November commitfest to see if anything existed already
> and I found nothing there.

No worries, it's more than a year old now (I'm quite ashamed I didn't
come back on this sooner).

>  I have time to work on this now, so likely
> it's better if I continue, just in case your time in December does not
> materialise.

I entirely agree.

> v2 of the patch is attached. I've not had time yet to give it a
> throughout post write review, but on first look it seems okay.


I've registered as a reviewer.   I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On 31 October 2018 at 12:24, Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Mon, Oct 29, 2018 at 1:44 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>>
>> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:
>> > I just had a look at your patch.  I see that you implemented only a
>> > subset of the possible optimizations (only the case for range
>> > partitionoing without subpartitions).  This has been previously
>> > discussed, but we should be able to do similar optimization for list
>> > partitioning if there's no interleaved values, and also for some cases
>> > of multi-level partitioning.
>>
>> I had thought about these cases but originally had thought they would
>> be more complex to implement than I could justify. On review, I've
>> found some pretty cheap ways to handle both sub-partitions and for
>> LIST partitioned tables.  Currently, with LIST partitioned tables I've
>> coded it to only allow the optimisation if there's no DEFAULT
>> partition and all partitions are defined with exactly 1 Datum. This
>> guarantees that there are no interleaved values, but it'll just fail
>> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
>> reason that I didn't go to the trouble of the additional checks was
>> that I don't really want to add any per-partition overhead to this.
>
> I see, but the overhead you mention is because you're doing that check
> during the planning in build_partition_pathkeys().  As advised by
> Robert quite some time ago
> (https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com),
>  we can store that information when the PartitionDesc is built, so
> that would it wouldn't be problematic.  Since checking for overlapping
> values is straightforward with the BoundInfoData infrastructure, it'd
> be a pity to miss this optimization in such cases, which I believe
> would not be rare.

Thanks for looking at this again.

I retrospectively read that thread after Amit mentioned about your
patch. I just disagree with Robert about caching this flag.  The
reason is, if the flag is false due to some problematic partitions, if
we go and prune those, then we needlessly fail to optimise that case.
I propose we come back and do the remaining optimisations with
interleaved LIST partitions and partitioned tables with DEFAULT
partitions later,  once we have a new "live_parts" field in
RelOptInfo.  That way we can just check the live parts to ensure
they're compatible with the optimization. If we get what's done
already in then we're already a bit step forward.

[...]

>> v2 of the patch is attached. I've not had time yet to give it a
>> throughout post write review, but on first look it seems okay.
>
>
> I've registered as a reviewer.   I still didn't have a deep look at
> the patch yet, but thanks a lot for working on it!

Thanks for signing up to review.  I need to send another revision of
the patch to add a missing call to truncate_useless_pathkeys(). Will
try to do that today.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On 31 October 2018 at 13:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:
>> I've registered as a reviewer.   I still didn't have a deep look at
>> the patch yet, but thanks a lot for working on it!
>
> Thanks for signing up to review.  I need to send another revision of
> the patch to add a missing call to truncate_useless_pathkeys(). Will
> try to do that today.

I've attached a patch that removes the redundant pathkeys. This allows
cases like the following to work:

explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
                         QUERY PLAN
-------------------------------------------------------------
 Append
   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
         Index Cond: (a = 10)
   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
         Index Cond: (a = 10)
(5 rows)

One thing that could work but currently does not are when LIST
partitions just allow a single value, we could allow the Append to
have pathkeys even if there are no indexes.  One way to do this would
be to add PathKeys to the seqscan path on the partition for supporting
partitions. However, that's adding code in another area so likely
should be another patch.

This could allow cases like:

create table bool_rp (b bool) partition by list(b);
create table bool_rp_true partition of bool_rp for values in(true);
create table bool_rp_false partition of bool_rp for values in(false);
explain (costs off) select * from bool_rp order by b;
                            QUERY PLAN
------------------------------------------------------------------
 Append
   ->  Seq Scan on bool_rp_false
   ->  Seq Scan on bool_rp_true
(3 rows)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Antonin Houska
Date:
David Rowley <david.rowley@2ndquadrant.com> wrote:

> On 31 October 2018 at 13:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
> >>> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >> I've registered as a reviewer.   I still didn't have a deep look at
> >> the patch yet, but thanks a lot for working on it!
> >
> > Thanks for signing up to review.  I need to send another revision of
> > the patch to add a missing call to truncate_useless_pathkeys(). Will
> > try to do that today.
>
> I've attached a patch that ...

I've picked this one when looking around what I could review.

* As for the logic, I found generate_mergeappend_paths() to be the most
  interesting part:

Imagine table partitioned by "i", so "partition_pathkeys" is {i}.

partition 1:

i | j
--+--
0 | 0
1 | 1
0 | 1
1 | 0

partition 2:

i | j
--+--
3 | 0
2 | 0
2 | 1
3 | 1

Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
ordering of the subpaths should not change the way tuples are split into
partitions.

Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
different items. To propose more generic rule, I used this example of
range-partitioned table, where "i" and "j" are the partitioning keys:

partition 1:

 i | j | k
---+---+---
 0 | 0 | 1
 0 | 0 | 0

partition 2:

 i | j | k
---+---+---
 0 | 1 | 0
 0 | 1 | 1

If the output "pathkey" is {i, k}, then the Append path makes rows of both
partitions interleave:

 i | j | k
---+---+---
 0 | 0 | 0
 0 | 1 | 0
 0 | 0 | 1
 0 | 1 | 1

So in general I think the restriction is that no valid position of "pathkeys"
and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
pathkey lists must be contained in the longer one. Does it make sense to you?


Another problem I see is that build_partition_pathkeys() continues even if it
fails to create a pathkey for some partitioning column. In the example above
it would mean that the table can have "partition_pathkeys" equal to {j}
instead of {i, j} on some EC-related conditions. However such a key does not
correspond to reality - this is easier to imagine if another partition is
considered.

partition 3:

 i | j | k
---+---+---
 1 | 0 | 1
 1 | 0 | 0

So I think no "partition_pathkeys" should be generated in that case. On the
other hand, if the function returned the part of the list it could construct
so far, it'd be wrong because such incomplete pathkeys could pass the checks I
proposed above for reasons unrelated to the partitioning scheme.


The following comments are mostly on coding:

* Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
  this sentence in the header comment:

Note: If changing this, see build_partition_pathkeys()

However I could not find other use besides that in
RelationBuildPartitionDesc().

* create_append_path():

    /*
     * Apply query-wide LIMIT if known and path is for sole base relation.
     * (Handling this at this low level is a bit klugy.)
     */
    if (root != NULL && pathkeys != NULL &&
        bms_equal(rel->relids, root->all_baserels))
        pathnode->limit_tuples = root->limit_tuples;
    else
        pathnode->limit_tuples = -1.0;

  I think this optimization is not specific to AppendPath / MergeAppendPath,
  so it could be moved elsewhere (as a separate patch of course). But
  specifically for AppendPath, why do we have to test pathkeys? The pathkeys
  of the AppendPath do not necessarily describe the order of the set to which
  LIMIT is applied, so their existence should not be important here.

* If pathkeys is passed, shouldn't create_append_path() include the
  cost_sort() of subpaths just like create_merge_append_path() does?  And if
  so, then create_append_path() and create_merge_append_path() might
  eventually have some common code (at least for the subpath processing) to be
  put into a separate function.

* Likewise, create_append_plan() / create_merge_append_plan() are going to be
  more similar then before, so some refactoring could also make sense.

Although it's not too much code, I admit the patch is not trivial, so I'm
curious about your opinion.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On 1 November 2018 at 04:01, Antonin Houska <ah@cybertec.at> wrote:
> * As for the logic, I found generate_mergeappend_paths() to be the most
>   interesting part:
>
> Imagine table partitioned by "i", so "partition_pathkeys" is {i}.
>
> partition 1:
>
> i | j
> --+--
> 0 | 0
> 1 | 1
> 0 | 1
> 1 | 0
>
> partition 2:
>
> i | j
> --+--
> 3 | 0
> 2 | 0
> 2 | 1
> 3 | 1
>
> Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
> ordering of the subpaths should not change the way tuples are split into
> partitions.
>
> Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
> different items. To propose more generic rule, I used this example of
> range-partitioned table, where "i" and "j" are the partitioning keys:
>
> partition 1:
>
>  i | j | k
> ---+---+---
>  0 | 0 | 1
>  0 | 0 | 0
>
> partition 2:
>
>  i | j | k
> ---+---+---
>  0 | 1 | 0
>  0 | 1 | 1
>
> If the output "pathkey" is {i, k}, then the Append path makes rows of both
> partitions interleave:
>
>  i | j | k
> ---+---+---
>  0 | 0 | 0
>  0 | 1 | 0
>  0 | 0 | 1
>  0 | 1 | 1
>
> So in general I think the restriction is that no valid position of "pathkeys"
> and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
> pathkey lists must be contained in the longer one. Does it make sense to you?

I understand what you're saying. I just don't understand what you
think is wrong with the patch in this area.

> Another problem I see is that build_partition_pathkeys() continues even if it
> fails to create a pathkey for some partitioning column. In the example above
> it would mean that the table can have "partition_pathkeys" equal to {j}
> instead of {i, j} on some EC-related conditions. However such a key does not
> correspond to reality - this is easier to imagine if another partition is
> considered.
>
> partition 3:
>
>  i | j | k
> ---+---+---
>  1 | 0 | 1
>  1 | 0 | 0
>
> So I think no "partition_pathkeys" should be generated in that case. On the
> other hand, if the function returned the part of the list it could construct
> so far, it'd be wrong because such incomplete pathkeys could pass the checks I
> proposed above for reasons unrelated to the partitioning scheme.

Oops. That's a mistake. We should return what we have so far if we
can't make one of the pathkeys. Will fix.

> The following comments are mostly on coding:
>
> * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
>   this sentence in the header comment:
>
> Note: If changing this, see build_partition_pathkeys()
>
> However I could not find other use besides that in
> RelationBuildPartitionDesc().

While the new code does not call those directly, the new code does
depend on the sort order of the partitions inside the PartitionDesc,
which those functions are responsible for. Perhaps there's a better
way to communicate that.

Actually, I think the partitioning checking code I added to pathkeys.c
does not belong there. Likely those checks should live with the other
partitioning code in the form of a bool returning function. I'll
change that now. It means we don't have to work that out twice as I'm
currently running it once for forward and once for the backwards scan
case.  Currently the code is very simple but if we start analysing
list partition bounds then it will become slower.

> * create_append_path():
>
>         /*
>          * Apply query-wide LIMIT if known and path is for sole base relation.
>          * (Handling this at this low level is a bit klugy.)
>          */
>         if (root != NULL && pathkeys != NULL &&
>                 bms_equal(rel->relids, root->all_baserels))
>                 pathnode->limit_tuples = root->limit_tuples;
>         else
>                 pathnode->limit_tuples = -1.0;
>
>   I think this optimization is not specific to AppendPath / MergeAppendPath,
>   so it could be moved elsewhere (as a separate patch of course). But
>   specifically for AppendPath, why do we have to test pathkeys? The pathkeys
>   of the AppendPath do not necessarily describe the order of the set to which
>   LIMIT is applied, so their existence should not be important here.

The pathkeys != NULL could be removed. I was just trying to maintain
the status quo for Appends without pathkeys. In reality it currently
does not matter since that's only used as a parameter for cost_sort().
There'd be no reason previously to have a Sort path as a subpath in an
Append node since the order would be destroyed after the Append.
Perhaps we should just pass it through as one day it might be useful.
I just can't currently imagine why.

> * If pathkeys is passed, shouldn't create_append_path() include the
>   cost_sort() of subpaths just like create_merge_append_path() does?  And if
>   so, then create_append_path() and create_merge_append_path() might
>   eventually have some common code (at least for the subpath processing) to be
>   put into a separate function.

It does. It's just done via the call to cost_append().

> * Likewise, create_append_plan() / create_merge_append_plan() are going to be
>   more similar then before, so some refactoring could also make sense.
>
> Although it's not too much code, I admit the patch is not trivial, so I'm
> curious about your opinion.

I think the costing code is sufficiently different to warant not
sharing more. For example, the startup costing is completely
different. Append can start on the startup cost of the first subpath,
but MergeAppend takes the sum of the startup cost of all subpaths.

I've attached v4 of the patch. I think this addresses all that you
mentioned apart from the first one, due to not understanding the
problem.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Antonin Houska
Date:
David Rowley <david.rowley@2ndquadrant.com> wrote:

> On 1 November 2018 at 04:01, Antonin Houska <ah@cybertec.at> wrote:
> > * As for the logic, I found generate_mergeappend_paths() to be the most
> >   interesting part:
> >
> > Imagine table partitioned by "i", so "partition_pathkeys" is {i}.
> >
> > partition 1:
> >
> > i | j
> > --+--
> > 0 | 0
> > 1 | 1
> > 0 | 1
> > 1 | 0
> >
> > partition 2:
> >
> > i | j
> > --+--
> > 3 | 0
> > 2 | 0
> > 2 | 1
> > 3 | 1
> >
> > Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
> > ordering of the subpaths should not change the way tuples are split into
> > partitions.
> >
> > ...
>
> I understand what you're saying. I just don't understand what you
> think is wrong with the patch in this area.

I think these conditions are too restrictive:

    /*
     * Determine if these pathkeys match the partition order, or reverse
     * partition order.  It can't match both, so only go to the trouble of
     * checking the reverse order when it's not in ascending partition
     * order.
     */
    partition_order = pathkeys_contained_in(pathkeys,
                        partition_pathkeys);
    partition_order_desc = !partition_order &&
                pathkeys_contained_in(pathkeys,
                            partition_pathkeys_desc);


In the example above I wanted to show that your new feature should work even
if "pathkeys" is not contained in "partition_pathkeys".

> > Another problem I see is that build_partition_pathkeys() continues even if it
> > fails to create a pathkey for some partitioning column. In the example above
> > it would mean that the table can have "partition_pathkeys" equal to {j}
> > instead of {i, j} on some EC-related conditions. However such a key does not
> > correspond to reality - this is easier to imagine if another partition is
> > considered.
> >
> > partition 3:
> >
> >  i | j | k
> > ---+---+---
> >  1 | 0 | 1
> >  1 | 0 | 0
> >
> > So I think no "partition_pathkeys" should be generated in that case. On the
> > other hand, if the function returned the part of the list it could construct
> > so far, it'd be wrong because such incomplete pathkeys could pass the checks I
> > proposed above for reasons unrelated to the partitioning scheme.
>
> Oops. That's a mistake. We should return what we have so far if we
> can't make one of the pathkeys. Will fix.

I think no "partition_pathkeys" should be created in this case, but before we
can discuss this in detail there needs to be an agreement on the evaluation of
the relationship between "pathkeys" and "partition_pathkeys", see above.

> > The following comments are mostly on coding:
> >
> > * Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
> >   this sentence in the header comment:
> >
> > Note: If changing this, see build_partition_pathkeys()
> >
> > However I could not find other use besides that in
> > RelationBuildPartitionDesc().
>
> While the new code does not call those directly, the new code does
> depend on the sort order of the partitions inside the PartitionDesc,
> which those functions are responsible for. Perhaps there's a better
> way to communicate that.

I pointed this out because I suspect that changes of these functions would
affect more features, not only the one you're trying to implement.

> > * If pathkeys is passed, shouldn't create_append_path() include the
> >   cost_sort() of subpaths just like create_merge_append_path() does?  And if
> >   so, then create_append_path() and create_merge_append_path() might
> >   eventually have some common code (at least for the subpath processing) to be
> >   put into a separate function.
>
> It does. It's just done via the call to cost_append().

ok, I missed that.

> > * Likewise, create_append_plan() / create_merge_append_plan() are going to be
> >   more similar then before, so some refactoring could also make sense.
> >
> > Although it's not too much code, I admit the patch is not trivial, so I'm
> > curious about your opinion.
>
> I think the costing code is sufficiently different to warant not
> sharing more. For example, the startup costing is completely
> different. Append can start on the startup cost of the first subpath,
> but MergeAppend takes the sum of the startup cost of all subpaths.

ok

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:
> I think these conditions are too restrictive:
>
>         /*
>          * Determine if these pathkeys match the partition order, or reverse
>          * partition order.  It can't match both, so only go to the trouble of
>          * checking the reverse order when it's not in ascending partition
>          * order.
>          */
>         partition_order = pathkeys_contained_in(pathkeys,
>                                                 partition_pathkeys);
>         partition_order_desc = !partition_order &&
>                                 pathkeys_contained_in(pathkeys,
>                                                         partition_pathkeys_desc);
>
>
> In the example above I wanted to show that your new feature should work even
> if "pathkeys" is not contained in "partition_pathkeys".

Okay, after a bit more time looking at this I see what you're saying
now and I agree, but; see below.

>> > Another problem I see is that build_partition_pathkeys() continues even if it
>> > fails to create a pathkey for some partitioning column. In the example above
>> > it would mean that the table can have "partition_pathkeys" equal to {j}
>> > instead of {i, j} on some EC-related conditions. However such a key does not
>> > correspond to reality - this is easier to imagine if another partition is
>> > considered.
>> >
>> > partition 3:
>> >
>> >  i | j | k
>> > ---+---+---
>> >  1 | 0 | 1
>> >  1 | 0 | 0
>> >
>> > So I think no "partition_pathkeys" should be generated in that case. On the
>> > other hand, if the function returned the part of the list it could construct
>> > so far, it'd be wrong because such incomplete pathkeys could pass the checks I
>> > proposed above for reasons unrelated to the partitioning scheme.
>>
>> Oops. That's a mistake. We should return what we have so far if we
>> can't make one of the pathkeys. Will fix.
>
> I think no "partition_pathkeys" should be created in this case, but before we
> can discuss this in detail there needs to be an agreement on the evaluation of
> the relationship between "pathkeys" and "partition_pathkeys", see above.

The problem with doing that is that if the partition keys are better
than the pathkeys then we'll most likely fail to generate any
partition keys at all due to lack of any existing eclass to use for
the pathkeys. It's unsafe to use just the prefix in this case as the
eclass may not have been found due to, for example one of the
partition keys having a different collation than the required sort
order of the query. In other words, we can't rely on a failure to
create the pathkey meaning that a more strict sort order is not
required.

I'm a bit unsure on how safe it would be to pass "create_it" as true
to make_pathkey_from_sortinfo(). We might be building partition path
keys for some sub-partitioned table. In this case the eclass should
likely have a its member added with em_is_child = true.  The existing
code always sets em_is_child to false. It's not that clear to me that
setting up a new eclass with a single em_is_child = true member is
correct.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:
> > I think these conditions are too restrictive:
> >
> >         /*
> >          * Determine if these pathkeys match the partition order, or reverse
> >          * partition order.  It can't match both, so only go to the trouble of
> >          * checking the reverse order when it's not in ascending partition
> >          * order.
> >          */
> >         partition_order = pathkeys_contained_in(pathkeys,
> >                                                 partition_pathkeys);
> >         partition_order_desc = !partition_order &&
> >                                 pathkeys_contained_in(pathkeys,
> >                                                         partition_pathkeys_desc);
> >

> The problem with doing that is that if the partition keys are better
> than the pathkeys then we'll most likely fail to generate any
> partition keys at all due to lack of any existing eclass to use for
> the pathkeys. It's unsafe to use just the prefix in this case as the
> eclass may not have been found due to, for example one of the
> partition keys having a different collation than the required sort
> order of the query. In other words, we can't rely on a failure to
> create the pathkey meaning that a more strict sort order is not
> required.

I had another look at this patch and it seems okay just to add a new
flag to build_partition_pathkeys() to indicate if the pathkey List was
truncated or not.  In generate_mergeappend_paths() we can then just
check that flag before checking if the partiiton pathkeys are
contained in pathkeys. It's fine if the partition keys were truncated
for the reverse of that check.

I've done this in the attached and added additional regression tests
for this case.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
Hi,

On Thu, Nov 22, 2018 at 11:27 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:
> > > I think these conditions are too restrictive:
> > >
> > >         /*
> > >          * Determine if these pathkeys match the partition order, or reverse
> > >          * partition order.  It can't match both, so only go to the trouble of
> > >          * checking the reverse order when it's not in ascending partition
> > >          * order.
> > >          */
> > >         partition_order = pathkeys_contained_in(pathkeys,
> > >                                                 partition_pathkeys);
> > >         partition_order_desc = !partition_order &&
> > >                                 pathkeys_contained_in(pathkeys,
> > >                                                         partition_pathkeys_desc);
> > >
>
> > The problem with doing that is that if the partition keys are better
> > than the pathkeys then we'll most likely fail to generate any
> > partition keys at all due to lack of any existing eclass to use for
> > the pathkeys. It's unsafe to use just the prefix in this case as the
> > eclass may not have been found due to, for example one of the
> > partition keys having a different collation than the required sort
> > order of the query. In other words, we can't rely on a failure to
> > create the pathkey meaning that a more strict sort order is not
> > required.
>
> I had another look at this patch and it seems okay just to add a new
> flag to build_partition_pathkeys() to indicate if the pathkey List was
> truncated or not.  In generate_mergeappend_paths() we can then just
> check that flag before checking if the partiiton pathkeys are
> contained in pathkeys. It's fine if the partition keys were truncated
> for the reverse of that check.
>
> I've done this in the attached and added additional regression tests
> for this case.

I started to look at v5.

If I understand correctly, the new behavior is controlled by
partitions_are_ordered(), but it only checks for declared partitions,
not partitions that survived pruning.  Did I miss something or is it
the intended behavior?  Also, generate_mergeappend_paths should
probably be renamed to something like generate_sortedappend_paths
since it can now generate either Append or MergeAppend paths.

I'm also wondering if that's ok to only generate either a (sorted)
Append or a MergeAppend.  Is it possible that in some cases it's
better to have a MergeAppend rather than a sorted Append, given that
MergeAppend is parallel-aware and the sorted Append isn't?


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <rjuju123@gmail.com> wrote:
> I started to look at v5.

Thanks for giving this a look over.

> If I understand correctly, the new behavior is controlled by
> partitions_are_ordered(), but it only checks for declared partitions,
> not partitions that survived pruning.  Did I miss something or is it
> the intended behavior?

Yeah, it was mentioned up-thread a bit.

I wrote:
> I retrospectively read that thread after Amit mentioned about your
> patch. I just disagree with Robert about caching this flag.  The
> reason is, if the flag is false due to some problematic partitions, if
> we go and prune those, then we needlessly fail to optimise that case.
> I propose we come back and do the remaining optimisations with
> interleaved LIST partitions and partitioned tables with DEFAULT
> partitions later,  once we have a new "live_parts" field in
> RelOptInfo.  That way we can just check the live parts to ensure
> they're compatible with the optimization. If we get what's done
> already in then we're already a bit step forward.

The reason I'm keen to leave this alone today is that determining
which partitions are pruned requires looking at each partition's
RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
minimise the overhead of this patch by avoiding doing any
per-partition processing. If we get the "live_parts" Bitmapset, then
this becomes cheaper as Bitmapsets are fairly efficient at finding the
next set member, even when they're large and sparsely populated.

> Also, generate_mergeappend_paths should
> probably be renamed to something like generate_sortedappend_paths
> since it can now generate either Append or MergeAppend paths.

You might be right about naming this something else, but
"sortedappend" sounds like an Append node with a Sort node above it.
"orderedappend" feels slightly better, although my personal vote would
be not to rename it at all. Sometimes generating an Append seems like
an easy enough corner case to mention in the function body.

> I'm also wondering if that's ok to only generate either a (sorted)
> Append or a MergeAppend.  Is it possible that in some cases it's
> better to have a MergeAppend rather than a sorted Append, given that
> MergeAppend is parallel-aware and the sorted Append isn't?

That might have been worth a thought if we had parallel MergeAppends,
but we don't. You might be thinking of GatherMerge.

I've attached a v6 patch. The only change is the renamed the
generate_mergeappend_paths() function to
generate_orderedappend_paths(), and also the required comment updates
to go with it.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Wed, Dec 19, 2018 at 10:51 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> > If I understand correctly, the new behavior is controlled by
> > partitions_are_ordered(), but it only checks for declared partitions,
> > not partitions that survived pruning.  Did I miss something or is it
> > the intended behavior?
>
> Yeah, it was mentioned up-thread a bit.
>
> I wrote:
> > I retrospectively read that thread after Amit mentioned about your
> > patch. I just disagree with Robert about caching this flag.  The
> > reason is, if the flag is false due to some problematic partitions, if
> > we go and prune those, then we needlessly fail to optimise that case.
> > I propose we come back and do the remaining optimisations with
> > interleaved LIST partitions and partitioned tables with DEFAULT
> > partitions later,  once we have a new "live_parts" field in
> > RelOptInfo.  That way we can just check the live parts to ensure
> > they're compatible with the optimization. If we get what's done
> > already in then we're already a bit step forward.

Ah, sorry I did read this but I misunderstood it.  I really need to
catchup what changed for partitioning since pg11 more thoroughly.

> The reason I'm keen to leave this alone today is that determining
> which partitions are pruned requires looking at each partition's
> RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
> minimise the overhead of this patch by avoiding doing any
> per-partition processing. If we get the "live_parts" Bitmapset, then
> this becomes cheaper as Bitmapsets are fairly efficient at finding the
> next set member, even when they're large and sparsely populated.

I see.  But since for now the optimisation will only be done
considering all partitions, I still think that it's better to store a
bool flag in the PartitionDesc to describe if it's natively ordered or
not, and therefore also handle the case for
non-intervleaved-multi-datums list partitioning.  It won't add much
overhead and will benefit way more cases.

We can still revisit that when a live_parts Bitmapset is available in
RelOptInfo (and maybe other flag that say if partitions were pruned or
not, and/or if the default partition was pruned).

> > Also, generate_mergeappend_paths should
> > probably be renamed to something like generate_sortedappend_paths
> > since it can now generate either Append or MergeAppend paths.
>
> You might be right about naming this something else, but
> "sortedappend" sounds like an Append node with a Sort node above it.
> "orderedappend" feels slightly better, although my personal vote would
> be not to rename it at all. Sometimes generating an Append seems like
> an easy enough corner case to mention in the function body.

Ok, I don't have a very strong opinion on it and orderedappend sounds
less ambiguous.

> > I'm also wondering if that's ok to only generate either a (sorted)
> > Append or a MergeAppend.  Is it possible that in some cases it's
> > better to have a MergeAppend rather than a sorted Append, given that
> > MergeAppend is parallel-aware and the sorted Append isn't?
>
> That might have been worth a thought if we had parallel MergeAppends,
> but we don't. You might be thinking of GatherMerge.

Ah, oups indeed :)


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Wed, Dec 19, 2018 at 10:51 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > The reason I'm keen to leave this alone today is that determining
> > which partitions are pruned requires looking at each partition's
> > RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
> > minimise the overhead of this patch by avoiding doing any
> > per-partition processing. If we get the "live_parts" Bitmapset, then
> > this becomes cheaper as Bitmapsets are fairly efficient at finding the
> > next set member, even when they're large and sparsely populated.
>
> I see.  But since for now the optimisation will only be done
> considering all partitions, I still think that it's better to store a
> bool flag in the PartitionDesc to describe if it's natively ordered or
> not, and therefore also handle the case for
> non-intervleaved-multi-datums list partitioning.  It won't add much
> overhead and will benefit way more cases.

I'm not really in favour of adding a flag there only to remove it
again once we can more easily determine the pruned partitions.
Remember the flag, because it's stored in the relation cache, must be
set accounting for all partitions. As soon as we want to add smarts
for pruned partitions, then the flag becomes completely useless for
everything.  If covering all cases in the first hit is your aim then
the way to go is to add the live_parts field to RelOptInfo in this
patch rather than in Amit's patch in [1].  I'd much rather add the
pruned partitions smarts as part of another effort.  The most likely
cases to benefit from this are already covered by the current patch;
range partitioned tables.

[1] https://commitfest.postgresql.org/21/1778/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Wed, Dec 19, 2018 at 1:18 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > I see.  But since for now the optimisation will only be done
> > considering all partitions, I still think that it's better to store a
> > bool flag in the PartitionDesc to describe if it's natively ordered or
> > not, and therefore also handle the case for
> > non-intervleaved-multi-datums list partitioning.  It won't add much
> > overhead and will benefit way more cases.
>
> I'm not really in favour of adding a flag there only to remove it
> again once we can more easily determine the pruned partitions.
> Remember the flag, because it's stored in the relation cache, must be
> set accounting for all partitions. As soon as we want to add smarts
> for pruned partitions, then the flag becomes completely useless for
> everything.

I don't see why we should drop this flag.  If we know that the
partitions are naturally ordered, they'll still be ordered after some
partitions have been prune, so we can skip later checks if we already
have the information.  The only remaining cases this flag doesn't
cover are:

- partitions are naturally ordered but there's a default partition.
We could store this information and later check if the default
partition has been pruned or not
- partitions are not naturally ordered, but become naturally ordered
if enough partitions are pruned.  I may be wrong but that doesn't seem
like a very frequent use case to me  I'd imagine that in a lot of
cases either almost no partition are prune (or at least not enough so
that the remaining one are ordered), or all but one partition is
pruned),.  So keeping a low overhead for the
almost-no-pruned-partition with naturally ordered partitions case
still seems like a good idea to me.

>  If covering all cases in the first hit is your aim then
> the way to go is to add the live_parts field to RelOptInfo in this
> patch rather than in Amit's patch in [1].  I'd much rather add the
> pruned partitions smarts as part of another effort.  The most likely
> cases to benefit from this are already covered by the current patch;
> range partitioned tables.

Covering all cases is definitely not my goal here, just grabbing the
low hanging fruits.  The multi-level partitioning case is another
thing that would need to be handled for instance (and that's the main
reason I couldn't submit a new patch when I was working on it), and
I'm definitely not arguing to cover it in this patch.  That being
said, I'll try to have a look at this patch too, but as I said I have
a lot of catch-up to do in this part of the code, so I'm afraid that
I'll not be super efficient.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote:
> I don't see why we should drop this flag.  If we know that the
> partitions are naturally ordered, they'll still be ordered after some
> partitions have been prune, so we can skip later checks if we already
> have the information.  The only remaining cases this flag doesn't
> cover are:
>
> - partitions are naturally ordered but there's a default partition.
> We could store this information and later check if the default
> partition has been pruned or not
> - partitions are not naturally ordered, but become naturally ordered
> if enough partitions are pruned.  I may be wrong but that doesn't seem
> like a very frequent use case to me  I'd imagine that in a lot of
> cases either almost no partition are prune (or at least not enough so
> that the remaining one are ordered), or all but one partition is
> pruned),.  So keeping a low overhead for the
> almost-no-pruned-partition with naturally ordered partitions case
> still seems like a good idea to me.

I'm objecting to processing for all partitions, but processing for
just non-pruned partitions seems fine to me. If there are 10k
partitions and we pruned none of them, then planning will be slow
anyway. I'm not too worried about slowing it down a further
microsecond or two.  It'll be a drop in the ocean. When we have the
live_parts flag in RelOptInfo then we can allow all of the cases
you've mentioned above, we'll just need to look at the non-pruned
partitions, and in partition order, determine if the lowest LIST
partitioned value sorts earlier than some earlier partition's highest
LIST value and disable the optimisation for such cases.

The flag you've mentioned will become redundant when support is added
for the cases you've mentioned above.  I don't see any reason not to
support all these cases, once the live_parts flag makes in into
RelOptInfo.  I'm also a bit confused at why you think it's so
important to make multi-valued LIST partitions work when no values are
interleaved, but you suddenly don't care about the optimisation when
the interleaved value partitions get pruned. Can you share your
reasoning for that?

If you're really so keen on this flag, can you share the design you
have in mind?    If it's just a single bool flag like "parts_ordered",
and that's set to false, then how would you know there is some natural
order when the DEFAULT partition gets pruned? Or are you proposing
multiple flags, maybe two flags, one for when the default is pruned
and one when it's not?  If so, I'd question why the default partition
is so special? Pruning of any of the other partitions could turn a
naturally unordered LIST partitioned table into a naturally ordered
partitioned table if the pruned partition happened to be the only one
with interleaved values. Handling only the DEFAULT partition in a
special way seems to violate the principle of least astonishment.

But in short, I just really don't like the flags idea and I'm not
really willing to work on it or put my name on it. I'd much rather
wait then build a proper solution that works in all cases.  I feel the
current patch is worthwhile as it stands.

> The multi-level partitioning case is another
> thing that would need to be handled for instance (and that's the main
> reason I couldn't submit a new patch when I was working on it), and
> I'm definitely not arguing to cover it in this patch.

As far as I'm aware, the multi-level partitioning should work just
fine with the current patch. I added code for that a while ago. There
are regression tests to exercise it. I'm not aware of any cases where
it does not work.

> That being
> said, I'll try to have a look at this patch too, but as I said I have
> a lot of catch-up to do in this part of the code, so I'm afraid that
> I'll not be super efficient.

Thanks for your time on this so far.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Wed, Dec 19, 2018 at 3:01 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> I'm objecting to processing for all partitions, but processing for
> just non-pruned partitions seems fine to me. If there are 10k
> partitions and we pruned none of them, then planning will be slow
> anyway. I'm not too worried about slowing it down a further
> microsecond or two.  It'll be a drop in the ocean. When we have the
> live_parts flag in RelOptInfo then we can allow all of the cases
> you've mentioned above, we'll just need to look at the non-pruned
> partitions, and in partition order, determine if the lowest LIST
> partitioned value sorts earlier than some earlier partition's highest
> LIST value and disable the optimisation for such cases.

My concern is more for a more moderate number of partition (a few
hundreds?).  I don't know how expensive that'll be, but it just seem
sad to recompute their ordering each time and waste cycles if we can
do it only once in non corner cases.

> The flag you've mentioned will become redundant when support is added
> for the cases you've mentioned above.  I don't see any reason not to
> support all these cases, once the live_parts flag makes in into
> RelOptInfo.  I'm also a bit confused at why you think it's so
> important to make multi-valued LIST partitions work when no values are
> interleaved, but you suddenly don't care about the optimisation when
> the interleaved value partitions get pruned. Can you share your
> reasoning for that?

I never said that I don't care about interleaved partition being
pruned.  I do think it might not be a super frequent thing, but I
certainly wish we handle it.  I just agree with your argument that the
pruned partitions problem will be better handled with the live_parts
that should be added in another patch.

> If you're really so keen on this flag, can you share the design you
> have in mind?    If it's just a single bool flag like "parts_ordered",
> and that's set to false, then how would you know there is some natural
> order when the DEFAULT partition gets pruned? Or are you proposing
> multiple flags, maybe two flags, one for when the default is pruned
> and one when it's not?

I don't think that the design is a big problem here.  You can either
have a flag that say if the partitions are ordered whether there's a
default partition or not, so callers will have to check if the default
partition is still there, or just store an enum to distinguish the
different cases.

>  If so, I'd question why the default partition
> is so special? Pruning of any of the other partitions could turn a
> naturally unordered LIST partitioned table into a naturally ordered
> partitioned table if the pruned partition happened to be the only one
> with interleaved values. Handling only the DEFAULT partition in a
> special way seems to violate the principle of least astonishment.

I'm not sure I'm following you, the default partition is by nature a
special partition, and its simple presence prevent this optimisation.
We can't possibly store all the sets of subsets of partitions that
would make the partitioned table naturally ordered if they were
pruned, so it seems like a different problem.

> But in short, I just really don't like the flags idea and I'm not
> really willing to work on it or put my name on it. I'd much rather
> wait then build a proper solution that works in all cases.  I feel the
> current patch is worthwhile as it stands.

Ok, fine.

> > The multi-level partitioning case is another
> > thing that would need to be handled for instance (and that's the main
> > reason I couldn't submit a new patch when I was working on it), and
> > I'm definitely not arguing to cover it in this patch.
>
> As far as I'm aware, the multi-level partitioning should work just
> fine with the current patch. I added code for that a while ago. There
> are regression tests to exercise it. I'm not aware of any cases where
> it does not work.

Ok.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Thu, 20 Dec 2018 at 09:48, Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Wed, Dec 19, 2018 at 3:01 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> >  If so, I'd question why the default partition
> > is so special? Pruning of any of the other partitions could turn a
> > naturally unordered LIST partitioned table into a naturally ordered
> > partitioned table if the pruned partition happened to be the only one
> > with interleaved values. Handling only the DEFAULT partition in a
> > special way seems to violate the principle of least astonishment.
>
> I'm not sure I'm following you, the default partition is by nature a
> special partition, and its simple presence prevent this optimisation.
> We can't possibly store all the sets of subsets of partitions that
> would make the partitioned table naturally ordered if they were
> pruned, so it seems like a different problem.

For example:

create table listp (a int) partition by list (a);
create table listp12 partition of listp for values in(1,2);
create table listp03 partition of listp for vlaues in(0,3);
create table listp45 partition of listp for values in(4,5);
create table listpd partition of listp default;

select * from listp where a in(1,2,4,5);

Here we prune all but listp12 and listp45. Since the default is pruned
and listp03 is pruned then there are no interleaved values. By your
proposed design the natural ordering is not detected since we're
storing a flag that says the partitions are unordered due to listp03.
With my idea for using live_parts, we'll process the partitions
looking for interleaved values on each query, after pruning takes
place. In this case, we'll see the partitions are naturally ordered. I
don't really foresee any issues with that additional processing since
it will only be a big effort when there are a large number of
partitions, and in those cases the planner already has lots of work to
do. Such processing is just a drop in the ocean when compared to path
generation for all those partitions.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Wed, Dec 19, 2018 at 11:08 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 20 Dec 2018 at 09:48, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > On Wed, Dec 19, 2018 at 3:01 PM David Rowley
> > <david.rowley@2ndquadrant.com> wrote:
> > >  If so, I'd question why the default partition
> > > is so special? Pruning of any of the other partitions could turn a
> > > naturally unordered LIST partitioned table into a naturally ordered
> > > partitioned table if the pruned partition happened to be the only one
> > > with interleaved values. Handling only the DEFAULT partition in a
> > > special way seems to violate the principle of least astonishment.
> >
> > I'm not sure I'm following you, the default partition is by nature a
> > special partition, and its simple presence prevent this optimisation.
> > We can't possibly store all the sets of subsets of partitions that
> > would make the partitioned table naturally ordered if they were
> > pruned, so it seems like a different problem.
>
> For example:
>
> create table listp (a int) partition by list (a);
> create table listp12 partition of listp for values in(1,2);
> create table listp03 partition of listp for vlaues in(0,3);
> create table listp45 partition of listp for values in(4,5);
> create table listpd partition of listp default;
>
> select * from listp where a in(1,2,4,5);
>
> Here we prune all but listp12 and listp45. Since the default is pruned
> and listp03 is pruned then there are no interleaved values. By your
> proposed design the natural ordering is not detected since we're
> storing a flag that says the partitions are unordered due to listp03.

No, what I'm proposing is to store if the partitions are naturally
ordered or not, *and* recheck after pruning if that property could
have changed (so if some partitions have been pruned).  So we avoid
extra processing if we already knew that the partitions were ordered
(possibly with the default partition pruning information), or if we
know that the partitions are not ordered and no partition have been
pruned.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Thu, 20 Dec 2018 at 18:20, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Dec 19, 2018 at 11:08 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > create table listp (a int) partition by list (a);
> > create table listp12 partition of listp for values in(1,2);
> > create table listp03 partition of listp for vlaues in(0,3);
> > create table listp45 partition of listp for values in(4,5);
> > create table listpd partition of listp default;
> >
> > select * from listp where a in(1,2,4,5);
> >
> > Here we prune all but listp12 and listp45. Since the default is pruned
> > and listp03 is pruned then there are no interleaved values. By your
> > proposed design the natural ordering is not detected since we're
> > storing a flag that says the partitions are unordered due to listp03.
>
> No, what I'm proposing is to store if the partitions are naturally
> ordered or not, *and* recheck after pruning if that property could
> have changed (so if some partitions have been pruned).  So we avoid
> extra processing if we already knew that the partitions were ordered
> (possibly with the default partition pruning information), or if we
> know that the partitions are not ordered and no partition have been
> pruned.

I see. So if the flag says "Yes", then we can skip the plan-time
check, if it says "No" and partitions were pruned, then we need to
re-check as the reason the flag says "No" might have been pruned away.

I guess that works, but I had imagined that the test wouldn't have
been particularly expensive. As more partitions are left unpruned then
such a test costing a bit more I thought would have been unlikely to
be measurable, but then I've not written the code yet.

Are you saying that you think this patch should have this? Or are you
happy to leave it until the next round?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Sun, Jan 6, 2019 at 4:24 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 20 Dec 2018 at 18:20, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> >
> > No, what I'm proposing is to store if the partitions are naturally
> > ordered or not, *and* recheck after pruning if that property could
> > have changed (so if some partitions have been pruned).  So we avoid
> > extra processing if we already knew that the partitions were ordered
> > (possibly with the default partition pruning information), or if we
> > know that the partitions are not ordered and no partition have been
> > pruned.
>
> I see. So if the flag says "Yes", then we can skip the plan-time
> check, if it says "No" and partitions were pruned, then we need to
> re-check as the reason the flag says "No" might have been pruned away.

Exactly.

> I guess that works, but I had imagined that the test wouldn't have
> been particularly expensive. As more partitions are left unpruned then
> such a test costing a bit more I thought would have been unlikely to
> be measurable, but then I've not written the code yet.

That's where my objection is I think. IIUC, the tests aren't not
especially expensive, one reason is because the multi-value list
partitioning case is out of scope.  I was thinking that this flag
would allow that keep this case in scope while not adding much
overhead, and could still be useful with future enhancements (though
optimizing some cycles with huge number of partitions is probably as
you said a drop in the ocean).

> Are you saying that you think this patch should have this? Or are you
> happy to leave it until the next round?

I'd be happy if we can handle in an efficient way ordered partitioned
table scan, including multi-value list partitioning, eventually.  So
if that means that this optimization if not the best way to handle it,
or if it's just not the best time to implement it I'm perfectly fine
with it.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
I've attached a rebased patch which fixes up the recent conflicts. No
other changes were made.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Michael Paquier
Date:
On Thu, Jan 31, 2019 at 04:29:56PM +1300, David Rowley wrote:
> I've attached a rebased patch which fixes up the recent conflicts. No
> other changes were made.

Moved to next CF, waiting for review.
--
Michael

Attachment

Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Thu, 31 Jan 2019 at 16:29, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached a rebased patch which fixes up the recent conflicts. No
> other changes were made.

Rebased version due to a new call to create_append_path() added in
ab5fcf2b0. No other changes made.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Antonin Houska
Date:
David Rowley <david.rowley@2ndquadrant.com> wrote:

> On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:
> > > I think these conditions are too restrictive:
> > >
> > >         /*
> > >          * Determine if these pathkeys match the partition order, or reverse
> > >          * partition order.  It can't match both, so only go to the trouble of
> > >          * checking the reverse order when it's not in ascending partition
> > >          * order.
> > >          */
> > >         partition_order = pathkeys_contained_in(pathkeys,
> > >                                                 partition_pathkeys);
> > >         partition_order_desc = !partition_order &&
> > >                                 pathkeys_contained_in(pathkeys,
> > >                                                         partition_pathkeys_desc);
> > >
>
> > The problem with doing that is that if the partition keys are better
> > than the pathkeys then we'll most likely fail to generate any
> > partition keys at all due to lack of any existing eclass to use for
> > the pathkeys. It's unsafe to use just the prefix in this case as the
> > eclass may not have been found due to, for example one of the
> > partition keys having a different collation than the required sort
> > order of the query. In other words, we can't rely on a failure to
> > create the pathkey meaning that a more strict sort order is not
> > required.
>
> I had another look at this patch and it seems okay just to add a new
> flag to build_partition_pathkeys() to indicate if the pathkey List was
> truncated or not.  In generate_mergeappend_paths() we can then just
> check that flag before checking if the partiiton pathkeys are
> contained in pathkeys. It's fine if the partition keys were truncated
> for the reverse of that check.
>
> I've done this in the attached and added additional regression tests
> for this case.

I agree with this approach and I'm also fine with your other comments / code
changes to address my review.

As for the latest version (v8-0001-...) I've only caught a small typo: "When
the first subpath needs sorted ...". It was probably meant "... needs sort
...".

--
Antonin Houska
https://www.cybertec-postgresql.com


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
Thanks a lot for taking the time to look at this.

On Tue, 5 Mar 2019 at 03:03, Antonin Houska <ah@cybertec.at> wrote:
> As for the latest version (v8-0001-...) I've only caught a small typo: "When
> the first subpath needs sorted ...". It was probably meant "... needs sort
> ...".

That was a sort of short way of saying "needs [to be] sorted". I've
added in the missing "to be" in the attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Robert Haas
Date:
On Wed, Dec 19, 2018 at 5:08 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> With my idea for using live_parts, we'll process the partitions
> looking for interleaved values on each query, after pruning takes
> place. In this case, we'll see the partitions are naturally ordered. I
> don't really foresee any issues with that additional processing since
> it will only be a big effort when there are a large number of
> partitions, and in those cases the planner already has lots of work to
> do. Such processing is just a drop in the ocean when compared to path
> generation for all those partitions.

I agree that partitions_are_ordered() is cheap enough in this patch
that it probably doesn't matter whether we cache the result.  On the
other hand, that's mostly because you haven't handled the hard cases -
e.g. interleaved list partitions.  If you did, then it would be
expensive, and it probably *would* be worth caching the result.  Now
maybe those hard cases aren't worth handling anyway.

You also seem to be saying that since we run-time partitioning pruning
might change the answer, caching the initial answer is pointless.  But
I think Julien has made a good argument for why that's wrong: if the
initial answer is that the partitions are ordered, which will often be
true, then we can skip all later checks.

So I am OK with the fact that this patch doesn't choose to cache it,
but I don't really buy any of your arguments for why it would be a bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 6 Mar 2019 at 07:17, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Dec 19, 2018 at 5:08 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > With my idea for using live_parts, we'll process the partitions
> > looking for interleaved values on each query, after pruning takes
> > place. In this case, we'll see the partitions are naturally ordered. I
> > don't really foresee any issues with that additional processing since
> > it will only be a big effort when there are a large number of
> > partitions, and in those cases the planner already has lots of work to
> > do. Such processing is just a drop in the ocean when compared to path
> > generation for all those partitions.
>
> I agree that partitions_are_ordered() is cheap enough in this patch
> that it probably doesn't matter whether we cache the result.  On the
> other hand, that's mostly because you haven't handled the hard cases -
> e.g. interleaved list partitions.  If you did, then it would be
> expensive, and it probably *would* be worth caching the result.  Now
> maybe those hard cases aren't worth handling anyway.

I admit that I didn't understand the idea of the flag at the time,
having failed to see the point of it since if partitions are plan-time
pruned then I had thought the flag would be useless. However, as
Julien explained, it would be a flag of "Yes" means "Yes", okay to do
ordered scans, and "No" means "Recheck if there are pruned partitions
using only the non-pruned ones".   That seems fine and very sane to me
now that I understand it. FWIW, my moment of realisation came in [1].

However, my thoughts are that adding new flags and the live_parts
field in RelOptInfo raise the bar a bit for this patch.   There's
already quite a number of partition-related fields in RelOptInfo.
Understanding what each of those does is not trivial, so I figured
that this patch would be much easier to consider if I skipped that
part for the first cut version.   I feared a lot of instability of
what fields exist from Amit's planner improvement patches and I didn't
want to deal with dependencies from WIP. I had to deal with that last
year on run-time pruning and it turned out not to be fun.

> You also seem to be saying that since we run-time partitioning pruning
> might change the answer, caching the initial answer is pointless.  But
> I think Julien has made a good argument for why that's wrong: if the
> initial answer is that the partitions are ordered, which will often be
> true, then we can skip all later checks.
>
> So I am OK with the fact that this patch doesn't choose to cache it,
> but I don't really buy any of your arguments for why it would be a bad
> idea.

OK, good.  I agree.  For the record; I want to steer clear of the flag
in this first cut version, especially so now given what time it is.

[1] https://www.postgresql.org/message-id/CAKJS1f_r51OAPsN1oC4i36D7vznnihNk+1wiDFG0qRVb_eOKWg@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> [ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ]

I took a quick look through this and I'm not very happy with it.
It seems to me that the premise ought to be "just use an Append
if we can prove the output would be ordered anyway", but that's not
what we actually have here: instead you're adding more infrastructure
onto Append, which notably involves invasive changes to the API of
create_append_path, which is the main reason why the patch keeps breaking.
(It's broken again as of HEAD, though the cfbot doesn't seem to have
noticed yet.)  Likewise there's a bunch of added complication in
cost_append, create_append_plan, etc.  I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

There also seem to be bits that duplicate functionality of the
drop-single-child-[Merge]Append patch (specifically I'm looking
at get_singleton_append_subpath).  Why do we need that?

The logic in build_partition_pathkeys is noticeably stupider than
build_index_pathkeys, in particular it's not bright about boolean columns.
Maybe that's fine, but if so it deserves a comment explaining why we're
not bothering.  Also, the comment for build_index_pathkeys specifies that
callers should do truncate_useless_pathkeys, which they do; why is that
not relevant here?

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > [ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ]
>
> I think you should remove all that
> and restrict this optimization to the case where all the subpaths are
> natively ordered --- if we have to insert Sorts, it's hardly going to move
> the needle to worry about simplifying the parent MergeAppend to Append.

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x".  Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think you should remove all that
>> and restrict this optimization to the case where all the subpaths are
>> natively ordered --- if we have to insert Sorts, it's hardly going to move
>> the needle to worry about simplifying the parent MergeAppend to Append.

> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> x".  Even if the first subpath(s) aren't natively ordered, not all of
> the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing.  Nor do I believe that this case is all that important.

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think you should remove all that
> >> and restrict this optimization to the case where all the subpaths are
> >> natively ordered --- if we have to insert Sorts, it's hardly going to move
> >> the needle to worry about simplifying the parent MergeAppend to Append.
>
> > This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> > x".  Even if the first subpath(s) aren't natively ordered, not all of
> > the sorts should actually be performed.
>
> [ shrug... ] We've got no realistic chance of estimating such situations
> properly, so I'd have no confidence in a plan choice based on such a
> thing.

With all due respect, I'd say that's not even close to being true.

A MergeAppend's startup cost end up set to the sum of all of its
subplan's startup costs, plus any Sort that will be required if the
subpath is not sufficiently ordered already.  An Append's startup cost
will just be the startup cost of the first subpath. This can happen
since, unlike MergeAppend, we don't need to pull the first tuple out
of such subnode to find the lowest one.  In Julien's case, such an
Append plan has a potential of weighing in massively cheaper than a
MergeAppend plan.  Just imagine some large sorts in some later
subpath.

Can you explain why you think that's not properly being estimated in the patch?

> Nor do I believe that this case is all that important.

Can you explain why you believe that?

I see you were the author of b1577a7c78d which was committed over 19
years ago, so I'm surprised to hear you say cheap startup plans are
not important. Or is it, you just don't think they're important for
partitioned tables?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 9 Mar 2019 at 09:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took a quick look through this and I'm not very happy with it.
> It seems to me that the premise ought to be "just use an Append
> if we can prove the output would be ordered anyway", but that's not
> what we actually have here: instead you're adding more infrastructure
> onto Append, which notably involves invasive changes to the API of
> create_append_path, which is the main reason why the patch keeps breaking.

Can you suggest how else we could teach higher paths that an Append is
ordered by some path keys without giving the append some pathkeys?
That's what pathkeys are for, so I struggle to imagine how else this
could work.  If we don't do this, then how is a MergeJoin going to
know it does not need to sort before joining?

As for the "the patch keeps breaking"...  those are just conflicts
with other changes that have been made in master. That seems fairly
normal to me.

> (It's broken again as of HEAD, though the cfbot doesn't seem to have
> noticed yet.)

I think it's not been updating itself for a few days.

>  Likewise there's a bunch of added complication in
> cost_append, create_append_plan, etc.  I think you should remove all that
> and restrict this optimization to the case where all the subpaths are
> natively ordered --- if we have to insert Sorts, it's hardly going to move
> the needle to worry about simplifying the parent MergeAppend to Append.

I think the patch would be less than half as useful if we do that.
Can you explain why you think that fast startup plans are less
important for partitioned tables?

I could perhaps understand an argument against this if the patch added
masses of complex code to achieve the goal, but in my opinion, the
code is fairly easy to understand and there's not very much extra code
added.

> There also seem to be bits that duplicate functionality of the
> drop-single-child-[Merge]Append patch (specifically I'm looking
> at get_singleton_append_subpath).  Why do we need that?

hmm, that patch is separate functionality. The patch you're talking
about, as you know, just removes Append/MergeAppends that have a
single subpath.  Over here we add smarts to allow conversion of
MergeAppends into Appends when the order of the partitions is defined
the same as the required order of the, would be, MergeAppend path.

get_singleton_append_subpath() allows sub-partitioned table's
MergeAppend or Append subpaths to be pulled into the top-level Appends
when they just contain a single subpath.

An example from the tests:

 Append
   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
   ->  Merge Append
         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def

If the nested MergeAppend path just had a single node then
get_singleton_append_subpath() would have pulled the subpath into the
top-level Append.  However, in this example, since there are multiple
MergeAppend subpath, the pull-up would be invalid since the top-level
Append can't guarantee the sort order of those MergeAppend subpaths.
In fact, the test directly after that one drops the mcrparted5_def
table which turns the plan into:

 Append
   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a


> The logic in build_partition_pathkeys is noticeably stupider than
> build_index_pathkeys, in particular it's not bright about boolean columns.
> Maybe that's fine, but if so it deserves a comment explaining why we're
> not bothering.

Good point.  That's required to allow cases like:

SELECT * FROM parttable WHERE boolcol = true ORDER BY boolcol, ordercol;

I've fixed that in the attached.

>  Also, the comment for build_index_pathkeys specifies that
> callers should do truncate_useless_pathkeys, which they do; why is that
> not relevant here?

I've neglected to explain that in the comments.  The problem with that
is that doing so would break cases where we use an Append when the
partition keys are a prefix of the query's pathkeys.  Say we had a
range partition table on (a,b) and an index on (a, b, c):

SELECT * FROM range_ab ORDER BY a, b, c;

With the current patch, we can use an Append for that as no earlier
value of (a,b) can come in a later partition.  If we
truncate_useless_pathkeys() then it'll strip out all pathkeys as the
partition pathkeys are not contained within the query pathkeys.

Maybe the patch should perform truncate_useless_pathkeys for the check
where we see if the query pathkeys are contained in the partition's
pathkeys. However, we can't do it for the reverse check.

I've added a comment to explain about the lack of
truncate_useless_pathkeys() in the attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
I've attached an updated patch which fixes the conflict with 0a9d7e1f6d8

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Robert Haas
Date:
On Fri, Mar 8, 2019 at 3:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took a quick look through this and I'm not very happy with it.
> It seems to me that the premise ought to be "just use an Append
> if we can prove the output would be ordered anyway", but that's not
> what we actually have here: instead you're adding more infrastructure
> onto Append, which notably involves invasive changes to the API of
> create_append_path, which is the main reason why the patch keeps breaking.
> (It's broken again as of HEAD, though the cfbot doesn't seem to have
> noticed yet.)  Likewise there's a bunch of added complication in
> cost_append, create_append_plan, etc.  I think you should remove all that
> and restrict this optimization to the case where all the subpaths are
> natively ordered --- if we have to insert Sorts, it's hardly going to move
> the needle to worry about simplifying the parent MergeAppend to Append.

Other people have already said that they don't think this is true; I
agree with those people.  Even if you have to sort *every* path,
sorting a bunch of reasonably large data sets individually is possibly
better than sorting all the data together, because (1) you can start
emitting rows sooner, (2) it might make you fit in memory instead of
having to spill to disk, and (3) O(n lg n) is supralinear.  Still, if
that were the only case this handled, I wouldn't be too excited,
because it seems at least plausible that lumping a bunch of small
partitions together and sorting it all at once could save some
start-up and tear-down costs vs. sorting them individually.  But it
isn't; the ability to consider that sort of plan is just a fringe
benefit.  If a substantial fraction of the partitions have indexes --
half, three-quarters, all-but-one -- sorting only the remaining ones
should win big.

Admittedly, I think this case is less common than it was a few years
ago, because with table inheritance one often ended up with a parent
partition that was empty and had no indexes so it produced a
dummy-seqscan in every plan, and that's gone with partitioning.
Moreover, because of Alvaro's work on cascaded CREATE INDEX, people
are probably now more likely to have matching indexes on all the
partitions.  Still, it's not that hard to imagine a case where older
data that doesn't change much is more heavily indexed than tables that
are still suffering DML of whatever kind on a regular basis.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 16 Mar 2019 at 04:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached an updated patch which fixes the conflict with 0a9d7e1f6d8

... and here's the one that I should have sent. (renamed to v12 to
prevent confusion)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think you should remove all that
> >> and restrict this optimization to the case where all the subpaths are
> >> natively ordered --- if we have to insert Sorts, it's hardly going to move
> >> the needle to worry about simplifying the parent MergeAppend to Append.
>
> > This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> > x".  Even if the first subpath(s) aren't natively ordered, not all of
> > the sorts should actually be performed.
>
> [ shrug... ] We've got no realistic chance of estimating such situations
> properly, so I'd have no confidence in a plan choice based on such a
> thing.  Nor do I believe that this case is all that important.

Hi Tom,

Wondering if you can provide more details on why you think there's no
realistic chance of the planner costing these cases correctly?   It
would be unfortunate to reject this patch based on a sentence that
starts with "[ shrug... ]", especially so when three people have stood
up and disagreed with you.

I've explained why I think you're wrong. Would you be able to explain
to me why you think I'm wrong?

You also mentioned that you didn't like the fact I'd changed the API
for create_append_plan().  Could you suggest why you think passing
pathkeys in is the wrong thing to do?  The Append path obviously needs
pathkeys so that upper paths know what order the path guarantees.
Passing pathkeys in allows us to verify that pathkeys are a valid
thing to have for the AppendPath.  They're not valid in the Parallel
Append case, for example, so setting them afterwards does not seem
like an improvement.  They also allow us to cost the cheaper startup
cost properly, however, you did seem to argue that you have no
confidence in cheap startup plans, which I'm still confused by.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
>>> x".  Even if the first subpath(s) aren't natively ordered, not all of
>>> the sorts should actually be performed.

>> [ shrug... ] We've got no realistic chance of estimating such situations
>> properly, so I'd have no confidence in a plan choice based on such a
>> thing.  Nor do I believe that this case is all that important.

> Wondering if you can provide more details on why you think there's no
> realistic chance of the planner costing these cases correctly?

The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens.  This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent.  Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.

If we don't incorporate that, then the situation that the planner
will have to model is a MergeAppend with possibly some sorts in
front of it, and it will correctly cost that as if all the sorts
happen before any output occurs, and so you can hope that reasonable
plan choices will ensue.

I believe that the cases where a LIMIT query actually ought to go
for a fast-start plan are where *all* the child rels have fast-start
(non-sort) paths --- which is exactly the cases we'd get if we only
allow "sorted" Appends when none of the inputs require a sort.
Imagining that we'll get good results in cases where some of them
need to be sorted is just wishful thinking.

> It would be unfortunate to reject this patch based on a sentence that
> starts with "[ shrug... ]", especially so when three people have stood
> up and disagreed with you.

I don't want to reject the patch altogether, I just want it to not
include a special hack to allow Append rather than MergeAppend in such
cases.  I am aware that I'm probably going to be shouted down on this
point, but that will not change my opinion that the shouters are wrong.

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Simon Riggs
Date:
On Fri, 22 Mar 2019 at 11:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
>>> x".  Even if the first subpath(s) aren't natively ordered, not all of
>>> the sorts should actually be performed.

>> [ shrug... ] We've got no realistic chance of estimating such situations
>> properly, so I'd have no confidence in a plan choice based on such a
>> thing.  Nor do I believe that this case is all that important.

> Wondering if you can provide more details on why you think there's no
> realistic chance of the planner costing these cases correctly?

The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens.  This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent.  Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.

If we don't incorporate that, then the situation that the planner
will have to model is a MergeAppend with possibly some sorts in
front of it, and it will correctly cost that as if all the sorts
happen before any output occurs, and so you can hope that reasonable
plan choices will ensue.

I believe that the cases where a LIMIT query actually ought to go
for a fast-start plan are where *all* the child rels have fast-start
(non-sort) paths --- which is exactly the cases we'd get if we only
allow "sorted" Appends when none of the inputs require a sort.
Imagining that we'll get good results in cases where some of them
need to be sorted is just wishful thinking.

> It would be unfortunate to reject this patch based on a sentence that
> starts with "[ shrug... ]", especially so when three people have stood
> up and disagreed with you.

I don't want to reject the patch altogether, I just want it to not
include a special hack to allow Append rather than MergeAppend in such
cases.  I am aware that I'm probably going to be shouted down on this
point, but that will not change my opinion that the shouters are wrong.

I agree that the issue of mixing sorts at various points will make nonsense of the startup cost/total cost results.

What you say about LIMIT is exactly right. But ISTM that LIMIT itself is the issue there and it need more smarts to correctly calculate costs.

I don't see LIMIT costing being broken as a reason to restrict this optimization. I would ask that we allow improvements to the important use case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 23 Mar 2019 at 04:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The reason why I'm skeptical about LIMIT with a plan of the form
> append-some-sorts-together is that there are going to be large
> discontinuities in the cost-vs-number-of-rows-returned graph,
> ie, every time you finish one child plan and start the next one,
> there'll be a hiccup while the sort happens.  This is something
> that our plan cost approximation (startup cost followed by linear
> output up to total cost) can't even represent.  Sticking a
> LIMIT on top of that is certainly not going to lead to any useful
> estimate of the actual cost, meaning that if you get a correct
> plan choice it'll just be by luck, and if you don't there'll be
> nothing to be done about it.

Thanks for explaining.  I see where you're coming from now.  I think
this point would carry more weight if using the Append instead of the
MergeAppend were worse in some cases as we could end up using an
inferior plan accidentally.  However, that's not the case. The Append
plan should always perform better both for startup and pulling a
single row all the way to pulling the final row.  The underlying
subplans are the same in each case, but Append has the additional
saving of not having to determine to perform a sort on the top row
from each subpath.

I also think that cost-vs-number-of-rows-returned is not any worse
than a SeqScan where the required rows are unevenly distributed
throughout the table. In fact, the SeqScan case is much worse as we
could end up choosing that over an index scan, which could be
significantly better, but as mentioned above, and benchmarked in the
initial post of this thread, Append always wins over MergeAppend, so I
don't quite understand your reasoning here.  I could understand it if
Append needed the sorts but MergeAppend did not, but they both need
the sorts if there's not a path that already provides the required
ordering.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I agree that the issue of mixing sorts at various points will make nonsense
> of the startup cost/total cost results.

Right.

> I don't see LIMIT costing being broken as a reason to restrict this
> optimization. I would ask that we allow improvements to the important use
> case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

There's not time to reinvent LIMIT costing for v12.  I'd be happy to
see some work done on that in the future, and when it does get done,
I'd be happy to see Append planning extended to allow this case.
I just don't think it's wise to ship one without the other.

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Simon Riggs
Date:
On Fri, 22 Mar 2019 at 11:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
> I agree that the issue of mixing sorts at various points will make nonsense
> of the startup cost/total cost results.

Right.

> I don't see LIMIT costing being broken as a reason to restrict this
> optimization. I would ask that we allow improvements to the important use
> case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

There's not time to reinvent LIMIT costing for v12.  I'd be happy to
see some work done on that in the future, and when it does get done,
I'd be happy to see Append planning extended to allow this case.
I just don't think it's wise to ship one without the other.

I was hoping to motivate you to look at this personally, and soon. LIMIT is so broken that any improvements count as bug fixes in my book.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> Thanks for explaining.  I see where you're coming from now.  I think
> this point would carry more weight if using the Append instead of the
> MergeAppend were worse in some cases as we could end up using an
> inferior plan accidentally.  However, that's not the case. The Append
> plan should always perform better both for startup and pulling a
> single row all the way to pulling the final row.  The underlying
> subplans are the same in each case, but Append has the additional
> saving of not having to determine to perform a sort on the top row
> from each subpath.

Uh, what?  sorted-Append and MergeAppend would need pre-sorts on
exactly the same set of children.  It's true that the Append path
might not have to actually execute some of those sorts, if it's
able to stop in an earlier child.  The problem here is basically
that it's hard to predict whether that will happen.

> Append always wins over MergeAppend, so I
> don't quite understand your reasoning here.

The problem is that the planner is likely to favor a "fast-start"
Append *too much*, and prefer it over some other plan altogether.

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

Yes, there are cases where Append-with-some-sorts is preferable to
MergeAppend-with-some-sorts, and maybe I'd even believe that it
always is.  But I don't believe that it's necessarily preferable
to plans that don't require a sort at all, and I'm afraid that we
are likely to find the planner making seriously bad choices when
it's presented with such situations.  I'd rather we leave that
case out for now, until we have some better way of modelling it.

The fact that the patch also requires a lot of extra hacking just
to support this case badly doesn't make me any more favorably
disposed.

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Robert Haas
Date:
On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In cases where, say, the first child requires no sort but also doesn't
> emit very many rows, while the second child requires an expensive sort,
> the planner will have a ridiculously optimistic opinion of the cost of
> fetching slightly more rows than are available from the first child.
> This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often.  I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In cases where, say, the first child requires no sort but also doesn't
>> emit very many rows, while the second child requires an expensive sort,
>> the planner will have a ridiculously optimistic opinion of the cost of
>> fetching slightly more rows than are available from the first child.
>> This might lead it to wrongly choose a merge join over a hash for example.

> I think this is very much a valid point, especially in view of the
> fact that we already choose supposedly fast-start plans too often.  I
> don't know whether it's a death sentence for this patch, but it should
> at least make us stop and think hard.

Once again: this objection is not a "death sentence for this patch".
I simply wish to suppress the option to generate an ordered Append
when some of the inputs would require an added sort step.  As long
as we have pre-ordered paths for all children, go for it.

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Robert Haas
Date:
On Fri, Mar 22, 2019 at 12:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Once again: this objection is not a "death sentence for this patch".
> I simply wish to suppress the option to generate an ordered Append
> when some of the inputs would require an added sort step.  As long
> as we have pre-ordered paths for all children, go for it.

I stand corrected.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In cases where, say, the first child requires no sort but also doesn't
>> emit very many rows, while the second child requires an expensive sort,
>> the planner will have a ridiculously optimistic opinion of the cost of
>> fetching slightly more rows than are available from the first child.
>> This might lead it to wrongly choose a merge join over a hash for example.

> I think this is very much a valid point, especially in view of the
> fact that we already choose supposedly fast-start plans too often.  I
> don't know whether it's a death sentence for this patch, but it should
> at least make us stop and think hard.

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs.  That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Robert Haas
Date:
On Fri, Mar 22, 2019 at 12:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> In cases where, say, the first child requires no sort but also doesn't
> >> emit very many rows, while the second child requires an expensive sort,
> >> the planner will have a ridiculously optimistic opinion of the cost of
> >> fetching slightly more rows than are available from the first child.
> >> This might lead it to wrongly choose a merge join over a hash for example.
>
> > I think this is very much a valid point, especially in view of the
> > fact that we already choose supposedly fast-start plans too often.  I
> > don't know whether it's a death sentence for this patch, but it should
> > at least make us stop and think hard.
>
> BTW, another thing we could possibly do to answer this objection is to
> give the ordered-Append node an artificially pessimistic startup cost,
> such as the sum or the max of its children's startup costs.  That's
> pretty ugly and unprincipled, but maybe it's better than not having the
> ability to generate the plan shape at all?

Yeah, I'm not sure whether that's a good idea or not.  I think one of
the problems with a cost-based optimizer is that once you start
putting things in with the wrong cost because you think it will give
the right answer, you're sorta playing with fire, because you can't
necessarily predict how things are going are going to turn out in more
complex scenarios.  On the other hand, it may sometimes be the right
thing to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Fri, Mar 22, 2019 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 22, 2019 at 12:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> In cases where, say, the first child requires no sort but also doesn't
> > >> emit very many rows, while the second child requires an expensive sort,
> > >> the planner will have a ridiculously optimistic opinion of the cost of
> > >> fetching slightly more rows than are available from the first child.
> > >> This might lead it to wrongly choose a merge join over a hash for example.
> >
> > > I think this is very much a valid point, especially in view of the
> > > fact that we already choose supposedly fast-start plans too often.  I
> > > don't know whether it's a death sentence for this patch, but it should
> > > at least make us stop and think hard.
> >
> > BTW, another thing we could possibly do to answer this objection is to
> > give the ordered-Append node an artificially pessimistic startup cost,
> > such as the sum or the max of its children's startup costs.  That's
> > pretty ugly and unprincipled, but maybe it's better than not having the
> > ability to generate the plan shape at all?
>
> Yeah, I'm not sure whether that's a good idea or not.  I think one of
> the problems with a cost-based optimizer is that once you start
> putting things in with the wrong cost because you think it will give
> the right answer, you're sorta playing with fire, because you can't
> necessarily predict how things are going are going to turn out in more
> complex scenarios.  On the other hand, it may sometimes be the right
> thing to do.

I've been bitten too many times with super inefficient plans of the
form "let's use the wrong index instead of the good one because I'll
probably find there the tuple I want very quickly", due to LIMIT
assuming an even distribution.   Since those queries can end up taking
dozens of minutes instead of less a ms, without a lot of control to
fix this kind of problem I definitely don't want to introduce another
similar source of pain for users.

However, what we're talking here is still a corner case.  People
having partitioned tables with a mix of partitions with and without
indexes suitable for ORDER BY x LIMIT y queries should already have at
best average performance.  And trying to handle this case cannot hurt
cases where all partitions have suitable indexes, so that may be an
acceptable risk?

I also have mixed feeling about this artificial startup cost penalty,
but if we go this way we can for sure cumulate the startup cost of all
sorts that we think we'll have to perform (according to each path's
estimated rows and the given limit_tuples).  That probably won't be
enough though, especially with fractional paths.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 23 Mar 2019 at 04:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > Append has the additional
> > saving of not having to determine to perform a sort on the top row
> > from each subpath.
>
> Uh, what?  sorted-Append and MergeAppend would need pre-sorts on
> exactly the same set of children.

I was talking about the binary heap code that MergeAppend uses to
decide which subplan to pull from next.

> In cases where, say, the first child requires no sort but also doesn't
> emit very many rows, while the second child requires an expensive sort,
> the planner will have a ridiculously optimistic opinion of the cost of
> fetching slightly more rows than are available from the first child.
> This might lead it to wrongly choose a merge join over a hash for example.

umm.. Yeah, that's a good point.  I seemed to have failed to consider
that the fast startup plan could lower the cost of a merge join with a
limit.  I agree with that concern. I also find it slightly annoying
since we already make other plan shapes that can suffer from similar
problems, e.g Index scan + filter + limit, but I agree we don't need
any more of those as they're pretty painful when they hit.

I'll change the patch around to pull out the code you've mentioned.

Thanks for spelling out your point to me.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, another thing we could possibly do to answer this objection is to
> give the ordered-Append node an artificially pessimistic startup cost,
> such as the sum or the max of its children's startup costs.  That's
> pretty ugly and unprincipled, but maybe it's better than not having the
> ability to generate the plan shape at all?

I admit to having thought of that while trying to get to sleep last
night, but I was too scared to even suggest it.  It's pretty much how
MergeAppend would cost it anyway.  I agree it's not pretty to lie
about the startup cost, but it does kinda seem silly to fall back on a
more expensive MergeAppend when we know fine well Append is cheaper.
Probably the danger would be that someone pulls it out thinking its a
bug. So we'd need to clearly comment why we're doing it.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, another thing we could possibly do to answer this objection is to
>> give the ordered-Append node an artificially pessimistic startup cost,
>> such as the sum or the max of its children's startup costs.  That's
>> pretty ugly and unprincipled, but maybe it's better than not having the
>> ability to generate the plan shape at all?

> I admit to having thought of that while trying to get to sleep last
> night, but I was too scared to even suggest it.  It's pretty much how
> MergeAppend would cost it anyway.  I agree it's not pretty to lie
> about the startup cost, but it does kinda seem silly to fall back on a
> more expensive MergeAppend when we know fine well Append is cheaper.

Yeah.  I'm starting to think that this might actually be the way to go,
and here's why: my argument here is basically that a child plan that
has a large startup cost is going to screw up our ability to estimate
whether the parent Append is really a fast-start plan or not.  Now, if
we have to insert a Sort to make a child plan be correctly ordered, that
clearly is a case where the child could have a large startup cost ...
but what if a correctly-ordered child has a large startup cost for
some other reason?  Simply refusing to insert Sort nodes won't keep us
out of the weeds if that's true.  However, if we stick in a hack like
the one suggested above, that will keep us from being too optimistic
about the fast-start properties of the Append node no matter whether
the problem arises from an added Sort node or is intrinsic to the
child plan.

It may well be that as things stand today, this scenario is only
hypothetical, because we can only prove that a plain-Append plan
is correctly sorted if it's arising from a suitably partitioned table,
and the child plans in such cases will all be IndexScans with
minimal startup cost.  But we should look ahead to scenarios where
that's not true.  (mumble maybe a foreign table as a partition
is already a counterexample? mumble)

            regards, tom lane


Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Wed, Dec 19, 2018 at 3:01 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > The multi-level partitioning case is another
> > thing that would need to be handled for instance (and that's the main
> > reason I couldn't submit a new patch when I was working on it), and
> > I'm definitely not arguing to cover it in this patch.
>
> As far as I'm aware, the multi-level partitioning should work just
> fine with the current patch. I added code for that a while ago. There
> are regression tests to exercise it. I'm not aware of any cases where
> it does not work.

Sorry to come back this late.  What I was mentioning about
sub-partitioning is when a whole partition hierarchy is natively
ordered, we could avoid the generate merge appends.  But unless I'm
missing something with your patch, that won't happen.

Considering

CREATE TABLE nested (id1 integer, id2 integer, val text) PARTITION BY
LIST (id1);

CREATE TABLE nested_1 PARTITION OF nested FOR VALUES IN (1) PARTITION
BY RANGE (id2);
CREATE TABLE nested_1_1 PARTITION OF nested_1 FOR VALUES FROM (1) TO (100000);
CREATE TABLE nested_1_2 PARTITION OF nested_1 FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE nested_1_3 PARTITION OF nested_1 FOR VALUES FROM (200000)
TO (300000);

CREATE TABLE nested_2 PARTITION OF nested FOR VALUES IN (2) PARTITION
BY RANGE (id2);
CREATE TABLE nested_2_1 PARTITION OF nested_2 FOR VALUES FROM (1) TO (100000);
CREATE TABLE nested_2_2 PARTITION OF nested_2 FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE nested_2_3 PARTITION OF nested_2 FOR VALUES FROM (200000)
TO (300000);

CREATE INDEX ON nested(id1, id2);

ISTM that a query like
SELECT * FROM nested ORDER BY 1, 2;
could simply append all the partitions in the right order (or generate
a tree of ordered appends), but:

                            QUERY PLAN
-------------------------------------------------------------------
 Append
   ->  Merge Append
         Sort Key: nested_1_1.id1, nested_1_1.id2
         ->  Index Scan using nested_1_1_id1_id2_idx on nested_1_1
         ->  Index Scan using nested_1_2_id1_id2_idx on nested_1_2
         ->  Index Scan using nested_1_3_id1_id2_idx on nested_1_3
   ->  Merge Append
         Sort Key: nested_2_1.id1, nested_2_1.id2
         ->  Index Scan using nested_2_1_id1_id2_idx on nested_2_1
         ->  Index Scan using nested_2_2_id1_id2_idx on nested_2_2
         ->  Index Scan using nested_2_3_id1_id2_idx on nested_2_3
(11 rows)


Also, a query like
SELECT * FROM nested_1 ORDER BY 1, 2;
could generate an append path, since the first column is guaranteed to
be identical in all partitions, but instead:

                         QUERY PLAN
-------------------------------------------------------------
 Merge Append
   Sort Key: nested_1_1.id1, nested_1_1.id2
   ->  Index Scan using nested_1_1_id1_id2_idx on nested_1_1
   ->  Index Scan using nested_1_2_id1_id2_idx on nested_1_2
   ->  Index Scan using nested_1_3_id1_id2_idx on nested_1_3
(5 rows)

and of course

# EXPLAIN (costs off) SELECT * FROM nested_1 ORDER BY 2;
             QUERY PLAN
------------------------------------
 Sort
   Sort Key: nested_1_1.id2
   ->  Append
         ->  Seq Scan on nested_1_1
         ->  Seq Scan on nested_1_2
         ->  Seq Scan on nested_1_3
(6 rows)

I admit that I didn't re-read the whole thread, so maybe I'm missing
something (if that's the case my apologies, and feel free to point me
any relevant discussion).  I'm just trying to make sure that we don't
miss some cases, as those seems possible and useful to handle.  Or is
that out of the perimeter?


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sun, 24 Mar 2019 at 05:16, Julien Rouhaud <rjuju123@gmail.com> wrote:
> ISTM that a query like
> SELECT * FROM nested ORDER BY 1, 2;
> could simply append all the partitions in the right order (or generate
> a tree of ordered appends), but:
>
>                             QUERY PLAN
> -------------------------------------------------------------------
>  Append
>    ->  Merge Append
>          Sort Key: nested_1_1.id1, nested_1_1.id2
>          ->  Index Scan using nested_1_1_id1_id2_idx on nested_1_1
>          ->  Index Scan using nested_1_2_id1_id2_idx on nested_1_2
>          ->  Index Scan using nested_1_3_id1_id2_idx on nested_1_3
>    ->  Merge Append
>          Sort Key: nested_2_1.id1, nested_2_1.id2
>          ->  Index Scan using nested_2_1_id1_id2_idx on nested_2_1
>          ->  Index Scan using nested_2_2_id1_id2_idx on nested_2_2
>          ->  Index Scan using nested_2_3_id1_id2_idx on nested_2_3
> (11 rows)
>
>
> Also, a query like
> SELECT * FROM nested_1 ORDER BY 1, 2;
> could generate an append path, since the first column is guaranteed to
> be identical in all partitions, but instead:
>
>                          QUERY PLAN
> -------------------------------------------------------------
>  Merge Append
>    Sort Key: nested_1_1.id1, nested_1_1.id2
>    ->  Index Scan using nested_1_1_id1_id2_idx on nested_1_1
>    ->  Index Scan using nested_1_2_id1_id2_idx on nested_1_2
>    ->  Index Scan using nested_1_3_id1_id2_idx on nested_1_3
> (5 rows)
>
> and of course
>
> # EXPLAIN (costs off) SELECT * FROM nested_1 ORDER BY 2;
>              QUERY PLAN
> ------------------------------------
>  Sort
>    Sort Key: nested_1_1.id2
>    ->  Append
>          ->  Seq Scan on nested_1_1
>          ->  Seq Scan on nested_1_2
>          ->  Seq Scan on nested_1_3
> (6 rows)

I think both these cases could be handled, but I think the way it
would likely have to be done would be to run the partition constraints
through equivalence class processing. Likely doing that would need
some new field in EquivalenceClass that indicated that the eclass did
not need to be applied to the partition.  If it was done that way then
pathkey_is_redundant() would be true for the id1 column's pathkey in
the sub-partitioned tables. The last plan you show above could also
use an index scan too since build_index_pathkeys() would also find the
pathkey redundant.  Doing this would also cause a query like: select *
from nested_1_1 where id2=1; would not apply "id2 = 1" as a base qual
to the partition. That's good for 2 reasons.  1) No wasted effort
filtering rows that always match; and 2) A Seq scan can be used
instead of the planner possibly thinking that an index scan might be
useful to filter rows. Stats might tell the planner that anyway...
but...

I suggested some changes to equivalence classes a few years ago in [1]
and I failed to get that idea floating.  In ways, this is similar as
it requires having equivalence classes that are not used in all cases.
I think to get something working a week before code cutoff is a step
too far for this, but certainly, it would be interesting to look into
fixing it in a later release.

[1] https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 23 Mar 2019 at 19:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> BTW, another thing we could possibly do to answer this objection is to
> >> give the ordered-Append node an artificially pessimistic startup cost,
> >> such as the sum or the max of its children's startup costs.  That's
> >> pretty ugly and unprincipled, but maybe it's better than not having the
> >> ability to generate the plan shape at all?
>
> > I admit to having thought of that while trying to get to sleep last
> > night, but I was too scared to even suggest it.  It's pretty much how
> > MergeAppend would cost it anyway.  I agree it's not pretty to lie
> > about the startup cost, but it does kinda seem silly to fall back on a
> > more expensive MergeAppend when we know fine well Append is cheaper.
>
> Yeah.  I'm starting to think that this might actually be the way to go,

Here's a version with it done that way.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Sun, Mar 24, 2019 at 11:06 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Sat, 23 Mar 2019 at 19:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > David Rowley <david.rowley@2ndquadrant.com> writes:
> > > On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> BTW, another thing we could possibly do to answer this objection is to
> > >> give the ordered-Append node an artificially pessimistic startup cost,
> > >> such as the sum or the max of its children's startup costs.  That's
> > >> pretty ugly and unprincipled, but maybe it's better than not having the
> > >> ability to generate the plan shape at all?
> >
> > > I admit to having thought of that while trying to get to sleep last
> > > night, but I was too scared to even suggest it.  It's pretty much how
> > > MergeAppend would cost it anyway.  I agree it's not pretty to lie
> > > about the startup cost, but it does kinda seem silly to fall back on a
> > > more expensive MergeAppend when we know fine well Append is cheaper.
> >
> > Yeah.  I'm starting to think that this might actually be the way to go,
>
> Here's a version with it done that way.

FTR this patch doesn't apply since single child [Merge]Append
suppression (8edd0e7946) has been pushed.


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Tue, 26 Mar 2019 at 09:02, Julien Rouhaud <rjuju123@gmail.com> wrote:
> FTR this patch doesn't apply since single child [Merge]Append
> suppression (8edd0e7946) has been pushed.

Thanks for letting me know.  I've attached v14 based on current master.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Tue, Mar 26, 2019 at 3:13 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Tue, 26 Mar 2019 at 09:02, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > FTR this patch doesn't apply since single child [Merge]Append
> > suppression (8edd0e7946) has been pushed.
>
> Thanks for letting me know.  I've attached v14 based on current master.

Thanks!

So, AFAICT everything works as intended, I don't see any problem in
the code and the special costing heuristic should avoid dramatic
plans.

A few, mostly nitpicking, comments:

+   if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+       partitions_are_ordered(root, rel))

shouldn't the test be IS_PARTITIONED_REL(rel) instead of testing
part_scheme?  I'm thinking of 1d33858406 and related discussions.

+ * partitions_are_ordered
+ *     For the partitioned table given in 'partrel', returns true if the
+ *     partitioned table guarantees that tuples which sort earlier according
+ *     to the partition bound are stored in an earlier partition.  Returns
+ *     false this is not possible, or if we have insufficient means to prove
+ *     it.
[...]
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.

Maybe it's because I'm not a native english speaker, but I had to read
those comments multiple time.  I'd also add to partitions_are_ordered
comments a note about default_partition (even though the function is
super short).

+           if (boundinfo->ndatums +
partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+               return false;

there are a few over lengthy lines, maybe a pgindent run would be useful.

+ * build_partition_pathkeys
+ *   Build a pathkeys list that describes the ordering induced by the
+ *   partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *   table guarantees that lower order tuples never will be found in a
+ *   later partition.).  Sets *partialkeys to false if pathkeys were only
+ *   built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+                        ScanDirection scandir, bool *partialkeys)

Maybe add an assert partitions_are_ordered also?


And finally, should this optimisation be mentioned in ddl.sgml (or
somewhere else)?


Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 27 Mar 2019 at 19:48, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry if this was discussed before, but why does this patch add any new
> code to partprune.c?  AFAICT, there's no functionality changes to the
> pruning code.

You're right. It probably shouldn't be there.  There's a bit of a lack
of a good home for partition code relating to the planner it seems.

> seem like their logic is specialized enough to be confined to pathkeys.c,
> only because it's needed there.

Yeah maybe.

> Regarding
>
> +bool
> +partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
>
> I think this could simply be:
>
> bool
> partitions_are_ordered(PartitionBoundInfo *boundinfo)
>
> and be defined in partitioning/partbounds.c.  If you think any future
> modifications to this will require access to the partition key info in
> PartitionScheme, maybe the following is fine:
>
> bool
> partitions_are_ordered(RelOptInfo *partrel)

It does need to know how many partitions the partitioned table has,
which it gets from partrel->nparts, so yeah, RelOptInfo is probably
needed. I don't think passing in int nparts is a good solution to
that.  The problem with moving it to partbounds.c is that nothing
there knows about RelOptInfo currently.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 27 Mar 2019 at 21:24, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Noticed a typo.
>
> +                 * multiple subpaths then we can't make guarantees about the
> +                 * order tuples in those subpaths, so we must leave the
>
> order of tuples?

I'll fix that. Thanks.

> Again, sorry if this was discussed, but I got curious about why
> partitions_are_ordered() thinks it can say true even for an otherwise
> ordered list-partitioned table, but containing a null-only partition,
> which is *always* scanned last.  If a query says ORDER BY partkey NULLS
> FIRST, then it's not alright to proceed with assuming partitions are
> ordered even if partitions_are_ordered() said so.

If it's *always* scanned last then it's fine for ORDER BY partkey
NULLS LAST.  If they have ORDER BY partkey NULLS FIRST then we won't
match on the pathkeys.

Or if they do ORDER BY partkey DESC NULLS FIRST, then we're also fine,
since we reverse the order of the subpaths list in that case.  ORDER
BY partkey DESC NULLS LAST is not okay, and we don't optimise that
since it won't match the pathkeys we generate in
build_partition_pathkeys().

> Related, why does build_partition_pathkeys() pass what it does for
> nulls_first parameter of make_pathkey_from_sortinfo()?
>
>  cpathkey = make_pathkey_from_sortinfo(root,
>                                        keyCol,
>                                        NULL,
>                                        partscheme->partopfamily[i],
>                                        partscheme->partopcintype[i],
>                                        partscheme->partcollation[i],
>                                        ScanDirectionIsBackward(scandir),
>                                  ==>   ScanDirectionIsBackward(scandir),
>                                        0,
>                                        partrel->relids,
>                                        false);
>
> I think null values are almost always placed in the last partition, unless
> the null-accepting list partition also accepts some other non-null value.
> I'm not sure exactly how we can determine the correct value to pass here,
> but what's there in the patch now doesn't seem to be it.

The code looks okay to me. It'll generate pathkeys like ORDER BY
partkey NULLS LAST for forward scans and ORDER BY partkey DESC NULLS
FIRST for backwards scans.

Can you explain what cases you think the code gets wrong?

Here's a preview of the actual and expected behaviour:

# explain (costs off) select * from listp order by a asc nulls last;
                         QUERY PLAN
------------------------------------------------------------
 Append
   ->  Index Only Scan using listp1_a_idx on listp1
   ->  Index Only Scan using listp2_a_idx on listp2
   ->  Index Only Scan using listp_null_a_idx on listp_null
(4 rows)


# explain (costs off) select * from listp order by a asc nulls first;
-- not optimised
             QUERY PLAN
------------------------------------
 Sort
   Sort Key: listp1.a NULLS FIRST
   ->  Append
         ->  Seq Scan on listp1
         ->  Seq Scan on listp2
         ->  Seq Scan on listp_null
(6 rows)

# explain (costs off) select * from listp order by a desc nulls first;
-- subpath list is simply reversed in this case.
                             QUERY PLAN
---------------------------------------------------------------------
 Append
   ->  Index Only Scan Backward using listp_null_a_idx on listp_null
   ->  Index Only Scan Backward using listp2_a_idx on listp2
   ->  Index Only Scan Backward using listp1_a_idx on listp1
(4 rows)


# explain (costs off) select * from listp order by a desc nulls last;
-- not optimised
              QUERY PLAN
--------------------------------------
 Sort
   Sort Key: listp1.a DESC NULLS LAST
   ->  Append
         ->  Seq Scan on listp1
         ->  Seq Scan on listp2
         ->  Seq Scan on listp_null
(6 rows)

We could likely improve the two cases that are not optimized by
putting the NULL partition in the correct place in the append
subpaths, but for now, we don't really have an efficient means to
identify which subpath that is.  I've not looked at your partition
planning improvements patch for a while to see if you're storing a
Bitmapset of the non-pruned partitions in RelOptInfo.  Something like
that would allow us to make this better.  Julien and I have talked
about other possible cases to optimise if we have that. e.g if the
default partition is pruned then we can optimise a RANGE partitioned
table with a default. So there's definitely more to be done on this. I
think there's a general consensus that what we're doing in the patch
already is enough to be useful.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
Hi,

On 2019/03/28 7:29, David Rowley wrote:
> On Wed, 27 Mar 2019 at 19:48, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Sorry if this was discussed before, but why does this patch add any new
>> code to partprune.c?  AFAICT, there's no functionality changes to the
>> pruning code.
> 
> You're right. It probably shouldn't be there.  There's a bit of a lack
> of a good home for partition code relating to the planner it seems.

partprune.c is outside the optimizer sub-directory, but exports
planning-related functions like prune_append_rel_partitions(),
make_partition_pruneinfo(), etc.

Similarly, partbound.c can grow bit of planning-related functionality.

>> Regarding
>>
>> +bool
>> +partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
>>
>> I think this could simply be:
>>
>> bool
>> partitions_are_ordered(PartitionBoundInfo *boundinfo)
>>
>> and be defined in partitioning/partbounds.c.  If you think any future
>> modifications to this will require access to the partition key info in
>> PartitionScheme, maybe the following is fine:
>>
>> bool
>> partitions_are_ordered(RelOptInfo *partrel)
> 
> It does need to know how many partitions the partitioned table has,
> which it gets from partrel->nparts, so yeah, RelOptInfo is probably
> needed. I don't think passing in int nparts is a good solution to
> that.  The problem with moving it to partbounds.c is that nothing
> there knows about RelOptInfo currently.

Maybe, this could be a start.  Also, there is a patch in nearby thread
which adds additional functionality to partbounds.c to be used by
partitionwise join code in the optimizer [1].

Thanks,
Amit

[1] https://commitfest.postgresql.org/22/1553/




Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Thu, 28 Mar 2019 at 14:34, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2019/03/28 7:29, David Rowley wrote:
> > On Wed, 27 Mar 2019 at 19:48, Amit Langote
> > It does need to know how many partitions the partitioned table has,
> > which it gets from partrel->nparts, so yeah, RelOptInfo is probably
> > needed. I don't think passing in int nparts is a good solution to
> > that.  The problem with moving it to partbounds.c is that nothing
> > there knows about RelOptInfo currently.
>
> Maybe, this could be a start.  Also, there is a patch in nearby thread
> which adds additional functionality to partbounds.c to be used by
> partitionwise join code in the optimizer [1].

Thanks for the review.  I've attached a patch that mostly just moved
the code around.

I also changed the comment in build_partition_pathkeys() to explain
about the nulls_first argument and why I just pass
ScanDirectionIsBackward(scandir).

Also, another comment in struct PlannerInfo to mentioning the
guarantee about append_rel_list having earlier partitions as defined
in PartitionDesc earlier in the list.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Thu, 28 Mar 2019 at 15:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Thanks for the review.  I've attached a patch that mostly just moved
> the code around.

Here's one that fixes up the compiler warning from the last one.
Thanks CF bot...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
Hi David,

On 2019/03/28 8:04, David Rowley wrote:
> On Wed, 27 Mar 2019 at 21:24, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Noticed a typo.
>>
>> +                 * multiple subpaths then we can't make guarantees about the
>> +                 * order tuples in those subpaths, so we must leave the
>>
>> order of tuples?
> 
> I'll fix that. Thanks.
> 
>> Again, sorry if this was discussed, but I got curious about why
>> partitions_are_ordered() thinks it can say true even for an otherwise
>> ordered list-partitioned table, but containing a null-only partition,
>> which is *always* scanned last.  If a query says ORDER BY partkey NULLS
>> FIRST, then it's not alright to proceed with assuming partitions are
>> ordered even if partitions_are_ordered() said so.
> 
> If it's *always* scanned last then it's fine for ORDER BY partkey
> NULLS LAST.  If they have ORDER BY partkey NULLS FIRST then we won't
> match on the pathkeys.

Sorry, I had meant to say that null values may or may not appear in the
last partition depending on how the null-accepting partition is defined.
I see that the code in partitions_are_ordered() correctly returns false if
null partition is not the last one, for example, for:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2_null partition of p for values in (2, null);
create table p3 partition of p for values in (3);

Maybe, a small comment regarding how that works correctly would be nice.

> Or if they do ORDER BY partkey DESC NULLS FIRST, then we're also fine,
> since we reverse the order of the subpaths list in that case.  ORDER
> BY partkey DESC NULLS LAST is not okay, and we don't optimise that
> since it won't match the pathkeys we generate in
> build_partition_pathkeys().

OK.

>> Related, why does build_partition_pathkeys() pass what it does for
>> nulls_first parameter of make_pathkey_from_sortinfo()?
>>
>>  cpathkey = make_pathkey_from_sortinfo(root,
>>                                        keyCol,
>>                                        NULL,
>>                                        partscheme->partopfamily[i],
>>                                        partscheme->partopcintype[i],
>>                                        partscheme->partcollation[i],
>>                                        ScanDirectionIsBackward(scandir),
>>                                  ==>   ScanDirectionIsBackward(scandir),
>>                                        0,
>>                                        partrel->relids,
>>                                        false);
>>
>> I think null values are almost always placed in the last partition, unless
>> the null-accepting list partition also accepts some other non-null value.
>> I'm not sure exactly how we can determine the correct value to pass here,
>> but what's there in the patch now doesn't seem to be it.
> 
> The code looks okay to me. It'll generate pathkeys like ORDER BY
> partkey NULLS LAST for forward scans and ORDER BY partkey DESC NULLS
> FIRST for backwards scans.
> 
> Can you explain what cases you think the code gets wrong?
> 
> Here's a preview of the actual and expected behaviour:
> 
> # explain (costs off) select * from listp order by a asc nulls last;
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append
>    ->  Index Only Scan using listp1_a_idx on listp1
>    ->  Index Only Scan using listp2_a_idx on listp2
>    ->  Index Only Scan using listp_null_a_idx on listp_null
> (4 rows)
> 
> 
> # explain (costs off) select * from listp order by a asc nulls first;
> -- not optimised
>              QUERY PLAN
> ------------------------------------
>  Sort
>    Sort Key: listp1.a NULLS FIRST
>    ->  Append
>          ->  Seq Scan on listp1
>          ->  Seq Scan on listp2
>          ->  Seq Scan on listp_null
> (6 rows)
> 
> # explain (costs off) select * from listp order by a desc nulls first;
> -- subpath list is simply reversed in this case.
>                              QUERY PLAN
> ---------------------------------------------------------------------
>  Append
>    ->  Index Only Scan Backward using listp_null_a_idx on listp_null
>    ->  Index Only Scan Backward using listp2_a_idx on listp2
>    ->  Index Only Scan Backward using listp1_a_idx on listp1
> (4 rows)
> 
> 
> # explain (costs off) select * from listp order by a desc nulls last;
> -- not optimised
>               QUERY PLAN
> --------------------------------------
>  Sort
>    Sort Key: listp1.a DESC NULLS LAST
>    ->  Append
>          ->  Seq Scan on listp1
>          ->  Seq Scan on listp2
>          ->  Seq Scan on listp_null
> (6 rows)

Ah, everything seems to be working correctly.  Thanks for the explanation
and sorry about the noise.

> We could likely improve the two cases that are not optimized by
> putting the NULL partition in the correct place in the append
> subpaths, but for now, we don't really have an efficient means to
> identify which subpath that is.  I've not looked at your partition
> planning improvements patch for a while to see if you're storing a
> Bitmapset of the non-pruned partitions in RelOptInfo.  Something like
> that would allow us to make this better.  Julien and I have talked
> about other possible cases to optimise if we have that. e.g if the
> default partition is pruned then we can optimise a RANGE partitioned
> table with a default. So there's definitely more to be done on this. I
> think there's a general consensus that what we're doing in the patch
> already is enough to be useful.

Certainly.  When trying out various combinations of ORDER BY ASC/DESC NULL
FIRST/LAST yesterday, I wrongly thought the plan came out wrong in one or
two cases, but now see that that's not the case.

Also, the comment you added in the latest patch sheds some light on the
matter, so that helps too.

Thanks,
Amit




Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Fri, 29 Mar 2019 at 00:00, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2019/03/28 8:04, David Rowley wrote:
> > If it's *always* scanned last then it's fine for ORDER BY partkey
> > NULLS LAST.  If they have ORDER BY partkey NULLS FIRST then we won't
> > match on the pathkeys.
>
> Sorry, I had meant to say that null values may or may not appear in the
> last partition depending on how the null-accepting partition is defined.
> I see that the code in partitions_are_ordered() correctly returns false if
> null partition is not the last one, for example, for:
>
> create table p (a int) partition by list (a);
> create table p1 partition of p for values in (1);
> create table p2_null partition of p for values in (2, null);
> create table p3 partition of p for values in (3);
>
> Maybe, a small comment regarding how that works correctly would be nice.

hmm, but there is a comment. It says:

/*
* LIST partitions can also guarantee ordering, but we'd need to
* ensure that partitions don't allow interleaved values.  We
* could likely check for this looking at each partition, in
* order, and checking which Datums are accepted.  If we find a
* Datum in a partition that's greater than one previously already
* seen, then values could become out of order and we'd have to
* disable the optimization.  For now, let's just keep it simple
* and just accept LIST partitions without a DEFAULT partition
* which only accept a single Datum per partition.  This is cheap
* as it does not require any per-partition processing.  Maybe
* we'd like to handle more complex cases in the future.
*/

Your example above breaks the "don't allow interleaved values" and
"just accept LIST partitions without a DEFAULT partition which only
accept a single Datum per partition."

Do you think I need to add something like "and if there is a NULL
partition, that it only accepts NULL values"?  I think that's implied
already, but if you think it's confusing then maybe it's worth adding
something along those lines.


 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
Hi,

On 2019/03/29 7:59, David Rowley wrote:
> On Fri, 29 Mar 2019 at 00:00, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> On 2019/03/28 8:04, David Rowley wrote:
>>> If it's *always* scanned last then it's fine for ORDER BY partkey
>>> NULLS LAST.  If they have ORDER BY partkey NULLS FIRST then we won't
>>> match on the pathkeys.
>>
>> Sorry, I had meant to say that null values may or may not appear in the
>> last partition depending on how the null-accepting partition is defined.
>> I see that the code in partitions_are_ordered() correctly returns false if
>> null partition is not the last one, for example, for:
>>
>> create table p (a int) partition by list (a);
>> create table p1 partition of p for values in (1);
>> create table p2_null partition of p for values in (2, null);
>> create table p3 partition of p for values in (3);
>>
>> Maybe, a small comment regarding how that works correctly would be nice.
> 
> hmm, but there is a comment. It says:
> 
> /*
> * LIST partitions can also guarantee ordering, but we'd need to
> * ensure that partitions don't allow interleaved values.  We
> * could likely check for this looking at each partition, in
> * order, and checking which Datums are accepted.  If we find a
> * Datum in a partition that's greater than one previously already
> * seen, then values could become out of order and we'd have to
> * disable the optimization.  For now, let's just keep it simple
> * and just accept LIST partitions without a DEFAULT partition
> * which only accept a single Datum per partition.  This is cheap
> * as it does not require any per-partition processing.  Maybe
> * we'd like to handle more complex cases in the future.
> */
> 
> Your example above breaks the "don't allow interleaved values" and
> "just accept LIST partitions without a DEFAULT partition which only
> accept a single Datum per partition."
> 
> Do you think I need to add something like "and if there is a NULL
> partition, that it only accepts NULL values"?  I think that's implied
> already, but if you think it's confusing then maybe it's worth adding
> something along those lines.

How about extending the sentence about when the optimization is disabled
as follows:

"If we find a Datum in a partition that's greater than one previously
already seen, then values could become out of order and we'd have to
disable the optimization.  We'd also need to disable the optimization if
NULL values are interleaved with other Datum values, because the calling
code expect them to be present in the last partition."

Further, extend the "For now..." sentence as:

"For now, let's just keep it simple and just accept LIST partitioned table
without a DEFAULT partition where each partition only accepts a single
Datum or NULL.  It's OK to always accept NULL partition in that case,
because PartitionBoundInfo lists it as the last partition."

Thanks,
Amit




Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Tue, 2 Apr 2019 at 14:26, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> How about extending the sentence about when the optimization is disabled
> as follows:
>
> "If we find a Datum in a partition that's greater than one previously
> already seen, then values could become out of order and we'd have to
> disable the optimization.  We'd also need to disable the optimization if
> NULL values are interleaved with other Datum values, because the calling
> code expect them to be present in the last partition."
>
> Further, extend the "For now..." sentence as:
>
> "For now, let's just keep it simple and just accept LIST partitioned table
> without a DEFAULT partition where each partition only accepts a single
> Datum or NULL.  It's OK to always accept NULL partition in that case,
> because PartitionBoundInfo lists it as the last partition."

I ended up rewording the entire thing and working on the header
comment for the function too. I think previously it wasn't that well
defined what "ordered" meant. I added a mention that we expect that
NULLs, if possible must come in the last partition.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
Hi David,

On Tue, Apr 2, 2019 at 8:49 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I ended up rewording the entire thing and working on the header
> comment for the function too. I think previously it wasn't that well
> defined what "ordered" meant. I added a mention that we expect that
> NULLs, if possible must come in the last partition.

Thanks for the updated patch.

New descriptions look good, although was amused by this:

diff --git a/src/backend/partitioning/partbounds.c
b/src/backend/partitioning/partbounds.c
index bdd0d23854..9dd378d7a0 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
...
+partitions_are_ordered(struct RelOptInfo *partrel)

Maybe, "struct" is unnecessary?

Thanks,
Amit



Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 3 Apr 2019 at 01:26, Amit Langote <amitlangote09@gmail.com> wrote:
> +#include "nodes/pathnodes.h"
> ...
> +partitions_are_ordered(struct RelOptInfo *partrel)
>
> Maybe, "struct" is unnecessary?

I just left it there so that the signature matched the header file.
Looking around for examples I see make_partition_pruneinfo() has the
structs only in the header file, so I guess that is how we do things,
so changed to that in the attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
On 2019/04/03 3:10, David Rowley wrote:
> On Wed, 3 Apr 2019 at 01:26, Amit Langote <amitlangote09@gmail.com> wrote:
>> +#include "nodes/pathnodes.h"
>> ...
>> +partitions_are_ordered(struct RelOptInfo *partrel)
>>
>> Maybe, "struct" is unnecessary?
> 
> I just left it there so that the signature matched the header file.
> Looking around for examples I see make_partition_pruneinfo() has the
> structs only in the header file, so I guess that is how we do things,
> so changed to that in the attached.

Ah, I see.  Thanks for updating the patch.

I don't have any more comments.

Thanks,
Amit




Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I don't have any more comments.

Great. Many thanks for having a look at this.  Going by [1], Julien
also seems pretty happy, so I'm going to change this over to ready for
committer.

[1] https://www.postgresql.org/message-id/CAOBaU_YpTQbFqcP5jYJZETPL6mgYuTwVTVVBZKZKC6XDBTDkfQ@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
Hi,

On Wed, Apr 3, 2019 at 7:47 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > I don't have any more comments.
>
> Great. Many thanks for having a look at this.  Going by [1], Julien
> also seems pretty happy, so I'm going to change this over to ready for
> committer.

Yes, I'm very sorry I didn't had time to come back on  this thread
yesterday.  I was fine with the patch (I didn't noticed the
partprune.c part though), and I'm happy with Amit's further comments
and this last patch, so +1 for me!



Re: Ordered Partitioned Table Scans

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I don't have any more comments.

> Great. Many thanks for having a look at this.  Going by [1], Julien
> also seems pretty happy, so I'm going to change this over to ready for
> committer.

Pushed with some hacking, mostly trying to improve the comments.
The only really substantive thing I changed was that you'd missed
teaching ExecSetTupleBound about descending through Append.
Now that we can have plans that are Limit-above-Append-above-Sort,
that's important.

            regards, tom lane



Re: Ordered Partitioned Table Scans

From
David Rowley
Date:
On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pushed with some hacking, mostly trying to improve the comments.

Great! Many thanks for improving those and pushing it.

Many thanks to Julien, Antonin for their detailed reviews on this.
Thanks Amit for your input on this as well. Much appreciated.

> The only really substantive thing I changed was that you'd missed
> teaching ExecSetTupleBound about descending through Append.
> Now that we can have plans that are Limit-above-Append-above-Sort,
> that's important.

Oops. Wasn't aware about that until now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Ordered Partitioned Table Scans

From
Julien Rouhaud
Date:
On Sat, Apr 6, 2019 at 2:45 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Pushed with some hacking, mostly trying to improve the comments.
>
> Great! Many thanks for improving those and pushing it.
>
> Many thanks to Julien, Antonin for their detailed reviews on this.
> Thanks Amit for your input on this as well. Much appreciated.

Thanks!  I'm glad that we'll have this optimization for pg12.



Re: Ordered Partitioned Table Scans

From
Amit Langote
Date:
On 2019/04/06 18:06, Julien Rouhaud wrote:
> On Sat, Apr 6, 2019 at 2:45 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>>
>> On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Pushed with some hacking, mostly trying to improve the comments.
>>
>> Great! Many thanks for improving those and pushing it.
>>
>> Many thanks to Julien, Antonin for their detailed reviews on this.
>> Thanks Amit for your input on this as well. Much appreciated.
> 
> Thanks!  I'm glad that we'll have this optimization for pg12.

+1, thank you for working on this.  It's nice to see partitioning being
useful to optimize query execution in even more cases.

Regards,
Amit