Thread: Allow ordered partition scans in more cases

Allow ordered partition scans in more cases

From
David Rowley
Date:
Over on [1], Benjamin highlighted that we don't do ordered partition
scans in some cases where we could.

Basically, what was added in 959d00e9d only works when at least one
child path has pathkeys that suit the required query pathkeys.  If the
partitions or partitioned table does not have an index matching the
partitioning columns or some subset thereof, then we'll not add an
ordered Append path.

I've attached a patch.  This is what it does:

create table lp (a int) partition by list(a);
create table lp1 partition of lp for values in(1);
create table lp2 partition of lp for values in(2);

explain (costs off) select * from lp order by a;

master;

            QUERY PLAN
----------------------------------
 Sort
   Sort Key: lp.a
   ->  Append
         ->  Seq Scan on lp1 lp_1
         ->  Seq Scan on lp2 lp_2
(5 rows)

patched:

            QUERY PLAN
----------------------------------
 Append
   ->  Sort
         Sort Key: lp_1.a
         ->  Seq Scan on lp1 lp_1
   ->  Sort
         Sort Key: lp_2.a
         ->  Seq Scan on lp2 lp_2
(7 rows)

There's still something in there that I'm not happy with which relates
to the tests I added in inherit.sql. Anyone looking at the new tests
might expect that the following query should work too:

explain (costs off) select * from range_parted order by a,b,c;

but it *appears* not to.  We do build an AppendPath for that, it's
just that the AppendPath added by the following code seems to win over
it:

/*
* If we found unparameterized paths for all children, build an unordered,
* unparameterized Append path for the rel.  (Note: this is correct even
* if we have zero or one live subpath due to constraint exclusion.)
*/
if (subpaths_valid)
add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
  NIL, NULL, 0, false,
  -1));

I still need to look to see if there's some small amount of data that
can be loaded into the table to help coax the planner into producing
the ordered scan for this one.  It works fine as-is for ORDER BY a,b
and ORDER BY a; so I've put tests in for that.

David

[1] https://postgr.es/m/CABTcpyuXXY1625-Mns=mPFCVSf4aouGiRVyLPiGQQ0doT0PiLQ@mail.gmail.com

Attachment

Re: Allow ordered partition scans in more cases

From
Ronan Dunklau
Date:
Thank you for improving this optimization !

Le mardi 21 février 2023, 04:14:02 CET David Rowley a écrit :
> I still need to look to see if there's some small amount of data that
> can be loaded into the table to help coax the planner into producing
> the ordered scan for this one.  It works fine as-is for ORDER BY a,b
> and ORDER BY a; so I've put tests in for that.

I haven't looked too deeply into it, but it seems reasonable that the whole
sort would cost cheaper than individual sorts on partitions + incremental
sorts, except when the the whole sort would spill to disk much more than the
incremental ones. I find it quite difficult to reason about what that threshold
should be, but I managed to find a case which could fit in a test:

create table range_parted (a int, b int, c int) partition by range(a, b);
create table range_parted1 partition of range_parted for values from (0,0) to
(10,10);
create table range_parted2 partition of range_parted for values from (10,10)
to (20,20);
insert into range_parted(a, b, c) select i, j, k from generate_series(1, 19)
i, generate_series(1, 19) j, generate_series(1, 5) k;
analyze range_parted;
set random_page_cost = 10;
set work_mem = '64kB';
explain (costs off) select * from range_parted order by a,b,c;

It's quite convoluted, because it needs the following:
 - estimate the individual partition sorts to fit into work_mem (even if that's
not the case here at runtime)
 - estimate the whole table sort to not fit into work_mem
 - the difference between the two should be big enough to compensate the
incremental sort penalty (hence raising random_page_cost).

This is completely tangential to the subject at hand, but maybe we have
improvements to do with the way we estimate what type of sort will be
performed ? It seems to underestimate the memory amount needed. I'm not sure
it makes a real difference in real use cases though.

Regards,

--
Ronan Dunklau






Re: Allow ordered partition scans in more cases

From
David Rowley
Date:
On Thu, 23 Feb 2023 at 02:10, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> I haven't looked too deeply into it, but it seems reasonable that the whole
> sort would cost cheaper than individual sorts on partitions + incremental
> sorts, except when the the whole sort would spill to disk much more than the
> incremental ones. I find it quite difficult to reason about what that threshold
> should be, but I managed to find a case which could fit in a test:

Thanks for coming up with that test case.  It's a little disappointing
to see that so many rows had to be added to get the plan to change. I
wonder if it's really worth testing this particular case. ~1800 rows
is a little more significant than I'd have hoped. The buildfarm has a
few dinosaurs that would likely see a noticeable slowdown from that.

What's on my mind now is if turning 1 Sort into N Sorts is a
particularly good idea from a work_mem standpoint. I see that we don't
do tuplesort_end() until executor shutdown, so that would mean that we
could end up using 1 x work_mem per Sort node.  I idly wondered if we
couldn't do tuplesort_end() after spitting out the final tuple when
EXEC_FLAG_REWIND is not set, but that would still mean we could use N
work_mems when EXEC_FLAG_REWIND *is* set. We only really have
visibility of that during execution too, so can't really make a
decision at plan time based on that.

I'm not quite sure if I'm being overly concerned here or not. All it
would take to get a sort per partition today would be to put a
suitable index on just 1 of the partitions. So this isn't exactly a
new problem, it's just making an old problem perhaps a little more
likely.  The problem does also exist for things like partition-wise
joins too for Hash and Merge joins.  Partition-wise joins are disabled
by default, however.

David



Re: Allow ordered partition scans in more cases

From
David Rowley
Date:
On Sat, 4 Mar 2023 at 00:56, David Rowley <dgrowleyml@gmail.com> wrote:
> What's on my mind now is if turning 1 Sort into N Sorts is a
> particularly good idea from a work_mem standpoint. I see that we don't
> do tuplesort_end() until executor shutdown, so that would mean that we
> could end up using 1 x work_mem per Sort node.  I idly wondered if we
> couldn't do tuplesort_end() after spitting out the final tuple when
> EXEC_FLAG_REWIND is not set, but that would still mean we could use N
> work_mems when EXEC_FLAG_REWIND *is* set. We only really have
> visibility of that during execution too, so can't really make a
> decision at plan time based on that.

Because of the above, I'm not planning on pursuing this patch any
further.  We can maybe revisit this if we come up with better ways to
manage the number of work_mems a plan can have in the future. As of
now, I'm a little too worried that this patch will end up consuming
too many work_mems by adding a Sort node per partition.

I'll mark this as withdrawn.

David