Thread: Allow ordered partition scans in more cases
Over on [1], Benjamin highlighted that we don't do ordered partition scans in some cases where we could. Basically, what was added in 959d00e9d only works when at least one child path has pathkeys that suit the required query pathkeys. If the partitions or partitioned table does not have an index matching the partitioning columns or some subset thereof, then we'll not add an ordered Append path. I've attached a patch. This is what it does: create table lp (a int) partition by list(a); create table lp1 partition of lp for values in(1); create table lp2 partition of lp for values in(2); explain (costs off) select * from lp order by a; master; QUERY PLAN ---------------------------------- Sort Sort Key: lp.a -> Append -> Seq Scan on lp1 lp_1 -> Seq Scan on lp2 lp_2 (5 rows) patched: QUERY PLAN ---------------------------------- Append -> Sort Sort Key: lp_1.a -> Seq Scan on lp1 lp_1 -> Sort Sort Key: lp_2.a -> Seq Scan on lp2 lp_2 (7 rows) There's still something in there that I'm not happy with which relates to the tests I added in inherit.sql. Anyone looking at the new tests might expect that the following query should work too: explain (costs off) select * from range_parted order by a,b,c; but it *appears* not to. We do build an AppendPath for that, it's just that the AppendPath added by the following code seems to win over it: /* * If we found unparameterized paths for all children, build an unordered, * unparameterized Append path for the rel. (Note: this is correct even * if we have zero or one live subpath due to constraint exclusion.) */ if (subpaths_valid) add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL, NIL, NULL, 0, false, -1)); I still need to look to see if there's some small amount of data that can be loaded into the table to help coax the planner into producing the ordered scan for this one. It works fine as-is for ORDER BY a,b and ORDER BY a; so I've put tests in for that. David [1] https://postgr.es/m/CABTcpyuXXY1625-Mns=mPFCVSf4aouGiRVyLPiGQQ0doT0PiLQ@mail.gmail.com
Attachment
Thank you for improving this optimization ! Le mardi 21 février 2023, 04:14:02 CET David Rowley a écrit : > I still need to look to see if there's some small amount of data that > can be loaded into the table to help coax the planner into producing > the ordered scan for this one. It works fine as-is for ORDER BY a,b > and ORDER BY a; so I've put tests in for that. I haven't looked too deeply into it, but it seems reasonable that the whole sort would cost cheaper than individual sorts on partitions + incremental sorts, except when the the whole sort would spill to disk much more than the incremental ones. I find it quite difficult to reason about what that threshold should be, but I managed to find a case which could fit in a test: create table range_parted (a int, b int, c int) partition by range(a, b); create table range_parted1 partition of range_parted for values from (0,0) to (10,10); create table range_parted2 partition of range_parted for values from (10,10) to (20,20); insert into range_parted(a, b, c) select i, j, k from generate_series(1, 19) i, generate_series(1, 19) j, generate_series(1, 5) k; analyze range_parted; set random_page_cost = 10; set work_mem = '64kB'; explain (costs off) select * from range_parted order by a,b,c; It's quite convoluted, because it needs the following: - estimate the individual partition sorts to fit into work_mem (even if that's not the case here at runtime) - estimate the whole table sort to not fit into work_mem - the difference between the two should be big enough to compensate the incremental sort penalty (hence raising random_page_cost). This is completely tangential to the subject at hand, but maybe we have improvements to do with the way we estimate what type of sort will be performed ? It seems to underestimate the memory amount needed. I'm not sure it makes a real difference in real use cases though. Regards, -- Ronan Dunklau
On Thu, 23 Feb 2023 at 02:10, Ronan Dunklau <ronan.dunklau@aiven.io> wrote: > I haven't looked too deeply into it, but it seems reasonable that the whole > sort would cost cheaper than individual sorts on partitions + incremental > sorts, except when the the whole sort would spill to disk much more than the > incremental ones. I find it quite difficult to reason about what that threshold > should be, but I managed to find a case which could fit in a test: Thanks for coming up with that test case. It's a little disappointing to see that so many rows had to be added to get the plan to change. I wonder if it's really worth testing this particular case. ~1800 rows is a little more significant than I'd have hoped. The buildfarm has a few dinosaurs that would likely see a noticeable slowdown from that. What's on my mind now is if turning 1 Sort into N Sorts is a particularly good idea from a work_mem standpoint. I see that we don't do tuplesort_end() until executor shutdown, so that would mean that we could end up using 1 x work_mem per Sort node. I idly wondered if we couldn't do tuplesort_end() after spitting out the final tuple when EXEC_FLAG_REWIND is not set, but that would still mean we could use N work_mems when EXEC_FLAG_REWIND *is* set. We only really have visibility of that during execution too, so can't really make a decision at plan time based on that. I'm not quite sure if I'm being overly concerned here or not. All it would take to get a sort per partition today would be to put a suitable index on just 1 of the partitions. So this isn't exactly a new problem, it's just making an old problem perhaps a little more likely. The problem does also exist for things like partition-wise joins too for Hash and Merge joins. Partition-wise joins are disabled by default, however. David
On Sat, 4 Mar 2023 at 00:56, David Rowley <dgrowleyml@gmail.com> wrote: > What's on my mind now is if turning 1 Sort into N Sorts is a > particularly good idea from a work_mem standpoint. I see that we don't > do tuplesort_end() until executor shutdown, so that would mean that we > could end up using 1 x work_mem per Sort node. I idly wondered if we > couldn't do tuplesort_end() after spitting out the final tuple when > EXEC_FLAG_REWIND is not set, but that would still mean we could use N > work_mems when EXEC_FLAG_REWIND *is* set. We only really have > visibility of that during execution too, so can't really make a > decision at plan time based on that. Because of the above, I'm not planning on pursuing this patch any further. We can maybe revisit this if we come up with better ways to manage the number of work_mems a plan can have in the future. As of now, I'm a little too worried that this patch will end up consuming too many work_mems by adding a Sort node per partition. I'll mark this as withdrawn. David