Home > mailing lists

Re: apply_scanjoin_target_to_paths and partitionwise join - Mailing list pgsql-hackers

From	Arne Roland
Subject	Re: apply_scanjoin_target_to_paths and partitionwise join
Date	October 29 21:06:17
Msg-id	cb7e7c64-2ca1-448a-a25d-b3a473eaf93f@malkut.net Whole thread Raw
In response to	Re: apply_scanjoin_target_to_paths and partitionwise join (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: apply_scanjoin_target_to_paths and partitionwise join
List	pgsql-hackers

Tree view

On 2025-10-29 17:47, Robert Haas wrote:
> On Wed, Oct 29, 2025 at 8:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the best shot at coming up with a
>> reproducer here is to study the cost differences in the queries where
>> the plan changes with the fix, particularly Q1 from my prior email.
> So, the really interesting thing about Q1 is that it contains a join
> which inflates the row count by a factor of about 84. We first join
> t1, which has 1001 rows, to t3, which has 1001 rows, and get 1001
> rows. Then we join to t2, which also has 1001 rows, and we get 83503
> rows. It is estimated to be mildly more efficient to perform the t1-t3
> join partition-wise: it costs 98.69 cost units to do it partitionwise
> and 104.56 cost units to do it non-partitionwise. However, the planner
> believes that doing the subsequent join to t2 in partitionwise fashion
> is a bad idea. The query has an ORDER BY clause, which means that
> after we finish doing the partitionwise part of the operation, we need
> to perform a Merge Append to restore the sort order. If we do the
> Merge Append before joining to t2, we only have to Merge Append 1001
> rows, but if we do the Merge Append after joining to t2, we have to
> Merge Append 83530 rows. The estimated cost of Merge Append is
> proportional to the number of input rows, so doing that MergeAppend
> after the join to t2 is estimated to be 84 times as expensive. In some
> situations, we might make up for that loss by being able to do the
> join more efficiently, but here the planner does not believe that to
> be the case case: we can do the join to t2 by a Merge Join over an
> Append over one index scan per partition, and there's basically no
> overhead vs. a partitionwise join. Hence, from the planner's point of
> view, doing the join to t2 in partitionwise fashion is a significant
> loss.
>
> I had difficulty reproducing this theoretical performance regression
> in practice. I found that doing the whole thing partitionwise, doing
> the whole thing non-partitionwise, and doing only the t1-t3 join
> partitionwise weren't that different in runtime, and the partitionwise
> approach actually seemed to be a little faster. But I constructed the
> following example, similar to but simpler than the one in the
> regression tests, which does show a regression for me in practice:
>
> drop table if exists dupfest;
> create table dupfest (a text) partition by range(a);
> create table dupfest1 partition of dupfest for values from ('0') to ('3');
> create table dupfest2 partition of dupfest for values from ('3') to ('6');
> create table dupfest3 partition of dupfest for values from ('6') to ('A');
> insert into dupfest
>      select '0' from generate_series(0, 10000) i
> union all
>      select '3' from generate_series(0, 10000) i
> union all
>      select '6' from generate_series(0, 10000) i;
> create index on dupfest(a);
> analyze dupfest;
> set max_parallel_workers_per_gather = 0;
>
> My test query was:
>
> select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
> = t2.a order by t1.a offset 0);
Thank you, that is very helpful!
> I want to start by saying that I haven't tried super-hard to do
> rigorous benchmarking. This is a debug-enabled, assertion-enabled
> build with patched source code. Results may not be fully
> representative. But here's what I found. EXPLAIN ANALYZE of this query
> without a partitionwise join took 30.8 seconds; without EXPLAIN
> ANALYZE, it ran in 15.4 seconds. With partitionwise join, EXPLAIN
> ANALYZE of the query ran in 89.6 seconds; without EXPLAIN ANALYZE, it
> ran in 64.5 seconds. The plan without partitionwise join was a merge
> join with an append of index only scans on both sides and a
> materialize node on the inner side. With partitionwise join, it
> switched to Nested Loop plans with index-only scans on the outer side
> and a materialize node over a sequential scan on the inner side,
> followed by a Merge Append.
The margins are slightly lower at my end, but I can clearly reproduce 
this locally.
> A notable point here is that the joins take about the same amount of
> time in both plans. In the EXPLAIN ANALYZE output, we see the three
> joins in the partitionwise plan taking a total of 24.6 seconds, and
> the single join in the non-partitionwise plan taking 24 seconds
> (exclusive of times for child nodes). However, the two Append nodes in
> the non-partitionwise plan run for a total of 2.5 *milliseconds* while
> the single Merge Append node in the partitionwise plan runs for 58.2
> seconds (again, exclusive of times for child nodes). Obviously,
> EXPLAIN ANALYZE distorts the runtime a lot, but I think the overall
> point is nonetheless fairly clear: running a lot of tuples through a
> Merge Append node is potentially expensive, and it can be worth
> eschewing a partitionwise join to avoid that.

Can you help me, why we even need a Merge Append node in the partition 
wise case? The sorted streams we have are sorted by the key column of 
the partitioning and hence a simple Append node should suffice anyways, no?

Either way running less trough any node can make stuff significantly 
faster. This case happens, if we join on something, which includes the 
partition key, but is not unique. I failed to consider that. My 
customers almost always joined on pk in these partition wise cases.

> I also tried running the same test without the "order by t1.a". With
> that change EXPLAIN ANALYZE took 24.3 seconds without partitionwise
> join and 34.8 seconds with partitionwise join. The times without
> EXPLAIN ANALYZE were quite close, around 15 seconds either way, but it
> looks to me as though the partitionwise plan was probably still a bit
> worse. What I think is happening here is that even running a large
> number of tuples through Append can have enough overhead to matter in
> extreme cases, but EXPLAIN ANALYZE significantly increases the cost of
> entering and exiting nodes, so in that case the difference is much
> easier to measure.
While I see a massive degradation for the explain analyze case, without 
it in the unsorted case the partition wise variant runs consistently 8-9 
% faster on my local machine. The explain reveals to me a quicker Hash 
Join. (I suspect due to quicker hash map lookups.) I generally believe 
taking the explain analyze timings here are a bit misleading, because 
they add very disproportional overhead.
> I don't know whether the EDB customer problem that started this thread
> was of the same type demonstrated here or not. It may well have been
> something else. However, unless I've fouled up the test case shown
> above in some way, which is not impossible, this does demonstrate that
> it is possible, at least in corner cases, to run into scenarios where
> a partitionwise join is worse than a non-partitionwise join. In this
> example, the reason it's worse is because postponing the MergeAppend
> until a later stage results in the MergeAppend seeing a much larger
> number of rows.

Either way: Despite me not encountering this in the wild, your sorted 
case doesn't seem too exotic to me and demonstrates the need to do 
something.

This virtually equivalent query issue occurs when the join condition is 
(almost) unique. The different amount of tuples to process clearly 
occurs when they are not.

Part of me wonders whether it could be worthwhile to do something 
differently in the case, that the join key is unique. I don't like to 
introduce another level of complexity, but it's interesting how close 
the costs without the (Merge) Append nodes are. We again have less than 
0.0018 % difference on my end, which is essentially asking for problems 
across platforms.

Regards
Arne

pgsql-hackers by date:

From: Masahiko Sawada
Date: 29 October, 21:05:12
Subject: Re: Question about InvalidatePossiblyObsoleteSlot()

From: Masahiko Sawada
Date: 29 October, 21:07:03
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

Re: apply_scanjoin_target_to_paths and partitionwise join - Mailing list pgsql-hackers

Previous

Next