Re: apply_scanjoin_target_to_paths and partitionwise join - Mailing list pgsql-hackers

From Robert Haas
Subject Re: apply_scanjoin_target_to_paths and partitionwise join
Date
Msg-id CA+TgmoYvJCFF0xYT+v7GLzuX_o_uqN-qrSLhyjjYAEYiG7D=3w@mail.gmail.com
Whole thread Raw
In response to Re: apply_scanjoin_target_to_paths and partitionwise join  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: apply_scanjoin_target_to_paths and partitionwise join
List pgsql-hackers
On Thu, Jan 2, 2025 at 4:41 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> A join between partitions is pushed down if only partitionwise join is
> chosen and a join between partitions won't be pushed down if
> partitionwise join is not chosen. Hence this bug affects pushdown as
> well.
>
> The CF entry shows as waiting for author. But that isn't the right
> status. Will change it to needs review. I think we need a consensus as
> to whether we want to fix this bug or not. Since this bug doesn't
> affect me anymore, I will just withdraw this CF entry if there is no
> interest.

I think it's unhelpful that you keep calling this a "bug" when the
behavior is clearly deliberate. Whether it's the *right* behavior is
debatable, but it's not *accidental* behavior.

I don't actually have a clear understanding of why we need this. In
https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com
Jakub says that an EDB customer experienced a case where the
partitionwise plan took 530+s and the non-partitionwise plan took 23s,
but unfortunately there's no public test case, and in the examples
shared publicly, either the partionwise plan is actually slower but is
mistakenly estimated to be faster, or the two are extremely close to
the same speed so it doesn't really matter. So the customer scenario
(which is not public) is justification for a code-change, but the
publicly-posted examples, as far as I can see, are not.

And what confuses me is -- what could that test case possibly look
like? I mean, how can it be less efficient to perform N small joins
than 1 big join? For instance, suppose we're doing a merge join
between A and B (partitions A_1..A_N and B_1..B_N) and we have to sort
all the data. With a partitionwise join, we have to do 2N sorts of
partitions of some size, let's say K. The cost should be O(2N * K lg
K). If we instead do two really big sorts, the cost is now O(2 * (NK)
lg (NK)), which is more. If we do a hash join, the cost should be
about the same either way, because probing a hash table is roughly
constant time. However, if we do N small hash joins, the hash table is
a lot more likely to fit in memory -- and if the big hash table does
not fit in memory and the small hash tables do, we should win big.
Finally, let's say we're doing a nested loop. If the inner side of the
nested loop is unparameterized, then the cost of the non-partitionwise
nested loop is O(N^2 * K^2), while the cost of the partitionwise
nested loop is O(N * K^2), which is a huge win. If the inner side is
parameterized, then the partitionwise plan involves scanning one
partition for matching values for each outer row, whereas the
non-partitionwise plan involves scanning every partition for matching
values for each outer row, which is again clearly worse.

I'm obviously missing something here, because I'm sure Jakub is quite
right when he says that this actually happened and actually hosed an
EDB customer. But I don't understand HOW it happened, and I think if
we're going to change the code we really ought to understand that and
write some code comments about it. In general, I think that it's very
reasonable to expect that a bunch of small joins will beat one big
join, which is why the code does what it currently does.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: magical eref alias names
Next
From: Tom Lane
Date:
Subject: Re: Fwd: Re: A new look at old NFS readdir() problems?