Re: apply_scanjoin_target_to_paths and partitionwise join - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: apply_scanjoin_target_to_paths and partitionwise join |
Date | |
Msg-id | CA+TgmoYvJCFF0xYT+v7GLzuX_o_uqN-qrSLhyjjYAEYiG7D=3w@mail.gmail.com Whole thread Raw |
In response to | Re: apply_scanjoin_target_to_paths and partitionwise join (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Responses |
Re: apply_scanjoin_target_to_paths and partitionwise join
|
List | pgsql-hackers |
On Thu, Jan 2, 2025 at 4:41 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > A join between partitions is pushed down if only partitionwise join is > chosen and a join between partitions won't be pushed down if > partitionwise join is not chosen. Hence this bug affects pushdown as > well. > > The CF entry shows as waiting for author. But that isn't the right > status. Will change it to needs review. I think we need a consensus as > to whether we want to fix this bug or not. Since this bug doesn't > affect me anymore, I will just withdraw this CF entry if there is no > interest. I think it's unhelpful that you keep calling this a "bug" when the behavior is clearly deliberate. Whether it's the *right* behavior is debatable, but it's not *accidental* behavior. I don't actually have a clear understanding of why we need this. In https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com Jakub says that an EDB customer experienced a case where the partitionwise plan took 530+s and the non-partitionwise plan took 23s, but unfortunately there's no public test case, and in the examples shared publicly, either the partionwise plan is actually slower but is mistakenly estimated to be faster, or the two are extremely close to the same speed so it doesn't really matter. So the customer scenario (which is not public) is justification for a code-change, but the publicly-posted examples, as far as I can see, are not. And what confuses me is -- what could that test case possibly look like? I mean, how can it be less efficient to perform N small joins than 1 big join? For instance, suppose we're doing a merge join between A and B (partitions A_1..A_N and B_1..B_N) and we have to sort all the data. With a partitionwise join, we have to do 2N sorts of partitions of some size, let's say K. The cost should be O(2N * K lg K). If we instead do two really big sorts, the cost is now O(2 * (NK) lg (NK)), which is more. If we do a hash join, the cost should be about the same either way, because probing a hash table is roughly constant time. However, if we do N small hash joins, the hash table is a lot more likely to fit in memory -- and if the big hash table does not fit in memory and the small hash tables do, we should win big. Finally, let's say we're doing a nested loop. If the inner side of the nested loop is unparameterized, then the cost of the non-partitionwise nested loop is O(N^2 * K^2), while the cost of the partitionwise nested loop is O(N * K^2), which is a huge win. If the inner side is parameterized, then the partitionwise plan involves scanning one partition for matching values for each outer row, whereas the non-partitionwise plan involves scanning every partition for matching values for each outer row, which is again clearly worse. I'm obviously missing something here, because I'm sure Jakub is quite right when he says that this actually happened and actually hosed an EDB customer. But I don't understand HOW it happened, and I think if we're going to change the code we really ought to understand that and write some code comments about it. In general, I think that it's very reasonable to expect that a bunch of small joins will beat one big join, which is why the code does what it currently does. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: