Home > mailing lists

Re: MergeJoin beats HashJoin in the case of multiple hash clauses - Mailing list pgsql-hackers

From	Andrei Lepikhov
Subject	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date	March 3, 2025 11:24:40
Msg-id	8750fa3f-43b6-40db-803f-d6ae471384ef@gmail.com Whole thread Raw
In response to	MergeJoin beats HashJoin in the case of multiple hash clauses (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
List	pgsql-hackers

Tree view

On 17/2/2025 01:34, Alexander Korotkov wrote:
> Hi, Andrei!
> 
> On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> Thank you for your work on this subject.  I agree with the general
> direction.  While everyone has used conservative estimates for a long
> time, it's better to change them only when we're sure about it.
> However, I'm still not sure I get the conservatism.
> 
> if (innerbucketsize > thisbucketsize)
>      innerbucketsize = thisbucketsize;
> if (innermcvfreq > thismcvfreq)
>     innermcvfreq = thismcvfreq;
> 
> IFAICS, even in the worst case (all columns are totally correlated),
> the overall bucket size should be the smallest bucket size among
> clauses (not the largest).  And the same is true of MCV.  As a mental
> experiment, we can add a new clause to hash join, which is always true
> because columns on both sides have the same value.  In fact, it would
> have almost no influence except for the cost of extracting additional
> columns and the cost of executing additional operators.  But in the
> current model, this additional clause would completely ruin
> thisbucketsize and thismcvfreq, making hash join extremely
> unappealing.  Should we still revise this to calculate minimum instead
> of maximum?
I agree with your point. But I think the code works precisely the way 
you have described.
> 
> I've slightly revised the patch.  I've run pg_indent and renamed
> s/saveList/origin_rinfos/g for better readability.
Thank You!
> 
> Also, the patch badly needs regression test coverage.  We can't
> include costs in expected outputs.  But that could be some plans,
> which previously were reliably merge joins but now become reliable
> hash joins.
I added one test here. Writing more tests on this feature is hard, but 
feature [1] may provide us with additional tools to reveal extended stat 
internals. I also have thought about injection points, but it seems an 
over-complication.

[1] Showing applied extended statistics in explain Part 2

https://www.postgresql.org/message-id/flat/TYYPR01MB82310B308BA8770838F681619E5E2%40TYYPR01MB8231.jpnprd01.prod.outlook.com

-- 
regards, Andrei Lepikhov

Attachment

v3-0001-Use-extended-statistics-for-precise-estimation-of.patch

pgsql-hackers by date:

From: Thomas Munro
Date: 03 March 2025, 11:11:00
Subject: Re: Allow io_combine_limit up to 1MB

From: Alena Rybakina
Date: 03 March 2025, 11:25:54
Subject: Re: making EXPLAIN extensible

Re: MergeJoin beats HashJoin in the case of multiple hash clauses - Mailing list pgsql-hackers

Attachment

Previous

Next