Re: MergeJoin beats HashJoin in the case of multiple hash clauses - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date
Msg-id c99dc067-9fec-4649-a0f5-62a771404972@gmail.com
Whole thread Raw
In response to Re: MergeJoin beats HashJoin in the case of multiple hash clauses  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 3/11/2023 23:43, Tomas Vondra wrote:
> On 9/11/23 10:04, Lepikhov Andrei wrote:
>   * Determine bucketsize fraction and MCV frequency for the inner
>   * relation. We use the smallest bucketsize or MCV frequency estimated
>   * for any individual hashclause; this is undoubtedly conservative.
> 
> I'm sure this may lead to inflated cost for "good" cases (where the
> actual bucket size really is a product), which may push the optimizer to
> use the less efficient/slower join method.
Yes, It was contradictory idea, though.
> IMHO the only principled way forward is to get a better ndistinct
> estimate (which this implicitly does), perhaps by using extended
> statistics. I haven't tried, but I guess it'd need to extract the
> clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of 
extended statistics for estimation of the HashJoin bucket_size. In 
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows 
to reduce computational cost of this statistic.

[1] 
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web

-- 
regards, Andrei Lepikhov

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?
Next
From: Amit Langote
Date:
Subject: Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions