Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers

From Ilia Evdokimov
Subject Re: Use merge-based matching for MCVs in eqjoinsel
Date
Msg-id abd67dfc-8505-4821-937b-0c003e9b5e28@tantorlabs.com
Whole thread Raw
In response to Re: Use merge-based matching for MCVs in eqjoinsel  (David Geier <geidav.pg@gmail.com>)
List pgsql-hackers
On 19.11.2025 18:38, David Geier wrote:
> On 19.11.2025 03:19, Tom Lane wrote:
>
>> I spent some effort on actually measuring timings of the v6 patch,
>> and concluded that this is all splitting hairs that we don't need
>> to split.  The actual crossover between hash-loses and hash-wins
>> is more than what my theoretical argument suggested, but still
>> probably less than 100 MCVs on each side.  I think we should go with
>>
>>     (sslot1.nvalues + sslot2.nvalues) >= 200
>>
>> and call it good.
>>
>> To arrive at this result, I built the v6 patchset with
>> EQJOINSEL_MCV_HASH_THRESHOLD changed to either 0 (to force hashing)
>> or 1000000 (to prevent it).  I then ran the attached scripts with
>> different values of "nstats" and collected timings from the postmaster
>> log output produced by the 0001 patch.
> Thanks for working out the details!
>
> I've ran your script on my development machine with 1000, 100 and 50
> MCVs with the following results. As the runtimes had quite some variance
> I didn't bother trying more variations. I think your proposal to go with
> 200 is fine.
>
> nstats | off INT | off TEXT | on INT | on TEXT
> -------------------------------------|------
> 1000   | 697     | 8907     |  14    |  2417
>   100   |  13.7   |  213     |   2.3  |   239
>    50   |   1.4   |    7.6   |   1.5  |    49
>
> The results suggest that the hash function for the non-deterministic
> collation is really slow. If we could properly include the operator
> cost, we could enable the optimization earlier in the case of simple
> data types such as INT. That can be future work.


LGTM

For simple types (integer columns), both algorithms finish in a couple 
milliseconds when the MCV counts are under 100, and the difference 
between them is very small (JOB results show the same trend). For text 
types, the planning time shifts gradually from one algorithm to the 
other around that range, without any sharp transition. And it seems to 
me that the current criterion is a reasonable compromise, without 
requiring us to complicate the threshold any further.

-- 
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Update timezone to C99
Next
From: Robert Haas
Date:
Subject: Re: pgsql: doc: remove verbiage about "receiving" data from rep. slots