Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers

From David Geier
Subject Re: Use merge-based matching for MCVs in eqjoinsel
Date
Msg-id c93be6a4-1a44-47c6-a2a3-4d8ab1ab03d2@gmail.com
Whole thread Raw
In response to Re: Use merge-based matching for MCVs in eqjoinsel  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Use merge-based matching for MCVs in eqjoinsel
Re: Use merge-based matching for MCVs in eqjoinsel
List pgsql-hackers
On 19.11.2025 03:19, Tom Lane wrote:
> I wrote:
>> Thinking a bit harder, we are comparing these costs:
>> [ theoretical arguments trimmed ]
> 
> I spent some effort on actually measuring timings of the v6 patch,
> and concluded that this is all splitting hairs that we don't need
> to split.  The actual crossover between hash-loses and hash-wins
> is more than what my theoretical argument suggested, but still
> probably less than 100 MCVs on each side.  I think we should go with
> 
>     (sslot1.nvalues + sslot2.nvalues) >= 200
> 
> and call it good.
> 
> To arrive at this result, I built the v6 patchset with
> EQJOINSEL_MCV_HASH_THRESHOLD changed to either 0 (to force hashing)
> or 1000000 (to prevent it).  I then ran the attached scripts with
> different values of "nstats" and collected timings from the postmaster
> log output produced by the 0001 patch.
> 
> The scripts are designed to test both the cheap-comparisons scenario
> (integer columns) and the expensive-comparisons scenario (text columns
> with a case-insensitive ICU collation).  My motivation for splitting
> them into a setup and a test step was to allow the tests to be run
> repeatedly against the same underlying data.  (Although I soon realized
> that because VACUUM ANALYZE takes a random sample each time, the stats
> we're working from aren't totally the same each time anyway.)  Also
> you'll notice that the test data is based on log(random()), which
> I did to roughly approximate a zipfian distribution.  If you remove
> the log() call you'll get a flat distribution instead, but it didn't
> seem to change the conclusions much.
Thanks for working out the details!

I've ran your script on my development machine with 1000, 100 and 50
MCVs with the following results. As the runtimes had quite some variance
I didn't bother trying more variations. I think your proposal to go with
200 is fine.

nstats | off INT | off TEXT | on INT | on TEXT
-------------------------------------|------
1000   | 697     | 8907     |  14    |  2417
 100   |  13.7   |  213     |   2.3  |   239
  50   |   1.4   |    7.6   |   1.5  |    49

The results suggest that the hash function for the non-deterministic
collation is really slow. If we could properly include the operator
cost, we could enable the optimization earlier in the case of simple
data types such as INT. That can be future work.

--
David Geier




pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Use strtoi64() in pgbench, replacing its open-coded implementation
Next
From: Tom Lane
Date:
Subject: Re: PRI?64 vs Visual Studio (2022)