Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers

From Ilia Evdokimov
Subject Re: Use merge-based matching for MCVs in eqjoinsel
Date
Msg-id f8e6b122-b598-4adb-b91e-89fb3cdc1db5@tantorlabs.com
Whole thread Raw
In response to Re: Use merge-based matching for MCVs in eqjoinsel  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Use merge-based matching for MCVs in eqjoinsel
List pgsql-hackers

Thanks for the detailed feedback!


On 04.11.2025 00:55, Tom Lane wrote:

Hmm.  Those results sure look like there is a performance regression
up to at least 100 MCVs ... not a large one, but consistently a few
percent.  That's a bit sad for a patch purporting to improve
performance.  It looks to me like perhaps we should stick to the old
algorithm up to 100 or possibly even more MCVs.  Certainly the
threshold needs to be higher than 1, as you have it now.


I re-ran the benchmark on JOB with a threshold of 100.Here are the updated results:

default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
--------------------------------------------------------------------------------
1                         | 1.00                | 2320.412            | 2318.377
5                         | 0.99                | 2335.894            | 2360.890
10                        | 1.00                | 2350.612            | 2347.154
25                        | 1.01                | 2365.977            | 2342.312
50                        | 0.99                | 2381.554            | 2405.262
75                        | 1.00                | 2396.481            | 2399.828
100                       | 1.00                | 2410.367            | 2412.456
1000                      | 1.11                | 2850.853            | 2564.303
2500                      | 2.04                | 5571.688            | 2731.545
5000                      | 6.05                | 18850.084           | 3114.692
7500                      | 11.96               | 39160.898           | 3273.688
10000                     | 19.04               | 71334.113           | 3745.955

I eyeballed the patch itself very briefly, and have a couple
quick comments:

* Is hash_msv a typo for hash_mcv?  If not, maybe spell out what
it's supposed to mean.


Yes, that was a typo — fixed.

* The patch would be easier to read if it didn't reindent a couple
large chunks of existing code.  Can we change the factorization
to avoid that?  If not, I'd recommend submitting without that
reindentation, and reminding the committer to reindent at the last
moment.


Fixed as well. I’ve removed all reindentation changes. I will keep that in mind for future submissions.


* The calculation loops in eqjoinsel_inner and eqjoinsel_semi
are not identical, which makes it look quite weird to be
writing just one function that conditionally replaces both.
I wonder if we should refactor to have just one copy (and
just eat the extra cycles of calculating matchprodfreq).


Agreed. I’ve dropped the attempt to merge them into a single function.



* In fact ... I wonder if we should try harder to not do essentially
identical calculations twice, remembering that eqjoinsel_semi is
always used alongside eqjoinsel_inner.  (Of course, we could only do
that if clamped_nvalues2 is the same as sslot2->nvalues, but that's
frequently true.)  I think the reason it's duplicative right now
is that we regarded this semijoin calculation as an experiment and
so didn't want to invest a lot of effort in it ... but this patch
is exactly a lot of effort, so maybe it's time to deal with that
unfinished business.
			regards, tom lane


Good point. I addressed this in a separate patch: eqjoinsel_inner() now saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list on the RHS is clamped, we still recompute locally. If you have a cleaner idea for how to share these values between the two functions without passing them explicitly, I’d be happy to consider it.

I’m attaching two patches:
1. v4-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch - removes redundant MCV matching for semi/anti joins;
2. v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch - adds hash-based MCV matching with a configurable threshold and includes fixes based on your comments.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Move SLRU_PAGES_PER_SEGMENT to pg_config_manual.h
Next
From: Daniel Gustafsson
Date:
Subject: Re: Minor adjustment: Update the range of the commit_siblings parameter.