Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers
| From | Ilia Evdokimov |
|---|---|
| Subject | Re: Use merge-based matching for MCVs in eqjoinsel |
| Date | |
| Msg-id | f8e6b122-b598-4adb-b91e-89fb3cdc1db5@tantorlabs.com Whole thread Raw |
| In response to | Re: Use merge-based matching for MCVs in eqjoinsel (Tom Lane <tgl@sss.pgh.pa.us>) |
| Responses |
Re: Use merge-based matching for MCVs in eqjoinsel
|
| List | pgsql-hackers |
Thanks for the detailed feedback!
On 04.11.2025 00:55, Tom Lane wrote:
Hmm. Those results sure look like there is a performance regression up to at least 100 MCVs ... not a large one, but consistently a few percent. That's a bit sad for a patch purporting to improve performance. It looks to me like perhaps we should stick to the old algorithm up to 100 or possibly even more MCVs. Certainly the threshold needs to be higher than 1, as you have it now.
I re-ran the benchmark on JOB with a threshold of 100.Here are the updated results:
default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
--------------------------------------------------------------------------------
1 | 1.00 | 2320.412 | 2318.377
5 | 0.99 | 2335.894 | 2360.890
10 | 1.00 | 2350.612 | 2347.154
25 | 1.01 | 2365.977 | 2342.312
50 | 0.99 | 2381.554 | 2405.262
75 | 1.00 | 2396.481 | 2399.828
100 | 1.00 | 2410.367 | 2412.456
1000 | 1.11 | 2850.853 | 2564.303
2500 | 2.04 | 5571.688 | 2731.545
5000 | 6.05 | 18850.084 | 3114.692
7500 | 11.96 | 39160.898 | 3273.688
10000 | 19.04 | 71334.113 | 3745.955
I eyeballed the patch itself very briefly, and have a couple quick comments: * Is hash_msv a typo for hash_mcv? If not, maybe spell out what it's supposed to mean.
Yes, that was a typo — fixed.
* The patch would be easier to read if it didn't reindent a couple large chunks of existing code. Can we change the factorization to avoid that? If not, I'd recommend submitting without that reindentation, and reminding the committer to reindent at the last moment.
Fixed as well. I’ve removed all reindentation changes. I will keep that in mind for future submissions.
* The calculation loops in eqjoinsel_inner and eqjoinsel_semi are not identical, which makes it look quite weird to be writing just one function that conditionally replaces both. I wonder if we should refactor to have just one copy (and just eat the extra cycles of calculating matchprodfreq).
Agreed. I’ve dropped the attempt to merge them into a single function.
* In fact ... I wonder if we should try harder to not do essentially identical calculations twice, remembering that eqjoinsel_semi is always used alongside eqjoinsel_inner. (Of course, we could only do that if clamped_nvalues2 is the same as sslot2->nvalues, but that's frequently true.) I think the reason it's duplicative right now is that we regarded this semijoin calculation as an experiment and so didn't want to invest a lot of effort in it ... but this patch is exactly a lot of effort, so maybe it's time to deal with that unfinished business. regards, tom lane
Good point. I addressed this in a separate patch: eqjoinsel_inner() now saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list on the RHS is clamped, we still recompute locally. If you have a cleaner idea for how to share these values between the two functions without passing them explicitly, I’d be happy to consider it.
I’m attaching two patches:
1. v4-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch - removes redundant MCV matching for semi/anti joins;
2. v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch - adds hash-based MCV matching with a configurable threshold and includes fixes based on your comments.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/
Attachment
pgsql-hackers by date: