Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers
| From | Ilia Evdokimov |
|---|---|
| Subject | Re: Use merge-based matching for MCVs in eqjoinsel |
| Date | |
| Msg-id | f0a2c63a-5776-462a-b9b7-434158297a9d@tantorlabs.com Whole thread Raw |
| In response to | Re: Use merge-based matching for MCVs in eqjoinsel (Tom Lane <tgl@sss.pgh.pa.us>) |
| Responses |
Re: Use merge-based matching for MCVs in eqjoinsel
|
| List | pgsql-hackers |
This didn't look very much like the refactorization that I had in mind: I thought we should have one copy of the matching code, not two. Also, after looking closer at your patch I realized you were just punting for cross-type comparison operators, which I felt was kind of sad. It's a little bit tricky to get simplehash.h to go along with cross-type hashing, because it wants to use just one hash and one equality function. But since those are interface routines we are going to supply anyway, we can make them deal with the insert and lookup cases differently.
I had considered the cross-type comparison operators and I didn’t see a clean way to support them, so I intentionally excluded cross-type cases from hash probing. Your suggestion to switch the hash function in probe mode is clearly a more correct approach than simply rejecting those cases. Thanks for the explanation.
So after a bit of hacking I ended up with the attached. I split up the refactorization into several steps to make it easier to review. (But I'd anticipate squashing these into one commit in the end, so I didn't spend a lot of time on the commit messages.)
I reviewed patches 0002-0004 with the refactoring, and I think the overall approach is excellent. However, I noticed one issue: in eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before the refactoring it was initialized by zero.
Also, 0001 in this series is not meant to be committed; what it does is to add some debug logging to ease comparing runtimes of different versions of eqjoinsel. I was able to use that to convince myself that the refactoring steps didn't cost anything meaningful in performance. Perhaps we could use it to investigate the right hashing threshold more carefully, too.
With 0001 patch I tested the selectivity calculation time for SEMI JOIN after applying patches 0002-0004, and the time was cut in half. Thank you for the work on that.
There are still a couple of XXX comments in the attached, denoting
loose ends to look at. In particular, I wondered whether the
hash threshold check
if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)
should use Max() instead --- that is, it might be safer to hash
if either MCV list is long. Or, holding one's head at a different
angle, perhaps the sum of the list lengths should be what's checked?
regards, tom lane
Hmm… using the sum actually seems like a good idea for me. It may provide a smoother switch-over point between the two MCV-scanning algorithms when both list lengths are below the threshold. But this definitely needs to be validated by measuring different MCV lengths below the threshold using the 0001 patch.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/
pgsql-hackers by date: