Hi,
attached is a patch to address this problem, and the one reported by
Dilip. I ran a lot of TPC-H and other benchmarks, and so far this
addresses all the performance issues, often being noticeably faster than
with the dynahash code.
Comments?
On 2017-03-03 11:23:00 +0530, Kuntal Ghosh wrote:
> On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote:
> >> the resulting hash-values aren't actually meaningfully influenced by the
> >> IV. Because we just xor with the IV, most hash-value that without the IV
> >> would have fallen into a single hash-bucket, fall into a single
> >> hash-bucket afterwards as well; just somewhere else in the hash-range.
> >
> > Wow, OK. I had kind of assumed (without looking) that setting the
> > hash IV did something a little more useful than that. Maybe we should
> > do something like struct blah { int iv; int hv; }; newhv =
> > hash_any(&blah, sizeof(blah)).
The hash invocations are already noticeable performancewise, so I'm a
bit hesitant to go there. I'd rather introduce a decent 'hash_combine'
function or such.
> Sounds good. I've seen a post from Thomas Munro suggesting some
> alternative approach for combining hash values in execGrouping.c[1].
Yea, I played with that too - it's a bit better, but doesn't really
address the problem fully either. Seems to generally reduce collisions
in multi-column keys a bit, but not that much.
> >> In addition to that it seems quite worthwhile to provide an iterator
> >> that's not vulnerable to this. An approach that I am, seemingly
> >> successfully, testing is to iterate the hashtable in multiple (in my
> >> case 23, because why not) passes, accessing only every nth element. That
> >> allows the data to be inserted in a lot less "dense" fashion. But
> >> that's more an optimization, so I'll just push something like the patch
> >> mentioned in the thread already.
> >>
> >> Makes some sense?
> >
> > Yep.
> >
> Yes, it makes sense. Quadratic probing is another approach, but it
> would require an extra shift op every time we want to find the next or
> prev location during a collision.
That's not really the big problem with quadratic probing. The issue is
that it leads to further random unpredictable memory accesses in case of
collisions, whereas linear probing has easy to predict accesses.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers