Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Hash Functions
Date
Msg-id 20170803220823.63iknkhmmqzdgi5f@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Hash Functions  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Hash Functions
List pgsql-hackers
On 2017-08-03 17:57:37 -0400, Robert Haas wrote:
> On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
> >> For me, the basic point here is that we need a set of hash functions
> >> for hash partitioning that are different than what we use for hash
> >> indexes and hash joins -- otherwise when we hash partition a table and
> >> create hash indexes on each partition, those indexes will have nasty
> >> clustering.  Partitionwise hash joins will have similar problems.  So,
> >> a new set of hash functions specifically for hash partitioning is
> >> quite desirable.
> >
> > Couldn't that just as well solved by being a bit smarter with an IV? I
> > doubt we want to end up with different hashfunctions for sharding,
> > partitioning, hashjoins (which seems to form a hierarchy). Having a
> > working hash-combine function, or even better a hash API that can
> > continue to use the hash's internal state, seems a more scalable
> > solution.
> 
> That's another way to go, but it requires inventing a way to thread
> the IV through the hash opclass interface.

Only if we really want to do it really well :P. Using a hash_combine()
like

/** Combine two hash values, resulting in another hash value, with decent bit* mixing.** Similar to boost's
hash_combine().*/
static inline uint32
hash_combine(uint32 a, uint32 b)
{a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);return a;
}

between hash(IV) and the hashfunction should do the trick (the IV needs
to hashed once, otherwise the bit mix is bad).


> That's actually sort of a
> problem anyway.  Maybe I ought to have started with the question of
> how we're going to make that end of things work.

+1 one for that plan.


> We could:
> 
> - Invent a new hash_partition AM that doesn't really make indexes but
> supplies hash functions for hash partitioning.
> - Add a new, optional support function 2 to the hash AM that takes a
> value of the type *and* an IV as an argument.
> - Something else.

Not arguing for it, but one option could also have pg_type.hash*
function(s).

One thing that I think might be advisable to think about is that we're
atm stuck with a relatively bad hash function for hash indexes (and hash
joins/aggs), and we should probably evolve it at some point. At the same
time there's currently people out there relying on the current hash
functions remaining stable.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Hash Functions
Next
From: Mark Rofail
Date:
Subject: Re: [HACKERS] GSoC 2017: Foreign Key Arrays