On 2017-08-03 17:57:37 -0400, Robert Haas wrote:
> On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
> >> For me, the basic point here is that we need a set of hash functions
> >> for hash partitioning that are different than what we use for hash
> >> indexes and hash joins -- otherwise when we hash partition a table and
> >> create hash indexes on each partition, those indexes will have nasty
> >> clustering. Partitionwise hash joins will have similar problems. So,
> >> a new set of hash functions specifically for hash partitioning is
> >> quite desirable.
> >
> > Couldn't that just as well solved by being a bit smarter with an IV? I
> > doubt we want to end up with different hashfunctions for sharding,
> > partitioning, hashjoins (which seems to form a hierarchy). Having a
> > working hash-combine function, or even better a hash API that can
> > continue to use the hash's internal state, seems a more scalable
> > solution.
>
> That's another way to go, but it requires inventing a way to thread
> the IV through the hash opclass interface.
Only if we really want to do it really well :P. Using a hash_combine()
like
/** Combine two hash values, resulting in another hash value, with decent bit* mixing.** Similar to boost's
hash_combine().*/
static inline uint32
hash_combine(uint32 a, uint32 b)
{a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);return a;
}
between hash(IV) and the hashfunction should do the trick (the IV needs
to hashed once, otherwise the bit mix is bad).
> That's actually sort of a
> problem anyway. Maybe I ought to have started with the question of
> how we're going to make that end of things work.
+1 one for that plan.
> We could:
>
> - Invent a new hash_partition AM that doesn't really make indexes but
> supplies hash functions for hash partitioning.
> - Add a new, optional support function 2 to the hash AM that takes a
> value of the type *and* an IV as an argument.
> - Something else.
Not arguing for it, but one option could also have pg_type.hash*
function(s).
One thing that I think might be advisable to think about is that we're
atm stuck with a relatively bad hash function for hash indexes (and hash
joins/aggs), and we should probably evolve it at some point. At the same
time there's currently people out there relying on the current hash
functions remaining stable.
Greetings,
Andres Freund