Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Hash Functions
Date
Msg-id 20170513230829.rw5oocydnpvlqyfh@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Hash Functions  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Hash Functions  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2017-05-13 10:29:09 -0400, Robert Haas wrote:
> On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Can we think of defining separate portable hash functions which can be
> > used for the purpose of hash partitioning?
> 
> I think that would be a good idea.  I think it shouldn't even be that
> hard.  By data type:
> 
> - Integers.  We'd need to make sure that we get the same results for
> the same value on big-endian and little-endian hardware, and that
> performance is good on both systems.  That seems doable.
> 
> - Floats.  There may be different representations in use on different
> hardware, which could be a problem.  Tom didn't answer my question
> about whether any even-vaguely-modern hardware is still using non-IEEE
> floats, which I suspect means that the answer is "no".  If every bit
> of hardware we are likely to find uses basically the same
> representation of the same float value, then this shouldn't be hard.
> (Also, even if this turns out to be hard for floats, using a float as
> a partitioning key would be a surprising choice because the default
> output representation isn't even unambiguous; you need
> extra_float_digits for that.)
> 
> - Strings.  There's basically only one representation for a string.
> If we assume that the hash code only needs to be portable across
> hardware and not across encodings, a position for which I already
> argued upthread, then I think this should be manageable.
> 
> - Everything Else.  Basically, everything else is just a composite of
> that stuff, I think.

I seriously doubt that's true.  A lot of more complex types have
internal alignment padding and such.  Consider e.g. something like
jsonb, hstore, or postgis types - you *can* convert them to something
that's unambiguous, but it's going to be fairly expensive.  Essentially
you'd have to something like calling the output function, and then
hashing the result of that.  And hash-partitioning is particularly
interesting for e.g. large amounts of points in a geospatial scenario,
because other types of partitioning are quite hard to maintain.

- Andres



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: [HACKERS] Hash Functions
Next
From: Mark Dilger
Date:
Subject: [HACKERS] Event triggers + table partitioning cause server crash in current master