Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Hash Functions
Date
Msg-id 20170601182522.hu57alga5e5ctn5b@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Hash Functions  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Hash Functions
Re: [HACKERS] Hash Functions
Re: [HACKERS] Hash Functions
List pgsql-hackers
On 2017-06-01 13:59:42 -0400, Robert Haas wrote:
> I'm not actually aware of an instance where this has bitten anyone,
> even though it seems like it certainly could have and maybe should've
> gotten somebody at some point.  Has anyone else?

Two comments: First, citus has been doing hash-partitiong and
append/range partitioning for a while now, and I'm not aware of anyone
being bitten by this (although there've been plenty other things ;)),
even though there've been cases upgrading to different collation &
encodings.  Secondly, I think that's to a significant degree caused by
the fact that in practice people way more often partition on types like
int4/int8/date/timestamp/uuid rather than text - there's rarely good
reasons to do the latter.

> Furthermore, neither range nor list partitioning depends on properties
> of the hardware, like how wide integers are, or whether they are
> stored big-endian.  A naive approach to hash partitioning would depend
> on those things.  That's clearly worse.

I don't think our current int4/8 hash functions depend on
FLOAT8PASSBYVAL.


> 3. Implement portable hash functions (Jeff Davis or me, not sure
> which).  Andres scoffed at this idea, but I still think it might have
> legs.  Coming up with a hashing algorithm for integers that produces
> the same results on big-endian and little-endian systems seems pretty
> feasible, even with the additional constraint that it should still be
> fast.

Just to clarify: I don't think it's a problem to do so for integers and
most other simple scalar types. There's plenty hash algorithms that are
endianess independent, and the rest is just a bit of care.  Where I see
a lot more issues is doing so for more complex types like arrays, jsonb,
postgis geometry/geography types and the like, where the fast and simple
implementation is to just hash the entire datum - and that'll very
commonly not be portable at all due to padding and type wideness
differences.


> My personal guess is that most people will prefer the fast
> hash functions over the ones that solve their potential future
> migration problems, but, hey, options are good.

I'm pretty sure that will be the case.  I'm not sure that adding
infrastructure to allow for something that nobody will use in practice
is a good idea.  If there ends up being demand for it, we can still go there.


I think that the number of people that migrate between architectures is
low enough that this isn't going to be a very common issue.  Having some
feasible way around this is important, but I don't think we should
optimize heavily for it by developing new infrastructure / complicating
experience for the 'normal' uses.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Effect of changing the value forPARALLEL_TUPLE_QUEUE_SIZE
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] logical replication busy-waiting on a lock