Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Hash Functions
Date
Msg-id CA+TgmoaTbyeEGP0HQn6uCUJqHWc=eapEL2U3q1wk4Fz-4bGxmA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Hash Functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Hash Functions  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Sat, May 13, 2017 at 1:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Basically, this is simply saying that you're willing to ignore the
> hard cases, which reduces the problem to one of documenting the
> portability limitations.  You might as well not even bother with
> worrying about the integer case, because porting between little-
> and big-endian systems is surely far less common than cases you've
> already said you're okay with blowing off.
>
> That's not an unreasonable position to take, perhaps; doing better
> than that is going to be a lot more work and it's not very clear
> how much real-world benefit results.  But I can't follow the point
> of worrying about endianness but not encoding.

Encoding is a user choice, not a property of the machine.  Or, looking
at it from another point of view, the set of values that can be
represented by an int4 is the same whether they are represented in
big-endian form or in little-endian form, but the set of values that
are representable changes when you switch encodings.  You could argue
that text-under-LATIN1 and text-under-UTF8 aren't really the same data
type at all.  It's one thing to say "you can pick up your data and
move it to a different piece of hardware and nothing will break".
It's quite another thing to say "you can pick up your data and convert
it to a different encoding and nothing will break".  The latter is
generally false already.  Maybe LATIN1 -> UTF8 is no-fail, but what
about UTF8 -> LATIN1 or SJIS -> anything?  Based on previous mailing
list discussions, I'm under the impression that it is sometimes
debatable how a character in one encoding should be converted to some
other encoding, either because it's not clear whether there is a
mapping at all or it's unclear what mapping should be used.  See,
e.g., 2dbbf33f4a95cdcce66365bcdb47c885a8858d3c, or
https://www.postgresql.org/message-id/1739a900-30ab-f48e-aec4-2b35475ecf02%402ndquadrant.com
where it was discussed that being able to convert encoding A ->
encoding B does not guarantee the ability to perform the reverse
conversion.

Arguing that a given int4 value should hash to the same value on every
platform seems like a request that is at least superficially
reasonable, if possibly practically tricky in some cases.  Arguing
that every currently supported encoding should hash every character
the same way when they don't all have the same set of characters and
the mappings between them are occasionally debatable is asking for the
impossible.  I certainly don't want to commit to a design for hash
partitioning that involves a compatibility break any time somebody
changes any encoding conversion in the system, even if a hash function
that involved translating every character to some sort of universal
code point before hashing it didn't seem likely to be horribly slow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [HACKERS] Latest Data::Dumper breaks hstore_plperl regressiontest
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Hash Functions