Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Hash Functions
Date
Msg-id CA+TgmobDuGr+cUv4d_LJncbaSM-tYumF8Tp83QyF1Fq+BSrtFw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Hash Functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Hash Functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] Hash Functions  (Kenneth Marshall <ktm@rice.edu>)
Re: [HACKERS] Hash Functions  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Fri, May 12, 2017 at 1:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd vote that it's not, which means that this whole approach to hash
> partitioning is unworkable.  I agree with Andres that demanding hash
> functions produce architecture-independent values will not fly.

If we can't produce architecture-independent hash values, then what's
the other option?

One alternative would be to change the way that we dump and restore
the data.  Instead of dumping the data with the individual partitions,
dump it all out for the parent and let tuple routing sort it out at
restore time.  Of course, this isn't very satisfying because now
dump-and-restore hasn't really preserved the state of the database;
indeed, you could make it into a hard failure by creating a foreign
key referencing a partition hash partition.  After dump-and-restore,
the row ends up in some other partition and the foreign key can't be
recreated because the relationship no longer holds.  This isn't
limited to foreign keys, either; similar problems could be created
with CHECK constraints or other per-table properties that can vary
between one child and another.

I basically think it's pretty futile to suppose that we can get away
with having a dump and restore move rows around between partitions
without having that blow up in some cases.

> Maintaining such a property for float8 (and the types that depend on it)
> might be possible if you believe that nobody ever uses anything but IEEE
> floats, but we've never allowed that as a hard assumption before.

I don't know how standard that is.  Is there any hardware that
anyone's likely to be using that doesn't?  TBH, I don't really care if
support for obscure, nearly-dead platforms like VAX or whatever don't
quite work with hash-partitioned tables.  In practice, PostgreSQL only
sorta works on that kind of platform anyway; there are far bigger
problems than this.  On the other hand, if there are servers being
shipped in 2017 that don't use IEEE floats, that's another problem.

What about integers?  I think we're already assuming two's-complement
arithmetic, which I think means that the only problem with making the
hash values portable for integers is big-endian vs. little-endian.
That's sounds solveable-ish.

> Even architecture dependence isn't the whole scope of the problem.
> Consider for example dumping a LATIN1-encoded database and trying
> to reload it into a UTF8-encoded database.  People will certainly
> expect that to be possible, and do you want to guarantee that the
> hash of a text value is encoding-independent?

No, I think that's expecting too much.  I'd be just fine telling
people that if you hash-partition on a text column, it may not load
into a database with another encoding.  If you care about that, don't
use hash-partitioning, or don't change the encoding, or dump out the
partitions one by one and reload all the roads into the parent table
for re-routing, solving whatever problems come up along the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] renaming "transaction log"
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Hash Functions